How fairness metrics can be misleading

What are fairness metrics and how you can use them.

Decision makers can rely upon machine learning models for guidance. However, sometimes these models can contain hidden bias, where certain people or groups are treated unfairly by the model due to individual features. To capture this discrimination, a variety of fairness metrics are used. Each of these measures can generally be classified as either a group or individual/counterfactual fairness metric, which is determined by what they are attempting to measure. For example, a group fairness metric will look at whether different demographic groups observe varied model behaviour i.e. are there more true positive model predictions for one gender group compared to another? In contrast, individual/counterfactual fairness measures will assess if similar individuals experience similar model treatment. Since these two metric types focus on different aspects of fairness, we expect that their results will be somewhat incompatible. Although, the amount of attention given to this topic in the literature is limited. Here, we begin with a review of the previous studies that do discuss this potential incompatibility between group and individual fairness metrics. Then, various measures are computed for an example dataset to explore the findings of the literature review. Moreover, we perform a dataset repair that optimizes group fairness. From these results, inconsistencies in the different metric types are apparent. Hence, fairness measures are useful approximations for highlighting bias in machine learning models, but caution should be used when applying these as stand-alone measures. In particular, optimising for one class of metric can lead to another being compromised.

Caption: Different fairness metrics grouped by gender demonstrates their incompatibility.

Machine learning algorithms are widely used to construct prediction models that inform decision-making processes. For example, a bank may use a model to predict the likelihood of someone paying back a loan based upon certain individual characteristics. This likelihood estimate can then help the bank to decide who to approve for a loan. However, it is well-known that machine learning models can exhibit prejudice towards specific types of people due to, for instance, their race or gender. This discrimination can be the result of the machine learning algorithm, or the data used to train the model, or both. Recent studies have focussed upon ways to evaluate the bias that may exist within a model, as well as techniques that reduce the impact of biases in datasets and algorithms.

One method commonly used to assess the fairness of a prediction model is the computation of fairness metrics. These attempt to evaluate the discrimination that can occur at an individual level and group level. Individual fairness is upheld if similar individuals receive similar model outcomes. Conversely, group fairness examines whether distinct groups of individuals, such as males and females, experience consistent treatment. Moreover, these metrics are used for dataset repairs, where a dataset will be modified in some way, and the model retrained, so that a particular fairness metric is optimised.

Dwork et al. (2012) showed that when demographic parity (a group fairness measure) is achieved, large disparities amongst similar individuals can still exist, which means that group fairness does not necessarily correspond to individual fairness. In response to this, they proposed an individual fairness metric that measures whether individuals with similar features observe the same model responses. More specifically, the difference between two individuals, say a and b, is quantified by some distance measure d(a,b) – this can account for one or more features – then essentially individual fairness is satisfied when

$\sum_i|P(i|a)-P(i|b)|<=d(a,b), (1)$

where P(i|a) is the probability of outcome i for individual a. Another well-cited individual fairness measure is the consistency index formulated by Zemel et al. (2013). This evaluates the difference between each individual’s model classification and their k nearest neighbours, which are selected on the basis of individual commonalities. Again, this assesses whether a model behaves consistently for similar individuals, although only binary response models are considered. Written in full, consistency is expressed as

$consistency=1-\frac{1}{n}\large \sum_i|\hat{Y}_i-\frac{1}{k}\sum_{j \in kNN(x_i)}\hat{Y}_j|, (2)$

where n is the total number of individuals, Yiis the model prediction for individual i and xiis the feature vector for individual i.

There are several group fairness measures, where metrics are computed for each group and then compared. For example, equal opportunity is a group level measure that examines the probability of a correctly predicted positive outcome for each group. If this probability is equal across all the groups then equal opportunity is satisfied. Equalised odds is an extension of this metric, which also considers the group probability of a correctly predicted negative outcome.

Another group metric is demographic parity, which uses the group probability of a predicted positive outcome, and is upheld if this probability is consistent over all the groups. Refer to Mehrabi et al. (2021) for a survey of other fairness measures. As suggested by the examples given here, each of the various group-level metrics focusses on a different aspect of fairness. These differences however do lead to inconsistencies, where it can be impossible to satisfy certain combinations of group fairness metrics simultaneously. For example, Garg et al. (2020) demonstrates that selected group measures are unable to be upheld at the same time, including equalised odds and demographic parity.

While the incompatibility of group-level metrics has been well-established in the literature, less attention has been given to the relationship between individual and group fairness measures. This relationship will be the focus of our article, in particular, we want to understand how performing a dataset repair that optimises group fairness will impact upon individual fairness, or vice versa. More specifically, we begin with a review of the relevant literature. Next, a repair method is applied to an example dataset to further explore the relationship between individual and group fairness. The fairness metrics computed before and after the repair are presented and compared to the findings from the literature.

Some studies have already considered a potential incompatibility between individual and group fairness measures. For instance, Binns (2020) discusses how often there will be a trade-off when trying to uphold both classes of metric. Firstly, they explain that if individual fairness is ignored in favour of group measures, then the issue of models classifying alike individuals very differently can persist. Secondly, if individual fairness is the only focus, they believe this can lead to notable differences in outcomes at the group level. Fleisher (2021) argues that individual fairness is an invalid choice of metric in isolation due to a number of factors, including its inability to ensure general fairness. An example Fleisher gives is a model that only predicts negative outcomes for every individual, which would satisfy individual fairness but is obviously an unfair model. He also concedes that all the group metrics fail to be stand-alone measures. Ultimately, Fleisher proposes that a variety of measures should be applied to fully assess discrimination in machine learning and that individual fairness should be viewed as “one kind of tool among many”.

Speicher et al. (2018) describe an index for overall unfairness, which measures model inequality by evaluating the level of beneficial treatment received by each individual. Moreover, when the data is partitioned into distinct groups, they show that this metric can be rewritten as the sum of two new measures referred to as between-group and within-group unfairness, which correspond respectively to the group-level and individual-level bias indicators discussed here. This suggests that the overall unfairness index assesses discrimination on both levels. The authors then use this decomposed expression to demonstrate the possible imbalance between group and individual fairness. Firstly, they consider having a small number of groups, which means that within the groups, both the number of people and the variation in model response will be considerable. Hence, in this situation, reducing group unfairness is relatively straightforward, although they claim that this could unintentionally affect the overall model unfairness and therefore increase the within-group unfairness. Alternatively, they look at when the number of groups grows significantly. As a result, the model response within the groups will be less varied, upholding individual fairness, and thus, the within-group unfairness is negligible. In order to now reduce the overall unfairness, group-level fairness must be improved, which they argue is a difficult task computation-wise due to the large number of groups. Thus, the behaviour described by Speicher et al. suggests that by obtaining group level fairness, it is impossible to fully uphold individual fairness, and vice versa.

In addition, Friedler et al. (2016) explain this group and individual incompatibility by firstly defining two opposing fairness perspectives, which they call WYSIWYG (what you see is what you get) and WAE (we’re all equal). They also define three information spaces, which they refer to as the unobserved, the observed and the decision spaces. From the WYSIWYG viewpoint, the unobserved and observed information spaces are the same. Whereas, WAE assumes that there are no discriminatory differences between groups in the unobserved space, however bias is introduced in the observed space outside an individual’s control. The authors discuss the example of Black students’ SAT verbal question scores being generally weaker (observed information), although unobserved qualities, such as intelligence, are the same amongst the groups.

They also define the different mappings between the observed space and the decision space for the two approaches, where the WYSIWYG view will have little distortion between the two spaces i.e. the distance between individuals in the observed space and the decision space is generally the same (upholds individual fairness), whilst WAE reduces the distance between the groups in the decision space (upholds group fairness). The authors encourage applying WAE when a decision is made using observed data with known group bias, whereas WYSIWYG is recommended when decisions need to be informed by individual performance. As well, they show that the WYSIWYG perspective ensures only individual fairness is met and group fairness is impossible since in this mindset, applying a group fairness mechanism, causing distortion, would be considered discriminatory. In contrast, the WAE approach only imposes group fairness and individual fairness is unattainable. This is because by applying an individual fairness measure here, groups that are a sizeable distance apart in the observed space will also be separated in the decision space, which is unfair according to WAE.

It should be noted that in the literature reviewed here, the focus upon individual and group incompatibility is not extended to difference measures that are often used to highlight group disparities i.e. Equal opportunity=(male group probability of a correctly predicted positive outcome) - (female group probability of a correctly predicted positive outcome). Moreover, counterfactual metrics are also not discussed by the publications mentioned, which are individual-level fairness measures that also consider the impact of a specific feature, such as gender, on the model behaviour. For example, they can be used to evaluate how the probability of a positive outcome changes if instead the individual was male and not female, whilst the remaining information about the individual stays the same. Refer to Pearl (2010) for more details.

This literature review suggests there is a potential imbalance between group and individual fairness in machine learning models. Failing to recognise individual-level bias and trying to maximise group fairness can result in unfair classification discrepancies amongst similar individuals. Conversely, using individual fairness as a stand-alone measure to assess and remove bias can lead to overt group discrimination. This issue of incompatibility is further examined with the application of a debiasing algorithm.

We apply a repair method to an example dataset to demonstrate the inconsistencies of different fairness measures. The aim of this method is to improve group fairness, where the groupings are chosen based upon certain demographics, such as females, being vulnerable to discrimination. In this example, the groups considered are male and female, and the prejudice experienced by females in contrast to males is highlighted with the group True Positive Rate (TPR - the probability of a correctly predicted positive outcome). Furthermore, the consistency metric defined above (see (2)) is applied to assess individual fairness.

The Adult dataset is applied in this example to train an income prediction model. As well, a repair algorithm from the Etiq library is used, that has at its core resampling. A sample of the Python code needed to build the model and then perform the repair with the Etiq library is outlined in Figure . Essentially, running this code generates two models, where one is trained with the original dataset and the other with the repaired debiased dataset. As well, fairness metrics for each model are computed, which means measures for before and after the repair are obtained. Next, we compare these two sets of results to evaluate the impact of debiasing on fairness and determine whether they support the literature.

```
1from etiq_core import *
2
3data = load_sample('adultdata')
4
5# DatasetLoader transforms the data and splits it into training/validation/testing data.
6dl = DatasetLoader(data=data,
7 label='income',
8 transforms=transforms,
9 bias_params=debias_param,
10 train_valid_test_splits=[0.8, 0.1, 0.1],
11 names_col = data.columns.values)
12metrics_initial= [accuracy, equal_opportunity, consistency]
13xgb = DefaultXGBoostClassifier()
14
15# DataPipeline computes metrics using the model provided.
16pipeline_initial = DataPipeline(dataset_loader=dl, model=xgb, metrics=metrics_initial)
17pipeline_initial.run()
18
19# Identify bias issues.
20identify_pipeline = IdentifyBiasSources(nr_groups=20, # nr_groups=number of segments
21 train_model_segment=True,
22 group_def=['unsupervised'],
23 fit_metrics=[accuracy, equal_opportunity],cutoff=0.2)
24# Apply the repair.
25repair_pipeline = RepairResamplePipeline(steps=[ResampleUnbiasedSegmentsStep(ratio_resample=1)], random_seed=4)
26
27# DebiasPipeline computes metrics for the repaired dataset using the model provided.
28debias_pipeline = DebiasPipeline(data_pipeline=pipeline_initial,
29 model=xgb,
30 metrics=metrics_initial,
31 identify_pipeline=identify_pipeline,
32 repair_pipeline=repair_pipeline)
33debias_pipeline.run()
34
35# Retrieve the calculated metrics
36debias_pipeline.get_protected_metrics()
37
38
```

The fairness results that correspond to the Adult dataset before (referred to as the baseline) and after the repair are shown in Figures 1-3. In Figure 1, the model accuracy (left) and equal opportunity measure (TPR males - TPR females) (right) are depicted. These two measures help to assess the effectiveness of the repair, where, if it is successful, accuracy is expected to remain relatively unchanged and equal opportunity should reflect the removal of group bias. Here, accuracy is shown to be generally unaffected by debiasing, as well, the baseline bias towards males changes to a slight bias towards females through repair. Hence, these plots suggest that the debiasing process has been effective. Although, the full impact of the repair should be further examined with other metric types, such as individual fairness measures.

In Figure 2, the TPR (top panel) and the consistency measure (bottom panel) are portrayed, which are group and individual fairness metrics respectively. The results for the original dataset and the repaired dataset are compared. The plots on the left and right correspond to the male group and the female group respectively. These figures reveal a sizable increase in the female TPR following repair, whilst the male TPR falls very slightly. In particular, the baseline male TPR was notably higher than the baseline female TPR, whereas after the repair, the female TPR is slightly higher than the male TPR. When the female TPR is low, the consistency measure is around 0.84, which suggests that individual fairness amongst females is relatively strong. However, after the repair is applied , when female TPR increases, the female consistency metric drops. This ‘seesaw effect’ between female TPR (group measure) and female consistency (individual measure) supports our findings from the literature. Note that male consistency has also decreased as a result of the repair, although the reduction is comparatively small. This behaviour also aligns with our previous discussion since individual fairness should decrease for all individuals if group fairness mechanisms are used to remove bias. Note that the gender specific consistency index is defined as written in (2) except that index i only refers to the females (males) in the dataset and n=No. of females (n=No. of males) for the female (male resp.) case. As well, it should be noted that here k=5, which is consistent with AI Fairness 360’s (https://aif360.mybluemix.net/) implementation of the consistency measure.

In Figure 3, the results for a counterfactual-type fairness metric are presented for males (left) and females (right), which looks at whether similar females and males according to their features are treated equally by the model. More precisely, it is defined as

$consistency=1-\frac{\text{No. neg. pred. males (females) with at least 1 female (male) pos. pred. nn}}{\text{No. neg. pred. males (females)}}, (3)$

where nn represents nearest neighbour and only the 5 closest neighbours are considered. This measure does evaluate individual fairness since it examines whether similar individuals have similar model treatment, although it specifically concentrates on similar males and females with opposing model predictions. From these plots, it appears that males have very strong individual fairness, although there is a sizable reduction following the repair. In contrast, the female group has a very weak individual fairness score, which also decreases after debiasing, although this reduction is smaller. These results portray a somewhat conflicting viewpoint to that conveyed by Figure 2, in particular, here female individual fairness is much lower than the male group and the repair seems to have more impact upon males than females. However, this fairness metric does decrease across the genders due to debiasing, which agrees with the consistency plots and the literature. This demonstrates that not only are there incompatibilities between group and individual fairness and between certain combinations of group metrics, but potentially there are inconsistencies between different types of individual fairness metrics, such as consistency and this counterfactual-type measure used here. It should be noted that the literature reviewed above did not specifically consider counterfactual fairness and its relationship with group fairness.

The potential conflict between individual and group-level fairness has been investigated. Firstly, previous studies that highlighted this incompatibility were detailed in the literature review. Next, a repair method available from the Etiq library was applied to the Adult dataset, which is a repair algorithm designed to remove group bias. This example did support the literature findings such that inconsistencies between individual and group level metrics were identified. In particular, a ‘see-saw effect’ between TPR and consistency was observed before and after the repair. This behaviour did demonstrate that by optimising group fairness using the ‘debiased’ dataset, individual fairness was compromised. Additionally, an alternative individual fairness metric was considered that focussed upon gender. Similar to the inconsistency of certain group-level measures identified by Garg et al., a potential incompatibility of individual-level metrics was suggested due to the conflicting results of the consistency index and the counterfactual-type measure. Thus, when designing a machine learning model, it seems necessary to consider a variety of different metrics from individual and group-level so that a justifiable compromise between opposing metric types can be attained.

- Binns, R., 2020. On the apparent conflict between individual and group fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT* '20). ACM, New York, NY, USA, 514-524. https://doi.org/10.1145/3351095.3372864
- Dwork, C., Hardt, M., Pitassi, T., Reingold, O. and Zemel, R., 2012. Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS ’12). ACM, New York, NY, USA, 214-226. https://doi.org/10.1145/2090236.2090255
- Fleisher, W., 2021. What's Fair about Individual Fairness? In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES '21). ACM, New York, NY, USA, 480-490. https://doi.org/10.1145/3461702.3462621
- Friedler, S., Scheidegger, C. and Venkatasubramanian, S., 2016. On the (Im)possibility of fairness. arXiv preprint arXiv:1609.07236 (2016). https://arxiv.org/pdf/1609.07236.pdf
- Garg, P., Villasenor, J. and Foggo, V., 2020. Fairness Metrics: A Comparative Analysis. arXiv preprint arXiv:2001.07864 (2020).https://arxiv.org/pdf/2001.07864.pdf
- Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. and Galstyan, A., 2021. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys, 54(6), 1-35. Fleisher, W., 2021.https://dl.acm.org/doi/pdf/10.1145/3457607
- Pearl, J., 2010. An introduction to causal inference. Int J Biostat, 6(2):Article 7. https://doi.org/10.2202/1557-4679.1203
- Speicher, T., Heidari, H., Grgic-Hlaca, N., Gummadi, K., Singla, A., Weller, A. and Zafar, M.B., 2018. A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual & Group Unfairness via Inequality Indices. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2239-2248. https://doi.org/10.1145/3219819.3220046
- Zemel, R., Wu, Y., Swersky, K., Pitassi, T. and Dwork, C., 2013. Learning Fair Representations. In Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):325-333. https://proceedings.mlr.press/v28/zemel13.html