Practical examples unveiling sources of unintended bias using synthetic data
Almost every week, the press highlights examples of machine learning models with biased outputs. With discrimination at the forefront of public discussion, how is social inequality reflected in the biased outputs of ML models? Decisions made at every step of a typical data science pipeline, from formulating questions to collecting data and training and deploying models can ultimately harm downstream users¹. Our goal is to achieve a practical understanding of how different sources of bias can be reflected in the data. To achieve this aim, we’ll build examples using synthetic data to illustrate how different sources of bias impact ML outputs and their underlying characteristics. The guiding principle is that a good way to understand something is to build it yourself!
We’ll think about biased datasets in the context of classification, the task of predicting a binary target outcome (the label): “Will the credit card offer be accepted or rejected?”, “Will the applicant pay back the loan or not?”. A predictive model uses the features of a particular application (what the bank knows about the applicant) to predict the associated label. The workflow for building predictive models is to put together a dataset with features that may be relevant to the predicted label and train a model that predicts with the highest accuracy the label on the training dataset.
For many data scientists, the goal when doing predictive modelling is to train a model of tolerable complexity that is sufficiently accurate at predicting the label. In recent years, many pointed out that this approach of optimizing model accuracy obfuscates the goal of building fair, equitable models for all users. There are many examples of models that are accurate but nevertheless lead to very harmful outcomes, especially against protected characteristics (for example, age or race) that have historically faced discrimination: from HR applications that invariably predict male applicants are more qualified for the job², to bail models that predict black persons are more likely to re-offend³, to health insurance models that recommend less sick white people compared to more sick black people for preventative care⁴. The law in many countries defines a person’s protected attributes as race, gender, age, sexual orientation, physical or mental disability, and marital status⁵ ⁶ . Discrimination appears between privileged (e.g., men) and unprivileged (e.g., women or LGBTQ) groups across a protected attribute (sexual orientation). It is important to also evaluate outcomes for groups at the intersection of several protected attributes, such as black LGBTQ individuals vs white men.
Exploratory data analysis (EDA) and fairness metrics are important tools. EDA can warn us if there is a large disparity in the proportion of labels for a group (for example, only 10% of the positive labels are females), and motivate further investigations. Fairness metrics⁷ ⁹ allow us to set a desirable output for the model and check whether this output is achieved across groups. One word of caution worth noting here, fairness metrics cannot all be satisfied simultaneously⁷. We’ll use two commonly employed fairness metrics: demographic parity and equal opportunity. Demographic parity asks that assigned labels are independent of group membership. The metric is computed as the ratio of the labels for the unprivileged and privileged groups. Demographic parity is 1 when the probability is independent of group membership; a ratio of 0.8 is reasonable based on the generalization of the 80 percent rule advocated by US EEOC⁸, and smaller numbers are indicative of bias. The equal opportunity metric highlights the fact that a positive label is often a desirable outcome (“an opportunity” such as “the mortgage loan is approved”) and thus focuses on comparing the True Positive Rate (TPR) across groups: the rate at which positive labels are predicted correctly as positive. The TPR metric has the added benefit of accommodating different baseline rates for the compared groups, as it asks what percentage of the expected number of positive labels have been found.
Top row, Actual labels. Bottom row, Predicted labels
The five we’ll dig into (based on⁹) are:
Our starting point is a simple scenario. Imagine an HR company that is trying to predict an applicant’s salary when considering what offer to extend them. The company has collected around 100k data points. For each applicant, the company knows their previous job type (backend or frontend developer), years of work experience, gender, extra certifications and salary, which they threshold for simplicity to under or over 52k. We built this dataset using a simple linear regression model:
We start with a synthetic dataset of ~ 100K applicants. Here are the first few rows of the data:
Initially, there’s no difference between salaries for men and women, while backend developers make on average slightly more than frontend engineers.
Top row, initially there is no difference in average salary based on gender, certifications or job type. Bottom row, salary increases linearly with longer work history. Error bars are SD.
There isn’t an imbalance in the distribution of labels. The number of applicants with salaries over and under the threshold is comparable across genders.
Similar numbers of males and females have salaries over and under the threshold in the initial dataset.
One source of bias that is very hard to avoid comes from features that are correlated with protected attributes like gender (proxies). In our dataset, there is a 0.7 correlation between gender and certifications. In a real dataset, most variables will have some correlation with gender (preferred sport, for example). Removing all the correlated variables would be impractical as we’d be left with no data to train the model. If features correlated with variables like gender are not removed, then the model can use them for classification, thus potentially leading to biased outputs. Simply removing the gender variable is ineffectual because the model will still use the gender information available in proxy variables to classify based on gender. Furthermore, removing the gender information from the dataset makes it impossible to evaluate the model by computing fairness metrics. Thus the more realistic setup is that the model is trained on features that are correlated with protected characteristics such as gender to various degrees, though perhaps not on gender information directly.
One type of measurement bias common in models built for the entire population is a lack of informative features for minority groups. We illustrate this type of bias in the example below, where everybody’s salary is still a function of job type, years of work experience and certifications, but for a percentage of women who are backend developers, salary is determined by what software tools and frameworks they know (kubernetes!). Our example illustrates the scenario in which women are required to know more frameworks than their male counterparts as they are less likely to be hired on potential. We changed the salary of 70% of the women with 7 to 15 years of experience and backend jobs. Changing such a specific group of users allows us to check to what extent analysis tools can find them. For these women, we randomly pick their salary from a uniform distribution when in fact the salary would be fully specified had we collected information regarding software frameworks.
Compare the distribution of salaries for women before change (orange) and after change (green). In blue, the distribution of salaries for all women.
We learn from exploratory data analysis that after the change in salaries, the salary distributions across the different variables does not change significantly, apart from a small drop in mean salary for women.
The change in salaries we introduced earlier is not obvious after conducting some exploratory data analysis
Furthermore, the distribution of labels is similarly balanced across males and females (women have 42% of the salaries above the threshold, while men have 58%). This particular dataset is balanced in terms of the distributions of labels we are trying to predict, so we are not in the sample size disparity scenario which we’ll describe in more detail below. This is why using actual fairness metrics is an important step in the model building process.
Counts of male and female applicants under and over the salary threshold are fairly similar
With this dataset in hand, we’re now training a model to predict if applicants make under or over 52K (the threshold). We’re splitting our dataset into a training set and a test set, training the model on the training set and evaluating it on the test set. A minimally tuned XGBoost model trained on the modified dataset has a 86% accuracy in predicting whether the applicant’s salary is over or under the cutoff. Let’s compute the fairness metrics: demographic parity and equal opportunity.
The equal opportunity metric highlights a gap between males and females with TPR=0.77. In the context of our modified dataset, as we’d expect from the linear model we used to generate the data, there’s a linear relationship between years of work experience and salary. After changing the salary of a portion of female applicants, the relationship between salary and years of experience changes. For applicants with 7 to 15 years of work experience, salary can take a larger range of values. The model will struggle to predict as accurately as before the outcomes for this group, making more errors, thus resulting in lower values for the equal opportunity metric which is the TPR.
Top, there is a linear relationship between years of work experience (distribution on the top edge of the graph) and salary (distribution on the right edge). Bottom: after modifying the salaries of 70% of the women with 7 to 15 years of experience, there is a wider range of possible salaries for applicants in that group, thus degrading model performance.
Synthetic datasets such as the one we created above can be very helpful in testing different debiasing approaches. In the case above, of datasets with limited features for applicants from unprivileged groups, the standard exploratory data analysis was not as helpful as fairness metrics in identifying poor model performance. Once we used the additional metrics and identified the limited feature issue, the task ahead is how to improve prediction performance for this group of applicants. An added source of difficulty with real life datasets is that many sources of bias will be present simultaneously, thus making the task of building an equitable model difficult. Lastly, it is worth keeping in mind that the sources of bias we’ve explored so far are easier to diagnose because they cause models to predict outcomes less accurately for a particular group¹⁰. The most difficult task is to recognize when models make perfectly accurate predictions, but those predictions reflect the inequalities present in our society. Beyond the data itself there are many other sources of bias when it comes to machine learning. A diverse team evaluating model predictions and a critical attitude towards these predictions are absolutely needed to prevent automatic propagation of discrimination.
In part 2, we’ll look at other sources of bias: skewed samples, sample size disparity and tainted examples.