Evaluating the Effect of Data Missingness on Fairness of Machine Learning Algorithms

Welcome to the website for our project! Here we will be introducing the core ideas of our research project, the motivation behind it, and what work we have done in order to answer the question: Does Data Missingness affect Fairness in Machine Learning?

Introduction

What's Fairness?

Fairness is a degree to which a machine learning algorithm is able to adequately respond to bias present in data. Pretty much all data in the world is biased in some way due to a variety of factors: it may be an issue of data collection, it may be a deliberate decision to bias the data by favoring a certain population, or it might be caused by a confounding factor that is not accounted for. What we care about in our project is something known as Selection Bias: when the data that is present tells a story that is different from the real world situation.

How Does Missingness Play Into It?

Missingness is a common occurence in the world of data science, and the term itself refers to a pattern (or lack thereof) of missing observations in the dataset. An example would be a dataset that contains information on various vehicles registered at a local DMV, and the observations that are missing are some license plate numbers. The fact that there are multiple data points missing and that there might be an underlying pattern constitutes the concept of missingness.
If some data is missing, then it means that the data might not be representative of the real world; however, the way the data is missing determines how different the story the data tells is from the actual population. Data missing because of data entry errors is different from data being missing because those who recorded the data simply did not record certain values. So, data missingness is generally categorized based on the patterns of missingness and on dependencies between different variables (i.e. what the observations are recorded for, the typical "columns" in a data table).
The following types are generally identified and these are the ones we focus on in our research:

Our Approach to the Question

Data

In order to work on our research question, we needed a dataset which could be biased, and also a dataset that contains the specific missingness patterns we are trying to evaluate. As a result, we decided to create a synthetic dataset based on real data collected on citizens' complaints to the New York Police Department . We used the data because it contains information about the complainants' gender, ethnicity, age, and other attributes that could be discriminated upon by the police department. Therefore, the data might show some biases, and the algorithm that processes this data might pick up on those biases and produce unfair decisions.
Our dataset is synthetic because we take the original dataset and simulate our own distribution of missingness. The process by which this works is that we use statistical representations of types of missingness in order to select which observations to "drop" from the dataset. As a result, we end up with a dataset that has certain values missing, and those are determined by a pattern we decide. More specifically, the complainant age attribute was chosen to contain missingness. This attribute was chosen for two reasons. The first is that the chosen missing attribute must be highly correlated with the chosen sensitive attribute and the label outcome attribute. This is because we don't want to alter the extremely delicate attributes that are the sensitive and label outcome attributes. This is also important because when we synthetically create MAR data, we need our attribute with missingness to produce missingness that makes sense, which would be highly dependent on either the sensitive attribute or the label outcome attribute. If you want to know more about how we simulate our dataset, feel free to visit Methods In-Depth page.

Processing and Evaluating Methods

Missingness in machine learning data sets is common and there are several types of missingness, describing Missing Completely at Random (MCAR), Not Missing at Random (NMAR), and Missing at Random (MAR).

MCAR: Missingness of data in MCAR is completely independent of both observed and unobserved data. Suppose X represents a randomized variable and R represents the binary random process which gives information about missingness and non-missingness of data with R=1 (missingness) and R=0 (non-missingness). Joint probability distribution of 𝑋 and 𝑅 as:

𝑃 (𝑋 = π‘₯, 𝑅 = π‘Ÿ) = 𝑃(𝑋 = π‘₯ | 𝑅 = π‘Ÿ) 𝑃(𝑅 = π‘Ÿ)

NMAR: Missingness of data in NMAR depends upon overlooked values of the unobserved variable and is independent of the observed data set so observed data is not capable to estimate probability distribution. Correlation between observed and missing data can help to estimate conditional distribution. Suppose Y indicates a key variable, R represents the binary random process that gives information about the missingness and non-missingness of Y with R=1 (missingness) and R=0 (non-missingness) and X represents observed variables relating to Y and R. Joint probability distribution of π‘Œ, 𝑋, and 𝑅 is given below:

𝑃(π‘Œ = 𝑦, 𝑋 = π‘₯, 𝑅 = π‘Ÿ) = 𝑃(π‘Œ = 𝑦 | 𝑋 = π‘₯, 𝑅 = π‘Ÿ) 𝑃(𝑋 = π‘₯, 𝑅 = π‘Ÿ)

Calculated conditional distribution is performed to fill out the missing values in the data set.

MAR: Missingness of data in MAR depends upon the variable that is included in a data set. The missingness of the variable is more dependent on the significance of the variable as compared to the variable value. Suppose we have data set with two variables in which Y is the dependent variable, X is the group of the dependent variable and nonlinear function R indicates the presence or absence of Y with R=1 (missingness) and R=0(non-missingness). If the probability of missingness depends purely on the observed variables 𝑋, the data are MAR.

𝑃 (𝑅 = 1 | π‘Œ, 𝑋) = (𝑅 = 1 | 𝑋)

We use the repair data set technique to identify issues in fairness using a different pattern of missingness as described. AIF360 library is used to remove the dependency of one variable that gives unbiased results so that algorithm called Adversarial debasing. For fairness notions, the data set is divided into training data and test data. Training data set develop relationships between variable and test data is used to bring fairness to results. One-hot encoding (OHE) from the scalar library is applied for the categorical variables in the training data and may be utilized for different ideas of fairness.

we cover the meaning of each fairness notion more in the relevant methods supbage.

Results and Interpretation

As we evaluate our results, we first notice that our dataset with no initial missingness performs quite well in terms of fairness and accuracy. This serves as a baseline for what will most likely be the best performance for our model. This is because once we introduce missingness, there is now a chance that our model becomes more unfair and potentially inaccurate. It is also possible for the results to be more fair/accurate for any of these missingness types, but very unlikely.
As we begin to introduce missingness such as NMAR, MAR and MCAR, we begin to see a difference in the measurements of fairness. To begin with, all of our fairness measurements increase slightly, meaning that the missing data is finally having a effect on measuring the fairness of the dataset. While NMAR seems to not have a significant difference in fairness measurements from the dataset with no missing data, the fairness values increase more with MCAR. Last but not least, MAR seems to perform the poorest out of all the types of missingness. Each fairness notion for MAR increases, indicating poor performance in fairness. Since we are using Adversarial Debiasing for our fairness intervention technique (see Methods in-depth page), the accuracies amongst the four datasets hardly change at all.
Based on our results, for each fairness notion. MAR performed the worst, MCAR performed the second worst, and NMAR performed the best in terms of our chosen fairness notions. It is first important to mention that all types of missingness (including no missingness at all) had very similar accuracies. This is expected since Adversarial Debiasing, our in-processing fairness intervention technique (which also served as our classifier) within our model maximizes accuracy while improving fairness as much as possible. Since it does this using a CNN, in each case, loss (which includes fairness notions considered within AIF360’s Adversarial Debiasing method) was minimized with the accuracy being around 76%. The CNN had determined that without significantly decreasing the accuracy to slightly increase bias (unworthy tradeoff), it would be best to keep the accuracy around here in each case. This does not bother us, as the data scientists, since we are more focused on how missingness affects our fairness notions. While accuracy is certainly always important to consider as well, we are happy with the accuracy being where it is, being that over ΒΎ of predictions made by the classifier are correct. Now we can focus on reasoning for why each missingness type’s fairness notions results look the way they do. Our results matched our logical expectations that we came up with at the beginning of the project. A pattern of missingness would noticeably affect fairness in a situation where the missingness is dependent on a sensitive attribute; that indicates that a pattern such as MAR, which depends on a different attribute, would likely be the one to cause the most bias in the dataset. Our data supports this, with MAR consistently being the type of missingness to cause the highest divergence from 0 across fairness notions. We also spotted MCAR to behave similarly, which can be explained by the random distribution of missing data across the dataset, thus causing the sensitive attributes to lose points in a random pattern at the same rate as non-sensitive attributes; the fact that sensitive attributes lose points is what determines fairness, so the results for MCAR are consistent with our intuition. In conclusion, types of missingness that depend on randomness (MCAR and MAR) are the ones that caused the most bias. With regards to fairness, one should generally be most concerned with how MAR affects biases. This is because a non-sensitive attribute can be dependent on a sensitive attribute’s outcome, which inherently would contain the most bias. MCAR should not be much of one’s concern, since its patterns can be explained completely by randomness. NMAR also should generally not influence the outcome much, assuming it is not a sensitive attribute; although it still should be handled with care.

What We Learned

Throughout our project, we studied the effects of missingness on fairness, as well as how fairness itself can be evaluated and approached during the implimantion of the algorithm. Fairness can be evaluated using a series of fairness notions, and we learned not only what they are and how to calculate them, but also how to use them in order to work towards a debiased dataset. Fairness notions indicate whether a certain group is favored over another in the dataset; if we are able to reconstruct our dataset in a way that brings the notion closer to zero, that means that there is less difference between the groups. We learned how to use dataset repair techniques from packages such as AIF360 to approach this problem and to reshape the dataset in a way that creates a fairer picture of the real world while still keeping data representative of the population. Throughout our project, we studied the effects of missingness on fairness.
A lot of what we learned lies in the definition of missingness, its distribution, and how we can statistically explain and evaluate missingness in the dataset. At the end of the day, missingness is just a statistical distribution of missing values in the dataset; as with any other recognizable pattern in the dataset, it can affect the outcome of the analysis. In our research, we learned about different types of bias and how it is mathetmatically defined; the relationships between sensitive and non-sensitive attributes and the final outcome, as well as the original labels used for training. All these together are linked in a causal graph (Salimi et al., 2019) that showcases the relationships between the attributes. If a sensitive attribute that is proven to be involved in causal relationships with the outcome has noticeable patterns of missing data, then it can severely skew the labels that the algorithm will be predicting.
Logically, missingness can be treated like any other data, but the difference in this case is that there is no story that can be told from this "data"; non-missing values are tied to specific outcomes in the real world, while the absence of values does not tell us or the algorithm anything about what is happening. Hence, we needed to see how absent data can affect the algorithm; how will the algorithm learn to predict the outcome if it is missing whole chunks of data that could be crucial to providing accurate and non-biased representation of the target population.