Types of bias
Understanding bias in ML requires us to categorize it into different types based on its source and manifestation. This helps us to identify the root cause and implement targeted mitigation strategies. Here are the key types of bias to consider:
- Disparate impact: Disparate impact refers to situations where an ML system’s outcomes disproportionately are a disadvantage to a specific group, typically a protected class defined by attributes such as race, sex, age, or religion. This often occurs even when the protected attribute isn’t explicitly included in the model.
- Disparate treatment: In contrast to disparate impact, disparate treatment happens when an ML model treats individuals differently based on their membership of a protected group, implying a direct, discriminatory effect. This occurs when the protected attribute is explicitly used in the decision-making process.
- Pre-existing bias: Also known as historical bias, pre-existing bias emerges when the data used to train an ML model reflects existing prejudices or societal biases. The model, in essence, learns these biases and propagates them in its predictions.
- Sample bias: Sample bias occurs when the data used to train a model isn’t representative of the population it’s supposed to serve. This can result in a model that performs well on the training data but poorly on the actual data it encounters in production, leading to unfair outcomes.
- Measurement bias: Measurement bias arises when there are systematic errors in the way data is collected or measured. This can distort the training data and cause the model to learn incorrect associations.
- Aggregation bias: Aggregation bias occurs when a model oversimplifies or fails to capture the diversity and nuances within a group. This can happen when distinct subgroups are treated as one homogeneous group, which can lead to the model making incorrect or unfair generalizations.
- Proxy bias: Proxy bias takes place when a model uses an attribute as a stand-in for a protected attribute. For instance, zip codes might be used as a proxy for race or income level, leading to biased outcomes even when the protected attribute isn’t directly used.
Each of these types of bias has different implications for fairness in ML and requires different strategies to detect and mitigate. By identifying the type of bias at play, we can take targeted steps to reduce its impact and work toward more fair and equitable ML systems.
Sources of algorithmic bias
ML models, grounded in the learnings from past data, may unintentionally propagate bias present in their training datasets. Recognizing the roots of this bias is a vital first step toward fairer models:
- One such source is historical bias. This form of bias mirrors existing prejudices and systemic inequalities present in society. An example would be a recruitment model trained on a company’s past hiring data. If the organization historically favored a specific group for certain roles, the model could replicate these biases, continuing the cycle of bias.
- Representation or sample bias is another significant contributor. It occurs when certain groups are over- or underrepresented in the training data. For instance, training a facial recognition model predominantly on images of light-skinned individuals may cause the model to perform poorly when identifying faces with darker skin tones, favoring one group over the other.
- Proxy bias is when models use data from unrelated domains as input, leading to biased outcomes. An example would be using arrest records to predict crime rates, which may introduce bias, as the arrest data could be influenced by systemic prejudice in law enforcement.
- Aggregation bias refers to the inappropriate grouping of data, simplifying the task at the cost of accuracy. An instance could be using average hemoglobin levels to predict diabetes, even though these levels vary among different ethnicities due to more complex factors.
Understanding these sources of algorithmic bias is the foundation upon which we can build strategies to prevent and mitigate bias in our ML models. Thus, we contribute to the development of more equitable, fair, and inclusive AI systems.