Introduction to the COMPAS dataset case study
In the realm of machine learning, where data drives decision-making, the line between algorithmic precision and ethical fairness often blurs. The COMPAS dataset, a collection of criminal offenders screened in Broward County, Florida, during 2013-2014, serves as a poignant reminder of this intricate dance. While, on the surface, it might appear as a straightforward binary classification task, the implications ripple far beyond simple predictions. Each row and feature isn’t just a digit or a class; it represents years, if not decades, of human experiences, ambitions, and lives. As we dive into this case study, we are reminded that these aren’t mere rows and columns but people with aspirations, dreams, and challenges. With a primary focus on predicting recidivism (the likelihood of an offender to re-offend), we’re confronted with not just the challenge of achieving model accuracy but also the monumental responsibility of ensuring fairness. Systemic privilege, racial discrepancies, and inherent biases in the data further accentuate the need for an approach that recognizes and mitigates these imbalances. This case study endeavors to navigate these tumultuous waters, offering insights into the biases present, and more importantly, exploring ways to strike a balance between ML accuracy and the paramount importance of human fairness. Let’s embark on this journey, bearing in mind the weight of the decisions we make and the profound impact they hold in real-world scenarios.
The core of this exploration revolves around the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) dataset. It aggregates data from criminal offenders processed in Broward County, Florida, between 2013–2014. Our focus narrows to a specific subset of this data, dedicated to the binary classification task of determining the probability of recidivism based on an individual’s attributes.
For those interested, the dataset is accessible here: https://www.kaggle.com/danofer/compass.
At first glance, the task seems straightforward. A binary classification with no data gaps, so, why not dive right in? However, the conundrum surfaces when one realizes the profound consequences our ML models can have on real human lives. The mantle of responsibility is upon us, as ML engineers and data practitioners, to not just craft efficient models but also to ensure the outcomes are inherently “just.”
Throughout this case study, we’ll endeavor to delineate the multifaceted nature of “fairness.” While multiple definitions are available, the crux lies in discerning which notion of fairness aligns with the specific domain at hand. By unfolding various fairness perspectives, we aim to elucidate their intended interpretations.
Note
This case study is illustrative and should not be misconstrued as an exhaustive statistical analysis or a critique of America’s criminal justice framework. Rather, it’s an endeavor to spotlight potential biases in datasets and champion techniques to enhance fairness in our ML algorithms.
Without further ado, let’s dive right into our dataset:
import pandas as pd
import numpy as np
compas_data = pd.read_csv(‘../data/compas-scores-two-years.csv’)
compas_data.head()
Figure 15.1 shows the first five rows of our dataset:

Figure 15.1 – An initial view of the COMPAS dataset
This unveils certain sensitive data concerning individuals previously incarcerated in Broward County, Florida. The key label here is two_year_recid, which addresses the binary query: “Did this individual get incarcerated again within 24 months of release?”
The 2016 ProPublica investigation, which scrutinized the fairness of the COMPAS algorithm and its foundational data, placed significant emphasis on the decile scores allotted to each subject. A decile score partitions data into 10 equal segments, similar in concept to percentiles. Essentially, an individual is ranked between 1 and 10, where each score denotes a segment of the population based on a specific metric. To illustrate, a decile score of 3 suggests that 70% of the subjects pose a higher risk of re-offending (scoring between 4 and 10) while 20% pose a lower risk (scoring 1 or 2). Conversely, a score of 7 would indicate that 30% have a heightened recidivism rate (scores of 8-10), whereas 60% are considered at a lower risk (scoring between 1 and 6).
Subsequent analyses showcased certain disparities in decile score allocation, particularly concerning race. Upon evaluating score distribution, clear racial biases emerge, as follows:
compas_data.groupby(‘race’)[‘decile_score’].value_counts(
normalize=True
).unstack().plot(
kind=’bar’, figsize=(20, 7),
title=’Decile Score Histogram by Race’, ylabel=’% with Decile Score’
)
Figure 15.2 shows the resulting graph:

Figure 15.2 – The racial variances in decile score distribution are evident
We could delve deeper into how the ProPublica investigation interpreted its findings, but our interest lies in constructing a binary classifier from the data, setting aside the pre-existing decile scores.