Understanding the task/outlining success – Navigating Real-World Data Science Case Studies in Action

Understanding the task/outlining success

The core of our investigation is binary classification. Our mission can be encapsulated in the question: “Considering various attributes of an individual, can we predict the likelihood of them re-offending, both efficiently and impartially?”

The notion of efficiency is straightforward. We have an arsenal of metrics such as accuracy, precision, and AUC to evaluate model efficacy. But when we discuss “impartiality,” we need to acquaint ourselves with novel concepts and metrics. Before delving into bias and fairness quantification, we should conduct some preliminary data exploration.

Preliminary data exploration

The intention is to predict the two_year_recid label using the dataset’s features. Specifically, the features that we’re working with are as follows:

  • sex – Binary: “Male” or “Female”
  • age – Numeric value indicating years
  • race – Categorical
  • juv_fel_count – Numeric value denoting prior juvenile felonies
  • juv_misd_count – Numeric value denoting previous juvenile misdemeanors
  • juv_other_count – Numeric value representing other juvenile convictions
  • priors_count – Numeric value indicating earlier criminal offenses
  • c_charge_degree – Binary: “F” indicating felony and “M” indicating misdemeanor

The target variable is as follows:

  • two_year_recid – Binary, indicating whether the individual re-offended within two years

It’s worth noting that we possess three distinct columns detailing juvenile offenses. We might consider merging these columns into one that represents the total count of juvenile offenses. Given our goal of crafting a precise and unbiased model, let’s inspect the recidivism distribution based on race. By categorizing the dataset by race and analyzing recidivism rates, it’s evident that there are varying baseline rates across racial groups:


compas_df.groupby(‘race’)[‘two_year_recid’].describe()

Figure 15.3 shows the resulting matrix of descriptive statistics:

Figure 15.3 – Recidivism descriptive statistics categorized by race; distinctive disparities in recidivism rates across racial groups are visible

We also observe limited representation of two racial groups: Asian and Native American. This skewed representation may lead to biased inferences. For context, Asians comprise about 4% of the Broward County, Florida, population, but only about 0.44% of this dataset. In this study, we’ll recategorize individuals from Asian or Native American groups as Other to address the data imbalances. This allows for a more balanced class distribution:


# Modify the race category for educational purposes and to address imbalance in the dataset
compas_df.loc[compas_df[‘race’].isin([‘Native American’, ‘Asian’]), ‘race’] = ‘Other’  # Adjust “Asian” and “Native American” categories to “Other”
compas_df.groupby(‘race’)[‘two_year_recid’].value_counts(
    normalize=True
).unstack().plot(
    kind=’bar’, figsize=(10, 5), title=’Recidivism Rates Classified by Race’
)  # Visualize Recidivism Rates across the refined racial groups

Figure 15.4 shows the resulting bar plot highlighting the differences in recidivism by race:

Figure 15.4 – A bar graph illustrating recidivism rates per racial category

Our findings reveal a higher recidivism rate among African-American individuals compared to Caucasian, Hispanic, and Other groups. The underlying reasons are multifaceted and beyond this study’s scope. However, it’s crucial to note the nuanced disparities in rates.

Note

We could have analyzed gender biases, as evident differences exist between male and female representations. For this study’s objectives, we’ll emphasize racial biases.

Advancing further, let’s analyze other dataset attributes:
compas_df[‘c_charge_degree’].value_counts(normalize=True).plot(
    kind=’bar’, title=’% of Charge Degree’, ylabel=’%’, xlabel=’Charge Degree’
)

We possess a binary charge degree attribute that, after conversion to a Boolean format, should be readily usable (this plot is shown in Figure 15.5):

Figure 15.5 – A breakdown of our dataset depicting felonies versus misdemeanors

Approximately 65% of charges are felonies, with the remaining being misdemeanors.