# Some initial setup
options(digits = 3)
library(tidyverse)
theme_set(theme_bw())
# Because huge plots are ugly
options(repr.plot.width = 6, repr.plot.height = 4)
# Read the data
compas_df <- read_rds("../data/compas.rds")
# Recap the model
recid_model <- glm(is_recid ~ priors_count + age, data = compas_df, family = "binomial")
compas_df <- compas_df %>%
mutate(
risk = predict(recid_model, type = "response"),
risk_bin = round(risk * 10),
binary_recid = risk >= 0.5
)
Recall that the cleaned version of the COMPAS data is loaded as compas_df
, with the following columns
id
: unique identifiers for each casesex
, dob
, age
, race
: demographic information for each defendantrecid_score
, violence_score
: COMPAS scores assessing risk that a defendant will recidivate (violence_score
for violent crimes) within two years of release (higher score correspond to higher risk)priors_count
: number of prior arrestsis_recid
, is_violent_recid
: Indicator variable that is 1
if the defendant was arrested for a new (violent) crime within two years of release, and 0
otherwise.and after fitting our model, we have added the following columns
risk
: the model-predicted probability of recidivismpredicted_risk_score
: a integer risk score between 0 and 10pred_recid
: a binary prediction of whether each defendant will recidivatehead(compas_df)
Last week we examined how our recidivism prediction model performed for different racial groups, and it turned out our model was well calibrated for white and Black defendants.
For this exercise, we will continue examining calibration of our model predictions, but for different genders.
We will reuse the risk_bin
we calculated last week, which is a discretized (rounded) version of predicted risk probability.
For Exercise 1, calculate recidivism rates for male and female defendants in our dataset by creating a data frame called calibration_by_gender
containing three columns: sex
, risk_bin
, and recidivism_rate
.
Additionally, to ensure we have enough defendents of each gender in every score bucket, we will limit our maximum score to 8 and remove everyone with score greater than 8.
# Calculate discretized risk score
# group people with risk score equal or gretaer than 8
calibration_by_gender <- compas_df %>%
# WRITE CODE HERE
# Put the recidivism rates of different races side by side
calibration_by_gender %>%
spread(sex, recidivism_rate) %>%
group_by(risk_bin) %>%
summarize(
n = sum(n),
Female = first(na.omit(Female)),
Male = first(na.omit(Male))
)
# Calibration plot
ggplot(compas_df,
aes(x = risk_bin, y = is_recid, color = sex, group=sex)) +
geom_smooth(method="glm", method.args=list(family="binomial")) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1))+
scale_x_continuous(breaks = seq(0, 10, 2), limits = c(1, 10))+
labs(x = "\nDiscretized risk score",
y = "Recidivism rate\n")
Given the plot above, do you think a gender-blind model is "fair"?
Note that we observe roughly up to a 1-point difference for male and female risk scores from the plot above.
For example, male defendants who were scored as 4
recidivated at a rate of 40%, while females who recidivated at a similar rate were given a higher score of 5
.
Because the model is "blind" to gender, women have lower risk compared to their male counterparts who have the same score.
One way to reduce this gender disparity is to explicitly include gender (sex
) as a variable.
In this exercise, build a gender-aware ricidivism prediction model with priors_count
, age
, sex
and add two columns gender_specific_risk
and gender_specific_risk_bin
(the prediction rounded to the nearest 10%) to the data frame.
# Refit the model by including gender, look at the coefficients of the fitted model,
# and generate gender-specific recidivism rate by risk score
# WRITE CODE HERE
# compute calibration by gender
calibration_by_gender <- compas_df %>%
filter(gender_specific_risk_score <= 8) %>%
group_by(sex, gender_specific_risk_score) %>%
summarize(recidivism_rate = mean(is_recid))
# Put the recidivism rates of different races side by side
calibration_by_gender %>%
spread(sex, recidivism_rate)
# Calibration plot
ggplot(compas_df,
aes(x = gender_specific_risk_score, y = is_recid, color = sex, group=sex)) +
geom_smooth(method="glm", method.args=list(family="binomial")) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1))+
scale_x_continuous(breaks = seq(0, 10, 2), limits = c(1, 10))+
labs(x = "\nDiscretized risk score",
y = "Recidivism rate\n")
Now, let's compare our gender-specific and gender-blind models by examining the number of men and women detained at a detention risk threshold of 50%.
# Calculate number of men and women detained for gender-specific and gender-blind models
# WRITE CODE HERE
compas_df %>%
By including gender in our model, we are able to obtain a calibrated model with fewer number of women detained. However, by explicitly using gender, we violate anti-classification. What do you think of this approach?
We now introduce false positive rate (FPR) and false negative rate (FNR), two common metrics for evaluating model performance.
In our application, the false positive rate is the proporition of people who are flagged as high risk by the algorithm, among those who ultimately did not recidivate. Conversely, the false negative rate is the proportion of people who are flagged as low risk by the algorithmic, among those who ultimately did recidivate.
To more formally define these error rates, we introduce a few more terms:
Their definitions can be illustrated using following table:
Real positive | Real negative | |
---|---|---|
Predicted positive | TP | FP |
Predicted negative | FN | TN |
Then the false positive rate (FPR) is given by:
\begin{equation} FPR = \frac{FP}{N_-} = \frac{FP}{FP + TN} \end{equation}Similarly, the false negative rate (FNR) is given by:
\begin{equation} FNR = \frac{FN}{N_+} = \frac{FN}{TP + FN} \end{equation}Coming back to the racial disparity we observed last week, in this exercise, let's calculate our model's FPR and FNR for white and Black defendants using a threshold of 50% for our binary prediction (binary_recid
).
# Complete the function calc_fpr_fnr, which takes a data frame that has at least three columns: race, is_recid, and binary_recid,
# and returns a data frame with three columns: race, FPR, and FNR
calc_fpr_fnr <- function(df) {
# WRITE CODE HERE
}
calc_fpr_fnr(compas_df)
Some have advocated for equalizing FPR to create a "fair" model. One way to do that is to set different thresholds for white defendants and Black defendants. To do so, we will fix our classification threshold for white defendants at 50% and tune the threshold for Black defendants so that the false positive rates are equal for both groups.
Similarly, find the threshold for Black defendants that equalizes the false negative rates for both groups.
white_threshold = 0.5
black_threshold = 0.5 # WRITE CODE HERE
# Calculate detention and recidivism rate by race
compas_df %>%
mutate(binary_recid = risk > if_else(race == "Caucasian", white_threshold, black_threshold)) %>%
calc_fpr_fnr()
# See where the thresholds are on the risk distribution
options(repr.plot.width = 7, repr.plot.height = 3.5)
# Recall this risk distribution plot
# Now we add our thresholds in the plots
ggplot(compas_df, aes(x = risk, fill = race)) +
geom_density(alpha = 0.5, color = NA) +
scale_x_continuous("Estimated risk", labels = scales::percent_format(), expand = c(0, 0)) +
scale_y_continuous(element_blank(), expand = c(0, 0)) +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank())+
geom_vline(
xintercept = c(black_threshold, white_threshold),
color = c("red", "blue"),
alpha = 0.5
)