from fairmlhealth import report, measure, stat_utils
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, TweedieRegressor
# First, we'll create a semi-randomized dataframe with specific columns for our attributes of interest
rng = np.random.RandomState(506)
N = 240
X = pd.DataFrame({'col1': rng.randint(1, 4, N),
'col2': rng.randint(1, 75, N),
'col3': rng.randint(0, 2, N),
'gender': [0, 1]*int(N/2),
'ethnicity': [1, 1, 0, 0]*int(N/4),
'other': [1, 0, 0, 0, 1, 0, 0, 1]*int(N/8)
})
# Second, we'll create a randomized target variable
y = pd.Series((X['col3']+X['gender']).values + rng.uniform(0, 6, N), name='Example_Target')
# Third, we'll split the data and use it to train two generic models
splits = train_test_split(X, y, test_size=0.5, random_state=42)
X_train, X_test, y_train, y_test = splits
model_1 = LinearRegression().fit(X_train, y_train)
model_2 = TweedieRegressor().fit(X_train, y_train)
display(X.head(), y.head())
col1 | col2 | col3 | gender | ethnicity | other | |
---|---|---|---|---|---|---|
0 | 1 | 15 | 0 | 0 | 1 | 1 |
1 | 3 | 51 | 1 | 1 | 1 | 0 |
2 | 1 | 30 | 1 | 0 | 0 | 0 |
3 | 2 | 28 | 1 | 1 | 0 | 0 |
4 | 1 | 72 | 0 | 0 | 1 | 1 |
0 1.700759 1 2.312593 2 6.117705 3 3.481302 4 1.051515 Name: Example_Target, dtype: float64
fairMLHealth has tools to create generalized reports of model bias and performance.
The primary reporting tool is now the compare function, which can be used to generate side-by-side comparisons for any number of models, and for either binary classifcation or for regression problems. Model performance metrics such as accuracy and precision (or MAE and RSquared for regression problems) are also provided to facilitate comparison.
A flagging protocol is applied by default to highlight any cells with values that are out of range. This can be turned off by passing *flag_oor = False* to report.compare().
Below is an example applying the function for a regression model. Note that the "fair" range to be used for evaluation of regression metrics does requires judgment on the part of the user. Default ranges have been set to [0.8, 1.2] for ratios, 10% of the available target range for Mean Prediction Difference, and 10% of the available MAE range for MAE Difference. If the default flags do not meet your needs, they can be turned off by passing *flag_oor = False* to report.compare(). More information is available in our Evaluating Fairness Documentation.
# Generate a measure report
report.compare(X_test, y_test, X_test['gender'], model_1, pred_type="regression")
model 1 | ||
---|---|---|
Metric | Measure | |
Group Fairness | MAE Difference | 0.3878 |
MAE Ratio | 1.2864 | |
Mean Prediction Difference | -1.0663 | |
Mean Prediction Ratio | 0.7721 | |
Individual Fairness | Between-Group Gen. Entropy Error | 0.0000 |
Consistency Score | 0.3652 | |
Model Performance | MAE | 1.5547 |
MSE | 3.3753 | |
Mean Error | -0.1224 | |
Mean Example_Target | 4.2513 | |
Mean Prediction | 4.1290 | |
Rsqrd | 0.1326 | |
Std. Dev. Error | 1.8408 | |
Std. Dev. Example_Target | 1.9809 | |
Std. Dev. Prediction | 0.9631 | |
Data Metrics | Prevalence of Privileged Class (%) | 48.0000 |
# Display the same report without performance measures
bias_report = report.compare(test_data=X_test,
targets=y_test,
protected_attr=X_test['gender'],
models=model_1,
pred_type="regression",
skip_performance=True)
print("Returned type:", type(bias_report))
display(bias_report)
Returned type: <class 'pandas.io.formats.style.Styler'>
model 1 | ||
---|---|---|
Metric | Measure | |
Group Fairness | MAE Difference | 0.3878 |
MAE Ratio | 1.2864 | |
Mean Prediction Difference | -1.0663 | |
Mean Prediction Ratio | 0.7721 | |
Individual Fairness | Between-Group Gen. Entropy Error | 0.0000 |
Consistency Score | 0.3652 | |
Data Metrics | Prevalence of Privileged Class (%) | 48.0000 |
By default the compare function returns a flagged comparison of type pandas Styler (pandas.io.formats.style.Styler). When flags are disabled, the default return type is a pandas DataFrame. Outputs can also be returned as embedded HTML -- with or without flags -- by specitying output_type="html".
# With flags disabled, the report is returned as a pandas DataFrame
df = report.compare(test_data=X_test,
targets=y_test,
protected_attr=X_test['gender'],
models=model_1,
pred_type="regression",
flag_oor=False)
print("Returned type:", type(df))
display(df.head(2))
Returned type: <class 'pandas.core.frame.DataFrame'>
model 1 | ||
---|---|---|
Metric | Measure | |
Group Fairness | MAE Difference | 0.3878 |
MAE Ratio | 1.2864 |
# Comparisons can also be returned as embedded HTML
from IPython.core.display import HTML
html_output = report.compare(test_data=X_test,
targets=y_test,
protected_attr=X_test['gender'],
models=model_1,
pred_type="regression",
output_type="html")
print("Returned type:", type(html_output))
HTML(html_output)
Returned type: <class 'str'>
model 1 | ||
---|---|---|
Metric | Measure | |
Group Fairness | MAE Difference | 0.3878 |
MAE Ratio | 1.2864 | |
Mean Prediction Difference | -1.0663 | |
Mean Prediction Ratio | 0.7721 | |
Individual Fairness | Between-Group Gen. Entropy Error | 0.0000 |
Consistency Score | 0.3652 | |
Model Performance | MAE | 1.5547 |
MSE | 3.3753 | |
Mean Error | -0.1224 | |
Mean Example_Target | 4.2513 | |
Mean Prediction | 4.1290 | |
Rsqrd | 0.1326 | |
Std. Dev. Error | 1.8408 | |
Std. Dev. Example_Target | 1.9809 | |
Std. Dev. Prediction | 0.9631 | |
Data Metrics | Prevalence of Privileged Class (%) | 48.0000 |
The compare tool can also be used to measure two different models or two different protected attributes. Protected attributes are measured separately and cannot yet be combined together with the compare tool, although they can be grouped as cohorts in the stratified tables as shown below.
Here is an example output comparing the two test models defined above. Missing values have been added for metrics requiring prediction probabilities, which the second model does not have (note the warning below).
# Generate a pandas dataframe of measures
report.compare(X_test,
y_test,
X_test['gender'],
{'model 1':model_1, 'model 2':model_2},
pred_type="regression")
model 1 | model 2 | ||
---|---|---|---|
Metric | Measure | ||
Group Fairness | MAE Difference | 0.3878 | 0.3357 |
MAE Ratio | 1.2864 | 1.2271 | |
Mean Prediction Difference | -1.0663 | -0.2019 | |
Mean Prediction Ratio | 0.7721 | 0.9523 | |
Individual Fairness | Between-Group Gen. Entropy Error | 0.0000 | 0.0000 |
Consistency Score | 0.3652 | 0.8737 | |
Model Performance | MAE | 1.5547 | 1.6516 |
MSE | 3.3753 | 3.7409 | |
Mean Error | -0.1224 | -0.1204 | |
Mean Example_Target | 4.2513 | 4.2513 | |
Mean Prediction | 4.1290 | 4.1310 | |
Rsqrd | 0.1326 | 0.0386 | |
Std. Dev. Error | 1.8408 | 1.9385 | |
Std. Dev. Example_Target | 1.9809 | 1.9809 | |
Std. Dev. Prediction | 0.9631 | 0.2086 | |
Data Metrics | Prevalence of Privileged Class (%) | 48.0000 | 48.0000 |
It is generally recommended to test whether any differences in model outcomes for protected attributes are the effect of a sampling error in our test. FairMLHealth comes with a bootstrapping utility and supporting functions that can be used in statistical testing. The bootstrapping utility accepts any function that returns a p-value and will return a True or False if the p-value is greater than some alpha for a threshold number of randomly sampled trials. While the selection of proper statistical tests is beyond the scope of this notebook, three examples using the bootstrap_significance tool with a built-in Kruskal-Wallis test function are shown below.
# Example 1 Bootstrap Test Results Applying Kruskal-Wallis to Relative to Gender
isMale = X['gender'].eq(1)
reject_h0 = stat_utils.bootstrap_significance(func=stat_utils.kruskal_pval,
a=y.loc[isMale],
b=y.loc[~isMale])
print("Is the y value is different for male vs female?\n", reject_h0)
Is the y value is different for male vs female? True
# Example 1 Bootstrap Test Results Applying Kruskal-Wallis to Relative to Ethnicity
isCaucasian = X['ethnicity'].eq(1)
reject_h0 = stat_utils.bootstrap_significance(func=stat_utils.kruskal_pval,
a=y.loc[isCaucasian],
b=y.loc[~isCaucasian])
print("Is the y-value is different for caucasian vs not-caucasian?\n", reject_h0)
Is the y-value is different for caucasian vs not-caucasian? False
# Example of Single Krusakal-Wallis Test
pval = stat_utils.kruskal_pval(a=y.loc[X['col3'].eq(1)],
b=y.loc[X['col3'].eq(0)],
# If n_sample set to None, tests on full dataset rather than sample
n_sample=None
)
print("P-Value of single K-W test:", pval)
P-Value of single K-W test: 2.981592458110808e-10
FairMLHealth also provides tools for detailed analysis of model variance by way of stratified data, performance, and bias tables. Beyond evaluating fairness, these tools are intended for flexible use in any generic assessment of model bais. Tables can evaluate multiple features at once. An important update starting in Version 1.0.0 is that all of these features are now contained in the measure.py* module (previously named reports.py).*
All tables display a summary row for "All Features, All Values". This summary can be turned off by passing *add_overview=False* to measure.data().
The stratified data table can be used to evaluate data against one or multiple targets. Two methods are available for identifying which features to assess, as shown in the examples below.
# Arguments Option 1: pass full set of data, subsetting with *features* argument
measure.data(X_test, y_test, features=['gender'])
Feature Name | Feature Value | Obs. | Entropy | Mean Example_Target | Median Example_Target | Missing Values | Std. Dev. Example_Target | Value Prevalence | |
---|---|---|---|---|---|---|---|---|---|
0 | ALL FEATURES | ALL VALUES | 120 | NaN | 4.2513 | 4.5745 | 0 | 1.9809 | 1.0000 |
1 | gender | 0 | 62 | 0.9992 | 3.5410 | 3.7835 | 0 | 2.0357 | 0.5167 |
2 | gender | 1 | 58 | 0.9992 | 5.0106 | 5.0673 | 0 | 1.6192 | 0.4833 |
# Arguments Option 2: pass the data subset of interest without using the *features* argument
measure.data(X_test['gender'], y_test)
Feature Name | Feature Value | Obs. | Entropy | Mean Example_Target | Median Example_Target | Missing Values | Std. Dev. Example_Target | Value Prevalence | |
---|---|---|---|---|---|---|---|---|---|
0 | ALL FEATURES | ALL VALUES | 120 | NaN | 4.2513 | 4.5745 | 0 | 1.9809 | 1.0000 |
1 | gender | 0 | 62 | 0.9992 | 3.5410 | 3.7835 | 0 | 2.0357 | 0.5167 |
2 | gender | 1 | 58 | 0.9992 | 5.0106 | 5.0673 | 0 | 1.6192 | 0.4833 |
# Display a similar report for multiple targets, dropping the summary row
measure.data(X=X_test, # used to define rows
Y=X_test, # used to define columns
features=['gender', 'col1'], # optional subset of X
targets=['col2', 'col3'], # optional subset of Y
add_overview=False # turns off "All Features, All Values" row
)
Feature Name | Feature Value | Obs. | Entropy | Mean col2 | Mean col3 | Median col2 | Median col3 | Missing Values | Std. Dev. col2 | Std. Dev. col3 | Value Prevalence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | gender | 0 | 62 | 0.9992 | 36.6452 | 0.4677 | 34.5 | 0.0 | 0 | 22.5811 | 0.5030 | 0.5167 |
1 | gender | 1 | 58 | 0.9992 | 36.2241 | 0.6034 | 32.5 | 1.0 | 0 | 20.9821 | 0.4935 | 0.4833 |
2 | col1 | 1 | 51 | 1.5579 | 36.4706 | 0.6471 | 32.0 | 1.0 | 0 | 21.2935 | 0.4826 | 0.4250 |
3 | col1 | 2 | 33 | 1.5579 | 33.6364 | 0.4545 | 30.0 | 0.0 | 0 | 21.4226 | 0.5056 | 0.2750 |
4 | col1 | 3 | 36 | 1.5579 | 38.9722 | 0.4444 | 40.0 | 0.0 | 0 | 22.9016 | 0.5040 | 0.3000 |
# Analytical tables are output as pandas DataFrames
test_table = measure.data(X=X_test[['gender', 'col1']], # used to define rows
Y=X_test[['col2', 'col3']], # used to define columns
)
test_table.loc[test_table['Feature Value'].eq("1"), ['Feature Name', 'Feature Value', 'Mean col2', 'Mean col3']]
Feature Name | Feature Value | Mean col2 | Mean col3 | |
---|---|---|---|---|
2 | gender | 1 | 36.2241 | 0.6034 |
3 | col1 | 1 | 36.4706 | 0.6471 |
The stratified performance table evaluates model performance specific to each feature-value subset. These tables are compatible with both classification and regression models.
# Performance table example
measure.performance(X_test[['gender']], y_test, model_1.predict(X_test),
pred_type="regression")
Feature Name | Feature Value | Obs. | Mean Target | Mean Prediction | MAE | MSE | Mean Error | Rsqrd | Std. Dev. Error | Std. Dev. Prediction | Std. Dev. Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ALL FEATURES | ALL VALUES | 120.0 | 4.2513 | 4.1290 | 1.5547 | 3.3753 | -0.1224 | 0.1326 | 1.8408 | 0.9631 | 1.9809 |
1 | gender | 0 | 62.0 | 3.5410 | 3.6136 | 1.7422 | 3.8787 | 0.0725 | 0.0487 | 1.9842 | 0.7671 | 2.0357 |
2 | gender | 1 | 58.0 | 5.0106 | 4.6799 | 1.3544 | 2.8372 | -0.3307 | -0.1012 | 1.6660 | 0.8420 | 1.6192 |
The stratified bias analysis feature applies fairness-related metrics for each feature-value pair. It assumes a given feature-value as the "privileged" group relative to all other possible values for the feature. For example, in the table output shown in the cell below, row 2 in the table below displays measures for "col1" with a value of "2". For this row, "2" is considered to be the privileged group, while all other non-null values (namely "1" and "3") are considered unprivileged.
Note that the flag function is compatible with both measure.bias() and measure.summary() (which is demonstrated below). However, to enable colored cells the tool returns a pandas Styler rather than a DataTable. For this reason, flag_oor is False by default for these features. Flagging can be turned on by passing flag_oor=True to either function. As an added feature, optional custom ranges can be passed to either measure.bias() or measure.summary() to facilitate regression evaluation as shown below.
# Custom "fair" ranges may be passed as dictionaries of tuples whose keys
# are case-insensitive measure names
my_ranges = {'mean prediction difference':(-2, 2)}
# Note that flag_oor is set to False by default for this feature
measure.bias(X_test[['gender', 'col1']],
y_test,
model_1.predict(X_test),
pred_type="regression",
flag_oor=True,
custom_ranges=my_ranges)
Feature Name | Feature Value | MAE Difference | MAE Ratio | Mean Prediction Difference | Mean Prediction Ratio | |
---|---|---|---|---|---|---|
0 | gender | 0 | -0.3878 | 0.7774 | 1.0663 | 1.2951 |
1 | gender | 1 | 0.3878 | 1.2864 | -1.0663 | 0.7721 |
2 | col1 | 1 | -0.2275 | 0.8650 | 0.1545 | 1.0382 |
3 | col1 | 2 | 0.2495 | 1.1816 | 0.1337 | 1.0332 |
4 | col1 | 3 | 0.0279 | 1.0182 | -0.3067 | 0.9294 |
The measure module also contains a summary function that works similarly to report.compare(). While it can only be applied to one model at a time, it can accept custom "fair" ranges, and accept cohort groups as shown in the next section.
# Example summary output for the regression model with custom ranges
measure.summary(X_test[['gender', 'col1']],
y_test,
model_1.predict(X_test),
prtc_attr=X_test['gender'],
pred_type="regression",
flag_oor=True,
custom_ranges={ 'mean prediction difference':(-0.5, 2)})
Value | ||
---|---|---|
Metric | Measure | |
Group Fairness | MAE Difference | 0.3878 |
MAE Ratio | 1.2864 | |
Mean Prediction Difference | -1.0663 | |
Mean Prediction Ratio | 0.7721 | |
Individual Fairness | Between-Group Gen. Entropy Error | 0.0000 |
Consistency Score | 0.3141 | |
Model Performance | MAE | 1.5547 |
MSE | 3.3753 | |
Mean Error | -0.1224 | |
Mean Example_Target | 4.2513 | |
Mean Prediction | 4.1290 | |
Rsqrd | 0.1326 | |
Std. Dev. Error | 1.8408 | |
Std. Dev. Example_Target | 1.9809 | |
Std. Dev. Prediction | 0.9631 | |
Data Metrics | Prevalence of Privileged Class (%) | 48.0000 |
Table-generating functions in the measure module can be additionally grouped using the cohort_labels argument to specify additional labels for each observation. Cohorts may consist of either as a single label or a set of labels, and may be either separate from or attached to the existing data.
# Define cohort labels relative to the true values of the target
cohort_labels = pd.qcut(y_test, 3, labels=False).rename('True Value Group')
# Separate, Single-Level Cohorts
measure.bias(X_test['col3'], y_test, model_1.predict(X_test),
pred_type="regression", flag_oor=True,
cohort_labels=cohort_labels)
True Value Group | Feature Name | Feature Value | MAE Difference | MAE Ratio | Mean Prediction Difference | Mean Prediction Ratio | |
---|---|---|---|---|---|---|---|
0 | 0 | col3 | 0 | 0.9421 | 1.5954 | 1.4668 | 1.4613 |
1 | 0 | col3 | 1 | -0.9421 | 0.6268 | -1.4668 | 0.6843 |
2 | 1 | col3 | 0 | -0.4956 | 0.5698 | 1.4232 | 1.4092 |
3 | 1 | col3 | 1 | 0.4956 | 1.7549 | -1.4232 | 0.7096 |
4 | 2 | col3 | 0 | -1.1291 | 0.5810 | 1.2833 | 1.3623 |
5 | 2 | col3 | 1 | 1.1291 | 1.7211 | -1.2833 | 0.7340 |
## Multi-Level Cohorts for the Data table
measure.data(X=X_test[['col3']], Y=y_test, cohort_labels=X_test[['gender', 'ethnicity']])
Feature Name | Feature Value | Obs. | Entropy | Mean Example_Target | Median Example_Target | Missing Values | Std. Dev. Example_Target | Value Prevalence | ||
---|---|---|---|---|---|---|---|---|---|---|
gender | ethnicity | |||||||||
0 | 0 | ALL FEATURES | ALL VALUES | 29 | NaN | 3.9273 | 3.9847 | 0 | 1.9164 | 1.0000 |
0 | col3 | 0 | 15 | 0.9991 | 3.3797 | 3.7754 | 0 | 1.9532 | 0.5172 | |
0 | col3 | 1 | 14 | 0.9991 | 4.5141 | 4.4485 | 0 | 1.7564 | 0.4828 | |
1 | ALL FEATURES | ALL VALUES | 33 | NaN | 3.2016 | 2.6024 | 0 | 2.1053 | 1.0000 | |
1 | col3 | 0 | 18 | 0.9940 | 2.4920 | 1.6709 | 0 | 1.8859 | 0.5455 | |
1 | col3 | 1 | 15 | 0.9940 | 4.0530 | 5.1426 | 0 | 2.0949 | 0.4545 | |
1 | 0 | ALL FEATURES | ALL VALUES | 26 | NaN | 4.9544 | 4.7895 | 0 | 1.4701 | 1.0000 |
0 | col3 | 0 | 11 | 0.9829 | 4.6557 | 4.5711 | 0 | 1.6014 | 0.4231 | |
0 | col3 | 1 | 15 | 0.9829 | 5.1735 | 5.0948 | 0 | 1.3805 | 0.5769 | |
1 | ALL FEATURES | ALL VALUES | 32 | NaN | 5.0563 | 5.2957 | 0 | 1.7530 | 1.0000 | |
1 | col3 | 0 | 12 | 0.9544 | 4.2436 | 4.2397 | 0 | 1.7731 | 0.3750 | |
1 | col3 | 1 | 20 | 0.9544 | 5.5440 | 5.6740 | 0 | 1.5894 | 0.6250 |