Example Tool Usage - Binary Classification Problems¶

About¶

This notebook contains simple, toy examples to help you get started with FairMLHealth tool usage. This same content is mirrored in the repository's main README.

Example Setup¶

In [1]:

from fairmlhealth import report, measure, stat_utils
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier

In [2]:

# First, we'll create a semi-randomized dataframe with specific columns for our 
# attributes of interest
rng = np.random.RandomState(506)
N = 240
X = pd.DataFrame({'col1': rng.randint(1, 4, N), 
                  'col2': rng.randint(1, 75, N),
                  'col3': rng.randint(0, 2, N),
                  'gender': [0, 1]*int(N/2), 
                  'ethnicity': [1, 1, 0, 0]*int(N/4),
                  'other': [1, 0, 0, 0, 1, 0, 0, 1]*int(N/8)
                 })

# Second, we'll create a randomized target value
y = pd.Series(X['col3'].values + rng.randint(0, 2, N), name='Example_Target').clip(upper=1)

# Third, we'll split the data and use it to train two generic models
splits = train_test_split(X, y, stratify=y, test_size=0.5, random_state=60)
X_train, X_test, y_train, y_test = splits

model_1 = BernoulliNB().fit(X_train, y_train)
model_2 = DecisionTreeClassifier().fit(X_train, y_train)

In [3]:

display(X.head(), y.head())

	col1	col2	col3	gender	ethnicity	other
0	1	15	0	0	1	1
1	3	51	1	1	1	0
2	1	30	1	0	0	0
3	2	28	1	1	0	0
4	1	72	0	0	1	1

0    0
1    1
2    1
3    1
4    1
Name: Example_Target, dtype: int64

Generalized Reports¶

fairMLHealth has tools to create generalized reports of model bias and performance.

The primary reporting tool is now the compare function, which can be used to generate side-by-side comparisons for any number of models, and for either binary classifcation or for regression problems. Model performance metrics such as accuracy and precision (or MAE and RSquared for regression problems) are also provided to facilitate comparison. Below is an example output comparing the two example models defined above. Missing values have been added for metrics requiring prediction probabilities (which the second model does not have).

A flagging protocol is applied by default to highlight any cells with values that are out of range. This can be turned off by passing *flag_oor = False* to report.compare().

Note that the Equal Odds Ratio has been dropped from the example below. This is because the false positive rate is approximately zero for both the entire dataset and for the privileged class, leading to a zero in the denominator of the False Positive Rate Ratio: $\frac{{FPR}_{unprivileged}}{{FPR}_{privileged}}$. The result is therefore undefined and cannot be compared in the Equal Odds Ratio.

In [4]:

# Example with different protected attributes. 
# Note that the same model is passed with two different keys to clarify the column names.
# Equal Odds Ratio is not displayed because it is undefined (the False Positive Rate for 
#   the privileged group, ethnicity = 1, is 0.0)
report.compare(test_data = X_test, 
               targets = y_test, 
               protected_attr = {'Gender': X_test['gender'], 
                                 'Ethnicity': X_test['ethnicity']}, 
               models = model_1
              )

~/repos/fairMLHealth/fairmlhealth/measure.py:888: UserWarning: The following measures are undefined and have been dropped: ['Equal Odds Ratio']
  warn(f"The following measures are undefined and have been dropped: {undefined}")

Out[4]:

		Gender	Ethnicity
Metric	Measure
Group Fairness	AUC Difference	-0.0250	-0.0778
	Balanced Accuracy Difference	0.0988	-0.3667
	Balanced Accuracy Ratio	1.1576	0.5769
	Disparate Impact Ratio	0.9068	1.8182
	Equal Odds Difference	-0.2036	1.0000
	Equal Odds Ratio	0.6691	nan
	Positive Predictive Parity Difference	0.0111	-0.2500
	Positive Predictive Parity Ratio	1.0133	0.7500
	Statistical Parity Difference	-0.0759	0.4500
Individual Fairness	Between-Group Gen. Entropy Error	0.0000	0.0241
Individual Fairness	Consistency Score	0.7683	0.7683
Model Performance	Accuracy	1.0000	1.0000
	F1-Score	1.0000	1.0000
	FPR	0.0000	0.0000
	Mean Example_Target	0.7750	0.7750
	Precision	1.0000	1.0000
	TPR	1.0000	1.0000
Data Metrics	Prevalence of Privileged Class (%)	49.0000	50.0000

In [5]:

# Generate a measure report
report.compare(X_test, y_test, X_test['ethnicity'], model_1)

~/repos/fairMLHealth/fairmlhealth/measure.py:888: UserWarning: The following measures are undefined and have been dropped: ['Equal Odds Ratio']
  warn(f"The following measures are undefined and have been dropped: {undefined}")

Out[5]:

		model 1
Metric	Measure
Group Fairness	AUC Difference	-0.0778
	Balanced Accuracy Difference	-0.3667
	Balanced Accuracy Ratio	0.5769
	Disparate Impact Ratio	1.8182
	Equal Odds Difference	1.0000
	Positive Predictive Parity Difference	-0.2500
	Positive Predictive Parity Ratio	0.7500
	Statistical Parity Difference	0.4500
Individual Fairness	Between-Group Gen. Entropy Error	0.0241
Individual Fairness	Consistency Score	0.7683
Model Performance	Accuracy	1.0000
	F1-Score	1.0000
	FPR	0.0000
	Mean Example_Target	0.7750
	Precision	1.0000
	TPR	1.0000
Data Metrics	Prevalence of Privileged Class (%)	50.0000

In [6]:

# Display the same report without performance measures
bias_report = report.compare(test_data=X_test, 
                             targets=y_test, 
                             protected_attr=X_test['gender'], 
                             models=model_1, 
                             pred_type="classification", 
                             skip_performance=True)
print("Returned type:", type(bias_report))
display(bias_report)

Returned type: <class 'pandas.io.formats.style.Styler'>

		model 1
Metric	Measure
Group Fairness	AUC Difference	-0.0250
	Balanced Accuracy Difference	0.0988
	Balanced Accuracy Ratio	1.1576
	Disparate Impact Ratio	0.9068
	Equal Odds Difference	-0.2036
	Equal Odds Ratio	0.6691
	Positive Predictive Parity Difference	0.0111
	Positive Predictive Parity Ratio	1.0133
	Statistical Parity Difference	-0.0759
Individual Fairness	Between-Group Gen. Entropy Error	0.0000
Individual Fairness	Consistency Score	0.7683
Data Metrics	Prevalence of Privileged Class (%)	49.0000

Alternative Return Types¶

By default the compare function returns a flagged comparison of type pandas Styler (pandas.io.formats.style.Styler). When flags are disabled, the default return type is a pandas DataFrame. Outputs can also be returned as embedded HTML -- with or without flags -- by specitying output_type="html".

In [7]:

# With flags disabled, the report is returned as a pandas DataFrame
df = report.compare(test_data=X_test, 
                    targets=y_test, 
                    protected_attr=X_test['gender'], 
                    models=model_1, 
                    pred_type="classification",
                    flag_oor=False, 
                   output_type="styler")
print("Returned type:", type(df))
#display(df.head(2))
isinstance(df, pd.io.formats.style.Styler)

Returned type: <class 'pandas.io.formats.style.Styler'>

Out[7]:

True

In [8]:

# Comparisons can also be returned as embedded HTML
from IPython.core.display import HTML
html_output = report.compare(test_data=X_test, 
                             targets=y_test, 
                             protected_attr=X_test['gender'], 
                             models=model_1, 
                             pred_type="classification", 
                             output_type="html")
print("Returned type:", type(html_output))
HTML(html_output)

Returned type: <class 'str'>

Out[8]:

		model 1
Metric	Measure
Group Fairness	AUC Difference	-0.0250
	Balanced Accuracy Difference	0.0988
	Balanced Accuracy Ratio	1.1576
	Disparate Impact Ratio	0.9068
	Equal Odds Difference	-0.2036
	Equal Odds Ratio	0.6691
	Positive Predictive Parity Difference	0.0111
	Positive Predictive Parity Ratio	1.0133
	Statistical Parity Difference	-0.0759
Individual Fairness	Between-Group Gen. Entropy Error	0.0000
Individual Fairness	Consistency Score	0.7683
Model Performance	Accuracy	1.0000
	F1-Score	1.0000
	FPR	0.0000
	Mean Example_Target	0.7750
	Precision	1.0000
	TPR	1.0000
Data Metrics	Prevalence of Privileged Class (%)	49.0000

Comparing Results for Multiple Models¶

The compare tool can also be used to measure two different models or two different protected attributes. Protected attributes are measured separately and cannot yet be combined together with the compare tool, although they can be grouped as cohorts in the stratified tables as shown below.

Below is an example output comparing the two test models defined above.

In [9]:

# Example with multiple models
report.compare(test_data = X_test, 
               targets = y_test, 
               protected_attr = X_test['gender'],
               models = {'Any Name 1':model_1, 'Model 2':model_2})

Out[9]:

		Any Name 1	Model 2
Metric	Measure
Group Fairness	AUC Difference	-0.0250	0.0623
	Balanced Accuracy Difference	0.0988	0.0709
	Balanced Accuracy Ratio	1.1576	1.1151
	Disparate Impact Ratio	0.9068	0.7820
	Equal Odds Difference	-0.2036	-0.2624
	Equal Odds Ratio	0.6691	0.5735
	Positive Predictive Parity Difference	0.0111	0.0123
	Positive Predictive Parity Ratio	1.0133	1.0148
	Statistical Parity Difference	-0.0759	-0.1737
Individual Fairness	Between-Group Gen. Entropy Error	0.0000	0.0018
Individual Fairness	Consistency Score	0.7683	0.7367
Model Performance	Accuracy	1.0000	1.0000
	F1-Score	1.0000	1.0000
	FPR	0.0000	0.0000
	Mean Example_Target	0.7750	0.7083
	Precision	1.0000	1.0000
	TPR	1.0000	1.0000
Data Metrics	Prevalence of Privileged Class (%)	49.0000	49.0000

Detailed Analyses¶

Significance Testing¶

It is generally recommended to test whether any differences in model outcomes for protected attributes are the effect of a sampling error in our test. FairMLHealth comes with a bootstrapping utility and supporting functions that can be used in statistical testing. The bootstrapping utility accepts any function that returns a p-value and will return a True or False if the p-value is greater than some alpha for a threshold number of randomly sampled trials. While the selection of proper statistical tests is beyond the scope of this notebook, two examples using the bootstrap_significance tool with built-in test functions are shown below: 1) using Kruskal-Wallis, and 2) using Chi-Square.

In [10]:

model_1_preds = pd.Series(model_1.predict(X_test))

In [11]:

# Example 1 Bootstrap Test Results Applying Kruskal-Wallis to Predictions
isMale = X_test.reset_index(drop=True)['gender'].eq(1)
reject_h0 = stat_utils.bootstrap_significance(alpha=0.05,
                                              threshold=0.70,
                                              func=stat_utils.kruskal_pval, 
                                              a=model_1_preds.loc[isMale], 
                                              b=model_1_preds.loc[~isMale])
print("Can we reject the hypothesis that y values have the same distribution, regardless of gender?\n",
      reject_h0)

Can we reject the hypothesis that y values have the same distribution, regardless of gender?
 False

In [12]:

# Example 2 Bootstrap Results Applying Chi-Square to the Distribution of 
# Prediction Successes/Failures
model_1_results = stat_utils.binary_result_labels(y_test, model_1_preds)
reject_h0 = stat_utils.bootstrap_significance(alpha=0.05,
                                              threshold=0.70,
                                              func=stat_utils.chisquare_pval, 
                                              group=X_test['gender'], 
                                              values=model_1_results)
print("Can we reject the hypothesis that prediction results are from the same", 
      "distribution regardless of gender?\n", reject_h0)

Can we reject the hypothesis that prediction results are from the same distribution regardless of gender?
 False

In [13]:

# Example of Single Chi-Square Test
pval = stat_utils.chisquare_pval(group=X_test['gender'], 
                                 values=model_1_results,
                                 # If n_sample set to None, tests on full dataset rather than sample
                                 n_sample=None
                                )
print("P-Value of single Chi-Square test:", pval)

P-Value of single Chi-Square test: 0.9214286123014678

Stratified Tables¶

FairMLHealth also provides tools for detailed analysis of model variance by way of stratified data, performance, and bias tables. Beyond evaluating fairness, these tools are intended for flexible use in any generic assessment of model bias. Tables can evaluate multiple features at once. An important update starting in Version 1.0.0 is that all of these features are now contained in the measure.py* module (previously named reports.py).*

All tables display a summary row for "All Features, All Values". This summary can be turned off by passing *add_overview=False* to measure.data().

Data Tables¶

The stratified data table can be used to evaluate data against one or multiple targets. Two methods are available for identifying which features to assess, as shown in the examples below.

In [14]:

# Arguments Option 1: pass full set of data, subsetting with *features* argument
measure.data(X_test, y_test, features=['gender', 'other', 'col1'])

Out[14]:

	Feature Name	Feature Value	Obs.	Entropy	Mean Example_Target	Median Example_Target	Std. Dev. Example_Target	Value Prevalence
0	ALL FEATURES	ALL VALUES	120	NaN	0.7500	1.0	0.4348	1.0000
1	gender	0	61	0.9998	0.7213	1.0	0.4521	0.5083
2	gender	1	59	0.9998	0.7797	1.0	0.4180	0.4917
3	other	0	77	0.9413	0.7922	1.0	0.4084	0.6417
4	other	1	43	0.9413	0.6744	1.0	0.4741	0.3583
5	col1	1	42	1.5838	0.7857	1.0	0.4153	0.3500
6	col1	2	40	1.5838	0.8250	1.0	0.3848	0.3333
7	col1	3	38	1.5838	0.6316	1.0	0.4889	0.3167

In [15]:

# Arguments Option 2: pass the data subset of interest without using the *features* argument
measure.data(X_test[['gender', 'other', 'col1']], y_test)

Out[15]:

	Feature Name	Feature Value	Obs.	Entropy	Mean Example_Target	Median Example_Target	Std. Dev. Example_Target	Value Prevalence
0	ALL FEATURES	ALL VALUES	120	NaN	0.7500	1.0	0.4348	1.0000
1	gender	0	61	0.9998	0.7213	1.0	0.4521	0.5083
2	gender	1	59	0.9998	0.7797	1.0	0.4180	0.4917
3	other	0	77	0.9413	0.7922	1.0	0.4084	0.6417
4	other	1	43	0.9413	0.6744	1.0	0.4741	0.3583
5	col1	1	42	1.5838	0.7857	1.0	0.4153	0.3500
6	col1	2	40	1.5838	0.8250	1.0	0.3848	0.3333
7	col1	3	38	1.5838	0.6316	1.0	0.4889	0.3167

In [16]:

# Pass multiple targets (again, using Arguments Option 2)
measure.data(X = X_test[['gender', 'col1']], # used to define rows
             Y = X_test[['col2', 'col3']]) # used to define columns

Out[16]:

	Feature Name	Feature Value	Obs.	Entropy	Mean col2	Mean col3	Median col2	Median col3	Std. Dev. col2	Std. Dev. col3	Value Prevalence
0	ALL FEATURES	ALL VALUES	120	NaN	39.3167	0.5083	40.5	1.0	20.7543	0.5020	1.0000
1	gender	0	61	0.9998	39.7705	0.4098	39.0	0.0	21.6482	0.4959	0.5083
2	gender	1	59	0.9998	38.8475	0.6102	41.0	1.0	19.9627	0.4919	0.4917
3	col1	1	42	1.5838	38.5000	0.5952	39.5	1.0	20.6424	0.4968	0.3500
4	col1	2	40	1.5838	35.7250	0.5250	34.0	1.0	18.2897	0.5057	0.3333
5	col1	3	38	1.5838	44.0000	0.3947	49.5	0.0	22.8769	0.4954	0.3167

In [17]:

# Analytical tables are output as pandas DataFrames
test_table = measure.data(X=X_test[['gender', 'col1']], # used to define rows
                          Y=X_test[['col2', 'col3']], # used to define columns
                          add_overview=False # turns off "All Features, All Values" row
                         )

test_table.loc[test_table['Feature Value'].eq("1"), ['Feature Name', 'Feature Value', 'Mean col2', 'Mean col3']]

Out[17]:

	Feature Name	Feature Value	Mean col2	Mean col3
1	gender	1	38.8475	0.6102
2	col1	1	38.5000	0.5952

Stratified Performance Tables¶

The stratified performance table evaluates model performance specific to each feature-value subset. These tables are compatible with both classification and regression models. For classification models with the predict_proba() method, additional ROC_AUC and PR_AUC values will be included if possible.

In [18]:

# Performance table example
measure.performance(X_test[['gender']], y_test, model_1.predict(X_test))

Out[18]:

	Feature Name	Feature Value	Obs.	Mean Target	Mean Prediction	Accuracy	F1-Score	FPR	Precision	TPR
0	ALL FEATURES	ALL VALUES	120.0	0.7500	0.7750	0.7750	0.8525	0.5000	0.8387	0.8667
1	gender	0	61.0	0.7213	0.7377	0.7869	0.8539	0.4118	0.8444	0.8636
2	gender	1	59.0	0.7797	0.8136	0.7627	0.8511	0.6154	0.8333	0.8696

In [19]:

# Performance table example with probabilities included
measure.performance(X_test[['gender']], 
                    y_true=y_test, 
                    y_pred=model_1.predict(X_test), 
                    y_prob=model_1.predict_proba(X_test)[:,1])

Out[19]:

	Feature Name	Feature Value	Obs.	Mean Target	Mean Prediction	Accuracy	F1-Score	FPR	PR AUC	Precision	ROC AUC	TPR
0	ALL FEATURES	ALL VALUES	120.0	0.7500	0.7750	0.7750	0.8525	0.5000	0.2062	0.8387	0.8583	0.8667
1	gender	0	61.0	0.7213	0.7377	0.7869	0.8539	0.4118	0.2261	0.8444	0.8429	0.8636
2	gender	1	59.0	0.7797	0.8136	0.7627	0.8511	0.6154	0.1873	0.8333	0.8679	0.8696

In [20]:

# Performance table example using ethnicity as the protected attribute
measure.performance(X_test[['ethnicity']], y_test, model_1.predict(X_test))

Out[20]:

	Feature Name	Feature Value	Obs.	Mean Target	Mean Prediction	Accuracy	F1-Score	FPR	Precision	TPR
0	ALL FEATURES	ALL VALUES	120.0	0.75	0.775	0.775	0.8525	0.5	0.8387	0.8667
1	ethnicity	0	60.0	0.75	1.000	0.750	0.8571	1.0	0.7500	1.0000
2	ethnicity	1	60.0	0.75	0.550	0.800	0.8462	0.0	1.0000	0.7333

Stratified Bias Tables¶

The stratified bias analysis feature applies fairness-related metrics for each feature-value pair. It assumes a given feature-value as the "privileged" group relative to all other possible values for the feature. For example, in the table output shown in the cell below, row 2 in the table below displays measures for "col1" with a value of "2". For this row, "2" is considered to be the privileged group, while all other non-null values (namely "1" and "3") are considered unprivileged.

To simplify the table, fairness measures have been reduced to their component parts. For example, the Equal Odds Ratio has been reduced to the True Positive Rate (TPR) Ratio and False Positive Rate (FPR) Ratio.

Note that the flag function is compatible with both measure.bias() and measure.summary() (which is demonstrated below). However, to enable colored cells the tool returns a pandas Styler rather than a DataTable. For this reason, flag_oor is False by default for these features. Flagging can be turned on by passing flag_oor=True to either function. As an added feature, optional custom ranges can be passed to either measure.bias() or measure.summary() to facilitate regression evaluation, shown in Example-ToolUsage_Regression.

In [21]:

# Example of bias table with flag turned on
measure.bias(X_test[['gender', 'col3']], y_test, model_1.predict(X_test), flag_oor=True)

Out[21]:

	Feature Name	Feature Value	Balanced Accuracy Difference	Balanced Accuracy Ratio	FPR Diff	FPR Ratio	PPV Diff	PPV Ratio	Selection Diff	Selection Ratio	TPR Diff	TPR Ratio
0	gender	0	-0.0988	0.8638	0.2036	1.4945	-0.0111	0.9868	0.0759	1.1028	0.0059	1.0069
1	gender	1	0.0988	1.1576	-0.2036	0.6691	0.0111	1.0133	-0.0759	0.9068	-0.0059	0.9932
2	col3	0	0.4569	1.8413	-0.5000	0.0000	0.4688	1.8824	0.4576	1.8438	0.4138	1.7059
3	col3	1	-0.4569	0.5431	0.5000	nan	-0.4688	0.5312	-0.4576	0.5424	-0.4138	0.5862

The measure module also contains a summary function that works similarly to report.compare(). While it can only be applied to one model at a time, it can accept custom "fair" ranges, and accept cohort groups as will be shown in the next section.

In [22]:

# Example summary with performance skipped
measure.summary(X_test[['col2']], 
                y_test, 
                model_1.predict(X_test),
                prtc_attr=X_test['gender'], 
                pred_type="classification",
                skip_performance=True,
                flag_oor=True
               )

Out[22]:

		Value
Metric	Measure
Group Fairness	Balanced Accuracy Difference	0.0988
	Balanced Accuracy Ratio	1.1576
	Disparate Impact Ratio	0.9068
	Equal Odds Difference	-0.2036
	Equal Odds Ratio	0.6691
	Positive Predictive Parity Difference	0.0111
	Positive Predictive Parity Ratio	1.0133
	Statistical Parity Difference	-0.0759
Individual Fairness	Between-Group Gen. Entropy Error	0.0000
Individual Fairness	Consistency Score	0.7350
Data Metrics	Prevalence of Privileged Class (%)	49.0000

Analysis by Cohort ¶

Table-generating functions in the measure module can all be additionally grouped using the cohort_labels argument to specify additional labels for each observation. Cohorts may consist of either as a single label or a set of labels, and may be either separate from or attached to the existing data.

In [23]:

# Separate, Single-Level Cohorts
cohort_labels = X_test['gender']
measure.bias(X_test['col3'], y_test, model_1.predict(X_test), 
                    flag_oor=True, cohort_labels=cohort_labels)

Out[23]:

	gender	Feature Name	Feature Value	Balanced Accuracy Difference	Balanced Accuracy Ratio	FPR Diff	FPR Ratio	PPV Diff	PPV Ratio	Selection Diff	Selection Ratio	TPR Diff	TPR Ratio
0	0	col3	0	0.3638	1.5718	-0.4118	0.0000	0.3500	1.5385	0.4444	1.8000	0.3158	1.4615
1	0	col3	1	-0.3638	0.6362	0.4118	nan	-0.3500	0.6500	-0.4444	0.5556	-0.3158	0.6842
2	1	col3	0	0.6077	2.5490	-0.6154	0.0000	0.6667	3.0000	0.4783	1.9167	0.6000	2.5000
3	1	col3	1	-0.6077	0.3923	0.6154	nan	-0.6667	0.3333	-0.4783	0.5217	-0.6000	0.4000

In [24]:

## Associated, Multi-Level Cohorts
measure.data(X=X_test['col3'], Y=y_test, cohort_labels=X_test[['gender', 'ethnicity']])

Out[24]:

		Feature Name	Feature Value	Obs.	Entropy	Mean Example_Target	Median Example_Target	Missing Values	Std. Dev. Example_Target	Value Prevalence
gender	ethnicity
0	0	ALL FEATURES	ALL VALUES	32	NaN	0.7812	1.0	0	0.4200	1.0000
	0	col3	0	20	0.9544	0.6500	1.0	0	0.4894	0.6250
	0	col3	1	12	0.9544	1.0000	1.0	0	0.0000	0.3750
	1	ALL FEATURES	ALL VALUES	29	NaN	0.6552	1.0	0	0.4837	1.0000
	1	col3	0	16	0.9923	0.3750	0.0	0	0.5000	0.5517
	1	col3	1	13	0.9923	1.0000	1.0	0	0.0000	0.4483
1	0	ALL FEATURES	ALL VALUES	28	NaN	0.7143	1.0	0	0.4600	1.0000
	0	col3	0	12	0.9852	0.3333	0.0	0	0.4924	0.4286
	0	col3	1	16	0.9852	1.0000	1.0	0	0.0000	0.5714
	1	ALL FEATURES	ALL VALUES	31	NaN	0.8387	1.0	0	0.3739	1.0000
	1	col3	0	11	0.9383	0.5455	1.0	0	0.5222	0.3548
	1	col3	1	20	0.9383	1.0000	1.0	0	0.0000	0.6452

In [25]:

# Cohorts for summary tables
measure.summary(X_test[['col2']], 
                y_test, 
                model_1.predict(X_test),
                prtc_attr=X_test['gender'], 
                pred_type="classification",
                flag_oor=False,
                skip_performance=True,
                cohort_labels=X_test[['ethnicity', 'col3']]
               )

~/opt/anaconda3/envs/exactech/lib/python3.6/site-packages/aif360/sklearn/metrics/metrics.py:116: UndefinedMetricWarning: The ratio is ill-defined and being set to 0.0 because 'predmean' for privileged samples is 0.
  UndefinedMetricWarning)
~/repos/fairMLHealth/fairmlhealth/measure.py:888: UserWarning: The following measures are undefined and have been dropped: ['Positive Predictive Parity Ratio', 'Equal Odds Ratio']
  warn(f"The following measures are undefined and have been dropped: {undefined}")
~/repos/fairMLHealth/fairmlhealth/__utils.py:214: UserWarning: Could not evaluate function for group(s): {errant_list}. This is commonly caused when there is  too little data or there is only a single feature-value pair is available in a given cohort. Each cohort must have  5 observations.
  warn(msg)

Out[25]:

				Value
Metric	Measure	ethnicity	col3
Group Fairness	Balanced Accuracy Difference	0	0	0.0000
	Balanced Accuracy Ratio	0	0	1.0000
	Disparate Impact Ratio	0	0	1.0000
	Equal Odds Difference	0	0	0.0000
	Equal Odds Ratio	0	0	1.0000
	Positive Predictive Parity Difference	0	0	0.3167
	Positive Predictive Parity Ratio	0	0	1.9500
	Statistical Parity Difference	0	0	0.0000
Individual Fairness	Between-Group Gen. Entropy Error	0	0	0.0054
Individual Fairness	Consistency Score	0	0	1.0000
Data Metrics	Prevalence of Privileged Class (%)	0	0	38.0000
Group Fairness	Balanced Accuracy Difference	1	0	0.0000
	Balanced Accuracy Ratio	1	0	1.0000
	Disparate Impact Ratio	1	0	0.0000
	Equal Odds Difference	1	0	0.0000
	Positive Predictive Parity Difference	1	0	0.0000
	Statistical Parity Difference	1	0	0.0000
Individual Fairness	Between-Group Gen. Entropy Error	1	0	0.0114
Individual Fairness	Consistency Score	1	0	1.0000
Data Metrics	Prevalence of Privileged Class (%)	1	0	41.0000