In [1]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline 

Causal Inference in `ktrain`¶

What is the causal impact of having a PhD on making over 50K/year?¶

As of v0.27.x, ktrain supports causal inference using meta-learners. We will use the well-studied Adults Census dataset from the UCI ML repository, which is census data from the early to mid 1990s. The objective is to estimate how much earning a PhD increases the probability of making over $50K in salary. (Note that this dataset is simply being used as a simple demonstration example of estimation. In a real-world scenario, you would spend more time on identifying which variables you should control for and which variables you should not control for.) Unlike conventional supervised machine learning, there is typically no ground truth for causal inference models, unless you're using a simulated dataset. So, we will simply check our estimates to see if they agree with intuition for illustration purposes in addition to inspecting robustness.

Let's begin by loading the dataset and creating a binary treatment (1 for PhD and 0 for no PhD).

In [2]:

!wget https://raw.githubusercontent.com/amaiya/ktrain/master/tests/resources/tabular_data/adults.csv -O /tmp/adults.csv

--2021-07-20 14:17:32--  https://raw.githubusercontent.com/amaiya/ktrain/master/tests/resources/tabular_data/adults.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573758 (4.4M) [text/plain]
Saving to: ‘/tmp/adults.csv’

/tmp/adults.csv     100%[===================>]   4.36M  26.3MB/s    in 0.2s    

2021-07-20 14:17:32 (26.3 MB/s) - ‘/tmp/adults.csv’ saved [4573758/4573758]

In [3]:

import pandas as pd
df = pd.read_csv('/tmp/adults.csv')
df = df.rename(columns=lambda x: x.strip())
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x) 
filter_set = 'Doctorate'
df['treatment'] = df['education'].apply(lambda x: 1 if x in filter_set else 0)
df.head()

Out[3]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	class
0	25	Private	178478	Bachelors	13	Never-married	Tech-support	Own-child	White	Female	40	United-States	<=50K
1	23	State-gov	61743	5th-6th	3	Never-married	Transport-moving	Not-in-family	White	Male	35	United-States	<=50K
2	46	Private	376789	HS-grad	9	Never-married	Other-service	Not-in-family	White	Male	15	United-States	<=50K
3	55	?	200235	HS-grad	9	Married-civ-spouse	?	Husband	White	Male	50	United-States	>50K
4	36	Private	224541	7th-8th	4	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	40	El-Salvador	<=50K

Next, let's invoke the causal_inference_model function to create a CausalInferenceModel instance and invoke fit to estimate the individualized treatment effect for each row in this dataset. By default, a T-Learner metalearner is used with LightGBM models as base learners. These can be adjusted using the method and learner parameters. Since this example is simply being used for illustration purposes, we will ignore the fnlwgt column, which represents the number of people the census believes the entry represents. In practice, one might incorporate domain knowledge when choosing which variables to include and ignore. For instance, variables thought to be common effects of both the treatment and outcome might be excluded as colliders. Finally, we will also exclude the education-related columns as they are already captured in the treatment.

In [4]:

from ktrain.tabular.causalinference import causal_inference_model
cm = causal_inference_model(df,
                            treatment_col='treatment', 
                            outcome_col='class',
                            ignore_cols=['fnlwgt', 'education','education-num']).fit()

replaced ['<=50K', '>50K'] in column "class" with [0, 1]
outcome column (categorical): class
treatment column: treatment
numerical/categorical covariates: ['age', 'workclass', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
preprocess time:  0.5897183418273926  sec
start fitting causal inference model
time to fit causal inference model:  0.9125957489013672  sec

As shown above, the dataset is automatically preprocessed and fitted very quickly.

Average Treatment Effect (ATE)¶

The overall average treatment effect for all examples is 0.20. That is, having a PhD increases your probability of making over $50K by 20 percentage points.

In [5]:

cm.estimate_ate()

Out[5]:

{'ate': 0.20340645077516034}

Conditional Average Treatment Effects (CATE)¶

We also compute treatment effects after conditioning on attributes.

For those with Master's degrees, we find that it is lower than the overall population as expected but still positive (which is qualitatively consistent with studies by the Census Bureau):

In [6]:

cm.estimate_ate(cm.df['education'] == 'Masters')

Out[6]:

{'ate': 0.17672418257642838}

For those that dropped out of school, we find that it is higher (also as expected):

In [7]:

cm.estimate_ate(cm.df['education'].isin(['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '12th']))

Out[7]:

{'ate': 0.2586697863578173}

Invidividualized Treatment Effects (ITE)¶

The CATEs above illustrate how causal effects vary across different subpopulations in the dataset. In fact, CausalInferenceModel.df stores a DataFrame representation of your dataset that has been augmented with a column called treatment_effect that stores the individualized treatment effect for each row in your dataset.

For instance, these individuals are predicted to benefit the most from a PhD with an increase of nearly 100 percentage points in the probability (see the treatment_effect column).

In [8]:

drop_cols = ['fnlwgt', 'education-num', 'capital-gain', 'capital-loss'] # omitted for readability
cm.df.sort_values('treatment_effect', ascending=False).drop(drop_cols, axis=1).head()

Out[8]:

	age	workclass	education	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	treatment_effect
19283	40	Private	HS-grad	Never-married	Adm-clerical	Not-in-family	White	Female	38	United-States	0.991928
16500	35	Private	Assoc-voc	Divorced	Adm-clerical	Not-in-family	White	Female	40	United-States	0.991656
30597	72	Private	Assoc-voc	Separated	Other-service	Unmarried	White	Female	25	United-States	0.991625
9888	27	Private	HS-grad	Divorced	Machine-op-inspct	Not-in-family	White	Male	40	United-States	0.989816
29341	39	Private	HS-grad	Divorced	Other-service	Unmarried	Amer-Indian-Eskimo	Female	40	United-States	0.989737

Examining how the treatment effect varies across units in the population can be useful for variety of different applications. Uplift modeling is often used by companies for targeted promotions by identifying those individuals with the highest estimated treatment effects. Assessing the impact after such campaigns is yet another way to assess the model.

Making Predictions on New Examples¶

Finally, we can predict treatment effects on new examples, as long as they are formatted like the original DataFrame. For instance, let's make a prediction for one of the rows we already examined:

In [9]:

df_example = cm.df.sort_values('treatment_effect', ascending=False).iloc[[0]]
df_example

Out[9]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	class	treatment	treatment_effect
19283	40	Private	207025	HS-grad	9	Never-married	Adm-clerical	Not-in-family	White	Female	6849	0	38	United-States	0	0	0.991928

In [10]:

cm.predict(df_example)

Out[10]:

array([[0.99192821]])

Evaluating Robustness¶

As mentioned above, there is no ground truth for this problem to validate our estimates. In the cells above, we simply inspected the estimates to see if they correspond to our intuition on what should happen. Another approach to validating causal estimates is to evaluate robustness to various data manipulations (i.e., sensitivity analysis). For instance, the Placebo Treatment test replaces the treatment with a random covariate. We see below that this causes our estimate to drop to near zero, which is expected and exactly what we want. Such tests might be used to compare different models.

In [23]:

cm.evaluate_robustness()

Out[23]:

Method	ATE	New ATE	New ATE LB	New ATE UB	Distance from Desired (should be near 0)
Placebo Treatment	0.203406	0.00164019	-0.00408386	0.00736424	0.00164019
Random Cause	0.203406	0.230316	0.216585	0.244046	0.0269094
Subset Data(sample size @0.8)	0.203406	0.194687	0.173465	0.215908	-0.00871967
Random Replace	0.203406	0.214506	0.201208	0.227804	0.0110997

ktrain uses the CausalNLP package for inferring causality. For more information, see the CausalNLP documentation.

In [ ]:

Causal Inference in ktrain¶