This notebook shows the functionality of the DummyEncoder and InteractionEncoder classes of Appelpy 🍏🥧 in depth, applied to an econometrics dataset. These classes are in the utils module.

Notebook structure:

Data loading: e.g. what format is needed for categorical columns and Boolean columns before using the Encoders.
DummyEncoder functionality: basic examples of categorical columns being encoded into dummy columns.
InteractionEncoder functionality: multiple scenarios are covered for interactions between different data types.
Modelling: examples of models that use interaction effects.

The notebook ends with an example of a simple model pipeline using the InterationEncoder.

In [1]:

import pandas as pd
import numpy as np

# Appelpy imports:
from appelpy.utils import DummyEncoder, InteractionEncoder
from appelpy.linear_model import OLS

# Hide Numpy warnings from Statsmodels
import warnings
warnings.filterwarnings('ignore')

Load data¶

The hsbdemo DTA file in this example is a dataset with 200 observations on the academic choices of students and other information about the students themselves, e.g. their academic profiles and demographic information.

In [2]:

df_raw = pd.read_stata('https://stats.idre.ucla.edu/stat/data/hsbdemo.dta')

In [3]:

df_raw.head()

Out[3]:

	id	female	ses	schtyp	prog	read	write	math	science	socst	honors	cid
0	45.0	female	low	public	vocation	34.0	35.0	41.0	29.0	26.0	not enrolled	1
1	108.0	male	middle	public	general	34.0	33.0	41.0	36.0	36.0	not enrolled	1
2	15.0	male	high	public	vocation	39.0	39.0	44.0	26.0	42.0	not enrolled	1
3	67.0	male	low	public	vocation	37.0	37.0	42.0	33.0	32.0	not enrolled	1
4	153.0	male	middle	public	vocation	39.0	31.0	40.0	39.0	51.0	not enrolled	1

In [4]:

df_raw.nunique()

Out[4]:

id         200
female       2
ses          3
schtyp       2
prog         3
read        30
write       29
math        40
science     34
socst       22
honors       2
awards       7
cid         20
dtype: int64

The categorical columns from the Stata file are already set up to be recognised by Pandas as pd.Categorical dtype.

NOTE: categorical data fed to the encoders should be in the pd.Categorical dtype in order for the encoding to work! They must not be in the generic object dtype.

Of course the DummyEncoder also handles cases where there are NaN values for categorical data (via the nan_policy argument)! That functionality will be covered separately in another notebook.

In [5]:

df_raw.dtypes

Out[5]:

id          float32
female     category
ses        category
schtyp     category
prog       category
read        float32
write       float32
math        float32
science     float32
socst       float32
honors     category
awards      float32
cid           int16
dtype: object

The female column will be recoded here as a Boolean column with values in {0, 1}, rather than the {'male', 'female'} format originally in the dataset.

NOTE: Boolean data fed to the encoders should be restricted to values in {0, 1} in order for the encoding to work!

In [6]:

# Recode 'female' col into 1 and 0 vals
df_raw['female'] = np.where(df_raw['female'] == 'female', 1, 0)

# Create another Bool col for use later on - col for 'read' value being higher than the mean
df_raw['read_gt_mean'] = np.where(df_raw['read'] > df_raw['read'].mean(), 1, 0)

These are some examples of the types of data in the dataset.

Boolean variables:

female

Categorical variables:

ses
prog

Continuous variables:

read, write, math, science, socst

Data pre-processing¶

`DummyEncoder` functionality¶

Make a new copy of the df_raw dataframe.

The dummy_encoder object is an instance of the DummyEncoder class.

The encoder object must be initialized with a dataframe.

By default, the _ separator is used to produce the dummy columns.

It takes a dictionary, where each column name is paired with a base level. If a base level is specified, then the dummy column for that category is dropped from the final dataframe.

In [7]:

dummy_encoder = DummyEncoder(df_raw, {'schtyp': None,
                                      'prog': None,
                                      'honors': None})

Create the transformed dataframe with the transform method.

In [8]:

# Overwrite the dataframe - encode dummies from the categorical variables specified
df = dummy_encoder.transform()

In [9]:

print(f"Default NaN policy: {dummy_encoder.nan_policy}")

Default NaN policy: row_of_zero

In [10]:

df.head()

Out[10]:

	id	female	ses	read	write	math	science	socst	cid	schtyp_public	prog_general	prog_vocation	honors_not enrolled
0	45.0	1	low	34.0	35.0	41.0	29.0	26.0	1	1	0	1	1
1	108.0	0	middle	34.0	33.0	41.0	36.0	36.0	1	1	1	0	1
2	15.0	0	high	39.0	39.0	44.0	26.0	42.0	1	1	0	1	1
3	67.0	0	low	37.0	37.0	42.0	33.0	32.0	1	1	0	1	1
4	153.0	0	middle	39.0	31.0	40.0	39.0	51.0	1	1	0	1	1

There are three categorical variables fed to the DummyEncoder.

The original columns for all three are removed from the final dataframe once encoding is done for their dummy variable equivalents.

In [11]:

[col for col in dummy_encoder.categorical_col_base_levels.keys()]

Out[11]:

['schtyp', 'prog', 'honors']

In [12]:

from appelpy.utils import get_dataframe_columns_diff

In [13]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df)}")
print(f"Columns added: {get_dataframe_columns_diff(df, df_raw)}")

Columns removed: ['prog', 'honors', 'schtyp']
Columns added: ['prog_academic', 'honors_not enrolled', 'honors_enrolled', 'schtyp_public', 'prog_vocation', 'prog_general', 'schtyp_private']

`InteractionEncoder` functionality¶

Make a new copy of the df_raw dataframe.

The int_encoder object is an instance of the InteractionEncoder class.

The encoder object must be initialized with a dataframe.

The # separator is used to represent the interaction between two variables in the columns that are produced by the encoder.

In [14]:

df = df_raw.copy()

Examples of interactions between variables will be given for these cases:

Two Boolean variables
Two continuous variables
Two categorical variables
One Boolean variable and one categorical variable
One Boolean variable and one continuous variable
One categorical variable and one continuous variable

Two Boolean variables¶

Bool: female
Bool: read_gt_mean

In [15]:

int_encoder = InteractionEncoder(df, {'female': ['read_gt_mean']})

df_enc = int_encoder.transform()
df_enc.tail()

Out[15]:

	id	female	ses	schtyp	prog	read	write	math	science	socst	honors	awards	cid	read_gt_mean	female#read_gt_mean
195	100.0	1	high	public	academic	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	20	1	1
196	143.0	0	middle	public	vocation	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	20	1	0
197	68.0	0	middle	public	academic	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	20	1	0
198	57.0	1	middle	public	academic	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	20	1	1
199	132.0	0	middle	public	academic	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	20	1	0

The columns for the main effects are both Boolean, so they must be kept in the final dataframe.

There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe.

The get_dataframe_columns_diff method is useful for checking how the final dataframe is different from the original dataframe after the encoding process.

In [16]:

print(f"Columns removed: {get_dataframe_columns_diff(df, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df)}")

Columns removed: []
Columns added: ['female#read_gt_mean']

The code is essentially comparing the columns of the dataframes through sets.

In [17]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df)}")
print(f"Columns added: {get_dataframe_columns_diff(df, df_raw)}")

Columns removed: []
Columns added: []

In [18]:

print(f"Columns removed: {list(set(df.columns) - set(df_enc.columns))}")
print(f"Columns added: {list(set(df_enc.columns) - set(df.columns))}")

Columns removed: []
Columns added: ['female#read_gt_mean']

Two continuous variables¶

Continuous: read
Continuous: write

Tip: do a one-line transformation by calling transform on an instance of the encoder class.

In [19]:

df_enc = InteractionEncoder(df_raw, {'read': ['write']}).transform()
df_enc.tail()

Out[19]:

	id	female	ses	schtyp	prog	read	write	math	science	socst	honors	awards	cid	read_gt_mean	read#write
195	100.0	1	high	public	academic	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	20	1	4095.0
196	143.0	0	middle	public	vocation	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	20	1	3969.0
197	68.0	0	middle	public	academic	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	20	1	4891.0
198	57.0	1	middle	public	academic	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	20	1	4615.0
199	132.0	0	middle	public	academic	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	20	1	4526.0

The columns for the main effects are both continuous, so they must be kept in the final dataframe.

There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe.

In [20]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: []
Columns added: ['read#write']

Two categorical variables¶

Categorical: prog
Categorical: ses

In [21]:

df_enc = InteractionEncoder(df_raw, {'prog': ['ses']}).transform()
df_enc.tail()

Out[21]:

	id	female	schtyp	read	write	math	science	socst	honors	awards	...	ses_high	prog_academic#ses_middle	prog_academic#ses_high	prog_vocation#ses_middle
195	100.0	1	public	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	...	1	0	1	0
196	143.0	0	public	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	...	0	0	0	1
197	68.0	0	public	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	...	0	1	0	0
198	57.0	1	public	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	...	0	1	0	0
199	132.0	0	public	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	...	0	1	0	0

5 rows × 27 columns

The columns for the main effects are both categorical: the information in those columns all have string values. The original columns prog and ses are removed from the final dataframe, as the DummyEncoder is used on them to produce dummy columns for them in the final dataframe. The original columns thus become redundant.

These are the columns added to the final dataframe via the encoding:

Dummy columns are produced for each category via the DummyEncoder: 3 values + 3 values = 6 dummy columns.
There are multiple interaction effects encoded between the two categorical variables: 3 values * 3 values = 9 interaction effects.

NOTE: one of the categories could be used as a 'base level' in a regression model.

In [22]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: ['ses', 'prog']
Columns added: ['prog_vocation#ses_low', 'prog_academic#ses_middle', 'ses_low', 'prog_academic', 'prog_general#ses_high', 'prog_general#ses_low', 'prog_vocation#ses_middle', 'prog_academic#ses_low', 'prog_vocation', 'ses_high', 'prog_academic#ses_high', 'prog_vocation#ses_high', 'prog_general#ses_middle', 'ses_middle', 'prog_general']

The key-value pair in the class initialization can also be switched and produce a dataframe with the same information, but the column names for the interaction effects will be different.

In [23]:

df_enc = InteractionEncoder(df_raw, {'ses': ['prog']}).transform()
df_enc.tail()

Out[23]:

	id	female	schtyp	read	write	math	science	socst	honors	awards	...	prog_vocation	ses_middle#prog_academic	ses_middle#prog_vocation	ses_high#prog_academic
195	100.0	1	public	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	...	0	0	0	1
196	143.0	0	public	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	...	1	0	1	0
197	68.0	0	public	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	...	0	1	0	0
198	57.0	1	public	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	...	0	1	0	0
199	132.0	0	public	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	...	0	1	0	0

5 rows × 27 columns

In [24]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: ['ses', 'prog']
Columns added: ['ses_low', 'prog_academic', 'ses_middle#prog_academic', 'prog_general', 'ses_middle#prog_general', 'ses_low#prog_general', 'ses_high#prog_academic', 'ses_low#prog_academic', 'prog_vocation', 'ses_low#prog_vocation', 'ses_high', 'ses_middle', 'ses_middle#prog_vocation', 'ses_high#prog_vocation', 'ses_high#prog_general']

One Bool and one categorical¶

Categorical: prog
Bool: female

In [25]:

df_enc = InteractionEncoder(df_raw, {'prog': ['female']}).transform()
df_enc.tail()

Out[25]:

	id	female	ses	schtyp	read	write	math	science	socst	honors	awards	cid	read_gt_mean	prog_academic	prog_vocation	prog_academic#female
195	100.0	1	high	public	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	20	1	1	0	1
196	143.0	0	middle	public	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	20	1	0	1	0
197	68.0	0	middle	public	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	20	1	1	0	0
198	57.0	1	middle	public	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	20	1	1	0	1
199	132.0	0	middle	public	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	20	1	1	0	0

One of the main effect columns is for a Boolean variable, so that must be kept in the final dataframe. The other main effect is a categorical variable, so dummy columns are encoded for it and the original column is removed in the final dataframe.

The columns added:

Dummy columns for the categorical variable: 3 values gives 3 dummy columns
Interaction effects between the Boolean variable and the dummy columns: 3 dummy columns * 1 Bool column = 3 interaction effects

In [26]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: ['prog']
Columns added: ['prog_academic', 'prog_vocation#female', 'prog_academic#female', 'prog_vocation', 'prog_general#female', 'prog_general']

One Bool and one continuous¶

In this case let's encode interactions between female and TWO continuous variables!

Bool: female
Continuous: read and write

In [27]:

df_enc = InteractionEncoder(df_raw, {'female': ['read', 'write']}).transform()
df_enc.tail()

Out[27]:

	id	female	ses	schtyp	prog	read	write	math	science	socst	honors	awards	cid	read_gt_mean	female#read	female#write
195	100.0	1	high	public	academic	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	20	1	63.0	65.0
196	143.0	0	middle	public	vocation	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	20	1	0.0	0.0
197	68.0	0	middle	public	academic	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	20	1	0.0	0.0
198	57.0	1	middle	public	academic	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	20	1	71.0	65.0
199	132.0	0	middle	public	academic	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	20	1	0.0	0.0

The columns for the main effects are Boolean or continuous, so they must be kept in the final dataframe.

There is only one interaction effect between a Boolean variable and a continuous variable, so one column is added to the dataframe for each of those pairings.

(In this case, there were two continuous variables interacted with female so there are two interaction effects added to the final dataframe)

In [28]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: []
Columns added: ['female#write', 'female#read']

In [29]:

df_enc = InteractionEncoder(df_raw, {'read': ['female'],
                                      'write': ['female']}).transform()
df_enc.tail()

Out[29]:

	id	female	ses	schtyp	prog	read	write	math	science	socst	honors	awards	cid	read_gt_mean	read#female	write#female
195	100.0	1	high	public	academic	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	20	1	63.0	65.0
196	143.0	0	middle	public	vocation	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	20	1	0.0	0.0
197	68.0	0	middle	public	academic	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	20	1	0.0	0.0
198	57.0	1	middle	public	academic	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	20	1	71.0	65.0
199	132.0	0	middle	public	academic	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	20	1	0.0	0.0

In [30]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: []
Columns added: ['read#female', 'write#female']

One categorical and one continuous¶

Categorical: prog
Continuous: socst

In [31]:

df_enc = InteractionEncoder(df_raw, {'socst': ['prog']}).transform()
df_enc.tail()

Out[31]:

	id	female	ses	schtyp	read	write	math	science	socst	honors	awards	cid	read_gt_mean	prog_academic	prog_vocation	socst#prog_academic	socst#prog_vocation
195	100.0	1	high	public	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	20	1	1	0	71.0	0.0
196	143.0	0	middle	public	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	20	1	0	1	0.0	66.0
197	68.0	0	middle	public	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	20	1	1	0	66.0	0.0
198	57.0	1	middle	public	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	20	1	1	0	56.0	0.0
199	132.0	0	middle	public	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	20	1	1	0	66.0	0.0

One of the main effects is continuous, so the column for that one must be kept in the final dataframe. The other main effect is a categorical variable, so the original column is dropped from the final dataframe after dummy columns are encoded from it.

There is an interaction effect between each of the dummy variables and the continuous variable.

In [32]:

print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: ['prog']
Columns added: ['prog_academic', 'socst#prog_vocation', 'prog_vocation', 'socst#prog_general', 'socst#prog_academic', 'prog_general']

In [33]:

InteractionEncoder(df_raw, {'prog': ['socst']}).transform().tail()

Out[33]:

	id	female	ses	schtyp	read	write	math	science	socst	honors	awards	cid	read_gt_mean	prog_academic	prog_vocation	prog_academic#socst	prog_vocation#socst
195	100.0	1	high	public	63.0	65.0	71.0	69.0	71.0	enrolled	5.0	20	1	1	0	71.0	0.0
196	143.0	0	middle	public	63.0	63.0	75.0	72.0	66.0	enrolled	4.0	20	1	0	1	0.0	66.0
197	68.0	0	middle	public	73.0	67.0	71.0	63.0	66.0	enrolled	7.0	20	1	1	0	66.0	0.0
198	57.0	1	middle	public	71.0	65.0	72.0	66.0	56.0	enrolled	5.0	20	1	1	0	56.0	0.0
199	132.0	0	middle	public	73.0	62.0	73.0	69.0	66.0	enrolled	3.0	20	1	1	0	66.0	0.0

Model¶

Let's do basic OLS regression models using the dataset, where interaction effects are also used as variables in modelling.

The UCLA's online resources have models of interaction effects on this dataset with Stata output:

Interaction between two continuous variables
Interaction between categorical variable and continuous variable (the example is a categorical variable with two categories, female, which is madr Boolean in this notebook).

The Stata output for each model is also provided in this notebook for comparison against the models done through Appelpy.

Interaction between two continuous variables¶

Create new dataframe and set up the InteractionEncoder object.

In [34]:

df_model = df_raw.copy()

Let's regress read on the scores for math, socst and the interaction between math & socst.

To get the interaction effect in the dataframe, we need to do some encoding to get the column math#socst.

In [35]:

df_model = InteractionEncoder(df_model, {'math': ['socst']}).transform()
df_model.head()

Out[35]:

	id	female	ses	schtyp	prog	read	write	math	science	socst	honors	cid	math#socst
0	45.0	1	low	public	vocation	34.0	35.0	41.0	29.0	26.0	not enrolled	1	1066.0
1	108.0	0	middle	public	general	34.0	33.0	41.0	36.0	36.0	not enrolled	1	1476.0
2	15.0	0	high	public	vocation	39.0	39.0	44.0	26.0	42.0	not enrolled	1	1848.0
3	67.0	0	low	public	vocation	37.0	37.0	42.0	33.0	32.0	not enrolled	1	1344.0
4	153.0	0	middle	public	vocation	39.0	31.0	40.0	39.0	51.0	not enrolled	1	2040.0

In [36]:

y_list = ['read']
X_list = ['math', 'socst', 'math#socst']
model = OLS(df_model, y_list, X_list).fit()

In [37]:

model.results_output

Out[37]:

OLS Regression Results
Dep. Variable:	read	R-squared:	0.546
Model:	OLS	Adj. R-squared:	0.539
Method:	Least Squares	F-statistic:	78.61
Date:	Fri, 03 Jan 2020	Prob (F-statistic):	1.99e-33
Time:	21:39:12	Log-Likelihood:	-669.80
No. Observations:	200	AIC:	1348.
Df Residuals:	196	BIC:	1361.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	37.8427	14.545	2.602	0.010	9.158	66.528
math	-0.1105	0.292	-0.379	0.705	-0.686	0.465
socst	-0.2200	0.272	-0.810	0.419	-0.756	0.316
math#socst	0.0113	0.005	2.157	0.032	0.001	0.022

Omnibus:	3.611	Durbin-Watson:	1.839
Prob(Omnibus):	0.164	Jarque-Bera (JB):	3.555
Skew:	0.325	Prob(JB):	0.169
Kurtosis:	2.942	Cond. No.	8.76e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.76e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The interaction between math and socst, i.e. math#socst#, is significant.

In [38]:

model.model_selection_stats

Out[38]:

{'root_mse': 6.96003820368867,
 'r_squared': 0.5461318818125249,
 'r_squared_adj': 0.5391849208198595,
 'aic': 1347.6088571651621,
 'bic': 1360.8021266313542}

This is what the model output would be from Stata:

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =   78.61
       Model |  11424.7622     3  3808.25406           Prob > F      =  0.0000
    Residual |  9494.65783   196  48.4421318           R-squared     =  0.5461
-------------+------------------------------           Adj R-squared =  0.5392
       Total |    20919.42   199  105.122714           Root MSE      =    6.96

------------------------------------------------------------------------------
        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        math |  -.1105123   .2916338    -0.38   0.705    -.6856552    .4646307
       socst |  -.2200442   .2717539    -0.81   0.419    -.7559812    .3158928
             |
      c.math#|
     c.socst |   .0112807   .0052294     2.16   0.032     .0009677    .0215938
             |
       _cons |   37.84271   14.54521     2.60   0.010     9.157506    66.52792
------------------------------------------------------------------------------

Interaction between continuous and Bool variables¶

In [39]:

df_model = InteractionEncoder(df_raw, {'female': ['socst']}).transform()
df_model.head()

Out[39]:

	id	female	ses	schtyp	prog	read	write	math	science	socst	honors	cid	female#socst
0	45.0	1	low	public	vocation	34.0	35.0	41.0	29.0	26.0	not enrolled	1	26.0
1	108.0	0	middle	public	general	34.0	33.0	41.0	36.0	36.0	not enrolled	1	0.0
2	15.0	0	high	public	vocation	39.0	39.0	44.0	26.0	42.0	not enrolled	1	0.0
3	67.0	0	low	public	vocation	37.0	37.0	42.0	33.0	32.0	not enrolled	1	0.0
4	153.0	0	middle	public	vocation	39.0	31.0	40.0	39.0	51.0	not enrolled	1	0.0

In [40]:

model = OLS(df_model, ['write'], ['female', 'socst', 'female#socst']).fit()

In [41]:

model.results_output

Out[41]:

OLS Regression Results
Dep. Variable:	write	R-squared:	0.430
Model:	OLS	Adj. R-squared:	0.421
Method:	Least Squares	F-statistic:	49.26
Date:	Fri, 03 Jan 2020	Prob (F-statistic):	9.02e-24
Time:	21:39:12	Log-Likelihood:	-676.91
No. Observations:	200	AIC:	1362.
Df Residuals:	196	BIC:	1375.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	17.7619	3.555	4.996	0.000	10.751	24.773
female	15.0000	5.098	2.942	0.004	4.946	25.054
socst	0.6248	0.067	9.315	0.000	0.493	0.757
female#socst	-0.2047	0.095	-2.147	0.033	-0.393	-0.017

Omnibus:	2.193	Durbin-Watson:	1.266
Prob(Omnibus):	0.334	Jarque-Bera (JB):	2.004
Skew:	-0.152	Prob(JB):	0.367
Kurtosis:	2.615	Cond. No.	713.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The interaction between female and socst, i.e. female#socst, is significant.

In the UCLA resources the chart shows how the slopes for the effect of socst vary by gender.

In [42]:

model.model_selection_stats

Out[42]:

{'root_mse': 7.211611852775864,
 'r_squared': 0.42986123794053965,
 'r_squared_adj': 0.4211346242355479,
 'aic': 1361.811865520546,
 'bic': 1375.005134986738}

This is what the regression output would be from Stata:

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =   49.26
       Model |  7685.43528     3  2561.81176           Prob > F      =  0.0000
    Residual |  10193.4397   196  52.0073455           R-squared     =  0.4299
-------------+------------------------------           Adj R-squared =  0.4211
       Total |   17878.875   199   89.843593           Root MSE      =  7.2116

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.female |   15.00001    5.09795     2.94   0.004     4.946132    25.05389
       socst |   .6247968   .0670709     9.32   0.000     .4925236    .7570701
             |
      female#|
     c.socst |
          1  |  -.2047288   .0953726    -2.15   0.033    -.3928171   -.0166405
             |
       _cons |    17.7619   3.554993     5.00   0.000     10.75095    24.77284
------------------------------------------------------------------------------

Model pipeline example¶

It's possible to make model pipelines with Pandas via chaining of Appelpy methods.

In [43]:

def process_data(raw_df):
    return (raw_df
            .pipe(InteractionEncoder, {'female': ['socst']})
            .transform())

In [44]:

def fit_model(df, y_list, X_list):
    return OLS(df, y_list, X_list).fit()

The cell below retrieves the previous model_selection_stats via a Pandas pipeline.

In [45]:

(df_raw
 .pipe(process_data)
 .pipe(fit_model, ['write'], ['female', 'socst', 'female#socst'])
 .model_selection_stats)

Out[45]:

{'root_mse': 7.211611852775864,
 'r_squared': 0.42986123794053965,
 'r_squared_adj': 0.4211346242355479,
 'aic': 1361.811865520546,
 'bic': 1375.005134986738}