This notebook shows the functionality of the DummyEncoder
and InteractionEncoder
classes of Appelpy 🍏🥧 in depth, applied to an econometrics dataset. These classes are in the utils
module.
Notebook structure:
DummyEncoder
functionality: basic examples of categorical columns being encoded into dummy columns.InteractionEncoder
functionality: multiple scenarios are covered for interactions between different data types.The notebook ends with an example of a simple model pipeline using the InterationEncoder
.
import pandas as pd
import numpy as np
# Appelpy imports:
from appelpy.utils import DummyEncoder, InteractionEncoder
from appelpy.linear_model import OLS
# Hide Numpy warnings from Statsmodels
import warnings
warnings.filterwarnings('ignore')
The hsbdemo DTA file in this example is a dataset with 200 observations on the academic choices of students and other information about the students themselves, e.g. their academic profiles and demographic information.
df_raw = pd.read_stata('https://stats.idre.ucla.edu/stat/data/hsbdemo.dta')
df_raw.head()
id | female | ses | schtyp | prog | read | write | math | science | socst | honors | awards | cid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45.0 | female | low | public | vocation | 34.0 | 35.0 | 41.0 | 29.0 | 26.0 | not enrolled | 0.0 | 1 |
1 | 108.0 | male | middle | public | general | 34.0 | 33.0 | 41.0 | 36.0 | 36.0 | not enrolled | 0.0 | 1 |
2 | 15.0 | male | high | public | vocation | 39.0 | 39.0 | 44.0 | 26.0 | 42.0 | not enrolled | 0.0 | 1 |
3 | 67.0 | male | low | public | vocation | 37.0 | 37.0 | 42.0 | 33.0 | 32.0 | not enrolled | 0.0 | 1 |
4 | 153.0 | male | middle | public | vocation | 39.0 | 31.0 | 40.0 | 39.0 | 51.0 | not enrolled | 0.0 | 1 |
df_raw.nunique()
id 200 female 2 ses 3 schtyp 2 prog 3 read 30 write 29 math 40 science 34 socst 22 honors 2 awards 7 cid 20 dtype: int64
The categorical columns from the Stata file are already set up to be recognised by Pandas as pd.Categorical
dtype.
NOTE: categorical data fed to the encoders should be in the pd.Categorical
dtype in order for the encoding to work! They must not be in the generic object
dtype.
Of course the DummyEncoder
also handles cases where there are NaN values for categorical data (via the nan_policy
argument)! That functionality will be covered separately in another notebook.
df_raw.dtypes
id float32 female category ses category schtyp category prog category read float32 write float32 math float32 science float32 socst float32 honors category awards float32 cid int16 dtype: object
The female
column will be recoded here as a Boolean column with values in {0, 1}, rather than the {'male', 'female'} format originally in the dataset.
NOTE: Boolean data fed to the encoders should be restricted to values in {0, 1} in order for the encoding to work!
# Recode 'female' col into 1 and 0 vals
df_raw['female'] = np.where(df_raw['female'] == 'female', 1, 0)
# Create another Bool col for use later on - col for 'read' value being higher than the mean
df_raw['read_gt_mean'] = np.where(df_raw['read'] > df_raw['read'].mean(), 1, 0)
These are some examples of the types of data in the dataset.
Boolean variables:
female
Categorical variables:
ses
prog
Continuous variables:
read
, write
, math
, science
, socst
DummyEncoder
functionality¶Make a new copy of the df_raw
dataframe.
The dummy_encoder
object is an instance of the DummyEncoder
class.
The encoder object must be initialized with a dataframe.
By default, the _
separator is used to produce the dummy columns.
It takes a dictionary, where each column name is paired with a base level. If a base level is specified, then the dummy column for that category is dropped from the final dataframe.
dummy_encoder = DummyEncoder(df_raw, {'schtyp': None,
'prog': None,
'honors': None})
Create the transformed dataframe with the transform
method.
# Overwrite the dataframe - encode dummies from the categorical variables specified
df = dummy_encoder.transform()
print(f"Default NaN policy: {dummy_encoder.nan_policy}")
Default NaN policy: row_of_zero
df.head()
id | female | ses | read | write | math | science | socst | awards | cid | read_gt_mean | schtyp_public | schtyp_private | prog_general | prog_academic | prog_vocation | honors_not enrolled | honors_enrolled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45.0 | 1 | low | 34.0 | 35.0 | 41.0 | 29.0 | 26.0 | 0.0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 108.0 | 0 | middle | 34.0 | 33.0 | 41.0 | 36.0 | 36.0 | 0.0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 15.0 | 0 | high | 39.0 | 39.0 | 44.0 | 26.0 | 42.0 | 0.0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
3 | 67.0 | 0 | low | 37.0 | 37.0 | 42.0 | 33.0 | 32.0 | 0.0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
4 | 153.0 | 0 | middle | 39.0 | 31.0 | 40.0 | 39.0 | 51.0 | 0.0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
There are three categorical variables fed to the DummyEncoder
.
The original columns for all three are removed from the final dataframe once encoding is done for their dummy variable equivalents.
[col for col in dummy_encoder.categorical_col_base_levels.keys()]
['schtyp', 'prog', 'honors']
from appelpy.utils import get_dataframe_columns_diff
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df)}")
print(f"Columns added: {get_dataframe_columns_diff(df, df_raw)}")
Columns removed: ['prog', 'honors', 'schtyp'] Columns added: ['prog_academic', 'honors_not enrolled', 'honors_enrolled', 'schtyp_public', 'prog_vocation', 'prog_general', 'schtyp_private']
InteractionEncoder
functionality¶Make a new copy of the df_raw
dataframe.
The int_encoder
object is an instance of the InteractionEncoder
class.
The encoder object must be initialized with a dataframe.
The #
separator is used to represent the interaction between two variables in the columns that are produced by the encoder.
df = df_raw.copy()
Examples of interactions between variables will be given for these cases:
female
read_gt_mean
int_encoder = InteractionEncoder(df, {'female': ['read_gt_mean']})
df_enc = int_encoder.transform()
df_enc.tail()
id | female | ses | schtyp | prog | read | write | math | science | socst | honors | awards | cid | read_gt_mean | female#read_gt_mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | high | public | academic | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | 20 | 1 | 1 |
196 | 143.0 | 0 | middle | public | vocation | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | 20 | 1 | 0 |
197 | 68.0 | 0 | middle | public | academic | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | 20 | 1 | 0 |
198 | 57.0 | 1 | middle | public | academic | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | 20 | 1 | 1 |
199 | 132.0 | 0 | middle | public | academic | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | 20 | 1 | 0 |
The columns for the main effects are both Boolean, so they must be kept in the final dataframe.
There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe.
The get_dataframe_columns_diff
method is useful for checking how the final dataframe is different from the original dataframe after the encoding process.
print(f"Columns removed: {get_dataframe_columns_diff(df, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df)}")
Columns removed: [] Columns added: ['female#read_gt_mean']
The code is essentially comparing the columns of the dataframes through sets.
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df)}")
print(f"Columns added: {get_dataframe_columns_diff(df, df_raw)}")
Columns removed: [] Columns added: []
print(f"Columns removed: {list(set(df.columns) - set(df_enc.columns))}")
print(f"Columns added: {list(set(df_enc.columns) - set(df.columns))}")
Columns removed: [] Columns added: ['female#read_gt_mean']
read
write
Tip: do a one-line transformation by calling transform
on an instance of the encoder class.
df_enc = InteractionEncoder(df_raw, {'read': ['write']}).transform()
df_enc.tail()
id | female | ses | schtyp | prog | read | write | math | science | socst | honors | awards | cid | read_gt_mean | read#write | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | high | public | academic | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | 20 | 1 | 4095.0 |
196 | 143.0 | 0 | middle | public | vocation | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | 20 | 1 | 3969.0 |
197 | 68.0 | 0 | middle | public | academic | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | 20 | 1 | 4891.0 |
198 | 57.0 | 1 | middle | public | academic | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | 20 | 1 | 4615.0 |
199 | 132.0 | 0 | middle | public | academic | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | 20 | 1 | 4526.0 |
The columns for the main effects are both continuous, so they must be kept in the final dataframe.
There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe.
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
Columns removed: [] Columns added: ['read#write']
prog
ses
df_enc = InteractionEncoder(df_raw, {'prog': ['ses']}).transform()
df_enc.tail()
id | female | schtyp | read | write | math | science | socst | honors | awards | ... | ses_high | prog_general#ses_low | prog_general#ses_middle | prog_general#ses_high | prog_academic#ses_low | prog_academic#ses_middle | prog_academic#ses_high | prog_vocation#ses_low | prog_vocation#ses_middle | prog_vocation#ses_high | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | public | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
196 | 143.0 | 0 | public | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
197 | 68.0 | 0 | public | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
198 | 57.0 | 1 | public | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
199 | 132.0 | 0 | public | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
The columns for the main effects are both categorical: the information in those columns all have string values. The original columns prog
and ses
are removed from the final dataframe, as the DummyEncoder
is used on them to produce dummy columns for them in the final dataframe. The original columns thus become redundant.
These are the columns added to the final dataframe via the encoding:
NOTE: one of the categories could be used as a 'base level' in a regression model.
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
Columns removed: ['ses', 'prog'] Columns added: ['prog_vocation#ses_low', 'prog_academic#ses_middle', 'ses_low', 'prog_academic', 'prog_general#ses_high', 'prog_general#ses_low', 'prog_vocation#ses_middle', 'prog_academic#ses_low', 'prog_vocation', 'ses_high', 'prog_academic#ses_high', 'prog_vocation#ses_high', 'prog_general#ses_middle', 'ses_middle', 'prog_general']
The key-value pair in the class initialization can also be switched and produce a dataframe with the same information, but the column names for the interaction effects will be different.
df_enc = InteractionEncoder(df_raw, {'ses': ['prog']}).transform()
df_enc.tail()
id | female | schtyp | read | write | math | science | socst | honors | awards | ... | prog_vocation | ses_low#prog_general | ses_low#prog_academic | ses_low#prog_vocation | ses_middle#prog_general | ses_middle#prog_academic | ses_middle#prog_vocation | ses_high#prog_general | ses_high#prog_academic | ses_high#prog_vocation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | public | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
196 | 143.0 | 0 | public | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
197 | 68.0 | 0 | public | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
198 | 57.0 | 1 | public | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
199 | 132.0 | 0 | public | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
Columns removed: ['ses', 'prog'] Columns added: ['ses_low', 'prog_academic', 'ses_middle#prog_academic', 'prog_general', 'ses_middle#prog_general', 'ses_low#prog_general', 'ses_high#prog_academic', 'ses_low#prog_academic', 'prog_vocation', 'ses_low#prog_vocation', 'ses_high', 'ses_middle', 'ses_middle#prog_vocation', 'ses_high#prog_vocation', 'ses_high#prog_general']
prog
female
df_enc = InteractionEncoder(df_raw, {'prog': ['female']}).transform()
df_enc.tail()
id | female | ses | schtyp | read | write | math | science | socst | honors | awards | cid | read_gt_mean | prog_general | prog_academic | prog_vocation | prog_general#female | prog_academic#female | prog_vocation#female | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | high | public | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | 20 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
196 | 143.0 | 0 | middle | public | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | 20 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
197 | 68.0 | 0 | middle | public | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | 20 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
198 | 57.0 | 1 | middle | public | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | 20 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
199 | 132.0 | 0 | middle | public | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | 20 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
One of the main effect columns is for a Boolean variable, so that must be kept in the final dataframe. The other main effect is a categorical variable, so dummy columns are encoded for it and the original column is removed in the final dataframe.
The columns added:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
Columns removed: ['prog'] Columns added: ['prog_academic', 'prog_vocation#female', 'prog_academic#female', 'prog_vocation', 'prog_general#female', 'prog_general']
In this case let's encode interactions between female
and TWO continuous variables!
female
read
and write
df_enc = InteractionEncoder(df_raw, {'female': ['read', 'write']}).transform()
df_enc.tail()
id | female | ses | schtyp | prog | read | write | math | science | socst | honors | awards | cid | read_gt_mean | female#read | female#write | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | high | public | academic | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | 20 | 1 | 63.0 | 65.0 |
196 | 143.0 | 0 | middle | public | vocation | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | 20 | 1 | 0.0 | 0.0 |
197 | 68.0 | 0 | middle | public | academic | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | 20 | 1 | 0.0 | 0.0 |
198 | 57.0 | 1 | middle | public | academic | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | 20 | 1 | 71.0 | 65.0 |
199 | 132.0 | 0 | middle | public | academic | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | 20 | 1 | 0.0 | 0.0 |
The columns for the main effects are Boolean or continuous, so they must be kept in the final dataframe.
There is only one interaction effect between a Boolean variable and a continuous variable, so one column is added to the dataframe for each of those pairings.
(In this case, there were two continuous variables interacted with female
so there are two interaction effects added to the final dataframe)
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
Columns removed: [] Columns added: ['female#write', 'female#read']
df_enc = InteractionEncoder(df_raw, {'read': ['female'],
'write': ['female']}).transform()
df_enc.tail()
id | female | ses | schtyp | prog | read | write | math | science | socst | honors | awards | cid | read_gt_mean | read#female | write#female | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | high | public | academic | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | 20 | 1 | 63.0 | 65.0 |
196 | 143.0 | 0 | middle | public | vocation | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | 20 | 1 | 0.0 | 0.0 |
197 | 68.0 | 0 | middle | public | academic | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | 20 | 1 | 0.0 | 0.0 |
198 | 57.0 | 1 | middle | public | academic | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | 20 | 1 | 71.0 | 65.0 |
199 | 132.0 | 0 | middle | public | academic | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | 20 | 1 | 0.0 | 0.0 |
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
Columns removed: [] Columns added: ['read#female', 'write#female']
prog
socst
df_enc = InteractionEncoder(df_raw, {'socst': ['prog']}).transform()
df_enc.tail()
id | female | ses | schtyp | read | write | math | science | socst | honors | awards | cid | read_gt_mean | prog_general | prog_academic | prog_vocation | socst#prog_general | socst#prog_academic | socst#prog_vocation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | high | public | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | 20 | 1 | 0 | 1 | 0 | 0.0 | 71.0 | 0.0 |
196 | 143.0 | 0 | middle | public | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | 20 | 1 | 0 | 0 | 1 | 0.0 | 0.0 | 66.0 |
197 | 68.0 | 0 | middle | public | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | 20 | 1 | 0 | 1 | 0 | 0.0 | 66.0 | 0.0 |
198 | 57.0 | 1 | middle | public | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | 20 | 1 | 0 | 1 | 0 | 0.0 | 56.0 | 0.0 |
199 | 132.0 | 0 | middle | public | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | 20 | 1 | 0 | 1 | 0 | 0.0 | 66.0 | 0.0 |
One of the main effects is continuous, so the column for that one must be kept in the final dataframe. The other main effect is a categorical variable, so the original column is dropped from the final dataframe after dummy columns are encoded from it.
There is an interaction effect between each of the dummy variables and the continuous variable.
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
Columns removed: ['prog'] Columns added: ['prog_academic', 'socst#prog_vocation', 'prog_vocation', 'socst#prog_general', 'socst#prog_academic', 'prog_general']
InteractionEncoder(df_raw, {'prog': ['socst']}).transform().tail()
id | female | ses | schtyp | read | write | math | science | socst | honors | awards | cid | read_gt_mean | prog_general | prog_academic | prog_vocation | prog_general#socst | prog_academic#socst | prog_vocation#socst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | 100.0 | 1 | high | public | 63.0 | 65.0 | 71.0 | 69.0 | 71.0 | enrolled | 5.0 | 20 | 1 | 0 | 1 | 0 | 0.0 | 71.0 | 0.0 |
196 | 143.0 | 0 | middle | public | 63.0 | 63.0 | 75.0 | 72.0 | 66.0 | enrolled | 4.0 | 20 | 1 | 0 | 0 | 1 | 0.0 | 0.0 | 66.0 |
197 | 68.0 | 0 | middle | public | 73.0 | 67.0 | 71.0 | 63.0 | 66.0 | enrolled | 7.0 | 20 | 1 | 0 | 1 | 0 | 0.0 | 66.0 | 0.0 |
198 | 57.0 | 1 | middle | public | 71.0 | 65.0 | 72.0 | 66.0 | 56.0 | enrolled | 5.0 | 20 | 1 | 0 | 1 | 0 | 0.0 | 56.0 | 0.0 |
199 | 132.0 | 0 | middle | public | 73.0 | 62.0 | 73.0 | 69.0 | 66.0 | enrolled | 3.0 | 20 | 1 | 0 | 1 | 0 | 0.0 | 66.0 | 0.0 |
Let's do basic OLS regression models using the dataset, where interaction effects are also used as variables in modelling.
The UCLA's online resources have models of interaction effects on this dataset with Stata output:
female
, which is madr Boolean in this notebook).The Stata output for each model is also provided in this notebook for comparison against the models done through Appelpy.
Create new dataframe and set up the InteractionEncoder
object.
df_model = df_raw.copy()
Let's regress read
on the scores for math
, socst
and the interaction between math
& socst
.
To get the interaction effect in the dataframe, we need to do some encoding to get the column math#socst
.
df_model = InteractionEncoder(df_model, {'math': ['socst']}).transform()
df_model.head()
id | female | ses | schtyp | prog | read | write | math | science | socst | honors | awards | cid | read_gt_mean | math#socst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45.0 | 1 | low | public | vocation | 34.0 | 35.0 | 41.0 | 29.0 | 26.0 | not enrolled | 0.0 | 1 | 0 | 1066.0 |
1 | 108.0 | 0 | middle | public | general | 34.0 | 33.0 | 41.0 | 36.0 | 36.0 | not enrolled | 0.0 | 1 | 0 | 1476.0 |
2 | 15.0 | 0 | high | public | vocation | 39.0 | 39.0 | 44.0 | 26.0 | 42.0 | not enrolled | 0.0 | 1 | 0 | 1848.0 |
3 | 67.0 | 0 | low | public | vocation | 37.0 | 37.0 | 42.0 | 33.0 | 32.0 | not enrolled | 0.0 | 1 | 0 | 1344.0 |
4 | 153.0 | 0 | middle | public | vocation | 39.0 | 31.0 | 40.0 | 39.0 | 51.0 | not enrolled | 0.0 | 1 | 0 | 2040.0 |
y_list = ['read']
X_list = ['math', 'socst', 'math#socst']
model = OLS(df_model, y_list, X_list).fit()
model.results_output
Dep. Variable: | read | R-squared: | 0.546 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.539 |
Method: | Least Squares | F-statistic: | 78.61 |
Date: | Fri, 03 Jan 2020 | Prob (F-statistic): | 1.99e-33 |
Time: | 21:39:12 | Log-Likelihood: | -669.80 |
No. Observations: | 200 | AIC: | 1348. |
Df Residuals: | 196 | BIC: | 1361. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 37.8427 | 14.545 | 2.602 | 0.010 | 9.158 | 66.528 |
math | -0.1105 | 0.292 | -0.379 | 0.705 | -0.686 | 0.465 |
socst | -0.2200 | 0.272 | -0.810 | 0.419 | -0.756 | 0.316 |
math#socst | 0.0113 | 0.005 | 2.157 | 0.032 | 0.001 | 0.022 |
Omnibus: | 3.611 | Durbin-Watson: | 1.839 |
---|---|---|---|
Prob(Omnibus): | 0.164 | Jarque-Bera (JB): | 3.555 |
Skew: | 0.325 | Prob(JB): | 0.169 |
Kurtosis: | 2.942 | Cond. No. | 8.76e+04 |
The interaction between math
and socst
, i.e. math#socst#
, is significant.
model.model_selection_stats
{'root_mse': 6.96003820368867, 'r_squared': 0.5461318818125249, 'r_squared_adj': 0.5391849208198595, 'aic': 1347.6088571651621, 'bic': 1360.8021266313542}
This is what the model output would be from Stata:
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 3, 196) = 78.61
Model | 11424.7622 3 3808.25406 Prob > F = 0.0000
Residual | 9494.65783 196 48.4421318 R-squared = 0.5461
-------------+------------------------------ Adj R-squared = 0.5392
Total | 20919.42 199 105.122714 Root MSE = 6.96
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | -.1105123 .2916338 -0.38 0.705 -.6856552 .4646307
socst | -.2200442 .2717539 -0.81 0.419 -.7559812 .3158928
|
c.math#|
c.socst | .0112807 .0052294 2.16 0.032 .0009677 .0215938
|
_cons | 37.84271 14.54521 2.60 0.010 9.157506 66.52792
------------------------------------------------------------------------------
df_model = InteractionEncoder(df_raw, {'female': ['socst']}).transform()
df_model.head()
id | female | ses | schtyp | prog | read | write | math | science | socst | honors | awards | cid | read_gt_mean | female#socst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45.0 | 1 | low | public | vocation | 34.0 | 35.0 | 41.0 | 29.0 | 26.0 | not enrolled | 0.0 | 1 | 0 | 26.0 |
1 | 108.0 | 0 | middle | public | general | 34.0 | 33.0 | 41.0 | 36.0 | 36.0 | not enrolled | 0.0 | 1 | 0 | 0.0 |
2 | 15.0 | 0 | high | public | vocation | 39.0 | 39.0 | 44.0 | 26.0 | 42.0 | not enrolled | 0.0 | 1 | 0 | 0.0 |
3 | 67.0 | 0 | low | public | vocation | 37.0 | 37.0 | 42.0 | 33.0 | 32.0 | not enrolled | 0.0 | 1 | 0 | 0.0 |
4 | 153.0 | 0 | middle | public | vocation | 39.0 | 31.0 | 40.0 | 39.0 | 51.0 | not enrolled | 0.0 | 1 | 0 | 0.0 |
model = OLS(df_model, ['write'], ['female', 'socst', 'female#socst']).fit()
model.results_output
Dep. Variable: | write | R-squared: | 0.430 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.421 |
Method: | Least Squares | F-statistic: | 49.26 |
Date: | Fri, 03 Jan 2020 | Prob (F-statistic): | 9.02e-24 |
Time: | 21:39:12 | Log-Likelihood: | -676.91 |
No. Observations: | 200 | AIC: | 1362. |
Df Residuals: | 196 | BIC: | 1375. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 17.7619 | 3.555 | 4.996 | 0.000 | 10.751 | 24.773 |
female | 15.0000 | 5.098 | 2.942 | 0.004 | 4.946 | 25.054 |
socst | 0.6248 | 0.067 | 9.315 | 0.000 | 0.493 | 0.757 |
female#socst | -0.2047 | 0.095 | -2.147 | 0.033 | -0.393 | -0.017 |
Omnibus: | 2.193 | Durbin-Watson: | 1.266 |
---|---|---|---|
Prob(Omnibus): | 0.334 | Jarque-Bera (JB): | 2.004 |
Skew: | -0.152 | Prob(JB): | 0.367 |
Kurtosis: | 2.615 | Cond. No. | 713. |
The interaction between female
and socst
, i.e. female#socst
, is significant.
In the UCLA resources the chart shows how the slopes for the effect of socst
vary by gender.
model.model_selection_stats
{'root_mse': 7.211611852775864, 'r_squared': 0.42986123794053965, 'r_squared_adj': 0.4211346242355479, 'aic': 1361.811865520546, 'bic': 1375.005134986738}
This is what the regression output would be from Stata:
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 3, 196) = 49.26
Model | 7685.43528 3 2561.81176 Prob > F = 0.0000
Residual | 10193.4397 196 52.0073455 R-squared = 0.4299
-------------+------------------------------ Adj R-squared = 0.4211
Total | 17878.875 199 89.843593 Root MSE = 7.2116
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.female | 15.00001 5.09795 2.94 0.004 4.946132 25.05389
socst | .6247968 .0670709 9.32 0.000 .4925236 .7570701
|
female#|
c.socst |
1 | -.2047288 .0953726 -2.15 0.033 -.3928171 -.0166405
|
_cons | 17.7619 3.554993 5.00 0.000 10.75095 24.77284
------------------------------------------------------------------------------
It's possible to make model pipelines with Pandas via chaining of Appelpy methods.
def process_data(raw_df):
return (raw_df
.pipe(InteractionEncoder, {'female': ['socst']})
.transform())
def fit_model(df, y_list, X_list):
return OLS(df, y_list, X_list).fit()
The cell below retrieves the previous model_selection_stats
via a Pandas pipeline.
(df_raw
.pipe(process_data)
.pipe(fit_model, ['write'], ['female', 'socst', 'female#socst'])
.model_selection_stats)
{'root_mse': 7.211611852775864, 'r_squared': 0.42986123794053965, 'r_squared_adj': 0.4211346242355479, 'aic': 1361.811865520546, 'bic': 1375.005134986738}