In [2]:

# import some basic libraries
import os
import pandas as pd
import seaborn as sns
%matplotlib inline

2. Categorical Predictors¶

The syntax for handling categorical predictors is different between standard regression models/two-stage-models (i.e. Lm and Lm2) and multi-level models (Lmer) in pymer4. This is because formula parsing is passed to R for Lmer models, but handled by Python for other models.

Lm and Lm2 Models¶

Lm and Lm2 models use patsy to parse model formulae. Patsy is very powerful and has built-in support for handling categorical coding schemes by wrapping a predictor in then C() within the module formula. Patsy can also perform some pre-processing such as scaling and standardization using special functions like center(). Here are some examples.

In [3]:

# import sample data and model
from pymer4.utils import get_resource_path
from pymer4.models import Lm

# IV3 is a categorical predictors with 3 levels in the sample data
df = pd.read_csv(os.path.join(get_resource_path(), 'sample_data.csv'))

Dummy-coded/Treatment contrasts¶

In [4]:

# Estimate a model using Treatment contrasts (dummy-coding)
# with '1.0' as the reference level
# This is the default of the C() function 
model = Lm("DV ~ C(IV3, levels=[1.0, 0.5, 1.5])", data=df)
model.fit()

Formula: DV~C(IV3,levels=[1.0,0.5,1.5])

Family: gaussian	 Estimator: OLS

Std-errors: non-robust	CIs: standard 95%	Inference: parametric 

Number of observations: 564	 R^2: 0.004	 R^2_adj: 0.001

Log-likelihood: -2728.620 	 AIC: 5463.241	 BIC: 5476.246

Fixed effects:

Out[4]:

	Estimate	2.5_ci	97.5_ci	SE	DF	T-stat	P-val	Sig
Intercept	42.721	38.334	47.108	2.233	561	19.129	0.000	***
C(IV3, levels=[1.0, 0.5, 1.5])[T.0.5]	1.463	-4.741	7.667	3.158	561	0.463	0.643
C(IV3, levels=[1.0, 0.5, 1.5])[T.1.5]	-3.419	-9.622	2.785	3.158	561	-1.082	0.280

Orthogonal Polynomial Contrasts¶

In [5]:

# Patsy can do this using the Poly argument to the 
# C() function
model = Lm('DV ~ C(IV3, Poly)', data=df)
model.fit()

Formula: DV~C(IV3,Poly)

Family: gaussian	 Estimator: OLS

Std-errors: non-robust	CIs: standard 95%	Inference: parametric 

Number of observations: 564	 R^2: 0.004	 R^2_adj: 0.001

Log-likelihood: -2728.620 	 AIC: 5463.241	 BIC: 5476.246

Fixed effects:

Out[5]:

	Estimate	2.5_ci	97.5_ci	SE	DF	T-stat	P-val	Sig
Intercept	42.069	39.537	44.602	1.289	561	32.627	0.000	***
C(IV3, Poly).Linear	-3.452	-7.838	0.935	2.233	561	-1.546	0.123
C(IV3, Poly).Quadratic	-0.798	-5.185	3.588	2.233	561	-0.357	0.721

Sum-to-zero contrasts¶

In [6]:

# Similar to before but with the Sum argument
model = Lm('DV ~ C(IV3, Sum)', data=df)
model.fit()

Formula: DV~C(IV3,Sum)

Family: gaussian	 Estimator: OLS

Std-errors: non-robust	CIs: standard 95%	Inference: parametric 

Number of observations: 564	 R^2: 0.004	 R^2_adj: 0.001

Log-likelihood: -2728.620 	 AIC: 5463.241	 BIC: 5476.246

Fixed effects:

Out[6]:

	Estimate	2.5_ci	97.5_ci	SE	DF	T-stat	P-val	Sig
Intercept	42.069	39.537	44.602	1.289	561	32.627	0.000	***
C(IV3, Sum)[S.0.5]	2.115	-1.467	5.697	1.823	561	1.160	0.247
C(IV3, Sum)[S.1.0]	0.652	-2.930	4.234	1.823	561	0.357	0.721

Scaling/Centering¶

In [7]:

# Moderation with IV2, but centering IV2 first
model = Lm('DV ~ center(IV2) * C(IV3, Sum)', data=df)
model.fit()

Formula: DV~center(IV2)*C(IV3,Sum)

Family: gaussian	 Estimator: OLS

Std-errors: non-robust	CIs: standard 95%	Inference: parametric 

Number of observations: 564	 R^2: 0.511	 R^2_adj: 0.507

Log-likelihood: -2528.051 	 AIC: 5068.102	 BIC: 5094.113

Fixed effects:

Out[7]:

	Estimate	2.5_ci	97.5_ci	SE	DF	T-stat	P-val	Sig
Intercept	42.051	40.268	43.833	0.908	558	46.329	0.000	***
C(IV3, Sum)[S.0.5]	0.580	-1.942	3.102	1.284	558	0.452	0.652
C(IV3, Sum)[S.1.0]	0.383	-2.136	2.903	1.282	558	0.299	0.765
center(IV2)	0.746	0.685	0.807	0.031	558	24.012	0.000	***
center(IV2):C(IV3, Sum)[S.0.5]	0.050	-0.037	0.137	0.044	558	1.132	0.258
center(IV2):C(IV3, Sum)[S.1.0]	-0.057	-0.144	0.029	0.044	558	-1.306	0.192

Please refer to the patsy documentation for more details when working categorical predictors in Lm or Lm2 models.

Lmer Models¶

Lmer() models currently have support for handling categorical predictors in one of three ways based on how R's factor() works (see the note at the end of this tutorial):

Dummy-coded factor levels (treatment contrasts) in which each model term is the difference between a factor level and a selected reference level
Orthogonal polynomial contrasts in which each model term is a polynomial contrast across factor levels (e.g. linear, quadratic, cubic, etc)
Custom contrasts for each level of a factor, which should be provided in the manner expected by R.

To make re-parameterizing models easier, factor codings are passed as a dictionary to the factors argument of a model's .fit(). This obviates the need for adjusting data-frame properties as in R. Note that this is different from Lm and Lm2 models above which expect factor codings in their formulae (because patsy does).

Each of these ways also enables you to easily compute post-hoc comparisons between factor levels, as well as interactions between continuous predictors and each factor level. See tutorial 3 for more on post-hoc tests.

Compatibility Note:¶

Due to recent updates in rpy2 and pandas, both of which are dependencies of pymer4, the contrast syntax shown below will raise errors and fail to run. But for the time being you can use rpy2==3.1.0 and pandas>0.23<1.0 on your own system to ensure compatiblity. Thanks for your patience while I implement the necessary fixes!

In [4]:

from pymer4.models import Lmer
# We're going to fit a multi-level logistic regression using the 
# dichotomous DV_l variable and the same categorical predictor (IV3)
# as before
model = Lmer('DV_l ~ IV3 + (IV3|Group)', data=df, family='binomial')

Dummy-coding factors¶

First we'll use dummy-coding/treatment contrasts with 1.0 as the reference level. This will compute two coefficients: 0.5 > 1.0 and 1.5 > 1.0.

In [5]:

# See compatiblity note!
model.fit(factors={
    'IV3': ['1.0', '0.5', '1.5']
})

boundary (singular) fit: see ?isSingular 

Formula: DV_l~IV3+(IV3|Group)

Family: binomial	 Inference: parametric

Number of observations: 564	 Groups: {'Group': 47.0}

Log-likelihood: -389.003 	 AIC: 796.006

Random effects:

              Name    Var    Std
Group  (Intercept)  0.022  0.148
Group       IV30.5  0.060  0.246
Group       IV31.5  0.038  0.196

               IV1     IV2  Corr
Group  (Intercept)  IV30.5  -1.0
Group  (Intercept)  IV31.5  -1.0
Group       IV30.5  IV31.5   1.0

Fixed effects:

Out[5]:

	Estimate	2.5_ci	97.5_ci	SE	OR	OR_2.5_ci	OR_97.5_ci	Prob	Prob_2.5_ci	Prob_97.5_ci	Z-stat	P-val
(Intercept)	-0.129	-0.419	0.162	0.148	0.879	0.658	1.176	0.468	0.397	0.540	-0.867	0.386
IV30.5	0.129	-0.283	0.540	0.210	1.137	0.753	1.716	0.532	0.430	0.632	0.612	0.541
IV31.5	-0.128	-0.539	0.283	0.210	0.880	0.583	1.327	0.468	0.368	0.570	-0.612	0.541

Polynomial contrast coding¶

Second we'll use orthogonal polynomial contrasts. This is accomplished using the ordered=True argument and specifying the order of the linear contrast in increasing order. R will automatically compute higher order polynomial contrats that are orthogonal to this linear contrast. In this example, since there are 3 factor levels this will result in two polynomial terms: a linear contrast we specify below corresponding to 0.5 < 1.0 < 1.5 and an orthogonal quadratic contrast automatically determined by R, corresponding to 0.5 > 1 < 1.5

In [6]:

# See compatiblity note!
model.fit(factors={
    'IV3': ['0.5', '1.0', '1.5']},
    ordered=True
)

boundary (singular) fit: see ?isSingular 

boundary (singular) fit: see ?isSingular 

Formula: DV_l~IV3+(IV3|Group)

Family: binomial	 Inference: parametric

Number of observations: 564	 Groups: {'Group': 47.0}

Log-likelihood: -389.003 	 AIC: 796.006

Random effects:

              Name    Var    Std
Group  (Intercept)  0.000  0.000
Group        IV3.L  0.001  0.035
Group        IV3.Q  0.032  0.180

               IV1    IV2  Corr
Group  (Intercept)  IV3.L   NaN
Group  (Intercept)  IV3.Q   NaN
Group        IV3.L  IV3.Q  -1.0

Fixed effects:

Out[6]:

	Estimate	2.5_ci	97.5_ci	SE	OR	OR_2.5_ci	OR_97.5_ci	Prob	Prob_2.5_ci	Prob_97.5_ci	Z-stat	P-val
(Intercept)	-0.128	-0.294	0.037	0.085	0.879	0.745	1.038	0.468	0.427	0.509	-1.518	0.129
IV3.L	-0.182	-0.469	0.106	0.147	0.834	0.626	1.112	0.455	0.385	0.526	-1.238	0.216
IV3.Q	0.000	-0.292	0.292	0.149	1.000	0.747	1.339	0.500	0.428	0.572	0.001	1.000

Custom contrasts¶

Lmer models can also take custom factor contrasts based on how they are expected by R (see the note at the end of this tutorial for how contrasts work in R). Remember that there can be at most k-1 model terms representing any k level factor without over-parameterizing a model. If you specify a custom contrast, R will generate set of orthogonal contrasts for the rest of your model terms.

In [9]:

# See compatiblity note!

# Compare level '1.0' to the mean of levels '0.5' and '1.5'
# and let R determine the second contrast orthogonal to it

model.fit(factors={
    'IV3': {'1.0': 1, '0.5': -.5, '1.5': -.5}
})

boundary (singular) fit: see ?isSingular 

boundary (singular) fit: see ?isSingular 

boundary (singular) fit: see ?isSingular 

boundary (singular) fit: see ?isSingular 

boundary (singular) fit: see ?isSingular 

Formula: DV_l~IV3+(IV3|Group)

Family: binomial	 Inference: parametric

Number of observations: 564	 Groups: {'Group': 47.0}

Log-likelihood: -389.003 	 AIC: 796.006

Random effects:

              Name    Var    Std
Group  (Intercept)  0.022  0.148
Group       IV30.5  0.060  0.246
Group       IV31.5  0.038  0.196

               IV1     IV2  Corr
Group  (Intercept)  IV30.5  -1.0
Group  (Intercept)  IV31.5  -1.0
Group       IV30.5  IV31.5   1.0

Fixed effects:

Out[9]:

	Estimate	2.5_ci	97.5_ci	SE	OR	OR_2.5_ci	OR_97.5_ci	Prob	Prob_2.5_ci	Prob_97.5_ci	Z-stat	P-val
(Intercept)	-0.129	-0.419	0.162	0.148	0.879	0.658	1.176	0.468	0.397	0.540	-0.867	0.386
IV30.5	0.129	-0.283	0.540	0.210	1.137	0.753	1.716	0.532	0.430	0.632	0.612	0.541
IV31.5	-0.128	-0.539	0.283	0.210	0.880	0.583	1.327	0.468	0.368	0.570	-0.612	0.541

User-created contrasts (without R)¶

Another option available to you is fitting a model with only your desired contrast(s) rather than a full set of k-1 contrasts. Contrary to how statistics is usually taught, you don't ever have to include a full set of k-1 contrasts for a k level factor! The upside to doing this is that you won't need to rely on R to compute anything for you (aside from the model fit), and you will have a model with exactly the number of terms as contrasts you desire, giving you complete control. The downside is that post-hoc tests will no longer be available (see tutorial 3 for more information on post-hoc tests), but it's unlikely you're doing post-hoc tests if you are computing a subset of specific contrasts anyway. This is also a useful approach if you don't want to use patsy's formula syntax with Lm and Lm2 as noted above.

This can be accomplished by creating new columns in your dataframe to test specific hypotheses and is trivial to do with pandas map and assign methods. For example, here we manually compute a linear contrast by creating a new column in our dataframe and treating it as a continuous variable.

In [10]:

# Create a new column in the dataframe with a custom (linear) contrast
df = df.assign(
    IV3_custom_lin=df['IV3'].map({
        0.5: -1,
        1.0: 0,
        1.5: 1
    })
)
df.head()

Out[10]:

	Group	IV1	DV_l	DV	IV2	IV3	IV3_custom_lin
0	1	20.0	0	7.936508	4.563492	0.5	-1
1	1	20.0	0	15.277778	0.000000	1.0	0
2	1	20.0	1	0.000000	0.000000	1.5	1
3	1	20.0	1	9.523810	0.000000	0.5	-1
4	1	12.5	0	0.000000	0.000000	1.0	0

Now we can use this variable as a continuous predictor without the need for the factors argument. Notice how the z-stat and p-value of the estimate are the same as the linear polynomial contrast estimated above. The coefficients differ in scale only because R uses [~-0.707, ~0, ~0.707] for its polynomial contrasts rather than [-1, 0, 1] like we did.

In [11]:

# Estimate model
model = Lmer('DV_l ~ IV3_custom_lin + (IV3_custom_lin|Group)', data=df, family='binomial')
model.fit()

boundary (singular) fit: see ?isSingular 

Formula: DV_l~IV3_custom_lin+(IV3_custom_lin|Group)

Family: binomial	 Inference: parametric

Number of observations: 564	 Groups: {'Group': 47.0}

Log-likelihood: -389.016 	 AIC: 788.031

Random effects:

                 Name  Var  Std
Group     (Intercept)  0.0  0.0
Group  IV3_custom_lin  0.0  0.0

               IV1             IV2  Corr
Group  (Intercept)  IV3_custom_lin   NaN

Fixed effects:

Out[11]:

	Estimate	2.5_ci	97.5_ci	SE	OR	OR_2.5_ci	OR_97.5_ci	Prob	Prob_2.5_ci	Prob_97.5_ci	Z-stat	P-val	Sig
(Intercept)	-0.128	-0.294	0.037	0.085	0.88	0.745	1.038	0.468	0.427	0.509	-1.517	0.129
IV3_custom_lin	-0.128	-0.331	0.075	0.104	0.88	0.718	1.077	0.468	0.418	0.519	-1.239	0.215

A note on how contrasts in R work¶

Note

This is just for folks curious about how contrasts in R work

Specifying multiple custom contrasts in R has always been a point of confusion amongst users. This because the contrasts() command in R doesn't actually expect contrast weights (i.e. a design matrix) as one would intuit. Rather, it is made for generating contrast coding schemes which are the inverse of the contrast weight matrix. For a longer explanation with examples see this reference and this reference. For these situations pymer4 offers a few utility functions to convert between these matrix types if desired in pymer4.utils: R2con() and con2R().