Generalized Linear Mixed‐effects Model Playground

or the many ways to perform GLMM in python

A comparison among:
StatsModels
Theano
PyMC3(Base on Theano)
TensorFlow
Stan and pyStan
Keras with Tensorflow backend

In [1]:
import numpy as np
#import pandas as pd
import seaborn as sns
from theano import tensor as tt
%pylab inline
%config InlineBackend.figure_format = 'retina'
%qtconsole --colors=linux

%load_ext watermark
Populating the interactive namespace from numpy and matplotlib
In [2]:
%watermark -v -m -p numpy,pandas,seaborn,statsmodels,theano,pymc3,tensorflow,keras,matplotlib
/usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
CPython 3.5.2
IPython 6.2.1

numpy 1.14.0
pandas 0.21.0
seaborn 0.8.1
statsmodels 0.8.0.dev0+abb5996
theano 1.0.0
pymc3 3.2
tensorflow 1.4.1
keras 2.1.2
matplotlib 2.1.0+224.g95805d2

compiler   : GCC 5.4.0 20160609
system     : Linux
release    : 4.4.0-104-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit
Using TensorFlow backend.

Prepare data and quick visualization

In [3]:
# load data
import pandas as pd
Tbl_beh  = pd.read_csv('./behavioral_data.txt', delimiter='\t')
Tbl_beh["subj"]  = Tbl_beh["subj"].astype('category')
#%% visualized data
# Draw a nested violinplot and split the violins for easier comparison
sns.set(style="ticks", palette="muted", color_codes=True)
plt.figure()
Tbl_beh['cbcond']  = pd.Series(Tbl_beh['chimera'] + '-' + Tbl_beh['identity'], index=Tbl_beh.index)
sns.violinplot(y="cbcond", x="rt", hue="group", data=Tbl_beh, split=True,
               inner="quart", palette={"cp": "b", "cg": "y"})
sns.despine(left=True)

Tbl_beh is the raw data table.

In [4]:
Tbl_beh.head(5)
Out[4]:
subj trial chimera identity orientation rt acc group cbcond
0 1 1 L_R other upright 754.6 0 cp L_R-other
1 1 2 L_L other upright 1036.1 1 cp L_L-other
2 1 3 R_R self upright 433.0 1 cp R_R-self
3 1 4 L_R self upright 463.4 1 cp L_R-self
4 1 5 L_L self upright 483.3 1 cp L_L-self
In [5]:
#%% Compute the conditional mean of dataset
condi_sel = ['subj', 'chimera', 'identity', 'orientation']
tblmean = pd.DataFrame({'Nt' : Tbl_beh.groupby( condi_sel ).size()}).reset_index()
tblmean["subj"] = tblmean["subj"].astype('category')
tblmean['rt']  = np.asarray(Tbl_beh.groupby(condi_sel)['rt'].mean())
tblmean['ACC'] = np.asarray(Tbl_beh.groupby(condi_sel)['acc'].mean())
tblmean['group']= np.asarray(Tbl_beh.groupby(condi_sel)['group'].all())

tblmean is the summary data table.

In [6]:
tblmean['cbcond']  = pd.Series(tblmean['chimera'] + '-' + tblmean['identity'], 
                index=tblmean.index)

## boxplot + scatter plot to show accuracy
#ax = sns.boxplot(y="cbcond", x="acc", data=tbltest,
#                 whis=np.inf, color="c")
## Add in points to show each observation
#sns.stripplot(y="cbcond", x="acc", data=tbltest,
#              jitter=True, size=3, color=".3", linewidth=0)
plt.figure()
g1 = sns.violinplot(y="cbcond", x="rt", hue="group", data=tblmean, split=True,
               inner="quart", palette={"cp": "b", "cg": "y"})
g1.set(xlim=(0, 3000))
# Make the quantitative axis logarithmic
sns.despine(trim=True)
In [7]:
tblmean.head(5)
Out[7]:
subj chimera identity orientation Nt rt ACC group cbcond
0 1 L_L other inverted 10 713.37 0.8 cp L_L-other
1 1 L_L other upright 10 749.55 0.8 cp L_L-other
2 1 L_L self inverted 10 538.64 0.6 cp L_L-self
3 1 L_L self upright 10 629.01 0.9 cp L_L-self
4 1 L_R other inverted 10 784.70 1.0 cp L_R-other

Here I used the mean table as the data - it is smaller and faster to run

In [8]:
tbltest = tblmean
tbltest.shape
Out[8]:
(320, 9)

Using StatsModels to perform a linear mixed model of reaction time

In [9]:
import statsmodels.formula.api as smf
from patsy import dmatrices
formula = "rt ~ group*orientation*identity"
#formula = "rt ~ -1 + cbcond"
md  = smf.mixedlm(formula, tbltest, groups=tbltest["subj"])
mdf = md.fit()
print(mdf.summary())

fe_params = pd.DataFrame(mdf.fe_params,columns=['LMM'])
random_effects = pd.DataFrame(mdf.random_effects)
random_effects = random_effects.transpose()
random_effects = random_effects.rename(index=str, columns={'groups': 'LMM'})

#%% Generate Design Matrix for later use
Y, X   = dmatrices(formula, data=tbltest, return_type='matrix')
Terms  = X.design_info.column_names
_, Z   = dmatrices('rt ~ -1+subj', data=tbltest, return_type='matrix')
X      = np.asarray(X) # fixed effect
Z      = np.asarray(Z) # mixed effect
Y      = np.asarray(Y).flatten()
nfixed = np.shape(X)
nrandm = np.shape(Z)
                               Mixed Linear Model Regression Results
===================================================================================================
Model:                           MixedLM               Dependent Variable:               rt        
No. Observations:                320                   Method:                           REML      
No. Groups:                      20                    Scale:                            4502.8886 
Min. group size:                 16                    Likelihood:                       -1796.2120
Max. group size:                 16                    Converged:                        Yes       
Mean group size:                 16.0                                                              
---------------------------------------------------------------------------------------------------
                                                     Coef.   Std.Err.   z    P>|z|  [0.025   0.975]
---------------------------------------------------------------------------------------------------
Intercept                                            689.538   21.918 31.459 0.000  646.579 732.497
group[T.cp]                                          151.113   37.049  4.079 0.000   78.498 223.727
orientation[T.upright]                               -64.602   13.160 -4.909 0.000  -90.395 -38.808
identity[T.self]                                     -17.581   13.160 -1.336 0.182  -43.374   8.212
group[T.cp]:orientation[T.upright]                   -87.712   22.245 -3.943 0.000 -131.311 -44.113
group[T.cp]:identity[T.self]                         -67.411   22.245 -3.030 0.002 -111.010 -23.812
orientation[T.upright]:identity[T.self]              -20.020   18.611 -1.076 0.282  -56.498  16.457
group[T.cp]:orientation[T.upright]:identity[T.self]   36.768   31.459  1.169 0.242  -24.890  98.426
groups RE                                           5119.673   27.637                              
===================================================================================================

A helper function for ploting

In [10]:
#%% ploting function 
def plotfitted(fe_params=fe_params,random_effects=random_effects,X=X,Z=Z,Y=Y):
    plt.figure(figsize=(18,9))
    ax1 = plt.subplot2grid((2,2), (0, 0))
    ax2 = plt.subplot2grid((2,2), (0, 1))
    ax3 = plt.subplot2grid((2,2), (1, 0), colspan=2)
    
    fe_params.plot(ax=ax1)
    random_effects.plot(ax=ax2)
    
    ax3.plot(Y.flatten(),'o',color='k',label = 'Observed', alpha=.25)
    for iname in fe_params.columns.get_values():
        fitted = np.dot(X,fe_params[iname])+np.dot(Z,random_effects[iname]).flatten()
        print("The MSE of "+iname+ " is " + str(np.mean(np.square(Y.flatten()-fitted))))
        ax3.plot(fitted,lw=1,label = iname, alpha=.5)
    ax3.legend(loc=0)
    #plt.ylim([0,5])
    plt.show()

plotfitted(fe_params=fe_params,random_effects=random_effects,X=X,Z=Z,Y=Y)
The MSE of LMM is 4150.212099971669

Using PyMC3

In [12]:
beta0     = np.linalg.lstsq(X,Y)

fixedpred = np.argmax(X,axis=1)
randmpred = np.argmax(Z,axis=1)

con  = tt.constant(fixedpred)
sbj  = tt.constant(randmpred)
import pymc3 as pm
with pm.Model() as glmm1:
    # Fixed effect
    beta = pm.Normal('beta', mu=0., sd=100., shape=(nfixed[1]))
    # random effect
    s    = pm.HalfCauchy('s', 10.)
    b    = pm.Normal('b', mu=0., sd=s, shape=(nrandm[1]))
    eps  = pm.HalfCauchy('eps', 5.)
    
    #mu_est = pm.Deterministic('mu_est',beta[con] + b[sbj])
    mu_est = pm.Deterministic('mu_est', tt.dot(X,beta)+tt.dot(Z,b))
    RT = pm.Normal('RT', mu_est, eps, observed=Y)
    
    trace = pm.sample(1000, tune=1000)
    
pm.traceplot(trace,varnames=['beta','b','s']) # 
plt.show()

fixed_pymc = pm.summary(trace, varnames=['beta'])
randm_pymc = pm.summary(trace, varnames=['b'])
fe_params['PyMC'] = pd.Series(np.asarray(fixed_pymc['mean']), index=fe_params.index)
random_effects['PyMC'] = pd.Series(np.asarray(randm_pymc['mean']),index=random_effects.index)
print(fe_params)
/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:1: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  """Entry point for launching an IPython kernel.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
/home/laoj/Documents/Github/pymc3/pymc3/model.py:384: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  if not np.issubdtype(var.dtype, float):
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [eps_log__, b, s_log__, beta]
 30%|███       | 606/2000 [00:06<00:14, 99.12it/s]INFO (theano.gof.compilelock): Waiting for existing lock by process '31629' (I am process '31630')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/laoj/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-3.5.2-64/lock_dir
100%|██████████| 2000/2000 [00:10<00:00, 194.41it/s]