%matplotlib inline
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import edward as ed
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from edward.models import Normal
plt.style.use('ggplot')
ed.set_seed(42)
We use the InstEval
data set from the popular
lme4 R package (Bates, Mächler, Bolker, & Walker, 2015), located
here.
It is a data set of instructor evaluation ratings, where the inputs
(covariates) include categories such as students
and
departments
, and our response variable of interest is the instructor
evaluation rating.
# s - students - 1:2972
# d - instructors - codes that need to be remapped
# dept also needs to be remapped
data = pd.read_csv('../examples/data/insteval.csv')
data['dcodes'] = data['d'].astype('category').cat.codes
data['deptcodes'] = data['dept'].astype('category').cat.codes
data['s'] = data['s'] - 1
train = data.sample(frac=0.8)
test = data.drop(train.index)
train.head()
Unnamed: 0 | s | d | studage | lectage | service | dept | y | dcodes | deptcodes | |
---|---|---|---|---|---|---|---|---|---|---|
66702 | 66703 | 2714 | 474 | 8 | 5 | 1 | 1 | 4 | 248 | 0 |
51671 | 51672 | 2074 | 102 | 8 | 1 | 1 | 1 | 2 | 55 | 0 |
35762 | 35763 | 1456 | 139 | 6 | 4 | 0 | 12 | 2 | 73 | 11 |
43777 | 43778 | 1772 | 2096 | 8 | 3 | 0 | 10 | 4 | 1092 | 9 |
4788 | 4789 | 178 | 554 | 6 | 6 | 1 | 6 | 5 | 290 | 5 |
In the code, we denote:
students
as s
instructors
as d
departments
as dept
service
as service
s_train = train['s'].values.astype(int)
d_train = train['dcodes'].values.astype(int)
dept_train = train['deptcodes'].values.astype(int)
y_train = train['y'].values.astype(float)
service_train = train['service'].values.astype(int)
n_obs_train = train.shape[0]
s_test = test['s'].values.astype(int)
d_test = test['dcodes'].values.astype(int)
dept_test = test['deptcodes'].values.astype(int)
y_test = test['y'].values.astype(float)
service_test = test['service'].values.astype(int)
n_obs_test = test.shape[0]
n_s = 2972 # number of students
n_d = 1128 # number of instructors
n_dept = 14 # number of departments
n_obs = train.shape[0] # number of observations
With linear regression, one makes an independence assumption where each data point regresses with a constant slope among each other. In our setting, the observations come from groups which may have varying slopes and intercepts. Thus we'd like to build a model that can capture this behavior (Gelman & Hill, 2006).
For examples of this phenomena:
each other. Rather, some students may systematically give low (or high) lecture ratings.
each other. We expect good teachers to get generally good ratings and bad teachers to get generally bad ratings.
each other. One department may generally have dry material and thus be rated lower than others.
Typical linear regression takes the form
\begin{equation*} \mathbf{y} = \mathbf{X}\beta + \epsilon, \end{equation*}where $\mathbf{X}$ corresponds to fixed effects with coefficients $\beta$ and $\epsilon$ corresponds to random noise, $\epsilon\sim\mathcal{N}(\mathbf{0}, \mathbf{I})$.
In a linear mixed effects model, we add an additional term $\mathbf{Z}\eta$, where $\mathbf{Z}$ corresponds to random effects with coefficients $\eta$. The model takes the form
\begin{align*} \eta &\sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}), \\ \mathbf{y} &= \mathbf{X}\beta + \mathbf{Z}\eta + \epsilon. \end{align*}Given data, the goal is to infer $\beta$, $\eta$, and $\sigma^2$, where $\beta$ are model parameters ("fixed effects"), $\eta$ are latent variables ("random effects"), and $\sigma^2$ is a variance component parameter.
Because the random effects have mean 0, the data's mean is captured by $\mathbf{X}\beta$. The random effects component $\mathbf{Z}\eta$ captures variations in the data (e.g. Instructor #54 is rated 1.4 points higher than the mean).
A natural question is the difference between fixed and random effects. A fixed effect is an effect that is constant for a given population. A random effect is an effect that varies for a given population (i.e., it may be constant within subpopulations but varies within the overall population). We illustrate below in our example:
service
as the fixed effect. It is a binary covariatecorresponding to whether the lecture belongs to the lecturer's main department. No matter how much additional data we collect, it can only take on the values in $0$ and $1$.
students
, teachers
,and departments
as the random effects. Given more
observations from the population of instructor evaluation ratings, we
may be looking at new students, teachers, or departments.
In the syntax of R's lme4 package (Bates et al., 2015), the model can be summarized as
y ~ 1 + (1|students) + (1|instructor) + (1|dept) + service
where 1
denotes an intercept term,(1|x)
denotes a
random effect for x
, and x
denotes a fixed effect.
# Set up placeholders for the data inputs.
s_ph = tf.placeholder(tf.int32, [None])
d_ph = tf.placeholder(tf.int32, [None])
dept_ph = tf.placeholder(tf.int32, [None])
service_ph = tf.placeholder(tf.float32, [None])
# Set up fixed effects.
mu = tf.Variable(tf.random_normal([]))
service = tf.Variable(tf.random_normal([]))
sigma_s = tf.sqrt(tf.exp(tf.Variable(tf.random_normal([]))))
sigma_d = tf.sqrt(tf.exp(tf.Variable(tf.random_normal([]))))
sigma_dept = tf.sqrt(tf.exp(tf.Variable(tf.random_normal([]))))
# Set up random effects.
eta_s = Normal(loc=tf.zeros(n_s), scale=sigma_s * tf.ones(n_s))
eta_d = Normal(loc=tf.zeros(n_d), scale=sigma_d * tf.ones(n_d))
eta_dept = Normal(loc=tf.zeros(n_dept), scale=sigma_dept * tf.ones(n_dept))
yhat = tf.gather(eta_s, s_ph) + \
tf.gather(eta_d, d_ph) + \
tf.gather(eta_dept, dept_ph) + \
mu + service * service_ph
y = Normal(loc=yhat, scale=tf.ones(n_obs))
Given data, we aim to infer the model's fixed and random effects. In this analysis, we use variational inference with the $\text{KL}(q\|p)$ divergence measure. We specify fully factorized normal approximations for the random effects and pass in all training data for inference. Under the algorithm, the fixed effects will be estimated under a variational EM scheme.
q_eta_s = Normal(
loc=tf.Variable(tf.random_normal([n_s])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([n_s]))))
q_eta_d = Normal(
loc=tf.Variable(tf.random_normal([n_d])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([n_d]))))
q_eta_dept = Normal(
loc=tf.Variable(tf.random_normal([n_dept])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([n_dept]))))
latent_vars = {
eta_s: q_eta_s,
eta_d: q_eta_d,
eta_dept: q_eta_dept}
data = {
y: y_train,
s_ph: s_train,
d_ph: d_train,
dept_ph: dept_train,
service_ph: service_train}
inference = ed.KLqp(latent_vars, data)
One way to critique the fitted model is a residual plot, i.e., a plot of the difference between the predicted value and the observed value for each data point. Below we manually run inference, initializing the algorithm and performing individual updates within a loop. We form residual plots as the algorithm progresses. This helps us examine how the algorithm proceeds to infer the random and fixed effects from data.
To form residuals, we first make predictions on test data. We do this
by copying yhat
defined in the model and replacing its
dependence on random effects with their inferred means. During the
algorithm, we evaluate the predictions, feeding in test inputs.
We have also fit the same model (y ~ service + (1|dept) + (1|s) + (1|d)
,
fit on the entire InstEval
dataset, specifically) in lme4
. We
have saved the random effect estimates and will compare them to our
learned parameters.
yhat_test = ed.copy(yhat, {
eta_s: q_eta_s.mean(),
eta_d: q_eta_d.mean(),
eta_dept: q_eta_dept.mean()})
inference.initialize(n_print=2000, n_iter=10000)
tf.global_variables_initializer().run()
for _ in range(inference.n_iter):
# Update and print progress of algorithm.
info_dict = inference.update()
inference.print_progress(info_dict)
t = info_dict['t']
if t == 1 or t % inference.n_print == 0:
# Make predictions on test data.
yhat_vals = yhat_test.eval(feed_dict={
s_ph: s_test,
d_ph: d_test,
dept_ph: dept_test,
service_ph: service_test})
# Form residual plot.
plt.title("Residuals for Predicted Ratings on Test Set")
plt.xlim(-4, 4)
plt.ylim(0, 800)
plt.hist(yhat_vals - y_test, 75)
plt.show()
1/10000 [ 0%] ETA: 17377s | Loss: 1378558.375
2000/10000 [ 20%] ██████ ETA: 40s | Loss: 97301.945
4000/10000 [ 40%] ████████████ ETA: 27s | Loss: 97092.609
6000/10000 [ 60%] ██████████████████ ETA: 17s | Loss: 97104.344
8000/10000 [ 80%] ████████████████████████ ETA: 8s | Loss: 97135.047
10000/10000 [100%] ██████████████████████████████ Elapsed: 43s | Loss: 97130.734
Above, we described a method for diagnosing the fit of the model via residual plots. See the residual plot at the end of the algorithm.
The residuals appear normally distributed with mean 0. This is a good sanity check for the model.
We can also compare our learned parameters to those estimated by R's
lme4
.
student_effects_lme4 = pd.read_csv('../examples/data/insteval_student_ranefs_r.csv')
instructor_effects_lme4 = pd.read_csv('../examples/data/insteval_instructor_ranefs_r.csv')
dept_effects_lme4 = pd.read_csv('../examples/data/insteval_dept_ranefs_r.csv')
student_effects_edward = q_eta_s.mean().eval()
instructor_effects_edward = q_eta_d.mean().eval()
dept_effects_edward = q_eta_dept.mean().eval()
plt.title("Student Effects Comparison")
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.xlabel("Student Effects from lme4")
plt.ylabel("Student Effects from edward")
plt.scatter(student_effects_lme4["(Intercept)"],
student_effects_edward,
alpha = 0.25)
plt.show()
plt.title("Instructor Effects Comparison")
plt.xlim(-1.5, 1.5)
plt.ylim(-1.5, 1.5)
plt.xlabel("Instructor Effects from lme4")
plt.ylabel("Instructor Effects from edward")
plt.scatter(instructor_effects_lme4["(Intercept)"],
instructor_effects_edward,
alpha = 0.25)
plt.show()
Great! Our estimates for both student and instructor effects seem to
match those from lme4
closely. We have set up a slightly different
model here (for example, our overall mean is regularized, as are our
variances for student, department, and instructor effects, which is not
true of lme4
s model), and we have a different inference method, so we
should not expect to find exactly the same parameters as lme4
. But
it is reassuring that they match up closely!
# Add in the intercept from R and edward
dept_effects_and_intercept_lme4 = 3.28259 + dept_effects_lme4["(Intercept)"]
dept_effects_and_intercept_edward = mu.eval() + dept_effects_edward
plt.title("Departmental Effects Comparison")
plt.xlim(3.0, 3.5)
plt.ylim(3.0, 3.5)
plt.xlabel("Department Effects from lme4")
plt.ylabel("Department Effects from edward")
plt.scatter(dept_effects_and_intercept_lme4,
dept_effects_and_intercept_edward,
s = 0.01*train.dept.value_counts())
plt.show()
Our department effects do not match up nearly as well with those from lme4
.
There are likely several reasons for this:
lme4
doesn't, which causes theedward model to put some of the intercept into the department effects, which are allowed to vary more widely since we learn a variance
lme4
estimate uses the whole InstEval
data set
estimate.
We thank Mayank Agrawal for writing the initial version of this tutorial.