import numpy as np
import scipy as sp
import pandas as pd
import patsy
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.pyplot as plt
import itertools
%matplotlib inline
sns.set_style("white")
Recall from last week's lecture that the general form of the GLM is
$y = f(\mathbf{X} \beta) + \epsilon$
All the models we've considered up until now are linear models: the link function $f()$ is the identity --- that is, it doesn't change $\mathbf{X} \beta$ at all, and the noise $\epsilon$ is assumed to be Gaussian.
But for many response variables, linear predictions don't make a lot of sense. Consider one of the example datasets we've given you: "g-forces" (from here). From the description:
Military pilots sometimes black out when their brains are deprived of oxygen due to G-forces during violent maneuvers.
Glaister and Miller (1990) produced similar symptoms by exposing volunteers’ lower bodies to negative air pressure, likewise decreasing oxygen to the brain. The data lists the subjects' ages and whether they showed syncopal blackout related signs (pallor, sweating, slow heartbeat, unconsciousness) during an 18 minute period.
So Glaister and Miller (1990) wanted to know whether the propensity to black out was related to the subject's age. They tested this in 8 people. Let's import the data:
gf = pd.read_csv('../Datasets/gforces.txt', sep='\t')
# make all columns lower case:
gf.columns = [c.lower() for c in gf.columns]
print(gf)
subject age signs 0 JW 39 0 1 JM 42 1 2 DT 20 0 3 LK 37 1 4 JK 20 1 5 MK 21 0 6 FP 41 1 7 DG 52 1
Note that the response variable we're trying to predict (signs
) takes on only one of two possible values: either the subject showed blackout signs (y = 1) or did not (y = 0).
Aside: the numerical values 0 and 1 are used by convention and for mathematical convenience. Think of it like "the presence of somthing (1) and the absence of something (0)".
What happens if we fit an ordinary (Gaussian error) linear regression to these data?
fit_lm = smf.glm('signs ~ age', gf).fit()
print(fit_lm.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: signs No. Observations: 8 Model: GLM Df Residuals: 6 Model Family: Gaussian Df Model: 1 Link Function: identity Scale: 0.234334625323 Method: IRLS Log-Likelihood: -4.3968 Date: Tue, 24 Feb 2015 Deviance: 1.4060 Time: 20:39:09 Pearson chi2: 1.41 No. Iterations: 4 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept -0.0998 0.540 -0.185 0.853 -1.159 0.959 age 0.0213 0.015 1.415 0.157 -0.008 0.051 ==============================================================================
# Let's have a look at the model predictions
sns.regplot('age', 'signs', gf)
sns.despine(offset=10);
Notice how the data are only 0s
or 1s
, but that our model can happily predict values outside this range. Moreover, our model expects the errors to be Gaussian, and they clearly won't be. Therefore, this linear model is not appropriate for these data.
What we want instead is to change the link function $f()$ from the identity (i.e. our prediction is just $\mathbf{X} \beta$ with Gaussian errors) to something more appropriate for the data we have. We are going to transform the linear predictor into a more appropriate range, by changing the link function.
Our linear predictor can theoretically range from -inf
to +inf
. We want something that maps this range into the range 0--1. There are a number of options, but the most common is the logistic function:
$y = \frac{1}{1 + e^{-x}}$
Let's plot what this does:
def logit(x):
return(1. / (1. + np.exp(-x)))
x = np.linspace(-5, 5) # imagine this is the linear predictor (inner product of X and beta)
y = logit(x)
plt.plot(x, y)
sns.despine(offset=10)
plt.vlines(0, 0, 1, linestyles='--')
plt.hlines(0.5, -5, 5, linestyles='--')
plt.yticks([0, 0.5, 1])
plt.xlabel('Linear predictor value')
plt.ylabel('Transformed value');
Notice that as the linear predictor approaches negative infinity, its transformed value approaches zero -- but never goes below. Similarly, as the linear predictor approaches positive infinity, its value approaches one, but never above:
logit(-np.inf)
0.0
logit(np.inf)
1.0
The logit
transform allows us to map our linear predictor into a bounded range. Note also that when the linear predictor is zero, the logit is 0.5.
How can we now use this transformed value to predict dichotomous responses?
Since the response variable is dichotomous (either "shows signs" or "does not show signs"), an appropriate probability distribution to describe the data is the binomial distribution. Just as the Normal distribution returns a probability density over a continuous dimension given a mean and standard deviation, the binomial distribution gives a probability mass over a discrete dimension (number of "successes"), given a number of "trials" and a probability of success, $p$ (note since $p$ is a probability, it can have any value from 0 to 1).
The most common example used to demonstrate the binomial distribution is coin flipping. Imagine we have a fair coin ($p = 0.5$) and we flip it 50 times. How many heads ("successes") would we expect? The binomial distribution can tell us.
# code hacked from here: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.binom.html
n, p = 50, 0.5
fig, ax = plt.subplots(1, 1)
x = np.arange(sp.stats.distributions.binom.ppf(0.001, n, p),
sp.stats.distributions.binom.ppf(0.999, n, p))
ax.plot(x, sp.stats.distributions.binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
ax.vlines(x, 0, sp.stats.distributions.binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
sns.despine(offset=10);
plt.xlabel('Number of successes')
plt.xlim(0, 50)
plt.ylabel('Probability mass');
Now imagine we have a trick coin, which comes up heads 90% of the time. How many heads do we expect now?
n, p = 50, 0.9
fig, ax = plt.subplots(1, 1)
x = np.arange(sp.stats.distributions.binom.ppf(0.001, n, p),
sp.stats.distributions.binom.ppf(0.999, n, p))
ax.plot(x, sp.stats.distributions.binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
ax.vlines(x, 0, sp.stats.distributions.binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
sns.despine(offset=10);
plt.xlabel('Number of successes')
plt.xlim(0, 50)
plt.ylabel('Probability mass');
So you can see that changing $p$ changes the number of successes we expect from a given number of trials (equivalently, the chance of a success on any one trial).
Now, by assuming that our response distribution is binomially distributed, we can estimate the expected probability of showing blackout signs as a function of age. To do this, we use the logit transformed linear predictor as our estimate of $p$ in the binomial distribution.
We can use statsmodel's glm
function to find the regression weights that maximise the likelihood of the observed data, assuming that the data come from a binomial distribution:
fit = smf.glm('signs ~ age',
data=gf,
family=sm.families.Binomial(link=sm.families.links.logit)).fit()
# note we don't have to specify the logit link above, since it's the default for the binomial distribution.
print(fit.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: signs No. Observations: 8 Model: GLM Df Residuals: 6 Model Family: Binomial Df Model: 1 Link Function: logit Scale: 1.0 Method: IRLS Log-Likelihood: -4.2173 Date: Tue, 24 Feb 2015 Deviance: 8.4345 Time: 20:39:10 Pearson chi2: 7.42 No. Iterations: 6 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept -2.9209 2.647 -1.104 0.270 -8.108 2.266 age 0.1057 0.080 1.315 0.188 -0.052 0.263 ==============================================================================
We get a summary table with regression coefficients as before. Note that a few things have changed: in particular, see that the Model Family
is now "Binomial", and the Link Function
is the logit. This is also called a logistic regression, because it uses the logit link function.
How do the regression coefficients above compare to the linear, Gaussian model we fit before?
print(fit_lm.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: signs No. Observations: 8 Model: GLM Df Residuals: 6 Model Family: Gaussian Df Model: 1 Link Function: identity Scale: 0.234334625323 Method: IRLS Log-Likelihood: -4.3968 Date: Tue, 24 Feb 2015 Deviance: 1.4060 Time: 20:39:10 Pearson chi2: 1.41 No. Iterations: 4 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept -0.0998 0.540 -0.185 0.853 -1.159 0.959 age 0.0213 0.015 1.415 0.157 -0.008 0.051 ==============================================================================
You can see the regression coefficients are totally different numbers. However, our design matrix $\mathbf{X}$ didn't change: it's still a column of ones and a column of age
. The reason the coefficients are different is because in the logistic regression, the beta weights pass through the link function to give a prediction for the response variable.
Let's run through what the regression coefficients mean in the logistic regression (fit
).
The Intercept
is the value of the linear predictor (note: not the response!) when age = 0. In this case, we can't observe an age of zero, so the intercept is not very meaningful
Aside: To make the intercept more interpretable, we could centre the age
variable by subtracting its mean -- this would make the intercept equal the linear predictor value when age is at its mean in the data. We could do this within the Patsy formula by writing signs ~ center(age)
. We will avoid doing this here for simplicity.
The age
coefficient tells us the change in the linear predictor for an increase in age of one year.
To work out what these mean for the response value, we need to pass them through the logit function. As an example, let's consider the prediction for the chance a 28 year old person would show blackout signs.
age = 28
linear_predictor = fit.params[0] + fit.params[1]*age
linear_predictor
0.038508192779235006
pred = logit(linear_predictor)
pred
0.50962585872402622
So our prediction for a 28 year old is that they would have about a 51% chance to black out. We can also compute this using the predict
method of the fit object. Let's do that for a 65 year old:
fit.predict(exog={'age':65})
array([ 0.9810925])
Our model predicts that a 65 year old would have a 98.1% chance of blacking out. What would happen if we tried making similar predictions from the (inappropriate) linear model?
fit_lm.predict(exog={'age': 65})
array([ 1.28585271])
The linear model predicts a value of 1.28 for a 65 year old -- which makes no sense.
Let's plot our model predictions for the chance of blacking out against the data.
# define expand grid function:
def expand_grid(data_dict):
""" A port of R's expand.grid function for use with Pandas dataframes.
from http://pandas.pydata.org/pandas-docs/stable/cookbook.html?highlight=expand%20grid
"""
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
# build a new matrix with expand grid:
preds = expand_grid({'age': np.linspace(10, 80, num=100)}) # put a line of ages in, to plot a smooth curve.
preds['yhat'] = fit.predict(preds)
# add to data:
merged = gf.append(preds)
merged.tail()
age | signs | subject | yhat | |
---|---|---|---|---|
95 | 77.171717 | NaN | NaN | 0.994704 |
96 | 77.878788 | NaN | NaN | 0.995084 |
97 | 78.585859 | NaN | NaN | 0.995436 |
98 | 79.292929 | NaN | NaN | 0.995763 |
99 | 80.000000 | NaN | NaN | 0.996067 |
g = sns.FacetGrid(merged, size=4)
g.map(plt.plot, 'age', 'yhat') # plot model predictions
g.map(plt.scatter, 'age', 'signs')
sns.despine(offset=10);
plt.ylim([-.1, 1.1])
plt.yticks([0, 0.5, 1]);
We can extend the design matrix to include any combination of continuous and categorical predictors as in the previous lectures, but using the logistic link function to predict 0--1 data.
To demonstrate this I will use a dataset from a psychophysical experiment (Wallis & Bex, 2012). In this experiment, three observers detected the location of a target embedded in a natural image in a forced-choice paradigm. The target (a circular patch of texture) could appear above, right, below or left of the observer's fixation point. The stimuli looked a bit like this:
The image above contains 12 patches of texture (three on each cardinal meridian relative to the fixation point). Note that in the actual experiment, only one patch was shown at one of the four possible target locations.
After each trial, the observer responded with one of four buttons. This is a four-alternative forced-choice (4AFC) task: if the observer could not see the target at all, average performance should be 25% correct.
We varied, among other things, the size of the patch and the eccentricity (distance from fixation). Let's import the data and have a look:
nat_ims = pd.read_csv('../Datasets/psychophysics.txt', sep='\t')
# columns to lowercase:
nat_ims.columns = [c.lower() for c in nat_ims.columns]
nat_ims.head()
observer | trialnum | blocknum | eccent | patchsize | correct | targetloc | responseloc | imageori | |
---|---|---|---|---|---|---|---|---|---|
0 | TW | 1 | 1 | 2 | -0.391207 | 1 | u | u | 0CW |
1 | TW | 2 | 1 | 2 | -0.391207 | 1 | d | d | 270CW |
2 | TW | 3 | 1 | 2 | -0.505150 | 0 | d | r | 180CW |
3 | TW | 4 | 1 | 2 | -0.505150 | 1 | l | l | 270CW |
4 | TW | 5 | 1 | 2 | -0.505150 | 1 | u | u | 0CW |
Each row in the dataset corresponds to one trial. We are told the observer, trial and block numbers, the eccentricity, the log patch size (log 10 pixels), the target location (relative to fixation), the response location (relative to fixation) and the orientation of the image relative to its veridical orientation.
Finally, the column correct
tells us whether the observer got the trial correct (1
) or incorrect (0
).
How many trials did each observer do?
nat_ims.groupby(['observer']).size()
observer N1 17200 PB 16918 TW 16932 dtype: int64
So each observer did about 17000 trials. Because ~50,000 observations is quite a lot, the plotting takes a while. For this reason I'm going to take a random subsample of the data first.
np.random.seed(12345) # seed rng so these results are reproducible.
rows = np.random.choice(nat_ims.index, size=5000, replace=False)
dat = nat_ims.ix[rows].copy()
dat.groupby(['observer', 'eccent']).size()
observer eccent N1 2 558 4 597 8 510 PB 2 527 4 547 8 571 TW 2 538 4 570 8 582 dtype: int64
How can we predict performance across all trials? This is another example of a binomial dataset, which we can model using logistic regression.
We can plot some of the data and fit models in one step using Seaborn:
sns.lmplot('patchsize', 'correct', data=dat,
hue='eccent',
col='observer',
logistic=True,
y_jitter=.05,
size=4)
sns.despine(offset=10)
g.fig.subplots_adjust(wspace=0.5)
plt.xlabel('log patch size')
plt.ylabel('Proportion correct')
plt.yticks([0, 0.5, 1]);
# the data get easier to see if we bin trials along the x-axis:
sns.lmplot('patchsize', 'correct', data=dat,
hue='eccent',
col='observer',
logistic=True,
x_bins=6,
size=4)
sns.despine(offset=10)
g.fig.subplots_adjust(wspace=0.5)
plt.xlabel('log patch size')
plt.ylabel('Proportion correct')
plt.yticks([0, 0.5, 1]);
Now let's fit a logistic regression using Statsmodels that considers the influence of patchsize and eccentricity, where here we will code eccentricity as a categorical variable with 3 levels.
Note that for simplicity here, I will just fit to all the data (pooled across observers). This should be analysed as a repeated measures design (the conditions are nested within observers), but this is beyond the scope of this course.
fit = smf.glm('correct ~ patchsize + C(eccent)',
data=nat_ims,
family=sm.families.Binomial(link=sm.families.links.logit)).fit()
# note we don't have to specify the logit link above, since it's the default for the binomial distribution.
print(fit.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: correct No. Observations: 51050 Model: GLM Df Residuals: 51046 Model Family: Binomial Df Model: 3 Link Function: logit Scale: 1.0 Method: IRLS Log-Likelihood: -33392. Date: Tue, 24 Feb 2015 Deviance: 66784. Time: 20:40:48 Pearson chi2: 5.10e+04 No. Iterations: 7 ==================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 1.2857 0.025 51.341 0.000 1.237 1.335 C(eccent)[T.4.0] -0.5116 0.024 -21.134 0.000 -0.559 -0.464 C(eccent)[T.8.0] -1.2084 0.031 -39.076 0.000 -1.269 -1.148 patchsize 2.0228 0.036 55.684 0.000 1.952 2.094 ====================================================================================
What do the coefficients mean?
The intercept is the value of the linear predictor when the target is at 2 degrees from fixation and log patchsize is zero. In this case, the model predicts performance will be logit(1.2857)
, about 78% correct.
The two coefficients for eccent show the offset applied to the linear predictor when the eccentricity is 4 or 8 degrees, relative to an eccentricity of 2. So when log patch size is zero, we predict the performance at 4 degrees to be logit(1.2857 -0.5116)
, about 68% correct. Similarly, a patch of log size zero presented 8 degrees from fixation would be expected to yield 51% correct.
The same sized patch is harder to detect, the further away from fixation it is presented.
Finally, the patchsize
coefficient shows the change in the linear predictor for a one-unit (i.e. 1 log unit change) in the size of the patch. At 2 degrees, a patch of log size 0 yields 78% correct (see above); if the patch was log size 1 it would be expected to yield logit(1.2857 + 2.0228*1)
or 96% correct, on average.
The GLM can be used to model a dichotomous response variable by assuming that variable is distributed according to a binomial distribution. Dichotomous response variables are found often in neuroscience and psychology: whenever we talk about "percent correct" or similar variables (like the psychophysical dataset above).
To do this, we "squash" the linear predictor through the logit
function, which gives it bounds between 0 and 1.This squashing procedure and binomial family is an example of changing the identity link function and the error model. It's what makes the GLM "generalised".
Note that nothing changed about how we construct or interpret the design matrix for our model: everything we've learned about design matrices still apply.
It's common in neuroscience to measure the spiking activity of neurons. Spikes over time can be thought of as "events", and thus the number of spikes in a time interval as event counts.
Counts are discrete and non-negative - thus modeling their distribution requires again a different type of GLM. Luckily, we can again achieve that by changing the link function.
As a very brief example, we consider a neuron recorded from V1 while showing a grating with two different orientations at two different contrasts. What type of design do we have here?
spk = pd.read_csv('../Datasets/neuron2.txt', sep='\t',index_col=False)
spk.head()
spikes | contrast | orientation | |
---|---|---|---|
0 | 3 | 1 | 0 |
1 | 6 | 1 | 0 |
2 | 3 | 1 | 0 |
3 | 3 | 1 | 0 |
4 | 6 | 1 | 0 |
sns.factorplot("contrast","spikes","orientation",spk,kind='point')
sns.despine(offset=10);
What type of effects would you assume when we fit a GLM with appropriate link function?
The correct model family for count data is the Poisson family, with the log link function
$X\beta = \ln(\mu)$
or conversly
$\mu = \exp(X\beta)$
This takes care of two things
fit = smf.glm('spikes ~ contrast * orientation',
data=spk,
family=sm.families.Poisson(link=sm.families.links.log)).fit()
print(fit.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: spikes No. Observations: 156 Model: GLM Df Residuals: 152 Model Family: Poisson Df Model: 3 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -374.91 Date: Tue, 24 Feb 2015 Deviance: 194.83 Time: 20:40:49 Pearson chi2: 184. No. Iterations: 8 ======================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------- Intercept 1.5515 0.083 18.677 0.000 1.389 1.714 contrast -0.0165 0.012 -1.361 0.174 -0.040 0.007 orientation -0.0018 0.001 -1.349 0.177 -0.004 0.001 contrast:orientation 0.0019 0.000 11.078 0.000 0.002 0.002 ========================================================================================
# build a new matrix with expand grid:
preds = expand_grid({'contrast': [1, 10], 'orientation': [0, 90]})
preds_2 = preds.copy()
preds['spikes'] = fit.predict(preds)
# also fit a model with no interaction for comparison:
#fit_2 = smf.glm('spikes ~ contrast * orientation', spk).fit()
fit_2 = smf.glm('spikes ~ contrast + orientation', data=spk,
family=sm.families.Poisson(link=sm.families.links.log)).fit()
preds_2['spikes'] = fit_2.predict(preds_2)
# to plot model predictions against data, we should merge them together:
# see answer (https://stackoverflow.com/questions/28239359/showing-data-and-model-predictions-in-one-plot-using-seaborn-and-statsmodels)
merged = pd.concat(dict(data=spk,
linear=preds_2,
interact=preds),
names=['type']).reset_index()
g = sns.factorplot('contrast', 'spikes', 'orientation',
data=merged,
col='type',
kind='point',
size=3.5)
g.fig.subplots_adjust(wspace=0.3)
sns.despine(offset=10);
Note that this is particularly important if the counts are low, because than their non-negativity plays a role.
print(fit_2.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: spikes No. Observations: 156 Model: GLM Df Residuals: 153 Model Family: Poisson Df Model: 2 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -437.65 Date: Tue, 24 Feb 2015 Deviance: 320.31 Time: 20:40:50 Pearson chi2: 328. No. Iterations: 8 =============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------- Intercept 0.8332 0.075 11.168 0.000 0.687 0.979 contrast 0.0976 0.007 14.083 0.000 0.084 0.111 orientation 0.0110 0.001 15.441 0.000 0.010 0.012 ===============================================================================
# Do an expanded prediction for many contrasts, compare to linear model:
preds_3 = expand_grid({'contrast': np.linspace(0.01, 20), 'orientation': [0, 90]})
preds_4 = preds_3.copy()
preds_3['yhat'] = fit.predict(preds_3)
fit_lm = smf.glm('spikes ~ contrast + orientation', data=spk).fit()
preds_4['yhat'] = fit_lm.predict(preds_3)
# to plot model predictions against data, we should merge them together:
# see answer (https://stackoverflow.com/questions/28239359/showing-data-and-model-predictions-in-one-plot-using-seaborn-and-statsmodels)
merged = pd.concat(dict(data=spk,
poisson=preds_3,
linear=preds_4),
names=['type']).reset_index()
merged.tail()
type | level_1 | contrast | orientation | spikes | yhat | |
---|---|---|---|---|---|---|
351 | poisson | 95 | 18.368163 | 90 | NaN | 65.976032 |
352 | poisson | 96 | 18.776122 | 90 | NaN | 70.202318 |
353 | poisson | 97 | 19.184082 | 90 | NaN | 74.699331 |
354 | poisson | 98 | 19.592041 | 90 | NaN | 79.484413 |
355 | poisson | 99 | 20.000000 | 90 | NaN | 84.576017 |
g = sns.FacetGrid(merged,
hue='type',
col='orientation')
g.map(plt.scatter, 'contrast', 'spikes')
g.map(plt.plot, 'contrast', 'yhat')
g.fig.subplots_adjust(wspace=0.3)
sns.despine(offset=10)
g.add_legend();
You can see that the linear and the poisson models give very different predictions outside the range of this data. Furthermore, the linear model
The GLM allows us to specify different distributions for the response variable $y$ and for the link function $f()$. In this notebook we've considered two examples from the context of behavioural / neural science: percent correct data (binomial regression) and spiking data (poisson regression). The following table (adapted from John Kruschke's book, Doing Bayesian Data Analysis) shows, for many types of response data, the appropriate error model and the most common link functions.
Scale type of $y$ | Typical noise distribution of $y$ | Typical link function |
---|---|---|
metric (continuous) | normal | identity |
dichotomous (0 or 1) | bernoulli | logistic |
nominal (unordered categories) | categorical | softmax |
ordinal (ordered categories) | categorical | thresholded logistic |
count | poisson | log |
In this course, we've looked extensively at the first row, and touched on dichotomous and count data here. The GLM offers a powerful framework for modelling a wide variety of data.
from IPython.core.display import HTML
def css_styling():
styles = open("../custom_style.css", "r").read()
return HTML(styles)
css_styling()