The University of TĂĽbingen

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import patsy
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.pyplot as plt
import itertools

%matplotlib inline

sns.set_style("white")


Recall from last week's lecture that the general form of the GLM is

$y = f(\mathbf{X} \beta) + \epsilon$

All the models we've considered up until now are linear models: the link function $f()$ is the identity --- that is, it doesn't change $\mathbf{X} \beta$ at all, and the noise $\epsilon$ is assumed to be Gaussian.

But for many response variables, linear predictions don't make a lot of sense. Consider one of the example datasets we've given you: "g-forces" (from here). From the description:

Military pilots sometimes black out when their brains are deprived of oxygen due to G-forces during violent maneuvers.

Glaister and Miller (1990) produced similar symptoms by exposing volunteersâ€™ lower bodies to negative air pressure, likewise decreasing oxygen to the brain. The data lists the subjects' ages and whether they showed syncopal blackout related signs (pallor, sweating, slow heartbeat, unconsciousness) during an 18 minute period.

So Glaister and Miller (1990) wanted to know whether the propensity to black out was related to the subject's age. They tested this in 8 people. Let's import the data:

In [2]:
gf = pd.read_csv('../Datasets/gforces.txt', sep='\t')
# make all columns lower case:
gf.columns = [c.lower() for c in gf.columns]
print(gf)

  subject  age  signs
0      JW   39      0
1      JM   42      1
2      DT   20      0
3      LK   37      1
4      JK   20      1
5      MK   21      0
6      FP   41      1
7      DG   52      1


Note that the response variable we're trying to predict (signs) takes on only one of two possible values: either the subject showed blackout signs (y = 1) or did not (y = 0).

Aside: the numerical values 0 and 1 are used by convention and for mathematical convenience. Think of it like "the presence of somthing (1) and the absence of something (0)".

What happens if we fit an ordinary (Gaussian error) linear regression to these data?

In [3]:
fit_lm = smf.glm('signs ~ age', gf).fit()
print(fit_lm.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                  signs   No. Observations:                    8
Model:                            GLM   Df Residuals:                        6
Model Family:                Gaussian   Df Model:                            1
Method:                          IRLS   Log-Likelihood:                -4.3968
Date:                Tue, 24 Feb 2015   Deviance:                       1.4060
Time:                        20:39:09   Pearson chi2:                     1.41
No. Iterations:                     4
==============================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.0998      0.540     -0.185      0.853        -1.159     0.959
age            0.0213      0.015      1.415      0.157        -0.008     0.051
==============================================================================

In [4]:
# Let's have a look at the model predictions
sns.regplot('age', 'signs', gf)
sns.despine(offset=10);


Notice how the data are only 0s or 1s, but that our model can happily predict values outside this range. Moreover, our model expects the errors to be Gaussian, and they clearly won't be. Therefore, this linear model is not appropriate for these data.

What we want instead is to change the link function $f()$ from the identity (i.e. our prediction is just $\mathbf{X} \beta$ with Gaussian errors) to something more appropriate for the data we have. We are going to transform the linear predictor into a more appropriate range, by changing the link function.

Our linear predictor can theoretically range from -inf to +inf. We want something that maps this range into the range 0--1. There are a number of options, but the most common is the logistic function:

$y = \frac{1}{1 + e^{-x}}$

Let's plot what this does:

In [5]:
def logit(x):
return(1. / (1. + np.exp(-x)))

x = np.linspace(-5, 5)  # imagine this is the linear predictor (inner product of X and beta)
y = logit(x)
plt.plot(x, y)
sns.despine(offset=10)
plt.vlines(0, 0, 1, linestyles='--')
plt.hlines(0.5, -5, 5, linestyles='--')
plt.yticks([0, 0.5, 1])
plt.xlabel('Linear predictor value')
plt.ylabel('Transformed value');


Notice that as the linear predictor approaches negative infinity, its transformed value approaches zero -- but never goes below. Similarly, as the linear predictor approaches positive infinity, its value approaches one, but never above:

In [6]:
logit(-np.inf)

Out[6]:
0.0
In [7]:
logit(np.inf)

Out[7]:
1.0

The logit transform allows us to map our linear predictor into a bounded range. Note also that when the linear predictor is zero, the logit is 0.5.

How can we now use this transformed value to predict dichotomous responses?

### Binomial distribution¶

Since the response variable is dichotomous (either "shows signs" or "does not show signs"), an appropriate probability distribution to describe the data is the binomial distribution. Just as the Normal distribution returns a probability density over a continuous dimension given a mean and standard deviation, the binomial distribution gives a probability mass over a discrete dimension (number of "successes"), given a number of "trials" and a probability of success, $p$ (note since $p$ is a probability, it can have any value from 0 to 1).

The most common example used to demonstrate the binomial distribution is coin flipping. Imagine we have a fair coin ($p = 0.5$) and we flip it 50 times. How many heads ("successes") would we expect? The binomial distribution can tell us.

In [8]:
# code hacked from here: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.binom.html
n, p = 50, 0.5
fig, ax = plt.subplots(1, 1)
x = np.arange(sp.stats.distributions.binom.ppf(0.001, n, p),
sp.stats.distributions.binom.ppf(0.999, n, p))
ax.plot(x, sp.stats.distributions.binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
ax.vlines(x, 0, sp.stats.distributions.binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
sns.despine(offset=10);
plt.xlabel('Number of successes')
plt.xlim(0, 50)
plt.ylabel('Probability mass');


Now imagine we have a trick coin, which comes up heads 90% of the time. How many heads do we expect now?

In [9]:
n, p = 50, 0.9
fig, ax = plt.subplots(1, 1)
x = np.arange(sp.stats.distributions.binom.ppf(0.001, n, p),
sp.stats.distributions.binom.ppf(0.999, n, p))
ax.plot(x, sp.stats.distributions.binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
ax.vlines(x, 0, sp.stats.distributions.binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
sns.despine(offset=10);
plt.xlabel('Number of successes')
plt.xlim(0, 50)
plt.ylabel('Probability mass');


So you can see that changing $p$ changes the number of successes we expect from a given number of trials (equivalently, the chance of a success on any one trial).

Now, by assuming that our response distribution is binomially distributed, we can estimate the expected probability of showing blackout signs as a function of age. To do this, we use the logit transformed linear predictor as our estimate of $p$ in the binomial distribution.

We can use statsmodel's glm function to find the regression weights that maximise the likelihood of the observed data, assuming that the data come from a binomial distribution:

In [10]:
fit = smf.glm('signs ~ age',
data=gf,
# note we don't have to specify the logit link above, since it's the default for the binomial distribution.

print(fit.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                  signs   No. Observations:                    8
Model:                            GLM   Df Residuals:                        6
Model Family:                Binomial   Df Model:                            1
Method:                          IRLS   Log-Likelihood:                -4.2173
Date:                Tue, 24 Feb 2015   Deviance:                       8.4345
Time:                        20:39:10   Pearson chi2:                     7.42
No. Iterations:                     6
==============================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -2.9209      2.647     -1.104      0.270        -8.108     2.266
age            0.1057      0.080      1.315      0.188        -0.052     0.263
==============================================================================


We get a summary table with regression coefficients as before. Note that a few things have changed: in particular, see that the Model Family is now "Binomial", and the Link Function is the logit. This is also called a logistic regression, because it uses the logit link function.

How do the regression coefficients above compare to the linear, Gaussian model we fit before?

In [11]:
print(fit_lm.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                  signs   No. Observations:                    8
Model:                            GLM   Df Residuals:                        6
Model Family:                Gaussian   Df Model:                            1
Method:                          IRLS   Log-Likelihood:                -4.3968
Date:                Tue, 24 Feb 2015   Deviance:                       1.4060
Time:                        20:39:10   Pearson chi2:                     1.41
No. Iterations:                     4
==============================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.0998      0.540     -0.185      0.853        -1.159     0.959
age            0.0213      0.015      1.415      0.157        -0.008     0.051
==============================================================================


You can see the regression coefficients are totally different numbers. However, our design matrix $\mathbf{X}$ didn't change: it's still a column of ones and a column of age. The reason the coefficients are different is because in the logistic regression, the beta weights pass through the link function to give a prediction for the response variable.

Let's run through what the regression coefficients mean in the logistic regression (fit).

The Intercept is the value of the linear predictor (note: not the response!) when age = 0. In this case, we can't observe an age of zero, so the intercept is not very meaningful

Aside: To make the intercept more interpretable, we could centre the age variable by subtracting its mean -- this would make the intercept equal the linear predictor value when age is at its mean in the data. We could do this within the Patsy formula by writing signs ~ center(age). We will avoid doing this here for simplicity.

The age coefficient tells us the change in the linear predictor for an increase in age of one year.

To work out what these mean for the response value, we need to pass them through the logit function. As an example, let's consider the prediction for the chance a 28 year old person would show blackout signs.

In [12]:
age = 28
linear_predictor = fit.params[0] + fit.params[1]*age
linear_predictor

Out[12]:
0.038508192779235006
In [13]:
pred = logit(linear_predictor)
pred

Out[13]:
0.50962585872402622

So our prediction for a 28 year old is that they would have about a 51% chance to black out. We can also compute this using the predict method of the fit object. Let's do that for a 65 year old:

In [14]:
fit.predict(exog={'age':65})

Out[14]:
array([ 0.9810925])

Our model predicts that a 65 year old would have a 98.1% chance of blacking out. What would happen if we tried making similar predictions from the (inappropriate) linear model?

In [15]:
fit_lm.predict(exog={'age': 65})

Out[15]:
array([ 1.28585271])

The linear model predicts a value of 1.28 for a 65 year old -- which makes no sense.

## Plotting model predictions¶

Let's plot our model predictions for the chance of blacking out against the data.

In [16]:
# define expand grid function:
def expand_grid(data_dict):
""" A port of R's expand.grid function for use with Pandas dataframes.

from http://pandas.pydata.org/pandas-docs/stable/cookbook.html?highlight=expand%20grid

"""
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())

# build a new matrix with expand grid:
preds = expand_grid({'age': np.linspace(10, 80, num=100)})  # put a line of ages in, to plot a smooth curve.
preds['yhat'] = fit.predict(preds)

merged = gf.append(preds)
merged.tail()

Out[16]:
age signs subject yhat
95 77.171717 NaN NaN 0.994704
96 77.878788 NaN NaN 0.995084
97 78.585859 NaN NaN 0.995436
98 79.292929 NaN NaN 0.995763
99 80.000000 NaN NaN 0.996067
In [17]:
g = sns.FacetGrid(merged, size=4)
g.map(plt.plot, 'age', 'yhat')  # plot model predictions
g.map(plt.scatter, 'age', 'signs')
sns.despine(offset=10);
plt.ylim([-.1, 1.1])
plt.yticks([0, 0.5, 1]);


# Logistic regression with categorical and continuous predictors¶

We can extend the design matrix to include any combination of continuous and categorical predictors as in the previous lectures, but using the logistic link function to predict 0--1 data.

To demonstrate this I will use a dataset from a psychophysical experiment (Wallis & Bex, 2012). In this experiment, three observers detected the location of a target embedded in a natural image in a forced-choice paradigm. The target (a circular patch of texture) could appear above, right, below or left of the observer's fixation point. The stimuli looked a bit like this:

The image above contains 12 patches of texture (three on each cardinal meridian relative to the fixation point). Note that in the actual experiment, only one patch was shown at one of the four possible target locations.

After each trial, the observer responded with one of four buttons. This is a four-alternative forced-choice (4AFC) task: if the observer could not see the target at all, average performance should be 25% correct.

We varied, among other things, the size of the patch and the eccentricity (distance from fixation). Let's import the data and have a look:

In [18]:
nat_ims = pd.read_csv('../Datasets/psychophysics.txt', sep='\t')
# columns to lowercase:
nat_ims.columns = [c.lower() for c in nat_ims.columns]

Out[18]:
observer trialnum blocknum eccent patchsize correct targetloc responseloc imageori
0 TW 1 1 2 -0.391207 1 u u 0CW
1 TW 2 1 2 -0.391207 1 d d 270CW
2 TW 3 1 2 -0.505150 0 d r 180CW
3 TW 4 1 2 -0.505150 1 l l 270CW
4 TW 5 1 2 -0.505150 1 u u 0CW

Each row in the dataset corresponds to one trial. We are told the observer, trial and block numbers, the eccentricity, the log patch size (log 10 pixels), the target location (relative to fixation), the response location (relative to fixation) and the orientation of the image relative to its veridical orientation.

Finally, the column correct tells us whether the observer got the trial correct (1) or incorrect (0).

How many trials did each observer do?

In [19]:
nat_ims.groupby(['observer']).size()

Out[19]:
observer
N1          17200
PB          16918
TW          16932
dtype: int64

So each observer did about 17000 trials. Because ~50,000 observations is quite a lot, the plotting takes a while. For this reason I'm going to take a random subsample of the data first.

In [20]:
np.random.seed(12345)  # seed rng so these results are reproducible.
rows = np.random.choice(nat_ims.index, size=5000, replace=False)
dat = nat_ims.ix[rows].copy()
dat.groupby(['observer', 'eccent']).size()

Out[20]:
observer  eccent
N1        2         558
4         597
8         510
PB        2         527
4         547
8         571
TW        2         538
4         570
8         582
dtype: int64

How can we predict performance across all trials? This is another example of a binomial dataset, which we can model using logistic regression.

We can plot some of the data and fit models in one step using Seaborn:

In [21]:
sns.lmplot('patchsize', 'correct', data=dat,
hue='eccent',
col='observer',
logistic=True,
y_jitter=.05,
size=4)
sns.despine(offset=10)
plt.xlabel('log patch size')
plt.ylabel('Proportion correct')
plt.yticks([0, 0.5, 1]);

In [22]:
# the data get easier to see if we bin trials along the x-axis:
sns.lmplot('patchsize', 'correct', data=dat,
hue='eccent',
col='observer',
logistic=True,
x_bins=6,
size=4)
sns.despine(offset=10)
plt.xlabel('log patch size')
plt.ylabel('Proportion correct')
plt.yticks([0, 0.5, 1]);


Now let's fit a logistic regression using Statsmodels that considers the influence of patchsize and eccentricity, where here we will code eccentricity as a categorical variable with 3 levels.

Note that for simplicity here, I will just fit to all the data (pooled across observers). This should be analysed as a repeated measures design (the conditions are nested within observers), but this is beyond the scope of this course.

In [23]:
fit = smf.glm('correct ~ patchsize + C(eccent)',
data=nat_ims,
# note we don't have to specify the logit link above, since it's the default for the binomial distribution.

print(fit.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                correct   No. Observations:                51050
Model:                            GLM   Df Residuals:                    51046
Model Family:                Binomial   Df Model:                            3
Method:                          IRLS   Log-Likelihood:                -33392.
Date:                Tue, 24 Feb 2015   Deviance:                       66784.
Time:                        20:40:48   Pearson chi2:                 5.10e+04
No. Iterations:                     7
====================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            1.2857      0.025     51.341      0.000         1.237     1.335
C(eccent)[T.4.0]    -0.5116      0.024    -21.134      0.000        -0.559    -0.464
C(eccent)[T.8.0]    -1.2084      0.031    -39.076      0.000        -1.269    -1.148
patchsize            2.0228      0.036     55.684      0.000         1.952     2.094
====================================================================================


What do the coefficients mean?

The intercept is the value of the linear predictor when the target is at 2 degrees from fixation and log patchsize is zero. In this case, the model predicts performance will be logit(1.2857), about 78% correct.

The two coefficients for eccent show the offset applied to the linear predictor when the eccentricity is 4 or 8 degrees, relative to an eccentricity of 2. So when log patch size is zero, we predict the performance at 4 degrees to be logit(1.2857 -0.5116), about 68% correct. Similarly, a patch of log size zero presented 8 degrees from fixation would be expected to yield 51% correct.

The same sized patch is harder to detect, the further away from fixation it is presented.

Finally, the patchsize coefficient shows the change in the linear predictor for a one-unit (i.e. 1 log unit change) in the size of the patch. At 2 degrees, a patch of log size 0 yields 78% correct (see above); if the patch was log size 1 it would be expected to yield logit(1.2857 + 2.0228*1) or 96% correct, on average.

## Conclusion: Binomial models¶

The GLM can be used to model a dichotomous response variable by assuming that variable is distributed according to a binomial distribution. Dichotomous response variables are found often in neuroscience and psychology: whenever we talk about "percent correct" or similar variables (like the psychophysical dataset above).

To do this, we "squash" the linear predictor through the logit function, which gives it bounds between 0 and 1.This squashing procedure and binomial family is an example of changing the identity link function and the error model. It's what makes the GLM "generalised".

Note that nothing changed about how we construct or interpret the design matrix for our model: everything we've learned about design matrices still apply.

# Poisson regression¶

It's common in neuroscience to measure the spiking activity of neurons. Spikes over time can be thought of as "events", and thus the number of spikes in a time interval as event counts.

Counts are discrete and non-negative - thus modeling their distribution requires again a different type of GLM. Luckily, we can again achieve that by changing the link function.

As a very brief example, we consider a neuron recorded from V1 while showing a grating with two different orientations at two different contrasts. What type of design do we have here?

In [24]:
spk = pd.read_csv('../Datasets/neuron2.txt', sep='\t',index_col=False)

In [25]:
spk.head()

Out[25]:
spikes contrast orientation
0 3 1 0
1 6 1 0
2 3 1 0
3 3 1 0
4 6 1 0
In [26]:
sns.factorplot("contrast","spikes","orientation",spk,kind='point')
sns.despine(offset=10);


What type of effects would you assume when we fit a GLM with appropriate link function?

The correct model family for count data is the Poisson family, with the log link function

$X\beta = \ln(\mu)$

or conversly

$\mu = \exp(X\beta)$

This takes care of two things

• The output of the rhs is nonnegative
• The variance of the output increases with the input, as in count data
In [27]:
fit = smf.glm('spikes ~ contrast * orientation',
data=spk,
print(fit.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                 spikes   No. Observations:                  156
Model:                            GLM   Df Residuals:                      152
Model Family:                 Poisson   Df Model:                            3
Method:                          IRLS   Log-Likelihood:                -374.91
Date:                Tue, 24 Feb 2015   Deviance:                       194.83
Time:                        20:40:49   Pearson chi2:                     184.
No. Iterations:                     8
========================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------
Intercept                1.5515      0.083     18.677      0.000         1.389     1.714
contrast                -0.0165      0.012     -1.361      0.174        -0.040     0.007
orientation             -0.0018      0.001     -1.349      0.177        -0.004     0.001
contrast:orientation     0.0019      0.000     11.078      0.000         0.002     0.002
========================================================================================

In [28]:
# build a new matrix with expand grid:
preds = expand_grid({'contrast': [1, 10], 'orientation': [0, 90]})
preds_2 = preds.copy()

preds['spikes'] = fit.predict(preds)

# also fit a model with no interaction for comparison:
#fit_2 = smf.glm('spikes ~ contrast * orientation', spk).fit()
fit_2 = smf.glm('spikes ~ contrast + orientation', data=spk,
preds_2['spikes'] = fit_2.predict(preds_2)

# to plot model predictions against data, we should merge them together:
merged = pd.concat(dict(data=spk,
linear=preds_2,
interact=preds),
names=['type']).reset_index()

g = sns.factorplot('contrast', 'spikes', 'orientation',
data=merged,
col='type',
kind='point',
size=3.5)
sns.despine(offset=10);


Note that this is particularly important if the counts are low, because than their non-negativity plays a role.

In [29]:
print(fit_2.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                 spikes   No. Observations:                  156
Model:                            GLM   Df Residuals:                      153
Model Family:                 Poisson   Df Model:                            2
Method:                          IRLS   Log-Likelihood:                -437.65
Date:                Tue, 24 Feb 2015   Deviance:                       320.31
Time:                        20:40:50   Pearson chi2:                     328.
No. Iterations:                     8
===============================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       0.8332      0.075     11.168      0.000         0.687     0.979
contrast        0.0976      0.007     14.083      0.000         0.084     0.111
orientation     0.0110      0.001     15.441      0.000         0.010     0.012
===============================================================================

In [30]:
# Do an expanded prediction for many contrasts, compare to linear model:

preds_3 = expand_grid({'contrast': np.linspace(0.01, 20), 'orientation': [0, 90]})
preds_4 = preds_3.copy()

preds_3['yhat'] = fit.predict(preds_3)

fit_lm = smf.glm('spikes ~ contrast + orientation', data=spk).fit()
preds_4['yhat'] = fit_lm.predict(preds_3)

# to plot model predictions against data, we should merge them together:
merged = pd.concat(dict(data=spk,
poisson=preds_3,
linear=preds_4),
names=['type']).reset_index()

merged.tail()

Out[30]:
type level_1 contrast orientation spikes yhat
351 poisson 95 18.368163 90 NaN 65.976032
352 poisson 96 18.776122 90 NaN 70.202318
353 poisson 97 19.184082 90 NaN 74.699331
354 poisson 98 19.592041 90 NaN 79.484413
355 poisson 99 20.000000 90 NaN 84.576017
In [31]:
g = sns.FacetGrid(merged,
hue='type',
col='orientation')
g.map(plt.scatter, 'contrast', 'spikes')
g.map(plt.plot, 'contrast', 'yhat')
sns.despine(offset=10)


You can see that the linear and the poisson models give very different predictions outside the range of this data. Furthermore, the linear model

• can predict negative rates
• assumes that variance homogenous, but for rate distributions (i.e. Poisson) the variance scales with the mean.

# Conclusion¶

The GLM allows us to specify different distributions for the response variable $y$ and for the link function $f()$. In this notebook we've considered two examples from the context of behavioural / neural science: percent correct data (binomial regression) and spiking data (poisson regression). The following table (adapted from John Kruschke's book, Doing Bayesian Data Analysis) shows, for many types of response data, the appropriate error model and the most common link functions.

Scale type of $y$ Typical noise distribution of $y$ Typical link function
metric (continuous) normal identity
dichotomous (0 or 1) bernoulli logistic
nominal (unordered categories) categorical softmax
ordinal (ordered categories) categorical thresholded logistic
count poisson log

In this course, we've looked extensively at the first row, and touched on dichotomous and count data here. The GLM offers a powerful framework for modelling a wide variety of data.

In [1]:
from IPython.core.display import HTML

def css_styling():