The Generalised Linear Model

A practical introduction using the Python environment

Tom Wallis and Philipp Berens

This course provides an overview of an extremely flexible statistical framework for describing and performing inference with a wide variety of data types: the Generalised Linear Model (GLM). Many common statistical procedures are special cases of the GLM. In the course, we focus on the construction and understanding of design matrices and the interpretation of regression weights. We mostly concentrate on the linear Gaussian model, before discussing more general cases. We also touch on how this framework relates to ANOVA-style model comparison.

The course was designed and presented as a six week elective statistics course for graduate students in the neuroscience program at the University of Tübingen, in January 2015. Lectures were presented as a collection of IPython Notebooks. While the notebooks are (we hope) well documented, they are lecture materials rather than a textbook. As such, some content might not be self-explanatory.

We chose to do the course in Python because

  1. It is a general purpose programming language and thus more versatile than e.g. R. Neuroscientists can use Python to not only analyse data but also to e.g. interface with hardware, conduct behavioural experiments, etc.
  2. Its popularity as a scientific computing environment is rapidly growing.
  3. The scientific computing environment of Python has many similarities to MATLAB ™, which is the historically dominant environment in our field.
  4. It is free and open source, and thus we feel will continue to benefit students who move out of a university environment.

Nevertheless, the main statistical module we use here (Statsmodels) is well behind R in its maturity (no wonder, since R is a lot older). Thankfully, learning to create and interpret design matrices using Patsy formula notation is a skill that transfers easily to R's glm routines.

Note two things:

  1. This is not a programming course. If you do not have enough experience with programming (or Python) to follow the materials here, seek out an introduction to programming in Python. There are many available for free on the internet.
  2. This is not a basic statistics course. You should be reasonably familiar with things like t-tests and ANOVA before proceeding.

Where content is erroneous, unclear or buggy, please tell us at our GitHub repository.

Lectures

Datasets

To demonstrate the ideas in the course we used several datasets obtained from the OzDASL database as well as from our own research. They are provided in the git repository to facilitate self learning.

License

Authors: Tom Wallis and Philipp Berens

Year: 2015

Copyright: This work is licensed under a CC-by-4.0 license. You may reuse, modify and redistribute these materials provided you give appropriate credit to the authors. All images embedded in the lecture materials were obtained from the internet and are used under "fair use" for educational purposes. The copyright for all images remain with their respective holders.

Further reading

Here we provide some references for further reading. These reflect our own backgrounds in neuroscience and psychology.

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference a practical information-theoretic approach. New York: Springer.

  • A thorough overview of model comparison, which is how we like to think of ANOVA

Gelman, A., & Hill, J. (2007). Data Analysis using regression and multilevel/hierarchical models. New York, NY: Cambridge Univ Press.

  • Introduction to regression using multilevel models. Random effects models, pooling and shrinkage...

Knoblauch, K., & Maloney, L. T. (2012). Modeling Psychophysical Data in R. New York: Springer.

  • This book presents some clear examples of applying GLMs to modelling psychophysical data, using the R environment

Kruschke, J. K. (2011). Doing Bayesian Data Analysis. Academic Press / Elsevier.

  • The last half of this book is dedicated to a clear and thorough practical introduction to GLMs. We recommend the Bayesian inference stuff too, but it's not part of our course

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), ??–??

  • An article about how to arrange data so that analysis environments like Pandas can work with it.

Here are some notes on how we set up a Python environment for the course (packages, versions) etc.

In [1]:
from IPython.core.display import HTML


def css_styling():
    styles = open("custom_style.css", "r").read()
    return HTML(styles)
css_styling()
Out[1]: