Poisson regression

  • Contingency tables.
  • Log-linear regression.
  • Log-linear regression as a generalized linear model.

Count data

Afterlife

Men and women were asked whether they believed in the after life (1991 General Social Survey).

* Y N or U Total
M 435 147 582
F 375 134 509
Total 810 281 1091

Question: is belief in the afterlife independent of sex?

Poisson counts

Definition

  • A random variable $Y$ is a Poisson random variable with parameter $\lambda$ if $$P(Y=j) = e^{-\lambda} \frac{\lambda^j}{j!}, \qquad \forall j \geq 0.$$
  • Some simple calculations show that $E(Y)=\text{Var}(Y)=\lambda.$
  • Poisson models for counts are analogous to Gaussian for continuous outcomes -- they appear in many common models.

Contingency table

  • Model: $Y_{ij} \sim Poisson(\lambda_{ij} )$.
  • Null (independence): $H_0 :\lambda_{ij} = \lambda \alpha_i \cdot \beta_j , \sum_i \alpha_i = 1, \sum_j \beta_j = 1.$
  • Alternative: $H_a : \lambda_{ij} \in \mathbb{R}$
  • Test statistic: Pearson’s $X^2$ : $X^2 = \sum_{ij} \frac{(Y_{ij}-E_{ij})^2}{E_{ij}} \overset{H_0}{\approx} \chi^2_1$
  • Here $E_{ij}$ is the estimated expected value under independence.
  • Why 1 df ? Independence model has 5 parameters, two constraints = 3 df. Unrestricted has 4 parameters.
  • This is actually a regression model for the count data.
In [ ]:
Y = c(435,147,375,134)
S = factor(c('M','M','F','F'))
B = factor(c('Y','N','Y','N'))

N = sum(Y)
piS = c((435+147)/N,(375+134)/N)
piB = c((435+375)/N,(147+134)/N)

E = N*c(piS[1]*piB[1], piS[1]*piB[2], piS[2]*piB[1], piS[2]*piB[2])
# Pearson's X^2
X2 = sum((Y - E)^2/E)
c(X2, 1-pchisq(X2,1))

The independence test is called chisq.test in R. Depending on whether one corrects or not, we get the $X^2$ or a corrected version.

In [ ]:
chisq.test(matrix(Y,2,2), correct=FALSE)
In [ ]:
chisq.test(matrix(Y,2,2))

Contingency table as regression model

  • Under independence $\begin{aligned}
     \log(E (Y_{ij} )) &= \log \lambda_{ij} = \log \lambda  + \log \alpha_i + \log \beta_j
    
    \end{aligned}$
  • OR, the model has a log link.
  • What about the variance? Because of Poisson assumption $Var(Y_{ij} ) = E (Y_{ij})$
  • OR, the variance function is $V (\mu) = \mu.$

The goodness of fit test can also be found using a glm.

In [ ]:
summary(glm(Y ~ S + B, family=poisson()))

This model has the same fitted values as we had computed by hand above.

In [ ]:
fitted(glm(Y ~ S+B, family=poisson()))
E
  • Here is the deviance test statistic.

  • It is numerically close, but not identical to Pearson's $X^2$ for this data.

In [ ]:
DEV = sum(2*(Y*log(Y/E)+Y-E))
c(X2,DEV)

Contingency table $(k \times m)$

  • Suppose we had $k$ categories on one axis, $m$ on the other (i.e. previous example $k = m = 2$). We call this as $k \times m$ contingency table.
  • Independence model $(H_0)$: $\log(E (Y_{ij} )) = \log \lambda_{ij} = \log \lambda + \log \alpha_i + \log \beta_j$
  • Test for independence: Pearson’s $$X^2 = \sum_{ij} \frac{(Y_{ij}-E_{ij})^2}{E_{ij}} \overset{H_0}{\approx} \chi^2_{(k-1)(m-1)}$$
  • Alternative test statistic $G = 2\sum_{ij} Y_{ij} \log \left(\frac{Y_{ij}}{E_{ij}}\right)$

Independence tests

  • Unlike in other cases, in this case the full model has as many parameters as observations (i.e. it’s saturated).
  • This test is known as a goodness of fit test.
  • It tests: "how well does the independence model fit this data"?

  • Unlike other tests we've seen, the deviance is the test statistic, not a difference of deviance.

Lumber company example

  • $Y$ : number of customers visting store from region;
  • $X_1$ : number of housing units in region;
  • $X_2$ : average household income;
  • $X_3$ : average housing unit age in region;
  • $X_4$ : distance to nearest competitor;
  • $X_5$ : distance to store in miles.

Poisson (log-linear) regression model

  • Given observations and covariates $Y_i , X_{ij} , 1 \leq i \leq n, 1 \leq j \leq p$.
  • Model: $$Y_{i} \sim Poisson \left(\exp\left(\beta_0 + \sum_{j=1}^p \beta_j X_{ij} \right)\right)$$
  • Poisson assumption implies the variance function is $V (\mu) = \mu.$
In [ ]:
url = 'http://stats191.stanford.edu/data/lumber.table'
lumber.table = read.table(url, header=T)
lumber.glm = glm(Customers ~ Housing + Income + Age + Competitor + Store, 
                 family=poisson(), data=lumber.table)
summary(lumber.glm)
In [ ]:
par(mfrow=c(2,2))
plot(lumber.glm)

Interpretation of coefficients

  • The log-linear model means covariates have multiplicative effect.
  • Log-linear model model: $\frac{E(Y|\dots, X_j=x_j+1, \dots)}{E(Y|\dots, X_j=x_j, \dots)} = e^{\beta_j}$
  • So, one unit increase in variable $j$ results in $e^{\beta_j}$ (multiplicative) increase the expected count, all other parameters being equal.

Generalized linear models

  • Logistic model: ${\text{logit}}(\pi) = \beta_0 + \sum_j \beta_j X_j \qquad V(\pi)=\pi(1-\pi)$
  • Poisson log-linear model: $\log(\mu) = \beta_0 + \sum_j \beta_j X_j, \qquad V(\mu) = \mu$
  • These are the ingredients to a GLM …

Deviance tests

  • To test $H_0:{\cal M}={\cal M}_R$ vs. $H_a: {\cal M}={\cal M}_F$, we use $$DEV({\cal M}_R) - DEV({\cal M}_F) \sim \chi^2_{df_R-df_F}$$
  • In contingency example ${\cal M}_R$ is the independence model $$\log(E(Y_{ij})) = \lambda + \alpha_i + \beta_j$$ with ${\cal M}_F$ being the "saturated model."
In [ ]:
lumber.R.glm = glm(Customers ~ Housing + Income + Age, 
                   family=poisson, data=lumber.table)
anova(lumber.R.glm, lumber.glm)
In [ ]:
1 - pchisq(263.45, 2)

Model selection

As it is a likelihood model, step can also be used for model selection.

In [ ]:
step(lumber.glm)
In [ ]:
step(glm(Customers ~ 1, data=lumber.table, family=poisson()), scope=list(upper=lumber.glm), direction='forward')