# Poisson regression¶

• Contingency tables.
• Log-linear regression.
• Log-linear regression as a generalized linear model.

# Count data¶

## Afterlife¶

Men and women were asked whether they believed in the after life (1991 General Social Survey).

* Y N or U Total
M 435 147 582
F 375 134 509
Total 810 281 1091

Question: is belief in the afterlife independent of sex?

## Poisson counts¶

### Definition¶

• A random variable $Y$ is a Poisson random variable with parameter $\lambda$ if $$P(Y=j) = e^{-\lambda} \frac{\lambda^j}{j!}, \qquad \forall j \geq 0.$$
• Some simple calculations show that $E(Y)=\text{Var}(Y)=\lambda.$
• Poisson models for counts are analogous to Gaussian for continuous outcomes -- they appear in many common models.

## Contingency table¶

• Model: $Y_{ij} \sim Poisson(\lambda_{ij} )$.
• Null (independence): $H_0 :\lambda_{ij} = \lambda \alpha_i \cdot \beta_j , \sum_i \alpha_i = 1, \sum_j \beta_j = 1.$
• Alternative: $H_a : \lambda_{ij} \in \mathbb{R}$
• Test statistic: Pearson’s $X^2$ : $X^2 = \sum_{ij} \frac{(Y_{ij}-E_{ij})^2}{E_{ij}} \overset{H_0}{\approx} \chi^2_1$
• Here $E_{ij}$ is the estimated expected value under independence.
• Why 1 df ? Independence model has 5 parameters, two constraints = 3 df. Unrestricted has 4 parameters.
• This is actually a regression model for the count data.
In [ ]:
Y = c(435,147,375,134)
S = factor(c('M','M','F','F'))
B = factor(c('Y','N','Y','N'))

N = sum(Y)
piS = c((435+147)/N,(375+134)/N)
piB = c((435+375)/N,(147+134)/N)

E = N*c(piS[1]*piB[1], piS[1]*piB[2], piS[2]*piB[1], piS[2]*piB[2])
# Pearson's X^2
X2 = sum((Y - E)^2/E)
c(X2, 1-pchisq(X2,1))


The independence test is called chisq.test in R. Depending on whether one corrects or not, we get the $X^2$ or a corrected version.

In [ ]:
chisq.test(matrix(Y,2,2), correct=FALSE)

In [ ]:
chisq.test(matrix(Y,2,2))


## Contingency table as regression model¶

• Under independence \begin{aligned}  \log(E (Y_{ij} )) &= \log \lambda_{ij} = \log \lambda + \log \alpha_i + \log \beta_j  \end{aligned}
• OR, the model has a log link.
• What about the variance? Because of Poisson assumption $Var(Y_{ij} ) = E (Y_{ij})$
• OR, the variance function is $V (\mu) = \mu.$

The goodness of fit test can also be found using a glm.

In [ ]:
summary(glm(Y ~ S + B, family=poisson()))


This model has the same fitted values as we had computed by hand above.

In [ ]:
fitted(glm(Y ~ S+B, family=poisson()))
E

• Here is the deviance test statistic.

• It is numerically close, but not identical to Pearson's $X^2$ for this data.

In [ ]:
DEV = sum(2*(Y*log(Y/E)+Y-E))
c(X2,DEV)


## Contingency table $(k \times m)$¶

• Suppose we had $k$ categories on one axis, $m$ on the other (i.e. previous example $k = m = 2$). We call this as $k \times m$ contingency table.
• Independence model $(H_0)$: $\log(E (Y_{ij} )) = \log \lambda_{ij} = \log \lambda + \log \alpha_i + \log \beta_j$
• Test for independence: Pearson’s $$X^2 = \sum_{ij} \frac{(Y_{ij}-E_{ij})^2}{E_{ij}} \overset{H_0}{\approx} \chi^2_{(k-1)(m-1)}$$
• Alternative test statistic $G = 2\sum_{ij} Y_{ij} \log \left(\frac{Y_{ij}}{E_{ij}}\right)$

## Independence tests¶

• Unlike in other cases, in this case the full model has as many parameters as observations (i.e. it’s saturated).
• This test is known as a goodness of fit test.
• It tests: "how well does the independence model fit this data"?

• Unlike other tests we've seen, the deviance is the test statistic, not a difference of deviance.

## Lumber company example¶

• $Y$ : number of customers visting store from region;
• $X_1$ : number of housing units in region;
• $X_2$ : average household income;
• $X_3$ : average housing unit age in region;
• $X_4$ : distance to nearest competitor;
• $X_5$ : distance to store in miles.

## Poisson (log-linear) regression model¶

• Given observations and covariates $Y_i , X_{ij} , 1 \leq i \leq n, 1 \leq j \leq p$.
• Model: $$Y_{i} \sim Poisson \left(\exp\left(\beta_0 + \sum_{j=1}^p \beta_j X_{ij} \right)\right)$$
• Poisson assumption implies the variance function is $V (\mu) = \mu.$
In [ ]:
url = 'http://stats191.stanford.edu/data/lumber.table'
lumber.glm = glm(Customers ~ Housing + Income + Age + Competitor + Store,
family=poisson(), data=lumber.table)
summary(lumber.glm)

In [ ]:
par(mfrow=c(2,2))
plot(lumber.glm)


## Interpretation of coefficients¶

• The log-linear model means covariates have multiplicative effect.
• Log-linear model model: $\frac{E(Y|\dots, X_j=x_j+1, \dots)}{E(Y|\dots, X_j=x_j, \dots)} = e^{\beta_j}$
• So, one unit increase in variable $j$ results in $e^{\beta_j}$ (multiplicative) increase the expected count, all other parameters being equal.

## Generalized linear models¶

• Logistic model: ${\text{logit}}(\pi) = \beta_0 + \sum_j \beta_j X_j \qquad V(\pi)=\pi(1-\pi)$
• Poisson log-linear model: $\log(\mu) = \beta_0 + \sum_j \beta_j X_j, \qquad V(\mu) = \mu$
• These are the ingredients to a GLM …

## Deviance tests¶

• To test $H_0:{\cal M}={\cal M}_R$ vs. $H_a: {\cal M}={\cal M}_F$, we use $$DEV({\cal M}_R) - DEV({\cal M}_F) \sim \chi^2_{df_R-df_F}$$
• In contingency example ${\cal M}_R$ is the independence model $$\log(E(Y_{ij})) = \lambda + \alpha_i + \beta_j$$ with ${\cal M}_F$ being the "saturated model."
In [ ]:
lumber.R.glm = glm(Customers ~ Housing + Income + Age,
family=poisson, data=lumber.table)
anova(lumber.R.glm, lumber.glm)

In [ ]:
1 - pchisq(263.45, 2)


## Model selection¶

As it is a likelihood model, step can also be used for model selection.

In [ ]:
step(lumber.glm)

In [ ]:
step(glm(Customers ~ 1, data=lumber.table, family=poisson()), scope=list(upper=lumber.glm), direction='forward')