# Poisson regression¶

• Contingency tables.
• Poisson regression.
• Generalized linear model.

# Count data¶

## Afterlife¶

Men and women were asked whether they believed in the after life (1991 General Social Survey).

* Y N or U Total
M 435 147 582
F 375 134 509
Total 810 281 1091

Question: is belief in the afterlife independent of gender?

## Poisson counts¶

### Definition¶

• A random variable $Y$ is a Poisson random variable with parameter $\lambda$ if $P(Y=j) = e^{-\lambda} \frac{\lambda^j}{j!}, \qquad \forall j \geq 0.$
• Some simple calculations show that $E(Y)=\text{Var}(Y)=\lambda.$
• Poisson models for counts are analogous to Gaussian for continuous outcomes.

## Contingency table¶

• Model: $Y_{ij} \sim Poisson(\lambda_{ij} )$.
• Null (independence): $H_0 :\lambda_{ij} = \lambda \alpha_i \cdot \beta_j , \sum_i \alpha_i = 1, \sum_j \beta_j = 1.$
• Alternative: $H_a : \lambda_{ij} \in \mathbb{R}$
• Test statistic: Pearson’s $X^2$ : $X^2 = \sum_{ij} \frac{(Y_{ij}-E_{ij})^2}{E_{ij}} \overset{H_0}{\approx} \chi^2_1$
• Here $E_{ij}$ is the estimated expected value under independence ($H_0$).
• Why 1 df ? Independence model has 5 parameters, two constraints = 3 df. Unrestricted has 4 parameters.
• This is actually a regression model for the count data.
In [1]:
%%R
Y = c(435,147,375,134)
G = factor(c('M','M','F','F'))
B = factor(c('Y','N','Y','N'))

N = sum(Y)
piG = c((435+147)/N,(375+134)/N)
piB = c((435+375)/N,(147+134)/N)

E = N*c(piG[1]*piB[1], piG[1]*piB[2], piG[2]*piB[1], piG[2]*piB[2])
# Pearson's X^2
X2 = sum((Y - E)^2/E)
c(X2, 1-pchisq(X2,1))

[1] 0.1620840 0.6872451


The independence test is called chisq.test in R. Depending on whether one corrects or not, we get the $X^2$ or a corrected version.

In [2]:
%%R
chisq.test(matrix(Y,2,2), correct=F)

	Pearson's Chi-squared test

data:  matrix(Y, 2, 2)
X-squared = 0.1621, df = 1, p-value = 0.6872


In [3]:
%%R
chisq.test(matrix(Y,2,2))

	Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(Y, 2, 2)
X-squared = 0.111, df = 1, p-value = 0.739



## Contingency table as regression model¶

• Under independence \begin{aligned}  \log(E (Y_{ij} )) &= \log \lambda_{ij} = \log \lambda + \log \alpha_i + \log \beta_j  \end{aligned}
• OR, the model has a log link.
• What about the variance? Because of Poisson assumption $Var(Y_{ij} ) = E (Y_{ij})$
• OR, the variance function is $V (\mu) = \mu.$

The goodness of fit test can also be found using a glm.

In [4]:
%%R
summary(glm(Y ~ G+B, family=poisson()))

Call:
glm(formula = Y ~ G + B, family = poisson())

Deviance Residuals:
1        2        3        4
0.1394  -0.2377  -0.1494   0.2524

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  4.87595    0.06787  71.839   <2e-16 ***
GM           0.13402    0.06069   2.208   0.0272 *
BY           1.05868    0.06923  15.291   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 272.685  on 3  degrees of freedom
Residual deviance:   0.162  on 1  degrees of freedom
AIC: 35.407

Number of Fisher Scoring iterations: 3



This model has the same fitted values as we had computed by hand above.

In [5]:
%%R
print(E)
print(fitted(glm(Y ~ G+B, family=poisson())))

[1] 432.099 149.901 377.901 131.099
1       2       3       4
432.099 149.901 377.901 131.099


Here is the deviance test statistic. It is numerically close, but not identical to Pearson's $X^2$.

In [6]:
%%R
DEV = sum(2*(Y*log(Y/E)+Y-E))
c(X2,DEV)

[1] 0.1620840 0.1619951


## Contingency table $(k \times m)$¶

• Suppose we had $k$ categories on one axis, $m$ on the other (i.e. previous example $k = m = 2$). We call this as $k \times m$ contingency table.
• Independence model $(H_0)$: $\log(E (Y_{ij} )) = \log \lambda_{ij} = \log \lambda + \log \alpha_i + \log \beta_j$
• Test for independence: Pearson’s $$X^2 = \sum_{ij} \frac{(Y_{ij}-E_{ij})^2}{E_{ij}} \overset{H_0}{\approx} \chi^2_{(k-1)(m-1)}$$
• Alternative test statistic $G = 2\sum_{ij} Y_{ij} \log \left(\frac{Y_{ij}}{E_{ij}}\right)$

## Independence tests¶

• Unlike in other cases, in this case the full model has as many parameters as observations (i.e. it’s saturated).
• This test is known as a goodness of fit test.
• It tests: "how well does the independence model fit this data"?

• Unlike other tests we've seen, the deviance is the test statistic, not a difference of deviance.

## Lumber company example¶

• $Y$ : number of customers visting store from region;
• $X_1$ : number of housing units in region;
• $X_2$ : average household income;
• $X_3$ : average housing unit age in region;
• $X_4$ : distance to nearest competitor;
• $X_5$ : distance to store in miles.

## Poisson (log-linear) regression model¶

• Given observations and covariates $Y_i , X_{ij} , 1 \leq i \leq n, 1 \leq j \leq p$.
• Model: $$Y_{i} \sim Poisson \left(\exp\left(\beta_0 + \sum_{j=1}^p \beta_j X_{ij} \right)\right)$$
• Poisson assumption implies the variance function is $V (\mu) = \mu.$
In [7]:
%%R -h 600 -w 600
url = 'http://stats191.stanford.edu/data/lumber.table'
lumber.table = read.table(url, header=T)
lumber.glm = glm(Customers ~ Housing + Income + Age + Competitor + Store, family=poisson(), data=lumber.table)
summary(lumber.glm)

Call:
glm(formula = Customers ~ Housing + Income + Age + Competitor +
Store, family = poisson(), data = lumber.table)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.93195  -0.58868  -0.00009   0.59269   2.23441

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.942e+00  2.072e-01  14.198  < 2e-16 ***
Housing      6.058e-04  1.421e-04   4.262 2.02e-05 ***
Income      -1.169e-05  2.112e-06  -5.534 3.13e-08 ***
Age         -3.726e-03  1.782e-03  -2.091   0.0365 *
Competitor   1.684e-01  2.577e-02   6.534 6.39e-11 ***
Store       -1.288e-01  1.620e-02  -7.948 1.89e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 422.22  on 109  degrees of freedom
Residual deviance: 114.99  on 104  degrees of freedom
AIC: 571.02

Number of Fisher Scoring iterations: 4


In [8]:
%%R
par(mfrow=c(2,2))
plot(lumber.glm)


## Interpretation of coefficients¶

• The log-linear model means covariates have multiplicative effect.
• Log-linear model model: $\frac{E(Y|\dots, X_j=x_j+1, \dots)}{E(Y|\dots, X_j=x_j, \dots)} = e^{\beta_j}$
• So, one unit increase in variable $j$ results in $e^{\beta_j}$ (multiplicative) increase the expected count, all other parameters being equal.

## Generalized linear models¶

• Logistic model: ${\text{logit}}(\pi) = \beta_0 + \sum_j \beta_j X_j \qquad V(\pi)=\pi(1-\pi)$
• Poisson log-linear model: $\log(\mu) = \beta_0 + \sum_j \beta_j X_j, \qquad V(\mu) = \mu$
• These are the ingredients to a GLM …

## Deviance tests¶

• To test $H_0:{\cal M}={\cal M}_R$ vs. $H_a: {\cal M}={\cal M}_F$, we use $$DEV({\cal M}_R) - DEV({\cal M}_F) \sim \chi^2_{df_R-df_F}$$
• In contingency example ${\cal M}_R$ is the independence model $$\log(E(Y_{ij})) = \lambda + \alpha_i + \beta_j$$ with ${\cal M}_F$ being the "saturated model."
In [9]:
%%R
lumber.R.glm = glm(Customers ~ Housing + Income + Age, family=poisson(link='log'), data=lumber.table)
anova(lumber.R.glm, lumber.glm)

Analysis of Deviance Table

Model 1: Customers ~ Housing + Income + Age
Model 2: Customers ~ Housing + Income + Age + Competitor + Store
Resid. Df Resid. Dev Df Deviance
1       106     378.43
2       104     114.99  2   263.45

In [10]:
%%R
1 - pchisq(263.45,2)

[1] 0


## Model selection¶

As it is a likelihood model, step can also be used for model selection.

In [11]:
%%R
step(lumber.glm)

Start:  AIC=571.02
Customers ~ Housing + Income + Age + Competitor + Store

Df Deviance    AIC
<none>            114.98 571.02
- Age         1   119.36 573.40
- Housing     1   133.19 587.23
- Income      1   146.78 600.82
- Competitor  1   156.65 610.68
- Store       1   182.49 636.52

Call:  glm(formula = Customers ~ Housing + Income + Age + Competitor +
Store, family = poisson(), data = lumber.table)

Coefficients:
(Intercept)      Housing       Income          Age   Competitor        Store
2.942e+00    6.058e-04   -1.169e-05   -3.726e-03    1.684e-01   -1.288e-01

Degrees of Freedom: 109 Total (i.e. Null);  104 Residual
Null Deviance:	    422.2
Residual Deviance: 115 	AIC: 571

In [12]:
%%R
step(glm(Customers ~ 1, data=lumber.table, family=poisson()), scope=list(upper=lumber.glm), direction='forward')

Start:  AIC=868.26
Customers ~ 1

Df Deviance    AIC
+ Store       1   184.41 632.45
+ Competitor  1   201.90 649.93
+ Housing     1   379.56 827.60
+ Income      1   399.15 847.19
<none>            422.22 868.26
+ Age         1   422.20 870.24

Step:  AIC=632.45
Customers ~ Store

Df Deviance    AIC
+ Competitor  1   149.33 599.37
+ Income      1   177.45 627.49
+ Housing     1   181.29 631.33
<none>            184.41 632.45
+ Age         1   183.19 633.23

Step:  AIC=599.37
Customers ~ Store + Competitor

Df Deviance    AIC
+ Income   1   134.98 587.02
<none>         149.33 599.37
+ Age      1   148.00 600.04
+ Housing  1   148.37 600.41

Step:  AIC=587.02
Customers ~ Store + Competitor + Income

Df Deviance    AIC
+ Housing  1   119.36 573.40
<none>         134.98 587.02
+ Age      1   133.19 587.23

Step:  AIC=573.4
Customers ~ Store + Competitor + Income + Housing

Df Deviance    AIC
+ Age   1   114.98 571.02
<none>      119.36 573.40

Step:  AIC=571.02
Customers ~ Store + Competitor + Income + Housing + Age

Call:  glm(formula = Customers ~ Store + Competitor + Income + Housing +
Age, family = poisson(), data = lumber.table)

Coefficients:
(Intercept)        Store   Competitor       Income      Housing          Age
2.942e+00   -1.288e-01    1.684e-01   -1.169e-05    6.058e-04   -3.726e-03

Degrees of Freedom: 109 Total (i.e. Null);  104 Residual
Null Deviance:	    422.2
Residual Deviance: 115 	AIC: 571