Gamma generalized linear models (GLMs)

This notebook contains simulation studies and examples for a generalized linear model (GLM) with a gamma distribution.

In [1]:
import statsmodels.api as sm
from statsmodels.genmod.families import family, links

For the gamma Generalized Linear Model (GLM), we can use several link functions. First we try explore the log link, for which the expected response is exp(params * exog). To relate this to the parameters alpha and beta of the gamma distribution, we use:

alpha = 1/scale
beta = e_endog * scale

where e_endog is the expected value of endog (e_endog = exp(params * exog)).

In the next cell, we define a function that generates data from this distribution.

In [2]:
def gendat(n, params, scale, link_inv):
    """
    Generate endog, exog that follow a Gamma GLM.
    
    Parameters:
    -----------
    n : int
        Sample size
    params : array-like
        Regression slopes
    scale : float
        Scale parameter
    link_inv : function
        Inverse of the link function, mapping the linear predictors 
        to the mean values.
        
    Returns:
    --------
    endog : array-like
        The response variable values
    exog : array-like
        The matrix of covariate values
    """
    p = len(params)
    exog = np.random.normal(size=(n, p))
    exog[:, 0] = 1.
    
    if p >= 3: # Create some collinearity
        r = 0.8
        exog[:, 2] = r*exog[:, 1] + np.sqrt(1 - r**2)*exog[:, 2]
    
    linpred = np.dot(exog, params)
    e_endog = link_inv(linpred)

    alpha = 1 / float(scale)
    beta = e_endog * scale
    endog = np.random.gamma(alpha, beta, size=n)
    
    return endog, exog

Next we set up the GLM family, changing the link to the log link (the default is reciprocal).

In [3]:
fam = family.Gamma()
fam.link = links.log()
print fam.link
<statsmodels.genmod.families.links.log object at 0x7f50a208ee90>

Now we run the simulations.

In [4]:
params = np.r_[1, -0.1, 0.1, 0.2]

nrep = 200
all_params = []
all_se = []
all_scales = []
scale = 2
n = 1000

for k in range(nrep):
    endog, exog = gendat(n, params, scale, np.exp)
    mod = sm.GLM(endog, exog, family=fam)
    rslt = mod.fit()
    all_params.append(rslt.params)
    all_se.append(rslt.bse)
    all_scales.append(rslt.scale)
    
all_params = np.asarray(all_params)
all_se = np.asarray(all_se)
all_scales = np.asarray(all_scales)

First we check that the average of the parameter estimates over the replicates is approximately equal to the true parameter values (i.e. that the estimates are approximately unbiased).

In [5]:
print all_params.mean(0)
print params
[ 0.99803569 -0.10918289  0.10599923  0.19678639]
[ 1.  -0.1  0.1  0.2]

Next we check that the estimated standard errors (averaged over the replicates) are approximately equal to the empirical standard deviations of the parameter estimates over the replicates.

In [6]:
print all_se.mean(0)
print all_params.std(0)
[ 0.04455334  0.07439816  0.07437255  0.04468223]
[ 0.04398817  0.07955131  0.07623943  0.0453856 ]

Finally we can compare the estimated scales

In [7]:
print all_scales.mean()
1.98214979127

The canonical link for the gamma GLM is the reciprocal link function. Proceeding as above, we can generate data from this distribution.

Here we use the gamma GLM family with the default link.

In [8]:
fam = family.Gamma()
print fam.link
<statsmodels.genmod.families.links.inverse_power object at 0x7f50a208ebd0>

Run the simulations.

In [9]:
params = np.r_[2, -0.1, 0.1, 0.2]

nrep = 200
all_params = []
all_se = []
scale = 2
n = 1000

for k in range(nrep):
    endog, exog = gendat(n, params, scale, lambda x: 1/x)
    mod = sm.GLM(endog, exog, family=fam)
    rslt = mod.fit()
    all_params.append(rslt.params)
    all_se.append(rslt.bse)
    
all_params = np.asarray(all_params)
all_se = np.asarray(all_se)

First we check the parameter estimates:

In [10]:
print all_params.mean(0)
print params
[ 2.01383018 -0.06799102  0.08238276  0.18991785]
[ 2.  -0.1  0.1  0.2]

Next we check the standard errors:

In [11]:
print all_se.mean(0)
print all_params.std(0)
[ 0.09045778  0.14616257  0.14611313  0.08623909]
[ 0.09285009  0.15691908  0.15388371  0.08619842]

Ocean chlorophyl data

The data are available from the R gamair package.

The primary source is:

http://oceancolor.gsfc.nasa.gov/

In [12]:
import pandas as pd

data = pd.read_csv("chl.csv")
data = data.rename(columns={"chl.sw": "chl_sw", "jul.day": "jul_day"})
print data.columns

plt.hist(np.asarray(data["chl_sw"]))
plt.xlabel("Chlorophyl", size=15)
_ = plt.ylabel("Frequency", size=15)
Index([u'Unnamed: 0', u'lon', u'lat', u'jul_day', u'bath', u'chl', u'chl_sw'], dtype='object')
In [13]:
for vn in "lat", "lon", "jul_day":
    plt.figure()
    plt.plot(data[vn], data["chl_sw"], 'o')
    plt.ylabel("Chlorophyl", size=15)
    plt.xlabel(vn, size=15)

Here we fit a Gamma GLM with the log link function. We use splines to capture the nonlinearities in the main effects.

In [14]:
fam = family.Gamma()
fam.link = links.log()

mod1 = sm.GLM.from_formula("chl_sw ~ bs(lon, df=8) + bs(lat, df=8) + bs(jul_day, df=8)", family=fam, data=data)
rslt1 = mod1.fit()
print rslt1.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                 chl_sw   No. Observations:                13840
Model:                            GLM   Df Residuals:                    13815
Model Family:                   Gamma   Df Model:                           24
Link Function:                    log   Scale:                  0.528456762718
Method:                          IRLS   Log-Likelihood:                   -inf
Date:                Mon, 25 Aug 2014   Deviance:                       22594.
Time:                        04:56:40   Pearson chi2:                 7.30e+03
No. Iterations:                    31                                         
========================================================================================
                           coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------
Intercept                0.2178      0.200      1.089      0.276        -0.174     0.610
bs(lon, df=8)[0]         1.5175      0.256      5.931      0.000         1.016     2.019
bs(lon, df=8)[1]        -2.3556      0.149    -15.835      0.000        -2.647    -2.064
bs(lon, df=8)[2]        -3.0975      0.182    -17.049      0.000        -3.454    -2.741
bs(lon, df=8)[3]        -3.3736      0.160    -21.026      0.000        -3.688    -3.059
bs(lon, df=8)[4]        -2.8571      0.160    -17.810      0.000        -3.172    -2.543
bs(lon, df=8)[5]        -0.5748      0.170     -3.386      0.001        -0.907    -0.242
bs(lon, df=8)[6]        -2.5091      0.285     -8.803      0.000        -3.068    -1.950
bs(lon, df=8)[7]         0.6984      0.314      2.223      0.026         0.083     1.314
bs(lat, df=8)[0]         0.5914      0.197      3.009      0.003         0.206     0.977
bs(lat, df=8)[1]         1.0401      0.079     13.240      0.000         0.886     1.194
bs(lat, df=8)[2]         2.6767      0.098     27.257      0.000         2.484     2.869
bs(lat, df=8)[3]         2.4291      0.084     28.870      0.000         2.264     2.594
bs(lat, df=8)[4]         2.7516      0.092     29.867      0.000         2.571     2.932
bs(lat, df=8)[5]         3.0222      0.103     29.230      0.000         2.820     3.225
bs(lat, df=8)[6]         0.2502      0.150      1.666      0.096        -0.044     0.545
bs(lat, df=8)[7]         2.5293      0.124     20.355      0.000         2.286     2.773
bs(jul_day, df=8)[0]    -0.5171      0.126     -4.110      0.000        -0.764    -0.270
bs(jul_day, df=8)[1]     0.5592      0.072      7.799      0.000         0.419     0.700
bs(jul_day, df=8)[2]     0.2040      0.084      2.421      0.015         0.039     0.369
bs(jul_day, df=8)[3]     0.2827      0.071      3.967      0.000         0.143     0.422
bs(jul_day, df=8)[4]    -0.4265      0.080     -5.356      0.000        -0.583    -0.270
bs(jul_day, df=8)[5]     1.5566      0.087     17.830      0.000         1.385     1.728
bs(jul_day, df=8)[6]    -1.3498      0.106    -12.774      0.000        -1.557    -1.143
bs(jul_day, df=8)[7]     0.2744      0.098      2.806      0.005         0.083     0.466
========================================================================================

We can plot the fitted values for each day of the year at a given location. We choose 18E, 57N, which based on the scatterplots above should have fairly high chlorophyl levels. This is just off the coast of Denmark as show in the following map:

https://www.google.com/maps/place/57%C2%B000'00.0%22N+18%C2%B000'00.0%22E/@56.9994919,18,3124503m/data=!3m1!1e3!4m2!3m1!1s0x0:0x0

In [15]:
mat = np.zeros((100, 3), dtype=np.float64)
mat[:, 2] = np.linspace(1, 365, 100) 
mat[:, 1] = 18
mat[:, 0] = 57
df = pd.DataFrame(mat, columns=("lat", "lon", "jul_day"))
e_endog1 = rslt1.predict(exog=df)

Here is the fitted annnual trend:

In [16]:
plt.plot(mat[:, 2], e_endog1, '-', lw=4)
plt.xlabel("Day", size=15)
plt.ylabel("Chlorophyl", size=15)
Out[16]:
<matplotlib.text.Text at 0x7f50a1daa6d0>

In a Gamma GLM, the variance is proportional to the square of the mean (the constant of proportionality is the scale parameter). So under this model, the variance is around half of the squared mean. Here is a plot of the variance against the mean.

In [17]:
alpha = 1 / rslt1.scale
beta = e_endog1 * rslt1.scale

plt.plot(alpha*beta, alpha*beta**2, '-')
plt.xlabel("Mean", size=15)
plt.ylabel("Variance", size=15)
Out[17]:
<matplotlib.text.Text at 0x7f50a2037e90>

Here is a plot that shows the mean and the variance as functions of the date.

In [18]:
plt.plot(mat[:, 2], e_endog1, '-', lw=4, label="Mean", color='purple')
plt.plot(mat[:, 2], alpha*beta**2, '-', lw=4, label="Variance", color='orange')
leg = plt.legend()
leg.draw_frame(False)
plt.xlabel("Day", size=15)
_ = plt.ylabel("Chlorophyl", size=15)

We can use a linear model to explore the relationship between conditional mean and conditional variance without fixing the relationship in the model. The plot of residuals on fitted values strongly suggests that the variance increases with the mean.

In [19]:
mod2 = sm.OLS.from_formula("chl_sw ~ bs(lon, df=8) + bs(lat, df=8) + bs(jul_day, df=8)", data=data)
rslt2 = mod2.fit()
plt.plot(rslt2.fittedvalues, rslt2.resid, 'o', alpha=0.6)
plt.xlabel("Fitted value", size=15)
plt.ylabel("Residuals", size=15)
Out[19]:
<matplotlib.text.Text at 0x7f50a1e25a10>

We can directly estimate the conditional variance by smoothing the squared residuals against the fitted mean. The relationship is roughly parabolic, although the vertex of the parabola appears slightly shifted to the right of the origin. The points with negative fitted mean can be ignored as

In [20]:
from statsmodels.nonparametric.smoothers_lowess import lowess

fit = lowess(rslt2.resid**2, rslt2.fittedvalues)

plt.plot(fit[:, 0], fit[:, 1], '-', lw=4)
plt.xlabel("Fitted value", size=15)
plt.ylabel("Squared residual", size=15)
Out[20]:
<matplotlib.text.Text at 0x7f50a1e74310>