# StatsModels¶

http://statsmodels.sourceforge.net

StatsModels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that StatsModels fully meets their needs for statistical computing and data analysis in Python. Features include:

• Linear regression models
• Generalized linear models
• Kernel density estimation
• Analysis of Variance (ANOVA)
• A wide range of statistical tests
• A collection of datasets for examples
• Input-output tools for producing tables in a number of formats (text, LaTex, HTML)
• Plotting functions
In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(9876789) # for replicability


## Ordinary Least Squares¶

### OLS estimation¶

Suppose the data consists of $n$ observations $\{\mathbf{x}_i, y_i\}_{i=1}^n$. Each observation includes a scalar response $y_i$ to a vector of p predictors (or regressors) $\mathbf{x}_i$. In a linear regression model the response variable is a linear function of the regressors:

$y_i = \mathbf{x}_i^T \beta + \varepsilon_i,$

where $\beta$ is a $p×1$ vector of unknown parameters; εi's are unobserved scalar random variables (errors) which account for the discrepancy between the actually observed responses $y_i$ and the "predicted outcomes" $\mathbf{x}_i^T \beta$; and T denotes matrix transpose, so that $\mathbf{x}_i^T \beta$ is the dot product between the vectors x and β. This model can also be written in matrix notation as

$\mathbf{y} = \mathbf{X}\beta + \mathbf\varepsilon,$

where $\mathbf{y}$ and $\mathbf\varepsilon$ are $n×1$ vectors, and $\mathbf{X}$ is an $n×p$ matrix of regressors, which is also sometimes called the design matrix.

As a rule, the constant term is always included in the set of regressors, say, by taking $x_i,1 = 1$ for all $i = 1, …, n$. The coefficient $\beta_1$ corresponding to this regressor is called the intercept.

In [2]:
# Make some artificial data
nsample = 100
x = np.linspace(0, 10, nsample)
X = np.column_stack((x, x**2))
X = sm.add_constant(X) # constant term to accomodate the intercept
X[:10]

Out[2]:
array([[ 1.        ,  0.        ,  0.        ],
[ 1.        ,  0.1010101 ,  0.01020304],
[ 1.        ,  0.2020202 ,  0.04081216],
[ 1.        ,  0.3030303 ,  0.09182736],
[ 1.        ,  0.4040404 ,  0.16324865],
[ 1.        ,  0.50505051,  0.25507601],
[ 1.        ,  0.60606061,  0.36730946],
[ 1.        ,  0.70707071,  0.49994898],
[ 1.        ,  0.80808081,  0.65299459],
[ 1.        ,  0.90909091,  0.82644628]])
In [3]:
beta = np.array([1, 0.1, 10])
y_true = np.dot(X, beta)
e = np.random.normal(size=nsample)
y = y_true + e
# graph this
plt.subplots(figsize=(8,6))
plt.plot(y_true[:10]); plt.plot(y[:10], '.')

Out[3]:
[<matplotlib.lines.Line2D at 0x10c287fd0>]
In [4]:
model = sm.OLS(y, X)
results = model.fit()
results.summary()

Out[4]:
Dep. Variable: R-squared: y 1.000 OLS 1.000 Least Squares 4.020e+06 Fri, 21 Nov 2014 2.83e-239 18:35:47 -146.51 100 299.0 97 306.8 2 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] 1.3423 0.313 4.292 0.000 0.722 1.963 -0.0402 0.145 -0.278 0.781 -0.327 0.247 10.0103 0.014 715.745 0.000 9.982 10.038
 Omnibus: Durbin-Watson: 2.042 2.274 0.36 1.875 0.234 0.392 2.519 144
In [5]:
results.params # compare to beta = np.array([1, 0.1, 10])

Out[5]:
array([  1.34233516,  -0.04024948,  10.01025357])
In [6]:
results.rsquared

Out[6]:
0.9999879365025871

### OLS non-linear curve but linear in parameters¶

In [7]:
nsample = 50
sig = 0.5
x = np.linspace(0, 20, nsample)
X = np.column_stack((x, np.sin(x), (x-5)**2))
X = sm.add_constant(X) # constant term to accomodate the intercept
beta = [5., 0.5, 0.5, -0.02]

y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
res = sm.OLS(y, X).fit()
res.summary()

Out[7]:
Dep. Variable: R-squared: y 0.933 OLS 0.928 Least Squares 211.8 Fri, 21 Nov 2014 6.30e-27 18:35:47 -34.438 50 76.88 46 84.52 3 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] 5.2058 0.171 30.405 0.000 4.861 5.550 0.4687 0.026 17.751 0.000 0.416 0.522 0.4836 0.104 4.659 0.000 0.275 0.693 -0.0174 0.002 -7.507 0.000 -0.022 -0.013
 Omnibus: Durbin-Watson: 0.655 2.896 0.721 0.36 0.207 0.835 3.026 221
In [8]:
from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(res) # build confidence intervals

fig, ax = plt.subplots(figsize=(8,6))

ax.plot(x, y, 'bo', label="data")
ax.plot(x, y_true, 'b-', label="True")
ax.plot(x, res.fittedvalues, 'ro', label="OLS")
ax.plot(x, iv_u, 'r--')
ax.plot(x, iv_l, 'r--')
ax.legend(loc='best');


## Kernel Density Estimation¶

In [9]:
from scipy import stats
from statsmodels.distributions.mixture_rvs import mixture_rvs
np.random.seed(12345)

In [10]:
obs_dist1 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.norm],
kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))
kde = sm.nonparametric.KDEUnivariate(obs_dist1)
kde.fit()

In [11]:
plt.figure(figsize=(12,8))
plt.hist(obs_dist1, bins=50, normed=True, color="red")
plt.plot(kde.support, kde.density, lw=2, color="black")

Out[11]:
[<matplotlib.lines.Line2D at 0x10c13a250>]
In [12]:
obs_dist2 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.beta],
kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=1,args=(1,.5))))
kde2 = sm.nonparametric.KDEUnivariate(obs_dist2)
kde2.fit()

In [13]:
plt.figure(figsize=(12,8))
plt.hist(obs_dist2, bins=50, normed=True, color="red")
plt.plot(kde2.support, kde2.density, lw=2, color="black")

Out[13]:
[<matplotlib.lines.Line2D at 0x10c564c50>]
In [14]:
kde2.entropy

Out[14]:
0.8622614485205428
In [15]:
kde2.evaluate(1.2)

Out[15]:
array([ 0.41632476])

## Graphics¶

In [16]:
data = sm.datasets.longley.load()
mod_fit = sm.OLS(data.endog, data.exog).fit()
res = mod_fit.resid # residuals
fig = sm.qqplot(res)

In [17]:
hie_data = sm.datasets.randhie.load_pandas()
corr_matrix = np.corrcoef(hie_data.data.T)
sm.graphics.plot_corr(corr_matrix, xnames=hie_data.names);

In [18]:
from statsmodels.graphics.mosaicplot import mosaic
data = {"a": 10, "b": 25, "c": 16}
mosaic(data, title="basic dictionary")

Out[18]:
(<matplotlib.figure.Figure at 0x10cbac350>,
OrderedDict([(('a',), (0.0, 0.0, 0.1941370607649, 1.0)), (('c',), (0.19908755581440496, 0.0, 0.31061929722384002, 1.0)), (('b',), (0.51465734808774988, 0.0, 0.48534265191225007, 1.0))]))
In [19]:
gender = ["male", "male", "male", "female", "female", "female"]
pet = ["cat", "dog", "dog", "cat", "dog", "cat"]
data = pd.DataFrame({"gender": gender, "pet": pet})
mosaic(data, ["pet", "gender"])

Out[19]:
(<matplotlib.figure.Figure at 0x10d5e2250>,
OrderedDict([(('cat', 'male'), (0.0, 0.0, 0.49751243781094534, 0.33222591362126241)), (('cat', 'female'), (0.0, 0.33554817275747506, 0.49751243781094534, 0.66445182724252494)), (('dog', 'male'), (0.5024875621890548, 0.0, 0.49751243781094534, 0.66445182724252483)), (('dog', 'female'), (0.5024875621890548, 0.66777408637873747, 0.49751243781094534, 0.33222591362126247))]))

## ANOVA¶

Analysis of Variance models

In [20]:
from statsmodels.formula.api import ols # supports R-style formulas

In [21]:
moore = sm.datasets.get_rdataset("Moore", "car", cache=True) # load data
data = moore.data
data = data.rename(columns={"partner.status" : "partner_status"}) # make name pythonic
data.loc[:10]

Out[21]:
partner_status conformity fcategory fscore
0 low 8 low 37
1 low 4 high 57
2 low 8 high 65
3 low 7 low 20
4 low 10 low 36
5 low 6 low 18
6 low 12 medium 51
7 low 4 medium 44
8 low 13 low 31
9 low 12 low 36
10 low 4 medium 42
In [22]:
moore_lm = ols("conformity ~ C(fcategory, Sum) * C(partner_status, Sum)", data=data).fit() # using R-style formula
sm.stats.anova_lm(moore_lm, typ=2) # Type 2 ANOVA DataFrame

Out[22]:
sum_sq df F PR(>F)
C(fcategory, Sum) 11.614700 2 0.276958 0.759564
C(partner_status, Sum) 212.213778 1 10.120692 0.002874
C(fcategory, Sum):C(partner_status, Sum) 175.488928 2 4.184623 0.022572
Residual 817.763961 39 NaN NaN

## Statistics¶

Basic functionality is already provided by scipy:

In [23]:
from scipy import stats

In [24]:
x = np.random.randn(1000)
(stats.skew(x), stats.kurtosis(x), stats.moment(x, 5))

Out[24]:
(-0.062791792404477, -0.14708239201514184, -0.49693063365391255)
In [25]:
(stats.skewtest(x), stats.kurtosistest(x), stats.normaltest(x))

Out[25]:
((-0.81582646992721908, 0.41459939171158222),
(-0.92010750059579682, 0.3575165851406058),
(1.5121706416865519, 0.46950077913145671))

This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean

In [26]:
stats.ttest_1samp(x, 0)

Out[26]:
(0.28456485903503731, 0.77603651233845949)
In [27]:
stats.wilcoxon(x)

Out[27]:
(246097.0, 0.64939888950280666)

Many more sophisticated functions are provided by statsmodels.stats: http://statsmodels.sourceforge.net/stable/stats.html