StatsModels

http://statsmodels.sourceforge.net

StatsModels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that StatsModels fully meets their needs for statistical computing and data analysis in Python. Features include:

  • Linear regression models
  • Generalized linear models
  • Kernel density estimation
  • Analysis of Variance (ANOVA)
  • A wide range of statistical tests
  • A collection of datasets for examples
  • Input-output tools for producing tables in a number of formats (text, LaTex, HTML)
  • Plotting functions
In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(9876789) # for replicability

Ordinary Least Squares

OLS estimation

Suppose the data consists of $n$ observations $\{\mathbf{x}_i, y_i\}_{i=1}^n$. Each observation includes a scalar response $y_i$ to a vector of p predictors (or regressors) $\mathbf{x}_i$. In a linear regression model the response variable is a linear function of the regressors:

$y_i = \mathbf{x}_i^T \beta + \varepsilon_i,$

where $\beta$ is a $p×1$ vector of unknown parameters; εi's are unobserved scalar random variables (errors) which account for the discrepancy between the actually observed responses $y_i$ and the "predicted outcomes" $\mathbf{x}_i^T \beta$; and T denotes matrix transpose, so that $\mathbf{x}_i^T \beta$ is the dot product between the vectors x and β. This model can also be written in matrix notation as

$\mathbf{y} = \mathbf{X}\beta + \mathbf\varepsilon,$

where $\mathbf{y}$ and $\mathbf\varepsilon$ are $n×1$ vectors, and $\mathbf{X}$ is an $n×p$ matrix of regressors, which is also sometimes called the design matrix.

As a rule, the constant term is always included in the set of regressors, say, by taking $x_i,1 = 1$ for all $i = 1, …, n$. The coefficient $\beta_1$ corresponding to this regressor is called the intercept.

In [2]:
# Make some artificial data
nsample = 100
x = np.linspace(0, 10, nsample)
X = np.column_stack((x, x**2))
X = sm.add_constant(X) # constant term to accomodate the intercept
X[:10]
Out[2]:
array([[ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.1010101 ,  0.01020304],
       [ 1.        ,  0.2020202 ,  0.04081216],
       [ 1.        ,  0.3030303 ,  0.09182736],
       [ 1.        ,  0.4040404 ,  0.16324865],
       [ 1.        ,  0.50505051,  0.25507601],
       [ 1.        ,  0.60606061,  0.36730946],
       [ 1.        ,  0.70707071,  0.49994898],
       [ 1.        ,  0.80808081,  0.65299459],
       [ 1.        ,  0.90909091,  0.82644628]])
In [3]:
beta = np.array([1, 0.1, 10])
y_true = np.dot(X, beta)
e = np.random.normal(size=nsample)
y = y_true + e
# graph this
plt.subplots(figsize=(8,6))
plt.plot(y_true[:10]); plt.plot(y[:10], '.')
Out[3]:
[<matplotlib.lines.Line2D at 0x10c287fd0>]
In [4]:
model = sm.OLS(y, X)
results = model.fit()
results.summary()
Out[4]:
OLS Regression Results
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.020e+06
Date: Fri, 21 Nov 2014 Prob (F-statistic): 2.83e-239
Time: 18:35:47 Log-Likelihood: -146.51
No. Observations: 100 AIC: 299.0
Df Residuals: 97 BIC: 306.8
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 1.3423 0.313 4.292 0.000 0.722 1.963
x1 -0.0402 0.145 -0.278 0.781 -0.327 0.247
x2 10.0103 0.014 715.745 0.000 9.982 10.038
Omnibus: 2.042 Durbin-Watson: 2.274
Prob(Omnibus): 0.360 Jarque-Bera (JB): 1.875
Skew: 0.234 Prob(JB): 0.392
Kurtosis: 2.519 Cond. No. 144.
In [5]:
results.params # compare to beta = np.array([1, 0.1, 10])
Out[5]:
array([  1.34233516,  -0.04024948,  10.01025357])
In [6]:
results.rsquared
Out[6]:
0.9999879365025871

OLS non-linear curve but linear in parameters

In [7]:
nsample = 50
sig = 0.5
x = np.linspace(0, 20, nsample)
X = np.column_stack((x, np.sin(x), (x-5)**2))
X = sm.add_constant(X) # constant term to accomodate the intercept
beta = [5., 0.5, 0.5, -0.02]

y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
res = sm.OLS(y, X).fit()
res.summary()
Out[7]:
OLS Regression Results
Dep. Variable: y R-squared: 0.933
Model: OLS Adj. R-squared: 0.928
Method: Least Squares F-statistic: 211.8
Date: Fri, 21 Nov 2014 Prob (F-statistic): 6.30e-27
Time: 18:35:47 Log-Likelihood: -34.438
No. Observations: 50 AIC: 76.88
Df Residuals: 46 BIC: 84.52
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 5.2058 0.171 30.405 0.000 4.861 5.550
x1 0.4687 0.026 17.751 0.000 0.416 0.522
x2 0.4836 0.104 4.659 0.000 0.275 0.693
x3 -0.0174 0.002 -7.507 0.000 -0.022 -0.013
Omnibus: 0.655 Durbin-Watson: 2.896
Prob(Omnibus): 0.721 Jarque-Bera (JB): 0.360
Skew: 0.207 Prob(JB): 0.835
Kurtosis: 3.026 Cond. No. 221.
In [8]:
from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(res) # build confidence intervals

fig, ax = plt.subplots(figsize=(8,6))

ax.plot(x, y, 'bo', label="data")
ax.plot(x, y_true, 'b-', label="True")
ax.plot(x, res.fittedvalues, 'ro', label="OLS")
ax.plot(x, iv_u, 'r--')
ax.plot(x, iv_l, 'r--')
ax.legend(loc='best');

Kernel Density Estimation

In [9]:
from scipy import stats
from statsmodels.distributions.mixture_rvs import mixture_rvs
np.random.seed(12345)
In [10]:
obs_dist1 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.norm],
                kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))
kde = sm.nonparametric.KDEUnivariate(obs_dist1)
kde.fit()
In [11]:
plt.figure(figsize=(12,8))
plt.hist(obs_dist1, bins=50, normed=True, color="red")
plt.plot(kde.support, kde.density, lw=2, color="black")
Out[11]:
[<matplotlib.lines.Line2D at 0x10c13a250>]
In [12]:
obs_dist2 = mixture_rvs([.25,.75], size=10000, dist=[stats.norm, stats.beta],
            kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=1,args=(1,.5))))
kde2 = sm.nonparametric.KDEUnivariate(obs_dist2)
kde2.fit()
In [13]:
plt.figure(figsize=(12,8))
plt.hist(obs_dist2, bins=50, normed=True, color="red")
plt.plot(kde2.support, kde2.density, lw=2, color="black")
Out[13]:
[<matplotlib.lines.Line2D at 0x10c564c50>]
In [14]:
kde2.entropy
Out[14]:
0.8622614485205428
In [15]:
kde2.evaluate(1.2)
Out[15]:
array([ 0.41632476])

Graphics

In [16]:
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
mod_fit = sm.OLS(data.endog, data.exog).fit()
res = mod_fit.resid # residuals
fig = sm.qqplot(res)
In [17]:
hie_data = sm.datasets.randhie.load_pandas()
corr_matrix = np.corrcoef(hie_data.data.T)
sm.graphics.plot_corr(corr_matrix, xnames=hie_data.names);
In [18]:
from statsmodels.graphics.mosaicplot import mosaic
data = {"a": 10, "b": 25, "c": 16}
mosaic(data, title="basic dictionary")
Out[18]:
(<matplotlib.figure.Figure at 0x10cbac350>,
 OrderedDict([(('a',), (0.0, 0.0, 0.1941370607649, 1.0)), (('c',), (0.19908755581440496, 0.0, 0.31061929722384002, 1.0)), (('b',), (0.51465734808774988, 0.0, 0.48534265191225007, 1.0))]))
In [19]:
gender = ["male", "male", "male", "female", "female", "female"]
pet = ["cat", "dog", "dog", "cat", "dog", "cat"]
data = pd.DataFrame({"gender": gender, "pet": pet})
mosaic(data, ["pet", "gender"])
Out[19]:
(<matplotlib.figure.Figure at 0x10d5e2250>,
 OrderedDict([(('cat', 'male'), (0.0, 0.0, 0.49751243781094534, 0.33222591362126241)), (('cat', 'female'), (0.0, 0.33554817275747506, 0.49751243781094534, 0.66445182724252494)), (('dog', 'male'), (0.5024875621890548, 0.0, 0.49751243781094534, 0.66445182724252483)), (('dog', 'female'), (0.5024875621890548, 0.66777408637873747, 0.49751243781094534, 0.33222591362126247))]))

ANOVA

Analysis of Variance models

In [20]:
from statsmodels.formula.api import ols # supports R-style formulas
In [21]:
moore = sm.datasets.get_rdataset("Moore", "car", cache=True) # load data
data = moore.data
data = data.rename(columns={"partner.status" : "partner_status"}) # make name pythonic
data.loc[:10]
Out[21]:
partner_status conformity fcategory fscore
0 low 8 low 37
1 low 4 high 57
2 low 8 high 65
3 low 7 low 20
4 low 10 low 36
5 low 6 low 18
6 low 12 medium 51
7 low 4 medium 44
8 low 13 low 31
9 low 12 low 36
10 low 4 medium 42
In [22]:
moore_lm = ols("conformity ~ C(fcategory, Sum) * C(partner_status, Sum)", data=data).fit() # using R-style formula
sm.stats.anova_lm(moore_lm, typ=2) # Type 2 ANOVA DataFrame
Out[22]:
sum_sq df F PR(>F)
C(fcategory, Sum) 11.614700 2 0.276958 0.759564
C(partner_status, Sum) 212.213778 1 10.120692 0.002874
C(fcategory, Sum):C(partner_status, Sum) 175.488928 2 4.184623 0.022572
Residual 817.763961 39 NaN NaN

Statistics

Basic functionality is already provided by scipy:

In [23]:
from scipy import stats
In [24]:
x = np.random.randn(1000)
(stats.skew(x), stats.kurtosis(x), stats.moment(x, 5))
Out[24]:
(-0.062791792404477, -0.14708239201514184, -0.49693063365391255)
In [25]:
(stats.skewtest(x), stats.kurtosistest(x), stats.normaltest(x))
Out[25]:
((-0.81582646992721908, 0.41459939171158222),
 (-0.92010750059579682, 0.3575165851406058),
 (1.5121706416865519, 0.46950077913145671))

This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean

In [26]:
stats.ttest_1samp(x, 0)
Out[26]:
(0.28456485903503731, 0.77603651233845949)
In [27]:
stats.wilcoxon(x)
Out[27]:
(246097.0, 0.64939888950280666)

Many more sophisticated functions are provided by statsmodels.stats: http://statsmodels.sourceforge.net/stable/stats.html