# Statistics in Python¶

There are two major modules for doing statistical analyses in Python:

• Scipy - basic statistics and distribution fitting
• Statsmodels - advanced statistical modeling focused on linear models (including ANOVA, multiple regression, generalized linear models, etc.)

To see the full functionality of these modules you'll need to look through their pages, but here are a few examples to show you that a lot of the standard statistical tests and models you need to perform can be easily done using Python.

## Imports¶

You'll want the stats module from Scipy and the api and formula.api modulesfrom statsmodels. We'll also go ahead and import Numpy for use in generating data and Matplotlib for some graphing.

In [1]:
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline


## Descriptive Statistics¶

stats.describe() gives basic descriptive statistics on any list of numbers. It returns (in the following order) the size of the data, it's min and max, the mean, the variance, skewness, and kurtosis.

In [2]:
x = [1, 2, 3, 4, 5]
stats.describe(x)

Out[2]:
(5, (1, 5), 3.0, 2.5, 0.0, -1.3)

We can also get this kind of information for columns in a DataFrame using the .describe() method.

In [3]:
data = pd.DataFrame([[1, 2, 3.5], [2, 2.4, 3.1], [3, 1.8, 2.5]], columns=['a', 'b', 'c'])
print data
data.describe()

   a    b    c
0  1  2.0  3.5
1  2  2.4  3.1
2  3  1.8  2.5

[3 rows x 3 columns]

Out[3]:
a b c
count 3.0 3.000000 3.000000
mean 2.0 2.066667 3.033333
std 1.0 0.305505 0.503322
min 1.0 1.800000 2.500000
25% 1.5 1.900000 2.800000
50% 2.0 2.000000 3.100000
75% 2.5 2.200000 3.300000
max 3.0 2.400000 3.500000

8 rows × 3 columns

## T-tests¶

Standard 1-sample and 2-sample t-tests (as well as many other basic statistical tests) are available in scipy.stats. T-tests return two values, the t-statistic and the p-value. First, let's generate some data that is normally distributed around zero.

In [4]:
x1 = np.random.randn(100, 1)
x2 = np.random.randn(100, 1)


### One-sample t-test¶

To determine if the mean of x1 is different from some number use a one-sample t-test. We'll compare the mean of x1 to both 0 (which it shouldn't be different from) and to 1 (which it should be different from).

In [5]:
tstat, pval = stats.ttest_1samp(x1, 0)
print "Comparison of the mean of x1 to 0.\nT-statistic = %s; P-value = %s." % (tstat, pval)

tstat, pval = stats.ttest_1samp(x1, 1)
print "Comparison of the mean of x1 to 5.\nT-statistic = %s; P-value = %s." % (tstat, pval)

Comparison of the mean of x1 to 0.
T-statistic = [ 1.17526772]; P-value = [ 0.24270646].
Comparison of the mean of x1 to 5.
T-statistic = [-9.46674519]; P-value = [  1.59375523e-15].


### Two-sample t-test¶

To determine if the mean of two different sets of numbers are different from one another wuse a two-sample t-test. We'll compare the means of x1 and x2, which should be different from one another since they should both be about zero.

In [6]:
tstat, pval = stats.ttest_ind(x1, x2)
print "Comparison of the means of x1 and x2.\nT-statistic = %s; P-value = %s." % (tstat, pval)

Comparison of the means of x1 and x2.
T-statistic = [ 1.54942032]; P-value = [ 0.12287759].


## Distribution fitting and analysis¶

scipy.stats also includes a powerful general system for working with statistical distribution. Let's generate some random normally distributed numbers to analyze.

In [7]:
u, sigma = 4, 2
random_numbers = stats.norm.rvs(u, sigma, size=50)
random_numbers

Out[7]:
array([ -2.69438161e-02,   2.92683231e+00,   6.59328450e+00,
5.09170230e+00,   4.99668460e+00,   5.50528668e+00,
7.50782290e+00,   4.63995490e+00,   3.17923549e+00,
7.95144616e+00,   2.19070256e+00,   6.40680973e+00,
4.65721099e+00,   1.77901732e+00,   2.63237191e+00,
4.49198740e+00,   5.56909943e+00,  -1.83542925e-01,
6.42398538e+00,   6.55107469e+00,   3.45866471e+00,
3.66296653e+00,   4.46982893e+00,   1.11183439e-01,
3.14154367e+00,   3.35583976e+00,   4.99528367e+00,
3.56179539e+00,   5.21900454e+00,   2.35695456e+00,
4.38756723e+00,   5.83026447e+00,   1.46130374e+00,
1.21498908e+00,   7.73665200e+00,  -2.39077287e-03,
3.45132547e+00,   4.62012917e+00,   8.59613520e-01,
3.73481950e+00,   7.26835158e+00,   3.26567510e+00,
-6.84996675e-01,   3.19824323e+00,   6.74505956e+00,
3.50754549e+00,   4.60034178e+00,   4.63561359e+00,
6.11115427e+00,   5.12909660e+00])

We can then fit distributions to this data.

In [8]:
stats.norm.fit(random_numbers)

Out[8]:
(4.0057489130411792, 2.1683842014426378)

## Regression¶

You can do simple univariate OLS regression in Scipy, but for anything more complicated you'll need statsmodels, so we'll just do the basics in statsmodels as well.

First, we'll generate some data and plot it.

In [9]:
#generate some data
x = np.array(range(20))
y = 3 + 0.5 * x + 2 * np.random.randn(20)
data = pd.DataFrame(zip(x, y), columns=['x', 'y'])

#plot the data
plt.plot(data['x'], data['y'], 'bo')
plt.show()


Now let's do a regression. We'll use formulas to specify regression models. This allows us to write our code in simple and intuitive ways, and means the we don't have to remember how to create design matrices for more complicated models. To do this we'll need to be using Pandas to manage the data. The regression function is name ols for "ordinary least-squares" the standard approach to regression. It takes two arguments. The first is a formula describing the regression we want to fit. In this case we just want to model the effect of x on y so we use y ~ x, where x and y are the names of columns in a data frame. The second argument is the name of the data frame that contains the columns x and y.

In [10]:
results = smf.ols('y ~ x', data).fit()
print results.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.785
Method:                 Least Squares   F-statistic:                     65.84
Date:                Sat, 04 Oct 2014   Prob (F-statistic):           2.00e-07
Time:                        11:41:12   Log-Likelihood:                -40.471
No. Observations:                  20   AIC:                             84.94
Df Residuals:                      18   BIC:                             86.93
Df Model:                           1
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      2.2481      0.832      2.704      0.015         0.501     3.995
x              0.6071      0.075      8.114      0.000         0.450     0.764
==============================================================================
Omnibus:                        1.724   Durbin-Watson:                   2.053
Prob(Omnibus):                  0.422   Jarque-Bera (JB):                1.008
Skew:                          -0.549   Prob(JB):                        0.604
Kurtosis:                       2.940   Cond. No.                         21.5
==============================================================================


So, .summary() presents us with most of the information we would want about the regression. We can also pull this information out in individual pieces. For example,

In [11]:
intercept, slope = results.params
r2 = results.rsquared
print slope, intercept, r2

0.607133890043 2.24811810288 0.78529268393


This makes it easy to store the key results of large numbers of analyses, or present the results in alternative ways, like graphs.

In [12]:
plt.plot(data['x'], data['y'], 'bo')
plt.hold(True)
x = np.array([min(x), max(x)])
y = intercept + slope * x
plt.plot(x, y, 'r-')
plt.show()


You'll notice that in order to plot the regression line what we actually do is plot a line with the appropriate slope and intercept by:

1. taking the minimum and maximum values of x
2. calculating their values of y based on the regression results
3. and plotting those two points with a straight line connecting them and no symbols

### Multiple-regression¶

Multiple-regression works the same way, but with additional terms in the formula.

In [13]:
import pandas as pd

#generate some data
x = np.random.randn(50)
z = np.random.randn(50)
noise = np.random.randn(50)
y = 3 + 0.5 * x + 1.5 * z + noise

data = pd.DataFrame(zip(x, y, z), columns=['x', 'y', 'z'])
results = smf.ols('y ~ x + z', data).fit()
print results.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.765
Method:                 Least Squares   F-statistic:                     76.36
Date:                Sat, 04 Oct 2014   Prob (F-statistic):           1.72e-15
Time:                        11:41:12   Log-Likelihood:                -65.717
No. Observations:                  50   AIC:                             137.4
Df Residuals:                      47   BIC:                             143.2
Df Model:                           2
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      2.8254      0.133     21.294      0.000         2.558     3.092
x              0.5812      0.124      4.699      0.000         0.332     0.830
z              1.4573      0.123     11.868      0.000         1.210     1.704
==============================================================================
Omnibus:                        2.028   Durbin-Watson:                   1.739
Prob(Omnibus):                  0.363   Jarque-Bera (JB):                1.936
Skew:                           0.452   Prob(JB):                        0.380
Kurtosis:                       2.663   Cond. No.                         1.20
==============================================================================


This makes it easy to do more complicated designs, including interactions.

In [14]:
results = smf.ols('y ~ x + z + x*z', data).fit()
print results.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.766
Method:                 Least Squares   F-statistic:                     50.17
Date:                Sat, 04 Oct 2014   Prob (F-statistic):           1.52e-14
Time:                        11:41:12   Log-Likelihood:                -65.585
No. Observations:                  50   AIC:                             139.2
Df Residuals:                      46   BIC:                             146.8
Df Model:                           3
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      2.8336      0.135     21.021      0.000         2.562     3.105
x              0.5671      0.128      4.432      0.000         0.310     0.825
z              1.4522      0.124     11.690      0.000         1.202     1.702
x:z            0.0541      0.110      0.493      0.624        -0.167     0.275
==============================================================================
Omnibus:                        1.898   Durbin-Watson:                   1.788
Prob(Omnibus):                  0.387   Jarque-Bera (JB):                1.844
Skew:                           0.425   Prob(JB):                        0.398
Kurtosis:                       2.596   Cond. No.                         1.46
==============================================================================


We can also include transformations in formulas using functions from Numpy.

In [15]:
results = smf.ols('y ~ x + np.log10(z)', data).fit()
print results.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.336
Method:                 Least Squares   F-statistic:                     5.825
Date:                Sat, 04 Oct 2014   Prob (F-statistic):            0.00898
Time:                        11:41:12   Log-Likelihood:                -41.444
No. Observations:                  26   AIC:                             88.89
Df Residuals:                      23   BIC:                             92.66
Df Model:                           2
===============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept       4.5062      0.261     17.288      0.000         3.967     5.045
x               0.6071      0.232      2.616      0.015         0.127     1.087
np.log10(z)     2.2221      0.786      2.826      0.010         0.596     3.849
==============================================================================
Omnibus:                        2.703   Durbin-Watson:                   1.307
Prob(Omnibus):                  0.259   Jarque-Bera (JB):                1.915
Skew:                           0.664   Prob(JB):                        0.384
Kurtosis:                       2.960   Cond. No.                         3.64
==============================================================================


### ANOVA¶

Using formulas also makes it easy to conduct statistical tests that are based on linear models. ANOVA is simply a linear model with appropriate dummy variables set up for each factor. To do this in statsmodels we simply use C() to tell the module that the variable of interest is categorical.

This time, let's start by grabbing some data from the web.

In [16]:
url = 'http://stats191.stanford.edu/data/rehab.csv'
rehab_table

Out[16]:
Fitness Time
0 1 29
1 1 42
2 1 38
3 1 40
4 1 43
5 1 40
6 1 30
7 1 42
8 2 30
9 2 35
10 2 39
11 2 28
12 2 31
13 2 31
14 2 29
15 2 35
16 2 29
17 2 33
18 3 26
19 3 32
20 3 21
21 3 20
22 3 23
23 3 22

24 rows × 2 columns

To see if the time to rehabilitate an injury is related to the fitness category we do the same as above, but wrapping the predictor in C().

In [17]:
results = smf.ols('Time ~ C(Fitness)', rehab_table).fit()
print results.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                   Time   R-squared:                       0.618
Method:                 Least Squares   F-statistic:                     16.96
Date:                Sat, 04 Oct 2014   Prob (F-statistic):           4.13e-05
Time:                        11:41:13   Log-Likelihood:                -68.286
No. Observations:                  24   AIC:                             142.6
Df Residuals:                      21   BIC:                             146.1
Df Model:                           2
===================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept          38.0000      1.574     24.149      0.000        34.728    41.272
C(Fitness)[T.2]    -6.0000      2.111     -2.842      0.010       -10.390    -1.610
C(Fitness)[T.3]   -14.0000      2.404     -5.824      0.000       -18.999    -9.001
==============================================================================
Omnibus:                        0.163   Durbin-Watson:                   2.209
Prob(Omnibus):                  0.922   Jarque-Bera (JB):                0.211
Skew:                          -0.163   Prob(JB):                        0.900
Kurtosis:                       2.675   Cond. No.                         3.80
==============================================================================


While all of the information that we want is technically present in this table, we typically want it presented in more standard fashion for ANOVA. We can do this using the anova_lm function.

In [18]:
from statsmodels.stats.anova import anova_lm

anova_lm(results)

Out[18]:
df sum_sq mean_sq F PR(>F)
C(Fitness) 2 672 336.000000 16.961538 0.000041
Residual 21 416 19.809524 NaN NaN

2 rows × 5 columns

### And lots more¶

Statsmodels includes much more advanced functionality including:

• Generalized Least Squares (i.e., correlated error structures such as for spatial and comparative analysis)
• Generalized Linear Models (i.e., non-normal error)
• Robust Linear Models
• Regression with Discrete Dependent Variable (e.g., logistic regression)
• Time Series analysis