Introduction to scientific computing with Python

Maxime Sangnier

September, 2020

Part 4: Statistics

Sampling and testing

Random sampling

Simple random data

Numpy offers several routines to generate easily uniform and normal random samples:

In [1]:
import numpy as np
import numpy.random as rdm
np.set_printoptions(precision=2)

rdm.rand(4, 4)  # Uniform sampling
Out[1]:
array([[0.65, 0.82, 0.23, 0.23],
       [0.44, 0.44, 0.32, 0.41],
       [0.43, 0.96, 0.48, 0.85],
       [0.93, 0.28, 0.98, 0.05]])
In [2]:
rdm.randn(4, 4)  # Standard normal sampling
Out[2]:
array([[ 0.53,  0.6 ,  0.98,  0.55],
       [-1.14,  1.24,  0.81,  0.43],
       [-0.16, -1.42,  1.14, -0.17],
       [ 1.4 , -0.08,  1.94,  1.55]])
In [3]:
rdm.randint(0, 10, size=(2, 3))  # Discrete uniform sampling (10 exclusive)
Out[3]:
array([[3, 6, 7],
       [9, 1, 1]])

In the case where the sampled integers are supposed to index an array, one can replace:

In [4]:
a = np.arange(10)*10

ind = rdm.randint(0, 5, size=7)
a[ind]
Out[4]:
array([20,  0, 10,  0, 40,  0,  0])

by:

In [5]:
rdm.choice(a, size=7)
Out[5]:
array([10, 80,  0, 70, 10,  0,  0])

The previous routines sample integers with replacement. To sample without replacement, one can use:

In [6]:
rdm.permutation(10)[:7]  # 7 first items of a random permutation of [0, …, 9]
Out[6]:
array([1, 8, 3, 0, 2, 9, 7])

Note that one can also directly permute an array with a copy (permutation) or in-place (shuffle), instead of generating random indexes.

In [7]:
rdm.permutation(a)
Out[7]:
array([40, 10, 20, 50, 80, 70, 30,  0, 60, 90])
In [8]:
rdm.shuffle(a)
a
Out[8]:
array([30, 50, 10, 60, 80, 40, 90, 70,  0, 20])

Question

Draw a sample of size $100$ from a normal distribution with mean $10$ and standard deviation $2$. Print the usual estimators of these two parameters to validate your operation.

In [ ]:
# Answer

Random generator

It is obvious that random generators used in scientific computing are in fact pseudo-random generators. As a consequence, the practitioner is able to control them to a certain extent. In a way, this is good news for reproducible science!

Both examples below illustrate how to replay a random sampling.

In [10]:
for it in range(3):
    rdm.seed(it)  # Seed the generator to the current iteration number
    print(rdm.randint(100, size=3))
[44 47 64]
[37 12 72]
[40 15 72]
In [11]:
for it in range(3):
    rdm.seed(it)  # Seed the generator to the current iteration number
    print(rdm.randint(100, size=3))  # Same as before!
[44 47 64]
[37 12 72]
[40 15 72]
In [12]:
s = rdm.get_state()  # Get the internal state of the generator
print(np.array([rdm.randn() for it in range(3)]))
print(rdm.rand(3, 3))
[-2.14  1.64 -1.79]
[[0.2  0.62 0.3 ]
 [0.27 0.62 0.53]
 [0.13 0.51 0.18]]
In [13]:
rdm.set_state(s)  # Set the internal state of the generator to its previous value
for it in range(3):
    print(rdm.randn(1))  # Same as before!
[-2.14]
[1.64]
[-1.79]

Distributions

Besides the previous routines, Numpy offers the possibility to draw samples from numerous distributions.

In [14]:
x = rdm.poisson(lam=4, size=500)
print(x[:30])
[3 4 5 2 2 2 4 6 6 3 2 1 6 2 5 3 3 5 3 6 7 6 5 3 2 3 2 4 4 3]
In [15]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

fig, ax = plt.subplots()
ax.stem(np.bincount(x), use_line_collection=True);

Question

Draw a sample of size $1000$ from an exponential distribution with scale $2$ and plot a density histogram of this sample.

In [ ]:
# Answer

Advanced distributions

Special functions

Many raw statistical routines (cumulative, survival and inverse functions) are available in the scipy.special mudule.

In [17]:
from scipy import special

plt.stem(special.pdtr(range(15), 4), use_line_collection=True);  # Poisson cumulative distribution function

Random variables

Scipy.stats implements random variables with two different classes: continuous random variables and discrete random variables. As an example, we focus here on the gamma distribution and illustrate the main methods available.

In [18]:
from scipy.stats import gamma

# gamma is a r.v. object corresponding to the standard gamma distribution
print("Distribution support: [{}, {}]".format(gamma.a, gamma.b))
print("Number of shape parameters: {} (name: {})".format(gamma.numargs, gamma.shapes))
Distribution support: [0.0, inf]
Number of shape parameters: 1 (name: a)

The shape parameter a appears in the probability density function: $$f(x; a) = \frac{x^{a-1}\exp(-x)}{\Gamma(a)}.$$

Since the shape parameter a is required, one has to specify it for each method.

In [19]:
print("Mean:", gamma.mean(a=4))
print("Median:", gamma.median(a=4))
print("Variance:", gamma.var(a=4))
Mean: 4.0
Median: 3.672060748850897
Variance: 4.0

Two other parameters can be passed to the methods: loc and scale. They correspond to shifting and rescaling the standard random variable with $x \mapsto scale \cdot x + loc.$

In [20]:
print("Mean:", gamma.mean(a=4, loc=2, scale=0.1))
Mean: 2.4

Since passing those parameters time and again can become quite bothersome, one can freeze the parameters:

In [21]:
rv = gamma(a=4, loc=2, scale=2)
print("Mean:", rv.mean())
Mean: 10.0

Now, let us have a look at the available methods.

In [22]:
print(rv.rvs(size=10))  # Draw a random sample
[ 6.56 17.46 24.21  7.11  9.63  5.71  8.27  7.46  9.37 14.17]
In [23]:
for n in range(4):
    print("Moment {}:".format(n), rv.moment(n))
Moment 0: 1.0
Moment 1: 10.0
Moment 2: 116.0
Moment 3: 1544.0
In [24]:
x = np.linspace(0, 30, num=50)

fig, ax = plt.subplots()
ax.plot(x, rv.pdf(x), label='pdf')  # Probability density function
ax.plot(x, rv.cdf(x)/10, label='cdf / 10')  # Cumulative density function
ax.legend(loc="best");
In [25]:
fig, ax = plt.subplots()
ax.plot(x, rv.logpdf(x), label='log pdf');  # Log of the pdf
In [26]:
x = np.linspace(0, 1, num=50)

fig, ax = plt.subplots()
ax.plot(x, rv.ppf(x), label='ppf')  # Percent point function (inverse of cdf, also called quantile function) 
ax.legend(loc="upper left");
In [27]:
m = rv.expect(lambda x: np.exp(-x))  # Expectation of a function
print("First exponential moment: {0:0.2e}".format(m))
First exponential moment: 1.67e-03

Note that an unfrozen random variable (for instance gamma, not rv) benefits from two other methods: fit and fit_loc_scale for estimating parameters respectively by likelihood maximization and the moment method.

Question

Draw a sample of size $1000$ from a Laplace distribution (with location $1$ and scale $2$) and plot a density histogram of this sample. Add the curve of the probability density function.

In [ ]:
# Answer

Descriptive statistics

Order statistics, moments and correlation

Basic descriptive statistics such as min, max, mean, median, std, variance and percentiles can be computed with array methods or routines from Numpy. Other empirical statistics such as mode and moments can be obtained with Scipy statistical functions.

Histograms

A common task in statisics is to estimate the probability density function of a random variable, what is called density estimation. In a first approach, this task can be achieved by computing a histogram.

In [29]:
x = rv.rvs(size=1000)  # Draw a random sample

# Compute the histogram
hist, bins = np.histogram(x, bins=20, density=True)

# Plot the histogram
fig, ax = plt.subplots()
ax.bar(bins[:-1], hist, width=bins[1]-bins[0])
ax.set_xticks(bins[::3], ["{0:0.2f}".format(t) for t in bins[::3]]);

x_pdf = np.linspace(0, 30, num=100)
ax.plot(x_pdf, rv.pdf(x_pdf), color="red", linewidth=2, label="pdf")
ax.legend();

Alternatively, one can use Matplotlib to produce fancy plots:

In [30]:
fig, ax = plt.subplots()
hist_plt, bins_plt, patches = ax.hist(x, bins=20, density=True)
ax.plot(x_pdf, rv.pdf(x_pdf), color="red", linewidth=2, label="pdf")
ax.legend();

Kernel density estimation

Kernel desity estimation is a tool more efficient than a histogram for density estimation. In Python, kernel density estimation can be performed with the gaussian_kde function.

In [31]:
from scipy.stats import gaussian_kde

kde = gaussian_kde(x)

fig, ax = plt.subplots()
(hist_plt, bins_plt, patches) = ax.hist(x, bins=20, density=True)
ax.plot(x_pdf, rv.pdf(x_pdf), color="red", linewidth=2, label="pdf")
ax.plot(x_pdf, kde(x_pdf), color="magenta", linewidth=2, label="kde")
ax.legend();

Question Add on the previous figure the curves of a kernel density estimation with parameter bw_method (of gaussian_kde) varying according to bw_values.

In [32]:
bw_values = [0.1, 0.5, 1]
In [ ]:
# Answer

Hypothesis testing

Analyzing one sample

As for R, many tests can be performed in Python. For instance, a single sample may be analyzed with:

  • ttest_1samp: T-test for the mean of one group of scores;
  • kstest: Kolmogorov-Smirnov test for goodness of fit;
  • ksone: general Kolmogorov-Smirnov one-sided test;
  • chisquare: one-way chi square test;
  • anderson: Anderson-Darling test for data coming from a particular distribution.
In [34]:
from scipy import stats

for rv in [stats.expon, stats.norm]:
    tt, pval = stats.ttest_1samp(rv.rvs(size=100), 0)
    print("{} sample: ".format(rv.name), end="")
    print("p-value = {0:0.2f} (statistics = {1:0.2f})".format(pval, tt))
expon sample: p-value = 0.00 (statistics = 10.52)
norm sample: p-value = 0.08 (statistics = -1.77)

On the one hand, the p-value for the exponential sample is small enough that we can reject the null hypothesis that the mean is $0$. On the other hand, with high significance levels, we cannot reject the null hypothesis for the normal sample (this is comforting!).

Let us now test for a given distribution:

In [35]:
for rv in [stats.expon, stats.norm]:
    tt, pval = stats.kstest(rv.rvs(size=100), 'expon')
    print("{} sample: ".format(rv.name), end="")
    print("p-value = {0:0.2f} (statistics = {1:0.2f})".format(pval, tt))
expon sample: p-value = 0.14 (statistics = 0.11)
norm sample: p-value = 0.00 (statistics = 0.56)

Again, the result is as expected.

Testing normality

As a special case, several tests exist for assessing the normality of a sample:

  • kurtosistest: tests whether a dataset has normal kurtosis;
  • skewtest: tests whether the skew is different from the normal distribution;
  • normaltest: tests whether a sample differs from a normal distribution (D'Agostino and Pearson's test);
  • jarque_bera: Jarque-Bera goodness of fit test on sample data;
  • shapiro: Shapiro-Wilk test for normality.
In [36]:
for rv in [stats.expon, stats.norm]:
    print("{} sample:".format(rv.name))
    for name, test in [('skew', stats.skewtest), ('kurtosis', stats.kurtosistest)]:
        tt, pval = test(rv.rvs(size=100))
        print("   {} test: ".format(name), end="")
        print("   p-value = {0:0.2f} (statistics = {1:0.2f})".format(pval, tt))
expon sample:
   skew test:    p-value = 0.00 (statistics = 5.06)
   kurtosis test:    p-value = 0.05 (statistics = 1.99)
norm sample:
   skew test:    p-value = 0.46 (statistics = -0.74)
   kurtosis test:    p-value = 0.93 (statistics = 0.09)

Note that these two tests are combined in the normality test:

In [37]:
for rv in [stats.expon, stats.norm]:
    tt, pval = stats.normaltest(rv.rvs(size=100))
    print("{} sample: ".format(rv.name), end="")
    print("p-value = {0:0.2f} (statistics = {1:0.2f})".format(pval, tt))
expon sample: p-value = 0.00 (statistics = 37.58)
norm sample: p-value = 0.51 (statistics = 1.36)

Question

Dowload this data file. Load the random sample stored in the variable x and test its normality.

In [38]:
from scipy.io import loadmat

x = loadmat('data/Gaussian_sample.mat')['x'].ravel()
In [ ]:
# Answer

Comparing two samples

Again, many tests for two samples are available in the scipy.stats module:

  • ks_2samp: Kolmogorov-Smirnov test for 2 samples;
  • ttest_ind: T-test for the means of two independent samples of scores;
  • kstwobign: Kolmogorov-Smirnov two-sided test for large N;
  • ttest_ind_from_stats: T-test for means of two independent samples from descriptive statistics;
  • ttest_rel: T-test on TWO RELATED samples of scores, a and b;
  • mannwhitneyu: Mann-Whitney rank test on samples x and y;
  • wilcoxon: Wilcoxon signed-rank test;
  • kruskal: Kruskal-Wallis H-test for independent samples;
  • ansari: Ansari-Bradley test for equal scale parameters;
  • bartlett: Bartlett's test for equal variances;
  • levene: Levene test for equal variances;
  • anderson_ksamp: Anderson-Darling test for k-samples;
  • fligner: Fligner-Killeen test for equality of variances;
  • median_test: Mood's median test;
  • mood: Mood's test for equal scale parameters.

As an example, on can test that two independent samples have identical means or that two independent samples are drawn from the same continuous distribution:

In [40]:
rvs1 = stats.norm.rvs(size=100, loc=0., scale=1)
rvs2 = stats.norm.rvs(size=200, loc=0.1, scale=2)

tt, pval = stats.ttest_ind(rvs1,rvs2)
print("p-value = {0:0.2f} (statistics = {1:0.2f})".format(pval, tt))
p-value = 0.34 (statistics = -0.95)

Here, the p-value is high enough that we cannot reject the null hypothesis that the two samples have identical means. On the other hand, the following statistical test makes it possible to state that the two samples are drawn from different distribution (we reject the null hypothesis that they come from the same distribution):

In [41]:
tt, pval = stats.ks_2samp(rvs1, rvs2)
print("p-value = {0:0.2f} (statistics = {1:0.2f})".format(pval, tt))
p-value = 0.00 (statistics = 0.26)

Question

Apply both T-tests ttest_ind and ttest_rel on the samples rvs1 and rvs1_coupled. Can you explain the difference in the results?

In [42]:
rvs1_coupled = rvs1 + rdm.randn(rvs1.size) * 0.1 + 0.1
print(rvs1.mean(), rvs1_coupled.mean())
-0.054561153736931295 0.05255814305890341
In [ ]:
# Answer

Other tests

  • pearsonr: Pearson correlation coefficient and the p-value for testing non-correlation;
  • spearmanr: Spearman rank-order correlation coefficient and the p-value to test for non-correlation;
  • power_divergence: Cressie-Read power divergence statistic and goodness of fit test;
  • friedmanchisquare: Friedman test for repeated measurements;
  • chi2_contingency: Chi-square test of independence of variables in a contingency table;
  • fisher_exact: Fisher exact test on a 2x2 contingency table.

Data representation and manipulation

Following the example of R, Python comes with a package for handling data as a table : the Pandas package provides a container for tables, called dataframe. A dataframe is a two-dimensional table, in which each column contains measurements on one variable, and each row contains one individual.

The main features of Pandas and its dataframe are:

  • reading data from csv and Excel files;
  • giving names to variables;
  • storing in a clever manner a large amound of data;
  • providing methods for descriptive statistics.

Reading and creating a dataframe

A dataframe may be either read from a file or created from raw data (the file is available here):

In [44]:
!cat data/defra_consumption.csv
;England;Wales;Scotland;N Ireland
Cheese;105;103;103;66
Carcass meat;245;227;242;267
Other meat;685;803;750;586
Fish;147;160;122;93
Fats and oils;193;235;184;209
Sugars;156;175;147;139
Fresh potatoes;720;874;566;1033
Fresh Veg;253;265;171;143
Other Veg;488;570;418;355
Processed potatoes;198;203;220;187
Processed Veg;360;365;337;334
Fresh fruit;1102;1137;957;674
Cereals;1472;1582;1462;1494
Beverages;57;73;53;47
Soft drinks;1374;1256;1572;1506
Alcoholic drinks;375;475;458;135
Confectionery;54;64;62;41
In [45]:
import numpy as np
import pandas as pd

consumption = pd.read_csv('data/defra_consumption.csv', sep=';', index_col=0)
consumption
Out[45]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
Fish 147 160 122 93
Fats and oils 193 235 184 209
Sugars 156 175 147 139
Fresh potatoes 720 874 566 1033
Fresh Veg 253 265 171 143
Other Veg 488 570 418 355
Processed potatoes 198 203 220 187
Processed Veg 360 365 337 334
Fresh fruit 1102 1137 957 674
Cereals 1472 1582 1462 1494
Beverages 57 73 53 47
Soft drinks 1374 1256 1572 1506
Alcoholic drinks 375 475 458 135
Confectionery 54 64 62 41
In [46]:
timeseries = pd.DataFrame(data=np.random.randn(6,4),
                          index=pd.date_range('20130101', periods=6),
                          columns=list('ABCD'))
timeseries
Out[46]:
A B C D
2013-01-01 -0.429630 2.257868 -0.765199 1.047689
2013-01-02 -0.897261 -1.181121 1.017547 1.058445
2013-01-03 1.075539 -0.549968 -1.587368 0.389060
2013-01-04 -1.595424 0.257599 0.202211 -2.483426
2013-01-05 -1.305641 0.480583 -0.135586 0.421599
2013-01-06 0.308346 -0.164241 0.446200 0.744765
In [47]:
df = pd.DataFrame({'A' : 1.,  # Single item
                    'B' : pd.Timestamp('20130102'),  # Single item
                    'C' : np.random.randn(4),  # Multiple item
                    'D' : pd.Categorical(["test", "train", "test", "train"])})
df
Out[47]:
A B C D
0 1.0 2013-01-02 -0.138763 test
1 1.0 2013-01-02 0.060227 train
2 1.0 2013-01-02 -0.429151 test
3 1.0 2013-01-02 1.244043 train

Question

Create a dataframe, the columns of which are entitled normal, exponential and laplace, and contain random samples of size $10$ of each distribution.

In [ ]:
# Answer

Viewing data

Since it is unrealistic to view a table in whole, Pandas provides different methods to give a sneak look at the aforesaid table.

In [49]:
timeseries.index  # Index of the table
Out[49]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
In [50]:
consumption.index  # Index of the table
Out[50]:
Index(['Cheese', 'Carcass meat', 'Other meat', 'Fish', 'Fats and oils',
       'Sugars', 'Fresh potatoes', 'Fresh Veg', 'Other Veg',
       'Processed potatoes', 'Processed Veg', 'Fresh fruit', 'Cereals',
       'Beverages', 'Soft drinks', 'Alcoholic drinks', 'Confectionery'],
      dtype='object')
In [51]:
consumption.columns  # Columns of the table
Out[51]:
Index(['England', 'Wales', 'Scotland', 'N Ireland'], dtype='object')
In [52]:
consumption.head(n=3)
Out[52]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
In [53]:
consumption.tail(n=3)
Out[53]:
England Wales Scotland N Ireland
Soft drinks 1374 1256 1572 1506
Alcoholic drinks 375 475 458 135
Confectionery 54 64 62 41
In [54]:
consumption.values  # Values are in a Numpy array
Out[54]:
array([[ 105,  103,  103,   66],
       [ 245,  227,  242,  267],
       [ 685,  803,  750,  586],
       [ 147,  160,  122,   93],
       [ 193,  235,  184,  209],
       [ 156,  175,  147,  139],
       [ 720,  874,  566, 1033],
       [ 253,  265,  171,  143],
       [ 488,  570,  418,  355],
       [ 198,  203,  220,  187],
       [ 360,  365,  337,  334],
       [1102, 1137,  957,  674],
       [1472, 1582, 1462, 1494],
       [  57,   73,   53,   47],
       [1374, 1256, 1572, 1506],
       [ 375,  475,  458,  135],
       [  54,   64,   62,   41]])

The methods info and describe give respectively general and quantitative information concerning the dataframe. In particular, info indicates the categorical variables (which are not treated by describe).

In [55]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    4 non-null float64
D    4 non-null category
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 324.0 bytes
In [56]:
df.describe()
Out[56]:
A C
count 4.0 4.000000
mean 1.0 0.184089
std 0.0 0.734652
min 1.0 -0.429151
25% 1.0 -0.211360
50% 1.0 -0.039268
75% 1.0 0.356181
max 1.0 1.244043

If you suspect a variable to be categorical, you can state it.

In [57]:
df['E'] = np.random.randint(0, high=2, size=df.shape[0])
df['E'] = df['E'].astype('category')
df
Out[57]:
A B C D E
0 1.0 2013-01-02 -0.138763 test 0
1 1.0 2013-01-02 0.060227 train 1
2 1.0 2013-01-02 -0.429151 test 1
3 1.0 2013-01-02 1.244043 train 0
In [58]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    4 non-null float64
D    4 non-null category
E    4 non-null category
dtypes: category(2), datetime64[ns](1), float64(2)
memory usage: 424.0 bytes
In [59]:
timeseries.sort_index(ascending=False)
Out[59]:
A B C D
2013-01-06 0.308346 -0.164241 0.446200 0.744765
2013-01-05 -1.305641 0.480583 -0.135586 0.421599
2013-01-04 -1.595424 0.257599 0.202211 -2.483426
2013-01-03 1.075539 -0.549968 -1.587368 0.389060
2013-01-02 -0.897261 -1.181121 1.017547 1.058445
2013-01-01 -0.429630 2.257868 -0.765199 1.047689
In [60]:
timeseries.sort_values(by='A')
Out[60]:
A B C D
2013-01-04 -1.595424 0.257599 0.202211 -2.483426
2013-01-05 -1.305641 0.480583 -0.135586 0.421599
2013-01-02 -0.897261 -1.181121 1.017547 1.058445
2013-01-01 -0.429630 2.257868 -0.765199 1.047689
2013-01-06 0.308346 -0.164241 0.446200 0.744765
2013-01-03 1.075539 -0.549968 -1.587368 0.389060

Question

Print the first 10 lines of the dataframe consumption.

In [ ]:
# Answer

Indexing a table

Natural indexing

Explanations are provided with the consumption dataframe.

In [62]:
consumption.head()
Out[62]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
Fish 147 160 122 93
Fats and oils 193 235 184 209

Natural indexing is performed with []. This indexes the columns of dataframes and the rows of series.

In [63]:
consumption['England']
Out[63]:
Cheese                 105
Carcass meat           245
Other meat             685
Fish                   147
Fats and oils          193
Sugars                 156
Fresh potatoes         720
Fresh Veg              253
Other Veg              488
Processed potatoes     198
Processed Veg          360
Fresh fruit           1102
Cereals               1472
Beverages               57
Soft drinks           1374
Alcoholic drinks       375
Confectionery           54
Name: England, dtype: int64
In [64]:
s = consumption['England']  # A series
s['Cheese']
Out[64]:
105

You may want to extract several columns or several rows.

In [65]:
consumption[['England', 'Scotland']].head()
Out[65]:
England Scotland
Cheese 105 103
Carcass meat 245 242
Other meat 685 750
Fish 147 122
Fats and oils 193 184
In [66]:
s[['Cheese', 'Other meat']]
Out[66]:
Cheese        105
Other meat    685
Name: England, dtype: int64

Remark: selecting with [[]] always return a dataframe.

In [67]:
consumption[['England']].head()
Out[67]:
England
Cheese 105
Carcass meat 245
Other meat 685
Fish 147
Fats and oils 193

Question

Print together columns B and D of the dataframe df.

In [ ]:
# Answer

Label based indexing

Label based indexing is an enhancement of natural indexing, accessible with .loc[]. Indexing has to be thought as a matrix but with labels instead of positions. Hence, the rows are indexed first (instead of the columns with []).

In [69]:
consumption.loc['Cheese']  # Single row
Out[69]:
England      105
Wales        103
Scotland     103
N Ireland     66
Name: Cheese, dtype: int64
In [70]:
consumption.loc[:, 'England'].head()  # Single column
Out[70]:
Cheese           105
Carcass meat     245
Other meat       685
Fish             147
Fats and oils    193
Name: England, dtype: int64
In [71]:
consumption.loc[['Cheese', 'Fresh potatoes']]  # Multiple rows
Out[71]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Fresh potatoes 720 874 566 1033

Slicing on rows and columns is possible but endpoints are included.

In [72]:
consumption.loc['Cheese':'Fish']  # Row slicing
Out[72]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
Fish 147 160 122 93
In [73]:
consumption.loc['Cheese':'Cereals':4]  # Row slicing (from Cheese to Cereals with step 2)
Out[73]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Fats and oils 193 235 184 209
Other Veg 488 570 418 355
Cereals 1472 1582 1462 1494
In [74]:
consumption.loc['Cheese':'Cereals':4, :'Wales']  # Row and column slicing
Out[74]:
England Wales
Cheese 105 103
Fats and oils 193 235
Other Veg 488 570
Cereals 1472 1582

Question

What is the value in the dataframe timeseries at row 2013-01-04 and column B?

In [ ]:
# Answer

Integer position based indexing

Interger location (or position) based indexing is done with .iloc[]. It is similar to .loc[] but consideres only integer positions instead of labels.

Remark: endpoints are not included (similarly to Numpy).

In [76]:
consumption.iloc[:2]
Out[76]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
In [77]:
consumption.iloc[10::4, ::2]
Out[77]:
England Scotland
Processed Veg 360 337
Soft drinks 1374 1572

Boolean indexing

Similarly to Numpy arrays, dataframes can be indexed with Boolean variables thanks to .loc[].

In [78]:
consumption.loc[consumption['England'] > 500]  # Row selection
Out[78]:
England Wales Scotland N Ireland
Other meat 685 803 750 586
Fresh potatoes 720 874 566 1033
Fresh fruit 1102 1137 957 674
Cereals 1472 1582 1462 1494
Soft drinks 1374 1256 1572 1506
In [79]:
consumption.loc[consumption['England'] > 500, consumption.loc['Other meat'] < 800]  # Row and column selection
Out[79]:
England Scotland N Ireland
Other meat 685 750 586
Fresh potatoes 720 566 1033
Fresh fruit 1102 957 674
Cereals 1472 1462 1494
Soft drinks 1374 1572 1506

Operating on the whole dataframe puts NaN where the condition is not satisfied. Beyond that, it is useful for assignments.

In [80]:
timeseries[timeseries > 0]  # Set NaN where condition is not satisfied
Out[80]:
A B C D
2013-01-01 NaN 2.257868 NaN 1.047689
2013-01-02 NaN NaN 1.017547 1.058445
2013-01-03 1.075539 NaN NaN 0.389060
2013-01-04 NaN 0.257599 0.202211 NaN
2013-01-05 NaN 0.480583 NaN 0.421599
2013-01-06 0.308346 NaN 0.446200 0.744765
In [81]:
tt = timeseries.copy()
tt[tt < 0] = 0
tt
Out[81]:
A B C D
2013-01-01 0.000000 2.257868 0.000000 1.047689
2013-01-02 0.000000 0.000000 1.017547 1.058445
2013-01-03 1.075539 0.000000 0.000000 0.389060
2013-01-04 0.000000 0.257599 0.202211 0.000000
2013-01-05 0.000000 0.480583 0.000000 0.421599
2013-01-06 0.308346 0.000000 0.446200 0.744765

The isin method enables selecting with an OR condition.

In [82]:
df.loc[df['D'].isin(['test', 'train'])]
Out[82]:
A B C D E
0 1.0 2013-01-02 -0.138763 test 0
1 1.0 2013-01-02 0.060227 train 1
2 1.0 2013-01-02 -0.429151 test 1
3 1.0 2013-01-02 1.244043 train 0

Question

Select rows of df with $1$ in column E.

In [ ]:
# Answer

Strange behaviors of natural indexing I do not recommend

Here are some strange and confusing behaviors of Pandas objects.

I advise you not to use those features, but to adopt .loc[] and .iloc[] instead.

Series (not dataframes) can be indexed with integer numbers.

In [84]:
s[0]
Out[84]:
105

And sliced… with endpoints excluded.

In [85]:
s[0:2]  # 0: Cheese, 1: Carcass meat, 2: Other meat
Out[85]:
Cheese          105
Carcass meat    245
Name: England, dtype: int64

Slicing is also possible with labels… with endpoints included.

In [86]:
s['Cheese':'Other meat']
Out[86]:
Cheese          105
Carcass meat    245
Other meat      685
Name: England, dtype: int64

You can also slice rows of a dataframe using the same notation (while one would expect to slice columns). Even stranger, slicing works with integer positions (for example consumption[2:5]), still with endpoints included with labels and excluded with integer positions, but not single indexing (consumption[2] raises an error).

In [87]:
consumption['Cheese':'Cereals':2]
Out[87]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Other meat 685 803 750 586
Fats and oils 193 235 184 209
Fresh potatoes 720 874 566 1033
Other Veg 488 570 418 355
Processed Veg 360 365 337 334
Cereals 1472 1582 1462 1494

Similarly to slicing, Boolean indexing operates on dataframe rows.

In [88]:
consumption[consumption['England'] > 500]  # Similar to consumption.loc[consumption['England'] > 500]
Out[88]:
England Wales Scotland N Ireland
Other meat 685 803 750 586
Fresh potatoes 720 874 566 1033
Fresh fruit 1102 1137 957 674
Cereals 1472 1582 1462 1494
Soft drinks 1374 1256 1572 1506

Selection random samples

The sample method makes it possible to randomly select rows (individuals) from a dataframe (without replacement).

In [89]:
consumption.sample(n=3)
Out[89]:
England Wales Scotland N Ireland
Confectionery 54 64 62 41
Fresh Veg 253 265 171 143
Fresh fruit 1102 1137 957 674

Adding and deleting items

Let us consider a copy of the first 5 rows of consumption.

In [90]:
cons = consumption.iloc[:5].copy()
cons
Out[90]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
Fish 147 160 122 93
Fats and oils 193 235 184 209

We successively add a column and a row to cons.

In [91]:
cons['all'] = cons.sum(axis=1)
cons.loc['Chocolate'] = [70, 60, 60, 150, 340]
cons
Out[91]:
England Wales Scotland N Ireland all
Cheese 105 103 103 66 377
Carcass meat 245 227 242 267 981
Other meat 685 803 750 586 2824
Fish 147 160 122 93 522
Fats and oils 193 235 184 209 821
Chocolate 70 60 60 150 340

That we can now drop.

In [92]:
cons = cons.drop('Chocolate')
cons = cons.drop('all', axis=1)
cons
Out[92]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
Fish 147 160 122 93
Fats and oils 193 235 184 209

Dataframes can be concatenated along columns and rows.

In [93]:
pd.concat((cons[['England']], cons[['Scotland']]), axis=1)
Out[93]:
England Scotland
Cheese 105 103
Carcass meat 245 242
Other meat 685 750
Fish 147 122
Fats and oils 193 184
In [94]:
pd.concat((cons.iloc[:2], cons.iloc[-2:]))
Out[94]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Fish 147 160 122 93
Fats and oils 193 235 184 209

This last operation is similar to:

In [95]:
cons.iloc[:2].append(cons.iloc[-2:])
Out[95]:
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Fish 147 160 122 93
Fats and oils 193 235 184 209

Managing missing data

Missing data are generally replaced by a NaN in the table. Pandas offers several possibilities to manage them.

In [96]:
cons_md = consumption[consumption>200]
cons_md.iloc[:, -1] = consumption.iloc[:, -1]
cons_md.head()  # A table with missing data
Out[96]:
England Wales Scotland N Ireland
Cheese NaN NaN NaN 66
Carcass meat 245.0 227.0 242.0 267
Other meat 685.0 803.0 750.0 586
Fish NaN NaN NaN 93
Fats and oils NaN 235.0 NaN 209
In [97]:
cons_md.dropna()  # Drop any row with missing data
Out[97]:
England Wales Scotland N Ireland
Carcass meat 245.0 227.0 242.0 267
Other meat 685.0 803.0 750.0 586
Fresh potatoes 720.0 874.0 566.0 1033
Other Veg 488.0 570.0 418.0 355
Processed Veg 360.0 365.0 337.0 334
Fresh fruit 1102.0 1137.0 957.0 674
Cereals 1472.0 1582.0 1462.0 1494
Soft drinks 1374.0 1256.0 1572.0 1506
Alcoholic drinks 375.0 475.0 458.0 135
In [98]:
cons_md.fillna(method='ffill').head()  # Fill missing data with previous real values
# This does not work with the first line
Out[98]:
England Wales Scotland N Ireland
Cheese NaN NaN NaN 66
Carcass meat 245.0 227.0 242.0 267
Other meat 685.0 803.0 750.0 586
Fish 685.0 803.0 750.0 93
Fats and oils 685.0 235.0 750.0 209
In [99]:
cons_md.fillna(method='ffill').fillna(method='bfill').head()
Out[99]:
England Wales Scotland N Ireland
Cheese 245.0 227.0 242.0 66
Carcass meat 245.0 227.0 242.0 267
Other meat 685.0 803.0 750.0 586
Fish 685.0 803.0 750.0 93
Fats and oils 685.0 235.0 750.0 209
In [100]:
cons_md.isnull().any()  # Test for missing data
Out[100]:
England       True
Wales         True
Scotland      True
N Ireland    False
dtype: bool

Question

By changing the axis parameter of dropna, drop the columns with missing values of cons_md.

In [ ]:
# Answer

Descriptive statistics

A dataframe comes with many methods for descriptive statistics:

  • count: Number of non-null observations;
  • sum: Sum of values;
  • mean: Mean of values;
  • mad: Mean absolute deviation;
  • median: Arithmetic median of values;
  • min: Minimum;
  • max: Maximum;
  • mode: Mode;
  • abs: Absolute Value;
  • prod: Product of values;
  • std: Bessel-corrected sample standard deviation;
  • var: Unbiased variance;
  • sem: Standard error of the mean;
  • skew: Sample skewness (3rd moment);
  • kurt: Sample kurtosis (4th moment);
  • quantile: Sample quantile (value at %);
  • cumsum: Cumulative sum;
  • cumprod: Cumulative product;
  • cummax: Cumulative maximum;
  • cummin: Cumulative minimum.
In [102]:
df.median()  # Median of numeric columns
Out[102]:
A    1.000000
C   -0.039268
E    0.500000
dtype: float64
In [103]:
df.median(axis=1)  # Median of rows (numeric objects only)
Out[103]:
0    0.430619
1    0.530114
2    0.285425
3    1.122021
dtype: float64
In [104]:
df['D'].value_counts()  # Histrogramming
Out[104]:
train    2
test     2
Name: D, dtype: int64

Question

Compute the cumulative sum of the consumption data.

In [ ]:
# Answer

Pivot table

A pivot table aggregates the values of a column according to the similar values of other columns (which will become the row indexes and columns labels of the aggregated table).

In [106]:
df['E'] = ['success', 'failure', 'failure', 'success']
df
Out[106]:
A B C D E
0 1.0 2013-01-02 -0.138763 test success
1 1.0 2013-01-02 0.060227 train failure
2 1.0 2013-01-02 -0.429151 test failure
3 1.0 2013-01-02 1.244043 train success
In [107]:
# Pivot table:
# values: label of the column whose values will be reorganized and displayed in the table
# index: indexes of the final table
# columns: columns of the final table
pd.pivot_table(df, values='C', index='D', columns='E')
Out[107]:
E failure success
D
test -0.429151 -0.138763
train 0.060227 1.244043

Plotting

With pandas

Pandas provides a rich collection of techniques to vizualize dataframes. Here, we illustrate just a few of them.

In [108]:
consumption.plot(figsize=(10, 6));  # Columns vs index
In [109]:
consumption.plot(subplots=True, figsize=(10, 8));  # Columns vs index
In [110]:
consumption.plot.barh(stacked=True, figsize=(10, 6));

Question

Plot a pie chart (parameter kind='pie' of plot) of the average consumption of foodstuff.

In [ ]:
# Answer

With Seaborn

Besides its esthetics purpose, Seaborn provides many routines for producing beautiful plots from dataframes, particularly for statistical data.

In [112]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(x='England', y='Scotland', data=consumption);
In [113]:
tips = sns.load_dataset('tips')
tips.head()
Out[113]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [114]:
import warnings
warnings.filterwarnings("ignore")

fig, ax = plt.subplots(figsize=(10, 6))
sns.violinplot(x='day', y='total_bill', hue='smoker', split=True, data=tips);
In [115]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='day', y='tip', hue='time', data=tips)
Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb45d0f8dd8>

Linear models

StatsModels is a scientific module based on Pandas for performing statistical analyses in Python. It provides tools for conducting data exploration, statistical tests and for the estimation of several statistical models. As a statistical package, each estimator in StatsModels comes with an extensive list of resulting statistics.

Linear regression

We illustrate here a major feature of StatsModels, which is linear regression.

Simple example

In [116]:
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

# Generate data
nsample = 50
sig = 0.5
x = np.linspace(0, 20, nsample)
X = np.column_stack((x, np.sin(x), (x-5)**2, np.ones(nsample)))
beta = [0.5, 0.5, -0.02, 5.]

y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

y[np.random.permutation(y.size)[:3]] += 4 * (-1)**np.random.randint(0, 2, size=3)  # 3 outliers
In [117]:
# Fit the model
model = sm.OLS(y, X)
res = model.fit()
res.summary()
Out[117]:
OLS Regression Results
Dep. Variable: y R-squared: 0.751
Model: OLS Adj. R-squared: 0.734
Method: Least Squares F-statistic: 46.15
Date: Tue, 22 Sep 2020 Prob (F-statistic): 6.45e-14
Time: 16:59:26 Log-Likelihood: -73.064
No. Observations: 50 AIC: 154.1
Df Residuals: 46 BIC: 161.8
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 0.4958 0.057 8.672 0.000 0.381 0.611
x2 0.7785 0.225 3.464 0.001 0.326 1.231
x3 -0.0205 0.005 -4.093 0.000 -0.031 -0.010
const 5.1553 0.371 13.906 0.000 4.409 5.902
Omnibus: 41.161 Durbin-Watson: 2.011
Prob(Omnibus): 0.000 Jarque-Bera (JB): 345.917
Skew: 1.658 Prob(JB): 7.67e-76
Kurtosis: 15.452 Cond. No. 221.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Main attributes of the fitted model are:

In [118]:
print('Parameters: ', res.params)
print('Standard errors: ', res.bse)
print('Predicted values: ', res.fittedvalues)
Parameters:  [ 0.5   0.78 -0.02  5.16]
Standard errors:  [0.06 0.22 0.01 0.37]
Predicted values:  [ 4.64  5.23  5.77  6.2   6.51  6.68  6.73  6.7   6.62  6.55  6.53  6.62
  6.82  7.14  7.56  8.04  8.54  8.99  9.37  9.62  9.74  9.73  9.62  9.44
  9.26  9.11  9.05  9.1   9.27  9.55  9.9  10.29 10.67 10.97 11.17 11.24
 11.17 10.99 10.73 10.44 10.16  9.96  9.85  9.87 10.   10.23 10.51 10.79
 11.02 11.16]

A method called predict is also available for prediction with the estimator:

In [119]:
res.predict(X)  # Same as res.fittedvalues
Out[119]:
array([ 4.64,  5.23,  5.77,  6.2 ,  6.51,  6.68,  6.73,  6.7 ,  6.62,
        6.55,  6.53,  6.62,  6.82,  7.14,  7.56,  8.04,  8.54,  8.99,
        9.37,  9.62,  9.74,  9.73,  9.62,  9.44,  9.26,  9.11,  9.05,
        9.1 ,  9.27,  9.55,  9.9 , 10.29, 10.67, 10.97, 11.17, 11.24,
       11.17, 10.99, 10.73, 10.44, 10.16,  9.96,  9.85,  9.87, 10.  ,
       10.23, 10.51, 10.79, 11.02, 11.16])
In [120]:
# Plot the regression
prstd, iv_l, iv_u = wls_prediction_std(res)  # Curves for standard deviation

fig, ax = plt.subplots()
ax.plot(x, y, 'o', label="data")
ax.plot(x, y_true, 'b-', label="True")
ax.plot(x, res.fittedvalues, 'r--.', label="OLS")
ax.plot(x, iv_u, 'r--')
ax.plot(x, iv_l, 'r--')
ax.legend(loc='best');
In [121]:
residuals = y - res.predict(X)  # Same as res.resid
print(residuals)
[-9.56e-02  3.60e-04 -5.25e-02 -2.06e-01 -2.61e-02  2.10e-01  7.19e-01
 -2.71e-01  2.62e-01  5.74e-01 -1.89e-01 -3.74e+00  1.43e-01 -3.85e-02
  5.10e+00  1.58e-01 -8.59e-01 -5.07e-01 -7.50e-01 -1.20e-01 -3.58e-01
 -1.74e-01 -9.44e-01  4.75e-01 -3.43e-01  7.70e-01 -2.83e-01  6.08e-01
  5.87e-01 -1.29e-01 -4.30e-01 -1.33e-01 -4.77e-01 -1.08e-01  2.57e-01
 -6.15e-01  2.47e+00 -7.23e-02 -2.46e-01 -6.36e-01 -8.28e-01  1.71e-01
  1.68e-01 -3.14e-02 -4.50e-01 -3.48e-01  5.32e-01 -2.21e-02  1.70e-01
  9.64e-02]

The sum of squared residuals (or residual sum of squares) is:

In [122]:
print(np.sum(res.resid**2), res.ssr)
54.4182341449693 54.41823414496929

while an unbiased estimate of the variance is:

In [123]:
print(res.ssr / res.df_resid, res.scale)
1.183005090108028 1.183005090108028

The hat (or projection matrix) is:

In [124]:
H = X.dot(np.linalg.solve(X.T.dot(X), X.T))

Then, the studentized residuals are:

In [125]:
t = res.resid / np.sqrt(res.scale*(1-np.diag(H)))  # Standardized residuals
ts = t * np.sqrt( (res.df_resid-1) / (res.df_resid-t**2))  # Studentized residuals
print(ts)
[-9.57e-02  3.53e-04 -5.11e-02 -2.00e-01 -2.51e-02  2.00e-01  6.81e-01
 -2.53e-01  2.45e-01  5.41e-01 -1.79e-01 -4.20e+00  1.36e-01 -3.63e-02
  6.77e+00  1.47e-01 -8.03e-01 -4.76e-01 -7.12e-01 -1.14e-01 -3.40e-01
 -1.64e-01 -8.91e-01  4.43e-01 -3.20e-01  7.26e-01 -2.68e-01  5.78e-01
  5.56e-01 -1.20e-01 -4.00e-01 -1.23e-01 -4.46e-01 -1.02e-01  2.45e-01
 -5.87e-01  2.49e+00 -6.76e-02 -2.28e-01 -5.93e-01 -7.80e-01  1.62e-01
  1.60e-01 -3.02e-02 -4.33e-01 -3.33e-01  5.12e-01 -2.14e-02  1.68e-01
  9.77e-02]

The studentized residuals can be directly obtained by:

In [126]:
print(res.outlier_test()[:, 0])
[-9.57e-02  3.53e-04 -5.11e-02 -2.00e-01 -2.51e-02  2.00e-01  6.81e-01
 -2.53e-01  2.45e-01  5.41e-01 -1.79e-01 -4.20e+00  1.36e-01 -3.63e-02
  6.77e+00  1.47e-01 -8.03e-01 -4.76e-01 -7.12e-01 -1.14e-01 -3.40e-01
 -1.64e-01 -8.91e-01  4.43e-01 -3.20e-01  7.26e-01 -2.68e-01  5.78e-01
  5.56e-01 -1.20e-01 -4.00e-01 -1.23e-01 -4.46e-01 -1.02e-01  2.45e-01
 -5.87e-01  2.49e+00 -6.76e-02 -2.28e-01 -5.93e-01 -7.80e-01  1.62e-01
  1.60e-01 -3.02e-02 -4.33e-01 -3.33e-01  5.12e-01 -2.14e-02  1.68e-01
  9.77e-02]
In [127]:
fig, ax = plt.subplots()
ax.plot(ts, '.')
ax.plot([0, X.shape[0]], [-2]*2, 'r')
ax.plot([0, X.shape[0]], [2]*2, 'r')

outliers = np.where(res.outlier_test()[:, -1]<0.1)
ax.plot(outliers, res.outlier_test()[outliers, 0], 'ro')
Out[127]:
[<matplotlib.lines.Line2D at 0x7fb45d028710>,
 <matplotlib.lines.Line2D at 0x7fb45cfbfcf8>]

Note that studentized residuals are only asymptotically normal.

In [128]:
from scipy.stats import normaltest, probplot

print(normaltest(ts))
NormaltestResult(statistic=57.329631472186776, pvalue=3.5565482180656016e-13)
In [129]:
from scipy.stats import t as student_dist

probplot(ts, dist=student_dist(df=res.df_resid-1), plot=plt);
In [130]:
probplot(ts, dist='norm', plot=plt);

Another example

In [131]:
df = sm.datasets.get_rdataset('iris').data
df.columns = [name.replace('.', '_').lower() for name in df.columns]  # Make name pythonic

df.describe()
Out[131]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
In [132]:
df.isnull().any()  # Check if there is any missing value
Out[132]:
sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
species         False
dtype: bool
In [133]:
import statsmodels.formula.api as smf

model =  smf.ols('sepal_length ~ petal_length + petal_width', data=df)
res = model.fit()
res.summary()
Out[133]:
OLS Regression Results
Dep. Variable: sepal_length R-squared: 0.766
Model: OLS Adj. R-squared: 0.763
Method: Least Squares F-statistic: 241.0
Date: Tue, 22 Sep 2020 Prob (F-statistic): 4.00e-47
Time: 16:59:29 Log-Likelihood: -75.023
No. Observations: 150 AIC: 156.0
Df Residuals: 147 BIC: 165.1
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.1906 0.097 43.181 0.000 3.999 4.382
petal_length 0.5418 0.069 7.820 0.000 0.405 0.679
petal_width -0.3196 0.160 -1.992 0.048 -0.637 -0.002
Omnibus: 0.383 Durbin-Watson: 1.826
Prob(Omnibus): 0.826 Jarque-Bera (JB): 0.540
Skew: 0.060 Prob(JB): 0.763
Kurtosis: 2.732 Cond. No. 25.3


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

StatsModels accepts categorical variables:

In [134]:
model = smf.ols('sepal_length ~ petal_length + petal_width + C(species)', data=df)
res = model.fit()
res.summary()
Out[134]:
OLS Regression Results
Dep. Variable: sepal_length R-squared: 0.837
Model: OLS Adj. R-squared: 0.832
Method: Least Squares F-statistic: 185.8
Date: Tue, 22 Sep 2020 Prob (F-statistic): 5.33e-56
Time: 16:59:29 Log-Likelihood: -48.116
No. Observations: 150 AIC: 106.2
Df Residuals: 145 BIC: 121.3
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.6830 0.107 34.291 0.000 3.471 3.895
C(species)[T.versicolor] -1.5984 0.206 -7.770 0.000 -2.005 -1.192
C(species)[T.virginica] -2.1126 0.304 -6.949 0.000 -2.714 -1.512
petal_length 0.9059 0.074 12.191 0.000 0.759 1.053
petal_width -0.0060 0.156 -0.038 0.969 -0.315 0.303
Omnibus: 0.587 Durbin-Watson: 1.802
Prob(Omnibus): 0.746 Jarque-Bera (JB): 0.679
Skew: 0.142 Prob(JB): 0.712
Kurtosis: 2.832 Cond. No. 61.5


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

To regress a variable y on all others, there is no notation equivalent to y ~ . in R. Instead, you can use the following workaround.

In [135]:
def all_predictors(df, outcome):
    return outcome + " ~ " + " + ".join(df.columns.difference([outcome]))

print(all_predictors(df, "sepal_length"))
sepal_length ~ petal_length + petal_width + sepal_width + species

ANOVA

The last linear model can be formalized as: $$Y = \mu \mathbb 1 + X\beta + A \alpha + \epsilon, \qquad s.t.~ \alpha_1 = 0,$$ where

  • $Y \in \mathbb R^n$ is the vector of observations (sepal_length);
  • $\mu \in \mathbb R$ is the intercept;
  • $X \in \mathbb R^{n \times 2}$ is the design matrix;
  • $\beta \in \mathbb R^2$ is the vector of coefficients (petal_length and petal_width);
  • $A \in \mathbb R^{n \times 3}$ is the indicator matrix corresponding to the categorical variable (species);
  • $\alpha \in \mathbb R^3$ is the vector of effects of the categorical values setosa, versicolor and virginica;
  • $\epsilon \sim \mathcal N(0, \sigma^2 I_n)$ is a random noise.

An analysis of variance (in its generalized version) tests the significance of continuous and categorical variables (based on a Fisher test), that is the null hypotheses (see the ANOVA table below):

  • $\alpha_1 = \alpha_2 = \alpha_3 = 0$ (significance of species);
  • $\beta_1 = 0$ (significance of petal_length);
  • $\beta_2 = 0$ (significance of petal_width).
In [136]:
sm.stats.anova_lm(res)  # ANOVA table
Out[136]:
df sum_sq mean_sq F PR(>F)
C(species) 2.0 63.212133 31.606067 274.728446 4.785160e-50
petal_length 1.0 22.274541 22.274541 193.616313 1.718161e-28
petal_width 1.0 0.000169 0.000169 0.001472 9.694469e-01
Residual 145.0 16.681489 0.115045 NaN NaN

We read the p-values in the last column. For the species, we observe a p-value that is small enough to reject the null hypothesis that the species does not impact the model. In the same manner, we can conclude that petal_length is an important factor (its coefficient is non-zero). However, the p-value for petal_width is large enough to conclude that, given the covariates species and petal_length, the variable petal_width does not provide an additional information, that explains the outcome sepal_length (it has no effect on the response variable).

Assume now that you know $\beta$ and consider the model: $$\tilde Y = Y - X\beta = \mu + A \alpha + \epsilon, \qquad s.t.~ \alpha_1 = 0.$$ The impact of the categorical variable species can be illustrated on the following graph, showing that the means are different according to the modalities of species (setosa, versicolor, virginica).

In [137]:
# Observations \tilde Y
sepal_beta = (0.9059, -0.006)
observations = df['sepal_length'] - sepal_beta[0]*df['petal_length'] - sepal_beta[1]*df['petal_width']

plt.plot(observations[df['species']=='setosa'], label="setosa")
plt.plot(observations[df['species']=='versicolor'], label="versicolor")
plt.plot(observations[df['species']=='virginica'], label="virginica")
plt.legend(loc='best');

Assume now that you would like to test the Gaussian assumption of the model, that is for each modality indexed by $j \in \{1, 2, 3\}$, observations are distributed according to $\mathcal N(\mu + \alpha_j, \sigma^2)$. You should test if the groups:

  1. are normally distributed;
  2. with equal variance.
In [138]:
from scipy.stats import levene, bartlett, normaltest

for modality in ['setosa', 'versicolor', 'virginica']:
    print(normaltest(observations[df['species']==modality]))
NormaltestResult(statistic=4.053279083071296, pvalue=0.13177761110864925)
NormaltestResult(statistic=0.7548077096205367, pvalue=0.6856391193186202)
NormaltestResult(statistic=0.11510513766630709, pvalue=0.944072260304799)

Samples can be considered normaly distributed. What about the variances?

In [139]:
print(levene(observations[df['species']=='setosa'], observations[df['species']=='versicolor'],
             observations[df['species']=='virginica']))
print(bartlett(observations[df['species']=='setosa'], observations[df['species']=='versicolor'],
               observations[df['species']=='virginica']))
LeveneResult(statistic=0.14161247465879312, pvalue=0.8680758119020389)
BartlettResult(statistic=0.22506978073272027, pvalue=0.8935661697135951)

Variances can be considered the same.

Exercises

Exercise 1

Draw a sample from a chi-squared distribution with 5 degrees of freedom. Plot the histogram and the probability distribution function on the same figure.

In [ ]:
# Answer

Exercise 2

Generate a sample from a standard normal distribution and another from a gamma distribution with shape $a=\frac 14$ and scale $\theta=2$. Test for the null hypothesis that both samples have equal variances.

In [ ]:
# Answer

Exercise 3

Illustrate the central limit theorem with a binomial distribution with parameters $n=10$ and $p=0.7$. Perform a test against a standard normal distribution to support the result.

In [ ]:
# Answer

Exercise 4

This exercise proposes to analyse a dataset.

  • Load the R dataset airquality with StatsModels utilities.
  • Make columns names pythonic (remove "." from solar.m).
  • Drop rows with missing data.
  • Display a summary of the dataset.
  • Perform a linear regression to explain the ozone variable with solar_r , wind, temp and month (categorical variable).
  • Display a summary of the linear regression.
  • Are the studentized residuals normaly distributed?
  • Perform an ANOVA.
  • Predict the air quality for today (suppose that solar_r = 207.0, convert wind to miles per hour and temp to degrees Fahrenheit by hitting Show nonmetric). Use res.predict(dict(solar_r=207, …)).
In [ ]:
# Answer

Answer:

Exercise 5

Implement a Metropolis Hastings algorithm. Use it to sample a beta distribution with parameters $\alpha=2$ and $\beta=5$. Compare with what is obtained with Scipy routines.

In [ ]:
# Answer