%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
from utils import decorate
from thinkstats2 import Pmf, Cdf
import thinkstats2
import thinkplot
Here are some of the functions from Chapter 5.
def MakeNormalModel(values, label=''):
"""Plots a CDF with a Normal model.
values: sequence
"""
cdf = thinkstats2.Cdf(values, label=label)
mean, var = thinkstats2.TrimmedMeanVar(values)
std = np.sqrt(var)
print('n, mean, std', len(values), mean, std)
xmin = mean - 4 * std
xmax = mean + 4 * std
xs, ps = thinkstats2.RenderNormalCdf(mean, std, xmin, xmax)
thinkplot.Plot(xs, ps, label='model', linewidth=4, color='0.8')
thinkplot.Cdf(cdf)
def MakeNormalPlot(values, label=''):
"""Generates a normal probability plot.
values: sequence
"""
mean, var = thinkstats2.TrimmedMeanVar(values, p=0.01)
std = np.sqrt(var)
xs = [-5, 5]
xs, ys = thinkstats2.FitLine(xs, mean, std)
thinkplot.Plot(xs, ys, color='0.8', label='model')
xs, ys = thinkstats2.NormalProbability(values)
thinkplot.Plot(xs, ys, '+', alpha=0.3, label=label)
Read the GSS data again.
%time gss = pd.read_hdf('gss.hdf5', 'gss')
gss.shape
gss.head()
Most variables use special codes to indicate missing data. We have to be careful not to use these codes as numerical data; one way to manage that is to replace them with NaN
, which Pandas recognizes as a missing value.
def replace_invalid(df):
df.realinc.replace([0], np.nan, inplace=True)
df.educ.replace([98,99], np.nan, inplace=True)
# 89 means 89 or older
df.age.replace([98, 99], np.nan, inplace=True)
df.cohort.replace([9999], np.nan, inplace=True)
df.adults.replace([9], np.nan, inplace=True)
replace_invalid(gss)
Here's the CDF of ages.
cdf_age = Cdf(gss.age)
thinkplot.cdf(cdf_age, label='age')
decorate(title='Distribution of age',
xlabel='Age (years)',
ylabel='CDF')
Exercise: Each of the following cells shows the distribution of ages under various transforms, compared to various models. In each text cell, add a sentence or two that interprets the result. What can we say about the distribution of ages based on each figure?
Interpretation:
MakeNormalModel(gss.age.dropna(), label='')
decorate(title='Distribution of age',
xlabel='Age (years)',
ylabel='CDF')
Interpretation:
MakeNormalPlot(gss.age.dropna(), label='')
decorate(title='Normal probability plot',
xlabel='Standard normal sample',
ylabel='Age (years)')
Interpretation:
thinkplot.cdf(cdf_age, label='age', complement=True)
decorate(title='Distribution of age',
xlabel='Age (years)',
ylabel='Complementary CDF, log scale',
yscale='log')
Interpretation:
thinkplot.cdf(cdf_age, label='age')
decorate(title='Distribution of age',
xlabel='Age (years)',
ylabel='CDF',
xscale='log')
Interpretation:
values = np.log10(gss.age.dropna())
MakeNormalModel(values, label='')
decorate(title='Distribution of age',
xlabel='Age (log10 years)',
ylabel='CDF')
Interpretation:
MakeNormalPlot(values, label='')
decorate(title='Distribution of age',
xlabel='Standard normal sample',
ylabel='Age (log10 years)')
Interpretation:
thinkplot.cdf(cdf_age, label='age', complement=True)
decorate(title='Distribution of age',
xlabel='Age (years)',
ylabel='Complementary CDF, log scale',
xscale='log',
yscale='log')
Interpretation:
thinkplot.cdf(cdf_age, label='age', transform='Weibull')
decorate(title='Distribution of age',
xlabel='Age (years)',
ylabel='log Complementary CDF, log scale',
xscale='log',
yscale='log')
Here's the CDF of realinc
.
cdf_realinc = Cdf(gss.realinc)
thinkplot.cdf(cdf_realinc, label='income')
decorate(title='Distribution of income',
xlabel='Income (1986 $)',
ylabel='CDF')
Exercise: Use visualizations like the ones in the previous exercise to see whether there is an analytic model that describes the distribution of gss.realinc
well.
# Solution goes here
# Solution goes here
# Solution goes here
# Solution goes here
# Solution goes here
# Solution goes here
# Solution goes here
Interpretation:
# Solution goes here
%time brfss = pd.read_hdf('brfss.hdf5', 'brfss')
brfss.head()
Let's look at the distribution of height in the BRFSS dataset. Here's the CDF.
heights = brfss.HTM4
cdf_heights = Cdf(heights)
thinkplot.Cdf(cdf_heights)
decorate(xlabel='Height (cm)', ylabel='CDF')
To see whether a normal model describes this data well, we can use KDE to estimate the PDF.
from scipy.stats import gaussian_kde
Here's an example using the default bandwidth method.
kde = gaussian_kde(heights.dropna())
xs = np.linspace(heights.min(), heights.max())
ds = kde.evaluate(xs)
ds /= ds.sum()
plt.plot(xs, ds, label='KDE heights')
decorate(xlabel='Height (cm)', ylabel='PDF')
It doesn't work very well; we can improve it by overriding the bandwidth with a constant.
kde = gaussian_kde(heights.dropna(), bw_method=0.3)
ds = kde.evaluate(xs)
ds /= ds.sum()
plt.plot(xs, ds, label='KDE heights')
decorate(xlabel='Height (cm)', ylabel='PDF')
Now we can generate a normal model with the same mean and standard deviation.
mean = heights.mean()
std = heights.std()
mean, std
Here's the model compared to the estimated PDF.
normal_pdf = thinkstats2.NormalPdf(mean, std)
ps = normal_pdf.Density(xs)
ps /= ps.sum()
plt.plot(xs, ps, color='gray', label='Normal model')
plt.plot(xs, ds, label='KDE heights')
decorate(xlabel='Height (cm)', ylabel='PDF')
The data don't fit the model particularly well, possibly because the distribution of heights is a mixture of two distributions, for men and women.
Exercise: Generate a similar figure for just women's heights and see if the normal model does any better.
# Solution goes here
# Solution goes here
Exercise: Generate a similar figure for men's weights, brfss.WTKG3
. How well does the normal model fit?
# Solution goes here
# Solution goes here
Exercise: Try it one more time with the log of men's weights. How well does the normal model fit? What does that imply about the distribution of weight?
# Solution goes here
# Solution goes here
Let's look at the skewness of the distribution of weights for men and women.
male = (brfss.SEX == 1)
male_weights = brfss.loc[male, 'WTKG3']
female = (brfss.SEX == 2)
female_weights = brfss.loc[female, 'WTKG3']
As we've seen, these distributions are skewed to the right, so we expect the mean to be higher than the median.
male_weights.mean(), male_weights.median()
We can compute the moment-based sample skewness using Pandas or thinkstats2
. The results are almost the same.
male_weights.skew(), thinkstats2.Skewness(male_weights.dropna())
But moment-based sample skewness is a terrible statistic! A more robust alternative is Pearson's median skewness:
thinkstats2.PearsonMedianSkewness(male_weights.dropna())
Exercise: Compute the same statistics for women. Which distribution is more skewed?
# Solution goes here
# Solution goes here
# Solution goes here
Exercise: Explore the GSS or BRFSS dataset and find something interesting!