This file contains example code to demonstrate various Pandas features.

In [2]:

from __future__ import print_function, division

%matplotlib inline

For the first example, I'll work with data from the BRFSS

In [3]:

import brfss
df = brfss.ReadBrfss(nrows=5000)
df['height'] = df.htm3

Of the first 5000 respondents, 42 have invalid heights. Note that most obvious ways of checking for null don't work.

In [4]:

sum(df.height.isnull())

Out[4]:

Use dropna to select valid heights.

In [5]:

valid_heights = df.height.dropna()
len(valid_heights)

Out[5]:

EstimatedPdf is an interface to gaussian_kde

In [6]:

import thinkstats2
pdf = thinkstats2.EstimatedPdf(valid_heights)

The kde object provides resample:

In [7]:

fillable = pdf.kde.resample(len(df)).flatten()
fillable.shape

Out[7]:

(5000,)

Or you can use thinkstats objects instead. First convert from EstimatedPdf to Pmf

In [8]:

import thinkplot
pmf = pdf.MakePmf()
thinkplot.Pdf(pmf)

You can use the Pmf to generate a random sample, but it is faster to convert to Cdf:

In [9]:

cdf = pmf.MakeCdf()
fillable = cdf.Sample(len(df))

Then we can use fillna to replace NaNs

In [10]:

import pandas
series = pandas.Series(fillable)
df.height.fillna(series, inplace=True)
sum(df.height.isnull())

Out[10]:

In [11]:

cdf = thinkstats2.Cdf(df.height)
thinkplot.Cdf(cdf)

Out[11]:

{'xscale': 'linear', 'yscale': 'linear'}

In [11]:

In [ ]:

In [14]:

import brfss
resp = brfss.ReadBrfss(nrows=5000).dropna(subset=['sex', 'htm3'])
grouped = resp.groupby('sex')

In [30]:

for i, group in grouped:
    print(i, group.shape)

1 (1611, 6)
2 (3347, 6)

In [26]:

grouped.get_group(1).mean()

Out[26]:

age         54.868077
sex          1.000000
wtyrago     92.561772
finalwt    782.980373
wtkg2       91.694793
htm3       179.120422
dtype: float64

In [23]:

grouped.mean()

Out[23]:

	age	wtyrago	finalwt	wtkg2	htm3
sex
1	54.868077	92.561772	782.980373	91.694793	179.120422
2	54.891468	76.803614	418.628225	76.282271	163.973110

In [21]:

grouped['htm3'].mean()

Out[21]:

sex
1      179.120422
2      163.973110
Name: htm3, dtype: float64

In [27]:

grouped.htm3.std()

Out[27]:

sex
1      7.643153
2      7.013427
Name: htm3, dtype: float64

In [18]:

import numpy
grouped.aggregate(numpy.mean)

Out[18]:

	age	wtyrago	finalwt	wtkg2	htm3
sex
1	54.868077	92.561772	782.980373	91.694793	179.120422
2	54.891468	76.803614	418.628225	76.282271	163.973110

In [19]:

grouped.aggregate(numpy.std)

Out[19]:

	age	wtyrago	finalwt	wtkg2	htm3
sex
1	15.979595	20.082096	891.896249	19.124774	7.643153
2	16.318462	20.818984	480.059627	19.615797	7.013427

In [13]:

d = {}
for name, group in grouped:
    d[name] = group.htm3.values

In [13]:

Out[13]:

{1: array([ 170.,  185.,  183., ...,  178.,  175.,  170.]),
 2: array([ 157.,  163.,  165., ...,  168.,  157.,  173.])}

In [1]:

import brfss
resp = brfss.ReadBrfss().dropna(subset=['sex', 'wtkg2'])

In [4]:

groups = resp.groupby('sex')
d = {}
for name, group in groups:
    d[name] = group.wtkg2

In [5]:

Out[5]:

{1: 3      73.64
 4      88.64
 5     109.09
 8      90.00
 9      77.27
 10     63.64
 13    127.27
 20     76.36
 23     78.18
 26     77.27
 35     81.82
 39     90.00
 42     90.91
 45     93.18
 50     81.82
 ...
 414468     63.64
 414470     81.82
 414474     87.73
 414475    113.64
 414477    100.91
 414480     89.09
 414481     76.36
 414488    102.27
 414490     71.82
 414498     75.00
 414501     86.36
 414503     78.18
 414504     88.64
 414506     90.91
 414508     75.00
 Name: wtkg2, Length: 153900, dtype: float64, 2: 0      70.91
 1      72.73
 6      50.00
 7     122.73
 11     78.18
 12     62.73
 14     95.45
 15     88.64
 16     90.91
 17     50.00
 18    100.00
 19     72.73
 21     63.64
 22     55.45
 24     90.91
 ...
 414483     68.18
 414484     87.27
 414485     77.27
 414486     61.36
 414489     86.36
 414492     70.45
 414493     56.82
 414494     68.18
 414495     72.73
 414496     56.82
 414497     65.91
 414499    129.55
 414500     75.00
 414505     72.73
 414507     89.09
 Name: wtkg2, Length: 244584, dtype: float64}

In [9]:

import numpy

for sex, weights in d.items():
    print(sex, numpy.log(weights).mean(), numpy.log(weights).std())

(1, 4.4693001977146656, 0.19557721757853172)
(2, 4.2596856357921178, 0.22599757494674719)

In [10]:

import scipy.stats

In [28]:

shape, loc, scale = scipy.stats.lognorm.fit(d[1], floc=0)
shape, loc, scale

Out[28]:

(0.19557677265342968, 0, 87.295636995626353)

In [27]:

shape, loc, scale = scipy.stats.lognorm.fit(d[2], floc=0)
shape, loc, scale

Out[27]:

(0.22599714638037718, 0, 70.787767991419116)

In [12]:

import thinkstats2
cdf = thinkstats2.Cdf(d[2])

In [14]:

import thinkplot
%matplotlib inline

thinkplot.Cdf(cdf)

Out[14]:

{'xscale': 'linear', 'yscale': 'linear'}

In [25]:

rv = scipy.stats.lognorm(0.23, 0, 70.8)

In [26]:

import matplotlib.pyplot as pyplot
xs = numpy.linspace(20, 200, 100)
ys = rv.cdf(xs)
thinkplot.Cdf(cdf)
pyplot.plot(xs, ys)

Out[26]:

[<matplotlib.lines.Line2D at 0x7f6a0b53e150>]

In [ ]: