Baby Names

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)
Out[1]:
In [2]:
import numpy as np
import pandas as pd
import addutils
from IPython.display import display
import bokeh.plotting as bk
bk.output_notebook()
Loading BokehJS ...
In [3]:
addutils.css_notebook()
Out[3]:

1 Load and prepare the data

We downloaded statistics about baby names choosen over years in the U.S. from: http://www.babycenter.com/baby-names and we stored them on our example data folder.

In [4]:
dataFolder = 'temp/baby_names/'
columnNames = ['name', 'sex', 'births']

names1880 = pd.read_csv(dataFolder+'yob1880.txt', names=columnNames)
names1880.head()
Out[4]:
name sex births
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746

This shows some of the names choosen during 1880.

Now we want to read all the files in the years that spaces from 1880 to 2011.

In [5]:
years = range(1880, 2012)
parts = []
for year in years:
    path = '{0}yob{1}.txt'.format(dataFolder, year)
    frame = pd.read_csv(path, names=columnNames)
    frame['year'] = year  
    parts.append(frame)

Now parts is a python list containing pandas.DataFrame(s). Let's create a single DataFrame containing all the names.

In [6]:
names = pd.concat(parts, ignore_index=True)
names[::10**5]
Out[6]:
name sex births year
0 Mary F 7065 1880
100000 Ernie F 13 1912
200000 Lemoyne M 9 1922
300000 Derrell M 43 1932
400000 Valentine M 56 1943
500000 Neal M 968 1953
600000 Konni F 14 1962
700000 Howard F 18 1970
800000 Tomecca F 6 1976
900000 Martrell M 8 1981
1000000 Clearence M 6 1986
1100000 Indiana F 15 1991
1200000 Aminata F 30 1995
1300000 Yutaro M 8 1998
1400000 Mc F 14 2002
1500000 Marshayla F 7 2005
1600000 Treonna F 9 2008
1700000 Kellsie F 13 2011

pandas.concat concatenates pandas objects along a particular axis. If the optional parameter ignore_insex is True, concat won't use index values on the concatenation. Values from 0 to n-1 will be used instead.

2 Pivoting

DataFrame.pivot_table creates a spreadsheet-style pivot table as a DataFrame. aggfunc parameter specifies a list of aggregation functions to use on elements, margins tells if grandtotal/subtotals are to be added to all columns/rows.

In [7]:
from bokeh.models.ranges import Range1d
totalBirths = names.pivot_table('births', index='year', columns='sex',
                                aggfunc=sum, margins=False)
#display(totalBirths.head())
#totalBirths[['F', 'M']][:-1].plot(title='Total births by sex and year')
fig = bk.figure(plot_width=750, plot_height=300, title=None)
fig.line(x=totalBirths.index, y=totalBirths['F'], legend='F', line_color='magenta')
fig.line(x=totalBirths.index, y=totalBirths['M'], legend='M', line_color='royalblue')
fig.legend.location = 'bottom_right'
fig.xaxis.axis_label = 'Year'
fig.yaxis.axis_label = 'Total births'
fig.yaxis[0].formatter.use_scientific = False
bk.show(fig)

3 Splitting

Let's see a couple of example about splitting data.

In the first example we are going to view the number of births grouped by year and sex.

In [8]:
names.groupby(['year', 'sex'])['births'].sum().head()
Out[8]:
year  sex
1880  F       90994
      M      110492
1881  F       91955
      M      100747
1882  F      107851
Name: births, dtype: int64

The second example shows how to split the names in two groups: Boys and Girls.

In [9]:
boys = names[names.sex == 'M']
girls = names[names.sex == 'F']
display(boys[:2000:100])
name sex births year
942 John M 9655 1880
1042 Perry M 134 1880
1142 Clayton M 60 1880
1242 Judson M 31 1880
1342 Wilmer M 19 1880
1442 Rubin M 14 1880
1542 Alois M 10 1880
1642 Fayette M 8 1880
1742 Toney M 7 1880
1842 Titus M 6 1880
1942 Leonidas M 5 1880
2980 Tom M 349 1881
3080 Lester M 87 1881
3180 Elias M 41 1881
3280 Hans M 23 1881
3380 Aubrey M 15 1881
3480 Ford M 11 1881
3580 Rafael M 9 1881
3680 Handy M 7 1881
3780 Orla M 6 1881

We can see how many boys with a specific name were born each year.

In [10]:
boys[boys['name']=='Jayden']
Out[10]:
name sex births year
824344 Jayden M 7 1977
843205 Jayden M 6 1978
899512 Jayden M 9 1981
940021 Jayden M 6 1983
960469 Jayden M 5 1984
977725 Jayden M 10 1985
997266 Jayden M 14 1986
1019074 Jayden M 11 1987
1040265 Jayden M 16 1988
1062845 Jayden M 22 1989
1087121 Jayden M 25 1990
1111407 Jayden M 38 1991
1136484 Jayden M 45 1992
1161561 Jayden M 77 1993
1187019 Jayden M 159 1994
1212851 Jayden M 239 1995
1238981 Jayden M 294 1996
1265586 Jayden M 387 1997
1292859 Jayden M 620 1998
1320950 Jayden M 1230 1999
1350145 Jayden M 1821 2000
1380163 Jayden M 2833 2001
1410493 Jayden M 3853 2002
1441364 Jayden M 5542 2003
1472909 Jayden M 6920 2004
1505279 Jayden M 8244 2005
1538659 Jayden M 9610 2006
1573173 Jayden M 15206 2007
1607962 Jayden M 17105 2008
1642694 Jayden M 17217 2009
1676954 Jayden M 17101 2010
1710637 Jayden M 16861 2011
In [11]:
bBirths = boys.pivot_table('births', index='year', columns='name',
                           aggfunc=sum, margins=False)
subset = bBirths[['Ray', 'Elvis', 'Sam', 'John', 'Marvin', 'Bob']]

plots = []
for name in subset.columns:
    fig = bk.figure(plot_height=200, plot_width=700, title=None)
    fig.line(x=np.asarray(subset.index), y=np.asarray(subset[name]),
             line_color='black', legend=name)
    plots.append([fig])
bk.show(bk.gridplot(plots))

# Or directly using Pandas (which uses Matplotlib, not Bokeh): 
#subset.plot(subplots=True, figsize=(12, 10), grid=False,
#            title="Number of births per year")

4 Using 'groupby'

Now we are going to add a column named 'prop` that shows the ratio: $\frac{\text{children with a specific name}}{\text{total children}}$

In [12]:
def add_prop(group):
    births = group['births']
    group['prop'] = births/float(births.sum())
    return group
In [13]:
names = names.groupby(['year', 'sex']).apply(add_prop)
display(names.head())
name sex births year prop
0 Mary F 7065 1880 0.077642
1 Anna F 2604 1880 0.028617
2 Emma F 2003 1880 0.022012
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188

Let's check our calculations by verifying that the sum of all porportions by sex must be equal (or at least close) to 1.

In [14]:
np.allclose(names.groupby(['year', 'sex'])['prop'].sum(), 1)
Out[14]:
True

Now we want to extract the top names for each sex/year combination.

In [15]:
def get_top(group, topNumber):
    return group.sort_values(by='births', ascending=False)[:topNumber]

grouped = names.groupby(['year', 'sex'])
topNames = grouped.apply(get_top, topNumber=10)
# rename indexes to avoid warning; index and columns should have different names
topNames.index.rename(['year_', 'sex_', None], inplace=True)
topNames[:50]
Out[15]:
name sex births year prop
year_ sex_
1880 F 0 Mary F 7065 1880 0.077642
1 Anna F 2604 1880 0.028617
2 Emma F 2003 1880 0.022012
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188
5 Margaret F 1578 1880 0.017342
6 Ida F 1472 1880 0.016177
7 Alice F 1414 1880 0.015539
8 Bertha F 1320 1880 0.014506
9 Sarah F 1288 1880 0.014155
M 942 John M 9655 1880 0.087382
943 William M 9533 1880 0.086278
944 James M 5927 1880 0.053642
945 Charles M 5348 1880 0.048402
946 George M 5126 1880 0.046392
947 Frank M 3242 1880 0.029341
948 Joseph M 2632 1880 0.023821
949 Thomas M 2534 1880 0.022934
950 Henry M 2444 1880 0.022119
951 Robert M 2415 1880 0.021857
1881 F 2000 Mary F 6919 1881 0.075243
2001 Anna F 2698 1881 0.029340
2002 Emma F 2034 1881 0.022120
2003 Elizabeth F 1852 1881 0.020140
2004 Margaret F 1658 1881 0.018031
2005 Minnie F 1653 1881 0.017976
2006 Ida F 1439 1881 0.015649
2007 Annie F 1326 1881 0.014420
2008 Bertha F 1324 1881 0.014398
2009 Alice F 1308 1881 0.014224
M 2938 John M 8769 1881 0.087040
2939 William M 8524 1881 0.084608
2940 James M 5442 1881 0.054016
2941 George M 4664 1881 0.046294
2942 Charles M 4636 1881 0.046016
2943 Frank M 2834 1881 0.028130
2944 Joseph M 2456 1881 0.024378
2945 Henry M 2339 1881 0.023217
2946 Thomas M 2282 1881 0.022651
2947 Edward M 2177 1881 0.021609
1882 F 3935 Mary F 8149 1882 0.075558
3936 Anna F 3143 1882 0.029142
3937 Emma F 2303 1882 0.021354
3938 Elizabeth F 2187 1882 0.020278
3939 Minnie F 2004 1882 0.018581
3940 Margaret F 1821 1882 0.016884
3941 Ida F 1673 1882 0.015512
3942 Alice F 1542 1882 0.014298
3943 Bertha F 1508 1882 0.013982
3944 Annie F 1492 1882 0.013834

This is our concluding example and we want to measure the increasing in name diversity.

In [16]:
from bokeh.models.ranges import Range1d

diversity = topNames.pivot_table('prop', index='year', columns='sex', aggfunc=sum)

fig = bk.figure(plot_width=750, plot_height=300, title=None)
fig.line(x=diversity.index, y=diversity['F'], line_color='green', legend='F')
fig.line(x=diversity.index, y=diversity['M'], line_color='blue',  legend='M')
fig.y_range = Range1d(0, 1.2)
bk.show(fig)

# Or, using directly Pandas' "plot" method (which calls Matplotlib, not Bokeh)
# diversity.plot(title='Sum of diversity.prop by year and sex',
#                yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))

Visit www.add-for.com for more tutorials and updates.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.