Relating Age to Name¶

This notebook is inspired by Nate Silver's recent article on How to Tell Someone's Age When All you Know is Her Name. It allows one to (almost) replicate the analysis done in the article, and provides more extensive features. I have done similar work using R, and you can find it here.

Data¶

We will uses four primary datasets.

We will download each of these datasets and process them to get them analysis ready.

In [1]:

%matplotlib inline
import re
import urllib
from zipfile import ZipFile
from path import path
import numpy as np
import pandas as pd
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import seaborn as sns
from ggplot import *

/Users/ramnathv/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /Users/ramnathv/anaconda/lib/python2.7/argparse.pyc, but /Users/ramnathv/anaconda/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream

Baby Names¶

The first dataset we will be downloading is the bnames dataset provided by SSA

In [2]:

urllib.urlretrieve("http://www.ssa.gov/oact/babynames/names.zip", "names.zip")
zf = ZipFile("names.zip")
def read_names(f):
  data = pd.read_csv(zf.open(f), header = None, names = ['name', 'sex', 'n'])
  data['year'] = int(re.findall(r'\d+', f)[0])
  return data
  
bnames = pd.concat([read_names(f) for f in zf.namelist() if f.endswith('.txt')])
bnames.head()

Out[2]:

	name	sex	n	year
0	Mary	F	9217	1884
1	Anna	F	3860	1884
2	Emma	F	2587	1884
3	Elizabeth	F	2549	1884
4	Minnie	F	2243	1884

5 rows × 4 columns

Baby Names by State¶

The second dataset we will be downloading is the bnames_by_state dataset, also provided by SSA

In [3]:

urllib.urlretrieve("http://www.ssa.gov/oact/babynames/state/namesbystate.zip", "namesbystate.zip")
zf = ZipFile("namesbystate.zip")
def read_names2(f):
  return pd.read_csv(zf.open(f), header = None, names = ['state', 'sex', 'year', 'name', 'n'])
  
bnames_by_state = pd.concat([read_names2(f) for f in zf.namelist() if f.endswith('.TXT')])
bnames_by_state.head()

Out[3]:

	state	sex	year	name	n
0	MA	F	1910	Mary	988
1	MA	F	1910	Helen	473
2	MA	F	1910	Margaret	374
3	MA	F	1910	Dorothy	331
4	MA	F	1910	Alice	313

5 rows × 5 columns

Cohort Life Tables¶

The next dataset is actuarial cohort life tables provided by SSA. I was unable to figure out how to scrape this data using BeautifulSoup. So I used the R package XML to scrape these lifetables and saved it to lifetables.csv. If you are interested in the R code, you can find it here.

In [4]:

lifetables = pd.read_csv('lifetables.csv')
lifetables.head()

Out[4]:

	x	qx	lx	dx	Lx	Tx	ex	sex	year
0	0	0.14596	100000	14596	90026	5151511	51.52	M	1900
1	1	0.03282	85404	2803	84003	5061484	59.26	M	1900
2	2	0.01634	82601	1350	81926	4977482	60.26	M	1900
3	3	0.01052	81251	855	80824	4895556	60.25	M	1900
4	4	0.00875	80397	703	80045	4814732	59.89	M	1900

5 rows × 9 columns

You can read the documentation for the lifetables to understand the various parameters. The key column of interest to us is lx, which provides the number of people born in the year year who live upto the age x. Since we are in the year 2014, we are only interested in a subset of the data.

In [5]:

lifetables_2014 = lifetables[lifetables['year'] + lifetables['x'] == 2014]
lifetables_2014.head()

Out[5]:

	x	qx	lx	dx	Lx	Tx	ex	sex	year
114	114	0.76365	0	0	0	0	0.79	M	1900
234	114	0.75783	0	0	0	0	0.80	F	1900
344	104	0.46882	25	12	19	38	1.54	M	1910
464	104	0.42317	191	81	150	328	1.72	F	1910
574	94	0.26858	3122	838	2703	8656	2.77	M	1920

5 rows × 9 columns

The cohort life tables are provided only for every decade. Since we need the data by year, we will use spline interpolation to fill out the gaps.

In [6]:

def process(d, kind = 'slinear'):
  f = interp1d(d.year, d.lx, kind)
  year = np.arange(1900, 2011)
  lx = f(year)
  return pd.DataFrame({"year": year, "lx": lx, "sex": d.sex.iloc[1]})
lifetable_2014 = lifetables_2014.\
  groupby('sex', as_index = False).\
  apply(process)
lifetable_2014.head()

Out[6]:

	lx	sex	year
0	0.0	F	1900
1	19.1	F	1901
2	38.2	F	1902
3	57.3	F	1903
4	76.4	F	1904

5 rows × 3 columns

Census Live Births Data¶

Finally, we need live births data from the census to extrapolate the birth data to account for the correct that not all births were recorded by SSA till around 1930, since it wasn't mandatory.

In [7]:

urllib.urlretrieve("http://www.census.gov/statab/hist/02HS0013.xls", "02HS0013.xls")
dat = pd.read_excel('02HS0013.xls', sheetname = 'HS-13', skiprows = range(14))
tot_births = dat.ix[9:101,:2].reset_index(drop = True)
tot_births.columns = ['year', 'births']
tot_births = tot_births.convert_objects(convert_numeric = True)
tot_births.head()

Out[7]:

	year	births
0	1909	2718
1	1910	2777
2	1911	2809
3	1912	2840
4	1913	2869

5 rows × 2 columns

We will now merge this data with the bnames aggregated by year to compute a set of correction factors, which will be used to scale the number of births. I have taken a very naive approach to do this, and there might be better ways to accomplish the same.

In [8]:

cor_factors = bnames.\
  groupby('year', as_index = False).\
  sum().\
  merge(tot_births)
cor_factors['cor'] = cor_factors['births']*1000/cor_factors['n']
cor_factors = cor_factors[['year', 'cor']]
cor_factors.head()

Out[8]:

	year	cor
0	1909	5.316662
1	1910	4.701138
2	1911	4.360075
3	1912	2.874415
4	1913	2.523176

5 rows × 2 columns

We expand the correction factors data for the years 2002 to 2014 using the correction factor for the year 2001.

In [9]:

cor_new = pd.DataFrame({
  'year': range(2002, 2014),
  'cor': cor_factors.cor.iloc[-1]
})
cor_factors = pd.concat([cor_factors, cor_new])[['year', 'cor']]
cor_factors.head()

Out[9]:

	year	cor
0	1909	5.316662
1	1910	4.701138
2	1911	4.360075
3	1912	2.874415
4	1913	2.523176

5 rows × 2 columns

Analysis¶

Now that we have all the required data, we need a few helper functions to help us with our analysis. The first function we will write is get_data, which takes name, sex and state and returns a data frame with the distribution of number of births and number of people alive by year.

In [10]:

def get_data(name, sex, state = None):
  if state is None:
    dat = bnames
  else:
    dat = bnames_by_state[(bnames_by_state["state"] == state)]
  data = dat[(dat['name'] == name) & (dat['sex'] == sex)].\
    merge(cor_factors).\
    merge(lifetable_2014)
  data['n_cor'] = data['n']*data['cor']
  data['n_alive'] = data['lx']/(10**5)*data['n_cor']
  return data
get_data('Violet', 'F').head()

Out[10]:

	name	sex	n	year	cor	lx	n_cor	n_alive
0	Violet	F	945	1909	5.316662	171.9	5024.245779	8.636678
1	Violet	F	1037	1910	4.701138	191.0	4875.080412	9.311404
2	Violet	F	1183	1911	4.360075	1084.2	5157.968506	55.922695
3	Violet	F	1535	1912	2.874415	1977.4	4412.227601	87.247389
4	Violet	F	1751	1913	2.523176	2870.6	4418.081208	126.825439

5 rows × 8 columns

Our next helper function is plot_name which accepts the same arguments as get_data, but returns a plot of the distribution of number of births and number alive by year.

In [11]:

def plot_name(name, sex, state = None):
  data = get_data(name, sex, state)
  return ggplot(data, aes('year', 'n_cor')) +\
    geom_line() +\
    geom_area(aes(ymin = 0, ymax = 'n_alive'), alpha = 0.5)
plot_name("Joseph", "F")

Out[11]:

<ggplot: (279099421)>

In [12]:

plot_name("Violet", "F", "MA")

Out[12]:

<ggplot: (279096937)>

Estimate Age¶

We will now write a function that will help us figure out the probability that a person with a certain name is alive, as well as the quantiles of their age distribution. Since we are dealing with weighted data, I will use some code copied from the wquantiles module. The quantile function accepts a data array and a weights array to return a specific quantile

In [13]:

# from the module wquantiles https://github.com/nudomarinero/wquantiles/blob/master/weighted.py
import numpy as np
def quantile_1D(data, weights, quantile):
    """
    Compute the weighted quantile of a 1D numpy array.

    Parameters
    ----------
    data : ndarray
        Input array (one dimension).
    weights : ndarray
        Array with the weights of the same size of `data`.
    quantile : float
        Quantile to compute. It must have a value between 0 and 1.

    Returns
    -------
    quantile_1D : float
        The output value.
    """
    # Check the data
    if not isinstance(data, np.matrix) :
        data = np.asarray(data)
    if not isinstance(weights, np.matrix) :
        weights = np.asarray(weights)
    nd = data.ndim
    if nd != 1:
        raise TypeError("data must be a one dimensional array")
    ndw = weights.ndim
    if ndw != 1:
        raise TypeError("weights must be a one dimensional array")
    if data.shape != weights.shape:
        raise TypeError("the length of data and weights must be the same")
    if ((quantile > 1.) or (quantile < 0.)):
        raise ValueError("quantile must have a value between 0. and 1.")
    # Sort the data
    ind_sorted = np.argsort(data)
    sorted_data = data[ind_sorted]
    sorted_weights = weights[ind_sorted]
    # Compute the auxiliary arrays
    Sn = np.cumsum(sorted_weights)
    # TODO: Check that the weights do not sum zero
    Pn = (Sn-0.5*sorted_weights)/np.sum(sorted_weights)
    # Get the value of the weighted median
    return np.interp(quantile, Pn, sorted_data)


def quantile(data, weights, quantile):
    """
    Weighted quantile of an array with respect to the last axis.

    Parameters
    ----------
    data : ndarray
        Input array.
    weights : ndarray
        Array with the weights. It must have the same size of the last 
        axis of `data`.
    quantile : float
        Quantile to compute. It must have a value between 0 and 1.

    Returns
    -------
    quantile : float
        The output value.
    """
    # TODO: Allow to specify the axis
    nd = data.ndim
    if nd == 0:
        TypeError("data must have at least one dimension")
    elif nd == 1:
        return quantile_1D(data, weights, quantile)
    elif nd > 1:
        n = data.shape
        imr = data.reshape((np.prod(n[:-1]), n[-1]))
        result = np.apply_along_axis(quantile_1D, -1, imr, weights, quantile)
        return result.reshape(n[:-1])

We will use the quantile function to write an estimate_age function that will return the living probabilities and quantiles for a given name, sex and state.

In [14]:

def estimate_age(name, sex, state = None):
  data = get_data(name, sex, state)
  qs = [1, 0.75, 0.5, 0.25, 0]
  quantiles = [2014 - int(quantile(data.year, data.n_alive, q)) for q in qs]
  result = dict(zip(['q0', 'q25', 'q50', 'q75', 'q100'], quantiles))
  result['p_alive'] = round(data.n_alive.sum()/data.n_cor.sum()*100, 2)
  result['sex'] = sex
  result['name'] = name
  return pd.Series(result)
  
estimate_age('Gertrude', 'F')

Out[14]:

name       Gertrude
p_alive       18.17
q0                4
q100            105
q25              71
q50              82
q75              90
sex               F
dtype: object

In [15]:

estimate_age('Ava', 'F')

Out[15]:

name         Ava
p_alive    95.24
q0             4
q100         105
q25            6
q50            8
q75           10
sex            F
dtype: object

We can now use the estimate_age function to compute the quantiles for the most common names and replicate the plots in the Nate Silver article.

In [16]:

top_100_names = bnames.\
  groupby(['name', 'sex'], as_index = False).\
  sum().\
  sort('n', ascending = False)
top_25_females = top_100_names[(top_100_names["sex"] == "F")]
estimates = pd.concat([estimate_age(name, 'F') for name in top_25_females["name"].iloc[:25].tolist()], axis = 1)
estimates.T.sort('q50').reset_index()

Out[16]:

	index	name	p_alive	q0	q100	q25	q50	q75	sex
0	24	Emily	90.62	4	105	12	19	28	F
1	18	Ashley	98.5	4	97	18	24	28	F
2	10	Jessica	97.9	4	105	22	27	32	F
3	9	Sarah	83.63	4	105	19	28	37	F
4	17	Anna	51.81	4	105	17	34	65	F
5	3	Jennifer	96.63	4	98	29	36	42	F
6	1	Elizabeth	70.05	4	105	23	39	58	F
7	23	Michelle	95.92	4	99	29	40	47	F
8	20	Kimberly	95.87	4	81	30	41	48	F
9	15	Lisa	94.48	4	104	40	47	51	F
10	14	Karen	89.12	4	105	48	55	62	F
11	7	Susan	86.52	4	105	50	57	63	F
12	19	Donna	81.85	4	105	52	58	66	F
13	16	Sandra	85.46	4	105	50	59	67	F
14	12	Nancy	76.02	4	105	53	62	70	F
15	2	Patricia	77.8	4	105	53	62	70	F
16	4	Linda	84.87	4	105	55	62	67	F
17	22	Carol	77.49	4	105	56	64	71	F
18	0	Mary	51.35	4	105	53	64	75	F
19	5	Barbara	71.56	4	105	57	65	73	F
20	6	Margaret	45.63	4	105	51	65	76	F
21	21	Ruth	34.23	4	105	57	70	83	F
22	13	Betty	50.7	4	105	65	74	82	F
23	11	Helen	29.38	4	105	61	74	86	F
24	8	Dorothy	34.24	4	105	64	76	85	F

25 rows × 9 columns

We can go a step beyond the article and also use the state parameter to get more specific quantiles, if we know the place of birth of a person.

The final step for me is to convert all of this into an interactive webapp. I have done it with R using the shiny package. I am still getting the hang of web frameworks in python, so it will be a while before I get to doing this.