This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.
Download the Babies dataset on the book's website. (http://ipython-books.github.io)
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
DataFrame
per year.files = [file for file in os.listdir('data/')
if file.startswith('yob')]
years = np.array(sorted([int(file[3:7])
for file in files]))
data = {year:
pd.read_csv('data/yob{y:d}.txt'.format(y=year),
index_col=0, header=None,
names=['First name', 'Gender', 'Number'])
for year in years}
data[2012].head()
def get_value(name, gender, year):
"""Return the number of babies born a given year, with a
given gender and a given name."""
try:
return data[year][data[year]['Gender'] == gender] \
['Number'][name]
except KeyError:
return 0
def get_evolution(name, gender):
"""Return the evolution of a baby name over the years."""
return np.array([get_value(name, gender, year)
for year in years])
correlate
function.def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size/2:]
def autocorr_name(name, gender, color):
x = get_evolution(name, gender)
z = autocorr(x)
# Evolution of the name.
plt.subplot(121);
plt.plot(years, x, '-o'+color, label=name);
plt.title("Baby names");
# Autocorrelation.
plt.subplot(122);
plt.plot(z / float(z.max()), '-'+color, label=name);
plt.legend();
plt.title("Autocorrelation");
plt.figure(figsize=(12,4));
autocorr_name('Olivia', 'F', 'k');
autocorr_name('Maria', 'F', 'y');
The autocorrelation of Olivia is decaying much faster than Maria's. This is mainly because of the steep increase of the name Olivia at the end of the twentieth century. By contrast, the name Maria is evolving more slowly globally, and its autocorrelation is decaying somewhat slower.
You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).