This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

10.3. Computing the autocorrelation of a time series

Download the Babies dataset on the book's website. (http://ipython-books.github.io)

  1. We import the packages.
In [ ]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
  1. We read the data with Pandas. The dataset contains one CSV file per year. Each file contains all baby names given that year with the respective frequencies. We load the data in a dictionary, containing one DataFrame per year.
In [ ]:
files = [file for file in os.listdir('data/') 
         if file.startswith('yob')]
In [ ]:
years = np.array(sorted([int(file[3:7]) 
                         for file in files]))
In [ ]:
data = {year: 
        pd.read_csv('data/yob{y:d}.txt'.format(y=year), 
                    index_col=0, header=None, 
                    names=['First name', 'Gender', 'Number']) 
        for year in years}
In [ ]:
data[2012].head()
  1. We write functions to retrieve the frequencies of baby names as a function of the name, gender, and birth year.
In [ ]:
def get_value(name, gender, year):
    """Return the number of babies born a given year, with a 
    given gender and a given name."""
    try:
        return data[year][data[year]['Gender'] == gender] \
               ['Number'][name]
    except KeyError:
        return 0
In [ ]:
def get_evolution(name, gender):
    """Return the evolution of a baby name over the years."""
    return np.array([get_value(name, gender, year) 
                     for year in years])
  1. Let's define a function that computes the autocorrelation of a signal. This function is essentially based on NumPy's correlate function.
In [ ]:
def autocorr(x):
    result = np.correlate(x, x, mode='full')
    return result[result.size/2:]
  1. Now, we create a function that displays the evolution of a baby name, as well as its autocorrelation.
In [ ]:
def autocorr_name(name, gender, color):
    x = get_evolution(name, gender)
    z = autocorr(x)
    # Evolution of the name.
    plt.subplot(121);
    plt.plot(years, x, '-o'+color, label=name);
    plt.title("Baby names");
    # Autocorrelation.
    plt.subplot(122);
    plt.plot(z / float(z.max()), '-'+color, label=name);
    plt.legend();
    plt.title("Autocorrelation");
  1. Let's take a look at two female names.
In [ ]:
plt.figure(figsize=(12,4));
autocorr_name('Olivia', 'F', 'k');
autocorr_name('Maria', 'F', 'y');

The autocorrelation of Olivia is decaying much faster than Maria's. This is mainly because of the steep increase of the name Olivia at the end of the twentieth century. By contrast, the name Maria is evolving more slowly globally, and its autocorrelation is decaying somewhat slower.

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).