This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

7.6. Estimating a probability distribution nonparametrically with a Kernel Density Estimation

You need to download the Storms dataset on the book's website, and extract it in the current directory. (http://ipython-books.github.io)

You also need matplotlib's toolkit basemap. (http://matplotlib.org/basemap/)

  1. Let's import the usual packages. The kernel density estimation with a Gaussian kernel is implemented in SciPy.stats.
In [ ]:
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
%matplotlib inline
  1. Let's open the data with Pandas.
In [ ]:
# http://www.ncdc.noaa.gov/ibtracs/index.php?name=wmo-data
df = pd.read_csv("data/Allstorms.ibtracs_wmo.v03r05.csv")
  1. The dataset contains information about most storms since 1848. A single storm may appear multiple times across several consecutive days.
In [ ]:
df[df.columns[[0,1,3,8,9]]].head()
  1. We use Pandas' groupby function to obtain the average location of every storm.
In [ ]:
dfs = df.groupby('Serial_Num')
pos = dfs[['Latitude', 'Longitude']].mean()
y, x = pos.values.T
pos.head()
  1. We display the storms on a map with basemap. This toolkit allows us to easily project the geographical coordinates on the map.
In [ ]:
m = Basemap(projection='mill', llcrnrlat=-65 ,urcrnrlat=85,
            llcrnrlon=-180, urcrnrlon=180)
x0, y0 = m(-180, -65)
x1, y1 = m(180, 85)
plt.figure(figsize=(10,6))
m.drawcoastlines()
m.fillcontinents(color='#dbc8b2')
xm, ym = m(x, y)
m.plot(xm, ym, '.r', alpha=.1);
  1. To perform the Kernel Density Estimation, we need to stack the x and y coordinates of the storms into a 2xN array.
In [ ]:
h = np.vstack((xm, ym))
In [ ]:
kde = st.gaussian_kde(h)
  1. The gaussian_kde routine returned a Python function. To see the results on a map, we need to evaluate this function on a 2D grid spanning the entire map. We create this grid with meshgrid, and we pass the x, y values to the kde function. We need to arrange the shape of the array since kde accepts a 2xN array as input.
In [ ]:
k = 50
tx, ty = np.meshgrid(np.linspace(x0, x1, 2*k),
                     np.linspace(y0, y1, k))
v = kde(np.vstack((tx.ravel(), ty.ravel()))).reshape((k, 2*k))
  1. Finally, we display the estimated density with imshow.
In [ ]:
plt.figure(figsize=(10,6))
m.drawcoastlines()
m.fillcontinents(color='#dbc8b2')
xm, ym = m(x, y)
m.imshow(v, origin='lower', extent=[x0,x1,y0,y1],
         cmap=plt.get_cmap('Reds'));

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).