In this notebook we do a few simple reanalyses of the geographic locations of the births and deaths of notable individuals. The data were originaly compiled and discussed in this paper. The full citation for the paper is:

Schich et al.

A Network Framework of Cultural History

Science 1 August 2014:

Vol. 345 no. 6196 pp. 558-562

DOI: 10.1126/science.1240064

The data can be obtained from the journal's web site.

Here we will only work with the Freebase ("SchichDataS1_FB") data set. We have taken the Freebase Excel spread sheet and converted it to a compressed csv file. To run the code below, you will need to do this yourself, and place the resulting data file in the working directory. You may also need to change the file name in the code below to reflect the compression method if you use something other than gzip. Note that it is also possible to use the Pandas read_excel function to load the Excel file directly.

This notebook involves working with geospatial data. This notebook reviews some techniques in this area.

We begin by importing our standard libraries, and also a couple of libraries for handling geographic data.

In [1]:

```
import pandas as pd
import numpy as np
import statsmodels.api as sm
from mpl_toolkits.basemap import Basemap
import pyproj
```

Next we load the Freebase data (change the file name and compression method here if needed).

In [2]:

```
data = pd.read_csv("SchichDataS1_FB.csv.gz", compression="gzip", encoding='utf-8')
```

We check the dimensions of the data frame:

In [3]:

```
print(data.shape)
```

We check the variable (column) names:

In [4]:

```
print(data.columns)
```

Here is a peek at the top few rows of the data set. Each record corresponds to one person. The first record is for David, the second king of Israel.

In [5]:

```
print(data.head())
```

In the next cell, we calculate the distance between the location where a person was born and the location where she or he died. This involves calculating geodesic distances, which can be handled using the `pyproj`

bindings for the `proj`

library. See here for more discussion about calculating geodesic distances in Python.

First we extract the relevant columns out of the Pandas data frame, and convert them to a numpy data frame. In general it is not necesary to do this conversion, but in this case we need to feed these data into a library that doesn't handle Pandas objects.

In [6]:

```
df = data.loc[:, ["BLocLat", "BLocLong", "DLocLat", "DLocLong"]].dropna()
dfa = np.asarray(df)
```

Next we do the distance calculations.

In [7]:

```
g = pyproj.Geod(ellps='WGS84')
_, _, dists = g.inv(dfa[:, 1], dfa[:, 0], dfa[:, 3], dfa[:, 2])
data["rdist"] = pd.Series(dists / 1000, df.index) # Convert distances to kilometers
```

A simple starting point is to look at the distribution of the distances between birth locations and death locations. There are two components to the distribution: many people were born and died in the same location; of the remainder, there is a wide distribution of distances centered at around 10^2.5 ~ 320km.

In [8]:

```
rdist = data["rdist"].dropna()
plt.clf()
plt.hist(np.log(1 + rdist) / np.log(10), bins=50, alpha=0.5)
plt.xlabel("Log10 distance (km)", size=15)
plt.ylabel("Frequency", size=15)
```

Out[8]:

We can calculate the proportion of people who were born and died in the same location:

In [9]:

```
print((data.rdist == 0).mean())
```

We can aso look at how the distance between birth locations and death locations has changed over time. First we define a new variable that holds the midpoint of a person's lifespan.

In [10]:

```
data["AYear"] = (data["BYear"] + data["DYear"]) / 2
```

Next we plot the distance between birth and death locations against this midpoint.

In [11]:

```
plt.clf()
plt.plot(data["AYear"], data["rdist"], 'o', alpha=0.2)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Distance (km)", size=15)
```

The distance is quite skewed, so we may get a different impression if we apply a log transform to it. We use a shifted log transform since many f the differences are exactly zero.

In [12]:

```
data["log_rdist"] = np.log(1 + data.rdist) / np.log(10)
```

Now we make a scatterplot of the log distance (between birth and death locations) against the midpoint of a person's life.

In [13]:

```
plt.clf()
plt.plot(data["AYear"], data["log_rdist"], 'o', alpha=0.2)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Distance (km)", size=15)
```

There are far more data records from recent history, so this plot may be misleading due to overplotting. We can fit the conditional mean curve using scatterplot smoothing to see how the average distance has changed over time.

In [14]:

```
lfit = sm.nonparametric.lowess(data.log_rdist, data.AYear, frac=0.1)
```

There is a rapid increase in the mean around 1700, as travel become much easier. There may also be a decrease in the distance traveled in recent decades, which could be a reflection of large numbers of people being displaced in the middle of the 20th century due to world war 2.

In [15]:

```
plt.clf()
plt.plot(data.AYear, data.log_rdist, 'o', color='grey', alpha=0.05)
plt.plot(lfit[:,0], lfit[:,1], '-', color='lime', lw=3, alpha=0.9)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Log10 distance", size=15)
```

I was curious about some of the people who lived long ago and yet travelled great distances during their lives:

In [16]:

```
ii = (data["rdist"] > 3000) & (data["AYear"] < 1000)
data.loc[ii, ["PrsLabel", "AYear", "BLocLabel", "DLocLabel", "rdist"]]
```

Out[16]:

Wikipedia says that Kumarajiva was born and died in China and that his father was from Kashmir.

One more drill-down check -- who has the greatest distance between their birth and death locations? The circumference of the earth is 40,075km, so the maximum possible distance is half of this value. This person was born and died at nearly antipodal points on the earth's surface.

In [17]:

```
ii = np.argmax(rdist)
print(data.loc[ii, :])
print(40075. / 2)
```

We repeated the same analysis using only the data from 1600 to the present.

In [18]:

```
df = data[data.AYear >= 1600]
lfit = sm.nonparametric.lowess(df.log_rdist, df.AYear, frac=0.1)
```

In [19]:

```
plt.clf()
plt.plot(df.AYear, df.log_rdist, 'o', color='grey', alpha=0.05)
plt.plot(lfit[:,0], lfit[:,1], '-', color='lime', lw=3, alpha=0.9)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Log distance", size=15)
```

In this section we will make some simple maps showing the birth locations.

In [20]:

```
df = data.loc[:, ["BLocLat", "BLocLong"]].dropna()
latit = np.asarray(df["BLocLat"])
longit = np.asarray(df["BLocLong"])
```

The following cell shows the natural way to make a map of the birth locations. However currently it doesn't work due to a bug in Basemap. See the next cell for a workaround.

In [21]:

```
#mp = Basemap()
#plt.figure(figsize=(16, 12))
#mp.drawcoastlines()
#mp.plot(longit[0:10000], latit[0:10000], '.', latlon=True, color='blue')
```

In [22]:

```
mp = Basemap()
plt.figure(figsize=(16, 12))
mp.drawcoastlines()
x, y = mp(longit, latit)
mp.plot(x, y, 'o', color='blue', alpha=0.1, ms=4, latlon=False)
```

Out[22]:

In [23]:

```
data.columns
```

Out[23]:

Someone was supposedly born in the Pacific Ocean about one third of the way from Vancouver to Honolulu. Who was this? Is the point in the correct place?

Make a histogram showing the distribution of birth years.

For the people in the dataset who were born in each century, determine the proportion who were born in the southern hemisphere. Make a graph plotting this proportion against time. Do the same for the western hemisphere.