Birth and death locations of notable individuals

In this notebook we do a few simple reanalyses of the geographic locations of the births and deaths of notable individuals. The data were originaly compiled and discussed in this paper. The full citation for the paper is:

Schich et al.
A Network Framework of Cultural History
Science 1 August 2014:
Vol. 345 no. 6196 pp. 558-562
DOI: 10.1126/science.1240064

The data can be obtained from the journal's web site.

Here we will only work with the Freebase ("SchichDataS1_FB") data set. We have taken the Freebase Excel spread sheet and converted it to a compressed csv file. To run the code below, you will need to do this yourself, and place the resulting data file in the working directory. You may also need to change the file name in the code below to reflect the compression method if you use something other than gzip. Note that it is also possible to use the Pandas read_excel function to load the Excel file directly.

This notebook involves working with geospatial data. This notebook reviews some techniques in this area.

We begin by importing our standard libraries, and also a couple of libraries for handling geographic data.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from mpl_toolkits.basemap import Basemap
import pyproj

Next we load the Freebase data (change the file name and compression method here if needed).

In [2]:
data = pd.read_csv("SchichDataS1_FB.csv.gz", compression="gzip", encoding='utf-8')

We check the dimensions of the data frame:

In [3]:
print(data.shape)
(120211, 20)

We check the variable (column) names:

In [4]:
print(data.columns)
Index([u'Unnamed: 0', u'PrsID', u'PrsLabel', u'BYear', u'BLocLabel', u'BLocID', u'BLocLat', u'BLocLong', u'DYear', u'DLocLabel', u'DLocID', u'DLocLat', u'DLocLong', u'Gender', u'PerformingArts', u'Creative', u'Gov/Law/Mil/Act/Rel', u'Academic/Edu/Health', u'Sports', u'Business/Industry/Travel'], dtype='object')

Here is a peek at the top few rows of the data set. Each record corresponds to one person. The first record is for David, the second king of Israel.

In [5]:
print(data.head())
   Unnamed: 0      PrsID         PrsLabel  BYear         BLocLabel  \
0           0   /m/02cvn            David  -1039         Bethlehem   
1           1  /m/02xr84           Josiah   -648         Jerusalem   
2           2   /m/02pqx          Ezekiel   -621         Jerusalem   
3           3   /m/0fyrw            Solon   -637  Classical Athens   
4           4  /m/03d0c1  Cyrus the Great   -575            Anshan   

       BLocID    BLocLat   BLocLong  DYear  DLocLabel    DLocID    DLocLat  \
0    /m/01cy_  31.703100  35.195600   -969  Jerusalem  /m/0430_  31.783333   
1    /m/0430_  31.783333  35.216667   -608  Jerusalem  /m/0430_  31.783333   
2    /m/0430_  31.783333  35.216667   -569    Babylon  /m/01cyh  32.541667   
3  /m/03nmzqx  37.966667  23.716667   -557     Cyprus  /m/01ppq  35.225120   
4   /m/05h83n  29.900000  52.400000   -528  Syr Darya  /m/0k8fk  40.850000   

    DLocLong Gender  PerformingArts  Creative  Gov/Law/Mil/Act/Rel  \
0  35.216667   Male               0         0                    1   
1  35.216667   Male               0         0                    0   
2  44.423333   Male               0         0                    1   
3  33.612441   Male               0         0                    0   
4  68.666667   Male               0         0                    0   

   Academic/Edu/Health  Sports  Business/Industry/Travel  
0                    0       0                         0  
1                    0       0                         0  
2                    0       0                         0  
3                    0       0                         0  
4                    0       0                         0  

Distance from birth location to death location

In the next cell, we calculate the distance between the location where a person was born and the location where she or he died. This involves calculating geodesic distances, which can be handled using the pyproj bindings for the proj library. See here for more discussion about calculating geodesic distances in Python.

First we extract the relevant columns out of the Pandas data frame, and convert them to a numpy data frame. In general it is not necesary to do this conversion, but in this case we need to feed these data into a library that doesn't handle Pandas objects.

In [6]:
df = data.loc[:, ["BLocLat", "BLocLong", "DLocLat", "DLocLong"]].dropna()
dfa = np.asarray(df)

Next we do the distance calculations.

In [7]:
g = pyproj.Geod(ellps='WGS84')
_, _, dists = g.inv(dfa[:, 1], dfa[:, 0], dfa[:, 3], dfa[:, 2])
data["rdist"] = pd.Series(dists / 1000, df.index) # Convert distances to kilometers

A simple starting point is to look at the distribution of the distances between birth locations and death locations. There are two components to the distribution: many people were born and died in the same location; of the remainder, there is a wide distribution of distances centered at around 10^2.5 ~ 320km.

In [8]:
rdist = data["rdist"].dropna()
plt.clf()
plt.hist(np.log(1 + rdist) / np.log(10), bins=50, alpha=0.5)
plt.xlabel("Log10 distance (km)", size=15)
plt.ylabel("Frequency", size=15)
Out[8]:
<matplotlib.text.Text at 0x7f7fdcd5dc90>

We can calculate the proportion of people who were born and died in the same location:

In [9]:
print((data.rdist == 0).mean())
0.144753807888

We can aso look at how the distance between birth locations and death locations has changed over time. First we define a new variable that holds the midpoint of a person's lifespan.

In [10]:
data["AYear"] = (data["BYear"] + data["DYear"]) / 2

Next we plot the distance between birth and death locations against this midpoint.

In [11]:
plt.clf()
plt.plot(data["AYear"], data["rdist"], 'o', alpha=0.2)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Distance (km)", size=15)

The distance is quite skewed, so we may get a different impression if we apply a log transform to it. We use a shifted log transform since many f the differences are exactly zero.

In [12]:
data["log_rdist"] = np.log(1 + data.rdist) / np.log(10)

Now we make a scatterplot of the log distance (between birth and death locations) against the midpoint of a person's life.

In [13]:
plt.clf()
plt.plot(data["AYear"], data["log_rdist"], 'o', alpha=0.2)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Distance (km)", size=15)

There are far more data records from recent history, so this plot may be misleading due to overplotting. We can fit the conditional mean curve using scatterplot smoothing to see how the average distance has changed over time.

In [14]:
lfit = sm.nonparametric.lowess(data.log_rdist, data.AYear, frac=0.1)

There is a rapid increase in the mean around 1700, as travel become much easier. There may also be a decrease in the distance traveled in recent decades, which could be a reflection of large numbers of people being displaced in the middle of the 20th century due to world war 2.

In [15]:
plt.clf()
plt.plot(data.AYear, data.log_rdist, 'o', color='grey', alpha=0.05)
plt.plot(lfit[:,0], lfit[:,1], '-', color='lime', lw=3, alpha=0.9)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Log10 distance", size=15)

I was curious about some of the people who lived long ago and yet travelled great distances during their lives:

In [16]:
ii = (data["rdist"] > 3000) & (data["AYear"] < 1000)
data.loc[ii, ["PrsLabel", "AYear", "BLocLabel", "DLocLabel", "rdist"]]
Out[16]:
PrsLabel AYear BLocLabel DLocLabel rdist
147 Caracalla 202.5 Lugdunum Harran 3003.425021
241 Kumarajiva 378.5 Kashmir Chang'an 3012.301066
340 Uqba ibn Nafi 652.5 Mecca Sidi Okba 3623.312353

Wikipedia says that Kumarajiva was born and died in China and that his father was from Kashmir.

One more drill-down check -- who has the greatest distance between their birth and death locations? The circumference of the earth is 40,075km, so the maximum possible distance is half of this value. This person was born and died at nearly antipodal points on the earth's surface.

In [17]:
ii = np.argmax(rdist)
print(data.loc[ii, :])
print(40075. / 2)
Unnamed: 0                          69457
PrsID                           /m/08wrn6
PrsLabel                    Michael Miles
BYear                                1919
BLocLabel                      Wellington
BLocID                           /m/0853g
BLocLat                          -41.2889
BLocLong                         174.7772
DYear                                1971
DLocLabel                           Spain
DLocID                           /m/06mkj
DLocLat                          40.69858
DLocLong                        -3.294946
Gender                               Male
PerformingArts                          1
Creative                                0
Gov/Law/Mil/Act/Rel                     0
Academic/Edu/Health                     0
Sports                                  0
Business/Industry/Travel                0
rdist                            19844.86
AYear                                1945
log_rdist                         4.29767
Name: 69457, dtype: object
20037.5

We repeated the same analysis using only the data from 1600 to the present.

In [18]:
df = data[data.AYear >= 1600]
lfit = sm.nonparametric.lowess(df.log_rdist, df.AYear, frac=0.1)
In [19]:
plt.clf()
plt.plot(df.AYear, df.log_rdist, 'o', color='grey', alpha=0.05)
plt.plot(lfit[:,0], lfit[:,1], '-', color='lime', lw=3, alpha=0.9)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Log distance", size=15)

Maps

In this section we will make some simple maps showing the birth locations.

In [20]:
df = data.loc[:, ["BLocLat", "BLocLong"]].dropna()
latit = np.asarray(df["BLocLat"])
longit = np.asarray(df["BLocLong"])

The following cell shows the natural way to make a map of the birth locations. However currently it doesn't work due to a bug in Basemap. See the next cell for a workaround.

In [21]:
#mp = Basemap()
#plt.figure(figsize=(16, 12))
#mp.drawcoastlines()
#mp.plot(longit[0:10000], latit[0:10000], '.', latlon=True, color='blue')
In [22]:
mp = Basemap()
plt.figure(figsize=(16, 12))
mp.drawcoastlines() 
x, y = mp(longit, latit)
mp.plot(x, y, 'o', color='blue', alpha=0.1, ms=4, latlon=False)
Out[22]:
[<matplotlib.lines.Line2D at 0x7f7fdb72b9d0>]
In [23]:
data.columns
Out[23]:
Index([u'Unnamed: 0', u'PrsID', u'PrsLabel', u'BYear', u'BLocLabel', u'BLocID', u'BLocLat', u'BLocLong', u'DYear', u'DLocLabel', u'DLocID', u'DLocLat', u'DLocLong', u'Gender', u'PerformingArts', u'Creative', u'Gov/Law/Mil/Act/Rel', u'Academic/Edu/Health', u'Sports', u'Business/Industry/Travel', u'rdist', u'AYear', u'log_rdist'], dtype='object')

Exercises

  • Someone was supposedly born in the Pacific Ocean about one third of the way from Vancouver to Honolulu. Who was this? Is the point in the correct place?

  • Make a histogram showing the distribution of birth years.

  • For the people in the dataset who were born in each century, determine the proportion who were born in the southern hemisphere. Make a graph plotting this proportion against time. Do the same for the western hemisphere.