In this notebook we do a few simple reanalyses of the geographic locations of the births and deaths of notable individuals. The data were originaly compiled and discussed in this paper. The full citation for the paper is:
Schich et al.
A Network Framework of Cultural History
Science 1 August 2014:
Vol. 345 no. 6196 pp. 558-562
The data can be obtained from the journal's web site.
Here we will only work with the Freebase ("SchichDataS1_FB") data set. We have taken the Freebase Excel spread sheet and converted it to a compressed csv file. To run the code below, you will need to do this yourself, and place the resulting data file in the working directory. You may also need to change the file name in the code below to reflect the compression method if you use something other than gzip. Note that it is also possible to use the Pandas read_excel function to load the Excel file directly.
This notebook involves working with geospatial data. This notebook reviews some techniques in this area.
We begin by importing our standard libraries, and also a couple of libraries for handling geographic data.
import pandas as pd import numpy as np import statsmodels.api as sm from mpl_toolkits.basemap import Basemap import pyproj
Next we load the Freebase data (change the file name and compression method here if needed).
data = pd.read_csv("SchichDataS1_FB.csv.gz", compression="gzip", encoding='utf-8')
We check the dimensions of the data frame:
We check the variable (column) names:
Index([u'Unnamed: 0', u'PrsID', u'PrsLabel', u'BYear', u'BLocLabel', u'BLocID', u'BLocLat', u'BLocLong', u'DYear', u'DLocLabel', u'DLocID', u'DLocLat', u'DLocLong', u'Gender', u'PerformingArts', u'Creative', u'Gov/Law/Mil/Act/Rel', u'Academic/Edu/Health', u'Sports', u'Business/Industry/Travel'], dtype='object')
Here is a peek at the top few rows of the data set. Each record corresponds to one person. The first record is for David, the second king of Israel.
Unnamed: 0 PrsID PrsLabel BYear BLocLabel \ 0 0 /m/02cvn David -1039 Bethlehem 1 1 /m/02xr84 Josiah -648 Jerusalem 2 2 /m/02pqx Ezekiel -621 Jerusalem 3 3 /m/0fyrw Solon -637 Classical Athens 4 4 /m/03d0c1 Cyrus the Great -575 Anshan BLocID BLocLat BLocLong DYear DLocLabel DLocID DLocLat \ 0 /m/01cy_ 31.703100 35.195600 -969 Jerusalem /m/0430_ 31.783333 1 /m/0430_ 31.783333 35.216667 -608 Jerusalem /m/0430_ 31.783333 2 /m/0430_ 31.783333 35.216667 -569 Babylon /m/01cyh 32.541667 3 /m/03nmzqx 37.966667 23.716667 -557 Cyprus /m/01ppq 35.225120 4 /m/05h83n 29.900000 52.400000 -528 Syr Darya /m/0k8fk 40.850000 DLocLong Gender PerformingArts Creative Gov/Law/Mil/Act/Rel \ 0 35.216667 Male 0 0 1 1 35.216667 Male 0 0 0 2 44.423333 Male 0 0 1 3 33.612441 Male 0 0 0 4 68.666667 Male 0 0 0 Academic/Edu/Health Sports Business/Industry/Travel 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0
In the next cell, we calculate the distance between the location where a person was born and the location where she or he died. This involves calculating geodesic distances, which can be handled using the
pyproj bindings for the
proj library. See here for more discussion about calculating geodesic distances in Python.
First we extract the relevant columns out of the Pandas data frame, and convert them to a numpy data frame. In general it is not necesary to do this conversion, but in this case we need to feed these data into a library that doesn't handle Pandas objects.
df = data.loc[:, ["BLocLat", "BLocLong", "DLocLat", "DLocLong"]].dropna() dfa = np.asarray(df)
Next we do the distance calculations.
g = pyproj.Geod(ellps='WGS84') _, _, dists = g.inv(dfa[:, 1], dfa[:, 0], dfa[:, 3], dfa[:, 2]) data["rdist"] = pd.Series(dists / 1000, df.index) # Convert distances to kilometers
A simple starting point is to look at the distribution of the distances between birth locations and death locations. There are two components to the distribution: many people were born and died in the same location; of the remainder, there is a wide distribution of distances centered at around 10^2.5 ~ 320km.
rdist = data["rdist"].dropna() plt.clf() plt.hist(np.log(1 + rdist) / np.log(10), bins=50, alpha=0.5) plt.xlabel("Log10 distance (km)", size=15) plt.ylabel("Frequency", size=15)
<matplotlib.text.Text at 0x7f7fdcd5dc90>
We can calculate the proportion of people who were born and died in the same location:
print((data.rdist == 0).mean())
We can aso look at how the distance between birth locations and death locations has changed over time. First we define a new variable that holds the midpoint of a person's lifespan.
data["AYear"] = (data["BYear"] + data["DYear"]) / 2
Next we plot the distance between birth and death locations against this midpoint.
plt.clf() plt.plot(data["AYear"], data["rdist"], 'o', alpha=0.2) plt.xlabel("Year", size=15) _ = plt.ylabel("Distance (km)", size=15)
The distance is quite skewed, so we may get a different impression if we apply a log transform to it. We use a shifted log transform since many f the differences are exactly zero.
data["log_rdist"] = np.log(1 + data.rdist) / np.log(10)
Now we make a scatterplot of the log distance (between birth and death locations) against the midpoint of a person's life.
plt.clf() plt.plot(data["AYear"], data["log_rdist"], 'o', alpha=0.2) plt.xlabel("Year", size=15) _ = plt.ylabel("Distance (km)", size=15)
There are far more data records from recent history, so this plot may be misleading due to overplotting. We can fit the conditional mean curve using scatterplot smoothing to see how the average distance has changed over time.
lfit = sm.nonparametric.lowess(data.log_rdist, data.AYear, frac=0.1)
There is a rapid increase in the mean around 1700, as travel become much easier. There may also be a decrease in the distance traveled in recent decades, which could be a reflection of large numbers of people being displaced in the middle of the 20th century due to world war 2.
plt.clf() plt.plot(data.AYear, data.log_rdist, 'o', color='grey', alpha=0.05) plt.plot(lfit[:,0], lfit[:,1], '-', color='lime', lw=3, alpha=0.9) plt.xlabel("Year", size=15) _ = plt.ylabel("Log10 distance", size=15)
I was curious about some of the people who lived long ago and yet travelled great distances during their lives:
ii = (data["rdist"] > 3000) & (data["AYear"] < 1000) data.loc[ii, ["PrsLabel", "AYear", "BLocLabel", "DLocLabel", "rdist"]]
|340||Uqba ibn Nafi||652.5||Mecca||Sidi Okba||3623.312353|
One more drill-down check -- who has the greatest distance between their birth and death locations? The circumference of the earth is 40,075km, so the maximum possible distance is half of this value. This person was born and died at nearly antipodal points on the earth's surface.
ii = np.argmax(rdist) print(data.loc[ii, :]) print(40075. / 2)
Unnamed: 0 69457 PrsID /m/08wrn6 PrsLabel Michael Miles BYear 1919 BLocLabel Wellington BLocID /m/0853g BLocLat -41.2889 BLocLong 174.7772 DYear 1971 DLocLabel Spain DLocID /m/06mkj DLocLat 40.69858 DLocLong -3.294946 Gender Male PerformingArts 1 Creative 0 Gov/Law/Mil/Act/Rel 0 Academic/Edu/Health 0 Sports 0 Business/Industry/Travel 0 rdist 19844.86 AYear 1945 log_rdist 4.29767 Name: 69457, dtype: object 20037.5
We repeated the same analysis using only the data from 1600 to the present.
df = data[data.AYear >= 1600] lfit = sm.nonparametric.lowess(df.log_rdist, df.AYear, frac=0.1)
plt.clf() plt.plot(df.AYear, df.log_rdist, 'o', color='grey', alpha=0.05) plt.plot(lfit[:,0], lfit[:,1], '-', color='lime', lw=3, alpha=0.9) plt.xlabel("Year", size=15) _ = plt.ylabel("Log distance", size=15)
df = data.loc[:, ["BLocLat", "BLocLong"]].dropna() latit = np.asarray(df["BLocLat"]) longit = np.asarray(df["BLocLong"])
The following cell shows the natural way to make a map of the birth locations. However currently it doesn't work due to a bug in Basemap. See the next cell for a workaround.
#mp = Basemap() #plt.figure(figsize=(16, 12)) #mp.drawcoastlines() #mp.plot(longit[0:10000], latit[0:10000], '.', latlon=True, color='blue')
mp = Basemap() plt.figure(figsize=(16, 12)) mp.drawcoastlines() x, y = mp(longit, latit) mp.plot(x, y, 'o', color='blue', alpha=0.1, ms=4, latlon=False)
[<matplotlib.lines.Line2D at 0x7f7fdb72b9d0>]
Index([u'Unnamed: 0', u'PrsID', u'PrsLabel', u'BYear', u'BLocLabel', u'BLocID', u'BLocLat', u'BLocLong', u'DYear', u'DLocLabel', u'DLocID', u'DLocLat', u'DLocLong', u'Gender', u'PerformingArts', u'Creative', u'Gov/Law/Mil/Act/Rel', u'Academic/Edu/Health', u'Sports', u'Business/Industry/Travel', u'rdist', u'AYear', u'log_rdist'], dtype='object')
Someone was supposedly born in the Pacific Ocean about one third of the way from Vancouver to Honolulu. Who was this? Is the point in the correct place?
Make a histogram showing the distribution of birth years.
For the people in the dataset who were born in each century, determine the proportion who were born in the southern hemisphere. Make a graph plotting this proportion against time. Do the same for the western hemisphere.