Genealogy, part II

This is the iPython notebook complementing the article http://www.mathiasbernhard.ch/genealogy-part-ii/
To load the wiki-entries from a pickle dump and start directly with the NLP-ML part, go here

import required libraries

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import urllib
In [2]:
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

load table (csv) with people to pandas dataframe

In [3]:
path = r'data/genealogy.csv'
df = pd.read_csv(path)
df.head()
Out[3]:
Prefix Name Vorname weitere Vornamen Beruf geboren gestorben Alter URL
0 NaN Pythagoras von Samos NaN Philosoph/Mathematik -570 -495 75 http://en.wikipedia.org/wiki/Pythagoras
1 NaN Platon NaN NaN Philosoph/Mathematik -427 -347 80 http://en.wikipedia.org/wiki/Plato
2 NaN Aristoteles NaN NaN Philosoph/Logik -384 -322 62 http://en.wikipedia.org/wiki/Aristotle
3 NaN Epicurus NaN NaN Philosoph -341 -270 71 http://en.wikipedia.org/wiki/Epicurus
4 NaN Euklid von Alexandria NaN Mathematik -325 -265 60 http://en.wikipedia.org/wiki/Euclid

define function that extracts the article text from a url

In [41]:
def extract_article(url):
    site = urllib.urlopen(url)
    soup = BeautifulSoup(site)
    article = soup.find("div", "mw-body-content").get_text()
    return article
In [42]:
pythagoras = extract_article(df.URL[0])
In [44]:
print pythagoras[760:1400]
Pythagoras of Samos (US /pɪˈθæɡərəs/;[1] UK /paɪˈθæɡərəs/;[2] Greek: Πυθαγόρας ὁ Σάμιος Pythagóras ho Sámios "Pythagoras the Samian", or simply Πυθαγόρας; Πυθαγόρης in Ionian Greek; c. 570 – c. 495 BC)[3][4] was an Ionian Greek philosopher, mathematician, and founder of the religious movement called Pythagoreanism. Most of the information about Pythagoras was written down centuries after he lived, so very little reliable information is known about him. He was born on the island of Samos, and might have travelled widely in his youth, visiting Egypt and other places seeking knowledge. Around 530 BC, he moved to Croton, in Magna Graec

extract names as last part of URL: for labelling in visuals

In [7]:
names = map(lambda x : x.split("/")[-1], df.URL)
names[:25]
Out[7]:
['Pythagoras',
 'Plato',
 'Aristotle',
 'Epicurus',
 'Euclid',
 'Archimedes',
 'Lucretius',
 'Vitruvius',
 'Fibonacci',
 'Dante_Alighieri',
 'Filippo_Brunelleschi',
 'Johannes_Gutenberg',
 'Leon_Battista_Alberti',
 'Donato_Bramante',
 'Leonardo_da_Vinci',
 'Albrecht_D%C3%BCrer',
 'Sebastiano_Serlio',
 'Michelangelo',
 'Parmigianino',
 'Giacomo_Barozzi_da_Vignola',
 'Andrea_Palladio',
 'Philibert_de_l%27Orme',
 'Rafael_Bombelli',
 'John_Napier',
 'Francis_Bacon']

apply function on ALL urls in table and store in list (of unicode strings)

In [8]:
wiki_entries = []
for url in df.URL:
    wiki_entries.append(extract_article(url))
len(wiki_entries)
Out[8]:
262

dump list to pickle, no need to crawl again

In [9]:
import pickle
In [10]:
pickle.dump(wiki_entries, open('data/wiki_entries.pkl','w'))

load pickled list to continue

The file loaded in the next step can be downloaded from here: http://www.mathiasbernhard.ch/notebooks/wiki_entries.pkl.zip

In [11]:
pkl_entries = pickle.load(open('data/wiki_entries.pkl','r'))

start of NLP - ML part

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(pkl_entries)
X_train_counts.shape
Out[12]:
(262, 68851)
In [13]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
Out[13]:
(262, 68851)

or directly using TfidfVectorizer

as for example used by Andrej Karpathy and Alexander Fabisch

rkeisler on guardian articles: https://github.com/rkeisler/tsne_guardian/blob/master/tsne_guardian.py

vecs (countvectorizer)
force vectors to have unit length:
norm = np.sqrt(vecs.multiply(vecs).sum(1))
vecs = vecs.multiply(1./norm)
distance_matrix = sklearn.metrics.pairwise.pairwise_distances(vecs, metric='cosine')
model = TSNE(early_exaggeration=4)
pos = model.fit_transform(distance_matrix)

additional parameters for TfidfVectorizer proposed by A. Karpathy

min_df=2, stop_words = 'english',\
strip_accents = 'unicode', lowercase=True, ngram_range=(1,2),\
norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True
In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectors = TfidfVectorizer().fit_transform(pkl_entries)
vectors.shape
Out[14]:
(262, 68851)
In [15]:
from sklearn.manifold import TSNE

define function for easy printing embeddings

In [16]:
def plot_embedding(pos):
    fig = plt.figure(figsize=(10, 10))
    ax = plt.axes(frameon=False)
    plt.setp(ax, xticks=(), yticks=())
    plt.scatter(pos[:,0],pos[:,1], s=5, color='r')
    for i, txt in enumerate(names):
        plt.annotate(txt, (pos[i,0], pos[i,1]), fontsize=6)

Alexander Fabisch method

In [17]:
from sklearn.decomposition import TruncatedSVD
X_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(vectors)
X_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_reduced)
[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 262 / 262
[t-SNE] Mean sigma: 0.181085
[t-SNE] Iteration 10: error = 19.5840120, gradient norm = 0.0329279
[t-SNE] Iteration 20: error = 17.0377197, gradient norm = 0.0752239
[t-SNE] Iteration 30: error = 16.1263235, gradient norm = 0.0850579
[t-SNE] Iteration 32: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Iteration 40: error = 15.6700067, gradient norm = 0.0858924
[t-SNE] Iteration 50: error = 16.0684620, gradient norm = 0.0771126
[t-SNE] Iteration 60: error = 16.0344312, gradient norm = 0.0775531
[t-SNE] Iteration 64: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 64 iterations with early exaggeration: 16.027074
[t-SNE] Iteration 70: error = 1.9491604, gradient norm = 0.0199896
[t-SNE] Iteration 80: error = 1.7117975, gradient norm = 0.0209856
[t-SNE] Iteration 90: error = 1.8939256, gradient norm = 0.0254609
[t-SNE] Iteration 100: error = 2.2310724, gradient norm = 0.0347741
[t-SNE] Iteration 110: error = 2.4389540, gradient norm = 0.0383329
[t-SNE] Iteration 111: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 111 iterations: 2.467875
In [18]:
plot_embedding(X_embedded)

Andrej Karpathy method

In [19]:
vectorizer = TfidfVectorizer(min_df=2, stop_words = 'english',\
strip_accents = 'unicode', lowercase=True, ngram_range=(1,2),\
norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)
X = vectorizer.fit_transform(pkl_entries)
D = -(X * X.T).todense() # Distance matrix: dot product between tfidf vectors
ak_embed = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(D)
[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 262 / 262
[t-SNE] Mean sigma: 0.298632
[t-SNE] Iteration 10: error = 18.3174214, gradient norm = 0.0362003
[t-SNE] Iteration 20: error = 16.8947560, gradient norm = 0.0595788
[t-SNE] Iteration 30: error = 16.2785143, gradient norm = 0.1064113
[t-SNE] Iteration 32: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Iteration 40: error = 17.4717802, gradient norm = 0.0666907
[t-SNE] Iteration 50: error = 16.8543602, gradient norm = 0.0710531
[t-SNE] Iteration 60: error = 17.3436406, gradient norm = 0.0682378
[t-SNE] Iteration 65: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 65 iterations with early exaggeration: 17.397422
[t-SNE] Iteration 70: error = 2.5222497, gradient norm = 0.0224481
[t-SNE] Iteration 80: error = 2.1122603, gradient norm = 0.0244114
[t-SNE] Iteration 90: error = 2.2152029, gradient norm = 0.0314139
[t-SNE] Iteration 100: error = 2.5744293, gradient norm = 0.0421679
[t-SNE] Iteration 110: error = 2.7956470, gradient norm = 0.0458885
[t-SNE] Iteration 113: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 113 iterations: 2.842739
In [20]:
plot_embedding(ak_embed)

R Kiesler method

In [21]:
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
vecs = X_train_counts
#force vectors to have unit length:
norm = np.sqrt(vecs.multiply(vecs).sum(1))
vecs = vecs.multiply(1./norm)
distance_matrix = pairwise_distances(vecs, metric='cosine')
model = TSNE(early_exaggeration=4)
rk_embed = model.fit_transform(distance_matrix)
In [22]:
plot_embedding(rk_embed)