Genealogy, part II¶

This is the iPython notebook complementing the article http://www.mathiasbernhard.ch/genealogy-part-ii/
To load the wiki-entries from a pickle dump and start directly with the NLP-ML part, go here

import required libraries¶

In [1]:

import pandas as pd
from bs4 import BeautifulSoup
import urllib

In [2]:

import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

load table (csv) with people to pandas dataframe¶

In [3]:

path = r'data/genealogy.csv'
df = pd.read_csv(path)
df.head()

Out[3]:

	Prefix	Name	Vorname	weitere Vornamen	Beruf	geboren	gestorben	Alter	URL
0	NaN	Pythagoras	von Samos	NaN	Philosoph/Mathematik	-570	-495	75	http://en.wikipedia.org/wiki/Pythagoras
1	NaN	Platon	NaN	NaN	Philosoph/Mathematik	-427	-347	80	http://en.wikipedia.org/wiki/Plato
2	NaN	Aristoteles	NaN	NaN	Philosoph/Logik	-384	-322	62	http://en.wikipedia.org/wiki/Aristotle
3	NaN	Epicurus	NaN	NaN	Philosoph	-341	-270	71	http://en.wikipedia.org/wiki/Epicurus
4	NaN	Euklid	von Alexandria	NaN	Mathematik	-325	-265	60	http://en.wikipedia.org/wiki/Euclid

define function that extracts the article text from a url¶

In [41]:

def extract_article(url):
    site = urllib.urlopen(url)
    soup = BeautifulSoup(site)
    article = soup.find("div", "mw-body-content").get_text()
    return article

In [42]:

pythagoras = extract_article(df.URL[0])

In [44]:

print pythagoras[760:1400]

Pythagoras of Samos (US /pɪˈθæɡərəs/;[1] UK /paɪˈθæɡərəs/;[2] Greek: Πυθαγόρας ὁ Σάμιος Pythagóras ho Sámios "Pythagoras the Samian", or simply Πυθαγόρας; Πυθαγόρης in Ionian Greek; c. 570 – c. 495 BC)[3][4] was an Ionian Greek philosopher, mathematician, and founder of the religious movement called Pythagoreanism. Most of the information about Pythagoras was written down centuries after he lived, so very little reliable information is known about him. He was born on the island of Samos, and might have travelled widely in his youth, visiting Egypt and other places seeking knowledge. Around 530 BC, he moved to Croton, in Magna Graec

extract names as last part of URL: for labelling in visuals¶

In [7]:

names = map(lambda x : x.split("/")[-1], df.URL)
names[:25]

Out[7]:

['Pythagoras',
 'Plato',
 'Aristotle',
 'Epicurus',
 'Euclid',
 'Archimedes',
 'Lucretius',
 'Vitruvius',
 'Fibonacci',
 'Dante_Alighieri',
 'Filippo_Brunelleschi',
 'Johannes_Gutenberg',
 'Leon_Battista_Alberti',
 'Donato_Bramante',
 'Leonardo_da_Vinci',
 'Albrecht_D%C3%BCrer',
 'Sebastiano_Serlio',
 'Michelangelo',
 'Parmigianino',
 'Giacomo_Barozzi_da_Vignola',
 'Andrea_Palladio',
 'Philibert_de_l%27Orme',
 'Rafael_Bombelli',
 'John_Napier',
 'Francis_Bacon']

apply function on ALL urls in table and store in list (of unicode strings)¶

In [8]:

wiki_entries = []
for url in df.URL:
    wiki_entries.append(extract_article(url))
len(wiki_entries)

Out[8]:

dump list to pickle, no need to crawl again¶

In [9]:

import pickle

In [10]:

pickle.dump(wiki_entries, open('data/wiki_entries.pkl','w'))

load pickled list to continue¶

The file loaded in the next step can be downloaded from here: http://www.mathiasbernhard.ch/notebooks/wiki_entries.pkl.zip

In [11]:

pkl_entries = pickle.load(open('data/wiki_entries.pkl','r'))

start of NLP - ML part¶

taken from here:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files

In [12]:

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(pkl_entries)
X_train_counts.shape

Out[12]:

(262, 68851)

In [13]:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

Out[13]:

(262, 68851)

or directly using TfidfVectorizer¶

as for example used by Andrej Karpathy and Alexander Fabisch

rkeisler on guardian articles: https://github.com/rkeisler/tsne_guardian/blob/master/tsne_guardian.py

vecs (countvectorizer)
force vectors to have unit length:
norm = np.sqrt(vecs.multiply(vecs).sum(1))
vecs = vecs.multiply(1./norm)
distance_matrix = sklearn.metrics.pairwise.pairwise_distances(vecs, metric='cosine')
model = TSNE(early_exaggeration=4)
pos = model.fit_transform(distance_matrix)

additional parameters for TfidfVectorizer proposed by A. Karpathy

min_df=2, stop_words = 'english',\
strip_accents = 'unicode', lowercase=True, ngram_range=(1,2),\
norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True

In [14]:

from sklearn.feature_extraction.text import TfidfVectorizer
vectors = TfidfVectorizer().fit_transform(pkl_entries)
vectors.shape

Out[14]:

(262, 68851)

In [15]:

from sklearn.manifold import TSNE

define function for easy printing embeddings¶

In [16]:

def plot_embedding(pos):
    fig = plt.figure(figsize=(10, 10))
    ax = plt.axes(frameon=False)
    plt.setp(ax, xticks=(), yticks=())
    plt.scatter(pos[:,0],pos[:,1], s=5, color='r')
    for i, txt in enumerate(names):
        plt.annotate(txt, (pos[i,0], pos[i,1]), fontsize=6)

Alexander Fabisch method¶

In [17]:

from sklearn.decomposition import TruncatedSVD
X_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(vectors)
X_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_reduced)

[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 262 / 262
[t-SNE] Mean sigma: 0.181085
[t-SNE] Iteration 10: error = 19.5840120, gradient norm = 0.0329279
[t-SNE] Iteration 20: error = 17.0377197, gradient norm = 0.0752239
[t-SNE] Iteration 30: error = 16.1263235, gradient norm = 0.0850579
[t-SNE] Iteration 32: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Iteration 40: error = 15.6700067, gradient norm = 0.0858924
[t-SNE] Iteration 50: error = 16.0684620, gradient norm = 0.0771126
[t-SNE] Iteration 60: error = 16.0344312, gradient norm = 0.0775531
[t-SNE] Iteration 64: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 64 iterations with early exaggeration: 16.027074
[t-SNE] Iteration 70: error = 1.9491604, gradient norm = 0.0199896
[t-SNE] Iteration 80: error = 1.7117975, gradient norm = 0.0209856
[t-SNE] Iteration 90: error = 1.8939256, gradient norm = 0.0254609
[t-SNE] Iteration 100: error = 2.2310724, gradient norm = 0.0347741
[t-SNE] Iteration 110: error = 2.4389540, gradient norm = 0.0383329
[t-SNE] Iteration 111: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 111 iterations: 2.467875

In [18]:

plot_embedding(X_embedded)

Andrej Karpathy method¶

In [19]:

vectorizer = TfidfVectorizer(min_df=2, stop_words = 'english',\
strip_accents = 'unicode', lowercase=True, ngram_range=(1,2),\
norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)
X = vectorizer.fit_transform(pkl_entries)
D = -(X * X.T).todense() # Distance matrix: dot product between tfidf vectors
ak_embed = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(D)

[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 262 / 262
[t-SNE] Mean sigma: 0.298632
[t-SNE] Iteration 10: error = 18.3174214, gradient norm = 0.0362003
[t-SNE] Iteration 20: error = 16.8947560, gradient norm = 0.0595788
[t-SNE] Iteration 30: error = 16.2785143, gradient norm = 0.1064113
[t-SNE] Iteration 32: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Iteration 40: error = 17.4717802, gradient norm = 0.0666907
[t-SNE] Iteration 50: error = 16.8543602, gradient norm = 0.0710531
[t-SNE] Iteration 60: error = 17.3436406, gradient norm = 0.0682378
[t-SNE] Iteration 65: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 65 iterations with early exaggeration: 17.397422
[t-SNE] Iteration 70: error = 2.5222497, gradient norm = 0.0224481
[t-SNE] Iteration 80: error = 2.1122603, gradient norm = 0.0244114
[t-SNE] Iteration 90: error = 2.2152029, gradient norm = 0.0314139
[t-SNE] Iteration 100: error = 2.5744293, gradient norm = 0.0421679
[t-SNE] Iteration 110: error = 2.7956470, gradient norm = 0.0458885
[t-SNE] Iteration 113: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 113 iterations: 2.842739

In [20]:

plot_embedding(ak_embed)

R Kiesler method¶

In [21]:

import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
vecs = X_train_counts
#force vectors to have unit length:
norm = np.sqrt(vecs.multiply(vecs).sum(1))
vecs = vecs.multiply(1./norm)
distance_matrix = pairwise_distances(vecs, metric='cosine')
model = TSNE(early_exaggeration=4)
rk_embed = model.fit_transform(distance_matrix)

In [22]:

plot_embedding(rk_embed)

plot distance matrices¶

In [23]:

plt.figure(figsize=(8,8))
plt.imshow(distance_matrix, cmap='coolwarm')

Out[23]:

<matplotlib.image.AxesImage at 0x10d258990>

In [24]:

plt.figure(figsize=(8,8))
plt.imshow(-D, norm=LogNorm(), cmap='coolwarm')

Out[24]:

<matplotlib.image.AxesImage at 0x10dc11bd0>

In [25]:

ak_embed_pre = TSNE(n_components=2, perplexity=40, verbose=2, metric='precomputed').fit_transform(D)

[t-SNE] Computed conditional probabilities for sample 262 / 262
[t-SNE] Mean sigma: 0.141009
[t-SNE] Iteration 10: error = 19.2839620, gradient norm = 0.0330228
[t-SNE] Iteration 20: error = 17.6546334, gradient norm = 0.0499292
[t-SNE] Iteration 30: error = 16.7189882, gradient norm = 0.1086522
[t-SNE] Iteration 32: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Iteration 40: error = 16.8981159, gradient norm = 0.0836658
[t-SNE] Iteration 50: error = 17.2641179, gradient norm = 0.0736577
[t-SNE] Iteration 60: error = 17.2952214, gradient norm = 0.0740660
[t-SNE] Iteration 64: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 64 iterations with early exaggeration: 17.388337
[t-SNE] Iteration 70: error = 2.5358830, gradient norm = 0.0225712
[t-SNE] Iteration 80: error = 2.2483497, gradient norm = 0.0273856
[t-SNE] Iteration 90: error = 2.3097594, gradient norm = 0.0313441
[t-SNE] Iteration 100: error = 2.6407048, gradient norm = 0.0425939
[t-SNE] Iteration 110: error = 2.8024695, gradient norm = 0.0535090
[t-SNE] Iteration 114: did not make any progress during the last 30 episodes. Finished.
[t-SNE] Error after 114 iterations: 2.891214

In [27]:

plot_embedding(ak_embed_pre)

In [ ]: