Python tools for NLP

by Françoise Provencher (demo for PyLadies Montreal, July 17th 2014)

Fantasia festival starts today!

So many films, so little time. Which ones to choose?

Check it out! 3 weeks of genre films! Can we find similar movies by using their synopsis?

1 - Let's build a corpus of film synopses. We need to scrape the web.

In which we profusely use Pattern for its web-friendly features.

In [ ]:
from pattern.web import URL, download, plaintext, Element, abs, Text
import re
In [ ]:
# The main page of Fantasia fest listing all the films
url = URL('http://fantasiafest.com/2014/en/films-schedule/films')
html = url.download(unicode=True)
In [ ]:
#List of links to all the films
element = Element(html)
links=[]
for link in element('h4 a'):
    formatted_link = abs(link.attributes.get('href',''), base=url.redirect or url.string)
    links.append(formatted_link)
In [ ]:
#List of durations
element = Element(html)
duration_pat = re.compile(r'[0-9]* min')
durations=[]
for e in element('div.info ul'):
    specs = plaintext(e.content)
    duration = int(duration_pat.search(specs).group()[:-4])
    durations.append(duration)
    print duration
In [ ]:
#List only films with duration over 45 minutes
feature_films = [link for (link, duration) in zip(links,durations) if duration>45]
print feature_films

For each feature film, get the synopsis

In [ ]:
# Demo for only one film
link = feature_films[165]
html = download(link)
element = Element(html)
title = plaintext(element('h1')[1].content)
synopsis = "\n".join([plaintext(e.content) for e in (element('div.synopsis p'))])
print title
print synopsis
In [ ]:
# Use a loop to get all the links
fantasia2014={}
for link in feature_films:
    html = download(link)
    element = Element(html)
    title = plaintext(element('h1')[1].content)
    synopsis = "\n".join([plaintext(e.content) for e in (element('div.synopsis p'))])
    fantasia2014[title]=synopsis
    #print title

Done! This kind of web-scraping can also be done with other modules such as Requests (for URL requests) and BeautifulSoup (for parsing HTML)

2 - Now that we have the raw text, let's clean it!

In which we profusely use NLTK for its classic tokenizer, stemmer and lemmatizer. Check out the awesome free book Natural Language Processing with Python.

Splitting the text into words : Tokenization

In [ ]:
#Fast and dirty : split on whitespace, remove preceding/trailing punctuation
punctuation = u",.;:'()\u201c\u2026\u201d\u2013\u2019\u2014"
splitted_text = fantasia2014["The Zero Theorem"].split()
clean_text = [w.strip(punctuation) for w in splitted_text]
print clean_text
In [ ]:
# More sophisticated : using a tokenizer
import nltk
synopsis = fantasia2014["The Zero Theorem"]
tokens = [word for sent in nltk.sent_tokenize(synopsis) for word in nltk.word_tokenize(sent)]
print tokens
In [ ]:
# I prefer my fast and dirty way for this corpus, so let's use a loop to apply it 
# it on all the texts

punctuation = u",.;:'()\u201c\u2026\u201d\u2013\u2019\u2014"
fantasia2014_tokenized = dict()

for title in fantasia2014:
    splitted_text = fantasia2014[title].split()
    fantasia2014_tokenized[title] = [w.strip(punctuation) for w in splitted_text
                                     if w.strip(punctuation) != ""]
    
#print fantasia2014_tokenized["The Zero Theorem"]
    

Getting the root of the words : stemming and lemmatization

In [ ]:
# Stemming : uses rules to chop off end of words
stemmer  = nltk.stem.porter.PorterStemmer()
singular = stemmer.stem("zombie")
plural   = stemmer.stem("zombies")

print singular, plural
print (singular==plural)
In [ ]:
# Lemmatizing : uses a dictionnary
from nltk import WordNetLemmatizer as wnl
singular = wnl().lemmatize("zombie")
plural   = wnl().lemmatize("zombies")

print singular, plural
print (singular==plural)
In [ ]:
# I like the lemmatization better.
# Let's lemmatize all the texts
fantasia2014_lemma = dict()

for title in fantasia2014_tokenized:
    synopsis = []
    for word in fantasia2014_tokenized[title]:
        lemma= wnl().lemmatize(word.lower()) #lowercasing text is another normalization
        synopsis.append(lemma)
    fantasia2014_lemma[title] = synopsis
    
print fantasia2014_lemma["The Zero Theorem"]
    

Just for fun : stopwords and collocations

In [ ]:
# Collocations are frequent bigrams (pair of words) that occur often together
# Get the collocations
all_texts = []
for title in fantasia2014_lemma:
    all_texts.extend(fantasia2014_lemma[title])
                     
bigrams = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(all_texts)
scored = finder.score_ngrams(bigrams.likelihood_ratio)

print scored
In [ ]:
#Lets remove the stopwords (a, the, in, into, on ...) and try again

stop = nltk.corpus.stopwords.words('english') #list of stopwords from NLTK
fantasia2014_stop=dict()

for title in fantasia2014_lemma:
    fantasia2014_stop[title] = [w for w in fantasia2014_lemma[title] if w not in stop]
In [ ]:
#This is the same as above, but with the stopwords removed
all_texts = []
for title in fantasia2014_stop:
    all_texts.extend(fantasia2014_stop[title])
                     
bigrams = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(all_texts)
scored = finder.score_ngrams(bigrams.likelihood_ratio)

print scored

NLTK has a lot more to offer : part-of-speech tagging, etc. Have a look to see if it's the right fit for you!

3 - Now that we have a clean corpus, let's train a linguistic model to find similarity between documents

In which we profusely use Gensim, which is great for topic modeling. Also check-out the awesome blog of the developer (that guy took Google's word2vec C code and made it faster in Python). The following is an adaptation of the tutorials found here, please refer to them for more explanations.

In [ ]:
from gensim import corpora, models, similarities

#put the text in the right format : lists
titles=[]
texts=[]
for title in fantasia2014_stop:
    titles.append(title)
    texts.append(fantasia2014_stop[title])
    
#remove words that occur only once to reduce the size
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

Build a model (TF-IDF)

Term frequency–inverse document frequency (TF-IDF) gives the importance of a word in a document, as it is frequent in that document but not very frequent in all the documents taken together.

In [ ]:
# Build a model
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
corpus_tfidf = tfidf[corpus]      # step 2 -- apply the transformation to the corpus
In [ ]:
# What does it look like?
for doc in corpus_tfidf:
    print(doc)

Topic modeling : Latent sementic indexing

TF-IDF is fine, but what if we have 2 documents talking about the same thing but with different words, e.g. "Funny zombie movie" and "comedy of the undead"? Well, if all these words appear sometimes together in other documents, they could be assigned to the same topic and we could use these topics to find the similarity between documents. Latent sementic indexing uses singular value decomposition (SVD).

In [ ]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=15)
lsi.print_topics(5)
In [ ]:
# What does this looks like?
for doc in corpus_lsi:
    print doc

Now let's find a film similar to The Zero Theorem

In [ ]:
#Which titles can we play with?
#print titles
In [ ]:
#Get the indice of the film we wish to query
ind = titles.index("The Zero Theorem")

#Transform film synopsis to LSI space
doc = texts[ind]
vec_bow = dictionary.doc2bow(doc)
vec_lsi = lsi[vec_bow] # convert the query to LSI space
    
print(vec_lsi)
In [ ]:
#transform corpus to LSI space and index it IN RAM!
index = similarities.MatrixSimilarity(lsi[corpus]) 

# perform a similarity query against the corpus and sort them
sims = index[vec_lsi] 
sims = sorted(enumerate(sims), key=lambda item: -item[1])

# print out nicely the first 10 films
for i, (document_num, sim) in enumerate(sims) : # print sorted (document number, similarity score) 2-tuples
    print titles[document_num], str(sim)
    if i > 10 : break

Fun with Word2Vec

In [ ]:
#Train the model with our corpus
model_w2v = models.Word2Vec(texts, min_count=3)
In [ ]:
#Query to find the most similar word
model_w2v.most_similar(positive=['horror'], topn=5)
In [ ]:
#Query the model
this = "light"
is_to = "dark"
what = "angel"
is_to2= model_w2v.most_similar(positive=[is_to, what], negative=[this], topn=3)

print this+' is to '+is_to+' as '+what+' is to : '
print is_to2

Our corpus is too small to get an accurate model. Let's use Google news instead.

In [ ]:
# Load the model, downloaded from : https://code.google.com/p/word2vec/
model_GN = models.Word2Vec.load_word2vec_format('/Users/francoiseprovencher/Documents/Word2VecBinaries/GoogleNews-vectors-negative300.bin.gz', binary=True)
In [ ]:
#Query to find the most similar word
model_GN.most_similar(positive=['zombie'], topn=5)
In [ ]:
#Query the model
this = "light"
is_to = "dark"
what = "angel"
is_to2= model_GN.most_similar(positive=[is_to, what], negative=[this], topn=3)

print this+' is to '+is_to+' as '+what+' is to : '
print is_to2
In [ ]: