text mining (nlp) with python

Author: Ties de Kok (Personal Website)
Last updated: 18 May 2018
Python version: Python 3.6
License: MIT License

Note: Some features (like the ToC) will only work if you run the notebook or if you use nbviewer by clicking this link:
https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb

Introduction

This notebook contains code examples to get you started with Natural Language Processing (NLP) / Text Mining for Research and Data Science purposes.

In the large scheme of things there are roughly 4 steps:

  1. Identify a data source
  2. Gather the data
  3. Process the data
  4. Analyze the data

This notebook only discusses step 3 and 4. If you want to learn more about step 2 see my Python tutorial.

Note: companion slides

This notebook was designed to accompany a PhD course session on NLP techniques in Accounting Research.
The slides of this session are publically availabe here: Slides

Elements / topics that are discussed in this notebook:

Table of Contents

Primer on NLP tools (to top)

There are many tools available for NLP purposes.
The code examples below are based on what I personally like to use, it is not intended to be a comprehsnive overview.

Besides build-in Python functionality I will use / demonstrate the following packages:

Standard NLP libraries:

  1. Spacy and the higher-level wrapper Textacy
  2. NLTK and the higher-level wrapper TextBlob

Note: besides installing the above packages you also often have to download (model) data . Make sure to check the documentation!

Standard machine learning library:

  1. scikit learn

Specific task libraries:

There are many, just a couple of examples:

  1. pyLDAvis for visualizing LDA)
  2. langdetect for detecting languages
  3. fuzzywuzzy for fuzzy text matching
  4. textstat to calculate readability statistics
  5. Gensim for topic modelling

Get some example data (to top)

There are many example datasets available to play around with, see for example this great repository:
https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table

The data that I will use for most of the examples is the "Reuter_50_50 Data Set" that is used for author identification experiments.

See the details here: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50

Download and load the data

Can't follow what I am doing here? Please see my Python tutorial (although the zipfile and io operations are not very relevant).

In [2]:
import requests, zipfile, io, os

Download and extract the zip file with the data

In [3]:
if not os.path.exists('C50test'):
    r = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip")
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

Load the data into memory

In [4]:
folder_dict = {'test' : 'C50test'}
text_dict = {'test' : {}}
In [5]:
for label, folder in folder_dict.items():
    authors = os.listdir(folder)
    for author in authors:
        text_files = os.listdir(os.path.join(folder, author))
        for file in text_files:
            with open(os.path.join(folder, author, file), 'r') as text_file:
                text_dict[label].setdefault(author, []).append(' '.join(text_file.readlines()))

Note: the text comes pre-split per sentence, for the sake of example I undo this through ' '.join(text_file.readlines()

In [6]:
text_dict['test']['TimFarrand'][0]
Out[6]:
'Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997. The shares fell 6p to 781p on the news.\n "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  \n Dermott Carr, an analyst at Nikko said, "the market is going to hang onto them for the moment but until we get a decision they will be held back."\n Whatever the MMC decides many analysts expect Lang to defer a decision until after the next general election which will be called by May 22.\n "They will probably try to defer the decision until after the election. I don\'t think they want the negative PR of having a large number of people fired," said Wakley.  \n If the deal does not go through, analysts calculate the maximum loss to Bass of 60 million, with most sums centred on the 30-40 million range.\n "It\'s a maxiumum loss of 60 million for Bass if they fail and, unlike Allied, you would have to compare it to the perceived upside of doing the deal," said Wakley.\n Bass said at the time of the deal it would take a one-off charge of 75 million stg for restructuring the combined business, resulting in expected annual cost savings of 90 million stg within three years.  \n Under the terms of the complex deal, if Bass cannot combine C-T with its own brewing business within 16 months, it has the option to put its whole shareholding to Carlsberg for 110 million stg and Carlsberg has an option to put 15 percent of C-T to Allied Domecq, which would reimburse Bass 30 million stg.\n Bass is also entitled to receive 50 percent of all profits earnied by C-T until the merger is complete, which should give it some 30-35 million stg in a full year. Carlsberg has agreed to contribute its interests and 20 million stg in exchange for a 20 percent share in the combined Bass Breweries and Carlsberg-Tetley business.\n C-T was a joint venture between Allied Domecq and Carlsberg formed in 1992 by the merger of their UK brewing and wholesaleing businesses.\n -- London Newsroom +44 171 542 6437\n'

Process + Clean text (to top)

Convert the text into a NLP representation

We can use the text directly, but if want to use packages like spacy and textblob we first have to convert the text into a corresponding object.

Spacy

Note: depending on the way that you installed the language models you will need to import it differently:

from spacy.en import English
parser = English()

OR

import en_core_web_sm
parser = en_core_web_sm.load()
In [8]:
import en_core_web_sm
parser = en_core_web_sm.load()

Convert all text in the "test" sample to a spacy doc object using parser():

In [9]:
spacy_text = {}
for author, text_list in text_dict['test'].items():
    spacy_text[author] = [parser(text) for text in text_list]
In [10]:
type(spacy_text['TimFarrand'][0])
Out[10]:
spacy.tokens.doc.Doc

NLTK

In [11]:
import nltk

We can apply basic nltk operations directly to the text so we don't need to convert first.

TextBlob

In [13]:
from textblob import TextBlob

Convert all text in the "test" sample to a TextBlob object using TextBlob():

In [14]:
textblob_text = {}
for author, text_list in text_dict['test'].items():
    textblob_text[author] = [TextBlob(text) for text in text_list]
In [15]:
type(textblob_text['TimFarrand'][0])
Out[15]:
textblob.blob.TextBlob

Normalization (to top)

Text normalization describes the task of transforming the text into a different (more comparable) form.

This can imply many things, I will show a couple of things below:

Deal with unwanted characters (to top)

You will often notice that there are characters that you don't want in your text.

Let's look at this sentence for example:

"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

You notice that there are some \ and \n in there. These are used to define how a string should be displayed, if we print this text we get:

In [16]:
text_dict['test']['TimFarrand'][0][:298]
Out[16]:
"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"
In [17]:
print(text_dict['test']['TimFarrand'][0][:298])
Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers

If we want to analyze text we often don't care about the visual representation. They might actually cause problems!

So how do we remove them?

In many cases it is sufficient to simply use the .replace() function:

In [18]:
text_dict['test']['TimFarrand'][0][:298].replace('\n', '').replace('\\', '')
Out[18]:
"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts. Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

Sometimes, however, the problem arrises because of encoding / decoding problems.

In those cases you can usually do something like:

In [19]:
problem_sentence = 'This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s annoying!'
print(problem_sentence.encode().decode('unicode_escape').encode('ascii','ignore'))
b"This is some  text that has to be cleaned! it's annoying!"

Sentence segmentation (to top)

Sentence segmentation means the task of splitting up the piece of text by sentence.

You could do this by splitting on the . symbol, but dots are used in many other cases as well so it is not very robust:

In [20]:
text_dict['test']['TimFarrand'][0][:550].split('.')
Out[20]:
["Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts",
 '\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997',
 ' The shares fell 6p to 781p on the news',
 '\n "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers',
 '  \n Dermott Carr, an analyst at Nikko said, "the mark']

It is better to use a more sophisticated implementation such as the one by Spacy:

In [21]:
example_paragraph = spacy_text['TimFarrand'][0]
In [22]:
sentence_list = [s for s in example_paragraph.sents]
sentence_list[:5]
Out[22]:
[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
  ,
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,
 The shares fell 6p to 781p on the news.
  ,
 "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  
  ,
 Dermott Carr, an analyst at Nikko said, "the market is going to hang onto them for the moment but until we get a decision they will be held back."
  ]

Notice that the returned object is still a spacy object:

In [23]:
type(sentence_list[0])
Out[23]:
spacy.tokens.span.Span

Apply to all texts (for use later on):

In [24]:
spacy_sentences = {}
for author, text_list in spacy_text.items():
    spacy_sentences[author] = [list(text.sents) for text in text_list]
In [25]:
spacy_sentences['TimFarrand'][0][:3]
Out[25]:
[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
  ,
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,
 The shares fell 6p to 781p on the news.
  ]

Word tokenization (to top)

Word tokenization means to split the sentence (or text) up into words.

In [26]:
example_sentence = spacy_sentences['TimFarrand'][0][0]
example_sentence
Out[26]:
Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
 

A word is called a token in this context (hence tokenization), using spacy:

In [27]:
token_list = [token for token in example_sentence]
token_list[0:15]
Out[27]:
[Shares,
 in,
 brewing,
 -,
 to,
 -,
 leisure,
 group,
 Bass,
 Plc,
 are,
 likely,
 to,
 be,
 held]

Lemmatization & Stemming (to top)

In some cases you want to convert a word (i.e. token) into a more general representation.

For example: convert "car", "cars", "car's", "cars'" all into the word car.

This is generally done through lemmatization / stemming (different approaches trying to achieve a similar goal).

Spacy

Space offers build-in functionality for lemmatization:

In [28]:
lemmatized = [token.lemma_ for token in example_sentence]
lemmatized[0:15]
Out[28]:
['share',
 'in',
 'brewing',
 '-',
 'to',
 '-',
 'leisure',
 'group',
 'bass',
 'plc',
 'be',
 'likely',
 'to',
 'be',
 'hold']

NLTK

Using the NLTK libary we can also use the more aggressive Porter Stemmer

In [29]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
In [30]:
stemmed = [stemmer.stem(token.text) for token in example_sentence]
stemmed[0:15]
Out[30]:
['share',
 'in',
 'brew',
 '-',
 'to',
 '-',
 'leisur',
 'group',
 'bass',
 'plc',
 'are',
 'like',
 'to',
 'be',
 'held']

Compare:

In [31]:
for original, lemma, stem in zip(token_list[:15], lemmatized[:15], stemmed[:15]):
    print(original, ' | ', lemma, ' | ', stem)
Shares  |  share  |  share
in  |  in  |  in
brewing  |  brewing  |  brew
-  |  -  |  -
to  |  to  |  to
-  |  -  |  -
leisure  |  leisure  |  leisur
group  |  group  |  group
Bass  |  bass  |  bass
Plc  |  plc  |  plc
are  |  be  |  are
likely  |  likely  |  like
to  |  to  |  to
be  |  be  |  be
held  |  hold  |  held

In my experience it is usually best to use lemmatization instead of a stemmer.

Language modeling (to top)

Text is inherently structured in complex ways, we can often use some of this underlying structure.

Part-of-Speech tagging (to top)

Part of speech tagging refers to the identification of words as nouns, verbs, adjectives, etc.

Using Spacy:

In [32]:
pos_list = [(token, token.pos_) for token in example_sentence]
pos_list[0:10]
Out[32]:
[(Shares, 'NOUN'),
 (in, 'ADP'),
 (brewing, 'NOUN'),
 (-, 'PUNCT'),
 (to, 'ADP'),
 (-, 'PUNCT'),
 (leisure, 'NOUN'),
 (group, 'NOUN'),
 (Bass, 'PROPN'),
 (Plc, 'PROPN')]

Uni-Gram & N-Grams (to top)

Obviously a sentence is not a random collection of words, the sequence of words has information value.

A simple way to incorporate some of this sequence is by using what is called n-grams.
An n-gram is nothing more than a a combination of N words into one token (a uni-gram token is just one word).

So we can convert "Sentence about flying cars" into a list of bigrams:

Sentence-about, about-flying, flying-cars

See my slide on N-Grams for a more comprehensive example: click here

Using NLTK:

In [33]:
bigram_list = ['-'.join(x) for x in nltk.bigrams([token.text for token in example_sentence])]
bigram_list[10:15]
Out[33]:
['are-likely', 'likely-to', 'to-be', 'be-held', 'held-back']

Stop words (to top)

Depending on what you are trying to do it is possible that there are many words that don't add any information value to the sentence.

The primary example are stop words.

Sometimes you can improve the accuracy of your model by removing stop words.

Using Spacy:

In [34]:
no_stop_words = [token for token in example_sentence if not token.is_stop]
In [35]:
no_stop_words[:10]
Out[35]:
[Shares, brewing, -, -, leisure, group, Bass, Plc, likely, held]
In [36]:
token_list[:10]
Out[36]:
[Shares, in, brewing, -, to, -, leisure, group, Bass, Plc]

Note we can also remove punctuation in the same way:

In [37]:
[token for token in example_sentence if not token.is_punct][:5]
Out[37]:
[Shares, in, brewing, to, leisure]

Wrap everything into one function

Below I will primarily use SpaCy directly. However, I also recommend to check out the high-level wrapper Textacy.

See their GitHub page for details: https://github.com/chartbeat-labs/textacy

Quick Textacy example

In [38]:
import textacy
In [39]:
example_text = text_dict['test']['TimFarrand'][0]
In [40]:
cleaned_text = textacy.preprocess_text(example_text, lowercase=True, fix_unicode=True, no_punct=True)

Basic SpaCy text processing function

  1. Split into sentences
  2. Apply lemmatizer and remove top words
  3. Clean up the sentence using textacy
In [41]:
def process_text_custom(text):
    sentences = list(parser(text).sents)
    lemmatized_sentences = []
    for sentence in sentences:
        lemmatized_sentences.append([token.lemma_ for token in sentence if not token.is_stop | token.is_punct | token.is_space])
    return [parser(' '.join(sentence)) for sentence in lemmatized_sentences]
In [42]:
%%time
spacy_text_clean = {}
for author, text_list in text_dict['test'].items():
    lst = []
    for text in text_list:
        lst.append(process_text_custom(text))
    spacy_text_clean[author] = lst
Wall time: 14min 45s

Note that there are quite a lot of sentences (~52K) so this takes a bit of time (~ 2 minutes).

In [43]:
count = 0
for author, texts in spacy_text_clean.items():
    for text in texts:
        count += len(text)
print('Number of sentences:', count)
Number of sentences: 56125

Result

In [44]:
spacy_text_clean['TimFarrand'][0][:3]
Out[44]:
[share brewing leisure group bass plc likely hold britain 's trade industry secretary ian lang decide allow propose merge brewer carlsberg tetley say analyst,
 early lang announce bass deal refer monoplies mergers commission report march 24 1997,
 the share fall 6p 781p news]

Direct feature extraction (to top)

We now have pre-processed our text into something that we can use for direct feature extraction or to convert it to a numerical representation.

Feature search (to top)

Entity recognition (to top)

It is often useful / relevant to extract entities that are mentioned in a piece of text.

SpaCy is quite powerful in extracting entities, however, it doesn't work very well on lowercase text.

Given that "token.lemma_" removes capitalization I will use spacy_sentences for this example.

In [46]:
example_sentence = spacy_sentences['TimFarrand'][0][3]
example_sentence
Out[46]:
"The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  
 
In [47]:
[(i, i.label_) for i in parser(example_sentence.text).ents]
Out[47]:
[(March, 'DATE'), (John Wakley, 'PERSON'), (Lehman Brothers, 'ORG')]
In [48]:
example_sentence = spacy_sentences['TimFarrand'][4][0]
example_sentence
Out[48]:
British pub-to-hotel group Greenalls Plc on Thursday reported a 48 percent rise in profits before exceptional items to 148.7 million pounds ($246.4 million), driven by its acquisition of brewer Boddington in November 1995.
 
In [49]:
[(i, i.label_) for i in parser(example_sentence.text).ents]
Out[49]:
[(British, 'NORP'),
 (Greenalls Plc, 'PERSON'),
 (Thursday, 'DATE'),
 (48 percent, 'PERCENT'),
 (148.7 million pounds, 'MONEY'),
 ($246.4 million, 'MONEY'),
 (Boddington, 'GPE'),
 (November 1995, 'DATE')]

Pattern search (to top)

Using the build-in re (regular expression) library you can pattern match nearly anything you want.

I will not go into details about regular expressions but see here for a tutorial:
https://regexone.com/references/python

In [50]:
import re

TIP: Use Pythex.org to try out your regular expression

Example on Pythex: click here

Example 1:

In [51]:
string_1 = 'Ties de Kok (#IDNUMBER: 123-AZ). Rest of text...'
string_2 = 'Philip Joos (#IDNUMBER: 663-BY). Rest of text...'
In [52]:
pattern = r'#IDNUMBER: (\d\d\d-\w\w)'
In [53]:
print(re.findall(pattern, string_1)[0])
print(re.findall(pattern, string_2)[0])
123-AZ
663-BY

Example 2:

If a sentence contains the word 'million' return True, otherwise return False

In [54]:
for sen in spacy_text_clean['TimFarrand'][2]:
    TERM = 'million'
    contains = True if re.search('million', sen.text) else False
    if contains:
        print(sen)
analyst forecast pretax profit range 218 232 million stg restructuring cost 206 million time
a restructure cost 35 million anticipate bulk 25 million stem closure small production plant france
cadbury 's u.s. drink business turn 112 million stg trading profit 59 million half 1995 entirely contribution dr pepper
campbell estimate uk beverage contribute 47 million stg operating profit 50 million time
broadly analyst expect pretty flat performance group 's confectionery business consensus forecast 110 million stg operating profit
on average analyst calculate beverage chip trading profit 150 million
after sale 51 percent stake coca cola amp schweppes beverages ccsb operation coca cola enterprises june 620 million stg analyst want clear statement strategy company
but far analyst company say shareholder expect return investment emerge market large far 75 million russian plant
cadbury announce investment 20 million stg build new plant wrocoaw poland 1993 joint venture china cost 20 million
net debt 1.34 billion end 1995 fall 510 million end 1996 result ccsb sale provide acquisition

Text evaluation (to top)

Besides feature search there are also many ways to analyze the text as a whole.

Let's, for example, evaluate the following paragraph:

In [55]:
example_paragraph = ' '.join([x.text for x in spacy_text_clean['TimFarrand'][2]])
example_paragraph[:500]
Out[55]:
"soft drink confectionery group cadbury schweppes plc expect report solid percent rise half profit wednesday face question performance 7up soft drink u.s. one main question success relaunch 7up brand say mark duffy food manufacturing analyst sbc warburg competitor sprite own coca cola see agressive marketing push rank fast grow brand u.s. cadbury 's dr pepper analyst forecast pretax profit range 218 232 million stg restructuring cost 206 million time a dividend 5.1 penny expect 4.9p a restructure"

Language (to top)

Using the langdetect package it is easy to detect the language of a piece of text

In [57]:
from langdetect import detect
In [58]:
detect(example_paragraph)
Out[58]:
'en'

Readability (to top)

Using the textstat package we can compute various readability metrics

In [59]:
from textstat.textstat import textstat
In [60]:
print(textstat.flesch_reading_ease(example_paragraph))
print(textstat.smog_index(example_paragraph))
print(textstat.flesch_kincaid_grade(example_paragraph))
print(textstat.coleman_liau_index(example_paragraph))
print(textstat.automated_readability_index(example_paragraph))
print(textstat.dale_chall_readability_score(example_paragraph))
print(textstat.difficult_words(example_paragraph))
print(textstat.linsear_write_formula(example_paragraph))
print(textstat.gunning_fog(example_paragraph))
print(textstat.text_standard(example_paragraph))
21.2
18.9
20.5
16.67
26.0
9.09
89
8.6
27.120776699029125
8th and 9th grade

Text similarity

In [61]:
from fuzzywuzzy import fuzz
In [62]:
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
Out[62]:
91

Term (dictionary) counting (to top)

One of the most common techniques that researchers currently use (at least in Accounting research) are simple metrics based on counting words in a dictionary.
This technique is, for example, very prevalent in sentiment analysis (counting positive and negative words).

In essence this technique is very simple to program:

Example 1:

In [63]:
word_dictionary = ['soft', 'first', 'most', 'be']
In [64]:
for word in word_dictionary:
    print(word, example_paragraph.count(word))
soft 3
first 0
most 1
be 8

Example 2:

In [65]:
pos = ['great', 'increase']
neg = ['bad', 'decrease']

sentence = '''According to Trump everything is great, great, 
and great even though his popularity is seeing a decrease.'''

pos_count = 0
for word in pos:
    pos_count += sentence.lower().count(word)
print(pos_count)

neg_count = 0
for word in neg:
    neg_count += sentence.lower().count(word)
print(neg_count)

pos_count / (neg_count + pos_count)
3
1
Out[65]:
0.75
In [66]:
sentence = '''According to Trump everything is great, great, 
and great even though his popularity is seeing a decrease.'''
In [67]:
pos_count = 0
for word in pos:
    pos_count += sentence.lower().count(word)
print(pos_count)
3
In [68]:
neg_count = 0
for word in neg:
    neg_count += sentence.lower().count(word)
print(neg_count)
1
In [69]:
pos_count / (neg_count + pos_count)
Out[69]:
0.75

Getting the total number of words is also easy:

In [70]:
len(parser(example_paragraph))
Out[70]:
419

Example 3:

We can also save the count per word

In [71]:
pos_count_dict = {}
for word in pos:
    pos_count_dict[word] = sentence.lower().count(word)
In [72]:
pos_count_dict
Out[72]:
{'great': 3, 'increase': 0}

Represent text numerically (to top)

Bag of Words (to top)

Sklearn includes the CountVectorizer and TfidfVectorizer function.

For details, see the documentation:
TF
TFIDF

Note 1: these functions also already includes a lot of preprocessing options (e.g. ngrames, remove stop words, accent stripper).

Note 2: example based on the following website click here

In [73]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Simple example:

In [74]:
doc_1 = "The sky is blue."
doc_2 = "The sun is bright today."
doc_3 = "The sun in the sky is bright."
doc_4 = "We can see the shining sun, the bright sun."

Calculate term frequency:

In [75]:
vectorizer = CountVectorizer(stop_words='english')
tf = vectorizer.fit_transform([doc_1, doc_2, doc_3, doc_4])
In [76]:
print(vectorizer.get_feature_names())
for doc_tf_vector in tf.toarray():
    print(doc_tf_vector)
['blue', 'bright', 'shining', 'sky', 'sun', 'today']
[1 0 0 1 0 0]
[0 1 0 0 1 1]
[0 1 0 1 1 0]
[0 1 1 0 2 0]

TF-IDF (to top)

In [77]:
transformer = TfidfVectorizer(stop_words='english')
tfidf = transformer.fit_transform([doc_1, doc_2, doc_3, doc_4])
In [78]:
for doc_vector in tfidf.toarray():
    print(doc_vector)
[0.78528828 0.         0.         0.6191303  0.         0.        ]
[0.         0.47380449 0.         0.         0.47380449 0.74230628]
[0.         0.53256952 0.         0.65782931 0.53256952 0.        ]
[0.         0.36626037 0.57381765 0.         0.73252075 0.        ]

More elaborate example:

In [79]:
clean_paragraphs = []
for author, value in spacy_text_clean.items():
    for article in value:
        clean_paragraphs.append(' '.join([x.text for x in article]))
In [80]:
len(clean_paragraphs)
Out[80]:
2500
In [81]:
transformer = TfidfVectorizer(stop_words='english')
tfidf_large = transformer.fit_transform(clean_paragraphs)
In [82]:
print('Number of vectors:', len(tfidf_large.toarray()))
print('Number of words in dictionary:', len(tfidf_large.toarray()[0]))
Number of vectors: 2500
Number of words in dictionary: 24092
In [83]:
tfidf_large
Out[83]:
<2500x24092 sparse matrix of type '<class 'numpy.float64'>'
	with 446676 stored elements in Compressed Sparse Row format>

Word Embeddings (to top)

Word2Vec (to top)

In [79]:
import gensim
from nltk.corpus import brown
In [80]:
sentences = brown.sents()
model = gensim.models.Word2Vec(sentences, min_count=1)

Save model

In [81]:
model.save('brown_model')

Load model

In [82]:
model = gensim.models.Word2Vec.load('brown_model')

Find words most similar to 'mother':

In [83]:
print(model.most_similar("mother"))
[('father', 0.9751995801925659), ('husband', 0.9580737352371216), ('wife', 0.9531918168067932), ('son', 0.9327206611633301), ('voice', 0.9201934337615967), ('boy', 0.9148358106613159), ('friend', 0.9091534614562988), ('ache', 0.8969202041625977), ('parents', 0.8916678428649902), ('maid', 0.8907431960105896)]

Find the odd one out:

In [84]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))
cereal
In [85]:
print(model.doesnt_match("pizza pasta garden fries".split()))
garden

Retrieve vector representation of the word "human"

In [86]:
model['human']
Out[86]:
array([ -1.24233282e+00,   1.73863158e-01,  -6.12607360e-01,
        -5.31224608e-01,   2.60317028e-01,  -3.56246114e-01,
        -5.06218493e-01,  -1.15960002e-01,   3.04902345e-01,
        -1.21463396e-01,   4.93332326e-01,  -9.75235820e-01,
         3.44474047e-01,   9.77401040e-04,   5.05028106e-02,
        -3.87262404e-01,   4.93362129e-01,   7.00087488e-01,
        -6.27336025e-01,  -4.63026613e-01,   2.79739406e-02,
         1.45691419e+00,   4.07162786e-01,  -4.19379741e-01,
        -8.41612220e-01,   8.46711546e-02,   3.03834379e-01,
        -5.89724183e-01,  -5.50288737e-01,   1.14675418e-01,
        -9.28169414e-02,  -5.00818849e-01,   1.74140222e-02,
         2.45587930e-01,   9.37732458e-02,  -6.30766377e-02,
         3.79322648e-01,  -9.30945396e-01,   1.81099728e-01,
         4.46529061e-01,   5.09826422e-01,  -4.00113940e-01,
        -3.06686193e-01,   5.83700202e-02,  -1.30845475e+00,
        -8.19562197e-01,  -1.43999264e-01,  -1.79302439e-01,
        -9.88642037e-01,   6.19562924e-01,  -5.98924696e-01,
        -3.26148927e-01,  -2.68154591e-01,   1.32927846e-03,
         4.15733218e-01,   4.20322359e-01,   2.24591553e-01,
        -8.06021392e-02,   1.66282967e-01,  -5.05886197e-01,
         4.11779553e-01,   2.37013131e-01,   9.44843650e-01,
         1.08043969e+00,   1.97366968e-01,  -2.09960312e-01,
        -2.96899788e-02,   6.43389523e-01,  -9.92119789e-01,
        -5.22915125e-02,  -4.15121198e-01,   6.58638895e-01,
        -4.61580336e-01,  -1.06919587e+00,  -3.75133425e-01,
        -2.02061430e-01,   1.24140203e+00,   2.50428259e-01,
         6.01192236e-01,   4.85432506e-01,   1.26407454e-02,
         7.29153931e-01,  -1.80993602e-01,  -9.56271172e-01,
         1.91430658e-01,   5.62396646e-01,  -1.07690930e+00,
        -7.97812045e-01,  -8.85272324e-01,  -1.71307661e-02,
        -4.96744901e-01,   7.43289739e-02,  -8.02996099e-01,
        -3.25119644e-01,   1.37371510e-01,  -6.58412039e-01,
        -4.42930400e-01,  -6.37149692e-01,  -3.13979797e-02,
        -1.62613422e-01], dtype=float32)

Statistical models (to top)

"Traditional" machine learning (to top)

The library to use for machine learning is scikit-learn ("sklearn").

Supervised (to top)

In [84]:
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.externals import joblib
In [85]:
import pandas as pd
import numpy as np

Convert the data into a pandas dataframe (so that we can input it easier)

In [86]:
article_list = []
for author, value in spacy_text_clean.items():
    for article in value:
        article_list.append((author, ' '.join([x.text for x in article])))
In [87]:
article_df = pd.DataFrame(article_list, columns=['author', 'text'])
In [88]:
article_df.sample(5)
Out[88]:
author text
384 DavidLawder chrysler corp. report record earning 1996 base...
377 DavidLawder tease automotive world glimpse possible produc...
2114 SarahDavison a widespread shake hit highly competitive fund...
737 JanLopatka engineering group skoda a.s say tuesday win or...
2070 SamuelPerry intel corp. executive say late tuesday company...

Split the sample into a training and test sample

In [89]:
X_train, X_test, y_train, y_test = train_test_split(article_df.text, article_df.author, test_size=0.20, random_state=3561)
In [90]:
print(len(X_train), len(X_test))
2000 500

Train and evaluate function

Simple function to train (i.e. fit) and evaluate the model

In [91]:
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))

Naïve Bayes estimator (to top)

In [92]:
from sklearn.naive_bayes import MultinomialNB

Define pipeline

In [93]:
clf = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english'
                            )),
        
    ('clf', MultinomialNB(alpha = 1,
                          fit_prior = True
                          )
    ),
])

Train and show evaluation stats

In [94]:
train_and_evaluate(clf, X_train, X_test, y_train, y_test)
Accuracy on training set:
0.8345
Accuracy on testing set:
0.714
Classification Report:
                   precision    recall  f1-score   support

    AaronPressman       0.90      1.00      0.95         9
       AlanCrosby       0.55      0.92      0.69        12
   AlexanderSmith       0.86      0.60      0.71        10
  BenjaminKangLim       0.75      0.27      0.40        11
    BernardHickey       0.75      0.30      0.43        10
      BradDorfman       0.80      1.00      0.89         8
 DarrenSchuettler       0.58      0.78      0.67         9
      DavidLawder       1.00      0.60      0.75        10
    EdnaFernandes       1.00      0.67      0.80         9
      EricAuchard       0.86      0.67      0.75         9
   FumikoFujisaki       1.00      1.00      1.00        10
   GrahamEarnshaw       0.59      1.00      0.74        10
 HeatherScoffield       0.83      0.56      0.67         9
       JanLopatka       0.33      0.33      0.33         9
    JaneMacartney       0.35      0.60      0.44        10
     JimGilchrist       0.73      1.00      0.84         8
   JoWinterbottom       0.89      0.80      0.84        10
         JoeOrtiz       0.80      0.89      0.84         9
     JohnMastrini       0.80      0.24      0.36        17
     JonathanBirt       0.47      1.00      0.64         8
      KarlPenhaul       0.87      1.00      0.93        13
        KeithWeir       0.69      0.90      0.78        10
   KevinDrawbaugh       0.88      0.70      0.78        10
    KevinMorrison       0.33      1.00      0.50         3
    KirstinRidley       0.86      0.67      0.75         9
KouroshKarimkhany       0.58      0.88      0.70         8
        LydiaZajc       0.82      0.90      0.86        10
   LynneO'Donnell       0.80      0.73      0.76        11
  LynnleyBrowning       0.93      1.00      0.96        13
  MarcelMichelson       1.00      0.50      0.67        12
     MarkBendeich       0.83      0.45      0.59        11
       MartinWolk       0.57      0.80      0.67         5
     MatthewBunce       1.00      0.86      0.92        14
    MichaelConnor       0.83      0.77      0.80        13
       MureDickie       0.44      0.40      0.42        10
        NickLouth       0.83      1.00      0.91        10
  PatriciaCommins       0.80      0.89      0.84         9
    PeterHumphrey       0.38      0.89      0.53         9
       PierreTran       0.56      0.83      0.67         6
       RobinSidel       1.00      1.00      1.00        12
     RogerFillion       1.00      1.00      1.00         8
      SamuelPerry       0.78      0.50      0.61        14
     SarahDavison       1.00      0.29      0.44        14
      ScottHillis       0.36      0.44      0.40         9
      SimonCowell       0.90      0.90      0.90        10
         TanEeLyn       0.67      0.57      0.62         7
   TheresePoletti       0.73      0.73      0.73        11
       TimFarrand       1.00      0.77      0.87        13
       ToddNissen       0.60      1.00      0.75         9
     WilliamKazer       0.00      0.00      0.00        10

      avg / total       0.76      0.71      0.70       500

Save results

In [95]:
joblib.dump(clf, 'naive_bayes_results.pkl')
Out[95]:
['naive_bayes_results.pkl']

Predict out of sample:

In [96]:
example_y, example_X = y_train[33], X_train[33]
In [97]:
print('Actual author:', example_y)
print('Predicted author:', clf.predict([example_X])[0])
Actual author: AaronPressman
Predicted author: AaronPressman

Support Vector Machines (SVM) (to top)

In [98]:
from sklearn.svm import SVC

Define pipeline

In [99]:
clf_svm = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english'
                            )),
        
    ('clf', SVC(kernel='rbf' ,
                C=10, gamma=0.3)
    ),
])

Note: The SVC estimator is very sensitive to the hyperparameters!

Train and show evaluation stats

In [100]:
train_and_evaluate(clf_svm, X_train, X_test, y_train, y_test)
Accuracy on training set:
0.9965
Accuracy on testing set:
0.83
Classification Report:
                   precision    recall  f1-score   support

    AaronPressman       0.89      0.89      0.89         9
       AlanCrosby       0.79      0.92      0.85        12
   AlexanderSmith       1.00      0.70      0.82        10
  BenjaminKangLim       0.67      0.36      0.47        11
    BernardHickey       1.00      0.50      0.67        10
      BradDorfman       0.70      0.88      0.78         8
 DarrenSchuettler       1.00      0.89      0.94         9
      DavidLawder       1.00      0.70      0.82        10
    EdnaFernandes       0.73      0.89      0.80         9
      EricAuchard       0.69      1.00      0.82         9
   FumikoFujisaki       1.00      1.00      1.00        10
   GrahamEarnshaw       0.77      1.00      0.87        10
 HeatherScoffield       0.75      1.00      0.86         9
       JanLopatka       0.43      0.33      0.38         9
    JaneMacartney       0.36      0.50      0.42        10
     JimGilchrist       0.89      1.00      0.94         8
   JoWinterbottom       1.00      0.90      0.95        10
         JoeOrtiz       0.82      1.00      0.90         9
     JohnMastrini       0.76      0.76      0.76        17
     JonathanBirt       0.80      1.00      0.89         8
      KarlPenhaul       1.00      1.00      1.00        13
        KeithWeir       0.83      1.00      0.91        10
   KevinDrawbaugh       0.90      0.90      0.90        10
    KevinMorrison       0.50      1.00      0.67         3
    KirstinRidley       1.00      0.56      0.71         9
KouroshKarimkhany       0.88      0.88      0.88         8
        LydiaZajc       1.00      1.00      1.00        10
   LynneO'Donnell       0.82      0.82      0.82        11
  LynnleyBrowning       1.00      1.00      1.00        13
  MarcelMichelson       1.00      0.75      0.86        12
     MarkBendeich       0.85      1.00      0.92        11
       MartinWolk       0.71      1.00      0.83         5
     MatthewBunce       1.00      0.86      0.92        14
    MichaelConnor       1.00      0.85      0.92        13
       MureDickie       0.60      0.60      0.60        10
        NickLouth       1.00      0.90      0.95        10
  PatriciaCommins       1.00      1.00      1.00         9
    PeterHumphrey       0.62      0.89      0.73         9
       PierreTran       0.67      1.00      0.80         6
       RobinSidel       1.00      1.00      1.00        12
     RogerFillion       1.00      1.00      1.00         8
      SamuelPerry       0.92      0.86      0.89        14
     SarahDavison       1.00      0.64      0.78        14
      ScottHillis       0.67      0.44      0.53         9
      SimonCowell       1.00      0.90      0.95        10
         TanEeLyn       0.71      0.71      0.71         7
   TheresePoletti       0.92      1.00      0.96        11
       TimFarrand       0.92      0.85      0.88        13
       ToddNissen       0.90      1.00      0.95         9
     WilliamKazer       0.20      0.20      0.20        10

      avg / total       0.85      0.83      0.83       500

Save results

In [101]:
joblib.dump(clf_svm, 'svm_results.pkl')
Out[101]:
['svm_results.pkl']

Predict out of sample:

In [102]:
example_y, example_X = y_train[33], X_train[33]
In [103]:
print('Actual author:', example_y)
print('Predicted author:', clf_svm.predict([example_X])[0])
Actual author: AaronPressman
Predicted author: AaronPressman

Model Selection and Evaluation (to top)

Both the TfidfVectorizer and SVC() estimator take a lot of hyperparameters.

It can be difficult to figure out what the best parameters are.

We can use GridSearchCV to help figure this out.

In [104]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score

First we define the options that should be tried out:

In [105]:
clf_search = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', SVC())
])
parameters = { 'vect__stop_words': ['english'],
                'vect__strip_accents': ['unicode'],
              'vect__max_features' : [1500],
              'vect__ngram_range': [(1,1), (2,2) ],
             'clf__gamma' : [0.2, 0.3, 0.4], 
             'clf__C' : [8, 10, 12],
              'clf__kernel' : ['rbf']
             }

Run everything:

In [106]:
grid = GridSearchCV(clf_search, param_grid=parameters, scoring=make_scorer(f1_score, average='micro'), n_jobs=1)
grid.fit(X_train, y_train)    
Out[106]:
GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__stop_words': ['english'], 'vect__strip_accents': ['unicode'], 'vect__max_features': [1500], 'vect__ngram_range': [(1, 1), (2, 2)], 'clf__gamma': [0.2, 0.3, 0.4], 'clf__C': [8, 10, 12], 'clf__kernel': ['rbf']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(f1_score, average=micro), verbose=0)

Note: if you are on a powerful unix system you can set n_jobs to the number of available threads to speed up the calculation

In [107]:
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
y_true, y_pred = y_test, grid.predict(X_test)
print(metrics.classification_report(y_true, y_pred))
The best parameters are {'clf__C': 12, 'clf__gamma': 0.4, 'clf__kernel': 'rbf', 'vect__max_features': 1500, 'vect__ngram_range': (1, 1), 'vect__stop_words': 'english', 'vect__strip_accents': 'unicode'} with a score of 0.77
                   precision    recall  f1-score   support

    AaronPressman       0.89      0.89      0.89         9
       AlanCrosby       0.79      0.92      0.85        12
   AlexanderSmith       1.00      0.70      0.82        10
  BenjaminKangLim       0.67      0.36      0.47        11
    BernardHickey       1.00      0.50      0.67        10
      BradDorfman       0.70      0.88      0.78         8
 DarrenSchuettler       1.00      0.89      0.94         9
      DavidLawder       1.00      0.70      0.82        10
    EdnaFernandes       0.80      0.89      0.84         9
      EricAuchard       0.75      1.00      0.86         9
   FumikoFujisaki       1.00      1.00      1.00        10
   GrahamEarnshaw       0.77      1.00      0.87        10
 HeatherScoffield       0.75      1.00      0.86         9
       JanLopatka       0.43      0.33      0.38         9
    JaneMacartney       0.36      0.50      0.42        10
     JimGilchrist       0.89      1.00      0.94         8
   JoWinterbottom       1.00      0.90      0.95        10
         JoeOrtiz       0.75      1.00      0.86         9
     JohnMastrini       0.76      0.76      0.76        17
     JonathanBirt       0.80      1.00      0.89         8
      KarlPenhaul       1.00      1.00      1.00        13
        KeithWeir       0.83      1.00      0.91        10
   KevinDrawbaugh       0.90      0.90      0.90        10
    KevinMorrison       0.50      1.00      0.67         3
    KirstinRidley       1.00      0.56      0.71         9
KouroshKarimkhany       0.88      0.88      0.88         8
        LydiaZajc       1.00      1.00      1.00        10
   LynneO'Donnell       0.82      0.82      0.82        11
  LynnleyBrowning       1.00      1.00      1.00        13
  MarcelMichelson       1.00      0.75      0.86        12
     MarkBendeich       0.85      1.00      0.92        11
       MartinWolk       0.62      1.00      0.77         5
     MatthewBunce       1.00      0.86      0.92        14
    MichaelConnor       1.00      0.85      0.92        13
       MureDickie       0.60      0.60      0.60        10
        NickLouth       1.00      0.90      0.95        10
  PatriciaCommins       1.00      1.00      1.00         9
    PeterHumphrey       0.57      0.89      0.70         9
       PierreTran       0.67      1.00      0.80         6
       RobinSidel       1.00      1.00      1.00        12
     RogerFillion       1.00      1.00      1.00         8
      SamuelPerry       0.92      0.86      0.89        14
     SarahDavison       1.00      0.64      0.78        14
      ScottHillis       0.67      0.44      0.53         9
      SimonCowell       1.00      0.90      0.95        10
         TanEeLyn       0.67      0.57      0.62         7
   TheresePoletti       0.92      1.00      0.96        11
       TimFarrand       0.92      0.85      0.88        13
       ToddNissen       0.90      1.00      0.95         9
     WilliamKazer       0.20      0.20      0.20        10

      avg / total       0.85      0.83      0.83       500

Unsupervised (to top)

Latent Dirichilet Allocation (LDA) (to top)

In [108]:
from sklearn.decomposition import LatentDirichletAllocation

Vectorizer (using countvectorizer for the sake of example)

In [109]:
vectorizer = CountVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english', max_df=0.8)
tf_large = vectorizer.fit_transform(clean_paragraphs)

Run the LDA model

In [110]:
n_topics = 10
n_top_words = 25
In [111]:
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=10,
                                learning_method='online',
                                n_jobs=1)
lda_fitted = lda.fit_transform(tf_large)

Visualize top words

In [112]:
def save_top_words(model, feature_names, n_top_words):
    out_list = []
    for topic_idx, topic in enumerate(model.components_):
        out_list.append((topic_idx+1, " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])))
    out_df = pd.DataFrame(out_list, columns=['topic_id', 'top_words'])
    return out_df
In [113]:
result_df = save_top_words(lda, vectorizer.get_feature_names(), n_top_words)
In [114]:
result_df
Out[114]:
topic_id top_words
0 1 company service new corp computer internet net...
1 2 percent million analyst share profit quarter s...
2 3 bank financial market company banking business...
3 4 tonne 000 ford new production plant car gm exp...
4 5 company pound million share group percent bill...
5 6 thomson nomura conrail csf north south korean ...
6 7 percent market price czech oil state billion t...
7 8 government state boeing plan tobacco union new...
8 9 stock gold bre company share canada toronto ex...
9 10 china kong hong chinese beijing people deng of...

pyLDAvis (to top)

In [116]:
%matplotlib inline
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
C:\Users\kokti\Anaconda3\lib\site-packages\sklearn\manifold\t_sne.py:420: DeprecationWarning: invalid escape sequence \s
  """
In [117]:
pyLDAvis.sklearn.prepare(lda, tf_large, vectorizer, n_jobs=1)
C:\Users\kokti\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py:387: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]
Out[117]:

Neural Networks (to top)

Interested? Check out the Stanford course CS224n (Syllabus)!