Author: Ties de Kok (Personal Website)
Last updated: June 2020
Python version: Python 3.7
License: MIT License
Note: Some features (like the ToC) will only work if you run the notebook or if you use nbviewer by clicking this link:
https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb
This notebook contains code examples to get you started with Python for Natural Language Processing (NLP) / Text Mining.
In the large scheme of things there are roughly 4 steps:
This notebook only discusses step 3 and 4. If you want to learn more about step 2 see my Python tutorial.
This notebook was designed to accompany a PhD course session on NLP techniques in Accounting Research.
The slides of this session are publically availabe here: Slides
There are many tools available for NLP purposes.
The code examples below are based on what I personally like to use, it is not intended to be a comprehsnive overview.
Besides build-in Python functionality I will use / demonstrate the following packages:
Standard NLP libraries:
Spacy
NLTK
and the higher-level wrapper TextBlob
Note: besides installing the above packages you also often have to download (model) data . Make sure to check the documentation!
Standard machine learning library:
scikit learn
Specific task libraries:
There are many, just a couple of examples:
pyLDAvis
for visualizing LDA)langdetect
for detecting languagesfuzzywuzzy
for fuzzy text matchingGensim
for topic modellingThere are many example datasets available to play around with, see for example this great repository:
https://archive.ics.uci.edu/ml/datasets.php
The data that I will use for most of the examples is the "Reuter_50_50 Data Set" that is used for author identification experiments.
See the details here: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50
Can't follow what I am doing here? Please see my Python tutorial (although the zipfile
and io
operations are not very relevant).
import requests, zipfile, io, os
from tqdm.notebook import tqdm
Note: for tqdm
to work in JupyterLab you need to install the @jupyter-widgets/jupyterlab-manager
using the puzzle icon in the left side bar.
Download and extract the zip file with the data
if not os.path.exists('C50test'):
r = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
Load the data into memory
folder_dict = {'test' : 'C50test'}
text_dict = {'test' : {}}
for label, folder in tqdm(folder_dict.items()):
authors = os.listdir(folder)
for author in authors:
text_files = os.listdir(os.path.join(folder, author))
for file in text_files:
with open(os.path.join(folder, author, file), 'r') as text_file:
text_dict[label].setdefault(author, []).append(' '.join(text_file.readlines()))
Note: the text comes pre-split per sentence, for the sake of example I undo this through ' '.join(text_file.readlines()
text_dict['test']['TimFarrand'][0]
We can use the text directly, but if want to use packages like spacy
and textblob
we first have to convert the text into a corresponding object.
Note: depending on the way that you installed the language models you will need to import it differently:
from spacy.en import English
nlp = English()
OR
import en_core_web_sm
nlp = en_core_web_sm.load()
import en_core_web_md
nlp = en_core_web_md.load()
import en_core_web_lg
nlp = en_core_web_lg.load()
import spacy
import en_core_web_md
nlp = en_core_web_md.load()
Convert all text in the "test" sample to a spacy
doc
object using nlp.pipe()
:
spacy_text = {}
for author, text_list in tqdm(text_dict['test'].items()):
spacy_text[author] = list(nlp.pipe(text_list))
A note on speed: This is slow because we didn't disable any compontents, see this note from the documentation:
Only apply the pipeline components you need. Getting predictions from the model that you don’t actually need adds up and becomes very inefficient at scale. To prevent this, use the disable keyword argument to disable components you don’t need – either when loading a model, or during processing with nlp.pipe. See the section on disabling pipeline components for more details and examples. link
type(spacy_text['TimFarrand'][0])
import nltk
We can apply basic nltk
operations directly to the text so we don't need to convert first.
from textblob import TextBlob
Convert all text in the "test" sample to a TextBlob
object using TextBlob()
:
textblob_text = {}
for author, text_list in text_dict['test'].items():
textblob_text[author] = [TextBlob(text) for text in text_list]
type(textblob_text['TimFarrand'][0])
Text normalization describes the task of transforming the text into a different (more comparable) form.
This can imply many things, I will show a couple of options below:
You will often notice that there are characters that you don't want in your text.
Let's look at this sentence for example:
"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"
You notice that there are some \
and \n
in there. These are used to define how a string should be displayed, if we print this text we get:
text_dict['test']['TimFarrand'][0][:298]
print(text_dict['test']['TimFarrand'][0][:298])
These special characters can cause problems in our analyses (and can be hard to debug if you are using print
statements to inspect the data).
So how do we remove them?
In many cases it is sufficient to simply use the .replace()
function:
text_dict['test']['TimFarrand'][0][:298].replace('\n', '').replace('\\', '')
Sometimes, however, the problem arrises because of encoding / decoding problems.
In those cases you can usually do something like:
problem_sentence = 'This is some \u03c0 text that has to be cleaned\u2026! it\u0027s difficult to deal with!'
print(problem_sentence)
print(problem_sentence.encode().decode('unicode_escape').encode('ascii','ignore'))
An alternative that is better at preserving the unicode characters would be to use unidecode
import unidecode
print('\u738b\u7389')
unidecode.unidecode(u"\u738b\u7389")
unidecode.unidecode(problem_sentence)
Sentence segmentation refers to the task of splitting up the text by sentence.
You could do this by splitting on the .
symbol, but dots are used in many other cases as well so it is not very robust:
text_dict['test']['TimFarrand'][0][:550].split('.')
It is better to use a more sophisticated implementation such as the one by Spacy
:
example_paragraph = spacy_text['TimFarrand'][0]
sentence_list = [s for s in example_paragraph.sents]
sentence_list[:5]
Notice that the returned object is still a spacy
object:
type(sentence_list[0])
Note: spacy
sentence segmentation relies on the text being capitalized, so make sure you didn't convert it to all lower case before running this operation.
Apply to all texts (for use later on):
spacy_sentences = {}
for author, text_list in tqdm(spacy_text.items()):
spacy_sentences[author] = [list(text.sents) for text in text_list]
spacy_sentences['TimFarrand'][0][:3]
Word tokenization means to split the sentence (or text) up into words.
example_sentence = spacy_sentences['TimFarrand'][0][0]
example_sentence
A word is called a token
in this context (hence tokenization
), using spacy
:
token_list = [token for token in example_sentence]
token_list[0:15]
In some cases you want to convert a word (i.e. token) into a more general representation.
For example: convert "car", "cars", "car's", "cars'" all into the word car
.
This is generally done through lemmatization / stemming (different approaches trying to achieve a similar goal).
Spacy
Space offers build-in functionality for lemmatization:
lemmatized = [token.lemma_ for token in example_sentence]
lemmatized[0:15]
NLTK
Using the NLTK libary we can also use the more aggressive Porter Stemmer
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token.text) for token in example_sentence]
stemmed[0:15]
Compare:
print(' Original | Spacy Lemma | NLTK Stemmer')
print('-' * 41)
for original, lemma, stem in zip(token_list[:15], lemmatized[:15], stemmed[:15]):
print(str(original).rjust(10, ' '), ' | ', str(lemma).rjust(10, ' '), ' | ', str(stem).rjust(10, ' '))
In my experience it is usually best to use lemmatization instead of a stemmer.
Text is inherently structured in complex ways, we can often use some of this underlying structure.
Part of speech tagging refers to the identification of words as nouns, verbs, adjectives, etc.
Using Spacy
:
pos_list = [(token, token.pos_) for token in example_sentence]
pos_list[0:10]
Obviously a sentence is not a random collection of words, the sequence of words has information value.
A simple way to incorporate some of this sequence is by using what is called n-grams
.
An n-gram
is nothing more than a a combination of N
words into one token (a uni-gram token is just one word).
So we can convert "Sentence about flying cars"
into a list of bigrams:
Sentence-about, about-flying, flying-cars
See my slide on N-Grams for a more comprehensive example: click here
Using NLTK
:
bigram_list = ['-'.join(x) for x in nltk.bigrams([token.text for token in example_sentence])]
bigram_list[10:15]
Using spacy
def tokenize_without_punctuation(sen_obj):
return [token.text for token in sen_obj if token.is_alpha]
def create_ngram(sen_obj, n, sep = '-'):
token_list = tokenize_without_punctuation(sen_obj)
number_of_tokens = len(token_list)
ngram_list = []
for i, token in enumerate(token_list[:-n+1]):
ngram_item = [token_list[i + ii] for ii in range(n)]
ngram_list.append(sep.join(ngram_item))
return ngram_list
create_ngram(example_sentence, 2)[:5]
create_ngram(example_sentence, 3)[:5]
Depending on what you are trying to do it is possible that there are many words that don't add any information value to the sentence.
The primary example are stop words.
Sometimes you can improve the accuracy of your model by removing stop words.
Using Spacy
:
no_stop_words = [token for token in example_sentence if not token.is_stop]
no_stop_words[:10]
token_list[:10]
Note we can also remove punctuation in the same way:
[token for token in example_sentence if not token.is_stop and token.is_alpha][:10]
Basic SpaCy text processing function
textacy
def process_text_custom(text):
sentences = list(nlp(text, disable=['tagger', 'ner', 'entity_linker', 'textcat', 'entitry_ruler']).sents)
lemmatized_sentences = []
for sentence in sentences:
lemmatized_sentences.append([token.lemma_ for token in sentence if not token.is_stop and token.is_alpha])
return [' '.join(sentence) for sentence in lemmatized_sentences]
spacy_text_clean = {}
for author, text_list in tqdm(text_dict['test'].items()):
lst = []
for text in text_list:
lst.append(process_text_custom(text))
spacy_text_clean[author] = lst
Note: that this would take quite a long time if we didn't disable some of the components.
count = 0
for author, texts in spacy_text_clean.items():
for text in texts:
count += len(text)
print('Number of sentences:', count)
Result
spacy_text_clean['TimFarrand'][0][:3]
Note: the quality of the input text is not great, so the sentence segmentation is also not great (without further tweaking).
We now have pre-processed our text into something that we can use for direct feature extraction or to convert it to a numerical representation.
It is often useful / relevant to extract entities that are mentioned in a piece of text.
SpaCy is quite powerful in extracting entities, however, it doesn't work very well on lowercase text.
Given that "token.lemma_" removes capitalization I will use spacy_sentences
for this example.
example_sentence = spacy_sentences['TimFarrand'][0][3]
example_sentence
[(i, i.label_) for i in nlp(example_sentence.text).ents]
example_sentence = spacy_sentences['TimFarrand'][4][0]
example_sentence
[(i, i.label_) for i in nlp(example_sentence.text).ents]
Using the build-in re
(regular expression) library you can pattern match nearly anything you want.
I will not go into details about regular expressions but see here for a tutorial:
https://regexone.com/references/python
import re
TIP: Use Pythex.org to try out your regular expression
Example on Pythex: click here
Example 1:
string_1 = 'Ties de Kok (#IDNUMBER: 123-AZ). Rest of text...'
string_2 = 'Philip Joos (#IDNUMBER: 663-BY). Rest of text...'
pattern = r'#IDNUMBER: (\d\d\d-\w\w)'
print(re.findall(pattern, string_1)[0])
print(re.findall(pattern, string_2)[0])
If a sentence contains the word 'million' return True, otherwise return False
for sen in spacy_text_clean['TimFarrand'][2]:
TERM = 'million'
if re.search('million', sen, flags= re.IGNORECASE):
print(sen)
Besides feature search there are also many ways to analyze the text as a whole.
Let's, for example, evaluate the following paragraph:
example_paragraph = ' '.join([x for x in spacy_text_clean['TimFarrand'][2]])
example_paragraph[:500]
Using the spacy-langdetect
package it is easy to detect the language of a piece of text
from spacy_langdetect import LanguageDetector
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)
print(nlp(example_paragraph)._.language)
Generally I'd recommend to calculate the readability metrics by yourself as they don't tend to be that difficult to compute. However, there are packages out there that can help, such as spacy_readability
from spacy_readability import Readability
nlp.add_pipe(Readability(), name='readability', last=True)
doc = nlp("I am some really difficult text to read because I use obnoxiously large words.")
print(doc._.flesch_kincaid_grade_level)
print(doc._.smog)
Manual example: FOG index
import syllapy
def calculate_fog(document):
doc = nlp(document, disable=['tagger', 'ner', 'entity_linker', 'textcat', 'entitry_ruler'])
sen_list = list(doc.sents)
num_sen = len(sen_list)
num_words = 0
num_complex_words = 0
for sen_obj in sen_list:
words_in_sen = [token.text for token in sen_obj if token.is_alpha]
num_words += len(words_in_sen)
num_complex = 0
for word in words_in_sen:
num_syl = syllapy.count(word.lower())
if num_syl > 2:
num_complex += 1
num_complex_words += num_complex
fog = 0.4 * ((num_words / num_sen) + ((num_complex_words / num_words)*100))
return {'fog' : fog,
'num_sen' : num_sen,
'num_words' : num_words,
'num_complex_words' : num_complex_words}
calculate_fog(example_paragraph)
fuzzywuzzy
¶from fuzzywuzzy import fuzz
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
spacy
¶Spacy can provide a similary score based on the semantic similarity (link)
tokens_1 = nlp("fuzzy wuzzy was a bear")
tokens_2 = nlp("wuzzy fuzzy was a bear")
tokens_1.similarity(tokens_2)
tokens_1 = nlp("Tom believes German cars are the best.")
tokens_2 = nlp("Sarah recently mentioned that she would like to go on holiday to Germany.")
tokens_1.similarity(tokens_2)
A common technique for basic NLP insights is to create simple metrics based on term counts.
These are relatively easy to implement.
word_dictionary = ['soft', 'first', 'most', 'be']
for word in word_dictionary:
print(word, example_paragraph.count(word))
pos = ['great', 'agree', 'increase']
neg = ['bad', 'disagree', 'decrease']
sentence = '''According to the president everything is great, great,
and great even though some people might disagree with those statements.'''
pos_count = 0
for word in pos:
pos_count += sentence.lower().count(word)
print(pos_count)
neg_count = 0
for word in neg:
neg_count += sentence.lower().count(word)
print(neg_count)
pos_count / (neg_count + pos_count)
Getting the total number of words is also easy:
num_tokens = len([token for token in nlp(sentence) if token.is_alpha])
num_tokens
We can also save the count per word
pos_count_dict = {}
for word in pos:
pos_count_dict[word] = sentence.lower().count(word)
pos_count_dict
Note: .lower()
is actually quite slow, if you have a lot of words / sentences it is recommend to minimize the amount of .lower()
operations that you have to make.
Sklearn includes the CountVectorizer
and TfidfVectorizer
function.
For details, see the documentation:
TF
TFIDF
Note 1: these functions also provide a variety of built-in preprocessing options (e.g. ngrames, remove stop words, accent stripper).
Note 2: example based on the following website click here
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
doc_1 = "The sky is blue."
doc_2 = "The sun is bright today."
doc_3 = "The sun in the sky is bright."
doc_4 = "We can see the shining sun, the bright sun."
Calculate term frequency:
vectorizer = CountVectorizer(stop_words='english')
tf = vectorizer.fit_transform([doc_1, doc_2, doc_3, doc_4])
print(vectorizer.get_feature_names(), '\n')
for doc_tf_vector in tf.toarray():
print(doc_tf_vector)
transformer = TfidfVectorizer(stop_words='english')
tfidf = transformer.fit_transform([doc_1, doc_2, doc_3, doc_4])
for doc_vector in tfidf.toarray():
print(doc_vector)
clean_paragraphs = []
for author, value in spacy_text_clean.items():
for article in value:
clean_paragraphs.append(' '.join([x for x in article]))
len(clean_paragraphs)
transformer = TfidfVectorizer(stop_words='english')
tfidf_large = transformer.fit_transform(clean_paragraphs)
print('Number of vectors:', len(tfidf_large.toarray()))
print('Number of words in dictionary:', len(tfidf_large.toarray()[0]))
tfidf_large
The en_core_web_lg
language model comes with GloVe vectors trained on the Common Crawl dataset (link)
tokens = nlp("The Dutch word for peanut butter is 'pindakaas', did you know that? This is a typpo.")
for token in tokens:
if token.is_alpha:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
token = nlp('Car')
print('The token: "{}" has the following vector (dimension: {})'.format(token.text, len(token.vector)))
token.vector
Simple example below is from: https://medium.com/@mishra.thedeepak/word2vec-in-minutes-gensim-nlp-python-6940f4e00980
Note: you might have to run nltk.download('brown')
to install the NLTK corpus files
import gensim
from nltk.corpus import brown
sentences = brown.sents()
model = gensim.models.Word2Vec(sentences, min_count=1)
Save model
model.save('brown_model')
Load model
model = gensim.models.Word2Vec.load('brown_model')
Find words most similar to 'mother':
print(model.wv.most_similar("mother"))
Find the odd one out:
print(model.wv.doesnt_match("breakfast cereal dinner lunch".split()))
print(model.wv.doesnt_match("pizza pasta garden fries".split()))
Retrieve vector representation of the word "human"
model.wv['human']
The library to use for machine learning is scikit-learn ("sklearn").
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import joblib
import pandas as pd
import numpy as np
article_list = []
for author, value in spacy_text_clean.items():
for article in value:
article_list.append((author, ' '.join([x for x in article])))
article_df = pd.DataFrame(article_list, columns=['author', 'text'])
article_df.sample(5)
X_train, X_test, y_train, y_test = train_test_split(article_df.text, article_df.author, test_size=0.20, random_state=3561)
print(len(X_train), len(X_test))
Simple function to train (i.e. fit) and evaluate the model
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
clf.fit(X_train, y_train)
print("Accuracy on training set:")
print(clf.score(X_train, y_train))
print("Accuracy on testing set:")
print(clf.score(X_test, y_test))
y_pred = clf.predict(X_test)
print("Classification Report:")
print(metrics.classification_report(y_test, y_pred))
from sklearn.naive_bayes import MultinomialNB
Define pipeline
clf = Pipeline([
('vect', TfidfVectorizer(strip_accents='unicode',
lowercase = True,
max_features = 1500,
stop_words='english'
)),
('clf', MultinomialNB(alpha = 1,
fit_prior = True
)
),
])
Train and show evaluation stats
train_and_evaluate(clf, X_train, X_test, y_train, y_test)
Save results
joblib.dump(clf, 'naive_bayes_results.pkl')
Predict out of sample:
example_y, example_X = y_train[33], X_train[33]
print('Actual author:', example_y)
print('Predicted author:', clf.predict([example_X])[0])
from sklearn.svm import SVC
Define pipeline
clf_svm = Pipeline([
('vect', TfidfVectorizer(strip_accents='unicode',
lowercase = True,
max_features = 1500,
stop_words='english'
)),
('clf', SVC(kernel='rbf' ,
C=10, gamma=0.3)
),
])
Note: The SVC estimator is very sensitive to the hyperparameters!
Train and show evaluation stats
train_and_evaluate(clf_svm, X_train, X_test, y_train, y_test)
Save results
joblib.dump(clf_svm, 'svm_results.pkl')
Predict out of sample:
example_y, example_X = y_train[33], X_train[33]
print('Actual author:', example_y)
print('Predicted author:', clf_svm.predict([example_X])[0])
Both the TfidfVectorizer
and SVC()
estimator take a lot of hyperparameters.
It can be difficult to figure out what the best parameters are.
We can use GridSearchCV
to help figure this out.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
First we define the options that should be tried out:
clf_search = Pipeline([
('vect', TfidfVectorizer()),
('clf', SVC())
])
parameters = { 'vect__stop_words': ['english'],
'vect__strip_accents': ['unicode'],
'vect__max_features' : [1500],
'vect__ngram_range': [(1,1), (2,2) ],
'clf__gamma' : [0.2, 0.3, 0.4],
'clf__C' : [8, 10, 12],
'clf__kernel' : ['rbf']
}
Run everything:
grid = GridSearchCV(clf_search,
param_grid=parameters,
scoring=make_scorer(f1_score, average='micro'),
n_jobs=-1
)
grid.fit(X_train, y_train)
Note: if you are on a powerful (preferably unix system) you can set n_jobs to the number of available threads to speed up the calculation
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
y_true, y_pred = y_test, grid.predict(X_test)
print(metrics.classification_report(y_true, y_pred))
from sklearn.decomposition import LatentDirichletAllocation
Vectorizer (using countvectorizer for the sake of example)
vectorizer = CountVectorizer(strip_accents='unicode',
lowercase = True,
max_features = 1500,
stop_words='english', max_df=0.8)
tf_large = vectorizer.fit_transform(clean_paragraphs)
Run the LDA model
n_topics = 10
n_top_words = 25
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=10,
learning_method='online',
n_jobs=-1)
lda_fitted = lda.fit_transform(tf_large)
Visualize top words
def save_top_words(model, feature_names, n_top_words):
out_list = []
for topic_idx, topic in enumerate(model.components_):
out_list.append((topic_idx+1, " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])))
out_df = pd.DataFrame(out_list, columns=['topic_id', 'top_words'])
return out_df
result_df = save_top_words(lda, vectorizer.get_feature_names(), n_top_words)
result_df
%matplotlib inline
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, tf_large, vectorizer, n_jobs=-1)
Warning: there is a small bug that when you show the pyLDAvis
visualization it will hide some of the icons of JupyterLab
Interested? Check out the Stanford course CS224n (Page)!