Introduction to text analysis¶

This notebook introduces how to analyse text to identify topic trends in text corpora.

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

Settings up things¶

In [ ]:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import pickle
import re
import os
from pathlib import Path
import requests
from collections import Counter
import matplotlib.pyplot as plt
from numpy import mean, ones
from scipy.sparse import csr_matrix
from nltk.corpus import stopwords

CountVectorizer converts a collection of text documents to a matrix of token counts¶

max_df: when building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold ngram_range: (1,2) includes ngrams of 1 and 2 words, (2,2) includes only ngrams of 2 words.

By default, rows are ngrams that appear per document:

	and	and this	document	document is	more terms...
doc0	0	0	1	...	...
doc1	0	1	0	...	...
doc2	1	1	0	...	...

By doing the transpose each row becomes a ngram frequency in all the documents

	doc1	doc2	doc3	doc4
and	0	0	1	0
and this	0	0	1	0
document	1	1	0	1
more terms...

Given a text corpora and the years of publication, we can use CountVectorizer to converts a collection of text documents to a matrix of token counts.¶

According to the scikit-learn documentation:

The parameter ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
The parameter analyzer allows to configure Whether the feature should be made of word n-gram or character n-grams.
The paramenter stopwords, allows the definition of a stop word list. If ‘english’, a built-in stop word list for English is used. Other language lists can be configured.

In [ ]:

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'Is this the second document?',
    'A third document is useful for testing purposes',
    'Is this the third document?',
]

year = [2000,2001,2002,2002,2002,2002,2000]

v = CountVectorizer(analyzer='word', ngram_range=(1, 2))

Once we have defined the CountVectorizer object, the method fit_transform learn the vocabulary dictionary and return the document-term matrix.

In [ ]:

X = v.fit_transform(corpus)

The method get_feature_names returns a list of feature names as an array mapping from feature integer indices to feature name.

In [ ]:

terms = v.get_feature_names()
terms

By default, rows are ngrams that appear per document:¶

In [ ]:

print(v.fit_transform(corpus).toarray())

By doing the transpose each row becomes a ngram frequency in all the documents¶

In [ ]:

matrix = v.fit_transform(corpus).transpose()
print(matrix.toarray())

We can obtain the doc frequency by getting the count of explicitly-stored values (nonzeros) per row (axis = 1)¶

In [ ]:

doc_frequencies = matrix.getnnz(axis=1)
print(doc_frequencies)

We can also obtain the term frequencies by adding the values of each row¶

In [ ]:

frequencies = matrix.sum(axis=1).A1
frequencies

Hapax legomena are terms of which only one instance of use is recorded.¶

We can remove them in order to target our efforts in the most effective way. Firt, we define a class to store the terms.

In [ ]:

class MPHash(object):
    # create from iterable 
    def __init__(self, terms):
        self.term = list(terms)
        self.code = {t:n for n, t in enumerate(self.term)}
    
    def __len__(self):
        return len(self.term)
    
    def get_code(self, term):
        return self.code.get(term)
    
    def get_term(self, code):
        return self.term[code]

In [ ]:

selected = [m for m, f in enumerate(frequencies) if f > 1]
hapax_rate = 1 - len(selected) / len(frequencies)
print('Removing hapax legomena ({:.1f}%)'.format(100 * hapax_rate))
matrix = matrix[selected, :]      
term_codes = MPHash([terms[m] for m in selected])

Now we can access codes and terms by means of the MPHash class¶

The code 0 corresponds to the term document
The code 1 corresponds to the term document is

In [ ]:

term_codes.get_code("document")

In [ ]:

term_codes.get_term(0)

In [ ]:

term_codes.get_term(1)

In [ ]:

term_codes.get_code("document is")

We can also store most common capitalization of terms by configuring the CountVectorizer with lowercase option.¶

In [ ]:

v.lowercase = False
matrix2 = v.fit_transform(corpus).transpose()
terms2 = v.get_feature_names()
frequencies2 = matrix2.sum(axis=1).A1    
forms = dict()
for t, f in zip(terms2, frequencies2):
    low = t.lower()
    if forms.get(low, (None, 0))[1] < f:
        forms[low] = (t, f)
capitals = {k:v[0] for k, v in forms.items()}
capitals

Now let's compute the average year of documents containing every term¶

We provide a period of time using years as description and identify the documents from the period provided.

The Enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.

enumerate(year) contains de document id and its year as is shown below:

In [ ]:

print(list(enumerate(year)))

Let's filter the documents by the period provided

In [ ]:

period = (2000, 2001)

docs = [n for n, y in enumerate(year)\
        if period[0] <= y <= period[1]]

# only documents in the period
print(docs)

Now we extract the documents in the matrix in which each row corresponds to a term and the documents (already filtered by year) in which appears represented by 1.

In [ ]:

#print(matrix.toarray())
tf_matrix = matrix[:, docs]
print(tf_matrix.toarray())

Now we obtain term frequencies and document frequencies¶

In [ ]:

tf_sum = tf_matrix.sum(axis=1).A1
df_sum = tf_matrix.getnnz(axis=1)
print(tf_sum)
print(df_sum)
terms = [m for m, tf in enumerate(tf_sum)]

Note: We could use now a term and document threshold frequency. Terms and documents with frequency less than the threshold are discarded.

In [ ]:

rows, cols = tf_matrix.nonzero()
print(rows)
print(cols)

We create a Compressed Sparse Row matrix using the method csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)]) where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k]

CSR matrix is often used to represent sparse matrices in machine learning given the efficient access and matrix multiplication that it supports.

In [ ]:

df_matrix = csr_matrix((ones(len(rows)), (rows, cols)))
print(df_matrix.toarray())

We retrieve the years in the documents¶

In [ ]:

year2 = [year[n] for n in docs]
print(year2)

The last step consists on retrieving the average year of documents containing every term¶

First, we show how to multiply the matrix term and years using the operator @ (matrix multiplication)

In [ ]:

res = df_matrix @ year2
print(res)

Finally, we compute the average dividing that number by the document frequency

In [ ]:

res = df_matrix @ year2 / df_matrix.getnnz(axis=1) # @ operator = matrix multiplication
print(res)

And finally we retrieve the term and the year

In [ ]:

result = {term_codes.get_term(terms[m]):res[m] for m in range(len(res))}
result