This notebook introduces how to analyse text to identify topic trends in text corpora.
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import pickle
import re
import os
from pathlib import Path
import requests
from collections import Counter
import matplotlib.pyplot as plt
from numpy import mean, ones
from scipy.sparse import csr_matrix
from nltk.corpus import stopwords
max_df: when building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold ngram_range: (1,2) includes ngrams of 1 and 2 words, (2,2) includes only ngrams of 2 words.
By default, rows are ngrams that appear per document:
and | and this | document | document is | more terms... | |
---|---|---|---|---|---|
doc0 | 0 | 0 | 1 | ... | ... |
doc1 | 0 | 1 | 0 | ... | ... |
doc2 | 1 | 1 | 0 | ... | ... |
By doing the transpose each row becomes a ngram frequency in all the documents
doc1 | doc2 | doc3 | doc4 | |
---|---|---|---|---|
and | 0 | 0 | 1 | 0 |
and this | 0 | 0 | 1 | 0 |
document | 1 | 1 | 0 | 1 |
more terms... |
According to the scikit-learn documentation:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'Is this the second document?',
'A third document is useful for testing purposes',
'Is this the third document?',
]
year = [2000,2001,2002,2002,2002,2002,2000]
v = CountVectorizer(analyzer='word', ngram_range=(1, 2))
Once we have defined the CountVectorizer object, the method fit_transform learn the vocabulary dictionary and return the document-term matrix.
X = v.fit_transform(corpus)
The method get_feature_names returns a list of feature names as an array mapping from feature integer indices to feature name.
terms = v.get_feature_names()
terms
print(v.fit_transform(corpus).toarray())
matrix = v.fit_transform(corpus).transpose()
print(matrix.toarray())
doc_frequencies = matrix.getnnz(axis=1)
print(doc_frequencies)
frequencies = matrix.sum(axis=1).A1
frequencies
We can remove them in order to target our efforts in the most effective way. Firt, we define a class to store the terms.
class MPHash(object):
# create from iterable
def __init__(self, terms):
self.term = list(terms)
self.code = {t:n for n, t in enumerate(self.term)}
def __len__(self):
return len(self.term)
def get_code(self, term):
return self.code.get(term)
def get_term(self, code):
return self.term[code]
selected = [m for m, f in enumerate(frequencies) if f > 1]
hapax_rate = 1 - len(selected) / len(frequencies)
print('Removing hapax legomena ({:.1f}%)'.format(100 * hapax_rate))
matrix = matrix[selected, :]
term_codes = MPHash([terms[m] for m in selected])
term_codes.get_code("document")
term_codes.get_term(0)
term_codes.get_term(1)
term_codes.get_code("document is")
v.lowercase = False
matrix2 = v.fit_transform(corpus).transpose()
terms2 = v.get_feature_names()
frequencies2 = matrix2.sum(axis=1).A1
forms = dict()
for t, f in zip(terms2, frequencies2):
low = t.lower()
if forms.get(low, (None, 0))[1] < f:
forms[low] = (t, f)
capitals = {k:v[0] for k, v in forms.items()}
capitals
We provide a period of time using years as description and identify the documents from the period provided.
The Enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.
enumerate(year) contains de document id and its year as is shown below:
print(list(enumerate(year)))
Let's filter the documents by the period provided
period = (2000, 2001)
docs = [n for n, y in enumerate(year)\
if period[0] <= y <= period[1]]
# only documents in the period
print(docs)
Now we extract the documents in the matrix in which each row corresponds to a term and the documents (already filtered by year) in which appears represented by 1.
#print(matrix.toarray())
tf_matrix = matrix[:, docs]
print(tf_matrix.toarray())
tf_sum = tf_matrix.sum(axis=1).A1
df_sum = tf_matrix.getnnz(axis=1)
print(tf_sum)
print(df_sum)
terms = [m for m, tf in enumerate(tf_sum)]
Note: We could use now a term and document threshold frequency. Terms and documents with frequency less than the threshold are discarded.
rows, cols = tf_matrix.nonzero()
print(rows)
print(cols)
We create a Compressed Sparse Row matrix using the method csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)]) where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k]
CSR matrix is often used to represent sparse matrices in machine learning given the efficient access and matrix multiplication that it supports.
df_matrix = csr_matrix((ones(len(rows)), (rows, cols)))
print(df_matrix.toarray())
year2 = [year[n] for n in docs]
print(year2)
First, we show how to multiply the matrix term and years using the operator @ (matrix multiplication)
res = df_matrix @ year2
print(res)
Finally, we compute the average dividing that number by the document frequency
res = df_matrix @ year2 / df_matrix.getnnz(axis=1) # @ operator = matrix multiplication
print(res)
And finally we retrieve the term and the year
result = {term_codes.get_term(terms[m]):res[m] for m in range(len(res))}
result