Unsupervised methods¶

In this lesson, we'll cover unsupervised computational text anlalysis approaches. The central methods covered are TF-IDF and Topic Modeling. Both of these are common approachs in the social sciences and humanities.

DTM/TF-IDF

Topic modeling

Today you will¶

Understand the DTM and why it's important to text analysis
Learn how to create a DTM in Python
Learn basic functionality of Python's package scikit-learn
Understand tf-idf scores
Learn a simple way to identify distinctive words
Implement a basic topic modeling algorithm and learn how to tweak it
In the process, gain more familiarity and comfort with the Pandas package and manipulating data

Time¶

Teaching: 30 minutes
Exercises: 30 minutes

Key Jargon¶

Document Term Matrix:
- a matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
TF-IDF Scores:
- short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Topic Modeling:
- A general class of statistical models that uncover abstract topics within a text. It uses the co-occurrence of words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
LDA:
- Latent Dirichlet Allocation. A particular model for topic modeling. It does not take document order into account, unlike other topic modeling algorithms.

DTM/TF-IDF ¶

In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums?

In [ ]:

import os
import numpy as np
import pandas as pd

DATA_DIR = '../data'
music_fname = 'music_reviews.csv'
music_fname = os.path.join(DATA_DIR, music_fname)

First attempt at reading in file¶

In [ ]:

reviews = pd.read_csv(music_fname)
reviews.head()

Challenge¶

Our first attempt at reading in the csv file failed. Why?

In [ ]:

Print the text of the first review.

In [ ]:

print(reviews['body'][0])

Explore the Data using Pandas¶

Let's first look at some descriptive statistics about this dataset, to get a feel for what's in it. We'll do this using the Pandas package.

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. <3 your data!

First, what genres are in this dataset, and how many reviews in each genre?

In [ ]:

#We can count this using the value_counts() function
reviews['genre'].value_counts()

The first thing most people do is to describe their data. (This is the summary command in R, or the sum command in Stata).

In [ ]:

#There's only one numeric column in our data so we only get one column for output.
reviews.describe()

This only gets us numerical summaries. To get summaries of some of the other columns, we can explicitly ask for it.

In [ ]:

reviews.describe(include=['O'])

Who were the reviewers?

In [ ]:

reviews['critic'].value_counts().head(10)

And the artists?

In [ ]:

reviews['artist'].value_counts().head(10)

We can get the average score as follows:

In [ ]:

reviews['score'].mean()

What's the distribution of scores?

In [ ]:

reviews['score'].plot(kind='hist');

Now we want to know the average score for each genre? To do this, we use Pandas groupby function. You'll want to get very familiar with the groupby function. It's quite powerful.

In [ ]:

reviews_grouped_by_genre = reviews.groupby("genre")
reviews_grouped_by_genre['score'].mean().sort_values(ascending=False)

Creating the DTM using scikit-learn¶

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers.

In [ ]:

def remove_digits(comment):
    return ''.join([ch for ch in comment if not ch.isdigit()])

reviews['body_without_digits'] = reviews['body'].apply(remove_digits)

In [ ]:

reviews['body_without_digits'].head()

CountVectorizer Function¶

Our next step is to turn the text into a document term matrix using the scikit-learn function called CountVectorizer.

In [ ]:

from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()
sparse_dtm = countvec.fit_transform(reviews['body_without_digits'])

Great! We made a DTM! Let's look at it.

In [ ]:

sparse_dtm

This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas DataFrame, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!

In [ ]:

dtm = pd.DataFrame(sparse_dtm.toarray(), columns=countvec.get_feature_names(), index=reviews.index)
dtm.head()

Challenge¶

Read in all the Jane Austen books from day 2 and turn them into a DTM. What will be the rows and columns?

In [ ]:

import glob
DAY2_DATA_DIR = '../../day-2/data'
AUSTEN_DIR = os.path.join(DAY2_DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(AUSTEN_DIR)
books = []
for fname in fnames:
    with open(fname) as f:
        text = f.read()
    books.append(text)

In [ ]:

What can we do with a DTM?¶

We can quickly identify the most frequent words

In [ ]:

dtm.sum().sort_values(ascending=False).head(10)

Challenge¶

Print out the most infrequent words rather than the most frequent words. You can look at the Pandas documentation for more information.

Gold star challenge:¶

Print the average number of times each word is used in a review.
Print this out sorted from lowest to highest.

In [ ]:

TF-IDF scores¶

How to find distinctive words in a corpus is a long-standing question in text analysis? Today, we'll learn one simple approach to this: TF-IDF. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is tf-idf score. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

Traditionally, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator:

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the scikit-learn calculation, but a challenge for you: use Pandas to calculate this manually.

TF-IDFVectorizer Function¶

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [ ]:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()
sparse_tfidf = tfidfvec.fit_transform(reviews['body_without_digits'])
sparse_tfidf

In [ ]:

tfidf = pd.DataFrame(sparse_tfidf.toarray(), columns=tfidfvec.get_feature_names(), index=reviews.index)
tfidf.head()

Let's look at the 20 words with highest tf-idf weights.

In [ ]:

tfidf.max().sort_values(ascending=False).head(20)

Ok! We have successfully identified content words, without removing stop words. What else do you notice about this list?

Identifying Distinctive Words¶

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we add in a column of genre.

In [ ]:

tfidf['genre_'] = reviews['genre']
tfidf.head()

Now lets compare the words with the highest tf-idf weight for each genre.

In [ ]:

rap = tfidf[tfidf['genre_']=='Rap']
indie = tfidf[tfidf['genre_']=='Indie']
jazz = tfidf[tfidf['genre_']=='Jazz']

rap.max(numeric_only=True).sort_values(ascending=False).head()

In [ ]:

indie.max(numeric_only=True).sort_values(ascending=False).head()

In [ ]:

jazz.max(numeric_only=True).sort_values(ascending=False).head()

There we go! A method of identifying distinctive words. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?

Challenge¶

Instead of outputting the highest weighted words, output the lowest weighted words. How should we interpret these words?

In [ ]:

Topic modeling ¶

There are many topic modeling algorithms, but we'll use LDA. This is a standard model to use. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.

We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained aboe to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.

Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century.

The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora

That page has additional corpora, for those interested in exploring text analysis further.

I did some minimal cleaning to get the children's literature data in .csv format for our use.

In [ ]:

literature_fname = os.path.join(DATA_DIR, 'childrens_lit.csv.bz2')
df_lit = pd.read_csv(literature_fname, sep='\t', encoding = 'utf-8', compression = 'bz2', index_col=0)

#drop rows where the text is missing
df_lit = df_lit.dropna(subset=['text'])
df_lit.head()

Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See here for more information about this function.

First, we have to import it from sklearn. We tell the model how many topics we expect to find.

In [ ]:

from sklearn.decomposition import LatentDirichletAllocation
n_topics = 5

In sklearn, the input to LDA is a DTM (with either counts or TF-IDF scores).

In [ ]:

tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
                                   max_features=5000,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(df_lit['text'])

In [ ]:

tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=5000,
                                stop_words='english'
                                )
tf = tf_vectorizer.fit_transform(df_lit['text'])

This is where we fit the model.

In [ ]:

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=20, random_state=0)
lda = lda.fit(tf)

This is a function to print out the top words for each topic in a pretty way. Don't worry too much about understanding every line of this code.

In [ ]:

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #{}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [ ]:

tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 20)

Further resources¶

This blog post goes through finding distinctive words using Python in more detail

Paper: Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict, Burt Monroe, Michael Colaresi, Kevin Quinn

More detailed description of implementing LDA using scikit-learn.

Topic modeling with Textacy