In this lesson, we'll cover unsupervised computational text anlalysis approaches. The central methods covered are TF-IDF and Topic Modeling. Both of these are common approachs in the social sciences and humanities.
Document Term Matrix:
TF-IDF Scores:
Topic Modeling:
LDA:
In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums?
import os
import numpy as np
import pandas as pd
DATA_DIR = '../data'
music_fname = 'music_reviews.csv'
music_fname = os.path.join(DATA_DIR, music_fname)
reviews = pd.read_csv(music_fname)
reviews.head()
Our first attempt at reading in the csv file failed. Why?
Print the text of the first review.
print(reviews['body'][0])
Let's first look at some descriptive statistics about this dataset, to get a feel for what's in it. We'll do this using the Pandas package.
Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. <3 your data!
First, what genres are in this dataset, and how many reviews in each genre?
#We can count this using the value_counts() function
reviews['genre'].value_counts()
The first thing most people do is to describe
their data. (This is the summary
command in R, or the sum
command in Stata).
#There's only one numeric column in our data so we only get one column for output.
reviews.describe()
This only gets us numerical summaries. To get summaries of some of the other columns, we can explicitly ask for it.
reviews.describe(include=['O'])
Who were the reviewers?
reviews['critic'].value_counts().head(10)
And the artists?
reviews['artist'].value_counts().head(10)
We can get the average score as follows:
reviews['score'].mean()
What's the distribution of scores?
reviews['score'].plot(kind='hist');
Now we want to know the average score for each genre? To do this, we use Pandas groupby
function. You'll want to get very familiar with the groupby
function. It's quite powerful.
reviews_grouped_by_genre = reviews.groupby("genre")
reviews_grouped_by_genre['score'].mean().sort_values(ascending=False)
Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers.
def remove_digits(comment):
return ''.join([ch for ch in comment if not ch.isdigit()])
reviews['body_without_digits'] = reviews['body'].apply(remove_digits)
reviews['body_without_digits'].head()
Our next step is to turn the text into a document term matrix using the scikit-learn function called CountVectorizer
.
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
sparse_dtm = countvec.fit_transform(reviews['body_without_digits'])
Great! We made a DTM! Let's look at it.
sparse_dtm
This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas DataFrame, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!
dtm = pd.DataFrame(sparse_dtm.toarray(), columns=countvec.get_feature_names(), index=reviews.index)
dtm.head()
Read in all the Jane Austen books from day 2 and turn them into a DTM. What will be the rows and columns?
import glob
DAY2_DATA_DIR = '../../day-2/data'
AUSTEN_DIR = os.path.join(DAY2_DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(AUSTEN_DIR)
books = []
for fname in fnames:
with open(fname) as f:
text = f.read()
books.append(text)
We can quickly identify the most frequent words
dtm.sum().sort_values(ascending=False).head(10)
Print out the most infrequent words rather than the most frequent words. You can look at the Pandas documentation for more information.
How to find distinctive words in a corpus is a long-standing question in text analysis? Today, we'll learn one simple approach to this: TF-IDF. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.
One of the most popular ways to weight words (beyond frequency counts) is tf-idf score
. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.
Traditionally, the inverse document frequency is calculated as such:
number_of_documents / number_documents_with_term
so:
tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)
You can, and often should, normalize the numerator:
tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)
We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the scikit-learn calculation, but a challenge for you: use Pandas to calculate this manually.
To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer()
sparse_tfidf = tfidfvec.fit_transform(reviews['body_without_digits'])
sparse_tfidf
tfidf = pd.DataFrame(sparse_tfidf.toarray(), columns=tfidfvec.get_feature_names(), index=reviews.index)
tfidf.head()
Let's look at the 20 words with highest tf-idf weights.
tfidf.max().sort_values(ascending=False).head(20)
Ok! We have successfully identified content words, without removing stop words. What else do you notice about this list?
What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.
First we add in a column of genre.
tfidf['genre_'] = reviews['genre']
tfidf.head()
Now lets compare the words with the highest tf-idf weight for each genre.
rap = tfidf[tfidf['genre_']=='Rap']
indie = tfidf[tfidf['genre_']=='Indie']
jazz = tfidf[tfidf['genre_']=='Jazz']
rap.max(numeric_only=True).sort_values(ascending=False).head()
indie.max(numeric_only=True).sort_values(ascending=False).head()
jazz.max(numeric_only=True).sort_values(ascending=False).head()
There we go! A method of identifying distinctive words. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?
Instead of outputting the highest weighted words, output the lowest weighted words. How should we interpret these words?
There are many topic modeling algorithms, but we'll use LDA. This is a standard model to use. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.
We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained aboe to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.
Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century.
The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora
That page has additional corpora, for those interested in exploring text analysis further.
I did some minimal cleaning to get the children's literature data in .csv format for our use.
literature_fname = os.path.join(DATA_DIR, 'childrens_lit.csv.bz2')
df_lit = pd.read_csv(literature_fname, sep='\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
#drop rows where the text is missing
df_lit = df_lit.dropna(subset=['text'])
df_lit.head()
Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.
See here for more information about this function.
First, we have to import it from sklearn. We tell the model how many topics we expect to find.
from sklearn.decomposition import LatentDirichletAllocation
n_topics = 5
In sklearn, the input to LDA is a DTM (with either counts or TF-IDF scores).
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
max_features=5000,
stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(df_lit['text'])
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
max_features=5000,
stop_words='english'
)
tf = tf_vectorizer.fit_transform(df_lit['text'])
This is where we fit the model.
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=20, random_state=0)
lda = lda.fit(tf)
This is a function to print out the top words for each topic in a pretty way. Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("\nTopic #{}:".format(topic_idx))
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 20)
This blog post goes through finding distinctive words using Python in more detail
Paper: Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict, Burt Monroe, Michael Colaresi, Kevin Quinn
More detailed description of implementing LDA using scikit-learn.