#!/usr/bin/env python # coding: utf-8 # # Document Similarity # This notebook introduces techniques for exploring document similarity. It's part of the [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) (and assumes that you've already worked through previous notebooks – see the table of contents). In this notebook we'll look in particular at # # * document term matrix # * TF-IDF # * cosing similarity # * visualizing document similarity with scatterplots # * visualizing document similarity with dendrograms # A common task when working with a larger corpus is to try to determine how documents relate to one another – how similar or different they are, or how they might cluster together. For instance, if we take the 18 texts of from the NTLK Gutenberg corpus, are there any texts that group together? There are several authors with multiple texts (Shakespeare, Austen, Chesterton), do these cluster together? # # More importantly in some ways, how do we determine similarity? For instance, we could plot the ratio of sentences per number of words and see if that's helpful in determining stylistic similarity. # In[1]: import nltk get_ipython().run_line_magic('matplotlib', 'inline') sentenceLengths = {} for fileid in nltk.corpus.gutenberg.fileids(): sentenceLengths[fileid] = len(nltk.corpus.gutenberg.sents(fileid)) / len(nltk.corpus.gutenberg.words(fileid)) nltk.FreqDist(sentenceLengths).plot() # We do indeed get a certain amount of grouping, for instance the first three texts are from Shakespeare, where there are a high number of sentences per total number of tokens. But this is perhaps more an indication of genre and even formatting rather than of vocabulary (indicated character names in the plain text plays are counted as sentences and the lines of text tend to be formatted shorter than in prose). # ## Document Term Matrix # Another way to consider document similarity is to consider and compare the frequency of terms in each document. The first step in doing this is to create what's called a document term matrix where frequencies are indicated for each document – it might look something like this: # In[2]: import pandas as pd documents = [ "I love jazz music and I love to read good books", "What good books have you read recently", "It's music to my ears when you say I love you" ] allWords = set([word.lower() for word in nltk.word_tokenize(" ".join(documents)) if any([c for c in word if c.isalpha()])]) allRawFreqs = [nltk.FreqDist(nltk.word_tokenize(document.lower())) for document in documents] pd.DataFrame(allRawFreqs).fillna(0) # The top row indicates each term in the entire corpus and each row represents a document ("0" is the first document "I love jazz…", etc.). This would be helpful (with real data), but it's not ideal to use raw frequencies (counts) since document length wouldn't be taken into consideration. # # We could create a similar table using relative frequencies instead, in other words, where each value would be relative to the total number of tokens in the document. That would be better for variable length documents, but an even more powerful (and commonly used) technique is to calculate a value that better represents the significance of a term within the corpus. Just as the relative frequency dampens counts based on document length, the [TF-IDF](http://en.wikipedia.org/wiki/Tf–idf) value dampens relative frequencies based on the distribution of the term in the corpus. # ## TF-IDF # We already have the "TF" part of the TF-IDF value, it's simply the term's raw or relative frequency in a given document. We need to multiply that by IDF, which is the inverse document frequency, which we can calculate like this: # # IDF = log of the total number of documents divided by the number of documents that contain the term # # So, imagine we want to calculate the TF-IDF score of the word "boot" for Austen's _Emma_ within the Gutenberg corpus: # # * 4 occurrences of "boot" in _Emma_ # * 16,1975 words in _Emma_ # * 6 documents that contain "boot" (at least once) # * 18 documents in total # # TF = 4 / 161975 = 0.00002469516901 # # IDF = log(18 /6) = 1.0986122886681098 # # TF-IDF (for boot in _Emma_) = 0.00002469516901 \* 1.0986122886681098 = **0.00002713041615** # That seems like a lot of work for a single TF-IDF value for one term in one document. But good news! We don't need to calculate TF-IDF scores ourselves, there are several libraries that do that for us. In fact, we've already seen this with the gensim library in the topic modelling notebook, but here we're going to use [scikit](http://scikit-learn.org/stable/), a machine learning library for Python, and in particular with [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). # In[3]: from sklearn.feature_extraction.text import TfidfVectorizer simpledtm = TfidfVectorizer().fit_transform(documents) # takes a list of strings (and tokenizes them for us) pd.DataFrame(simpledtm.toarray()) # When we use this convenience method, we lose the term labels (they appear in the top header row as numbers from 0 to 16), but in fact that's not as important for now as just seeing that we have a document term matrix with TF-IDF values for each document and each term (with several values of zero). # # We can generate a similar – though much larger – table for all the texts in the Gutenberg corpus. # In[4]: names = nltk.corpus.gutenberg.fileids() # we'll keep track of filenames (for labels) texts = [nltk.corpus.gutenberg.raw(fileid) for fileid in names] # a list of strings, one per document documentTermMatrix = TfidfVectorizer().fit_transform(texts) documentTermMatrix # Our document term matrix is stored in something called a sparse matrix, which is a useful and efficient way to store and process a very large table of data where several values are absent or not set. There are typically a lot of zero values in a sparse matrix (imagine a term that only occurs in one of the 18 documents – a true matrix would need to store 18 cells/values, but a sparse matrix can store just one. # # This document term matrix is conceptually enormous. Depending on the arguments we provide to fit_transform() we could have a huge table of 18 (the number of documents) by the total number of unique words in all the documents (each term is a column and each column needs a value for each one of the documents). # ## Cosine Similarity # Even though we have a table of TF-IDF values for every term in every document, in order to make any use of it we need a way of comparing documents, in other words of determining the distance between the frequency values (columns) for any of the documents the rows. In effect, we want a way of expressing how different two rows (documents) are. # # We can do this by converting our TF-IDF values from each document into geometric space. Imagine each term frequency (as TF-IDF) as a point on a cartesian graph and a value that expresses the aggregate of these points. Each document could thus be expressed as a vector, or a line with a certain magnitude (length) and angle from the origin (0, 0). # # Because the length of the line (the total magnitude of the total TF-IDF values) would be sensitive to the length of the document, we can't simply compare the lines for each document, but we can compare the angle between the documents – that's the essence of cosine similarity. # # * cosine distance between document 1 and document 2 (d1, d2) # * cosine distance between document 2 and document 3 (d2, d3) # * cosine distance between document 1 and document 3 (d1, d3) # # # ![Cosine Similarity](images/cosine-similarity.png) # # We'll use scikit's [cosine_similarity](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) function. # In[5]: from sklearn.metrics.pairwise import cosine_similarity distances = 1 - cosine_similarity(documentTermMatrix) # This creates a new matrix not of individual terms but of distance measures between documents. Think of a table of distances between cities on a map, if you present it in a table there will be duplication, and identity values for where cities meet. # #
CalgaryMontrealTorontoVancouver
Calgary375034501050
Montreal37505504800
Toronto34505504500
Vancouver105048004500
# # The distance values that we get from cosine_similarity() on our documentTermMatrix from Gutenberg are similar, we have an 18 by 18 grid. But the grid is still difficult to read, we want a way of visualizing distances. # ## Visualizing Document Distances with a Scatterplot # An 18 by 18 grid can be thought of as an 18 dimensional dataset. Coordinates on a cartesian graph are usually two dimensional, since we have x and y dimensions. So the question becomes: how do we convert a multidimensional grid into a two dimensional graph? Why, with [multidimensional scaling](https://en.wikipedia.org/wiki/Multidimensional_scaling) of course :). The math to accomplish this isn't as important for now as the purpose of multidimensional scaling to two dimensions, which is to represent as well as possible the distance that objects have in their original multiple dimensions. # In[6]: from sklearn.manifold import MDS # reduce our n-dimensional distances matrix to a two dimensional matrix (for x and y coordinates) mds = MDS(dissimilarity="precomputed", random_state=1) positions = mds.fit_transform(distances) positions # We now have a list of 18 elements (one for each Gutenberg text) where each text has an x and y value. Now all we need to do is to plot the values. # In[11]: import matplotlib.pyplot as plt xvalues = positions[:, 0] # the left colunn (x axis) for all rows yvalues = positions[: ,1] # the right column (y axis) for all rows plt.figure(figsize=(10,10)) # make the graph easier to see for x, y, name in zip(xvalues, yvalues, names): plt.scatter(x, y) plt.text(x, y, name.replace(".txt", "")) plt.show() # We can see a few encouraging clusters, including the authors Shakespeare (bottom right), Austen (middle left) and Chesterton (middle top). # ## Visualizing Document Clusters with a Dendrogram # Another strategy for showing clusters is to create a hierarchical tree that forms a [dendrogram](http://en.wikipedia.org/wiki/Dendrogram). At its simplest, a dendrogram starts with each document in its own "cluster" and then tries to merge it with the closest document that still hasn't been merged. Those merged clusters are then merged again and the process is repeated as many times as possible. Different strategies exist for merging clusters based on distance matrix, but in this case we'll demonstrate [Ward](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.ward.html#scipy.cluster.hierarchy.ward)'s approach. # In[8]: from scipy.cluster.hierarchy import ward, dendrogram linkage_matrix = ward(distances) dendrogram(linkage_matrix, labels=names, orientation="right"); plt.show() # fixes margins # The clustering of documents is argumably even easier to see here than in the scatterplot. We started the notebook by asking if we could find similarities and clusters of documents in the Gutenberg corpus, and the answer seems to be a resounding yes. # ## See Also # * [Working with texts](https://de.dariah.eu/tatom/working_with_text.html) by Allen Riddell (to which this notebook is particularly indebted) # --- # From [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) & [Geoffrey Rockwell](http://geoffreyrockwell.com), [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/)
# Created March 24, 2015 (run with Python 3.4)