Topic-specific term associations through word representations¶

How do Democrats and Republicans talk different about jobs¶

https://github.com/JasonKessler/scattertext

Cite as: Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. 2017.

Link to preprint: https://arxiv.org/abs/1703.00565

@article{kessler2017scattertext, author = {Kessler, Jason S.}, title = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ}, booktitle = {ACL System Demonstrations}, year = {2017}, }

In [1]:

%matplotlib inline
import scattertext as st
from gensim.models import word2vec
import re, io, itertools
from pprint import pprint
import pandas as pd
import numpy as np
import spacy
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))

In [2]:

nlp = spacy.load('en')
# If this doesn't work, please uncomment the following line and use a regex-based parser instead
#nlp = st.whitespace_nlp_with_sentences

Load the 2012 Conventions Dataset¶

We'll limit the study to unigrams¶

In [3]:

convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df['parsed'] = convention_df.text.apply(nlp)
corpus = (st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed')
          .build()
          .get_unigram_corpus())

Use Gensim to run Word2Vec on the corpus.¶

Word2Vec encodes each word in a dense K-dimensional vector space¶

Cosine distances between terms vectors correspond to semantic similarity¶

In [4]:

model = word2vec.Word2Vec(size=100, window=5, min_count=10, workers=4)
model = st.Word2VecFromParsedCorpus(corpus, model).train(epochs=10000)
model.wv.most_similar('jobs')

Out[4]:

[('create', 0.9190447926521301),
 ('businesses', 0.8814688920974731),
 ('million', 0.8395127058029175),
 ('taxes', 0.8300786018371582),
 ('millions', 0.829835057258606),
 ('created', 0.8269357085227966),
 ('pay', 0.8228686451911926),
 ('families', 0.8117849826812744),
 ('lives', 0.8079125881195068),
 ('debt', 0.8053802847862244)]

In [5]:

corpus._df[corpus._parsed_col].apply(lambda x: len(list(x.sents))).sum()
#model.corpus_count

Out[5]:

Draw the Scattertext by only coloring points that have are associated with a category (p < 0.05 via log-odds w/ prior)¶

The top Democratic and Republican terms are raked by their similarity to "jobs"¶

Only the terms associated to a category are considered.¶

On the far right, the most similar terms, regardless of category association, are listed.¶

In [6]:

target_term = 'jobs'

html = st.word_similarity_explorer_gensim(corpus,
                                          category='democrat',
                                          category_name='Democratic',
                                          not_category_name='Republican',
                                          target_term=target_term,
                                          minimum_term_frequency=5,
                                          width_in_pixels=1000,
                                          word2vec=model,
                                          metadata=convention_df['speaker'])
file_name = 'output/demo_similarity_gensim.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

Out[6]:

Instead of using vectors trained on the Corpus, we can use the spaCy-provided word vectors trained on the Common Crawl Corpus.¶

These are trained on a lot more data, but aren't specific to the corpus¶

In [7]:

# Note: this will fail if you did not use spaCy as your parser.
html = st.word_similarity_explorer(corpus,
                                   category='democrat',
                                   category_name='Democratic',
                                   not_category_name='Republican',
                                   target_term='jobs',
                                   minimum_term_frequency=5,
                                   width_in_pixels=1000,
                                   metadata=convention_df['speaker'])
file_name = 'output/demo_similarity.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

Out[7]:

In [ ]: