Jason S. Kessler @jasonkessler
This notebook shows a quick-and-dirty analysis of PyCon abstracts. It makes heavy use of the library Scattertext (https://github.com/JasonKessler/scattertext) for language processing and visualizations.
If you have any questions, feel free to reach out on Twitter.
import scattertext as st
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import spacy
import umap
from gensim.models.word2vec import Word2Vec
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))
assert st.__version__ >= '0.0.2.27.1'
raw_2017 = BeautifulSoup(requests.get('https://us.pycon.org/2017/schedule/talks/list/').text, 'lxml')
for div in raw_2017.findAll('div'):
if 'class' in div.attrs and div.attrs['class'][0].strip() == 'box-content':
content_div = div
df_2017 = pd.DataFrame({
'title':[a for a in raw_2017.find_all('a') if 'id' in a.attrs and a.attrs['id'].startswith('presentation-')],
'headers_raw':content_div.find_all('p'),
'content_raw':content_div.find_all('div', attrs={'class': 'presentation-description'}),
'year':'2017'})
raw_2018 = BeautifulSoup(requests.get('https://us.pycon.org/2018/schedule/talks/list/').text, 'lxml')
for div in raw_2018.findAll('div'):
if 'class' in div.attrs and div.attrs['class'][0].strip() == 'box-content':
content_div = div
df_2018 = pd.DataFrame({
'title':[a for a in raw_2018.find_all('a') if 'id' in a.attrs and a.attrs['id'].startswith('presentation-')],
'headers_raw':content_div.find_all('p'),
'content_raw':content_div.find_all('div', attrs={'class': 'presentation-description'}),
'year':'2018'})
df = pd.concat([df_2017, df_2018])
df['content_text'] = df['content_raw'].apply(lambda x: (re.sub(r'\[http[^\]]+\]', '',
re.sub(r'\(http[^\)]+\)', '', x.text))
.replace('()','').replace('รข',"'").replace("'", chr(8217))).replace('<','<').replace('>','>').strip())
df['headers_text'] = df['headers_raw'].apply(lambda x: ' '.join((re.sub(r'\[http[^\]]+\]', '',
re.sub(r'\(http[^\)]+\)', '', x.text))
.replace('()','').split())).strip())
df['headers_text'] = df['headers_text'].apply(lambda x: ''.join(c for c in x if ord(c) < 128).strip())
df.to_csv('pycon2017-2018.csv', index=False)
try:
nlp
except:
nlp = spacy.load('en')
df['parse'] = df['content_text'].apply(nlp)
pycon_corpus = (st.CorpusFromParsedDocuments(df, parsed_col='parse', category_col='year')
.build()
.compact(st.ClassPercentageCompactor(term_count=1)))
pycon_corpus = pycon_corpus.remove_terms([t for t in pycon_corpus.get_terms() if not re.match('^[a-z ]+$', t)])
pycon_phrase_corpus = (st.CorpusFromParsedDocuments(df,
parsed_col='parse',
category_col='year',
feats_from_spacy_doc=st.PhraseMachinePhrases())
.build()
.compact(st.ClassPercentageCompactor(term_count=2))
.compact(st.CompactTerms(slack=6)))
pycon_phrase_corpus = pycon_phrase_corpus.remove_terms([t for t in pycon_phrase_corpus.get_terms() if not re.match('^[a-z ]+$', t)])
def get_metadata_from_corpus(corpus):
df = corpus.get_df()
return df.title.apply(lambda x: x.text.strip())
In this chart, noun phrases occuring in 2017 and 2018 PyCon absracts are plotted. The position on the x-axis is proportional to the frequency the noun phrase occured in the set of abstracts, while the y-axis position is higher if the phrase was more associated with 2018 and lower if it was more associated with 2017.
The following two charts use the difference in dense ranks metric of term-class assocation.
2018 was the year of best practices. The phrase "real world" was used in 2018 abstracts, but not nearly as much as in 2017. Still, dominated both years as a topic, but gained steam in 2018.
html = st.produce_frequency_explorer(pycon_phrase_corpus,
category='2018',
minimum_term_frequency=0,
pmi_filter_thresold=0,
use_full_doc = True,
term_scorer = st.RankDifference(),
term_ranker=st.OncePerDocFrequencyRanker,
metadata = get_metadata_from_corpus(pycon_phrase_corpus),
grey_threshold=0,
width_in_pixels=1200)
file_name = 'phrase_rankdiff.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1400, height=700)
Here, the frequencies of unigrams are plotted. Instead of frequency, the x-axis referes to charcteristicness, while the y-axis refers to year-association.
Clearly, the word "python" is highly characteristic, to the point it distorts the rest of the plot.
Let's see how unigram frequencies differ between 2018 and 2017. Many of the 2018 differences are stylistic. First person plural pronouns are ("our", "we", "us") dominatate. This, along with other function words like "just", suggest 2018 abstracts had a more conversational style.
Machine learning terms like "learning", "learn", and "features" (these are all polysemous) also dominate 2018. Words related to software deployment ("deploy", "deployment", "production") were also trending in 2018.
html = st.produce_characteristic_explorer(pycon_corpus,
category='2018',
not_category_name='2017',
term_ranker=st.OncePerDocFrequencyRanker,
term_scorer=st.RankDifference(),
metadata=get_metadata_from_corpus(pycon_corpus))
file_name = 'pycon_characteristic_raw.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1400, height=700)
After removing function words and the word "python", we can start to see parts of the scientific computing computing stack ("numpy", "scipy", "pandas") were discussed less in 2018.
Interestingly, the word "probably" hardly appeared in 2017, but was very popular in 2018.
Words relating to ease of use were hot in 2018: "easy", "easier", "humans", and "intuitive" were all much more associated this year's talk. The exception, the word "simple". The word "introduction" only appeared in one 2018 abstract, while appearing in multiple 2017 abstracts.
stoplist_corpus = pycon_corpus.get_stoplisted_unigram_corpus().remove_terms(['python', 'just'])
html = st.produce_characteristic_explorer(stoplist_corpus,
category='2018',
not_category_name='2017',
term_ranker=st.OncePerDocFrequencyRanker,
term_scorer=st.ScaledFScorePresets(beta=1, one_to_neg_one=True),
metadata=get_metadata_from_corpus(stoplist_corpus))
file_name = 'pycon_characteristic.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1400, height=700)
Given the small size of the corpus, the word embeddings aren't ideal, but still interesting to explore. We can see a UMAP projection below. Words more associated with 2018 are colored in blue, while those more associated with 2017 are colored in red.
html = st.produce_projection_explorer(stoplist_corpus,
category='2018',
not_category_name='2017',
term_scorer = st.RankDifference(),
term_ranker=st.OncePerDocFrequencyRanker,
width_in_pixels=1000,
use_full_doc=True,
projection_model = umap.UMAP(metric='cosine'),
metadata=get_metadata_from_corpus(stoplist_corpus))
file_name = 'umap_projection.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)