Topic modelling with Spacy, Gensim and Textacy

By Max Munnecke, @maxmunnecke

This notebook consists of following sections:

  • Initialize: Setting up environment and loading data.
  • Text extraction. Phrase and tokens extraction with Gensim and Spacy.
  • Topic modelling. Using Textacy's LDA model.
  • Data processing. Calculating data for visualization and export.
  • Model evaluation. A collection of visualizations of the resulting topics.
  • Export data. The data can be used for creating more visualization or import into a graph.

General concept:
The emphasis in this notebook is on facilitating an iterative process where you can easily adjust stopwords and number of topics. Furthermore it contains features to re-focus on sub topics and thereby create a hierachy of topics.

INITIALIZE

Load environment

In [1]:
# Load packages
import spacy
nlp = spacy.load("en")
import textacy # 0.5.0, does not work with 0.6.0.
import textacy.datasets
import textacy.fileio
import matplotlib.pyplot as plt
import json # write to disk
import pandas as pd
%matplotlib inline
# nlp = spacy.load("en") # Download spacy english vocabulary: `python -m spacy download en`

import os, re, sys
import warnings
warnings.filterwarnings('ignore')  # Let's not pay heed to them right now
%matplotlib inline
In [2]:
# Log environment
print("cwd : " + os.getcwd())
print("sys : " + str(sys.version_info))
print("spacy : "+ spacy.__version__)
print("textacy : "+ textacy.__version__)
cwd : /home/jovyan/work/prj-bib/tb/nlp
sys : sys.version_info(major=3, minor=6, micro=3, releaselevel='final', serial=0)
spacy : 2.0.8
textacy : 0.5.0

Set global variables and load data

SET Change 'outroot' to reflect current investigation
In [3]:
# infolder ='' # win64 py36
infolder = 'data-in/' # docker py35mini
infile = 'tb_data.tsv'
outfolder = 'data-out/'
outroot = 'tb_main_'
START if 'MAIN TOPIC' investigation
In [4]:
data_org = pd.read_csv(infolder + infile, index_col=0, sep='\t')
print('Length : ' + str(len(data_org)))
data_org.describe()
Length : 2101
Out[4]:
pub-full abstract key-au key-pub
count 2101 2093 1445 1962
unique 2098 2092 1442 1888
top Should tuberculosis programmes invest in secon... Introduction: In tuberculosis (Tb), the great ... Antitubercular; Benzoxazole; Interaction energ... MYCOBACTERIUM-TUBERCULOSIS
freq 2 2 2 21
In [5]:
# Transforming the incoming dataframe to standard template.
columns_extract = {'pub-full':'title','abstract':'abstract', 'key-au':'keywords'} # {'old':'new'}
data_org = data_org[[name for name in columns_extract.keys()]]
data_org.rename(columns=columns_extract, inplace=True)
data = data_org # Keeping a copy of the original data set for when sub slices are being explored
END if 'MAIN TOPIC' investigation
START if 'SUB TOPIC' investigation
In [ ]:
# START HERE if 'data' has been manipulated elsewhere
# Load external data frame
data_topic = pd.read_csv(outfolder +'tb_mdrtb_data-topic-df.tsv', index_col=0, sep='\t')
In [ ]:
data_topic.describe()
SET `sub_topic` and adjust `cutoff`
In [ ]:
sub_topic = '3'
cutoff = 0.7
data_tmp = data_topic[data_topic[sub_topic]>cutoff] 
In [ ]:
print('Number of articles: %s' % len(data_tmp))
In [ ]:
data = data_tmp
END if 'SUB TOPIC' investigation
In [6]:
docs = [ str(a_) + ". " + str(b_) for a_,b_ in zip(data['title'], data['abstract'])]
# Converting '-' to '_' to make sure that terms are not split up during subsequent Gensim and Textacy manipulation.
docs = [re.sub(r'\b-\b', '_', text) for text in docs] # Should not be touched as it is referenced later.

TEXT EXTRACTION

Find phrases

Concept: Identify freqent phrases and glue them together with an underscore "_".

Inspiration: *Phrase model for bi- and tri-gram with Gensim: https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb Other source: https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

Train phrase model

In [7]:
import re
import gensim

# Split pargraph into sentences
grap_sentence = re.compile(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?)\s')
# Remove text enclosed in '()' or '[]'
grap_enclosed = re.compile(r'[\(\[].*?[\)\]]')

model_sents = []
for text in docs:
    # In a single line sentences are being extracted without enclosed text.
    stripped = [grap_enclosed.sub("", sent) for sent in grap_sentence.split(text)]
    # Excluding the final character of the sentence so the punctuation does not make part of last word      
    model_sents += [text[:-1].split() for text in stripped]
    
common_terms = ["of", "with", "without", "and", "or", "the", "a", "not", "be", "to","this","who","in"]
bigram = gensim.models.Phrases(model_sents, min_count=50, threshold=5, max_vocab_size=50000, common_terms=common_terms)
trigram = gensim.models.Phrases(bigram[model_sents], min_count=50, threshold=5, max_vocab_size=50000, common_terms=common_terms)

Apply phrase model

In [8]:
docs_phrased = [" ".join(trigram[bigram[doc.split()]]) for doc in docs]
print(str(len(docs_phrased)) +' : '+ docs_phrased[0][:400])
2101 : Culture and Next_generation sequencing_based drug_susceptibility_testing unveil high levels of drug_resistant_TB in Djibouti: results from the first national survey. Djibouti is a small country in the Horn of Africa with a high TB incidence (378/100,000 in 2015). Multidrug_resistant TB (MDR_TB) and resistance to second_line agents have_been previously identified in the country but the extent of th

Filter and tokenize text

SpaCy is assigned the task of making each document into a list of tokens. It is important to filter away non-topical tokens as they otherwise may be determining for the topic modelling. Textacy is not used as I prefer to learn the underlying spacy pipeline.

Define filters

When adding stopwords to Spacy it will only stop the exact same spelling of a word. It is therefore necessary to include uppercase versions and eventuall plural forms of a word to get all variations. Alternatively we can create custom external lists outside the nlp object and control in detail how a word is being evaluated. Below are versions for exact spelling, lowercase, and lemma.

NB: The keywords that was used to generate a corpus a good candidates for stopwords. Likewise you may filter out the most frequent terms of a sub-topic if you choose to dig into it.

SET Stopwords. Topic and sub-topic specific words. As well as artifacts not captured by standard `is_stop` property.
In [72]:
# EXACT nlp object. Names, organisations etc.
exact_stop = 'someword'.split()
for w in exact_stop:
    lexeme = nlp.vocab[w]
    lexeme.is_stop = True

# LOWERCASE in list. Safe choice.
lower_stop = ('the to a an background objective').lower().split()

# LEMMA in list. Powerful
topic_stop = 'tb mdr mdr_tb tuberculosis'.split()
subtopic_stop = ''.split('|')
artifact = ['-PRON-','=', '+', 'in']

lemma_stop = [item.strip() for item in (topic_stop + subtopic_stop + artifact)]

Filter text

Overview of all token attributes: https://github.com/explosion/spaCy/blob/master/spacy/attrs.pyx

In [73]:
docs_tokens, tokens_tmp = [], []
for item in docs_phrased:
    doc = nlp(item)
    for w in doc:
        # Filter away line endings,nlp stopwords, numbers and words in lists
        if not (w.text == '\n' or w.is_stop or w.is_punct or w.like_num or w.lemma_ in lemma_stop or w.text.lower() in lower_stop):
            tokens_tmp.append(w.lemma_)
    docs_tokens.append(tokens_tmp)
    tokens_tmp = []
# check
print(docs_tokens[1])
['repurpos', 'revival', 'drug', 'new', 'approach', 'combat', 'drug_resistant', 'emergence', 'drug_resistant', 'like', 'multi_drug_resistant', 'extensively_drug_resistant', 'xdr_tb', 'totally', 'drug_resistant', 'tdr_tb', 'create', 'new', 'challenge', 'fight', 'bad', 'bug', 'mycobacterium', 'repurposing', 'revival', 'drug', 'new', 'trend', 'option', 'combat', 'worsen', 'situation', 'antibiotic', 'resistance', 'era', 'situation', 'global', 'emergency', 'bactericidal', 'synergistic', 'effect', 'repurpos', 'revive', 'drug', 'late', 'drug', 'bedaquiline', 'delamanid', 'treatment', 'xdr_tb', 'tdr_tb', 'choice', 'future', 'promising', 'combinatorial', 'chemotherapy', 'bad', 'bug']

TOPIC MODELLING

We use Textacy because it has a nice suite of functions for topic modelling and connects with both termite and pyLDAvis

In [74]:
vectorizer = textacy.Vectorizer(
     weighting='tf', normalize=False, smooth_idf=True,
     min_df=3, max_df=0.95, max_n_terms=10000)
In [75]:
# Document-Term Matrix
doc_term_matrix = vectorizer.fit_transform(docs_tokens)
SET Number of topics to model. Start with a larger number (10-20) and narrow down in subsequent iterations.
In [89]:
# SET
no_topics=6
model = textacy.tm.TopicModel('lda', n_topics=no_topics) # `n_components=x` is not being registered in 19.1 version
model.fit(doc_term_matrix)
/opt/conda/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21
  DeprecationWarning)
In [90]:
# Document-Topic Matrix
doc_topic_matrix = model.transform(doc_term_matrix)
# check
doc_topic_matrix.shape
Out[90]:
(2101, 6)

DATA PROCESSING

Topic/Term distribution

In [91]:
topic_weight_serie = pd.Series(model.topic_weights(doc_topic_matrix))
In [92]:
# convert list of terms to text string
topic_term_list = ["| "+" | ".join(x[1])+" |" for x in model.top_topic_terms(vectorizer.id_to_term, topics=-1)]
In [93]:
# Creating a dataframe from a list of series so the order is perserved. Dictionaries (including OrderDicts) move columns around.
series_list_tmp = [ pd.Series(range(len(topic_weight_serie)),name='topic_id'), pd.Series(topic_term_list, name='terms'), topic_weight_serie.rename('weight')]
topic_term_df = pd.concat(series_list_tmp, axis=1)
In [94]:
# Insert a column in a particular position. NB: 'rank' returns the type of the column being ranked. <float> is in this case converted to integer.
topic_term_df.insert(0, 'rank', topic_term_df['weight'].rank(ascending=False).astype(int))

Topic Weight Chart

In [95]:
fig = plt.figure()
ax1 = fig.add_subplot(111)
bars = ax1.bar(range(len(topic_weight_serie)),topic_weight_serie, color='c', edgecolor='black')
plt.savefig(outfolder + outroot +"topic-weight.png")

Find top topics (= top_topic_list)

SET Number of top topics to focus. The 'Topic Weight Chart' above may help deciding how many to include.
In [96]:
number_top_topics = 4 # max 6
top_list_tmp = topic_weight_serie.nlargest(n=number_top_topics)
top_topic_list = list(top_list_tmp.index)
print(top_topic_list)
[5, 1, 2, 3]

Prepare Topic-Term visualization with Termite plot & pyLDAvis

In [97]:
# NB: `termite_plot` saves an image with option: `save='filename.png'
termite_file = outfolder + outroot +"termite.png"
In [98]:
# Prepare pyLDAvis
import pyLDAvis
pyLDAvis.enable_notebook()
top_term_matrix = model.model.components_
doc_lengths = [len(d) for d in docs_tokens]
vocab = list(vectorizer.id_to_term.values())
term_frequency = textacy.vsm.get_term_freqs(doc_term_matrix)
vis_data = pyLDAvis.prepare(top_term_matrix,doc_topic_matrix,doc_lengths,vocab,term_frequency)

MODEL EVALUATION

A number of visualizations are gathered in the following to give a comprehensive overview of the resulting topic model.

In [99]:
for index, row in topic_term_df.iterrows():
    print('%02d' % (row['weight']*100) +'%','#'+'%02d' % row['rank'],"@"+('%02d' % index), row['terms'])
05% #05 @00 | isolate | strain | cluster | beijing | pza | transmission | linezolid | patient | genotype | mycobacterium |
23% #02 @01 | isolate | resistance | drug | mutation | strain | mycobacterium | m. | study | resistant | gene |
17% #03 @02 | case | control | new | drug | child | disease | multidrug_resistant | infection | treatment | country |
12% #04 @03 | resistance | assay | detection | result | method | dst | test | rapid | rif | culture |
04% #06 @04 | patient | response | cell | pulmonary | therapy | level | lung | sputum | culture | blood |
36% #01 @05 | patient | treatment | drug | case | outcome | study | multidrug_resistant | resistance | result | regimen |
In [100]:
grid = model.termite_plot(doc_term_matrix, vectorizer.id_to_term, highlight_topics=top_topic_list,
                   topics=-1,  n_terms=30, sort_terms_by='seriation', save=termite_file)
NOTE "Termite Plot" above assign random number to topic starting with (@00). "pyLDAvis" below ranks topic based on their weight in corpus starting with (#01)
In [101]:
pyLDAvis.display(vis_data)
Out[101]:
In [102]:
for index, row in topic_term_df.sort_values(by='rank').iterrows():
    print('%02d' % (row['weight']*100) +'%','@'+('%02d' % index),'#'+'%02d' % row['rank'], row['terms'])
36% @05 #01 | patient | treatment | drug | case | outcome | study | multidrug_resistant | resistance | result | regimen |
23% @01 #02 | isolate | resistance | drug | mutation | strain | mycobacterium | m. | study | resistant | gene |
17% @02 #03 | case | control | new | drug | child | disease | multidrug_resistant | infection | treatment | country |
12% @03 #04 | resistance | assay | detection | result | method | dst | test | rapid | rif | culture |
05% @00 #05 | isolate | strain | cluster | beijing | pza | transmission | linezolid | patient | genotype | mycobacterium |
04% @04 #06 | patient | response | cell | pulmonary | therapy | level | lung | sputum | culture | blood |
NOTE The above visualizations may inspire you to iteratively adjust stopwords, number of topics or number of topics in focus. Once you have a satisfying result you can proceed with the following export of data

EXPORT DATA

NB: Termite plot is saved every time it is generated.
NB: Textacy has functions to save the trained model (model.save and textacy.tm.TopicModel.load).

In [103]:
# Export topic_term_df
topic_term_df.to_csv(outfolder + outroot +'topic-term-df.tsv', encoding='UTF-8', header=True, index=False, sep='\t')
In [104]:
# pyLDAvis visualization
pyldavis_file = outfolder + outroot +"pyldavis.html"
pyLDAvis.save_html(vis_data,pyldavis_file)

Document/Topic all-weights (..doc-topic-weight.tsv)

Merging the docs id with the generated doc_topic_matrix makes it possible to search the content. These are standard functions in Textacy, but we might as well practice our skills in manipulating pandas. Get topics for uid with doc_topic_df.loc['10.1038_emi.2017.83'].nlargest(n=3) Load again with loaded_df = pd.read_csv(open(r'...doc-topic-weight.tsv',encoding='UTF-8'),sep='\t', index_col=0)

In [105]:
# Write Document-Topic Matrix
doc_topic_df = pd.DataFrame(data=doc_topic_matrix,    # values
             index=data.index,    # 1st column as index
             columns=list(range(doc_topic_matrix.shape[1])))  # 1st row as the column names
In [106]:
doc_topic_df.to_csv(outfolder + outroot +'doc-topic-weight.tsv', encoding='UTF-8', header=True, index=True, sep='\t')
# Check df
doc_topic_df.iloc[:2,:5]
Out[106]:
0 1 2 3 4
pub-uid
10.1038_s41598_017_17705_3 0.072527 0.098516 0.166856 0.355275 0.001497
10.3389_fmicb.2017.02452 0.061662 0.328028 0.600453 0.003277 0.003282

Document/Topic top-weights (..doc-topic-top.json)

Comments: 'model.top_doc_topics' generates a row number and a tuple containing pairs of topic:weight. Row number is used to get 'uid'. Tuple is made into a list of list and the values are converted from numpy to python objects.

In [107]:
generator = model.top_doc_topics(doc_topic_matrix, docs=-1, top_n=3, weights=True)
doc_topic_top3 = [[data.index[doc_idx],[[x.item(),round(y.item(),2)] for x,y in topics]] for doc_idx, topics in generator]
with open(outfolder + outroot +'doc-topic-top3.json', 'w') as outfile:
    json.dump(doc_topic_top3, outfile)
doc_topic_top3[:2]
Out[107]:
[['10.1038_s41598_017_17705_3', [[3, 0.36], [5, 0.31], [2, 0.17]]],
 ['10.3389_fmicb.2017.02452', [[2, 0.6], [1, 0.33], [0, 0.06]]]]

Topic-Term Matrix with weights

In [108]:
topic_term_list = list(model.top_topic_terms(vectorizer.id_to_term, topics=-1, top_n=10, weights=True))
with open(outfolder + outroot +'topic-term-weight.json', 'w') as outfile:
    json.dump(topic_term_list, outfile)
In [109]:
# Topic Aggregated Weight (list with one number per topic)
with open(outfolder + outroot + 'topic-aggregated-weight.tsv', 'w') as f:
    f.write("\n".join([str(x) for x in topic_weight_serie]))
Word cloud (..wordcloud.txt)

No satisfactory packages exist for making word clouds in Python from a topic-term frequency list. Below a text file is generated with the data needed for creating word clouds at https://worditout.com/word-cloud/create. I recommend the following settings; font:sans-serif, colours: #ff8000 - #40bfbf, background:#242424, colour blending: rainbow, vary-word-colour: frequency, aspect ration 16/9, differences:big, vary-word-size: frequency

In [110]:
with open(str(outfolder + outroot + 'wordcloud.txt'), 'w') as f:    
    for topic_tmp in topic_term_list:
        f.write("==== #" + str(topic_tmp[0])+"\n")
        for x in topic_tmp[1]:
            (a,b) = x
            f.write(str(a) + ":" + str(int(b))+"\n")

Top articles for each topic (..doc-topic-abstract.html)

In [111]:
cutoff = 15 # How many top articles should be displayed for each topic
top_overview = [] # list of topics containing each a list of top articles.
for topic in range(no_topics):
    top_overview.append([topic,[[x,'%.2f' % y] for (x,y) in doc_topic_df[topic].nlargest(n=cutoff).iteritems()]])
In [112]:
with open(str(outfolder + outroot + 'doc-top.html'), 'w') as f:  
    f.write('<html>\n<head><title>Top Documents</title></head><body style="font-family:verdana;">')
    for topic in top_overview:
        # print top topic titles
        f.write('<h1>Topic number %s</h1>' % (topic[0]))
        for item in topic[1]:
            #look-up title
            uid = item[0]
            url = uid.replace('_','/')
            weight = item[1]
            # Find row and transpose the first row to a Serie with column names as index.  
            row = data.loc[uid]
            f.write('<p><b>%s</b>   : <a href="http://dx.doi.org/%s">%s</a> | %s</p>' % (row['title'],url,uid,weight))
            #Output title, doi, weight. 
        # print top topic abstracts
    f.write('</body></html>')
START if 'MAIN TOPIC' investigation

Save dataframe for exploring sub-themes

When a main theme has been explored, a dataframe may be exported containing the combined data and the doc_topic_df, so it is possible to drill into the individual topics and explore sub topics.

In [113]:
# Execute only for main
sub_topics = top_topic_list
data_topic = pd.concat([data,doc_topic_df], axis=1)
In [114]:
data_topic.head()
Out[114]:
title abstract keywords 0 1 2 3 4 5
pub-uid
10.1038_s41598_017_17705_3 Culture and Next-generation sequencing-based d... Djibouti is a small country in the Horn of Afr... NaN 0.072527 0.098516 0.166856 0.355275 0.001497 0.305329
10.3389_fmicb.2017.02452 Repurposing and Revival of the Drugs: A New Ap... Emergence of drug resistant tuberculosis like ... drug resistance tuberculosis; repurposing; rev... 0.061662 0.328028 0.600453 0.003277 0.003282 0.003298
10.1007_s15010_017_1054_8 Outcomes of multidrug-resistant tuberculosis i... The purpose of this study was to establish a b... Zambia; Tuberculosis; MDR-TB; Drug resistance;... 0.001741 0.001742 0.001745 0.001759 0.001741 0.991272
10.1016_j.jgar.2017.07.002 Extensively drug-resistant tuberculosis (XDR-T... Objectives: Extensively drug-resistant tubercu... Mycobacterium tuberculosis; MDR-TB; XDR-TB; Ge... 0.001449 0.619068 0.188984 0.187598 0.001444 0.001456
10.1016_j.ijid.2017.09.019 Trends and characteristics of drug-resistant t... Objectives: The aim of this study was to descr... Multidrug-resistant tuberculosis; Primary tran... 0.001476 0.001476 0.293617 0.001471 0.001468 0.700492
In [115]:
data_topic.to_csv(outfolder + outroot +'data-topic-df.tsv', encoding='UTF-8', header=True, index=True, sep='\t')
END if 'MAIN TOPIC' investigation
In [ ]: