By Max Munnecke, @maxmunnecke
This notebook consists of following sections:
General concept:
The emphasis in this notebook is on facilitating an iterative process where you can easily adjust stopwords and number of topics. Furthermore it contains features to re-focus on sub topics and thereby create a hierachy of topics.
# Load packages
import spacy
nlp = spacy.load("en")
import textacy # 0.5.0, does not work with 0.6.0.
import textacy.datasets
import textacy.fileio
import matplotlib.pyplot as plt
import json # write to disk
import pandas as pd
%matplotlib inline
# nlp = spacy.load("en") # Download spacy english vocabulary: `python -m spacy download en`
import os, re, sys
import warnings
warnings.filterwarnings('ignore') # Let's not pay heed to them right now
%matplotlib inline
# Log environment
print("cwd : " + os.getcwd())
print("sys : " + str(sys.version_info))
print("spacy : "+ spacy.__version__)
print("textacy : "+ textacy.__version__)
cwd : /home/jovyan/work/prj-bib/tb/nlp sys : sys.version_info(major=3, minor=6, micro=3, releaselevel='final', serial=0) spacy : 2.0.8 textacy : 0.5.0
# infolder ='' # win64 py36
infolder = 'data-in/' # docker py35mini
infile = 'tb_data.tsv'
outfolder = 'data-out/'
outroot = 'tb_main_'
data_org = pd.read_csv(infolder + infile, index_col=0, sep='\t')
print('Length : ' + str(len(data_org)))
data_org.describe()
Length : 2101
pub-full | abstract | key-au | key-pub | |
---|---|---|---|---|
count | 2101 | 2093 | 1445 | 1962 |
unique | 2098 | 2092 | 1442 | 1888 |
top | Should tuberculosis programmes invest in secon... | Introduction: In tuberculosis (Tb), the great ... | Antitubercular; Benzoxazole; Interaction energ... | MYCOBACTERIUM-TUBERCULOSIS |
freq | 2 | 2 | 2 | 21 |
# Transforming the incoming dataframe to standard template.
columns_extract = {'pub-full':'title','abstract':'abstract', 'key-au':'keywords'} # {'old':'new'}
data_org = data_org[[name for name in columns_extract.keys()]]
data_org.rename(columns=columns_extract, inplace=True)
data = data_org # Keeping a copy of the original data set for when sub slices are being explored
# START HERE if 'data' has been manipulated elsewhere
# Load external data frame
data_topic = pd.read_csv(outfolder +'tb_mdrtb_data-topic-df.tsv', index_col=0, sep='\t')
data_topic.describe()
sub_topic = '3'
cutoff = 0.7
data_tmp = data_topic[data_topic[sub_topic]>cutoff]
print('Number of articles: %s' % len(data_tmp))
data = data_tmp
docs = [ str(a_) + ". " + str(b_) for a_,b_ in zip(data['title'], data['abstract'])]
# Converting '-' to '_' to make sure that terms are not split up during subsequent Gensim and Textacy manipulation.
docs = [re.sub(r'\b-\b', '_', text) for text in docs] # Should not be touched as it is referenced later.
Concept: Identify freqent phrases and glue them together with an underscore "_".
Inspiration: *Phrase model for bi- and tri-gram with Gensim: https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb Other source: https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb
import re
import gensim
# Split pargraph into sentences
grap_sentence = re.compile(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?)\s')
# Remove text enclosed in '()' or '[]'
grap_enclosed = re.compile(r'[\(\[].*?[\)\]]')
model_sents = []
for text in docs:
# In a single line sentences are being extracted without enclosed text.
stripped = [grap_enclosed.sub("", sent) for sent in grap_sentence.split(text)]
# Excluding the final character of the sentence so the punctuation does not make part of last word
model_sents += [text[:-1].split() for text in stripped]
common_terms = ["of", "with", "without", "and", "or", "the", "a", "not", "be", "to","this","who","in"]
bigram = gensim.models.Phrases(model_sents, min_count=50, threshold=5, max_vocab_size=50000, common_terms=common_terms)
trigram = gensim.models.Phrases(bigram[model_sents], min_count=50, threshold=5, max_vocab_size=50000, common_terms=common_terms)
docs_phrased = [" ".join(trigram[bigram[doc.split()]]) for doc in docs]
print(str(len(docs_phrased)) +' : '+ docs_phrased[0][:400])
2101 : Culture and Next_generation sequencing_based drug_susceptibility_testing unveil high levels of drug_resistant_TB in Djibouti: results from the first national survey. Djibouti is a small country in the Horn of Africa with a high TB incidence (378/100,000 in 2015). Multidrug_resistant TB (MDR_TB) and resistance to second_line agents have_been previously identified in the country but the extent of th
SpaCy is assigned the task of making each document into a list of tokens. It is important to filter away non-topical tokens as they otherwise may be determining for the topic modelling. Textacy is not used as I prefer to learn the underlying spacy pipeline.
When adding stopwords to Spacy it will only stop the exact same spelling of a word. It is therefore necessary to include uppercase versions and eventuall plural forms of a word to get all variations. Alternatively we can create custom external lists outside the nlp object and control in detail how a word is being evaluated. Below are versions for exact spelling, lowercase, and lemma.
NB: The keywords that was used to generate a corpus a good candidates for stopwords. Likewise you may filter out the most frequent terms of a sub-topic if you choose to dig into it.
# EXACT nlp object. Names, organisations etc.
exact_stop = 'someword'.split()
for w in exact_stop:
lexeme = nlp.vocab[w]
lexeme.is_stop = True
# LOWERCASE in list. Safe choice.
lower_stop = ('the to a an background objective').lower().split()
# LEMMA in list. Powerful
topic_stop = 'tb mdr mdr_tb tuberculosis'.split()
subtopic_stop = ''.split('|')
artifact = ['-PRON-','=', '+', 'in']
lemma_stop = [item.strip() for item in (topic_stop + subtopic_stop + artifact)]
Overview of all token attributes: https://github.com/explosion/spaCy/blob/master/spacy/attrs.pyx
docs_tokens, tokens_tmp = [], []
for item in docs_phrased:
doc = nlp(item)
for w in doc:
# Filter away line endings,nlp stopwords, numbers and words in lists
if not (w.text == '\n' or w.is_stop or w.is_punct or w.like_num or w.lemma_ in lemma_stop or w.text.lower() in lower_stop):
tokens_tmp.append(w.lemma_)
docs_tokens.append(tokens_tmp)
tokens_tmp = []
# check
print(docs_tokens[1])
['repurpos', 'revival', 'drug', 'new', 'approach', 'combat', 'drug_resistant', 'emergence', 'drug_resistant', 'like', 'multi_drug_resistant', 'extensively_drug_resistant', 'xdr_tb', 'totally', 'drug_resistant', 'tdr_tb', 'create', 'new', 'challenge', 'fight', 'bad', 'bug', 'mycobacterium', 'repurposing', 'revival', 'drug', 'new', 'trend', 'option', 'combat', 'worsen', 'situation', 'antibiotic', 'resistance', 'era', 'situation', 'global', 'emergency', 'bactericidal', 'synergistic', 'effect', 'repurpos', 'revive', 'drug', 'late', 'drug', 'bedaquiline', 'delamanid', 'treatment', 'xdr_tb', 'tdr_tb', 'choice', 'future', 'promising', 'combinatorial', 'chemotherapy', 'bad', 'bug']
We use Textacy because it has a nice suite of functions for topic modelling and connects with both termite and pyLDAvis
vectorizer = textacy.Vectorizer(
weighting='tf', normalize=False, smooth_idf=True,
min_df=3, max_df=0.95, max_n_terms=10000)
# Document-Term Matrix
doc_term_matrix = vectorizer.fit_transform(docs_tokens)
# SET
no_topics=6
model = textacy.tm.TopicModel('lda', n_topics=no_topics) # `n_components=x` is not being registered in 19.1 version
model.fit(doc_term_matrix)
/opt/conda/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21 DeprecationWarning)
# Document-Topic Matrix
doc_topic_matrix = model.transform(doc_term_matrix)
# check
doc_topic_matrix.shape
(2101, 6)
topic_weight_serie = pd.Series(model.topic_weights(doc_topic_matrix))
# convert list of terms to text string
topic_term_list = ["| "+" | ".join(x[1])+" |" for x in model.top_topic_terms(vectorizer.id_to_term, topics=-1)]
# Creating a dataframe from a list of series so the order is perserved. Dictionaries (including OrderDicts) move columns around.
series_list_tmp = [ pd.Series(range(len(topic_weight_serie)),name='topic_id'), pd.Series(topic_term_list, name='terms'), topic_weight_serie.rename('weight')]
topic_term_df = pd.concat(series_list_tmp, axis=1)
# Insert a column in a particular position. NB: 'rank' returns the type of the column being ranked. <float> is in this case converted to integer.
topic_term_df.insert(0, 'rank', topic_term_df['weight'].rank(ascending=False).astype(int))
fig = plt.figure()
ax1 = fig.add_subplot(111)
bars = ax1.bar(range(len(topic_weight_serie)),topic_weight_serie, color='c', edgecolor='black')
plt.savefig(outfolder + outroot +"topic-weight.png")
number_top_topics = 4 # max 6
top_list_tmp = topic_weight_serie.nlargest(n=number_top_topics)
top_topic_list = list(top_list_tmp.index)
print(top_topic_list)
[5, 1, 2, 3]
draw_termite_plot
but not needed so far: https://www.pydoc.io/pypi/textacy-0.5.0/autoapi/viz/termite/index.html#module-viz.termite# NB: `termite_plot` saves an image with option: `save='filename.png'
termite_file = outfolder + outroot +"termite.png"
# Prepare pyLDAvis
import pyLDAvis
pyLDAvis.enable_notebook()
top_term_matrix = model.model.components_
doc_lengths = [len(d) for d in docs_tokens]
vocab = list(vectorizer.id_to_term.values())
term_frequency = textacy.vsm.get_term_freqs(doc_term_matrix)
vis_data = pyLDAvis.prepare(top_term_matrix,doc_topic_matrix,doc_lengths,vocab,term_frequency)
A number of visualizations are gathered in the following to give a comprehensive overview of the resulting topic model.
for index, row in topic_term_df.iterrows():
print('%02d' % (row['weight']*100) +'%','#'+'%02d' % row['rank'],"@"+('%02d' % index), row['terms'])
05% #05 @00 | isolate | strain | cluster | beijing | pza | transmission | linezolid | patient | genotype | mycobacterium | 23% #02 @01 | isolate | resistance | drug | mutation | strain | mycobacterium | m. | study | resistant | gene | 17% #03 @02 | case | control | new | drug | child | disease | multidrug_resistant | infection | treatment | country | 12% #04 @03 | resistance | assay | detection | result | method | dst | test | rapid | rif | culture | 04% #06 @04 | patient | response | cell | pulmonary | therapy | level | lung | sputum | culture | blood | 36% #01 @05 | patient | treatment | drug | case | outcome | study | multidrug_resistant | resistance | result | regimen |
grid = model.termite_plot(doc_term_matrix, vectorizer.id_to_term, highlight_topics=top_topic_list,
topics=-1, n_terms=30, sort_terms_by='seriation', save=termite_file)
pyLDAvis.display(vis_data)
for index, row in topic_term_df.sort_values(by='rank').iterrows():
print('%02d' % (row['weight']*100) +'%','@'+('%02d' % index),'#'+'%02d' % row['rank'], row['terms'])
36% @05 #01 | patient | treatment | drug | case | outcome | study | multidrug_resistant | resistance | result | regimen | 23% @01 #02 | isolate | resistance | drug | mutation | strain | mycobacterium | m. | study | resistant | gene | 17% @02 #03 | case | control | new | drug | child | disease | multidrug_resistant | infection | treatment | country | 12% @03 #04 | resistance | assay | detection | result | method | dst | test | rapid | rif | culture | 05% @00 #05 | isolate | strain | cluster | beijing | pza | transmission | linezolid | patient | genotype | mycobacterium | 04% @04 #06 | patient | response | cell | pulmonary | therapy | level | lung | sputum | culture | blood |
NB: Termite plot is saved every time it is generated.
NB: Textacy has functions to save the trained model (model.save
and textacy.tm.TopicModel.load
).
# Export topic_term_df
topic_term_df.to_csv(outfolder + outroot +'topic-term-df.tsv', encoding='UTF-8', header=True, index=False, sep='\t')
# pyLDAvis visualization
pyldavis_file = outfolder + outroot +"pyldavis.html"
pyLDAvis.save_html(vis_data,pyldavis_file)
Merging the docs id with the generated doc_topic_matrix makes it possible to search the content. These are standard functions in Textacy, but we might as well practice our skills in manipulating pandas.
Get topics for uid with doc_topic_df.loc['10.1038_emi.2017.83'].nlargest(n=3)
Load again with loaded_df = pd.read_csv(open(r'...doc-topic-weight.tsv',encoding='UTF-8'),sep='\t', index_col=0)
# Write Document-Topic Matrix
doc_topic_df = pd.DataFrame(data=doc_topic_matrix, # values
index=data.index, # 1st column as index
columns=list(range(doc_topic_matrix.shape[1]))) # 1st row as the column names
doc_topic_df.to_csv(outfolder + outroot +'doc-topic-weight.tsv', encoding='UTF-8', header=True, index=True, sep='\t')
# Check df
doc_topic_df.iloc[:2,:5]
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
pub-uid | |||||
10.1038_s41598_017_17705_3 | 0.072527 | 0.098516 | 0.166856 | 0.355275 | 0.001497 |
10.3389_fmicb.2017.02452 | 0.061662 | 0.328028 | 0.600453 | 0.003277 | 0.003282 |
Comments: 'model.top_doc_topics' generates a row number and a tuple containing pairs of topic:weight. Row number is used to get 'uid'. Tuple is made into a list of list and the values are converted from numpy to python objects.
generator = model.top_doc_topics(doc_topic_matrix, docs=-1, top_n=3, weights=True)
doc_topic_top3 = [[data.index[doc_idx],[[x.item(),round(y.item(),2)] for x,y in topics]] for doc_idx, topics in generator]
with open(outfolder + outroot +'doc-topic-top3.json', 'w') as outfile:
json.dump(doc_topic_top3, outfile)
doc_topic_top3[:2]
[['10.1038_s41598_017_17705_3', [[3, 0.36], [5, 0.31], [2, 0.17]]], ['10.3389_fmicb.2017.02452', [[2, 0.6], [1, 0.33], [0, 0.06]]]]
topic_term_list = list(model.top_topic_terms(vectorizer.id_to_term, topics=-1, top_n=10, weights=True))
with open(outfolder + outroot +'topic-term-weight.json', 'w') as outfile:
json.dump(topic_term_list, outfile)
# Topic Aggregated Weight (list with one number per topic)
with open(outfolder + outroot + 'topic-aggregated-weight.tsv', 'w') as f:
f.write("\n".join([str(x) for x in topic_weight_serie]))
No satisfactory packages exist for making word clouds in Python from a topic-term frequency list. Below a text file is generated with the data needed for creating word clouds at https://worditout.com/word-cloud/create. I recommend the following settings; font:sans-serif, colours: #ff8000 - #40bfbf, background:#242424, colour blending: rainbow, vary-word-colour: frequency, aspect ration 16/9, differences:big, vary-word-size: frequency
with open(str(outfolder + outroot + 'wordcloud.txt'), 'w') as f:
for topic_tmp in topic_term_list:
f.write("==== #" + str(topic_tmp[0])+"\n")
for x in topic_tmp[1]:
(a,b) = x
f.write(str(a) + ":" + str(int(b))+"\n")
cutoff = 15 # How many top articles should be displayed for each topic
top_overview = [] # list of topics containing each a list of top articles.
for topic in range(no_topics):
top_overview.append([topic,[[x,'%.2f' % y] for (x,y) in doc_topic_df[topic].nlargest(n=cutoff).iteritems()]])
with open(str(outfolder + outroot + 'doc-top.html'), 'w') as f:
f.write('<html>\n<head><title>Top Documents</title></head><body style="font-family:verdana;">')
for topic in top_overview:
# print top topic titles
f.write('<h1>Topic number %s</h1>' % (topic[0]))
for item in topic[1]:
#look-up title
uid = item[0]
url = uid.replace('_','/')
weight = item[1]
# Find row and transpose the first row to a Serie with column names as index.
row = data.loc[uid]
f.write('<p><b>%s</b> : <a href="http://dx.doi.org/%s">%s</a> | %s</p>' % (row['title'],url,uid,weight))
#Output title, doi, weight.
# print top topic abstracts
f.write('</body></html>')
When a main theme has been explored, a dataframe may be exported containing the combined data and the doc_topic_df, so it is possible to drill into the individual topics and explore sub topics.
# Execute only for main
sub_topics = top_topic_list
data_topic = pd.concat([data,doc_topic_df], axis=1)
data_topic.head()
title | abstract | keywords | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|---|
pub-uid | |||||||||
10.1038_s41598_017_17705_3 | Culture and Next-generation sequencing-based d... | Djibouti is a small country in the Horn of Afr... | NaN | 0.072527 | 0.098516 | 0.166856 | 0.355275 | 0.001497 | 0.305329 |
10.3389_fmicb.2017.02452 | Repurposing and Revival of the Drugs: A New Ap... | Emergence of drug resistant tuberculosis like ... | drug resistance tuberculosis; repurposing; rev... | 0.061662 | 0.328028 | 0.600453 | 0.003277 | 0.003282 | 0.003298 |
10.1007_s15010_017_1054_8 | Outcomes of multidrug-resistant tuberculosis i... | The purpose of this study was to establish a b... | Zambia; Tuberculosis; MDR-TB; Drug resistance;... | 0.001741 | 0.001742 | 0.001745 | 0.001759 | 0.001741 | 0.991272 |
10.1016_j.jgar.2017.07.002 | Extensively drug-resistant tuberculosis (XDR-T... | Objectives: Extensively drug-resistant tubercu... | Mycobacterium tuberculosis; MDR-TB; XDR-TB; Ge... | 0.001449 | 0.619068 | 0.188984 | 0.187598 | 0.001444 | 0.001456 |
10.1016_j.ijid.2017.09.019 | Trends and characteristics of drug-resistant t... | Objectives: The aim of this study was to descr... | Multidrug-resistant tuberculosis; Primary tran... | 0.001476 | 0.001476 | 0.293617 | 0.001471 | 0.001468 | 0.700492 |
data_topic.to_csv(outfolder + outroot +'data-topic-df.tsv', encoding='UTF-8', header=True, index=True, sep='\t')