This notebook uses topic modeling to analyze the EGU conference using the abstracts submitted for the years 2011 to 2018. The notebook can be used to detect trends and visualize topics along these years using one or many categories. The abstracts were parsed into text and ingested into Apache Solr.
We parsed the PDFs using PDFMiner's utility pdf2txt and ingested the resulting text files into Solr
ls *.pdf | xargs -n1 -P8 bash -c 'pdf2txt.py -o output/$0.txt -t text $0'
Example Document in Solr
doc = {
"entities":[
"Jeffrey Obelcz and Warren T. Wood",
"NRC Postdoctoral Fellow",
"Naval Research Lab",
"Seafloor Sciences",
"United States jbobelcz@gmail.com",
"Naval",
"Research Lab",
"Seafloor Sciences",
"United States"],
"id": "EGU2018-9778",
"sessions": ["ESSI4.3"],
"file": ["EGU2018-9778"],
"presentation": ["Posters"],
"year": [2018],
"title": ["Towards a Quantitative Understanding of Parameters Driving Submarine Slope Failure: A Machine Learning Approach"],
"category": ["ESSI"],
"abstract":["Submarine slope failure is a ubiquitous process and dominant pathway for sediment and organic carbon flux from continental margins to the deep sea. Slope failure occurs over a wide range of temporal and spatial scales ..."]
}
This notebook can be used to analyze what a corpora of scientific text talks about, in this case we used EGU abstracts but it can be used on any text corpora.
The topics are examined using LDAVis, which displays the topics on an X-Y plot (intertopic distance map). Topic are represented by circles whose areas are proportional to the relative prevalences of the topics in the corpus. In the display the user can enter a topic number (note 1-base numbering vs. 0-base numbering in Topic List below cells 5 and 7); the terms are displayed on the right, ranked by significance (weight). A topic can be selected on the fly by hovering over its circle; clicking selects that topic. A user can also click on a term in the RH panel to show the topics in which that term occurs.
The slider at the top of the RH panel allows the user to vary the “saliency”, i.e. uniqueness of the terms to that topic. A value of 0.6 is optimal, according to the authors of the algorithm. Blue bars represent overall term frequency while red bars show term frequency within the topic, which will be different when the saliency Is selected be less than one.
Please see the annotated image of the LDAVis display.
The final cell of this notebook lists the titles and IDs of abstracts belonging to a specified topic.
Geodynamics (GD) - Geosciences Instrumentation & Data Systems (GI) - Geomorphology (GM) - Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) - Hydrological Sciences (HS) - Natural Hazards (NH) - Nonlinear Processes in Geosciences (NP) - Ocean Sciences (OS) - Planetary & Solar System Sciences (PS) - Seismology (SM) - Stratigraphy, Sedimentology & Palaeontology (SSP) - Soil System Sciences (SSS) - Solar-Terrestrial Sciences (ST) - Tectonics & Structural Geology (TS) - Atmospheric Sciences (AS) - Biogeosciences (BG) - Climate: Past, Present, Future (CL) - Cryospheric Sciences (CR) - Earth Magnetism & Rock Physics (EMRP) - Energy, Resources and the Environment (ERE) - Earth & Space Science Informatics (ESSI) - Geodesy (G) - Geodynamics (GD) - Geosciences Instrumentation & Data Systems (GI) - Geomorphology (GM) - Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) - Hydrological Sciences (HS) - Natural Hazards (NH) - Nonlinear Processes in Geosciences (NP) - Ocean Sciences (OS) - Planetary & Solar System Sciences (PS) - Seismology (SM) - Stratigraphy, Sedimentology & Palaeontology (SSP) - Soil System Sciences (SSS) - Solar-Terrestrial Sciences (ST) - Tectonics & Structural Geology (TS) -
Union Symposia (US) Great Debates (GDB) Medal Lectures (ML) Short courses (SC) Educational and Outreach Symposia (EOS) EGU Plenary, Ceremonies and Networking (PCN) Feedback and administrative meetings (FAM) Townhall and splinter meetings (TSM) Side events (SEV) Press conferences (PC)
# Cell 1: Import requirements
import urllib
import json
import string
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# NLP libraries
import spacy
from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore, LdaModel
# Cell 2: Querying Solr and building our corpus out of the matching documents
# terms = ['ice', 'climate'] to include only abstracts with specified terms
terms = ['*']
years = ['2018']
entities = ['*']
sessions = ['*']
# Return "page_size" documents with each Solr query until complete
page_size = 5000
cursorMark = '*'
solr_documents = []
solr_root = 'http://qa.pdi-solr.apps.nsidc.org/solr/egu/select?indent=on&'
more_results = True
if terms[0] != '*':
terms_wirldcard = ['*' + t + '*' for t in terms]
else:
terms_wirldcard = ['*']
if sessions[0] != '*':
sessions_wirldcard = ['*' + s + '*' for s in sessions]
else:
sessions_wirldcard = ['*']
if entities[0] != '*':
entities_wirldcard = ['*' + e + '*' for e in entities]
else:
entities_wirldcard = ['*']
terms_query = '%20OR%20abstract:'.join(terms_wirldcard)
years_query = '%20OR%20year:'.join(years)
entities_query = '%20OR%20entities:'.join(entities_wirldcard)
sessions_query = '%20OR%20sessions:'.join(sessions_wirldcard)
query_string = 'q=(abstract:{})%20AND(year:{})' + \
'%20AND%20(entities:{})%20AND%20(sessions:{})&wt=json&rows={}&cursorMark={}&sort=id+asc'
while (more_results):
solr_query = query_string.format(terms_query,
years_query,
entities_query,
sessions_query,
page_size,
cursorMark)
solr_url = solr_root + solr_query
print('Querying: \n' + solr_url)
req = urllib.request.Request(solr_url)
# parsing response
r = urllib.request.urlopen(req).read()
json_response = json.loads(r.decode('utf-8'))
solr_documents.extend(json_response['response']['docs'])
nextCursorMark = json_response['nextCursorMark']
if (nextCursorMark == cursorMark):
more_results = False
break
else:
cursorMark = nextCursorMark
total_found = json_response['response']['numFound']
print("Processing {0} out of {1} total. \n".format(len(solr_documents), total_found))
# Cell 3, remove stop words and create an array of documents
import string
my_stop_words = {'et_al','change', 'different'}
def remove_stop_words(text):
cleaned_test = [w for w in text if w not in my_stop_words]
cleaned_test = [w for w in cleaned_test if len(w) > 2]
return cleaned_test
document_list = []
# bigram corpus will contain an array of documents and their tokens, with bigram tokens included
bigram_corpus = []
for doc in solr_documents:
bigrams = remove_stop_words(doc['bigrams'][0].split())
if 'sessions' in doc:
sessions = doc['sessions'][0]
else:
sessions = 'NAN'
if 'category' in doc:
category = doc['category'][0]
else:
category = 'NAN'
document_list.append({ 'id': doc['id'],
'text': bigrams,
'year': str(doc['year'][0]),
'title': doc['title'][0],
'category': category.replace('<',''),
'sessions':sessions})
bigram_corpus.append(bigrams)
df = pd.DataFrame.from_dict(document_list)
axis_category = pd.DataFrame(df.groupby(['category', 'year'])['category'].count()).rename(columns={'category': 'count'})
print(axis_category)
# Cell 4: Using GENSIM to do topic modelling
# num pases should be adjusted, 3 is just a guesstimate of when convergence will be achieved.
num_passes = 2
num_topics = 20
words_per_topic = 9
print_topics = False
dictionary = corpora.Dictionary(bigram_corpus)
lda_corpus = [dictionary.doc2bow(text) for text in bigram_corpus]
lda_model = LdaMulticore(lda_corpus,
num_topics=num_topics,
id2word=dictionary,
passes=num_passes,
workers=2
)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
if print_topics:
print ("Topic List: \n")
for topic in topics:
t = str((int(topic[0])+ 1))
print('Topic ' + t + ': ', topic[1:])
import warnings
warnings.filterwarnings('ignore')
import pyLDAvis.gensim
print ("\nPyLDAVis: \n")
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(corpus=lda_corpus,
topic_model=lda_model,
dictionary=dictionary,
sort_topics=False)
# Cell 4: Listing papers containing a particular ngram
# use unigrams or bigrams
terms = set(['carbon_cycle'])
top_n = 10
def createLink(doc):
baseURL = 'https://meetingorganizer.copernicus.org/EGU' + str(doc['year']) + '/' + doc['id'] + '.pdf'
return baseURL
# Ode to Python's comprehension lists
matches = [doc for doc in document_list if terms == set(doc['text']).intersection(terms)]
from IPython.core.display import display, HTML
# Let's predict the first 10 documents
for doc in matches[0:top_n]:
display(HTML('<br>Abstract <a href="{}" target="_blank">{}</a> '.format(
createLink(doc),
doc['id'])))
# Cell 5: Classifying an unseen document using our GENSIM model
# For practical purposes we use a mocked up document but we can easily query Solr or another store to get the content we want to classify
# Eventually all this should be served in as a web service
# taken from https://meetingorganizer.copernicus.org/EGU2018/EGU2014-2415.pdf
unseen_document = """
Waves in the Southern Ocean are the largest in the planet. In the Southern Hemisphere, the absence of large
landmasses at high latitudes allows the wind to feed energy into the ocean over a virtually unlimited fetch. The
enormous amount air-sea momentum exchanged over the Southern Ocean plays a substantial role on the global
climate. However, large biases affect the estimation of wave regime around the Antarctic continent making climate
prediction susceptible to uncertainty.
"""
parsed_doc = list(unseen_document.split())
vec = dictionary.doc2bow(parsed_doc)
predicted_topics = lda_model[vec]
predicted_topics = [(p[0]+1, p[1]) for p in predicted_topics]
print(predicted_topics)
# Cell 5: Plotting model coherence, this takes some time depending on iterations and model used.
# lda_model.log_perplexity(lda_corpus)
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim.models import CoherenceModel
initial_topics = 4
max_topics = 50
def compute_coherence_values(dictionary, corpus, limit=10, start=2, step=2):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the LDA model with respective number of topics
"""
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
print("Processing {0} topics \n".format(num_topics))
# model = LdaModel(corpus=corpus,
# id2word=dictionary,
# num_topics=num_topics,
# random_state=100,
# update_every=0,
# chunksize=100000,
# passes=1,
# alpha='auto',
# per_word_topics=False)
model = LdaMulticore(corpus=corpus,
num_topics=num_topics,
id2word=dictionary,
passes=5,
workers=8
)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, corpus=lda_corpus, texts=bigram_corpus, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values(dictionary=dictionary,
corpus=lda_corpus,
start=initial_topics,
limit=max_topics,
step=2)
limit=max_topics; start=initial_topics; step=2;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()