"Topic discovery in scientific articles with Python"

"In this article we apply latent dirichlet allocation (LDA) to discover topic clusters in academic papers."

  • toc: true
  • branch: master
  • badges: true
  • comments: true
  • categories: [python, scikit-learn, nlp]

PLOS Biology Topics

Ever wonder what topics are discussed in PLOS Biology articles? Here I will apply an implementation of Latent Dirichlet Allocation (LDA) on a set of 1,754 PLOS Biology articles to work out what a possible collection of underlying topics could be.

I first read about LDA in Building Machine Learning Systems with Python co-authored by Luis Coelho.

LDA seems to have been first described by Blei et al. and I will use the implementation provided by gensim which was written by Radim Řehůřek.

In [ ]:
import gensim
In [ ]:
import plospy
import os
In [ ]:
import nltk
In [ ]:
import cPickle as pickle

With the following lines of code we open, parse, and tokenize all 1,754 PLOS Biology articles in our collection.

As this takes a bit of time and memory, I carried out all of these steps once and stored the resulting data structures to my hard disk for later reuse - see further below.

In [ ]:
all_names = [name for name in os.listdir('../plos/plos_biology/plos_biology_data') if '.dat' in name]
In [ ]:
article_bodies = []

for name_i, name in enumerate(all_names):
    docs = plospy.PlosXml('../plos/plos_biology/plos_biology_data/'+name)
    for article in docs.docs:
        article_bodies.append(article['body'])

We have 1,754 PLOS Biology articles in our collection:

In [ ]:
len(article_bodies)
In [ ]:
punkt_param = nltk.tokenize.punkt.PunktParameters()
punkt_param.abbrev_types = set(['et al', 'i.e', 'e.g', 'ref', 'c.f',
                                'fig', 'Fig', 'Eq', 'eq', 'eqn', 'Eqn',
                                'dr', 'Dr'])
sentence_splitter = nltk.tokenize.punkt.PunktSentenceTokenizer(punkt_param)
In [ ]:
sentences = []
for body in article_bodies:
    sentences.append(sentence_splitter.tokenize(body))
In [ ]:
articles = []
for body in sentences:
    this_article = []
    for sentence in body:
        this_article.append(nltk.tokenize.word_tokenize(sentence))
    articles.append(this_article)
In [ ]:
pickle.dump(articles, open('plos_biology_articles_tokenized.list', 'w'))
In [ ]:
articles = pickle.load(open('plos_biology_articles_tokenized.list', 'r'))
In [ ]:
is_stopword = lambda w: len(w) < 4 or w in nltk.corpus.stopwords.words('english')

Save each article as one list of tokens and filter out stopwords:

In [ ]:
articles_unfurled = []
for article in articles:
    this_article = []
    for sentence in article:
        this_article += [token.lower().encode('utf-8') for token in sentence if not is_stopword(token)]
    articles_unfurled.append(this_article)
In [ ]:
pickle.dump(articles_unfurled, open('plos_biology_articles_unfurled.list', 'w'))
In [ ]:
articles_unfurled = pickle.load(open('plos_biology_articles_unfurled.list', 'r'))

Dictionary and Corpus Creation

Create a dictionary of all words (tokens) that appear in our collection of PLOS Biology articles and create a bag of words object for each article (doc2bow).

In [ ]:
dictionary = gensim.corpora.Dictionary(articles_unfurled)
In [ ]:
dictionary.save('plos_biology.dict')
In [ ]:
dictionary = gensim.corpora.dictionary.Dictionary().load('plos_biology.dict')

I noticed that the word figure occurs rather frequently in these articles, so let us exclude this and any other words that appear in more than half of the articles in this data set (thanks to Radim for pointing this out to me).

In [ ]:
dictionary.filter_extremes()
In [ ]:
corpus = [dictionary.doc2bow(article) for article in articles_unfurled]
In [ ]:
gensim.corpora.MmCorpus.serialize('plos_biology_corpus.mm', corpus)
In [ ]:
corpus = gensim.corpora.MmCorpus('plos_biology_corpus.mm')
In [ ]:
model = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, update_every=1, chunksize=100, passes=2, num_topics=20)
In [ ]:
model.save('plos_biology.lda_model')
In [ ]:
model = gensim.models.ldamodel.LdaModel.load('plos_biology.lda_model')

And these are the twenty topics we find in 1,754 PLOS Biology articles:

In [ ]:
for topic_i, topic in enumerate(model.print_topics(20)):
    print('topic # %d: %s\n' % (topic_i+1, topic))

Topics with Lemmatized Tokens

As we can notice, some of the tokens in the above topics are just singular and plural forms of the same word.

Let us see what topics we find after lemmatizing all of our tokens.

In [ ]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
articles_lemmatized = []
for article in articles_unfurled:
    articles_lemmatized.append([wnl.lemmatize(token) for token in article])
In [ ]:
pickle.dump(articles_lemmatized, open('plos_biology_articles_lemmatized.list', 'w'))
In [ ]:
dictionary_lemmatized = gensim.corpora.Dictionary(articles_lemmatized)
In [ ]:
dictionary_lemmatized.save('plos_biology_lemmatized.dict')
In [ ]:
dictionary_lemmatized.filter_extremes()
In [ ]:
corpus_lemmatized = [dictionary_lemmatized.doc2bow(article) for article in articles_lemmatized]
In [ ]:
gensim.corpora.MmCorpus.serialize('plos_biology_corpus_lemmatized.mm', corpus_lemmatized)
In [ ]:
model_lemmatized = gensim.models.ldamodel.LdaModel(corpus_lemmatized, id2word=dictionary_lemmatized, update_every=1, chunksize=100, passes=2, num_topics=20)
In [ ]:
for topic_i, topic in enumerate(model_lemmatized.print_topics(20)):
    print('topic # %d: %s\n' % (topic_i+1, topic))