For the best viewing experience use nbviewer.
In this notebook, I will use Python and its libraries for topic modeling. In topic modeling, statistical models are used to identify topics or categories in a document or a set of documents. I will use one specific method called Latent Dirichlet Allocation (LDA). The algorithm can be summarized as follows:
This notebook uses the following packages:
spacy
nltk
random
gensim
pickle
pandas
sklearn
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # see the value of multiple statements at once.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
spaCy
¶In this projects I will use the spaCy
library (see this [link](https://github.com/skipgram/modern-nlp-in python/blob/master/executable/Modern_NLP_in_Python.ipynb)).
spaCy
is:
An industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()
<spacy.lang.en.English at 0x111f27748>
df = pd.read_csv('articles.csv',header=None)
df.columns = ['titles']
df.shape
df.head()
(100, 1)
titles | |
---|---|
0 | A novel digitally controlled low noise ring os... |
1 | A motion compensation system with a high effic... |
2 | A reconfigurable MAC architecture implemented ... |
3 | Why is 3-D interaction so hard and what can we... |
4 | Automatic colorization of grayscale images usi... |
From df
I will build a list doc_set
containing the row entries:
doc_set = df.values.T.tolist()[0]
print(doc_set[0:10])
['A novel digitally controlled low noise ring oscillator.', 'A motion compensation system with a high efficiency reference frame pre-fetch scheme for QFHD H.264/AVC decoding.', 'A reconfigurable MAC architecture implemented with mixed-Vt standard cell library.', 'Why is 3-D interaction so hard and what can we really do about it?', 'Automatic colorization of grayscale images using multiple images on the web.', 'Automatic Profile Generation in eRACE.', 'A QoS aware multicore hash scheduler for network applications.', 'Incremental Bloom Filters.', 'Hardware Organization for Nonnumeric Processing', 'Retrieval of motion capture data based on short-term feature extraction.']
Before applying natural language processing tools to our problem, I will provide a quick review of some basic procedures using Python. We first import nltk
and the necessary classes for lemmatization and stemming:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
We then create objects of the classes PorterStemmer
and WordNetLemmatizer
:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
To use lemmatization and/or stemming in a given string text we must first tokenize it. The code below matches word characters until it reaches a non-word character, like a space.
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenined_docs = []
for doc in doc_set:
tokens = tokenizer.tokenize(doc.lower())
tokenined_docs.append(tokens)
print(tokenined_docs[0:3])
[['a', 'novel', 'digitally', 'controlled', 'low', 'noise', 'ring', 'oscillator'], ['a', 'motion', 'compensation', 'system', 'with', 'a', 'high', 'efficiency', 'reference', 'frame', 'pre', 'fetch', 'scheme', 'for', 'qfhd', 'h', '264', 'avc', 'decoding'], ['a', 'reconfigurable', 'mac', 'architecture', 'implemented', 'with', 'mixed', 'vt', 'standard', 'cell', 'library']]
lemmatized_tokens = []
for lst in tokenined_docs:
tokens_lemma = [lemmatizer.lemmatize(i) for i in lst]
lemmatized_tokens.append(tokens_lemma)
print(lemmatized_tokens[0:3])
[['a', 'novel', 'digitally', 'controlled', 'low', 'noise', 'ring', 'oscillator'], ['a', 'motion', 'compensation', 'system', 'with', 'a', 'high', 'efficiency', 'reference', 'frame', 'pre', 'fetch', 'scheme', 'for', 'qfhd', 'h', '264', 'avc', 'decoding'], ['a', 'reconfigurable', 'mac', 'architecture', 'implemented', 'with', 'mixed', 'vt', 'standard', 'cell', 'library']]
from stop_words import get_stop_words
en_stop_words = get_stop_words('en')
n=4
tokens = []
for lst in lemmatized_tokens:
tokens.append([i for i in lst if not i in en_stop_words if len(i) > n])
print(tokens[0:3])
[['novel', 'digitally', 'controlled', 'noise', 'oscillator'], ['motion', 'compensation', 'system', 'efficiency', 'reference', 'frame', 'fetch', 'scheme', 'decoding'], ['reconfigurable', 'architecture', 'implemented', 'mixed', 'standard', 'library']]
I will now generate an LDA model and for that, the frequency that each term occurs within each document needs to be understood.
A document-term matrix is constructed to do that. It contains a corpus of $n$ documents and a vocabulary of $m$ words. Each cell $ij$ counts the frequency of the word $j$ in the document $i$.
word_1 | word_2 | ... | word_m | |
---|---|---|---|---|
doc_1 | 1 | 3 | ... | 2 |
doc_2 | 2 | 3 | ... | 3 |
... | ... | 2 | ... | 1 |
doc_n | 1 | 1 | ... | 1 |
What LDA does is to convert this matrix into two matrices with lower dimensions namely:
topic_1 | topic_2 | ... | topic_T | |
---|---|---|---|---|
doc_1 | 0 | 1 | ... | 1 |
doc_2 | 0 | 1 | ... | 1 |
... | ... | ... | ... | 1 |
doc_n | 1 | 0 | ... | 0 |
and
word_1 | word_2 | ... | word_m | |
---|---|---|---|---|
topic_1 | 1 | 0 | ... | 1 |
topic_2 | 1 | 0 | ... | 1 |
... | ... | ... | ... | 1 |
topic_T | 1 | 1 | ... | 1 |
from gensim import corpora, models
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(text) for text in tokens]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')
corpus[0]
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]
import gensim
ldamodel_3 = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)
ldamodel_3.save('model3.gensim')
ldamodel_4 = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)
ldamodel_4.save('model4.gensim')
for el in ldamodel_3.print_topics(num_topics=3, num_words=3):
print(el,'\n')
(0, '0.017*"system" + 0.017*"image" + 0.017*"based"') (1, '0.035*"network" + 0.015*"multi" + 0.012*"based"') (2, '0.016*"based" + 0.013*"using" + 0.012*"system"')
for el in ldamodel_4.print_topics(num_topics=3, num_words=3):
print(el,'\n')
(0, '0.028*"based" + 0.023*"system" + 0.015*"database"') (1, '0.015*"search" + 0.012*"multi" + 0.012*"analysis"') (2, '0.025*"network" + 0.016*"image" + 0.013*"using"')
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model3.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
/Users/marcotavora/anaconda3/lib/python3.6/site-packages/pyLDAvis/_prepare.py:387: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated topic_term_dists = topic_term_dists.ix[topic_order]
pyLDAvis.display(lda_display)