Best viewed via Jupyter nbviewer: https://nbviewer.jupyter.org/github/Josh-Been/LSI-Topic-Model/blob/master/Baylor-Libraries-LSI-Topic-Model.ipynb?flush_cache=true
Generates related keywords from a corpus bundeled together into topic areas. 10 keywords are generated per topic. A single text file is uploaded and each line is treated as a separate document.
Baylor University Libraries: LSI Topic Model
Implements the Latent Semantic Index
From Wikipedia https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing "Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings."
This Python application was built by the Baylor University Libraries to assist researchers to implement unsupervised topic modelling on 1-line documents, such as Twitter social media.
**First**, ensure Anaconda 2.7 is installed on your system. If it is not, head to https://www.anaconda.com/download/ and install. Then continue with the next step.
Fourth, download the Jupyter Notebook file https://raw.githubusercontent.com/Josh-Been/LSI-Topic-Model/master/Baylor-Libraries-LSI-Topic-Model.ipynb to your computer. In the Jupyter browser tab that opened in the previous step, click the Upload button and browse for the saved Jupyter Notebook file.
Up to this point you have been reading an HTML version of this Notebook.
Now switch to the interactive version in Jupyter.
For background on Gensim - https://radimrehurek.com/gensim/
Steps:
(1) Open Anaconda Navigator
(2) On the left-hand menu, click Environments
(3) In the drop-down menu at the top center of the page, select 'All'
(4) In the Search Packages to the right, type gensim
(5) In the result below, check the box to the left of gensim and then click t he green apply button on the bottom right. Click Apply on any popups.
Sixth, browse for the text file containing lines of documents. Put the cursor in the box below and click the 'run cell, select below' button at the top of this notebook.
import warnings, string
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim import corpora, models
from Tkinter import *
from tkFileDialog import askopenfilename
def strip_non_ascii(cleanme):
stripped = cleanme.translate(None, string.punctuation)
stripped = (c for c in stripped if 0 < ord(c) < 127)
return ''.join(stripped)
root_stop = Tk()
txt_file = askopenfilename()
print txt_file
root_stop.update()
root_stop.destroy()
documents = []
documents[:] = []
f = open(txt_file, 'r')
for line in f:
documents.append(strip_non_ascii(line))
f.close()
Seventh, browse for a stop word list. This list must be a text file with one word per line. Put the cursor in the box beow and click the 'run cell, select below' button at the top of this notebook.
There are numerous stopword lists on the internet. One example of lists in numerous languages is https://github.com/Alir3z4/stop-words It is advisable to modify lists as per your corpus.
root_lines = Tk()
stoptxt = askopenfilename()
print stoptxt
root_lines.update()
root_lines.destroy()
stoplist = []
stoplist[:] = []
f1 = open(stoptxt, 'r')
for line in f1:
line = line.replace('\n','')
line = line.replace('\r','')
stoplist.append(line)
f1.close()
stoplist.append('rt')
stoplist.append('>')
stoplist.append('sho')
stoplist.append('&:)')
stopset = set(stoplist)
If you set limit_proper_english_words to 'true', a browse dialog will appear. Browse for a list of English words. There are numerous ways to obtain lists of words, including https://github.com/dwyl/english-words
number_of_topics = 5
limit_proper_english_words = 'true'
remove_urls = 'true'
if limit_proper_english_words == 'true':
root_dict = Tk()
dicttxt = askopenfilename()
print dicttxt
root_dict.update()
root_dict.destroy()
dictlist = []
dictlist[:] = []
f2 = open(dicttxt, 'r')
for line in f2:
line = line.replace('\n','')
line = line.replace('\r','')
dictlist.append(line)
f2.close()
dictset = set(dictlist)
print 'Options Entered!'
try:
if limit_proper_english_words.lower() == 'true':
lpe = dictset
else:
lpe = ''
lpe = set()
if remove_urls.lower() == 'true':
ru = 'http'
else:
ru = ' '
except:
print 'Error With Options - Applying Defaults!'
number_of_topics = 5
lpe = ''
ru = 'dict'
if limit_proper_english_words.lower() == 'true':
texts = [[word for word in document.lower().split() if (not word in stopset and not ru in word and not word.isdigit() and word.islower() and word in lpe and len(word)>1)]
for document in documents]
else:
texts = [[word for word in document.lower().split() if (not word in stopset and not ru in word and not word.isdigit() and word.islower() and len(word)>1)]
for document in documents]
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
for text in texts]
dictionary = corpora.Dictionary(texts)
# dictionary.save('/tmp/twitter.dict')
corpus = [dictionary.doc2bow(text) for text in texts]
# corpora.MmCorpus.serialize('/tmp/twitter.mm', corpus)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=number_of_topics)
corpus_lsi = lsi[corpus_tfidf]
lsi.print_topics(number_of_topics)