LSI-Topic-Model

Generates related keywords from a corpus bundeled together into topic areas. 10 keywords are generated per topic. A single text file is uploaded and each line is treated as a separate document.

Baylor University Libraries: LSI Topic Model

Implements the Latent Semantic Index

From Wikipedia https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing "Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings."

This Python application was built by the Baylor University Libraries to assist researchers to implement unsupervised topic modelling on 1-line documents, such as Twitter social media.

**First**, ensure Anaconda 2.7 is installed on your system. If it is not, head to https://www.anaconda.com/download/ and install. Then continue with the next step.

Second, launch Anaconda Navigator. After installing in the previous step, this will be in the Programs menu on Windows and in the Applications directory on Mac.

Third, launch the Jupyter Notebook application. Anaconda Navigator has a link directly to Jupyter Notebook.

Fourth, download the Jupyter Notebook file https://raw.githubusercontent.com/Josh-Been/LSI-Topic-Model/master/Baylor-Libraries-LSI-Topic-Model.ipynb to your computer. In the Jupyter browser tab that opened in the previous step, click the Upload button and browse for the saved Jupyter Notebook file.

Up to this point you have been reading an HTML version of this Notebook.

Now switch to the interactive version in Jupyter.

Fifth, ensure you have the Gensim: Topic Modelling for Humans library installed. If you are confident you already installed Gensim, skip ahead of this step.

For background on Gensim - https://radimrehurek.com/gensim/

Steps:

(1) Open Anaconda Navigator

(2) On the left-hand menu, click Environments

(3) In the drop-down menu at the top center of the page, select 'All'

(4) In the Search Packages to the right, type gensim

(5) In the result below, check the box to the left of gensim and then click t he green apply button on the bottom right. Click Apply on any popups.

NOTE: Step 5 above may take a few minutes to complete, depending on the speed of the computer and the network connection. Please be patient before moving on to the next step. Now is a good time to go and grab that coffee.

Sixth, browse for the text file containing lines of documents. Put the cursor in the box below and click the 'run cell, select below' button at the top of this notebook.

In [ ]:
import warnings, string
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim import corpora, models
from Tkinter import *
from tkFileDialog import askopenfilename

def strip_non_ascii(cleanme):
    stripped = cleanme.translate(None, string.punctuation)
    stripped = (c for c in stripped if 0 < ord(c) < 127)
    return ''.join(stripped)

root_stop = Tk()
txt_file = askopenfilename()
print txt_file
root_stop.update()
root_stop.destroy()

documents = []
documents[:] = []
f = open(txt_file, 'r')
for line in f:
    documents.append(strip_non_ascii(line))
f.close()

Seventh, browse for a stop word list. This list must be a text file with one word per line. Put the cursor in the box beow and click the 'run cell, select below' button at the top of this notebook.

There are numerous stopword lists on the internet. One example of lists in numerous languages is https://github.com/Alir3z4/stop-words It is advisable to modify lists as per your corpus.

In [ ]:
root_lines = Tk()
stoptxt = askopenfilename()
print stoptxt
root_lines.update()
root_lines.destroy()

stoplist = []
stoplist[:] = []
f1 = open(stoptxt, 'r')
for line in f1:
    line = line.replace('\n','')
    line = line.replace('\r','')
    stoplist.append(line)
f1.close()
stoplist.append('rt')
stoplist.append('&gt;')
stoplist.append('sho')
stoplist.append('&amp;:)')
stopset = set(stoplist)

Eighth, specify the following options. Then, put the cursor in the box below and click the 'run cell, select below' button at the top of this notebook.

If you set limit_proper_english_words to 'true', a browse dialog will appear. Browse for a list of English words. There are numerous ways to obtain lists of words, including https://github.com/dwyl/english-words

In [ ]:
number_of_topics = 5
limit_proper_english_words = 'true'
remove_urls = 'true'

if limit_proper_english_words == 'true':
    root_dict = Tk()
    dicttxt = askopenfilename()
    print dicttxt
    root_dict.update()
    root_dict.destroy()

    dictlist = []
    dictlist[:] = []
    f2 = open(dicttxt, 'r')
    for line in f2:
        line = line.replace('\n','')
        line = line.replace('\r','')
        dictlist.append(line)
    f2.close()
    dictset = set(dictlist)

print 'Options Entered!'

Ninth, calculate the LSI topics for the corpus. Put the cursor in the box beow and click the 'run cell, select below' button at the top of this notebook.

In [ ]:
try:
    if limit_proper_english_words.lower() == 'true':
        lpe = dictset
    else:
        lpe = ''
        lpe = set()
    if remove_urls.lower() == 'true':
        ru = 'http'
    else:
        ru = ' '
except:
    print 'Error With Options - Applying Defaults!'
    number_of_topics = 5
    lpe = ''
    ru = 'dict'

if limit_proper_english_words.lower() == 'true':
    texts = [[word for word in document.lower().split() if (not word in stopset and not ru in word and not word.isdigit() and word.islower() and word in lpe and len(word)>1)]
             for document in documents]
else:
    texts = [[word for word in document.lower().split() if (not word in stopset and not ru in word and not word.isdigit() and word.islower() and len(word)>1)]
             for document in documents]    

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

dictionary = corpora.Dictionary(texts)
# dictionary.save('/tmp/twitter.dict')

corpus = [dictionary.doc2bow(text) for text in texts]
# corpora.MmCorpus.serialize('/tmp/twitter.mm', corpus)

tfidf = models.TfidfModel(corpus)

corpus_tfidf = tfidf[corpus]

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=number_of_topics)
corpus_lsi = lsi[corpus_tfidf]

lsi.print_topics(number_of_topics)
In [ ]: