主题模型



王成军

[email protected]

计算传播网 http://computational-communication.com

2014年高考前夕,百度“基于海量作文范文和搜索数据,利用概率主题模型,预测2014年高考作文的命题方向”。如上图所示,共分为了六个主题:时间、生命、民族、教育、心灵、发展。而每个主题下面又包括了一些具体的关键词。比如,生命的主题对应:平凡、自由、美丽、梦想、奋斗、青春、快乐、孤独。

Read more

latent Dirichlet allocation (LDA)

The simplest topic model (on which all others are based) is latent Dirichlet allocation (LDA).

  • LDA is a generative model that infers unobserved meanings from a large set of observations.

Reference

  • Blei DM, Ng J, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003; 3: 993–1022.
  • Blei DM, Lafferty JD. Correction: a correlated topic model of science. Ann Appl Stat. 2007; 1: 634.
  • Blei DM. Probabilistic topic models. Commun ACM. 2012; 55: 55–65.
  • Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342

  • Topic models assume that each document contains a mixture of topics
    • Topics are considered latent/unobserved variables that stand between the documents and terms

It is impossible to directly assess the relationships between topics and documents and between topics and terms.

  • What can be directly observed is the distribution of terms over documents, which is known as the document term matrix (DTM).

Topic models algorithmically identify the best set of latent variables (topics) that can best explain the observed distribution of terms in the documents.

The DTM is further decomposed into two matrices:

  • a term-topic matrix (TTM)
  • a topic-document matrix (TDM)

Each document can be assigned to a primary topic that demonstrates the highest topic-document probability and can then be linked to other topics with declining probabilities.

Assume K topics are in D documents, and each topic is denoted with $\phi_{1:K}$.

Each topic $\phi_K$ is a distribution of fixed words in the given documents.

The topic proportion in the document is denoted as $\theta_d$.

  • e.g., the kth topic's proportion in document d is $\theta_{d, k}$.

Let $w_{d,n}$ denote the nth term in document d.

Further, topic models assign topics to a document and its terms.

  • For example, the topic assigned to document d is denoted as $z_d$,
    • and the topic assigned to the nth term in document d is denoted as $z_{d,n}$.

According to Blei et al. the joint distribution of $\phi_{1:K}$,$\theta_{1:D}$, $z_{1:D}$ and $w_{d, n}$ plus the generative process for LDA can be expressed as:

$ p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n}) $ =

$\prod_{i=1}^{K} p(\phi_i) \prod_{d =1}^D p(\theta_d)(\prod_{n=1}^N p(z_{d,n} \mid \theta_d) \times p(w_{d, n} \mid \phi_{1:K}, Z_{d, n}) ) $

Note that $\phi_{1:k},\theta_{1:D},and z_{1:D}$ are latent, unobservable variables. Thus, the computational challenge of LDA is to compute the conditional distribution of them given the observable specific words in the documents $w_{d, n}$.

Accordingly, the posterior distribution of LDA can be expressed as:

$p(\phi_{1:K}, \theta_{1:D}, z_{1:D} \mid w_{d, n}) = \frac{p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n})}{p(w_{1:D})}$

Because the number of possible topic structures is exponentially large, it is impossible to compute the posterior of LDA. Topic models aim to develop efficient algorithms to approximate the posterior of LDA.

  • There are two categories of algorithms:
    • sampling-based algorithms
    • variational algorithms

Using the Gibbs sampling method, we can build a Markov chain for the sequence of random variables (see Eq 1). The sampling algorithm is applied to the chain to sample from the limited distribution, and it approximates the posterior.

Gensim

Unfortunately, scikit-learn does not support latent Dirichlet allocation.

Therefore, we are going to use the gensim package in Python.

Gensim is developed by Radim Řehůřek,who is a machine learning researcher and consultant in the Czech Republic. We must start by installing it. We can achieve this by running one of the following commands:

pip install gensim

In [21]:
%matplotlib inline
from __future__ import print_function
from wordcloud import WordCloud
from gensim import corpora, models, similarities,  matutils
import matplotlib.pyplot as plt
import numpy as np

Download data

http://www.cs.princeton.edu/~blei/lda-c/ap.tgz

Unzip the data and put them into /Users/chengjun/bigdata/ap/

In [22]:
# Load the data
corpus = corpora.BleiCorpus('/Users/chengjun/bigdata/ap/ap.dat', '/Users/chengjun/bigdata/ap/vocab.txt')
In [23]:
' '.join(dir(corpus))
Out[23]:
'__class__ __delattr__ __dict__ __doc__ __format__ __getattribute__ __getitem__ __hash__ __init__ __iter__ __len__ __module__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _smart_save docbyoffset fname id2word index length line2doc load save save_corpus serialize'
In [24]:
corpus.id2word.items()[:3]
Out[24]:
[(0, u'i'), (1, u'new'), (2, u'percent')]

Build the topic model

In [25]:
NUM_TOPICS = 100
In [26]:
model = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=None)
/Users/chengjun/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py:379: VisibleDeprecationWarning: non integer (and non boolean) array-likes will not be accepted as indices in the future
  expElogbetad = self.expElogbeta[:, ids]
/Users/chengjun/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py:404: VisibleDeprecationWarning: non integer (and non boolean) array-likes will not be accepted as indices in the future
  sstats[:, ids] += numpy.outer(expElogthetad.T, cts / phinorm)
/Users/chengjun/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py:659: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  score += numpy.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, id]) for id, cnt in doc)
In [27]:
' '.join(dir(model))
Out[27]:
'__class__ __delattr__ __dict__ __doc__ __format__ __getattribute__ __getitem__ __hash__ __init__ __module__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _apply _smart_save alpha bound chunksize clear decay dispatcher distributed do_estep do_mstep eta eval_every expElogbeta gamma_threshold id2word inference iterations load log_perplexity num_terms num_topics num_updates numworkers offset optimize_alpha passes print_topic print_topics save show_topic show_topics state sync_state top_topics update update_alpha update_every'

We can see the list of topics a document refers to

by using the model[doc] syntax:

In [28]:
document_topics = [model[c] for c in corpus]
In [29]:
# how many topics does one document cover?
document_topics[2]
Out[29]:
[(1, 0.012173218478465178),
 (5, 0.010844598121070434),
 (9, 0.017341043056838871),
 (11, 0.024468311850992307),
 (13, 0.011713551338282521),
 (17, 0.12008191700644692),
 (26, 0.2653896038945065),
 (36, 0.016205103283848322),
 (40, 0.063642155983975865),
 (42, 0.10912551497030909),
 (43, 0.18969307606619615),
 (54, 0.029152810319599372),
 (70, 0.027382335898959255),
 (73, 0.079544001401190376)]
In [30]:
# The first topic
# format: weight, term
model.show_topic(0, 10)
Out[30]:
[(0.010029947066157719, u'cents'),
 (0.0084518751921951272, u'sheet'),
 (0.006971130612136031, u'people'),
 (0.0069681226790944068, u'videos'),
 (0.0061501028198266746, u'designs'),
 (0.0059801669571577327, u'threeway'),
 (0.0051079365521994723, u'mca'),
 (0.0042408541343766145, u'new'),
 (0.0038503471959112092, u'today'),
 (0.0034815721467352559, u'year')]
In [31]:
# The 100 topic
# format: weight, term
model.show_topic(99, 10)
Out[31]:
[(0.012503720479598517, u'billion'),
 (0.0093416080644276624, u'bomber'),
 (0.0080440160210163321, u'fiscal'),
 (0.0073821380604033784, u'defense'),
 (0.0059336615162439077, u'air'),
 (0.0057387426153885654, u'safety'),
 (0.005239704830984784, u'year'),
 (0.0047342102096198735, u'nuclear'),
 (0.0046810109846632626, u'north'),
 (0.004498783524243005, u'million')]
In [32]:
words = model.show_topic(0, 5)
words
Out[32]:
[(0.010029947066157719, u'cents'),
 (0.0084518751921951272, u'sheet'),
 (0.006971130612136031, u'people'),
 (0.0069681226790944068, u'videos'),
 (0.0061501028198266746, u'designs')]
In [33]:
model.show_topics(4)
Out[33]:
[u'0.011*government + 0.010*hughes + 0.009*communist + 0.008*party + 0.007*fellows + 0.006*leader + 0.006*cordon + 0.005*people + 0.005*president + 0.005*fair',
 u'0.132*cdy + 0.118*clr + 0.042*rn + 0.011*m + 0.010*rome + 0.008*frankfurt + 0.005*new + 0.004*paris + 0.004*tokyo + 0.003*i',
 u'0.033*percent + 0.012*year + 0.011*billion + 0.009*states + 0.006*last + 0.005*ec + 0.005*years + 0.005*new + 0.005*tax + 0.005*orders',
 u'0.014*percent + 0.011*million + 0.007*africa + 0.007*south + 0.007*sales + 0.006*share + 0.006*year + 0.005*wednesdays + 0.005*new + 0.005*mandela']
In [34]:
for f, w in words[:10]:
    print(f, w)
0.0100299470662 cents
0.0084518751922 sheet
0.00697113061214 people
0.00696812267909 videos
0.00615010281983 designs
In [39]:
# write out topcis with 10 terms with weights
for ti in range(model.num_topics):
    words = model.show_topic(ti, 10)
    tf = sum(f for f, w in words)
    with open('/Users/chengjun/github/cjc2016/data/topics_term_weight.txt', 'a') as output:
        for f, w in words:
            line = str(ti) + '\t' +  w + '\t' + str(f/tf) 
            output.write(line + '\n')
In [40]:
# We first identify the most discussed topic, i.e., the one with the
# highest total weight
topics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)
weight = topics.sum(1)
max_topic = weight.argmax()
In [216]:
# Get the top 64 words for this topic
# Without the argument, show_topic would return only 10 words
words = model.show_topic(max_topic, 64)
words = np.array(words).T
words_freq=[float(i)*10000000 for i in words[0]]
words = zip(words[1], words_freq)
In [219]:
wordcloud = WordCloud().generate_from_frequencies(words)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
In [104]:
num_topics_used = [len(model[doc]) for doc in corpus]

fig,ax = plt.subplots()
ax.hist(num_topics_used, np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
fig.tight_layout()
#fig.savefig('Figure_04_01.png')

We can see that about 150 documents have 5 topics,

  • while the majority deal with around 10 to 12 of them.
    • No document talks about more than 20 topics.
In [109]:
# Now, repeat the same exercise using alpha=1.0
# You can edit the constant below to play around with this parameter
ALPHA = 1.0
model1 = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=ALPHA)

num_topics_used1 = [len(model1[doc]) for doc in corpus]
In [108]:
fig,ax = plt.subplots()
ax.hist([num_topics_used, num_topics_used1], np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
# The coordinates below were fit by trial and error to look good
plt.text(9, 223, r'default alpha')
plt.text(26, 156, 'alpha=1.0')
fig.tight_layout()

读取并清洗数据

In [1]:
with open('/Users/chengjun/bigdata/ap/ap.txt', 'r') as f:
    dat = f.readlines()
In [2]:
dat[:6]
Out[2]:
['<DOC>\n',
 '<DOCNO> AP881218-0003 </DOCNO>\n',
 '<TEXT>\n',
 " A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered on the floor crying ``Jesus save us! God save us!'' Friends and family said the boy apparently was troubled by his grandmother's death and the divorce of his parents and had been tormented by classmates. Nicholas' grandfather, Clarence Elliott Sr., said Saturday that the boy's parents separated about four years ago and his maternal grandmother, Channey Williams, died last year after a long illness. The grandfather also said his grandson was fascinated with guns. ``The boy was always talking about guns,'' he said. ``He knew a lot about them. He knew all the names of them _ none of those little guns like a .32 or a .22 or nothing like that. He liked the big ones.'' The slain teacher was identified as Karen H. Farley, 40. The wounded teacher, 37-year-old Sam Marino, was in serious condition Saturday with gunshot wounds in the shoulder. Police said the boy also shot at a third teacher, Susan Allen, 31, as she fled from the room where Marino was shot. He then shot Marino again before running to a third classroom where a Bible class was meeting. The youngster shot the glass out of a locked door before opening fire, police spokesman Lewis Thurston said. When the youth's pistol jammed, he was tackled by teacher Maurice Matteson, 24, and other students, Thurston said. ``Once you see what went on in there, it's a miracle that we didn't have more people killed,'' Police Chief Charles R. Wall said. Police didn't have a motive, Detective Tom Zucaro said, but believe the boy's primary target was not a teacher but a classmate. Officers found what appeared to be three Molotov cocktails in the boy's locker and confiscated the gun and several spent shell casings. Fourteen rounds were fired before the gun jammed, Thurston said. The gun, which the boy carried to school in his knapsack, was purchased by an adult at the youngster's request, Thurston said, adding that authorities have interviewed the adult, whose name is being withheld pending an investigation by the federal Bureau of Alcohol, Tobacco and Firearms. The shootings occurred in a complex of four portable classrooms for junior and senior high school students outside the main building of the 4-year-old school. The school has 500 students in kindergarten through 12th grade. Police said they were trying to reconstruct the sequence of events and had not resolved who was shot first. The body of Ms. Farley was found about an hour after the shootings behind a classroom door.\n",
 ' </TEXT>\n',
 '</DOC>\n']
In [3]:
dat[4].strip()[0]
Out[3]:
'<'
In [4]:
docs = []
for i in dat[:100]:
    if i.strip()[0] != '<':
        docs.append(i)
In [5]:
def clean_doc(doc):
    doc = doc.replace('.', '').replace(',', '')
    doc = doc.replace('``', '').replace('"', '')
    doc = doc.replace('_', '').replace("'", '')
    doc = doc.replace('!', '')
    return doc
docs = [clean_doc(doc) for doc in docs]
In [6]:
texts = [[i for i in doc.lower().split()] for doc in docs]
In [ ]:
import nltk
nltk.download()
# 会打开一个窗口,选择book,download,待下载完毕就可以使用了。
In [7]:
from nltk.corpus import stopwords
stop = stopwords.words('english') # 如果此处出错,请执行上一个block的代码
In [8]:
' '.join(stop)
Out[8]:
u'i me my myself we our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom this that these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don should now d ll m o re ve y ain aren couldn didn doesn hadn hasn haven isn ma mightn mustn needn shan shouldn wasn weren won wouldn'
In [9]:
stop.append('said')
In [10]:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1 and token not in stop]
        for text in texts]
In [11]:
docs[8]
Out[11]:
' Here is a summary of developments in forest and brush fires in Western states:\n'
In [12]:
' '.join(texts[9])
Out[12]:
'stirbois 2 man extreme-right national front party le pen died saturday automobile police 43 stirbois political meeting friday city dreux miles west paris traveling toward capital car ran police stirbois national front member party since born paris law headed business stirbois several extreme-right political joining national front 1977 percent vote local elections west paris highest vote percentage candidate year half later deputy dreux stirbois deputy national 1986 lost elections last national front founded le pen frances government death priority first years presidential elections le pen percent vote national front could'
In [15]:
dictionary = corpora.Dictionary(texts)
lda_corpus = [dictionary.doc2bow(text) for text in texts]
#The function doc2bow() simply counts the number of occurences of each distinct word, 
# converts the word to its integer word id and returns the result as a sparse vector. 
In [18]:
lda_model = models.ldamodel.LdaModel(
    lda_corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha=None)
In [19]:
import pyLDAvis.gensim

ap_data = pyLDAvis.gensim.prepare(lda_model, lda_corpus, dictionary)
/Users/chengjun/anaconda/lib/python2.7/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:102: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -0.199959128474 and the largest is 0.763806504314.
  RuntimeWarning
In [20]:
pyLDAvis.enable_notebook()
pyLDAvis.display(ap_data)
Out[20]:
In [220]:
pyLDAvis.save_html(ap_data, '/Users/chengjun/github/cjc2016/vis/ap_ldavis.html')

阅读材料

Willi Richert, Luis Pedro Coelho, 2013, Building Machine Learning Systems with Python. Chapter 4. Packt Publishing.

东风夜放花千树:对宋词进行主题分析初探 http://chengjun.github.io/cn/2013/09/topic-modeling-of-song-peom/

Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342