主题模型



王成军

[email protected]

计算传播网 http://computational-communication.com

2014年高考前夕,百度“基于海量作文范文和搜索数据,利用概率主题模型,预测2014年高考作文的命题方向”。如上图所示,共分为了六个主题:时间、生命、民族、教育、心灵、发展。而每个主题下面又包括了一些具体的关键词。比如,生命的主题对应:平凡、自由、美丽、梦想、奋斗、青春、快乐、孤独。

Read more

latent Dirichlet allocation (LDA)

潜在狄利克雷分配

The simplest topic model (on which all others are based) is latent Dirichlet allocation (LDA).

  • LDA is a generative model that infers unobserved meanings from a large set of observations.

Reference

  • Blei DM, Ng J, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003; 3: 993–1022.
  • Blei DM, Lafferty JD. Correction: a correlated topic model of science. Ann Appl Stat. 2007; 1: 634.
  • Blei DM. Probabilistic topic models. Commun ACM. 2012; 55: 55–65.
  • Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342

阅读文献

Blei DM. Probabilistic topic models. Commun ACM. 2012; 55: 55–65.

LDA(Latent Dirichlet Allocation)是一种文档主题生成模型

  • 三层贝叶斯概率模型,包含词、主题和文档三层结构。

所谓生成模型,就是说,我们认为一篇文章的每个词都是通过这样一个过程得到:

以一定概率选择了某个主题,并从这个主题中以一定概率选择某个词语

  • 文档到主题服从多项式分布,主题到词服从多项式分布。

多项式分布(Multinomial Distribution)是二项式分布的推广。

  • 二项分布的典型例子是扔硬币,硬币正面朝上概率为p, 重复扔n次硬币,k次为正面的概率即为一个二项分布概率。(严格定义见伯努利实验定义)。
  • 把二项分布公式推广至多种状态,就得到了多项分布。
    • 例如在上面例子中1出现k1次,2出现k2次,3出现k3次的概率分布情况。

LDA是一种非监督机器学习技术

可以用来识别大规模文档集(document collection)或语料库(corpus)中潜藏的主题信息。

  • 采用了词袋(bag of words)的方法,将每一篇文档视为一个词频向量,从而将文本信息转化为了易于建模的数字信息。
  • 但是词袋方法没有考虑词与词之间的顺序,这简化了问题的复杂性,同时也为模型的改进提供了契机。
  • 每一篇文档代表了一些主题所构成的一个概率分布,而每一个主题又代表了很多单词所构成的一个概率分布。

多项分布的参数服从Dirichlet分布

  • Dirichlet分布是多项分布的参数的分布, 被认为是“分布上的分布”。

LDA的名字由来

存在两个隐含的Dirichlet分布。

  • 每篇文档对应一个不同的topic分布,服从多项分布
    • topic多项分布的参数服从一个Dirichlet分布。
  • 每个topic下存在一个term的多项分布
    • term多项分布的参数服从一个Dirichlet分布。

Topic models assume that each document contains a mixture of topics.

It is impossible to directly assess the relationships between topics and documents and between topics and terms.

  • Topics are considered latent/unobserved variables that stand between the documents and terms

  • What can be directly observed is the distribution of terms over documents, which is known as the document term matrix (DTM).

Topic models algorithmically identify the best set of latent variables (topics) that can best explain the observed distribution of terms in the documents.

The DTM is further decomposed into two matrices:

  • a term-topic matrix (TTM)
  • a topic-document matrix (TDM)

Each document can be assigned to a primary topic that demonstrates the highest topic-document probability and can then be linked to other topics with declining probabilities.

Assume K topics are in D documents.

主题在词语上的分布

Each topic is denoted with $\phi_{1:K}$,

  • 主题$\phi_K$ 是第k个主题,这个主题表达为一系列的terms。
  • Each topic is a distribution of fixed words.

主题在文本上的分布

The topics proportion in the document d is denoted as $\theta_d$

  • e.g., the kth topic's proportion in document d is $\theta_{d, k}$.

主题在文本和词上的分配

topic models assign topics to a document and its terms.

  • The topic assigned to document d is denoted as $z_d$,
  • The topic assigned to the nth term in document d is denoted as $z_{d,n}$.

可以观察到的是?

词在文档中的位置,也就是文档-词矩阵(document-term matrix)

Let $w_{d,n}$ denote the nth term in document d.

联合概率分布

According to Blei et al. the joint distribution of $\phi_{1:K}$,$\theta_{1:D}$, $z_{1:D}$ and $w_{d, n}$ plus the generative process for LDA can be expressed as:

$ p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n}) $ =

$\prod_{i=1}^{K} p(\phi_i) \prod_{d =1}^D p(\theta_d)(\prod_{n=1}^N p(z_{d,n} \mid \theta_d) \times p(w_{d, n} \mid \phi_{1:K}, Z_{d, n}) ) $

后验分布

Note that $\phi_{1:k},\theta_{1:D},and z_{1:D}$ are latent, unobservable variables. Thus, the computational challenge of LDA is to compute the conditional distribution of them given the observable specific words in the documents $w_{d, n}$.

Accordingly, the posterior distribution of LDA can be expressed as:

$p(\phi_{1:K}, \theta_{1:D}, z_{1:D} \mid w_{d, n}) = \frac{p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n})}{p(w_{1:D})}$

Because the number of possible topic structures is exponentially large, it is impossible to compute the posterior of LDA.

Topic models aim to develop efficient algorithms to approximate the posterior of LDA. There are two categories of algorithms:

  • sampling-based algorithms
  • variational algorithms

Gibbs sampling

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult.

Using the Gibbs sampling method, we can build a Markov chain for the sequence of random variables (see Eq 1).

The sampling algorithm is applied to the chain to sample from the limited distribution, and it approximates the posterior.

Gensim: Topic modelling for humans

Gensim is developed by Radim Řehůřek,who is a machine learning researcher and consultant in the Czech Republic. We must start by installing it. We can achieve this by running the following command:

pip install gensim

In [2]:
%matplotlib inline
from gensim import corpora, models, similarities,  matutils
import matplotlib.pyplot as plt
import numpy as np

Download data

http://www.cs.princeton.edu/~blei/lda-c/ap.tgz

http://www.cs.columbia.edu/~blei/lda-c/

Unzip the data and put them into your folder, e.g., /Users/chengjun/bigdata/ap/

In [3]:
# Load the data
corpus = corpora.BleiCorpus('/Users/datalab/bigdata/ap/ap.dat',\
                            '/Users/datalab/bigdata/ap/vocab.txt')

使用help命令理解corpora.BleiCorpus函数

help(corpora.BleiCorpus)

class BleiCorpus(gensim.corpora.indexedcorpus.IndexedCorpus) | Corpus in Blei's LDA-C format. | | The corpus is represented as two files: | one describing the documents, | and another describing the mapping between words and their ids.
In [14]:
# 使用dir看一下有corpus有哪些子函数?
dir(corpus)[-10:] 
Out[14]:
['docbyoffset',
 'fname',
 'id2word',
 'index',
 'length',
 'line2doc',
 'load',
 'save',
 'save_corpus',
 'serialize']
In [1]:
# corpus.id2word is a dict which has keys and values, e.g., 
{0: u'i', 1: u'new', 2: u'percent', 3: u'people', 4: u'year', 5: u'two'}
Out[1]:
{0: 'i', 1: 'new', 2: 'percent', 3: 'people', 4: 'year', 5: 'two'}
In [7]:
# transform the dict to list using items()
corpusList = corpus.id2word.items()
list(corpusList)[:3]
Out[7]:
[(0, 'i'), (1, 'new'), (2, 'percent')]
In [9]:
# show the first 5 elements of the list
list(corpusList)[:5]
Out[9]:
[(0, 'i'), (1, 'new'), (2, 'percent'), (3, 'people'), (4, 'year')]

Build the topic model

In [10]:
# 设置主题数量
NUM_TOPICS = 100
In [11]:
model = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, 
    id2word=corpus.id2word, 
    alpha=None) 
/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)

help(models.ldamodel.LdaModel)

Help on class LdaModel in module gensim.models.ldamodel:

class LdaModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)

  • The constructor estimates Latent Dirichlet Allocation model parameters based on a training corpus:

lda = LdaModel(corpus, num_topics=10)

  • You can then infer topic distributions on new, unseen documents, with

doc_lda = lda[doc_bow]

  • The model can be updated (trained) with new documents via

lda.update(other_corpus)

In [12]:
# 看一下训练出来的模型有哪些函数?
' '.join(dir(model))
Out[12]:
'__class__ __delattr__ __dict__ __dir__ __doc__ __eq__ __format__ __ge__ __getattribute__ __getitem__ __gt__ __hash__ __init__ __le__ __lt__ __module__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _adapt_by_suffix _apply _load_specials _save_specials _smart_save alpha bound callbacks chunksize clear decay diff dispatcher distributed do_estep do_mstep dtype eta eval_every expElogbeta gamma_threshold get_document_topics get_term_topics get_topic_terms get_topics id2word inference init_dir_prior iterations load log_perplexity minimum_phi_value minimum_probability num_terms num_topics num_updates numworkers offset optimize_alpha optimize_eta passes per_word_topics print_topic print_topics random_state save show_topic show_topics state sync_state top_topics update update_alpha update_eta update_every'

We can see the list of topics a document refers to

by using the model[doc] syntax:

In [13]:
document_topics = [model[c] for c in corpus]
In [14]:
# how many topics does one document cover?
# 例如,对于文档2来说,他所覆盖的主题和比例如下:
document_topics[2] 
Out[14]:
[(13, 0.04079367),
 (32, 0.49417603),
 (38, 0.050167914),
 (41, 0.028905964),
 (42, 0.015476955),
 (68, 0.013462535),
 (71, 0.16602227),
 (74, 0.041889388),
 (87, 0.12986045),
 (95, 0.015352134)]
In [18]:
# The first topic
# 对于主题0而言,它所对应10个词语和比重如下:
model.show_topic(0, 10)
Out[18]:
[('earth', 0.031918947),
 ('genes', 0.015866058),
 ('atmosphere', 0.013967291),
 ('encounter', 0.011340389),
 ('gravity', 0.010003102),
 ('study', 0.0068522394),
 ('scientists', 0.005652591),
 ('time', 0.0052818577),
 ('make', 0.004897477),
 ('two', 0.004616615)]
In [17]:
# 对于主题0而言,它所对应5个词语和比重如下:
words = model.show_topic(0, 20)
words
Out[17]:
[('earth', 0.031918947),
 ('genes', 0.015866058),
 ('atmosphere', 0.013967291),
 ('encounter', 0.011340389),
 ('gravity', 0.010003102),
 ('study', 0.0068522394),
 ('scientists', 0.005652591),
 ('time', 0.0052818577),
 ('make', 0.004897477),
 ('two', 0.004616615),
 ('sun', 0.004591544),
 ('soviet', 0.0045909504),
 ('produce', 0.004448547),
 ('secretaries', 0.0042900774),
 ('space', 0.0042664623),
 ('i', 0.0042191655),
 ('solar', 0.0042040357),
 ('dukakis', 0.004171154),
 ('day', 0.003936676),
 ('crew', 0.003838289)]
In [19]:
# 对于主题99而言,它所对应10个词语和比重如下:

model.show_topic(99, 10)
Out[19]:
[('states', 0.010365464),
 ('electoral', 0.009793981),
 ('south', 0.0069398303),
 ('communist', 0.0066758585),
 ('party', 0.006626982),
 ('national', 0.006572403),
 ('united', 0.006457128),
 ('military', 0.00630188),
 ('people', 0.005895602),
 ('soviet', 0.0056286734)]
In [20]:
# 模型计算出来的所有的主题当中的第5个是?
model.show_topics(4)
Out[20]:
[(38,
  '0.012*"vernon" + 0.012*"squarefoot" + 0.010*"tickets" + 0.008*"curtis" + 0.008*"dixon" + 0.007*"farrell" + 0.006*"chief" + 0.006*"sony" + 0.006*"i" + 0.006*"carol"'),
 (78,
  '0.018*"bass" + 0.012*"sadat" + 0.010*"mulroney" + 0.009*"air" + 0.008*"canal" + 0.008*"planes" + 0.006*"thurmond" + 0.005*"news" + 0.005*"minimal" + 0.005*"carolinas"'),
 (61,
  '0.027*"committees" + 0.012*"billion" + 0.012*"discount" + 0.010*"million" + 0.010*"kennedy" + 0.008*"nominate" + 0.007*"i" + 0.007*"mca" + 0.007*"defense" + 0.006*"last"'),
 (10,
  '0.021*"stock" + 0.019*"market" + 0.012*"index" + 0.011*"american" + 0.009*"trading" + 0.009*"exchange" + 0.009*"shares" + 0.008*"stocks" + 0.008*"unchanged" + 0.008*"today"')]
In [21]:
for w, f in words:
    print(w, f)
earth 0.031918947
genes 0.015866058
atmosphere 0.013967291
encounter 0.011340389
gravity 0.010003102
study 0.0068522394
scientists 0.005652591
time 0.0052818577
make 0.004897477
two 0.004616615
sun 0.004591544
soviet 0.0045909504
produce 0.004448547
secretaries 0.0042900774
space 0.0042664623
i 0.0042191655
solar 0.0042040357
dukakis 0.004171154
day 0.003936676
crew 0.003838289

Find the most discussed topic

i.e., the one with the highest total weight

In [49]:
## Convert corpus into a dense np array 
help(matutils.corpus2dense) 
Help on function corpus2dense in module gensim.matutils:

corpus2dense(corpus, num_terms, num_docs=None, dtype=<type 'numpy.float32'>)
    Convert corpus into a dense np array (documents will be columns). You
    must supply the number of features `num_terms`, because dimensionality
    cannot be deduced from the sparse vectors alone.
    
    You can optionally supply `num_docs` (=the corpus length) as well, so that
    a more memory-efficient code path is taken.
    
    This is the mirror function to `Dense2Corpus`.

In [25]:
topics = matutils.corpus2dense(model[corpus], 
                               num_terms=model.num_topics)
topics 
Out[25]:
array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.01733483],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.07737275, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01533518, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]], dtype=float32)
In [65]:
# Return the sum of the array elements 
help(topics.sum)
Help on built-in function sum:

sum(...)
    a.sum(axis=None, dtype=None, out=None, keepdims=False)
    
    Return the sum of the array elements over the given axis.
    
    Refer to `numpy.sum` for full documentation.
    
    See Also
    --------
    numpy.sum : equivalent function

In [58]:
# 第一个主题的词语总权重
topics[0].sum() 
Out[58]:
15.244399
In [26]:
# 将每一个主题的词语总权重算出来
weight = topics.sum(1)
weight
Out[26]:
array([  7.016551  ,   9.330058  ,   4.092443  ,  14.70797   ,
        18.816536  ,  13.84574   ,  25.188553  ,   8.681887  ,
        83.210266  ,  65.948326  ,  48.294598  ,   9.822699  ,
        14.046081  ,  16.374575  ,  12.474185  ,   3.1544456 ,
        15.028612  ,  29.481203  ,  14.048639  ,   5.1747265 ,
        33.916435  ,  14.628782  ,   5.587079  ,  16.012156  ,
        31.394032  ,  22.309895  ,  15.289978  ,  28.24406   ,
        38.468887  ,  37.16824   ,   7.6589146 ,   6.3032856 ,
        45.218246  ,   0.73920536,   8.27928   ,  14.282246  ,
         7.3205595 ,  16.605833  ,  17.57109   ,   4.3485236 ,
        14.937658  ,  18.280533  ,  31.116217  ,   8.81229   ,
        99.35021   ,  15.629625  ,  27.235668  ,   3.6531925 ,
         3.9520464 ,  36.610214  ,   6.665674  ,  17.697351  ,
         8.585338  ,   6.9967422 ,   2.7082832 ,  25.98687   ,
        13.173248  ,  13.529184  ,   6.96301   ,   9.336487  ,
        13.240212  ,  10.740415  ,   4.688093  ,   1.7932636 ,
        17.099922  ,   8.932779  ,  26.018682  ,  18.67585   ,
        18.156227  ,   7.1015162 ,   3.4652896 ,  70.66327   ,
         8.223337  ,   2.9610934 ,  25.065014  ,   1.6192663 ,
        43.515667  ,   9.007435  ,   9.945666  ,   7.46577   ,
        59.848038  ,  22.138157  ,   1.750703  ,   3.57628   ,
         6.4887547 ,   7.9155493 ,  28.058126  ,  34.570457  ,
         6.2709646 ,  16.616875  ,  33.435608  ,   7.1481905 ,
        29.8868    ,  75.7148    ,  51.13218   , 145.18645   ,
        11.686968  , 138.8191    ,   7.7476287 ,  40.25295   ],
      dtype=float32)
In [27]:
# 找到最大值在哪里

help(weight.argmax)
Help on built-in function argmax:

argmax(...) method of numpy.ndarray instance
    a.argmax(axis=None, out=None)
    
    Return indices of the maximum values along the given axis.
    
    Refer to `numpy.argmax` for full documentation.
    
    See Also
    --------
    numpy.argmax : equivalent function

In [28]:
# 找出具有最大权重的主题是哪一个
max_topic = weight.argmax()
print(max_topic)
95
In [44]:
# Get the top 64 words for this topic
# Without the argument, show_topic would return only 10 words
words = model.show_topic(max_topic, 64)
words = np.array(words).T
words_freq=[float(i)*10000000 for i in words[1]]
words = zip(words[0], words_freq)
# words_dic = {}
# for i, j in words:
#     words_dic[i] = j
words_dic = {i:j for i,j in words}

主题词云

In [46]:
from wordcloud import WordCloud

fig = plt.figure(figsize=(15, 8),facecolor='white')

wordcloud = WordCloud().generate_from_frequencies(words_dic)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

每个文档有多少主题

In [47]:
# 每个文档有多少主题
num_topics_used = [len(model[doc]) for doc in corpus]
In [48]:
# 画出来每个文档主题数量的直方图

fig,ax = plt.subplots()
ax.hist(num_topics_used, np.arange(27))
ax.set_ylabel('$Number \;of\; documents$', fontsize = 20)
ax.set_xlabel('$Number \;of \;topics$', fontsize = 20)
fig.tight_layout()
#fig.savefig('Figure_04_01.png')

We can see that about 150 documents have 5 topics,

  • while the majority deal with around 10 to 12 of them.
    • No document talks about more than 30 topics.

改变超级参数alpha

In [49]:
# Now, repeat the same exercise using alpha=1.0
# You can edit the constant below to play around with this parameter
ALPHA = 1.0
model1 = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, 
    alpha=ALPHA)

num_topics_used1 = [len(model1[doc]) for doc in corpus]
/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)
In [50]:
fig,ax = plt.subplots()
ax.hist([num_topics_used, num_topics_used1], np.arange(42))
ax.set_ylabel('$Number \;of\; documents$', fontsize = 20)
ax.set_xlabel('$Number \;of \;topics$', fontsize = 20)
# The coordinates below were fit by trial and error to look good
plt.text(9, 223, r'default alpha')
plt.text(26, 156, 'alpha=1.0')
fig.tight_layout() 

问题:$\alpha$引起主题数量分布的变化意味着什么?

从原始文本到主题模型:一个完整的例子

刚才的例子使用的是一个已经处理好的语料库,已经构建完整的语料和字典,并清洗好了数据。

In [53]:
with open('../data/ap.txt', 'r') as f:
    dat = f.readlines()
In [54]:
# 需要进行文本清洗
dat[:6]
Out[54]:
['<DOC>\n',
 '<DOCNO> AP881218-0003 </DOCNO>\n',
 '<TEXT>\n',
 " A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered on the floor crying ``Jesus save us! God save us!'' Friends and family said the boy apparently was troubled by his grandmother's death and the divorce of his parents and had been tormented by classmates. Nicholas' grandfather, Clarence Elliott Sr., said Saturday that the boy's parents separated about four years ago and his maternal grandmother, Channey Williams, died last year after a long illness. The grandfather also said his grandson was fascinated with guns. ``The boy was always talking about guns,'' he said. ``He knew a lot about them. He knew all the names of them _ none of those little guns like a .32 or a .22 or nothing like that. He liked the big ones.'' The slain teacher was identified as Karen H. Farley, 40. The wounded teacher, 37-year-old Sam Marino, was in serious condition Saturday with gunshot wounds in the shoulder. Police said the boy also shot at a third teacher, Susan Allen, 31, as she fled from the room where Marino was shot. He then shot Marino again before running to a third classroom where a Bible class was meeting. The youngster shot the glass out of a locked door before opening fire, police spokesman Lewis Thurston said. When the youth's pistol jammed, he was tackled by teacher Maurice Matteson, 24, and other students, Thurston said. ``Once you see what went on in there, it's a miracle that we didn't have more people killed,'' Police Chief Charles R. Wall said. Police didn't have a motive, Detective Tom Zucaro said, but believe the boy's primary target was not a teacher but a classmate. Officers found what appeared to be three Molotov cocktails in the boy's locker and confiscated the gun and several spent shell casings. Fourteen rounds were fired before the gun jammed, Thurston said. The gun, which the boy carried to school in his knapsack, was purchased by an adult at the youngster's request, Thurston said, adding that authorities have interviewed the adult, whose name is being withheld pending an investigation by the federal Bureau of Alcohol, Tobacco and Firearms. The shootings occurred in a complex of four portable classrooms for junior and senior high school students outside the main building of the 4-year-old school. The school has 500 students in kindergarten through 12th grade. Police said they were trying to reconstruct the sequence of events and had not resolved who was shot first. The body of Ms. Farley was found about an hour after the shootings behind a classroom door.\n",
 ' </TEXT>\n',
 '</DOC>\n']
In [55]:
# 如果包含'<'就去掉这一行
dat[4].strip()[0]
Out[55]:
'<'
In [59]:
# 选取前100篇文档
docs = []
for i in dat:
    try:
        if i.strip()[0] != '<':
            docs.append(i)
    except:
        pass
len(docs)
Out[59]:
2248
In [60]:
# 定义一个函数,进一步清洗
def clean_doc(doc):
    doc = doc.replace('.', '').replace(',', '')
    doc = doc.replace('``', '').replace('"', '')
    doc = doc.replace('_', '').replace("'", '')
    doc = doc.replace('!', '')
    return doc
docs = [clean_doc(doc) for doc in docs]
In [61]:
texts = [[i for i in doc.lower().split()] for doc in docs]

停用词

In [65]:
import nltk
#nltk.download()
# 会打开一个窗口,选择book,download,待下载完毕就可以使用了。
In [66]:
from nltk.corpus import stopwords
stop = stopwords.words('english') # 如果此处出错,请执行上一个block的代码
# 停用词stopword:在英语里面会遇到很多a,the,or等使用频率很多的字或词,常为冠词、介词、副词或连词等。
# 人类语言包含很多功能词。与其他词相比,功能词没有什么实际含义。
In [67]:
' '.join(stop)
Out[67]:
"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't"
In [68]:
from gensim.parsing.preprocessing import STOPWORDS

' '.join(STOPWORDS)
Out[68]:
'fire four amoungst when even much sixty elsewhere she already thick amount un except that for latter nothing do therefore seem various yet nine never wherever thru rather above give mill go out thence re inc her first onto those found seems me fify ourselves along side third whom meanwhile whatever themselves top himself these noone twenty because almost sometime have in must whenever if anyway during hereafter de it still although how formerly con least he anyone so several among once someone often everything behind at used would alone eleven take cant keep only another whither made amongst who beside while two the him thereafter towards below enough an hundred few of should somehow toward thin therein sometimes ours doing many latterly becoming up nevertheless call twelve hers not mostly any others there about every couldnt too bottom whereafter again some system show this here down but why wherein we anyhow bill across your everywhere them really didn co until say afterwards from kg and nor since indeed before ten off nowhere had also former forty my under could well together other hereupon around regarding am thereby is can a please both either are herein neither no see mine ltd put done now yourself itself whole anywhere us such seeming something ie did move does computer one whereas each eg between find less don etc own moreover cry with next may beforehand over sincere you quite using else cannot into whether however though against interest part detail been front very get per further ever to full fifteen describe serious herself by most or without three thereupon yours via six becomes throughout was yourselves besides unless i whence same whereby be became which anything were then doesn otherwise thus myself their they due has after make fill nobody five always back km everyone hence whereupon perhaps hereby upon whose beyond more as become last whoever where just empty seemed namely will what its somewhere being all through his might than none name on hasnt our eight within'
In [69]:
stop.append('said')
In [70]:
# 计算每一个词的频数
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
In [71]:
# 去掉只出现一次的词和
texts = [[token for token in text \
          if frequency[token] > 1 and token not in stop]
        for text in texts]
In [72]:
docs[8]
Out[72]:
' Here is a summary of developments in forest and brush fires in Western states:\n'
In [73]:
' '.join(texts[9])
Out[73]:
'stirbois 2 man extreme-right national front party leader jean-marie le pen died saturday automobile accident police 43 stirbois attended political meeting friday city dreux 60 miles west paris traveling toward capital car ran road smashed tree 2:40 police stirbois secretary-general national front member party leadership since 1981 born jan 30 1945 paris held degrees law marketing headed printing business stirbois active several extreme-right political movements joining national front 1977 1982 126 percent vote local elections district west paris highest vote percentage france right-wing candidate year half later election deputy mayor dreux stirbois elected deputy national assembly 1986 lost seat legislative elections last summer national front founded le pen 1972 strongly opposed frances highly centralized bureaucratic government personal taxes favors death penalty priority french citizens jobs stopping immigration first round years presidential elections le pen surprising 144 percent vote worrying many feared national front could awaken racist sentiments'

help(corpora.Dictionary)

Help on class Dictionary in module gensim.corpora.dictionary:

class Dictionary(gensim.utils.SaveLoad, _abcoll.Mapping)

  • Dictionary encapsulates the mapping between normalized words and their integer ids.

  • The main function is doc2bow

    • which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.
In [74]:
dictionary = corpora.Dictionary(texts)
lda_corpus = [dictionary.doc2bow(text) for text in texts]
# The function doc2bow() simply counts the number of occurences of each distinct word, 
# converts the word to its integer word id and returns the result as a sparse vector. 
In [76]:
NUM_TOPICS = 100
lda_model = models.ldamodel.LdaModel(
    lda_corpus, num_topics=NUM_TOPICS, 
    id2word=dictionary, alpha=None)
/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)
In [77]:
import pyLDAvis.gensim

ap_data = pyLDAvis.gensim.prepare(lda_model, lda_corpus, dictionary)

pyLDAvis.enable_notebook()
pyLDAvis.display(ap_data)
Out[77]: