主题模型¶

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

2014年高考前夕，百度“基于海量作文范文和搜索数据，利用概率主题模型，预测2014年高考作文的命题方向”。如上图所示，共分为了六个主题：时间、生命、民族、教育、心灵、发展。而每个主题下面又包括了一些具体的关键词。比如，生命的主题对应：平凡、自由、美丽、梦想、奋斗、青春、快乐、孤独。

latent Dirichlet allocation (LDA)¶

潜在狄利克雷分配

The simplest topic model (on which all others are based) is latent Dirichlet allocation (LDA).

LDA is a generative model that infers unobserved meanings from a large set of observations.

Reference¶

Blei DM, Ng J, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003; 3: 993–1022.
Blei DM, Lafferty JD. Correction: a correlated topic model of science. Ann Appl Stat. 2007; 1: 634.
Blei DM. Probabilistic topic models. Commun ACM. 2012; 55: 55–65.
Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342

阅读文献¶

Blei DM. Probabilistic topic models. Commun ACM. 2012; 55: 55–65.¶

LDA（Latent Dirichlet Allocation）是一种文档主题生成模型¶

三层贝叶斯概率模型，包含词、主题和文档三层结构。

所谓生成模型，就是说，我们认为一篇文章的每个词都是通过这样一个过程得到:

以一定概率选择了某个主题，并从这个主题中以一定概率选择某个词语

文档到主题服从多项式分布，主题到词服从多项式分布。

多项式分布（Multinomial Distribution）是二项式分布的推广。¶

二项分布的典型例子是扔硬币，硬币正面朝上概率为p, 重复扔n次硬币，k次为正面的概率即为一个二项分布概率。（严格定义见伯努利实验定义）。
把二项分布公式推广至多种状态，就得到了多项分布。
- 例如在上面例子中1出现k1次，2出现k2次，3出现k3次的概率分布情况。

LDA是一种非监督机器学习技术¶

可以用来识别大规模文档集（document collection）或语料库（corpus）中潜藏的主题信息。

采用了词袋（bag of words）的方法，将每一篇文档视为一个词频向量，从而将文本信息转化为了易于建模的数字信息。
但是词袋方法没有考虑词与词之间的顺序，这简化了问题的复杂性，同时也为模型的改进提供了契机。
每一篇文档代表了一些主题所构成的一个概率分布，而每一个主题又代表了很多单词所构成的一个概率分布。

多项分布的参数服从Dirichlet分布¶

Dirichlet分布是多项分布的参数的分布，被认为是“分布上的分布”。

LDA的名字由来¶

存在两个隐含的Dirichlet分布。

每篇文档对应一个不同的topic分布，服从多项分布
- topic多项分布的参数服从一个Dirichlet分布。
每个topic下存在一个term的多项分布
- term多项分布的参数服从一个Dirichlet分布。

Topic models assume that each document contains a mixture of topics.¶

It is impossible to directly assess the relationships between topics and documents and between topics and terms.

Topics are considered latent/unobserved variables that stand between the documents and terms
What can be directly observed is the distribution of terms over documents, which is known as the document term matrix (DTM).

Topic models algorithmically identify the best set of latent variables (topics) that can best explain the observed distribution of terms in the documents.

The DTM is further decomposed into two matrices：

a term-topic matrix (TTM)
a topic-document matrix (TDM)

Each document can be assigned to a primary topic that demonstrates the highest topic-document probability and can then be linked to other topics with declining probabilities.

Assume K topics are in D documents.

主题在词语上的分布¶

Each topic is denoted with $\phi_{1:K}$，

主题$\phi_K$ 是第k个主题，这个主题表达为一系列的terms。
Each topic is a distribution of fixed words.

主题在文本上的分布¶

The topics proportion in the document d is denoted as $\theta_d$

e.g., the kth topic's proportion in document d is $\theta_{d, k}$.

主题在文本和词上的分配¶

topic models assign topics to a document and its terms.

The topic assigned to document d is denoted as $z_d$,
The topic assigned to the nth term in document d is denoted as $z_{d,n}$.

可以观察到的是？¶

词在文档中的位置，也就是文档-词矩阵（document-term matrix）

Let $w_{d,n}$ denote the nth term in document d.

联合概率分布¶

According to Blei et al. the joint distribution of $\phi_{1:K}$,$\theta_{1:D}$, $z_{1:D}$ and $w_{d, n}$ plus the generative process for LDA can be expressed as:

$ p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n}) $ =

$\prod_{i=1}^{K} p(\phi_i) \prod_{d =1}^D p(\theta_d)(\prod_{n=1}^N p(z_{d,n} \mid \theta_d) \times p(w_{d, n} \mid \phi_{1:K}, Z_{d, n}) ) $

后验分布¶

Note that $\phi_{1:k},\theta_{1:D},and z_{1:D}$ are latent, unobservable variables. Thus, the computational challenge of LDA is to compute the conditional distribution of them given the observable specific words in the documents $w_{d, n}$.

Accordingly, the posterior distribution of LDA can be expressed as:

$p(\phi_{1:K}, \theta_{1:D}, z_{1:D} \mid w_{d, n}) = \frac{p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n})}{p(w_{1:D})}$¶

Because the number of possible topic structures is exponentially large, it is impossible to compute the posterior of LDA.

Topic models aim to develop efficient algorithms to approximate the posterior of LDA. There are two categories of algorithms:

sampling-based algorithms
variational algorithms

Gibbs sampling¶

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult.

Using the Gibbs sampling method, we can build a Markov chain for the sequence of random variables (see Eq 1).

The sampling algorithm is applied to the chain to sample from the limited distribution, and it approximates the posterior.

Gensim: Topic modelling for humans¶

Gensim is developed by Radim Řehůřek,who is a machine learning researcher and consultant in the Czech Republic. We must start by installing it. We can achieve this by running the following command:

pip install gensim¶

In [11]:

%matplotlib inline
from gensim import corpora, models, similarities,  matutils
import matplotlib.pyplot as plt
import numpy as np

Download data¶

~~http://www.cs.princeton.edu/~blei/lda-c/ap.tgz~~

http://www.cs.columbia.edu/~blei/lda-c/

Unzip the data and put them into your folder, e.g., /Users/chengjun/bigdata/ap/

In [3]:

# Load the data
corpus = corpora.BleiCorpus('/Users/datalab/bigdata/ap/ap.dat',\
                            '/Users/datalab/bigdata/ap/vocab.txt')

使用help命令理解corpora.BleiCorpus函数¶

help(corpora.BleiCorpus)

class BleiCorpus(gensim.corpora.indexedcorpus.IndexedCorpus) | Corpus in Blei's LDA-C format. | | The corpus is represented as two files: | one describing the documents, | and another describing the mapping between words and their ids.

In [14]:

# 使用dir看一下有corpus有哪些子函数？
dir(corpus)[-10:]

Out[14]:

['docbyoffset',
 'fname',
 'id2word',
 'index',
 'length',
 'line2doc',
 'load',
 'save',
 'save_corpus',
 'serialize']

In [1]:

# corpus.id2word is a dict which has keys and values, e.g., 
{0: u'i', 1: u'new', 2: u'percent', 3: u'people', 4: u'year', 5: u'two'}

Out[1]:

{0: 'i', 1: 'new', 2: 'percent', 3: 'people', 4: 'year', 5: 'two'}

In [7]:

# transform the dict to list using items()
corpusList = corpus.id2word.items()
list(corpusList)[:3]

Out[7]:

[(0, 'i'), (1, 'new'), (2, 'percent')]

In [9]:

# show the first 5 elements of the list
list(corpusList)[:5]

Out[9]:

[(0, 'i'), (1, 'new'), (2, 'percent'), (3, 'people'), (4, 'year')]

Build the topic model¶

In [10]:

# 设置主题数量
NUM_TOPICS = 100

In [11]:

model = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, 
    id2word=corpus.id2word, 
    alpha=None)

/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)

help(models.ldamodel.LdaModel)¶

Help on class LdaModel in module gensim.models.ldamodel:

class LdaModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)

The constructor estimates Latent Dirichlet Allocation model parameters based on a training corpus:

lda = LdaModel(corpus, num_topics=10)

You can then infer topic distributions on new, unseen documents, with

doc_lda = lda[doc_bow]

The model can be updated (trained) with new documents via

lda.update(other_corpus)

In [12]:

# 看一下训练出来的模型有哪些函数？
' '.join(dir(model))

Out[12]:

'__class__ __delattr__ __dict__ __dir__ __doc__ __eq__ __format__ __ge__ __getattribute__ __getitem__ __gt__ __hash__ __init__ __le__ __lt__ __module__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _adapt_by_suffix _apply _load_specials _save_specials _smart_save alpha bound callbacks chunksize clear decay diff dispatcher distributed do_estep do_mstep dtype eta eval_every expElogbeta gamma_threshold get_document_topics get_term_topics get_topic_terms get_topics id2word inference init_dir_prior iterations load log_perplexity minimum_phi_value minimum_probability num_terms num_topics num_updates numworkers offset optimize_alpha optimize_eta passes per_word_topics print_topic print_topics random_state save show_topic show_topics state sync_state top_topics update update_alpha update_eta update_every'

We can see the list of topics a document refers to¶

by using the model[doc] syntax:

In [13]:

document_topics = [model[c] for c in corpus]

In [14]:

# how many topics does one document cover?
# 例如，对于文档2来说，他所覆盖的主题和比例如下：
document_topics[2]

Out[14]:

[(13, 0.04079367),
 (32, 0.49417603),
 (38, 0.050167914),
 (41, 0.028905964),
 (42, 0.015476955),
 (68, 0.013462535),
 (71, 0.16602227),
 (74, 0.041889388),
 (87, 0.12986045),
 (95, 0.015352134)]

In [18]:

# The first topic
# 对于主题0而言，它所对应10个词语和比重如下：
model.show_topic(0, 10)

Out[18]:

[('earth', 0.031918947),
 ('genes', 0.015866058),
 ('atmosphere', 0.013967291),
 ('encounter', 0.011340389),
 ('gravity', 0.010003102),
 ('study', 0.0068522394),
 ('scientists', 0.005652591),
 ('time', 0.0052818577),
 ('make', 0.004897477),
 ('two', 0.004616615)]

In [17]:

# 对于主题0而言，它所对应5个词语和比重如下：
words = model.show_topic(0, 20)
words

Out[17]:

[('earth', 0.031918947),
 ('genes', 0.015866058),
 ('atmosphere', 0.013967291),
 ('encounter', 0.011340389),
 ('gravity', 0.010003102),
 ('study', 0.0068522394),
 ('scientists', 0.005652591),
 ('time', 0.0052818577),
 ('make', 0.004897477),
 ('two', 0.004616615),
 ('sun', 0.004591544),
 ('soviet', 0.0045909504),
 ('produce', 0.004448547),
 ('secretaries', 0.0042900774),
 ('space', 0.0042664623),
 ('i', 0.0042191655),
 ('solar', 0.0042040357),
 ('dukakis', 0.004171154),
 ('day', 0.003936676),
 ('crew', 0.003838289)]

In [19]:

# 对于主题99而言，它所对应10个词语和比重如下：

model.show_topic(99, 10)

Out[19]:

[('states', 0.010365464),
 ('electoral', 0.009793981),
 ('south', 0.0069398303),
 ('communist', 0.0066758585),
 ('party', 0.006626982),
 ('national', 0.006572403),
 ('united', 0.006457128),
 ('military', 0.00630188),
 ('people', 0.005895602),
 ('soviet', 0.0056286734)]

In [20]:

# 模型计算出来的所有的主题当中的第5个是？
model.show_topics(4)

Out[20]:

[(38,
  '0.012*"vernon" + 0.012*"squarefoot" + 0.010*"tickets" + 0.008*"curtis" + 0.008*"dixon" + 0.007*"farrell" + 0.006*"chief" + 0.006*"sony" + 0.006*"i" + 0.006*"carol"'),
 (78,
  '0.018*"bass" + 0.012*"sadat" + 0.010*"mulroney" + 0.009*"air" + 0.008*"canal" + 0.008*"planes" + 0.006*"thurmond" + 0.005*"news" + 0.005*"minimal" + 0.005*"carolinas"'),
 (61,
  '0.027*"committees" + 0.012*"billion" + 0.012*"discount" + 0.010*"million" + 0.010*"kennedy" + 0.008*"nominate" + 0.007*"i" + 0.007*"mca" + 0.007*"defense" + 0.006*"last"'),
 (10,
  '0.021*"stock" + 0.019*"market" + 0.012*"index" + 0.011*"american" + 0.009*"trading" + 0.009*"exchange" + 0.009*"shares" + 0.008*"stocks" + 0.008*"unchanged" + 0.008*"today"')]

In [21]:

for w, f in words:
    print(w, f)

earth 0.031918947
genes 0.015866058
atmosphere 0.013967291
encounter 0.011340389
gravity 0.010003102
study 0.0068522394
scientists 0.005652591
time 0.0052818577
make 0.004897477
two 0.004616615
sun 0.004591544
soviet 0.0045909504
produce 0.004448547
secretaries 0.0042900774
space 0.0042664623
i 0.0042191655
solar 0.0042040357
dukakis 0.004171154
day 0.003936676
crew 0.003838289

Find the most discussed topic¶

i.e., the one with the highest total weight

In [49]:

## Convert corpus into a dense np array 
help(matutils.corpus2dense)

Help on function corpus2dense in module gensim.matutils:

corpus2dense(corpus, num_terms, num_docs=None, dtype=<type 'numpy.float32'>)
    Convert corpus into a dense np array (documents will be columns). You
    must supply the number of features `num_terms`, because dimensionality
    cannot be deduced from the sparse vectors alone.
    
    You can optionally supply `num_docs` (=the corpus length) as well, so that
    a more memory-efficient code path is taken.
    
    This is the mirror function to `Dense2Corpus`.

In [25]:

topics = matutils.corpus2dense(model[corpus], 
                               num_terms=model.num_topics)
topics

Out[25]:

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.01733483],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.07737275, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01533518, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]], dtype=float32)

In [65]:

# Return the sum of the array elements 
help(topics.sum)

Help on built-in function sum:

sum(...)
    a.sum(axis=None, dtype=None, out=None, keepdims=False)
    
    Return the sum of the array elements over the given axis.
    
    Refer to `numpy.sum` for full documentation.
    
    See Also
    --------
    numpy.sum : equivalent function

In [58]:

# 第一个主题的词语总权重
topics[0].sum()

Out[58]:

15.244399

In [26]:

# 将每一个主题的词语总权重算出来
weight = topics.sum(1)
weight

Out[26]:

array([  7.016551  ,   9.330058  ,   4.092443  ,  14.70797   ,
        18.816536  ,  13.84574   ,  25.188553  ,   8.681887  ,
        83.210266  ,  65.948326  ,  48.294598  ,   9.822699  ,
        14.046081  ,  16.374575  ,  12.474185  ,   3.1544456 ,
        15.028612  ,  29.481203  ,  14.048639  ,   5.1747265 ,
        33.916435  ,  14.628782  ,   5.587079  ,  16.012156  ,
        31.394032  ,  22.309895  ,  15.289978  ,  28.24406   ,
        38.468887  ,  37.16824   ,   7.6589146 ,   6.3032856 ,
        45.218246  ,   0.73920536,   8.27928   ,  14.282246  ,
         7.3205595 ,  16.605833  ,  17.57109   ,   4.3485236 ,
        14.937658  ,  18.280533  ,  31.116217  ,   8.81229   ,
        99.35021   ,  15.629625  ,  27.235668  ,   3.6531925 ,
         3.9520464 ,  36.610214  ,   6.665674  ,  17.697351  ,
         8.585338  ,   6.9967422 ,   2.7082832 ,  25.98687   ,
        13.173248  ,  13.529184  ,   6.96301   ,   9.336487  ,
        13.240212  ,  10.740415  ,   4.688093  ,   1.7932636 ,
        17.099922  ,   8.932779  ,  26.018682  ,  18.67585   ,
        18.156227  ,   7.1015162 ,   3.4652896 ,  70.66327   ,
         8.223337  ,   2.9610934 ,  25.065014  ,   1.6192663 ,
        43.515667  ,   9.007435  ,   9.945666  ,   7.46577   ,
        59.848038  ,  22.138157  ,   1.750703  ,   3.57628   ,
         6.4887547 ,   7.9155493 ,  28.058126  ,  34.570457  ,
         6.2709646 ,  16.616875  ,  33.435608  ,   7.1481905 ,
        29.8868    ,  75.7148    ,  51.13218   , 145.18645   ,
        11.686968  , 138.8191    ,   7.7476287 ,  40.25295   ],
      dtype=float32)

In [27]:

# 找到最大值在哪里

help(weight.argmax)

Help on built-in function argmax:

argmax(...) method of numpy.ndarray instance
    a.argmax(axis=None, out=None)
    
    Return indices of the maximum values along the given axis.
    
    Refer to `numpy.argmax` for full documentation.
    
    See Also
    --------
    numpy.argmax : equivalent function

In [28]:

# 找出具有最大权重的主题是哪一个
max_topic = weight.argmax()
print(max_topic)

In [44]:

# Get the top 64 words for this topic
# Without the argument, show_topic would return only 10 words
words = model.show_topic(max_topic, 64)
words = np.array(words).T
words_freq=[float(i)*10000000 for i in words[1]]
words = zip(words[0], words_freq)
# words_dic = {}
# for i, j in words:
#     words_dic[i] = j
words_dic = {i:j for i,j in words}

主题词云¶

In [46]:

from wordcloud import WordCloud

fig = plt.figure(figsize=(15, 8),facecolor='white')

wordcloud = WordCloud().generate_from_frequencies(words_dic)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

每个文档有多少主题¶

In [47]:

# 每个文档有多少主题
num_topics_used = [len(model[doc]) for doc in corpus]

In [48]:

# 画出来每个文档主题数量的直方图

fig,ax = plt.subplots()
ax.hist(num_topics_used, np.arange(27))
ax.set_ylabel('$Number \;of\; documents$', fontsize = 20)
ax.set_xlabel('$Number \;of \;topics$', fontsize = 20)
fig.tight_layout()
#fig.savefig('Figure_04_01.png')

We can see that about 150 documents have 5 topics,¶

while the majority deal with around 10 to 12 of them.
- No document talks about more than 30 topics.

改变超级参数alpha¶

In [49]:

# Now, repeat the same exercise using alpha=1.0
# You can edit the constant below to play around with this parameter
ALPHA = 1.0
model1 = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, 
    alpha=ALPHA)

num_topics_used1 = [len(model1[doc]) for doc in corpus]

/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)

In [50]:

fig,ax = plt.subplots()
ax.hist([num_topics_used, num_topics_used1], np.arange(42))
ax.set_ylabel('$Number \;of\; documents$', fontsize = 20)
ax.set_xlabel('$Number \;of \;topics$', fontsize = 20)
# The coordinates below were fit by trial and error to look good
plt.text(9, 223, r'default alpha')
plt.text(26, 156, 'alpha=1.0')
fig.tight_layout()

问题：$\alpha$引起主题数量分布的变化意味着什么？¶

从原始文本到主题模型：一个完整的例子¶

刚才的例子使用的是一个已经处理好的语料库，已经构建完整的语料和字典，并清洗好了数据。

In [53]:

with open('../data/ap.txt', 'r') as f:
    dat = f.readlines()

In [54]:

# 需要进行文本清洗
dat[:6]

Out[54]:

['<DOC>\n',
'<DOCNO> AP881218-0003 </DOCNO>\n',
'<TEXT>\n',
" A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered on the floor crying ``Jesus save us! God save us!'' Friends and family said the boy apparently was troubled by his grandmother's death and the divorce of his parents and had been tormented by classmates. Nicholas' grandfather, Clarence Elliott Sr., said Saturday that the boy's parents separated about four years ago and his maternal grandmother, Channey Williams, died last year after a long illness. The grandfather also said his grandson was fascinated with guns. ``The boy was always talking about guns,'' he said. ``He knew a lot about them. He knew all the names of them _ none of those little guns like a .32 or a .22 or nothing like that. He liked the big ones.'' The slain teacher was identified as Karen H. Farley, 40. The wounded teacher, 37-year-old Sam Marino, was in serious condition Saturday with gunshot wounds in the shoulder. Police said the boy also shot at a third teacher, Susan Allen, 31, as she fled from the room where Marino was shot. He then shot Marino again before running to a third classroom where a Bible class was meeting. The youngster shot the glass out of a locked door before opening fire, police spokesman Lewis Thurston said. When the youth's pistol jammed, he was tackled by teacher Maurice Matteson, 24, and other students, Thurston said. ``Once you see what went on in there, it's a miracle that we didn't have more people killed,'' Police Chief Charles R. Wall said. Police didn't have a motive, Detective Tom Zucaro said, but believe the boy's primary target was not a teacher but a classmate. Officers found what appeared to be three Molotov cocktails in the boy's locker and confiscated the gun and several spent shell casings. Fourteen rounds were fired before the gun jammed, Thurston said. The gun, which the boy carried to school in his knapsack, was purchased by an adult at the youngster's request, Thurston said, adding that authorities have interviewed the adult, whose name is being withheld pending an investigation by the federal Bureau of Alcohol, Tobacco and Firearms. The shootings occurred in a complex of four portable classrooms for junior and senior high school students outside the main building of the 4-year-old school. The school has 500 students in kindergarten through 12th grade. Police said they were trying to reconstruct the sequence of events and had not resolved who was shot first. The body of Ms. Farley was found about an hour after the shootings behind a classroom door.\n",
' </TEXT>\n',
'</DOC>\n']

In [55]:

# 如果包含'<'就去掉这一行
dat[4].strip()[0]

Out[55]:

'<'

In [59]:

# 选取前100篇文档
docs = []
for i in dat:
    try:
        if i.strip()[0] != '<':
            docs.append(i)
    except:
        pass
len(docs)

Out[59]:

In [60]:

# 定义一个函数，进一步清洗
def clean_doc(doc):
    doc = doc.replace('.', '').replace(',', '')
    doc = doc.replace('``', '').replace('"', '')
    doc = doc.replace('_', '').replace("'", '')
    doc = doc.replace('!', '')
    return doc
docs = [clean_doc(doc) for doc in docs]

In [61]:

texts = [[i for i in doc.lower().split()] for doc in docs]

停用词¶

In [65]:

import nltk
#nltk.download()
# 会打开一个窗口，选择book，download，待下载完毕就可以使用了。

In [66]:

from nltk.corpus import stopwords
stop = stopwords.words('english') # 如果此处出错，请执行上一个block的代码
# 停用词stopword：在英语里面会遇到很多a,the,or等使用频率很多的字或词,常为冠词、介词、副词或连词等。
# 人类语言包含很多功能词。与其他词相比，功能词没有什么实际含义。

In [67]:

' '.join(stop)

Out[67]:

"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't"

In [68]:

from gensim.parsing.preprocessing import STOPWORDS

' '.join(STOPWORDS)

Out[68]:

'fire four amoungst when even much sixty elsewhere she already thick amount un except that for latter nothing do therefore seem various yet nine never wherever thru rather above give mill go out thence re inc her first onto those found seems me fify ourselves along side third whom meanwhile whatever themselves top himself these noone twenty because almost sometime have in must whenever if anyway during hereafter de it still although how formerly con least he anyone so several among once someone often everything behind at used would alone eleven take cant keep only another whither made amongst who beside while two the him thereafter towards below enough an hundred few of should somehow toward thin therein sometimes ours doing many latterly becoming up nevertheless call twelve hers not mostly any others there about every couldnt too bottom whereafter again some system show this here down but why wherein we anyhow bill across your everywhere them really didn co until say afterwards from kg and nor since indeed before ten off nowhere had also former forty my under could well together other hereupon around regarding am thereby is can a please both either are herein neither no see mine ltd put done now yourself itself whole anywhere us such seeming something ie did move does computer one whereas each eg between find less don etc own moreover cry with next may beforehand over sincere you quite using else cannot into whether however though against interest part detail been front very get per further ever to full fifteen describe serious herself by most or without three thereupon yours via six becomes throughout was yourselves besides unless i whence same whereby be became which anything were then doesn otherwise thus myself their they due has after make fill nobody five always back km everyone hence whereupon perhaps hereby upon whose beyond more as become last whoever where just empty seemed namely will what its somewhere being all through his might than none name on hasnt our eight within'

In [69]:

stop.append('said')

In [70]:

# 计算每一个词的频数
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

In [71]:

# 去掉只出现一次的词和
texts = [[token for token in text \
          if frequency[token] > 1 and token not in stop]
        for text in texts]

In [72]:

docs[8]

Out[72]:

' Here is a summary of developments in forest and brush fires in Western states:\n'

In [73]:

' '.join(texts[9])

Out[73]:

'stirbois 2 man extreme-right national front party leader jean-marie le pen died saturday automobile accident police 43 stirbois attended political meeting friday city dreux 60 miles west paris traveling toward capital car ran road smashed tree 2:40 police stirbois secretary-general national front member party leadership since 1981 born jan 30 1945 paris held degrees law marketing headed printing business stirbois active several extreme-right political movements joining national front 1977 1982 126 percent vote local elections district west paris highest vote percentage france right-wing candidate year half later election deputy mayor dreux stirbois elected deputy national assembly 1986 lost seat legislative elections last summer national front founded le pen 1972 strongly opposed frances highly centralized bureaucratic government personal taxes favors death penalty priority french citizens jobs stopping immigration first round years presidential elections le pen surprising 144 percent vote worrying many feared national front could awaken racist sentiments'

help(corpora.Dictionary)¶

Help on class Dictionary in module gensim.corpora.dictionary:

class Dictionary(gensim.utils.SaveLoad, _abcoll.Mapping)

Dictionary encapsulates the mapping between normalized words and their integer ids.
The main function is doc2bow
- which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.

In [74]:

dictionary = corpora.Dictionary(texts)
lda_corpus = [dictionary.doc2bow(text) for text in texts]
# The function doc2bow() simply counts the number of occurences of each distinct word, 
# converts the word to its integer word id and returns the result as a sparse vector.

In [76]:

NUM_TOPICS = 100
lda_model = models.ldamodel.LdaModel(
    lda_corpus, num_topics=NUM_TOPICS, 
    id2word=dictionary, alpha=None)

/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)

使用pyLDAvis可视化主题模型¶

http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb

pip install pyldavis¶

In [77]:

import pyLDAvis.gensim

ap_data = pyLDAvis.gensim.prepare(lda_model, lda_corpus, dictionary)

pyLDAvis.enable_notebook()
pyLDAvis.display(ap_data)

Out[77]:

In [102]:

#pyLDAvis.show(ap_data)

In [79]:

pyLDAvis.save_html(ap_data, '../vis/ap_ldavis.html')

对2016年政府工作报告建立主题模型¶

pip install jieba¶

https://github.com/fxsjy/jieba

pip install wordcloud¶

https://github.com/amueller/word_cloud

pip install gensim¶

In [80]:

import gensim
from gensim import corpora, models, similarities
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import matplotlib
matplotlib.rc("savefig", dpi=400)
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体

In [113]:

from selenium import webdriver
# import selenium.webdriver.support.ui as ui
browser = webdriver.Chrome()
browser.get("http://news.xinhuanet.com/fortune/2016-03/05/c_128775704.htm") #需要翻墙打开网址
ilis = browser.find_elements_by_tag_name("p")
gov_report_2016  = [i.text for i in ilis]

In [114]:

gov_report_2016[0]

Out[114]:

'政府工作报告'

In [1]:

import requests
from bs4 import BeautifulSoup
import sys

url2016 = 'http://news.xinhuanet.com/fortune/2016-03/05/c_128775704.htm'

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}
content = requests.get(url2016, headers = headers)
content.encoding = 'utf8'
soup = BeautifulSoup(content.text, 'lxml')

In [2]:

gov_report_2016 = [s.text for s in soup('p')]
for i in gov_report_2016[:10]:
    print(i)

政府工作报告
——2016年3月5日在第十二届全国人民代表大会第四次会议上
国务院总理 李克强
各位代表：
　　现在，我代表国务院，向大会报告政府工作，请予审议，并请全国政协各位委员提出意见。
　　一、2015年工作回顾
　　过去一年，我国发展面临多重困难和严峻挑战。在以习近平同志为总书记的党中央坚强领导下，全国各族人民以坚定的信心和非凡的勇气，攻坚克难，开拓进取，经济社会发展稳中有进、稳中有好，完成了全年主要目标任务，改革开放和社会主义现代化建设取得新的重大成就。
　　——经济运行保持在合理区间。国内生产总值达到67.7万亿元，增长6.9%，在世界主要经济体中位居前列。粮食产量实现"十二连增"，居民消费价格涨幅保持较低水平。特别是就业形势总体稳定，城镇新增就业1312万人，超过全年预期目标，成为经济运行的一大亮点。
　　——结构调整取得积极进展。服务业在国内生产总值中的比重上升到50.5%，首次占据"半壁江山"。消费对经济增长的贡献率达到66.4%。高技术产业和装备制造业增速快于一般工业。单位国内生产总值能耗下降5.6%。
　　——发展新动能加快成长。创新驱动发展战略持续推进，互联网与各行业加速融合，新兴产业快速增长。大众创业、万众创新蓬勃发展，全年新登记注册企业增长21.6%，平均每天新增1.2万户。新动能对稳就业、促升级发挥了突出作用，正在推动经济社会发生深刻变革。

In [3]:

def clean_txt(txt):
    for i in [u'、', u'，', u'—', u'！', u'。', u'《', u'》', u'（', u'）']:
        txt = txt.replace(i, ' ')
    return txt

In [4]:

gov_report_2016 = [clean_txt(i) for i in gov_report_2016]

In [5]:

len(gov_report_2016)

Out[5]:

In [6]:

for i in gov_report_2016[:10]:
    print(i)

政府工作报告
  2016年3月5日在第十二届全国人民代表大会第四次会议上
国务院总理 李克强
各位代表：
　　现在 我代表国务院 向大会报告政府工作 请予审议 并请全国政协各位委员提出意见 
　　一 2015年工作回顾
　　过去一年 我国发展面临多重困难和严峻挑战 在以习近平同志为总书记的党中央坚强领导下 全国各族人民以坚定的信心和非凡的勇气 攻坚克难 开拓进取 经济社会发展稳中有进 稳中有好 完成了全年主要目标任务 改革开放和社会主义现代化建设取得新的重大成就 
　　  经济运行保持在合理区间 国内生产总值达到67.7万亿元 增长6.9% 在世界主要经济体中位居前列 粮食产量实现"十二连增" 居民消费价格涨幅保持较低水平 特别是就业形势总体稳定 城镇新增就业1312万人 超过全年预期目标 成为经济运行的一大亮点 
　　  结构调整取得积极进展 服务业在国内生产总值中的比重上升到50.5% 首次占据"半壁江山" 消费对经济增长的贡献率达到66.4% 高技术产业和装备制造业增速快于一般工业 单位国内生产总值能耗下降5.6% 
　　  发展新动能加快成长 创新驱动发展战略持续推进 互联网与各行业加速融合 新兴产业快速增长 大众创业 万众创新蓬勃发展 全年新登记注册企业增长21.6% 平均每天新增1.2万户 新动能对稳就业 促升级发挥了突出作用 正在推动经济社会发生深刻变革

In [7]:

len(gov_report_2016[5:-1])

Out[7]:

In [8]:

filename = '../data/stopwords.txt'
stopwords = {}
f = open(filename, 'r')
line = f.readline().rstrip()
while line:
    stopwords.setdefault(line, 0)
    stopwords[line] = 1
    line = f.readline().rstrip()
f.close()

In [131]:

stopwords.items()

Out[131]:

dict_items([('的', 1), ('#', 1), ('+', 1), ('也', 1), (')', 1), ('说', 1), ('年', 1), ('6', 1), ('了', 1), ('//', 1), ('不', 1), ('./', 1), ('等', 1), ('9', 1), ('&', 1), ('!', 1), ('月', 1), ('(', 1), ('.数', 1), ('.', 1), ('%', 1), ('.一', 1), ('：', 1), ('8', 1), ('..', 1), (':', 1), ('2', 1), ('在', 1), ('$', 1), ('.日', 1), ('中', 1), ('3', 1), ('5', 1), ('-', 1), ('1', 1), ('......', 1), ('7', 1), ('日', 1), ('是', 1), ('"', 1), ('--', 1), ('时', 1), ('0', 1), ('...', 1), (',', 1), ('和', 1), ('...................', 1), ('/', 1), ('4', 1), ('名', 1), ('*', 1), ("'", 1)])

In [9]:

adding_stopwords = [u'我们', u'要', u'地', u'有', u'这', u'人',
                    u'发展',u'建设',u'加强',u'继续',u'对',u'等',
                    u'推进',u'工作',u'增加']
for s in adding_stopwords: stopwords[s]=10

In [10]:

import jieba.analyse

def cleancntxt(txt, stopwords):
    tfidf1000= jieba.analyse.extract_tags(txt, topK=1000, withWeight=False)
    seg_generator = jieba.cut(txt, cut_all=False)
    seg_list = [i for i in seg_generator if i not in stopwords]
    seg_list = [i for i in seg_list if i != u' ']
    seg_list = [i for i in seg_list if i in tfidf1000]
    return(seg_list)

In [12]:

def getCorpus(data):
    processed_docs = [tokenize(doc) for doc in data]
    word_count_dict = gensim.corpora.Dictionary(processed_docs)
    print ("In the corpus there are", len(word_count_dict), "unique tokens")
    word_count_dict.filter_extremes(no_below=5, no_above=0.2) # word must appear >5 times, and no more than 10% documents
    print ("After filtering, in the corpus there are only", len(word_count_dict), "unique tokens")
    bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
    return bag_of_words_corpus, word_count_dict


def getCnCorpus(data):
    processed_docs = [cleancntxt(doc) for doc in data]
    word_count_dict = gensim.corpora.Dictionary(processed_docs)
    print ("In the corpus there are", len(word_count_dict), "unique tokens")
    #word_count_dict.filter_extremes(no_below=5, no_above=0.2) 
    # word must appear >5 times, and no more than 10% documents
    print ("After filtering, in the corpus there are only", len(word_count_dict), "unique tokens")
    bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
    return bag_of_words_corpus, word_count_dict

In [13]:

def inferTopicNumber(bag_of_words_corpus, num, word_count_dict):
    lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=num, id2word=word_count_dict, passes=10)
    _ = lda_model.print_topics(-1) #use _ for throwaway variables.
    logperplexity = lda_model.log_perplexity(bag_of_words_corpus)
    return logperplexity

def fastInferTopicNumber(bag_of_words_corpus, num, word_count_dict):
    lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=bag_of_words_corpus, num_topics=num, \
                                                        id2word=word_count_dict,\
                                                        workers=None, chunksize=2000, passes=2, \
                                                        batch=False, alpha='symmetric', eta=None, \
                                                        decay=0.5, offset=1.0, eval_every=10, \
                                                        iterations=50, gamma_threshold=0.001, random_state=None)
    _ = lda_model.print_topics(-1) #use _ for throwaway variables.
    logperplexity = lda_model.log_perplexity(bag_of_words_corpus)
    return logperplexity

In [136]:

import jieba.analyse

jieba.add_word(u'屠呦呦', freq=None, tag=None)
#del_word(word) 

print (' '.join(cleancntxt(u'屠呦呦获得了诺贝尔医学奖。', stopwords)))

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/8b/hhnbt0nd4zsg2qhxc28q23w80000gn/T/jieba.cache
Loading model cost 0.716 seconds.
Prefix dict has been built succesfully.

屠呦呦 获得 诺贝尔 医学奖

In [14]:

import gensim

processed_docs = [cleancntxt(doc, stopwords) for doc in gov_report_2016[5:-1]]
word_count_dict = gensim.corpora.Dictionary(processed_docs)
print ("In the corpus there are", len(word_count_dict), "unique tokens")
# word_count_dict.filter_extremes(no_below=5, no_above=0.2) # word must appear >5 times, and no more than 10% documents
# print "After filtering, in the corpus there are only", len(word_count_dict), "unique tokens"
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/8b/hhnbt0nd4zsg2qhxc28q23w80000gn/T/jieba.cache
Loading model cost 1.347 seconds.
Prefix dict has been built succesfully.

In the corpus there are 2632 unique tokens

In [15]:

tfidf = models.TfidfModel(bag_of_words_corpus )
corpus_tfidf = tfidf[bag_of_words_corpus ]
#lda_model = gensim.models.LdaModel(corpus_tfidf, num_topics=20, id2word=word_count_dict, passes=10)
lda_model = gensim.models.LdaMulticore(corpus_tfidf, 
                                       num_topics=20, 
                                       id2word=word_count_dict, 
                                       passes=10)

In [141]:

perplexity_list = [inferTopicNumber(bag_of_words_corpus, num, word_count_dict) for num in [5, 10, 15, 20, 25, 30,40, 50, 60 ]]

In [142]:

plt.plot([5, 10, 15, 20, 25, 30,40,50,60 ], perplexity_list)
plt.show()

In [16]:

lda_model.print_topics(3)

Out[16]:

[(8,
  '0.007*"改革" + 0.006*"创新" + 0.005*"企业" + 0.005*"实施" + 0.005*"就业" + 0.005*"政策" + 0.005*"加快" + 0.005*"促进" + 0.005*"扩大" + 0.005*"农业"'),
 (1,
  '0.005*"八个" + 0.004*"财政赤字" + 0.003*"今年" + 0.003*"范围" + 0.003*"做好" + 0.003*"重点" + 0.003*"方面" + 0.002*"财政支出" + 0.002*"安排" + 0.002*"干事"'),
 (6,
  '0.006*"以下" + 0.005*"回顾" + 0.005*"2015" + 0.004*"主要" + 0.004*"农村" + 0.004*"脱贫" + 0.004*"一年" + 0.003*"扶贫" + 0.002*"万公里" + 0.002*"支撑"')]

In [17]:

topictermlist = lda_model.print_topics(-1)
top_words = [[j.split('*')[1] for j in i[1].split(' + ')] for i in topictermlist] 
for i in top_words: 
    print (" ".join(i) )

"需求" "领导人" "论坛" "自贸" "峰会" "住房" "亿美元" "合作" "国际" "一手"
"八个" "财政赤字" "今年" "范围" "做好" "重点" "方面" "财政支出" "安排" "干事"
"非公有制" "动能" "强国" "各种" "城镇化" "经济" "中心" "创新型" "动力" "战略"
"经济运行" "向心力" "中华儿女" "侨眷" "侨务" "海内外" "归侨" "不断" "事故" "维护"
"宗教" "合作" "竞争" "动能" "产能" "对外开放" "深刻" "审批" "经济" "关系"
"随机" "行政" "清单" "事项" "公布" "监管" "审批" "收费" "价格" "政府"
"以下" "回顾" "2015" "主要" "农村" "脱贫" "一年" "扶贫" "万公里" "支撑"
"各位" "代表" "建成" "基本" "党风廉政" "万元" "现代" "贸易" "阶段" "实现"
"改革" "创新" "企业" "实施" "就业" "政策" "加快" "促进" "扩大" "农业"
"两岸" "环保" "节能" "我国" "取得" "亿多" "经济" "国内" "调控" "生产总值"
"地方" "接受" "项目" "合作" "政府" "支付" "存款" "产能" "充分发挥" "人民币"
"增收" "农民" "渠道" "毫不放松" "富农" "农惠农" "无穷" "征程" "优越性" "显示"
"合作" "经济" "增长" "经贸合作" "居民" "强力" "下决心" "环境治理" "双赢" "事关"
"民生" "感谢" "表示" "诚挚" "一些" "存在" "同胞" "为政之道" "之忧" "一件"
"十三" "消费" "融资" "港澳" "时期" "地方" "目标" "金融" "货币政策" "海洋"
"民族" "人民" "世界反法西斯战争" "充满" "救助" "更加" "特色" "美好" "前景" "将会"
"军队" "协定" "谈判" "自贸区" "国防" "战略" "强军" "商签" "各方" "亚太"
"消除" "电信" "合法权益" "土地" "壁垒" "天然气" "市政" "核准" "民营企业" "其长"
"2016" "重点" "国民经济" "第十三个" "政府" "规划" "五年" "城乡" "城镇化" "全民"
"国有企业" "出口" "国有资产" "贸易" "进口" "外贸" "职能" "优化" "退税" "整合"

In [155]:

top_words_shares = [[j.split('*')[0] for j in i[1].split(' + ')] for i in topictermlist] 
top_words_shares = [[float(j) for j in i] for i in top_words_shares]
def weightvalue(x):
    return (x - np.min(top_words_shares))*40/(np.max(top_words_shares) -np.min(top_words_shares)) + 10
 
top_words_shares = [[weightvalue(j) for j in i] for i in top_words_shares]  

def plotTopics(mintopics, maxtopics):
    num_top_words = 10
    plt.rcParams['figure.figsize'] = (20.0, 8.0)  
    n = 0
    for t in range(mintopics , maxtopics):
        plt.subplot(2, 15, n + 1)  # plot numbering starts with 1
        plt.ylim(0, num_top_words)  # stretch the y-axis to accommodate the words
        plt.xticks([])  # remove x-axis markings ('ticks')
        plt.yticks([]) # remove y-axis markings ('ticks')
        plt.title(u'主题 #{}'.format(t+1), size = 15)
        words = top_words[t][0:num_top_words ]
        words_shares = top_words_shares[t][0:num_top_words ]
        for i, (word, share) in enumerate(zip(words, words_shares)):
            plt.text(0.05, num_top_words-i-0.9, word, fontsize= np.log(share*1000))
        n += 1

In [156]:

top_words_shares

Out[156]:

[[18.0, 16.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
 [20.0, 18.0, 18.0, 18.0, 16.0, 16.0, 16.0, 14.0, 14.0, 14.0],
 [22.0, 18.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0],
 [18.0, 16.0, 16.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
 [18.0, 16.0, 16.0, 16.0, 16.0, 16.0, 14.0, 14.0, 14.0, 14.0],
 [16.0, 16.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
 [16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 14.0, 14.0, 14.0, 14.0],
 [18.0, 16.0, 16.0, 16.0, 16.0, 16.0, 14.0, 14.0, 14.0, 14.0],
 [20.0, 18.0, 16.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
 [16.0, 16.0, 14.0, 14.0, 14.0, 14.0, 14.0, 12.0, 12.0, 12.0],
 [50.0, 50.0, 24.0, 20.0, 20.0, 20.0, 20.0, 20.0, 18.0, 18.0],
 [14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
 [14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
 [18.0, 18.0, 16.0, 16.0, 16.0, 16.0, 16.0, 14.0, 14.0, 14.0],
 [16.0, 16.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
 [18.0, 16.0, 16.0, 16.0, 16.0, 16.0, 14.0, 14.0, 14.0, 14.0],
 [18.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
 [10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0],
 [22.0, 18.0, 18.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0],
 [20.0, 16.0, 16.0, 16.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0]]

In [157]:

plotTopics(0, 10)

In [158]:

plotTopics(10, 20)

对宋词进行主题分析初探¶

宋词数据下载 http://cos.name/wp-content/uploads/2011/03/SongPoem.tar.gz

In [165]:

import pandas as pd

In [166]:

pdf = pd.read_csv('data/SongPoem.csv', encoding = 'gb18030')

pdf[:3]

Out[166]:

	Page	Author	Title	Title2	Sentence
0	0001.1	和岘	导引	导引	气和玉烛，叡化着鸿明。缇管一阳生。郊禋盛礼燔柴毕，旋轸凤凰城。森罗仪卫振华缨。载路溢欢声。皇...
1	0001.2	和岘	六州	六州	严夜警，铜莲漏迟迟。清禁肃，森陛戟，羽卫俨皇闱。角声励，钲鼓攸宜。金管成雅奏，逐吹逶迤。荐苍...
2	0001.3	和岘	十二时	忆少年	承宝运，驯致隆平。鸿庆被寰瀛。时清俗阜，治定功成。遐迩咏由庚。严郊祀，文物声明。会天正、星拱...

In [167]:

len(pdf)

Out[167]:

In [168]:

poems = pdf.Sentence

In [178]:

import gensim

processed_docs = [cleancntxt(doc, stopwords) for doc in poems]
word_count_dict = gensim.corpora.Dictionary(processed_docs)
print ("In the corpus there are", len(word_count_dict), "unique tokens")
# word_count_dict.filter_extremes(no_below=5, no_above=0.2) # word must appear >5 times, and no more than 10% documents
# print "After filtering, in the corpus there are only", len(word_count_dict), "unique tokens"
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]

In the corpus there are 147177 unique tokens

In [126]:

tfidf = models.TfidfModel(bag_of_words_corpus )
corpus_tfidf = tfidf[bag_of_words_corpus ]
lda_model = gensim.models.LdaModel(corpus_tfidf, num_topics=20, id2word=word_count_dict, passes=10)

In [170]:

# 使用并行LDA加快处理速度。 
lda_model2 = gensim.models.ldamulticore.LdaMulticore(corpus=None, num_topics=20, id2word=word_count_dict,\
                                        workers=None, chunksize=2000, passes=1, \
                                        batch=False, alpha='symmetric', eta=None, \
                                        decay=0.5, offset=1.0, eval_every=10, \
                                        iterations=50, gamma_threshold=0.001, random_state=None)

In [171]:

lda_model2.print_topics(3)

Out[171]:

[(11,
  '0.000*"正元" + 0.000*"任翁" + 0.000*"尝似" + 0.000*"穆满" + 0.000*"奠酒" + 0.000*"无鹤驭" + 0.000*"歌始断" + 0.000*"张翰" + 0.000*"子建" + 0.000*"止酒"'),
 (15,
  '0.000*"簪萸" + 0.000*"偷眼" + 0.000*"耳边" + 0.000*"怨女" + 0.000*"安车" + 0.000*"野芳" + 0.000*"寿仙堂" + 0.000*"多会少" + 0.000*"催角" + 0.000*"长步障"'),
 (9,
  '0.000*"好满" + 0.000*"共吟" + 0.000*"亡赖" + 0.000*"圆处" + 0.000*"移琼步" + 0.000*"履恐" + 0.000*"横冰" + 0.000*"侵坐" + 0.000*"更乱山" + 0.000*"讲清"')]

In [172]:

topictermlist = lda_model2.print_topics(-1)
top_words = [[j.split('*')[1] for j in i[1].split(' + ')] for i in topictermlist] 
for k, i in enumerate(top_words): 
    print (k+1, " ".join(i) )

1 "冲凌翠麓" "伸剖" "隐几" "听夷甫" "酝酒" "欲雨能" "河沙" "双麟" "旋买村" "风透幕"
2 "北畔" "侵云里" "撩绿鬓" "遐算" "棱棱霜气" "寿卉荐" "夜落" "期腊" "肩久" "春锁碧湾"
3 "壮伟" "凤板" "春近" "收云卷" "通流" "呈芳" "畜犬防" "就妆镜" "折露" "生面"
4 "女霜娥" "画饼充饥" "虽乐" "权相" "既玉" "遗一老" "雁稀" "战退夜" "君侯当" "西园路"
5 "华省" "挥豪闲" "鹏运" "折翼" "真诗伯" "厨人独" "几所" "光接" "最为" "老苍"
6 "休折" "困无语" "催风恼" "花台" "蛾黛浅" "动离" "晚凉" "云迷章" "趁凉" "似讶"
7 "说好" "夜雪春霏" "何年" "紫笛" "碎玉" "已闻赐玺" "社舞" "前楚" "事幽" "昏人静"
8 "自随" "生影" "枕冷衾寒" "橘丸" "怀公处" "忻忻" "岳生贤" "戒格" "同饮" "嗟暑"
9 "如宾处" "茅亭" "水凿" "寻坎" "宫髻" "趁水" "紧寒侵" "窗秋晓" "修竹泪" "雨踪云迹"
10 "好满" "共吟" "亡赖" "圆处" "移琼步" "履恐" "横冰" "侵坐" "更乱山" "讲清"
11 "终难近" "乌爰止" "游意" "甚大" "此木生" "定缘" "严玉" "傍檐" "病休" "种槐"
12 "正元" "任翁" "尝似" "穆满" "奠酒" "无鹤驭" "歌始断" "张翰" "子建" "止酒"
13 "出清冰" "足变" "竹东" "鸣线" "一觉" "临淮夜" "耐寒" "散银原" "招牌" "玉流翠凝"
14 "旧欢" "晓寒闲" "四无" "睫中争" "散子" "夜寒重" "盘空擎" "头如漆" "宵景" "碧圆"
15 "背银" "独输" "还阁" "老妇" "笑拟" "弗见" "遥障" "两带" "懒约" "感泪"
16 "簪萸" "偷眼" "耳边" "怨女" "安车" "野芳" "寿仙堂" "多会少" "催角" "长步障"
17 "文星" "念柳外" "宿酒" "于春懒" "晋时" "何郎晕" "青红" "斗顿" "灵梭" "弹古瑟"
18 "墨争" "花情" "仙韶" "昼锦如" "水减" "火怯" "市上" "则甚" "青女肃" "早瘦"
19 "何肯" "秋灯吟" "梅生" "未若" "池间" "寻欢" "正危" "长淡净" "心萦" "看霞"
20 "犹未绝" "红素" "垂白" "秀异" "早谙" "正令此" "春市" "筵酒" "菜传" "如幂"

In [175]:

perplexity_list = [fastInferTopicNumber(bag_of_words_corpus, num, word_count_dict) for num in [5, 15, 20, 25, 30, 35, 40 ]]

In [176]:

plt.plot([5, 15, 20, 25, 30, 35, 40], perplexity_list)
plt.show()

In [180]:

# import pyLDAvis.gensim

# song_data = pyLDAvis.gensim.prepare(lda_model, bag_of_words_corpus, word_count_dict)

In [141]:

pyLDAvis.enable_notebook()
pyLDAvis.show(song_data)

Note: if you're in the IPython notebook, pyLDAvis.show() is not the best command
      to use. Consider using pyLDAvis.display(), or pyLDAvis.enable_notebook().
      See more information at http://pyLDAvis.github.io/quickstart.html .

You must interrupt the kernel to end this command

Serving to http://127.0.0.1:8889/    [Ctrl-C to exit]

127.0.0.1 - - [21/Sep/2017 22:29:23] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [21/Sep/2017 22:29:23] "GET /LDAvis.css HTTP/1.1" 200 -
127.0.0.1 - - [21/Sep/2017 22:29:23] "GET /d3.js HTTP/1.1" 200 -
127.0.0.1 - - [21/Sep/2017 22:29:23] "GET /LDAvis.js HTTP/1.1" 200 -

stopping Server...

阅读材料¶

Willi Richert, Luis Pedro Coelho, 2013, Building Machine Learning Systems with Python. Chapter 4. Packt Publishing.

LDA Experiments on the English Wikipedia https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

东风夜放花千树：对宋词进行主题分析初探 https://chengjunwang.com/zh/post/cn/2013-09-27-topic-modeling-of-song-peom/

Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342

https://rare-technologies.com/tutorial-on-mallet-in-python/