2014年高考前夕,百度“基于海量作文范文和搜索数据,利用概率主题模型,预测2014年高考作文的命题方向”。如上图所示,共分为了六个主题:时间、生命、民族、教育、心灵、发展。而每个主题下面又包括了一些具体的关键词。比如,生命的主题对应:平凡、自由、美丽、梦想、奋斗、青春、快乐、孤独。
潜在狄利克雷分配
The simplest topic model (on which all others are based) is latent Dirichlet allocation (LDA).
所谓生成模型,就是说,我们认为一篇文章的每个词都是通过这样一个过程得到:
以一定概率选择了某个主题,并从这个主题中以一定概率选择某个词语
可以用来识别大规模文档集(document collection)或语料库(corpus)中潜藏的主题信息。
存在两个隐含的Dirichlet分布。
It is impossible to directly assess the relationships between topics and documents and between topics and terms.
Topics are considered latent/unobserved variables that stand between the documents and terms
What can be directly observed is the distribution of terms over documents, which is known as the document term matrix (DTM).
Topic models algorithmically identify the best set of latent variables (topics) that can best explain the observed distribution of terms in the documents.
The DTM is further decomposed into two matrices:
Each document can be assigned to a primary topic that demonstrates the highest topic-document probability and can then be linked to other topics with declining probabilities.
Assume K topics are in D documents.
Each topic is denoted with $\phi_{1:K}$,
The topics proportion in the document d is denoted as $\theta_d$
topic models assign topics to a document and its terms.
According to Blei et al. the joint distribution of $\phi_{1:K}$,$\theta_{1:D}$, $z_{1:D}$ and $w_{d, n}$ plus the generative process for LDA can be expressed as:
$ p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n}) $ =
$\prod_{i=1}^{K} p(\phi_i) \prod_{d =1}^D p(\theta_d)(\prod_{n=1}^N p(z_{d,n} \mid \theta_d) \times p(w_{d, n} \mid \phi_{1:K}, Z_{d, n}) ) $
Note that $\phi_{1:k},\theta_{1:D},and z_{1:D}$ are latent, unobservable variables. Thus, the computational challenge of LDA is to compute the conditional distribution of them given the observable specific words in the documents $w_{d, n}$.
Accordingly, the posterior distribution of LDA can be expressed as:
Because the number of possible topic structures is exponentially large, it is impossible to compute the posterior of LDA.
Topic models aim to develop efficient algorithms to approximate the posterior of LDA. There are two categories of algorithms:
In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult.
Using the Gibbs sampling method, we can build a Markov chain for the sequence of random variables (see Eq 1).
The sampling algorithm is applied to the chain to sample from the limited distribution, and it approximates the posterior.
%matplotlib inline
from gensim import corpora, models, similarities, matutils
import matplotlib.pyplot as plt
import numpy as np
http://www.cs.princeton.edu/~blei/lda-c/ap.tgz
http://www.cs.columbia.edu/~blei/lda-c/
Unzip the data and put them into your folder, e.g., /Users/chengjun/bigdata/ap/
# Load the data
corpus = corpora.BleiCorpus('/Users/chengjun/bigdata/ap/ap.dat',\
'/Users/chengjun/bigdata/ap/vocab.txt')
help(corpora.BleiCorpus)
# 使用dir看一下有corpus有哪些子函数?
dir(corpus)[-10:]
['docbyoffset', 'fname', 'id2word', 'index', 'length', 'line2doc', 'load', 'save', 'save_corpus', 'serialize']
# corpus.id2word is a dict which has keys and values, e.g.,
print {0: u'i', 1: u'new', 2: u'percent', 3: u'people', 4: u'year', 5: u'two'}
{0: u'i', 1: u'new', 2: u'percent', 3: u'people', 4: u'year', 5: u'two'} [(0, u'i'), (1, u'new'), (2, u'percent'), (3, u'people'), (4, u'year')]
# transform the dict to list using items()
corpusList = corpus.id2word.items()
# show the first 5 elements of the list
print corpusList[:5]
[(0, u'i'), (1, u'new'), (2, u'percent'), (3, u'people'), (4, u'year')]
# 设置主题数量
NUM_TOPICS = 100
model = models.ldamodel.LdaModel(
corpus, num_topics=NUM_TOPICS,
id2word=corpus.id2word,
alpha=None)
Help on class LdaModel in module gensim.models.ldamodel:
class LdaModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)
lda = LdaModel(corpus, num_topics=10)
doc_lda = lda[doc_bow]
lda.update(other_corpus)
# 看一下训练出来的模型有哪些函数?
' '.join(dir(model))
'__class__ __delattr__ __dict__ __doc__ __format__ __getattribute__ __getitem__ __hash__ __init__ __module__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _adapt_by_suffix _apply _load_specials _save_specials _smart_save alpha bound chunksize clear decay diff dispatcher distributed do_estep do_mstep eta eval_every expElogbeta gamma_threshold get_document_topics get_term_topics get_topic_terms id2word inference init_dir_prior iterations load log_perplexity minimum_phi_value minimum_probability num_terms num_topics num_updates numworkers offset optimize_alpha optimize_eta passes per_word_topics print_topic print_topics random_state save show_topic show_topics state sync_state top_topics update update_alpha update_eta update_every'
by using the model[doc] syntax:
document_topics = [model[c] for c in corpus]
# how many topics does one document cover?
# 例如,对于文档2来说,他所覆盖的主题和比例如下:
document_topics[2]
[(1, 0.012309431364318347), (8, 0.012202830226942012), (55, 0.91962972149779032), (62, 0.010698472086080716), (94, 0.013764605944173381), (99, 0.010788322268905609)]
# The first topic
# 对于主题0而言,它所对应10个词语和比重如下:
model.show_topic(0, 10)
[(u'two', 0.0097638048666660541), (u'county', 0.0095685029771191091), (u'people', 0.0070668043002524118), (u'hutton', 0.0068716234883852093), (u'raid', 0.0062069076129358421), (u'miles', 0.005796006118837134), (u'police', 0.0056406047915654907), (u'firing', 0.0052135720850514291), (u'killed', 0.0050783339770887016), (u'fines', 0.0049427052359337512)]
# 对于主题0而言,它所对应5个词语和比重如下:
words = model.show_topic(0, 5)
words
[(u'two', 0.0097638048666660541), (u'county', 0.0095685029771191091), (u'people', 0.0070668043002524118), (u'hutton', 0.0068716234883852093), (u'raid', 0.0062069076129358421)]
for f, w in words[:10]:
print(f, w)
(u'two', 0.0097638048666660541) (u'county', 0.0095685029771191091) (u'people', 0.0070668043002524118) (u'hutton', 0.0068716234883852093) (u'raid', 0.0062069076129358421)
# 对于主题99而言,它所对应10个词语和比重如下:
model.show_topic(99, 10)
[(u'endowment', 0.014332613767409458), (u'vegas', 0.014187123477681848), (u'las', 0.014003089471363507), (u'study', 0.0066308632883444862), (u'imperial', 0.0058305581953050183), (u'north', 0.0056318801116002114), (u'educate', 0.0055984402867634911), (u'air', 0.0055789377622512838), (u'cents', 0.0048116501665708775), (u'dwva', 0.0047560660227365147)]
# 模型计算出来的所有的主题当中的第5个是?
model.show_topics(4)
[u'0.011*government + 0.010*hughes + 0.009*communist + 0.008*party + 0.007*fellows + 0.006*leader + 0.006*cordon + 0.005*people + 0.005*president + 0.005*fair', u'0.132*cdy + 0.118*clr + 0.042*rn + 0.011*m + 0.010*rome + 0.008*frankfurt + 0.005*new + 0.004*paris + 0.004*tokyo + 0.003*i', u'0.033*percent + 0.012*year + 0.011*billion + 0.009*states + 0.006*last + 0.005*ec + 0.005*years + 0.005*new + 0.005*tax + 0.005*orders', u'0.014*percent + 0.011*million + 0.007*africa + 0.007*south + 0.007*sales + 0.006*share + 0.006*year + 0.005*wednesdays + 0.005*new + 0.005*mandela']
for w, f in words:
print(w, f)
(u'two', 0.0097638048666660541) (u'county', 0.0095685029771191091) (u'people', 0.0070668043002524118) (u'hutton', 0.0068716234883852093) (u'raid', 0.0062069076129358421) (u'miles', 0.005796006118837134) (u'police', 0.0056406047915654907) (u'firing', 0.0052135720850514291) (u'killed', 0.0050783339770887016) (u'fines', 0.0049427052359337512)
# write out topcis with 10 terms with weights
for ti in range(model.num_topics):
words = model.show_topic(ti, 10)
tf = sum(f for w, f in words)
with open('/Users/chengjun/github/workshop/data/topics_term_weight.txt', 'a') as output:
for w, f in words:
line = str(ti) + '\t' + w + '\t' + str(f/tf)
output.write(line + '\n')
i.e., the one with the highest total weight
## Convert corpus into a dense np array
help(matutils.corpus2dense)
Help on function corpus2dense in module gensim.matutils: corpus2dense(corpus, num_terms, num_docs=None, dtype=<type 'numpy.float32'>) Convert corpus into a dense np array (documents will be columns). You must supply the number of features `num_terms`, because dimensionality cannot be deduced from the sparse vectors alone. You can optionally supply `num_docs` (=the corpus length) as well, so that a more memory-efficient code path is taken. This is the mirror function to `Dense2Corpus`.
topics = matutils.corpus2dense(model[corpus],
num_terms=model.num_topics)
topics
array([[ 0.02923143, 0. , 0. , ..., 0. , 0. , 0. ], [ 0. , 0. , 0.0123087 , ..., 0. , 0. , 0. ], [ 0. , 0. , 0. , ..., 0. , 0. , 0. ], ..., [ 0. , 0. , 0. , ..., 0. , 0. , 0. ], [ 0. , 0. , 0. , ..., 0. , 0. , 0. ], [ 0. , 0. , 0.01078893, ..., 0. , 0. , 0. ]], dtype=float32)
# Return the sum of the array elements
help(topics.sum)
Help on built-in function sum: sum(...) a.sum(axis=None, dtype=None, out=None, keepdims=False) Return the sum of the array elements over the given axis. Refer to `numpy.sum` for full documentation. See Also -------- numpy.sum : equivalent function
# 第一个主题的词语总权重
topics[0].sum()
15.244399
# 将每一个主题的词语总权重算出来
weight = topics.sum(1)
weight
array([ 15.24439907, 27.13823318, 14.16592598, 5.5000782 , 18.20896721, 1.79626262, 75.51860809, 19.70967484, 14.99353981, 11.98105526, 25.99263763, 2.91408682, 9.63045311, 24.01381683, 11.86772251, 9.51368904, 16.58557129, 11.69364357, 11.40436935, 13.31723404, 30.41246033, 10.41183758, 39.43877029, 3.98492718, 12.05030251, 37.49634552, 2.7084589 , 59.74034119, 2.64890122, 30.77053833, 2.42677784, 35.34954453, 52.81302261, 19.11531067, 13.65035915, 94.64816284, 30.57484818, 33.32247925, 32.47951126, 6.58638334, 9.9152956 , 6.37622595, 13.77993202, 5.28659534, 2.25559521, 8.77668667, 4.45420837, 42.87486267, 6.22911644, 6.28044605, 14.09706116, 9.15421677, 4.51118183, 4.84333706, 35.31595612, 136.42349243, 4.38191986, 2.02930832, 74.41523743, 5.79068279, 2.44928861, 8.03872108, 31.14110374, 2.59122419, 35.65231323, 20.56199837, 125.37039185, 17.28886414, 10.74967766, 5.2983923 , 26.05622864, 34.46240997, 31.72027206, 4.5250349 , 10.54576683, 10.32496643, 18.12685013, 8.09305954, 20.49374962, 16.45386124, 10.95189095, 3.06417084, 5.86327457, 10.25288486, 203.87794495, 12.18054962, 25.36583138, 21.87836838, 4.00764465, 16.40171623, 3.38701963, 5.51084995, 18.67506981, 10.49090958, 12.44936657, 18.38467789, 13.54945087, 57.47393036, 5.71446228, 10.86363983], dtype=float32)
# 找到最大值在哪里
help(weight.argmax)
Help on built-in function argmax: argmax(...) a.argmax(axis=None, out=None) Return indices of the maximum values along the given axis. Refer to `numpy.argmax` for full documentation. See Also -------- numpy.argmax : equivalent function
# 找出具有最大权重的主题是哪一个
max_topic = weight.argmax()
print(max_topic)
84
# Get the top 64 words for this topic
# Without the argument, show_topic would return only 10 words
words = model.show_topic(max_topic, 64)
words = np.array(words).T
words_freq=[float(i)*10000000 for i in words[1]]
words = zip(words[0], words_freq)
from wordcloud import WordCloud
fig = plt.figure(figsize=(15, 8),facecolor='white')
wordcloud = WordCloud().generate_from_frequencies(words)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
# 每个文档有多少主题
num_topics_used = [len(model[doc]) for doc in corpus]
# 画出来每个文档主题数量的直方图
fig,ax = plt.subplots()
ax.hist(num_topics_used, np.arange(27))
ax.set_ylabel('$Number \;of\; documents$', fontsize = 20)
ax.set_xlabel('$Number \;of \;topics$', fontsize = 20)
fig.tight_layout()
#fig.savefig('Figure_04_01.png')
# Now, repeat the same exercise using alpha=1.0
# You can edit the constant below to play around with this parameter
ALPHA = 1.0
model1 = models.ldamodel.LdaModel(
corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word,
alpha=ALPHA)
num_topics_used1 = [len(model1[doc]) for doc in corpus]
fig,ax = plt.subplots()
ax.hist([num_topics_used, num_topics_used1], np.arange(42))
ax.set_ylabel('$Number \;of\; documents$', fontsize = 20)
ax.set_xlabel('$Number \;of \;topics$', fontsize = 20)
# The coordinates below were fit by trial and error to look good
plt.text(9, 223, r'default alpha')
plt.text(26, 156, 'alpha=1.0')
fig.tight_layout()
刚才的例子使用的是一个已经处理好的语料库,已经构建完整的语料和字典,并清洗好了数据。
with open('./data/ap.txt', 'r') as f:
dat = f.readlines()
# 需要进行文本清洗
dat[:6]
['<DOC>\n', '<DOCNO> AP881218-0003 </DOCNO>\n', '<TEXT>\n', " A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered on the floor crying ``Jesus save us! God save us!'' Friends and family said the boy apparently was troubled by his grandmother's death and the divorce of his parents and had been tormented by classmates. Nicholas' grandfather, Clarence Elliott Sr., said Saturday that the boy's parents separated about four years ago and his maternal grandmother, Channey Williams, died last year after a long illness. The grandfather also said his grandson was fascinated with guns. ``The boy was always talking about guns,'' he said. ``He knew a lot about them. He knew all the names of them _ none of those little guns like a .32 or a .22 or nothing like that. He liked the big ones.'' The slain teacher was identified as Karen H. Farley, 40. The wounded teacher, 37-year-old Sam Marino, was in serious condition Saturday with gunshot wounds in the shoulder. Police said the boy also shot at a third teacher, Susan Allen, 31, as she fled from the room where Marino was shot. He then shot Marino again before running to a third classroom where a Bible class was meeting. The youngster shot the glass out of a locked door before opening fire, police spokesman Lewis Thurston said. When the youth's pistol jammed, he was tackled by teacher Maurice Matteson, 24, and other students, Thurston said. ``Once you see what went on in there, it's a miracle that we didn't have more people killed,'' Police Chief Charles R. Wall said. Police didn't have a motive, Detective Tom Zucaro said, but believe the boy's primary target was not a teacher but a classmate. Officers found what appeared to be three Molotov cocktails in the boy's locker and confiscated the gun and several spent shell casings. Fourteen rounds were fired before the gun jammed, Thurston said. The gun, which the boy carried to school in his knapsack, was purchased by an adult at the youngster's request, Thurston said, adding that authorities have interviewed the adult, whose name is being withheld pending an investigation by the federal Bureau of Alcohol, Tobacco and Firearms. The shootings occurred in a complex of four portable classrooms for junior and senior high school students outside the main building of the 4-year-old school. The school has 500 students in kindergarten through 12th grade. Police said they were trying to reconstruct the sequence of events and had not resolved who was shot first. The body of Ms. Farley was found about an hour after the shootings behind a classroom door.\n", ' </TEXT>\n', '</DOC>\n']
# 如果包含'<'就去掉这一行
dat[4].strip()[0]
'<'
# 选取前100篇文档
docs = []
for i in dat[:100]:
if i.strip()[0] != '<':
docs.append(i)
# 定义一个函数,进一步清洗
def clean_doc(doc):
doc = doc.replace('.', '').replace(',', '')
doc = doc.replace('``', '').replace('"', '')
doc = doc.replace('_', '').replace("'", '')
doc = doc.replace('!', '')
return doc
docs = [clean_doc(doc) for doc in docs]
texts = [[i for i in doc.lower().split()] for doc in docs]
import nltk
#nltk.download()
# 会打开一个窗口,选择book,download,待下载完毕就可以使用了。
from nltk.corpus import stopwords
stop = stopwords.words('english') # 如果此处出错,请执行上一个block的代码
# 停用词stopword:在英语里面会遇到很多a,the,or等使用频率很多的字或词,常为冠词、介词、副词或连词等。
# 人类语言包含很多功能词。与其他词相比,功能词没有什么实际含义。
' '.join(stop)
u'i me my myself we our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom this that these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don should now d ll m o re ve y ain aren couldn didn doesn hadn hasn haven isn ma mightn mustn needn shan shouldn wasn weren won wouldn'
from gensim.parsing.preprocessing import STOPWORDS
' '.join(STOPWORDS)
'all six just less being indeed over move anyway four not own through using fify where mill only find before one whose system how somewhere much thick show had enough should to must whom seeming yourselves under ours two has might thereafter latterly do them his around than get very de none cannot every un they front during thus now him nor name regarding several hereafter did always who didn whither this someone either each become thereupon sometime side towards therein twelve because often ten our doing km eg some back used up go namely computer are further beyond ourselves yet out even will what still for bottom mine since please forty per its everything behind does various above between it neither seemed ever across she somehow be we full never sixty however here otherwise were whereupon nowhere although found alone re along quite fifteen by both about last would anything via many could thence put against keep etc amount became ltd hence onto or con among already co afterwards formerly within seems into others while whatever except down hers everyone done least another whoever moreover couldnt throughout anyhow yourself three from her few together top there due been next anyone eleven cry call therefore interest then thru themselves hundred really sincere empty more himself elsewhere mostly on fire am becoming hereby amongst else part everywhere too kg herself former those he me myself made twenty these was bill cant us until besides nevertheless below anywhere nine can whether of your toward my say something and whereafter whenever give almost wherever is describe beforehand herein doesn an as itself at have in seem whence ie any fill again hasnt inc thereby thin no perhaps latter meanwhile when detail same wherein beside also that other take which becomes you if nobody unless whereas see though may after upon most hereupon eight but serious nothing such why off a don whereby third i whole noone sometimes well amoungst yours their rather without so five the first with make once'
stop.append('said')
# 计算每一个词的频数
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
# 去掉只出现一次的词和
texts = [[token for token in text \
if frequency[token] > 1 and token not in stop]
for text in texts]
docs[8]
' Here is a summary of developments in forest and brush fires in Western states:\n'
' '.join(texts[9])
'stirbois 2 man extreme-right national front party le pen died saturday automobile police said 43 stirbois political meeting friday city dreux miles west paris traveling toward capital car ran police said stirbois national front member party since born paris law headed business stirbois several extreme-right political joining national front 1977 percent vote local elections west paris highest vote percentage candidate year half later deputy dreux stirbois deputy national 1986 lost elections last national front founded le pen frances government death priority first years presidential elections le pen percent vote national front could'
Help on class Dictionary in module gensim.corpora.dictionary:
class Dictionary(gensim.utils.SaveLoad, _abcoll.Mapping)
Dictionary encapsulates the mapping between normalized words and their integer ids.
The main function is doc2bow
dictionary = corpora.Dictionary(texts)
lda_corpus = [dictionary.doc2bow(text) for text in texts]
# The function doc2bow() simply counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result as a sparse vector.
NUM_TOPICS = 10
lda_model = models.ldamodel.LdaModel(
lda_corpus, num_topics=NUM_TOPICS,
id2word=dictionary, alpha=None)
http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
pip install pyldavis¶
import pyLDAvis.gensim
ap_data = pyLDAvis.gensim.prepare(lda_model, lda_corpus, dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(ap_data)
/Users/chengjun/anaconda/lib/python2.7/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:102: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -0.00393193849591 and the largest is 0.0358364993473. RuntimeWarning
#pyLDAvis.show(ap_data)
pyLDAvis.save_html(ap_data, './vis/ap_ldavis.html')
import gensim
from gensim import corpora, models, similarities
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import matplotlib
matplotlib.rc("savefig", dpi=400)
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体
import urllib2
from bs4 import BeautifulSoup
import sys
url2016 = 'http://news.xinhuanet.com/fortune/2016-03/05/c_128775704.htm'
content = urllib2.urlopen(url2016).read()
soup = BeautifulSoup(content)
gov_report_2016 = [s.text for s in soup('p')]
for i in gov_report_2016[:10]:
print(i)
政府工作报告 ——2016年3月5日在第十二届全国人民代表大会第四次会议上 国务院总理 李克强 各位代表: 现在,我代表国务院,向大会报告政府工作,请予审议,并请全国政协各位委员提出意见。 一、2015年工作回顾 过去一年,我国发展面临多重困难和严峻挑战。在以习近平同志为总书记的党中央坚强领导下,全国各族人民以坚定的信心和非凡的勇气,攻坚克难,开拓进取,经济社会发展稳中有进、稳中有好,完成了全年主要目标任务,改革开放和社会主义现代化建设取得新的重大成就。 ——经济运行保持在合理区间。国内生产总值达到67.7万亿元,增长6.9%,在世界主要经济体中位居前列。粮食产量实现"十二连增",居民消费价格涨幅保持较低水平。特别是就业形势总体稳定,城镇新增就业1312万人,超过全年预期目标,成为经济运行的一大亮点。 ——结构调整取得积极进展。服务业在国内生产总值中的比重上升到50.5%,首次占据"半壁江山"。消费对经济增长的贡献率达到66.4%。高技术产业和装备制造业增速快于一般工业。单位国内生产总值能耗下降5.6%。 ——发展新动能加快成长。创新驱动发展战略持续推进,互联网与各行业加速融合,新兴产业快速增长。大众创业、万众创新蓬勃发展,全年新登记注册企业增长21.6%,平均每天新增1.2万户。新动能对稳就业、促升级发挥了突出作用,正在推动经济社会发生深刻变革。
def clean_txt(txt):
for i in [u'、', u',', u'—', u'!', u'。', u'《', u'》', u'(', u')']:
txt = txt.replace(i, ' ')
return txt
gov_report_2016 = [clean_txt(i) for i in gov_report_2016]
len(gov_report_2016)
109
for i in gov_report_2016[:10]:
print(i)
政府工作报告 2016年3月5日在第十二届全国人民代表大会第四次会议上 国务院总理 李克强 各位代表: 现在 我代表国务院 向大会报告政府工作 请予审议 并请全国政协各位委员提出意见 一 2015年工作回顾 过去一年 我国发展面临多重困难和严峻挑战 在以习近平同志为总书记的党中央坚强领导下 全国各族人民以坚定的信心和非凡的勇气 攻坚克难 开拓进取 经济社会发展稳中有进 稳中有好 完成了全年主要目标任务 改革开放和社会主义现代化建设取得新的重大成就 经济运行保持在合理区间 国内生产总值达到67.7万亿元 增长6.9% 在世界主要经济体中位居前列 粮食产量实现"十二连增" 居民消费价格涨幅保持较低水平 特别是就业形势总体稳定 城镇新增就业1312万人 超过全年预期目标 成为经济运行的一大亮点 结构调整取得积极进展 服务业在国内生产总值中的比重上升到50.5% 首次占据"半壁江山" 消费对经济增长的贡献率达到66.4% 高技术产业和装备制造业增速快于一般工业 单位国内生产总值能耗下降5.6% 发展新动能加快成长 创新驱动发展战略持续推进 互联网与各行业加速融合 新兴产业快速增长 大众创业 万众创新蓬勃发展 全年新登记注册企业增长21.6% 平均每天新增1.2万户 新动能对稳就业 促升级发挥了突出作用 正在推动经济社会发生深刻变革
len(gov_report_2016[5:-1])
103
# Set the Working Directory
import os
os.getcwd()
os.chdir('/Users/chengjun/github/cjc/')
os.getcwd()
'/Users/chengjun/GitHub/cjc'
filename = 'data/stopwords.txt'
stopwords = {}
f = open(filename, 'r')
line = f.readline().rstrip()
while line:
stopwords.setdefault(line, 0)
stopwords[line.decode('utf-8')] = 1
line = f.readline().rstrip()
f.close()
adding_stopwords = [u'我们', u'要', u'地', u'有', u'这', u'人',
u'发展',u'建设',u'加强',u'继续',u'对',u'等',
u'推进',u'工作',u'增加']
for s in adding_stopwords: stopwords[s]=10
import jieba.analyse
def cleancntxt(txt, stopwords):
tfidf1000= jieba.analyse.extract_tags(txt, topK=1000, withWeight=False)
seg_generator = jieba.cut(txt, cut_all=False)
seg_list = [i for i in seg_generator if i not in stopwords]
seg_list = [i for i in seg_list if i != u' ']
seg_list = [i for i in seg_list if i in tfidf1000]
return(seg_list)
def getCorpus(data):
processed_docs = [tokenize(doc) for doc in data]
word_count_dict = gensim.corpora.Dictionary(processed_docs)
print ("In the corpus there are", len(word_count_dict), "unique tokens")
word_count_dict.filter_extremes(no_below=5, no_above=0.2) # word must appear >5 times, and no more than 10% documents
print ("After filtering, in the corpus there are only", len(word_count_dict), "unique tokens")
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
return bag_of_words_corpus, word_count_dict
def getCnCorpus(data):
processed_docs = [cleancntxt(doc) for doc in data]
word_count_dict = gensim.corpora.Dictionary(processed_docs)
print ("In the corpus there are", len(word_count_dict), "unique tokens")
#word_count_dict.filter_extremes(no_below=5, no_above=0.2)
# word must appear >5 times, and no more than 10% documents
print ("After filtering, in the corpus there are only", len(word_count_dict), "unique tokens")
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
return bag_of_words_corpus, word_count_dict
def inferTopicNumber(bag_of_words_corpus, num, word_count_dict):
lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=num, id2word=word_count_dict, passes=10)
_ = lda_model.print_topics(-1) #use _ for throwaway variables.
logperplexity = lda_model.log_perplexity(bag_of_words_corpus)
return logperplexity
def fastInferTopicNumber(bag_of_words_corpus, num, word_count_dict):
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=bag_of_words_corpus, num_topics=num, \
id2word=word_count_dict,\
workers=None, chunksize=2000, passes=2, \
batch=False, alpha='symmetric', eta=None, \
decay=0.5, offset=1.0, eval_every=10, \
iterations=50, gamma_threshold=0.001, random_state=None)
_ = lda_model.print_topics(-1) #use _ for throwaway variables.
logperplexity = lda_model.log_perplexity(bag_of_words_corpus)
return logperplexity
import jieba.analyse
jieba.add_word(u'屠呦呦', freq=None, tag=None)
#del_word(word)
print (' '.join(cleancntxt(u'屠呦呦获得了诺贝尔医学奖。', stopwords)))
屠呦呦 获得 诺贝尔 医学奖
import gensim
processed_docs = [cleancntxt(doc, stopwords) for doc in gov_report_2016[5:-1]]
word_count_dict = gensim.corpora.Dictionary(processed_docs)
print ("In the corpus there are", len(word_count_dict), "unique tokens")
# word_count_dict.filter_extremes(no_below=5, no_above=0.2) # word must appear >5 times, and no more than 10% documents
# print "After filtering, in the corpus there are only", len(word_count_dict), "unique tokens"
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
In the corpus there are 2622 unique tokens
tfidf = models.TfidfModel(bag_of_words_corpus )
corpus_tfidf = tfidf[bag_of_words_corpus ]
#lda_model = gensim.models.LdaModel(corpus_tfidf, num_topics=20, id2word=word_count_dict, passes=10)
lda_model = gensim.models.LdaMulticore(corpus_tfidf, num_topics=20, id2word=word_count_dict, passes=10)
perplexity_list = [inferTopicNumber(bag_of_words_corpus, num, word_count_dict) for num in [5, 10, 15, 20, 25, 30 ]]
plt.plot([5, 10, 15, 20, 25, 30 ], perplexity_list)
plt.show()
lda_model.print_topics(3)
[(15, u'0.005*"\u5c31\u4e1a" + 0.003*"\u56fd\u6709\u4f01\u4e1a" + 0.002*"\u8c08\u5224" + 0.002*"\u81ea\u8d38\u533a" + 0.002*"\u534f\u5b9a" + 0.002*"\u91d1\u878d" + 0.002*"\u56fd\u6709\u8d44\u4ea7" + 0.002*"\u5e02\u573a\u5316" + 0.002*"\u521b\u4e1a" + 0.002*"\u5747\u8861"'), (8, u'0.005*"\u4ee5\u4e0b" + 0.004*"\u4e00\u5e74" + 0.003*"\u4e3b\u8981" + 0.003*"\u533b\u4fdd" + 0.002*"\u6559\u80b2" + 0.002*"\u533b\u7597" + 0.002*"\u5b66\u6821" + 0.002*"\u514d\u9664" + 0.002*"\u5b66\u6742\u8d39" + 0.002*"C919"'), (10, u'0.004*"\u4e24\u5cb8" + 0.004*"\u9700\u6c42" + 0.003*"\u6295\u8d44" + 0.002*"\u6210\u7ee9" + 0.002*"\u767e\u5206\u70b9" + 0.002*"\u6709\u6548" + 0.002*"\u8fd9\u4e9b" + 0.002*"\u9532\u800c\u4e0d\u820d" + 0.002*"\u8150\u8d25\u5206\u5b50" + 0.002*"\u89c4\u5b9a"')]
topictermlist = lda_model.print_topics(-1)
top_words = [[j.split('*')[1] for j in i[1].split(' + ')] for i in topictermlist]
for i in top_words:
print (" ".join(i) )
"创新" "企业" "创业" "党风廉政" "财政支出" "就业" "经济运行" "全年" "范围" "所有" "更加" "支付" "取消" "审批" "存款" "如期" "光明" "美好" "前景" "将会" "海洋" "合作" "地区" "产能" "支持" "金融" "存在" "一些" "去年" "基金" "财政赤字" "领导人" "论坛" "安排" "亿元" "峰会" "地方" "债券" "联合国" "万亿元" "万公里" "农村" "城乡" "救助" "里程" "协调" "重大" "覆盖" "主要" "突出" "农业" "提高" "加快" "改革" "实施" "保护" "政策" "人民" "基本" "民族" "住房" "考虑" "就业" "6.5" "文化" "预期" "相衔接" "有利于" "全民" "补贴" "节能" "环保" "消费" "国民经济" "第十三个" "增长" "国内" "规划" "生产总值" "五年" "以下" "一年" "主要" "医保" "教育" "医疗" "学校" "免除" "学杂费" "C919" "军队" "国防" "强军" "政治" "领导" "一年" "领域" "鱼水情深" "战备" "武装警察" "两岸" "需求" "投资" "成绩" "百分点" "有效" "这些" "锲而不舍" "腐败分子" "规定" "2016" "重点" "回顾" "八个" "2015" "脱贫" "扶贫" "做好" "接受" "今年" "民生" "非公有制" "时期" "十三" "举措" "竞争" "任务" "对外开放" "重大" "主要" "各位" "代表" "安全" "伟大" "民主" "作出" "富强" "聚力" "紧密" "复兴" "合作" "依法" "地方" "作用" "宗教" "大国" "产能" "关系" "政府" "维护" "就业" "国有企业" "谈判" "自贸区" "协定" "金融" "国有资产" "市场化" "创业" "均衡" "2020" "万元" "外商投资" "强国" "环境治理" "强力" "下决心" "事关" "双赢" "一批" "13" "意识" "干事" "服务业" "辉煌成就" "左右" "货币政策" "制造" "广大干部" "品质" "港澳" "居民" "自贸" "香港" "实际" "亿美元" "勇气" "重大成就" "信心" "稳中有" "调控" "供给" "结构性" "政府" "城镇化" "货币政策" "出口" "新型" "区域" "宏观调控"
top_words_shares = [[j.split('*')[0] for j in i[1].split(' + ')] for i in topictermlist]
top_words_shares = [map(float, i) for i in top_words_shares]
def weightvalue(x):
return (x - np.min(top_words_shares))*40/(np.max(top_words_shares) -np.min(top_words_shares)) + 10
top_words_shares = [map(weightvalue, i) for i in top_words_shares]
def plotTopics(mintopics, maxtopics):
num_top_words = 10
plt.rcParams['figure.figsize'] = (20.0, 8.0)
n = 0
for t in range(mintopics , maxtopics):
plt.subplot(2, 15, n + 1) # plot numbering starts with 1
plt.ylim(0, num_top_words) # stretch the y-axis to accommodate the words
plt.xticks([]) # remove x-axis markings ('ticks')
plt.yticks([]) # remove y-axis markings ('ticks')
plt.title(u'主题 #{}'.format(t+1), size = 15)
words = top_words[t][0:num_top_words ]
words_shares = top_words_shares[t][0:num_top_words ]
for i, (word, share) in enumerate(zip(words, words_shares)):
plt.text(0.05, num_top_words-i-0.9, word, fontsize= np.log(share*1000))
n += 1
plotTopics(0, 10)
plotTopics(10, 20)
import pandas as pd
pdf = pd.read_csv('./data/SongPoem.csv', encoding = 'gb18030')
pdf[:3]
Page | Author | Title | Title2 | Sentence | |
---|---|---|---|---|---|
0 | 0001.1 | 和岘 | 导引 | 导引 | 气和玉烛,叡化着鸿明。缇管一阳生。郊禋盛礼燔柴毕,旋轸凤凰城。森罗仪卫振华缨。载路溢欢声。皇... |
1 | 0001.2 | 和岘 | 六州 | 六州 | 严夜警,铜莲漏迟迟。清禁肃,森陛戟,羽卫俨皇闱。角声励,钲鼓攸宜。金管成雅奏,逐吹逶迤。荐苍... |
2 | 0001.3 | 和岘 | 十二时 | 忆少年 | 承宝运,驯致隆平。鸿庆被寰瀛。时清俗阜,治定功成。遐迩咏由庚。严郊祀,文物声明。会天正、星拱... |
len(pdf)
20692
poems = pdf.Sentence
import gensim
processed_docs = [cleancntxt(doc, stopwords) for doc in poems]
word_count_dict = gensim.corpora.Dictionary(processed_docs)
print ("In the corpus there are", len(word_count_dict), "unique tokens")
# word_count_dict.filter_extremes(no_below=5, no_above=0.2) # word must appear >5 times, and no more than 10% documents
# print "After filtering, in the corpus there are only", len(word_count_dict), "unique tokens"
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
In the corpus there are 147177 unique tokens
tfidf = models.TfidfModel(bag_of_words_corpus )
corpus_tfidf = tfidf[bag_of_words_corpus ]
lda_model = gensim.models.LdaModel(corpus_tfidf, num_topics=20, id2word=word_count_dict, passes=10)
# 使用并行LDA加快处理速度。
lda_model2 = gensim.models.ldamulticore.LdaMulticore(corpus=None, num_topics=20, id2word=word_count_dict,\
workers=None, chunksize=2000, passes=1, \
batch=False, alpha='symmetric', eta=None, \
decay=0.5, offset=1.0, eval_every=10, \
iterations=50, gamma_threshold=0.001, random_state=None)
lda_model2.print_topics(3)
[(14, u'0.000*"\u5bab\u58f6" + 0.000*"\u5a25\u7709" + 0.000*"\u5c11\u7384" + 0.000*"\u7ea2\u8865\u7fe0" + 0.000*"\u5bd2\u7981" + 0.000*"\u6069\u6ce2\u6e3a" + 0.000*"\u6210\u7b11" + 0.000*"\u4e00\u65b9" + 0.000*"\u632f\u4f69" + 0.000*"\u5343\u6761"'), (3, u'0.000*"\u7b11\u6885" + 0.000*"\u98de\u7fe5" + 0.000*"\u751a\u5904\u5e02" + 0.000*"\u7076\u59d4\u5ca9" + 0.000*"\u58f0\u4e91\u5916" + 0.000*"\u8bd7\u9b13\u7a7a" + 0.000*"\u9999\u5e15" + 0.000*"\u4e00\u5411" + 0.000*"\u559c\u8fd1" + 0.000*"\u8349\u5e26"'), (9, u'0.000*"\u9189\u5f52\u82b1" + 0.000*"\u96e8\u9701\u9ad8\u70df" + 0.000*"\u79c1\u81ea" + 0.000*"\u5c1a\u4e8e" + 0.000*"\u5ba2\u4e91" + 0.000*"\u4ea4\u8a89" + 0.000*"\u7f18\u529b" + 0.000*"\u9ad8\u4eba\u53f3" + 0.000*"\u7814\u971c" + 0.000*"\u751f\u60b2"')]
topictermlist = lda_model2.print_topics(-1)
top_words = [[j.split('*')[1] for j in i[1].split(' + ')] for i in topictermlist]
for k, i in enumerate(top_words):
print (k+1, " ".join(i) )
1 "杯面" "衡任" "鹊声" "东瓯" "毛遂" "狂胡" "金横带" "为民" "贪欢适" "女骋" 2 "事关" "长不昧" "扑鼻" "印曲花" "千亿" "悲似" "成絮" "绿须" "辗柔茵" "中眉" 3 "此等" "半疑" "菜传" "羞郎觑" "工艺" "翠翘花" "苦自" "闲发" "正梅粉" "愁梦欲" 4 "笑梅" "飞翥" "甚处市" "灶委岩" "声云外" "诗鬓空" "香帕" "一向" "喜近" "草带" 5 "合姓" "入户" "生青雾" "千掌" "佳兆" "平镜" "脉脉" "几间" "留春语" "先递" 6 "松路" "菊荒" "苎萝" "涌大" "波平岸" "任城" "题桐叶" "三三五五" "望仙官" "景疏" 7 "幽欢整" "虔祈" "步鸯" "开口笑" "怨深" "列郡" "风拂罗衣" "似途" "恨苦" "情忠武" 8 "粟粟" "报临" "摩孩罗" "半嗔" "虚野" "倚定" "欲语" "夷夏高仰" "看君行" "未伊瘦损" 9 "刺萦" "适忘鱼" "困流霞" "犀隐" "只弹" "幼稚" "花阴淡" "恐山深" "盘山" "今底" 10 "醉归花" "雨霁高烟" "私自" "尚于" "客云" "交誉" "缘力" "高人右" "研霜" "生悲" 11 "休为" "迷舞凤" "惬邻" "愁味" "解禁" "一物" "亭北" "催庭树" "梦翠翘" "披蕊" 12 "老此" "横塘处" "这闲福" "初不悟" "花满碧蹊归" "纵巧" "放荡" "歌者" "要称" "先泪" 13 "筹密边" "任碧罗" "犹闻" "秋霁碧" "瘦千崖" "翠如葱" "休争" "爱此" "辅盈成" "蒸民" 14 "送日眺" "事皆非" "顶头" "储秀降" "济水" "良日" "辜伊" "岫边" "若耶溪" "空歇" 15 "宫壶" "娥眉" "少玄" "红补翠" "寒禁" "恩波渺" "成笑" "一方" "振佩" "千条" 16 "开景运" "休治" "争映" "明时" "念羁" "天末家" "点墨" "春权" "丝弦" "盈畴" 17 "情念骤" "上林" "侵染" "香高烛" "心许" "裂石" "兽烟" "麦光" "符梦" "尘想" 18 "别郎" "庐中" "不待禁" "整冠落" "同摘" "穿线" "细草芳" "村姑" "季真非" "待取" 19 "云里认" "一枕松风" "雨惜" "西瑶" "调冰荐" "已生些" "谢郎池" "散场" "锅汤" "赋里" 20 "访隐" "劳君" "尹字" "火力" "入轻" "衷肠" "绝境" "地来" "清颍咽" "心与秋空"
perplexity_list = [fastInferTopicNumber(bag_of_words_corpus, num, word_count_dict) for num in [5, 15, 20, 25, 30, 35, 40 ]]
plt.plot([5, 15, 20, 25, 30, 35, 40], perplexity_list)
plt.show()
import pyLDAvis.gensim
song_data = pyLDAvis.gensim.prepare(lda_model, bag_of_words_corpus, word_count_dict)
pyLDAvis.enable_notebook()
pyLDAvis.show(song_data)
Note: if you're in the IPython notebook, pyLDAvis.show() is not the best command to use. Consider using pyLDAvis.display(), or pyLDAvis.enable_notebook(). See more information at http://pyLDAvis.github.io/quickstart.html . You must interrupt the kernel to end this command Serving to http://127.0.0.1:8889/ [Ctrl-C to exit]
127.0.0.1 - - [21/Sep/2017 22:29:23] "GET / HTTP/1.1" 200 - 127.0.0.1 - - [21/Sep/2017 22:29:23] "GET /LDAvis.css HTTP/1.1" 200 - 127.0.0.1 - - [21/Sep/2017 22:29:23] "GET /d3.js HTTP/1.1" 200 - 127.0.0.1 - - [21/Sep/2017 22:29:23] "GET /LDAvis.js HTTP/1.1" 200 -
stopping Server...
Willi Richert, Luis Pedro Coelho, 2013, Building Machine Learning Systems with Python. Chapter 4. Packt Publishing.
LDA Experiments on the English Wikipedia https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation
东风夜放花千树:对宋词进行主题分析初探 https://chengjunwang.com/zh/post/cn/2013-09-27-topic-modeling-of-song-peom/
Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342