文本挖掘简介



王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

What can be learned from 5 million books

http://v.youku.com/v_show/id_XMzA3OTA5MjUy.html

This talk by Jean-Baptiste Michel and Erez Lieberman Aiden is phenomenal. The associated article is also well worth checking out: Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176–182.

试一下谷歌图书的数据: https://books.google.com/ngrams/

数据下载: http://www.culturomics.org/home

Bag-of-words model

Represent text as numerical feature vectors

  • We create a vocabulary of unique tokens—for example, words—from the entire set of documents.
  • We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them sparse

Bag of words,也叫做“词袋”,在信息检索中,Bag of words model假定对于一个文本,忽略其词序和语法,句法,将其仅仅看做是一个词集合,或者说是词的一个组合,文本中每个词的出现都是独立的,不依赖于其他词是否出现,或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。这种假设虽然对自然语言进行了简化,便于模型化。

假定在有些情况下是不合理的,例如在新闻个性化推荐中,采用Bag of words的模型就会出现问题。例如用户甲对“南京醉酒驾车事故”这个短语很感兴趣,采用bag of words忽略了顺序和句法,则认为用户甲对“南京”、“醉酒”、“驾车”和“事故”感兴趣,因此可能推荐出和“南京”,“公交车”,“事故”相关的新闻,这显然是不合理的。

解决的方法可以采用SCPCD的方法抽取出整个短语,或者采用高阶(2阶以上)统计语言模型,例如bigram,trigram来将词序保留下来,相当于bag of bigram和bag of trigram,这样能在一定程度上解决这种问题。简言之,bag of words模型是否适用需要根据实际情况来确定。对于那些不可以忽视词序,语法和句法的场合均不能采用bag of words的方法。

Transforming words into feature vectors

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

D1 = "I like databases"

D2 = "I hate databases"

I like hate databases
D1 1 1 0 1
D2 1 0 1 1
In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
In [6]:
' '.join(dir(count)) 
Out[6]:
'__class__ __delattr__ __dict__ __doc__ __format__ __getattribute__ __hash__ __init__ __module__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _char_ngrams _char_wb_ngrams _check_vocabulary _count_vocab _get_param_names _limit_features _sort_features _validate_vocabulary _white_spaces _word_ngrams analyzer binary build_analyzer build_preprocessor build_tokenizer decode decode_error dtype encoding fit fit_transform fixed_vocabulary fixed_vocabulary_ get_feature_names get_params get_stop_words input inverse_transform lowercase max_df max_features min_df ngram_range preprocessor set_params stop_words stop_words_ strip_accents token_pattern tokenizer transform vocabulary vocabulary_'
In [2]:
count.get_feature_names()
Out[2]:
[u'and', u'is', u'shining', u'sun', u'sweet', u'the', u'weather']
In [3]:
print(count.vocabulary_)
{u'and': 0, u'weather': 6, u'sweet': 4, u'sun': 3, u'is': 1, u'the': 5, u'shining': 2}
In [5]:
type(bag)
Out[5]:
scipy.sparse.csr.csr_matrix
In [3]:
print(bag.toarray())
[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]
In [12]:
import pandas as pd
pd.DataFrame(bag.toarray(), columns = count.get_feature_names())
Out[12]:
and is shining sun sweet the weather
0 0 1 1 1 0 1 0
1 0 1 0 0 1 1 1
2 1 2 1 1 1 2 1

1-gram

The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model

  • each item or token in the vocabulary represents a single word.

n-gram

The choice of the number n in the n-gram model depends on the particular application

  • 1-gram: "the", "sun", "is", "shining"
  • 2-gram: "the sun", "sun is", "is shining"

The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter.

While a 1-gram representation is used by default

we could switch to a 2-gram representation by initializing a new CountVectorizer instance with ngram_range=(2,2).

Assessing word relevancy via term frequency-inverse document frequency

$tf-idf(t, d) = tf(t, d) \times idf(t)$

$tf(t, d)$ is the term frequency of term t in document d.

inverse document frequency $idf(t)$ can be calculated as:

$idf(t) = log \frac{n_d}{1 + df(d, t)}$

where $n_d$ is the total number of documents, and $df(d, t)$ is the number of documents $d$ that contain the term $t$.

提问: Why do we add the constant 1 to the denominator ?

课堂作业:请根据公式计算'is'这个词在文本2中的tfidf数值?

TfidfTransformer

Scikit-learn implements yet another transformer, the TfidfTransformer, that takes the raw term frequencies from CountVectorizer as input and transforms them into tf-idfs:

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2)

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]
In [17]:
bag = tfidf.fit_transform(count.fit_transform(docs))
pd.DataFrame(bag.toarray(), columns = count.get_feature_names())
Out[17]:
and is shining sun sweet the weather
0 0.000000 0.433708 0.558478 0.558478 0.000000 0.433708 0.000000
1 0.000000 0.433708 0.000000 0.000000 0.558478 0.433708 0.558478
2 0.404748 0.478102 0.307822 0.307822 0.307822 0.478102 0.307822
In [18]:
# 一个词的tfidf值
tf_is = 2 
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)
tf-idf of term "is" = 2.00
In [19]:
# 最后一个文本里的词的tfidf原始数值(未标准化)
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 
Out[19]:
array([ 1.69,  2.  ,  1.29,  1.29,  1.29,  2.  ,  1.29])

The tf-idf equation that was implemented in scikit-learn is as follows:

$tf-idf(t, d) = tf(t, d) \times (idf(t, d) + 1)$

L2-normalization

$v_{norm} = \frac{v}{(\sum_{i = 1}^n v_i)^{1/2}}$

In [20]:
# l2标准化后的tfidf数值
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf
Out[20]:
array([ 0.4 ,  0.48,  0.31,  0.31,  0.31,  0.48,  0.31])

政府工作报告文本挖掘

0. 读取数据

In [1]:
with open('/Users/chengjun/github/cjc2016/data/gov_reports1954-2016.txt', 'r') as f:
    reports = f.readlines()
In [2]:
len(reports)
Out[2]:
47
In [3]:
print reports[32][:1000]
2002	2002年政府工作报告  ——2002年3月5日在第九届全国人民代表大会第五次会议上		                   国务院总理朱镕基   各位代表:  现在,我代表国务院向大会作政府工作报告,请予审议,并请全国政协各位委员提出意见。  首先报告2001年的工作。  新世纪第一年,全国各族人民在中国共产党领导下,面对复杂多变的国际形势,克服困难,阔步前进,改革开放和社会主义现代化建设取得了新的重大成就。  国民经济保持良好发展势头。在世界经济增长明显减速的情况下,我们坚持扩大内需的方针,坚定地实施积极的财政政策和稳健的货币政策,实现了经济较快增长。2001年国内生产总值达到95933亿元,比上年增长7.3%。经济结构调整取得积极进展。农业结构有所优化,优质、专

pip install jieba

https://github.com/fxsjy/jieba

pip install wordcloud

https://github.com/amueller/word_cloud

pip install gensim

在terminal里成功安装第三方的包,结果发现在notebook里无法import

这个问题多出现于mac用户,因为mac有一个系统自带的python,成功安装的第三方包都被安装到了系统自带的python里。因此需要确保我们使用的是conda自己的pip,即需要指定pip的路径名,比如我的pip路径名在:/Users/chengjun/anaconda/bin/pip,那么在terminal里输入:

/Users/chengjun/anaconda/bin/pip install package_name

In [7]:
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import sys 
import numpy as np
from collections import defaultdict
import statsmodels.api as sm
from wordcloud import WordCloud
import jieba
import matplotlib
import gensim
from gensim import corpora, models, similarities
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体 
matplotlib.rc("savefig", dpi=400)
In [10]:
# 为了确保中文可以在matplotlib里正确显示
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体 
# 需要确定系统安装了Microsoft YaHei
In [9]:
matplotlib.rcParams
Out[9]:
RcParams({u'agg.path.chunksize': 0,
          u'animation.avconv_args': [],
          u'animation.avconv_path': u'avconv',
          u'animation.bitrate': -1,
          u'animation.codec': u'mpeg4',
          u'animation.convert_args': [],
          u'animation.convert_path': u'convert',
          u'animation.ffmpeg_args': [],
          u'animation.ffmpeg_path': u'ffmpeg',
          u'animation.frame_format': u'png',
          u'animation.html': u'none',
          u'animation.mencoder_args': [],
          u'animation.mencoder_path': u'mencoder',
          u'animation.writer': u'ffmpeg',
          u'axes.axisbelow': False,
          u'axes.edgecolor': u'k',
          u'axes.facecolor': u'w',
          u'axes.formatter.limits': [-7, 7],
          u'axes.formatter.use_locale': False,
          u'axes.formatter.use_mathtext': False,
          u'axes.formatter.useoffset': True,
          u'axes.grid': False,
          u'axes.grid.axis': u'both',
          u'axes.grid.which': u'major',
          u'axes.hold': True,
          u'axes.labelcolor': u'k',
          u'axes.labelpad': 5.0,
          u'axes.labelsize': u'medium',
          u'axes.labelweight': u'normal',
          u'axes.linewidth': 1.0,
          u'axes.prop_cycle': cycler(u'color', [u'b', u'g', u'r', u'c', u'm', u'y', u'k']),
          u'axes.spines.bottom': True,
          u'axes.spines.left': True,
          u'axes.spines.right': True,
          u'axes.spines.top': True,
          u'axes.titlesize': u'large',
          u'axes.titleweight': u'normal',
          u'axes.unicode_minus': True,
          u'axes.xmargin': 0.0,
          u'axes.ymargin': 0.0,
          u'axes3d.grid': True,
          u'backend': 'module://ipykernel.pylab.backend_inline',
          u'backend.qt4': u'PyQt4',
          u'backend.qt5': u'PyQt5',
          u'backend_fallback': True,
          u'boxplot.bootstrap': None,
          u'boxplot.boxprops.color': u'b',
          u'boxplot.boxprops.linestyle': u'-',
          u'boxplot.boxprops.linewidth': 1.0,
          u'boxplot.capprops.color': u'k',
          u'boxplot.capprops.linestyle': u'-',
          u'boxplot.capprops.linewidth': 1.0,
          u'boxplot.flierprops.color': u'b',
          u'boxplot.flierprops.linestyle': u'none',
          u'boxplot.flierprops.linewidth': 1.0,
          u'boxplot.flierprops.marker': u'+',
          u'boxplot.flierprops.markeredgecolor': u'k',
          u'boxplot.flierprops.markerfacecolor': u'b',
          u'boxplot.flierprops.markersize': 6.0,
          u'boxplot.meanline': False,
          u'boxplot.meanprops.color': u'r',
          u'boxplot.meanprops.linestyle': u'-',
          u'boxplot.meanprops.linewidth': 1.0,
          u'boxplot.medianprops.color': u'r',
          u'boxplot.medianprops.linestyle': u'-',
          u'boxplot.medianprops.linewidth': 1.0,
          u'boxplot.notch': False,
          u'boxplot.patchartist': False,
          u'boxplot.showbox': True,
          u'boxplot.showcaps': True,
          u'boxplot.showfliers': True,
          u'boxplot.showmeans': False,
          u'boxplot.vertical': True,
          u'boxplot.whiskerprops.color': u'b',
          u'boxplot.whiskerprops.linestyle': u'--',
          u'boxplot.whiskerprops.linewidth': 1.0,
          u'boxplot.whiskers': 1.5,
          u'contour.corner_mask': True,
          u'contour.negative_linestyle': u'dashed',
          u'datapath': u'/Users/chengjun/anaconda/lib/python2.7/site-packages/matplotlib/mpl-data',
          u'docstring.hardcopy': False,
          u'errorbar.capsize': 3.0,
          u'examples.directory': u'',
          u'figure.autolayout': False,
          u'figure.dpi': 80.0,
          u'figure.edgecolor': (1, 1, 1, 0),
          u'figure.facecolor': (1, 1, 1, 0),
          u'figure.figsize': [6.0, 4.0],
          u'figure.frameon': True,
          u'figure.max_open_warning': 20,
          u'figure.subplot.bottom': 0.125,
          u'figure.subplot.hspace': 0.2,
          u'figure.subplot.left': 0.125,
          u'figure.subplot.right': 0.9,
          u'figure.subplot.top': 0.9,
          u'figure.subplot.wspace': 0.2,
          u'figure.titlesize': u'medium',
          u'figure.titleweight': u'normal',
          u'font.cursive': [u'Apple Chancery',
                            u'Textile',
                            u'Zapf Chancery',
                            u'Sand',
                            u'Script MT',
                            u'Felipa',
                            u'cursive'],
          u'font.family': [u'sans-serif'],
          u'font.fantasy': [u'Comic Sans MS',
                            u'Chicago',
                            u'Charcoal',
                            u'ImpactWestern',
                            u'Humor Sans',
                            u'fantasy'],
          u'font.monospace': [u'Bitstream Vera Sans Mono',
                              u'DejaVu Sans Mono',
                              u'Andale Mono',
                              u'Nimbus Mono L',
                              u'Courier New',
                              u'Courier',
                              u'Fixed',
                              u'Terminal',
                              u'monospace'],
          u'font.sans-serif': [u'Microsoft YaHei'],
          u'font.serif': [u'Bitstream Vera Serif',
                          u'DejaVu Serif',
                          u'New Century Schoolbook',
                          u'Century Schoolbook L',
                          u'Utopia',
                          u'ITC Bookman',
                          u'Bookman',
                          u'Nimbus Roman No9 L',
                          u'Times New Roman',
                          u'Times',
                          u'Palatino',
                          u'Charter',
                          u'serif'],
          u'font.size': 10.0,
          u'font.stretch': u'normal',
          u'font.style': u'normal',
          u'font.variant': u'normal',
          u'font.weight': u'normal',
          u'grid.alpha': 1.0,
          u'grid.color': u'k',
          u'grid.linestyle': u':',
          u'grid.linewidth': 0.5,
          u'image.aspect': u'equal',
          u'image.cmap': u'jet',
          u'image.composite_image': True,
          u'image.interpolation': u'bilinear',
          u'image.lut': 256,
          u'image.origin': u'upper',
          u'image.resample': False,
          u'interactive': True,
          u'keymap.all_axes': [u'a'],
          u'keymap.back': [u'left', u'c', u'backspace'],
          u'keymap.forward': [u'right', u'v'],
          u'keymap.fullscreen': [u'f', u'ctrl+f'],
          u'keymap.grid': [u'g'],
          u'keymap.home': [u'h', u'r', u'home'],
          u'keymap.pan': [u'p'],
          u'keymap.quit': [u'ctrl+w', u'cmd+w'],
          u'keymap.save': [u's', u'ctrl+s'],
          u'keymap.xscale': [u'k', u'L'],
          u'keymap.yscale': [u'l'],
          u'keymap.zoom': [u'o'],
          u'legend.borderaxespad': 0.5,
          u'legend.borderpad': 0.4,
          u'legend.columnspacing': 2.0,
          u'legend.edgecolor': u'inherit',
          u'legend.facecolor': u'inherit',
          u'legend.fancybox': False,
          u'legend.fontsize': u'large',
          u'legend.framealpha': None,
          u'legend.frameon': True,
          u'legend.handleheight': 0.7,
          u'legend.handlelength': 2.0,
          u'legend.handletextpad': 0.8,
          u'legend.isaxes': True,
          u'legend.labelspacing': 0.5,
          u'legend.loc': u'upper right',
          u'legend.markerscale': 1.0,
          u'legend.numpoints': 2,
          u'legend.scatterpoints': 3,
          u'legend.shadow': False,
          u'lines.antialiased': True,
          u'lines.color': u'b',
          u'lines.dash_capstyle': u'butt',
          u'lines.dash_joinstyle': u'round',
          u'lines.linestyle': u'-',
          u'lines.linewidth': 1.0,
          u'lines.marker': u'None',
          u'lines.markeredgewidth': 0.5,
          u'lines.markersize': 6.0,
          u'lines.solid_capstyle': u'projecting',
          u'lines.solid_joinstyle': u'round',
          u'markers.fillstyle': u'full',
          u'mathtext.bf': u'serif:bold',
          u'mathtext.cal': u'cursive',
          u'mathtext.default': u'it',
          u'mathtext.fallback_to_cm': True,
          u'mathtext.fontset': u'cm',
          u'mathtext.it': u'serif:italic',
          u'mathtext.rm': u'serif',
          u'mathtext.sf': u'sans\\-serif',
          u'mathtext.tt': u'monospace',
          u'nbagg.transparent': True,
          u'patch.antialiased': True,
          u'patch.edgecolor': u'k',
          u'patch.facecolor': u'b',
          u'patch.linewidth': 1.0,
          u'path.effects': [],
          u'path.simplify': True,
          u'path.simplify_threshold': 0.1111111111111111,
          u'path.sketch': None,
          u'path.snap': True,
          u'pdf.compression': 6,
          u'pdf.fonttype': 3,
          u'pdf.inheritcolor': False,
          u'pdf.use14corefonts': False,
          u'pgf.debug': False,
          u'pgf.preamble': [],
          u'pgf.rcfonts': True,
          u'pgf.texsystem': u'xelatex',
          u'plugins.directory': u'.matplotlib_plugins',
          u'polaraxes.grid': True,
          u'ps.distiller.res': 6000,
          u'ps.fonttype': 3,
          u'ps.papersize': u'letter',
          u'ps.useafm': False,
          u'ps.usedistiller': False,
          u'savefig.bbox': None,
          u'savefig.directory': u'~',
          u'savefig.dpi': 400.0,
          u'savefig.edgecolor': u'w',
          u'savefig.facecolor': u'w',
          u'savefig.format': u'png',
          u'savefig.frameon': True,
          u'savefig.jpeg_quality': 95,
          u'savefig.orientation': u'portrait',
          u'savefig.pad_inches': 0.1,
          u'savefig.transparent': False,
          u'svg.fonttype': u'path',
          u'svg.image_inline': True,
          u'svg.image_noscale': False,
          u'text.antialiased': True,
          u'text.color': u'k',
          u'text.dvipnghack': None,
          u'text.hinting': u'auto',
          u'text.hinting_factor': 8,
          u'text.latex.preamble': [],
          u'text.latex.preview': False,
          u'text.latex.unicode': False,
          u'text.usetex': False,
          u'timezone': u'UTC',
          u'tk.window_focus': False,
          u'toolbar': u'toolbar2',
          u'verbose.fileo': u'sys.stdout',
          u'verbose.level': u'silent',
          u'webagg.open_in_browser': True,
          u'webagg.port': 8988,
          u'webagg.port_retries': 50,
          u'xtick.color': u'k',
          u'xtick.direction': u'in',
          u'xtick.labelsize': u'medium',
          u'xtick.major.pad': 4.0,
          u'xtick.major.size': 4.0,
          u'xtick.major.width': 0.5,
          u'xtick.minor.pad': 4.0,
          u'xtick.minor.size': 2.0,
          u'xtick.minor.visible': False,
          u'xtick.minor.width': 0.5,
          u'ytick.color': u'k',
          u'ytick.direction': u'in',
          u'ytick.labelsize': u'medium',
          u'ytick.major.pad': 4.0,
          u'ytick.major.size': 4.0,
          u'ytick.major.width': 0.5,
          u'ytick.minor.pad': 4.0,
          u'ytick.minor.size': 2.0,
          u'ytick.minor.visible': False,
          u'ytick.minor.width': 0.5})

1. 分词

In [6]:
import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))
Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
Default Mode: 我/ 来到/ 北京/ 清华大学
他, 来到, 了, 网易, 杭研, 大厦
小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ,, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造

2. 停用词

In [7]:
filename = '/Users/chengjun/github/cjc2016/data/stopwords.txt'
stopwords = {}
f = open(filename, 'r')
line = f.readline().rstrip()
while line:
    stopwords.setdefault(line, 0)
    stopwords[line.decode('utf-8')] = 1
    line = f.readline().rstrip()
f.close()
In [8]:
adding_stopwords = [u'我们', u'要', u'地', u'有', u'这', u'人',
                    u'发展',u'建设',u'加强',u'继续',u'对',u'等',u'推进',u'工作',u'增加']
for s in adding_stopwords: stopwords[s]=10

3. 关键词抽取

基于TF-IDF 算法的关键词抽取

In [14]:
import jieba.analyse
txt = reports[-1]
tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)
In [262]:
print u"、".join([i[0] for i in tf[:50]])
发展、推进、改革、建设、创新、加快、经济、加强、促进、实施、政府、推动、完善、政策、全面、增长、社会、就业、企业、提高、创业、扩大、制度、坚持、一批、深化、人民、落实、支持、农村、试点、实现、安全、合作、工作、我国、动能、机制、加大、服务业、城镇、我们、服务、取得、依法、积极、中国、深入、结构性、民生
In [267]:
plt.hist([i[1] for i in tf])
plt.show()

基于 TextRank 算法的关键词抽取

In [264]:
tr = jieba.analyse.textrank(txt,topK=200, withWeight=True)
print u"、".join([i[0] for i in tr[:50]])
发展、建设、经济、改革、推进、创新、加强、加快、政府、推动、促进、实施、企业、政策、社会、制度、中国、提高、完善、全面、增长、扩大、支持、实现、工作、机制、创业、人民、服务、农村、试点、地方、坚持、国家、国际、继续、就业、合作、基本、加大、农业、投资、保护、问题、地区、依法、工程、取得、鼓励、建立
In [268]:
plt.hist([i[1] for i in tr])
plt.show()
In [75]:
import pandas as pd

def keywords(index):
    txt = reports[-index]
    tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)
    tr = jieba.analyse.textrank(txt,topK=200, withWeight=True)
    tfdata = pd.DataFrame(tf, columns=['word', 'tfidf'])
    trdata = pd.DataFrame(tr, columns=['word', 'textrank'])
    worddata = pd.merge(tfdata, trdata, on='word')
    plt.plot(worddata.tfidf, worddata.textrank, linestyle='',marker='.')
    for i in range(len(worddata.word)):
        plt.text(worddata.tfidf[i], worddata.textrank[i], worddata.word[i], 
                 fontsize = worddata.textrank[i]*15, color = 'red', rotation = 0)
    plt.title(txt[:4])
    plt.xlabel('Tf-Idf')
    plt.ylabel('TextRank')
    plt.show()
In [80]:
keywords(1)
In [269]:
keywords(2)
In [270]:
keywords(3)

算法论文:

TextRank: Bringing Order into Texts

基本思想:

  • 将待抽取关键词的文本进行分词
  • 以固定窗口大小(默认为5,通过span属性调整),词之间的共现关系,构建图
  • 计算图中节点的PageRank,注意是无向带权图

4. 词云

In [59]:
def wordcloudplot(txt, year):
    wordcloud = WordCloud(font_path='/Users/chengjun/github/cjc2016/data/msyh.ttf').generate(txt)
    # Open a plot of the generated image.
    plt.imshow(wordcloud)
    plt.title(year)
    plt.axis("off")
    #plt.show()

基于tfidf过滤的词云

In [326]:
txt = reports[-1]
tfidf200= jieba.analyse.extract_tags(txt, topK=200, withWeight=False)
seg_list = jieba.cut(txt, cut_all=False)
seg_list = [i for i in seg_list if i in tfidf200]
txt200 = r' '.join(seg_list)
wordcloudplot(txt200, txt[:4])