http://v.youku.com/v_show/id_XMzA3OTA5MjUy.html
This talk by Jean-Baptiste Michel and Erez Lieberman Aiden is phenomenal. The associated article is also well worth checking out: Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176–182.
试一下谷歌图书的数据: https://books.google.com/ngrams/
Represent text as numerical feature vectors
Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them sparse
Bag of words,也叫做“词袋”,在信息检索中,Bag of words model假定对于一个文本,忽略其词序和语法,句法,将其仅仅看做是一个词集合,或者说是词的一个组合,文本中每个词的出现都是独立的,不依赖于其他词是否出现,或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。这种假设虽然对自然语言进行了简化,便于模型化。
假定在有些情况下是不合理的,例如在新闻个性化推荐中,采用Bag of words的模型就会出现问题。例如用户甲对“南京醉酒驾车事故”这个短语很感兴趣,采用bag of words忽略了顺序和句法,则认为用户甲对“南京”、“醉酒”、“驾车”和“事故”感兴趣,因此可能推荐出和“南京”,“公交车”,“事故”相关的新闻,这显然是不合理的。
解决的方法可以采用SCPCD的方法抽取出整个短语,或者采用高阶(2阶以上)统计语言模型,例如bigram,trigram来将词序保留下来,相当于bag of bigram和bag of trigram,这样能在一定程度上解决这种问题。简言之,bag of words模型是否适用需要根据实际情况来确定。对于那些不可以忽视词序,语法和句法的场合均不能采用bag of words的方法。
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.
In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.
D1 = "I like databases"
D2 = "I hate databases"
I | like | hate | databases | |
---|---|---|---|---|
D1 | 1 | 1 | 0 | 1 |
D2 | 1 | 0 | 1 | 1 |
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
' '.join(dir(count))
'__class__ __delattr__ __dict__ __doc__ __format__ __getattribute__ __hash__ __init__ __module__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _char_ngrams _char_wb_ngrams _check_vocabulary _count_vocab _get_param_names _limit_features _sort_features _validate_vocabulary _white_spaces _word_ngrams analyzer binary build_analyzer build_preprocessor build_tokenizer decode decode_error dtype encoding fit fit_transform fixed_vocabulary fixed_vocabulary_ get_feature_names get_params get_stop_words input inverse_transform lowercase max_df max_features min_df ngram_range preprocessor set_params stop_words stop_words_ strip_accents token_pattern tokenizer transform vocabulary vocabulary_'
count.get_feature_names()
[u'and', u'is', u'shining', u'sun', u'sweet', u'the', u'weather']
print(count.vocabulary_)
{u'and': 0, u'weather': 6, u'sweet': 4, u'sun': 3, u'is': 1, u'the': 5, u'shining': 2}
type(bag)
scipy.sparse.csr.csr_matrix
print(bag.toarray())
[[0 1 1 1 0 1 0] [0 1 0 0 1 1 1] [1 2 1 1 1 2 1]]
import pandas as pd
pd.DataFrame(bag.toarray(), columns = count.get_feature_names())
and | is | shining | sun | sweet | the | weather | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
2 | 1 | 2 | 1 | 1 | 1 | 2 | 1 |
The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model
The choice of the number n in the n-gram model depends on the particular application
The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter.
While a 1-gram representation is used by default
we could switch to a 2-gram representation by initializing a new CountVectorizer instance with ngram_range=(2,2).
Scikit-learn implements yet another transformer, the TfidfTransformer, that takes the raw term frequencies from CountVectorizer as input and transforms them into tf-idfs:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2)
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
[[ 0. 0.43 0.56 0.56 0. 0.43 0. ] [ 0. 0.43 0. 0. 0.56 0.43 0.56] [ 0.4 0.48 0.31 0.31 0.31 0.48 0.31]]
bag = tfidf.fit_transform(count.fit_transform(docs))
pd.DataFrame(bag.toarray(), columns = count.get_feature_names())
and | is | shining | sun | sweet | the | weather | |
---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.433708 | 0.558478 | 0.558478 | 0.000000 | 0.433708 | 0.000000 |
1 | 0.000000 | 0.433708 | 0.000000 | 0.000000 | 0.558478 | 0.433708 | 0.558478 |
2 | 0.404748 | 0.478102 | 0.307822 | 0.307822 | 0.307822 | 0.478102 | 0.307822 |
# 一个词的tfidf值
tf_is = 2
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)
tf-idf of term "is" = 2.00
# 最后一个文本里的词的tfidf原始数值(未标准化)
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf
array([ 1.69, 2. , 1.29, 1.29, 1.29, 2. , 1.29])
# l2标准化后的tfidf数值
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf
array([ 0.4 , 0.48, 0.31, 0.31, 0.31, 0.48, 0.31])
with open('/Users/chengjun/github/cjc2016/data/gov_reports1954-2016.txt', 'r') as f:
reports = f.readlines()
len(reports)
47
print reports[32][:1000]
2002 2002年政府工作报告 ——2002年3月5日在第九届全国人民代表大会第五次会议上 国务院总理朱镕基 各位代表: 现在,我代表国务院向大会作政府工作报告,请予审议,并请全国政协各位委员提出意见。 首先报告2001年的工作。 新世纪第一年,全国各族人民在中国共产党领导下,面对复杂多变的国际形势,克服困难,阔步前进,改革开放和社会主义现代化建设取得了新的重大成就。 国民经济保持良好发展势头。在世界经济增长明显减速的情况下,我们坚持扩大内需的方针,坚定地实施积极的财政政策和稳健的货币政策,实现了经济较快增长。2001年国内生产总值达到95933亿元,比上年增长7.3%。经济结构调整取得积极进展。农业结构有所优化,优质、专
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import sys
import numpy as np
from collections import defaultdict
import statsmodels.api as sm
from wordcloud import WordCloud
import jieba
import matplotlib
import gensim
from gensim import corpora, models, similarities
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体
matplotlib.rc("savefig", dpi=400)
# 为了确保中文可以在matplotlib里正确显示
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体
# 需要确定系统安装了Microsoft YaHei
matplotlib.rcParams
RcParams({u'agg.path.chunksize': 0, u'animation.avconv_args': [], u'animation.avconv_path': u'avconv', u'animation.bitrate': -1, u'animation.codec': u'mpeg4', u'animation.convert_args': [], u'animation.convert_path': u'convert', u'animation.ffmpeg_args': [], u'animation.ffmpeg_path': u'ffmpeg', u'animation.frame_format': u'png', u'animation.html': u'none', u'animation.mencoder_args': [], u'animation.mencoder_path': u'mencoder', u'animation.writer': u'ffmpeg', u'axes.axisbelow': False, u'axes.edgecolor': u'k', u'axes.facecolor': u'w', u'axes.formatter.limits': [-7, 7], u'axes.formatter.use_locale': False, u'axes.formatter.use_mathtext': False, u'axes.formatter.useoffset': True, u'axes.grid': False, u'axes.grid.axis': u'both', u'axes.grid.which': u'major', u'axes.hold': True, u'axes.labelcolor': u'k', u'axes.labelpad': 5.0, u'axes.labelsize': u'medium', u'axes.labelweight': u'normal', u'axes.linewidth': 1.0, u'axes.prop_cycle': cycler(u'color', [u'b', u'g', u'r', u'c', u'm', u'y', u'k']), u'axes.spines.bottom': True, u'axes.spines.left': True, u'axes.spines.right': True, u'axes.spines.top': True, u'axes.titlesize': u'large', u'axes.titleweight': u'normal', u'axes.unicode_minus': True, u'axes.xmargin': 0.0, u'axes.ymargin': 0.0, u'axes3d.grid': True, u'backend': 'module://ipykernel.pylab.backend_inline', u'backend.qt4': u'PyQt4', u'backend.qt5': u'PyQt5', u'backend_fallback': True, u'boxplot.bootstrap': None, u'boxplot.boxprops.color': u'b', u'boxplot.boxprops.linestyle': u'-', u'boxplot.boxprops.linewidth': 1.0, u'boxplot.capprops.color': u'k', u'boxplot.capprops.linestyle': u'-', u'boxplot.capprops.linewidth': 1.0, u'boxplot.flierprops.color': u'b', u'boxplot.flierprops.linestyle': u'none', u'boxplot.flierprops.linewidth': 1.0, u'boxplot.flierprops.marker': u'+', u'boxplot.flierprops.markeredgecolor': u'k', u'boxplot.flierprops.markerfacecolor': u'b', u'boxplot.flierprops.markersize': 6.0, u'boxplot.meanline': False, u'boxplot.meanprops.color': u'r', u'boxplot.meanprops.linestyle': u'-', u'boxplot.meanprops.linewidth': 1.0, u'boxplot.medianprops.color': u'r', u'boxplot.medianprops.linestyle': u'-', u'boxplot.medianprops.linewidth': 1.0, u'boxplot.notch': False, u'boxplot.patchartist': False, u'boxplot.showbox': True, u'boxplot.showcaps': True, u'boxplot.showfliers': True, u'boxplot.showmeans': False, u'boxplot.vertical': True, u'boxplot.whiskerprops.color': u'b', u'boxplot.whiskerprops.linestyle': u'--', u'boxplot.whiskerprops.linewidth': 1.0, u'boxplot.whiskers': 1.5, u'contour.corner_mask': True, u'contour.negative_linestyle': u'dashed', u'datapath': u'/Users/chengjun/anaconda/lib/python2.7/site-packages/matplotlib/mpl-data', u'docstring.hardcopy': False, u'errorbar.capsize': 3.0, u'examples.directory': u'', u'figure.autolayout': False, u'figure.dpi': 80.0, u'figure.edgecolor': (1, 1, 1, 0), u'figure.facecolor': (1, 1, 1, 0), u'figure.figsize': [6.0, 4.0], u'figure.frameon': True, u'figure.max_open_warning': 20, u'figure.subplot.bottom': 0.125, u'figure.subplot.hspace': 0.2, u'figure.subplot.left': 0.125, u'figure.subplot.right': 0.9, u'figure.subplot.top': 0.9, u'figure.subplot.wspace': 0.2, u'figure.titlesize': u'medium', u'figure.titleweight': u'normal', u'font.cursive': [u'Apple Chancery', u'Textile', u'Zapf Chancery', u'Sand', u'Script MT', u'Felipa', u'cursive'], u'font.family': [u'sans-serif'], u'font.fantasy': [u'Comic Sans MS', u'Chicago', u'Charcoal', u'ImpactWestern', u'Humor Sans', u'fantasy'], u'font.monospace': [u'Bitstream Vera Sans Mono', u'DejaVu Sans Mono', u'Andale Mono', u'Nimbus Mono L', u'Courier New', u'Courier', u'Fixed', u'Terminal', u'monospace'], u'font.sans-serif': [u'Microsoft YaHei'], u'font.serif': [u'Bitstream Vera Serif', u'DejaVu Serif', u'New Century Schoolbook', u'Century Schoolbook L', u'Utopia', u'ITC Bookman', u'Bookman', u'Nimbus Roman No9 L', u'Times New Roman', u'Times', u'Palatino', u'Charter', u'serif'], u'font.size': 10.0, u'font.stretch': u'normal', u'font.style': u'normal', u'font.variant': u'normal', u'font.weight': u'normal', u'grid.alpha': 1.0, u'grid.color': u'k', u'grid.linestyle': u':', u'grid.linewidth': 0.5, u'image.aspect': u'equal', u'image.cmap': u'jet', u'image.composite_image': True, u'image.interpolation': u'bilinear', u'image.lut': 256, u'image.origin': u'upper', u'image.resample': False, u'interactive': True, u'keymap.all_axes': [u'a'], u'keymap.back': [u'left', u'c', u'backspace'], u'keymap.forward': [u'right', u'v'], u'keymap.fullscreen': [u'f', u'ctrl+f'], u'keymap.grid': [u'g'], u'keymap.home': [u'h', u'r', u'home'], u'keymap.pan': [u'p'], u'keymap.quit': [u'ctrl+w', u'cmd+w'], u'keymap.save': [u's', u'ctrl+s'], u'keymap.xscale': [u'k', u'L'], u'keymap.yscale': [u'l'], u'keymap.zoom': [u'o'], u'legend.borderaxespad': 0.5, u'legend.borderpad': 0.4, u'legend.columnspacing': 2.0, u'legend.edgecolor': u'inherit', u'legend.facecolor': u'inherit', u'legend.fancybox': False, u'legend.fontsize': u'large', u'legend.framealpha': None, u'legend.frameon': True, u'legend.handleheight': 0.7, u'legend.handlelength': 2.0, u'legend.handletextpad': 0.8, u'legend.isaxes': True, u'legend.labelspacing': 0.5, u'legend.loc': u'upper right', u'legend.markerscale': 1.0, u'legend.numpoints': 2, u'legend.scatterpoints': 3, u'legend.shadow': False, u'lines.antialiased': True, u'lines.color': u'b', u'lines.dash_capstyle': u'butt', u'lines.dash_joinstyle': u'round', u'lines.linestyle': u'-', u'lines.linewidth': 1.0, u'lines.marker': u'None', u'lines.markeredgewidth': 0.5, u'lines.markersize': 6.0, u'lines.solid_capstyle': u'projecting', u'lines.solid_joinstyle': u'round', u'markers.fillstyle': u'full', u'mathtext.bf': u'serif:bold', u'mathtext.cal': u'cursive', u'mathtext.default': u'it', u'mathtext.fallback_to_cm': True, u'mathtext.fontset': u'cm', u'mathtext.it': u'serif:italic', u'mathtext.rm': u'serif', u'mathtext.sf': u'sans\\-serif', u'mathtext.tt': u'monospace', u'nbagg.transparent': True, u'patch.antialiased': True, u'patch.edgecolor': u'k', u'patch.facecolor': u'b', u'patch.linewidth': 1.0, u'path.effects': [], u'path.simplify': True, u'path.simplify_threshold': 0.1111111111111111, u'path.sketch': None, u'path.snap': True, u'pdf.compression': 6, u'pdf.fonttype': 3, u'pdf.inheritcolor': False, u'pdf.use14corefonts': False, u'pgf.debug': False, u'pgf.preamble': [], u'pgf.rcfonts': True, u'pgf.texsystem': u'xelatex', u'plugins.directory': u'.matplotlib_plugins', u'polaraxes.grid': True, u'ps.distiller.res': 6000, u'ps.fonttype': 3, u'ps.papersize': u'letter', u'ps.useafm': False, u'ps.usedistiller': False, u'savefig.bbox': None, u'savefig.directory': u'~', u'savefig.dpi': 400.0, u'savefig.edgecolor': u'w', u'savefig.facecolor': u'w', u'savefig.format': u'png', u'savefig.frameon': True, u'savefig.jpeg_quality': 95, u'savefig.orientation': u'portrait', u'savefig.pad_inches': 0.1, u'savefig.transparent': False, u'svg.fonttype': u'path', u'svg.image_inline': True, u'svg.image_noscale': False, u'text.antialiased': True, u'text.color': u'k', u'text.dvipnghack': None, u'text.hinting': u'auto', u'text.hinting_factor': 8, u'text.latex.preamble': [], u'text.latex.preview': False, u'text.latex.unicode': False, u'text.usetex': False, u'timezone': u'UTC', u'tk.window_focus': False, u'toolbar': u'toolbar2', u'verbose.fileo': u'sys.stdout', u'verbose.level': u'silent', u'webagg.open_in_browser': True, u'webagg.port': 8988, u'webagg.port_retries': 50, u'xtick.color': u'k', u'xtick.direction': u'in', u'xtick.labelsize': u'medium', u'xtick.major.pad': 4.0, u'xtick.major.size': 4.0, u'xtick.major.width': 0.5, u'xtick.minor.pad': 4.0, u'xtick.minor.size': 2.0, u'xtick.minor.visible': False, u'xtick.minor.width': 0.5, u'ytick.color': u'k', u'ytick.direction': u'in', u'ytick.labelsize': u'medium', u'ytick.major.pad': 4.0, u'ytick.major.size': 4.0, u'ytick.major.width': 0.5, u'ytick.minor.pad': 4.0, u'ytick.minor.size': 2.0, u'ytick.minor.visible': False, u'ytick.minor.width': 0.5})
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # 精确模式
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学 Default Mode: 我/ 来到/ 北京/ 清华大学 他, 来到, 了, 网易, 杭研, 大厦 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ,, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
filename = '/Users/chengjun/github/cjc2016/data/stopwords.txt'
stopwords = {}
f = open(filename, 'r')
line = f.readline().rstrip()
while line:
stopwords.setdefault(line, 0)
stopwords[line.decode('utf-8')] = 1
line = f.readline().rstrip()
f.close()
adding_stopwords = [u'我们', u'要', u'地', u'有', u'这', u'人',
u'发展',u'建设',u'加强',u'继续',u'对',u'等',u'推进',u'工作',u'增加']
for s in adding_stopwords: stopwords[s]=10
import jieba.analyse
txt = reports[-1]
tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)
print u"、".join([i[0] for i in tf[:50]])
发展、推进、改革、建设、创新、加快、经济、加强、促进、实施、政府、推动、完善、政策、全面、增长、社会、就业、企业、提高、创业、扩大、制度、坚持、一批、深化、人民、落实、支持、农村、试点、实现、安全、合作、工作、我国、动能、机制、加大、服务业、城镇、我们、服务、取得、依法、积极、中国、深入、结构性、民生
plt.hist([i[1] for i in tf])
plt.show()
tr = jieba.analyse.textrank(txt,topK=200, withWeight=True)
print u"、".join([i[0] for i in tr[:50]])
发展、建设、经济、改革、推进、创新、加强、加快、政府、推动、促进、实施、企业、政策、社会、制度、中国、提高、完善、全面、增长、扩大、支持、实现、工作、机制、创业、人民、服务、农村、试点、地方、坚持、国家、国际、继续、就业、合作、基本、加大、农业、投资、保护、问题、地区、依法、工程、取得、鼓励、建立
plt.hist([i[1] for i in tr])
plt.show()
import pandas as pd
def keywords(index):
txt = reports[-index]
tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)
tr = jieba.analyse.textrank(txt,topK=200, withWeight=True)
tfdata = pd.DataFrame(tf, columns=['word', 'tfidf'])
trdata = pd.DataFrame(tr, columns=['word', 'textrank'])
worddata = pd.merge(tfdata, trdata, on='word')
plt.plot(worddata.tfidf, worddata.textrank, linestyle='',marker='.')
for i in range(len(worddata.word)):
plt.text(worddata.tfidf[i], worddata.textrank[i], worddata.word[i],
fontsize = worddata.textrank[i]*15, color = 'red', rotation = 0)
plt.title(txt[:4])
plt.xlabel('Tf-Idf')
plt.ylabel('TextRank')
plt.show()
keywords(1)
keywords(2)
keywords(3)
def wordcloudplot(txt, year):
wordcloud = WordCloud(font_path='/Users/chengjun/github/cjc2016/data/msyh.ttf').generate(txt)
# Open a plot of the generated image.
plt.imshow(wordcloud)
plt.title(year)
plt.axis("off")
#plt.show()
txt = reports[-1]
tfidf200= jieba.analyse.extract_tags(txt, topK=200, withWeight=False)
seg_list = jieba.cut(txt, cut_all=False)
seg_list = [i for i in seg_list if i in tfidf200]
txt200 = r' '.join(seg_list)
wordcloudplot(txt200, txt[:4])