文本挖掘简介



王成军

[email protected]

计算传播网 http://computational-communication.com

What can be learned from 5 million books

http://v.youku.com/v_show/id_XMzA3OTA5MjUy.html

This talk by Jean-Baptiste Michel and Erez Lieberman Aiden is phenomenal. The associated article is also well worth checking out: Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176–182.

试一下谷歌图书的数据: https://books.google.com/ngrams/

数据下载: http://www.culturomics.org/home

Bag-of-words model (BOW)

Represent text as numerical feature vectors

  • We create a vocabulary of unique tokens—for example, words—from the entire set of documents.
  • We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them sparse

Bag of words,也叫做“词袋”,在信息检索中,Bag of words model假定对于一个文本,忽略其词序和语法,句法,将其仅仅看做是一个词集合,或者说是词的一个组合,文本中每个词的出现都是独立的,不依赖于其他词是否出现,或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。这种假设虽然对自然语言进行了简化,便于模型化。

假定在有些情况下是不合理的,例如在新闻个性化推荐中,采用Bag of words的模型就会出现问题。例如用户甲对“南京醉酒驾车事故”这个短语很感兴趣,采用bag of words忽略了顺序和句法,则认为用户甲对“南京”、“醉酒”、“驾车”和“事故”感兴趣,因此可能推荐出和“南京”,“公交车”,“事故”相关的新闻,这显然是不合理的。

解决的方法可以采用SCPCD的方法抽取出整个短语,或者采用高阶(2阶以上)统计语言模型,例如bigram,trigram来将词序保留下来,相当于bag of bigram和bag of trigram,这样能在一定程度上解决这种问题。简言之,bag of words模型是否适用需要根据实际情况来确定。对于那些不可以忽视词序,语法和句法的场合均不能采用bag of words的方法。

Transforming words into feature vectors

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

D1 = "I like databases"

D2 = "I hate databases"

I like hate databases
D1 1 1 0 1
D2 1 0 1 1
In [66]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
In [2]:
' '.join(dir(count))  
Out[2]:
'__class__ __delattr__ __dict__ __dir__ __doc__ __eq__ __format__ __ge__ __getattribute__ __gt__ __hash__ __init__ __le__ __lt__ __module__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _char_ngrams _char_wb_ngrams _check_vocabulary _count_vocab _get_param_names _limit_features _sort_features _validate_vocabulary _white_spaces _word_ngrams analyzer binary build_analyzer build_preprocessor build_tokenizer decode decode_error dtype encoding fit fit_transform fixed_vocabulary fixed_vocabulary_ get_feature_names get_params get_stop_words input inverse_transform lowercase max_df max_features min_df ngram_range preprocessor set_params stop_words stop_words_ strip_accents token_pattern tokenizer transform vocabulary vocabulary_'
In [3]:
count.get_feature_names()
Out[3]:
['and', 'is', 'shining', 'sun', 'sweet', 'the', 'weather']
In [67]:
print(count.vocabulary_)
{'sun': 3, 'is': 1, 'weather': 6, 'the': 5, 'and': 0, 'shining': 2, 'sweet': 4}
In [68]:
type(bag)
Out[68]:
scipy.sparse.csr.csr_matrix
In [69]:
print(bag.toarray())
[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]
In [70]:
import pandas as pd
pd.DataFrame(bag.toarray(), columns = count.get_feature_names())
Out[70]:
and is shining sun sweet the weather
0 0 1 1 1 0 1 0
1 0 1 0 0 1 1 1
2 1 2 1 1 1 2 1

1-gram

The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model

  • each item or token in the vocabulary represents a single word.

n-gram

The choice of the number n in the n-gram model depends on the particular application

  • 1-gram: "the", "sun", "is", "shining"
  • 2-gram: "the sun", "sun is", "is shining"

The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter.

While a 1-gram representation is used by default

we could switch to a 2-gram representation by initializing a new CountVectorizer instance with ngram_range=(2,2).

Assessing word relevancy via term frequency-inverse document frequency

$tf-idf(t, d) = tf(t, d) \times idf(t)$

$tf(t, d)$ is the term frequency of term t in document d.

inverse document frequency $idf(t)$ can be calculated as:

$idf(t) = log \frac{n_d}{1 + df(d, t)}$

SKlearn use_idf=True, smooth_idf=True

$idf(t) = log \frac{n_d}{df(d, t) + 1} + 1$

where $n_d$ is the total number of documents, and $df(d, t)$ is the number of documents $d$ that contain the term $t$.

提问: Why do we add the constant 1 to the denominator ?

课堂作业:请根据公式计算'is'这个词在文本2中的tfidf数值?

TfidfTransformer

Scikit-learn implements yet another transformer, the TfidfTransformer, that takes the raw term frequencies from CountVectorizer as input and transforms them into tf-idfs:

In [76]:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2)

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]
In [85]:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2)

tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=False)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
[[ 0.    1.    1.41  1.41  0.    1.    0.  ]
 [ 0.    1.    0.    0.    1.41  1.    1.41]
 [ 2.1   2.    1.41  1.41  1.41  2.    1.41]]
In [80]:
bag = tfidf.fit_transform(count.fit_transform(docs))
pd.DataFrame(bag.toarray(), columns = count.get_feature_names())
Out[80]:
and is shining sun sweet the weather
0 0.000000 1.0 1.287682 1.287682 0.000000 1.0 0.000000
1 0.000000 1.0 0.000000 0.000000 1.287682 1.0 1.287682
2 1.693147 2.0 1.287682 1.287682 1.287682 2.0 1.287682
In [83]:
# 一个词的tfidf值
import numpy as np
tf_is = 2.0
n_docs = 3.0
#idf_is = np.log(n_docs / (3))
idf_is = np.log(n_docs / (3)) + 1

tfidf_is = tf_is * idf_is
print('tf-idf of term "is" = %.2f' % tfidf_is)
tf-idf of term "is" = 2.00
In [86]:
# 最后一个文本里的词的tfidf原始数值(未标准化)
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=False)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf, count.get_feature_names()
Out[86]:
(array([ 2.1 ,  2.  ,  1.41,  1.41,  1.41,  2.  ,  1.41]),
 ['and', 'is', 'shining', 'sun', 'sweet', 'the', 'weather'])

The tf-idf equation that was implemented in scikit-learn is as follows:

$tf-idf(t, d) = tf(t, d) \times (idf(t, d) + 1)$

L2-normalization

$l2_{x} = \frac{x} {np.sqrt(np.sum(x^2))}$

In [14]:
# l2标准化后的tfidf数值
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf 
Out[14]:
array([ 0.4 ,  0.48,  0.31,  0.31,  0.31,  0.48,  0.31])

政府工作报告文本挖掘

0. 读取数据

In [87]:
with open('../data/gov_reports1954-2017.txt', 'r') as f:
    reports = f.readlines()
In [88]:
len(reports)
Out[88]:
48
In [89]:
print(reports[33][:1000])
2003	2003年政府工作报告		——2003年3月5日在第十届全国人民代表大会第一次会议上		                               国务院总理 朱镕基   各位代表:  本届政府1998年3月就职,任期即将结束。现在,我代表国务院,向第十届全国人民代表大会第一次会议报告过去五年的工作,对今年的工作提出建议,请予审议,并请全国政协各位委员提出意见。  一、过去五年政府工作的回顾  第九届全国人民代表大会第一次会议以来的五年,是很不平凡的五年。本届政府初期,亚洲金融危机冲击,世界经济增长放慢;国内产业结构矛盾十分突出,国有企业职工大量下岗;1998、1999年连续遭受特大洪涝灾害。全国各族人民在中国共产党领导下,团结奋进,顽强拼搏,战胜种种困难,改革开放和经济社会发展取得举世公认的伟大成就。我们胜利实现了现代化建设第二步战略目标,开始向第三步战略目标迈进。  五年来,国民经济保持良好发展势头,经济结构战略性调整迈出重要步伐。  ——经济持续较快增长。国内生产总值从1997年的7.4万亿元增加到2002年的10.2万亿元,按可比价格计算,平均每年增长7.7%。产业结构调整成效明显。粮食等主要农产品供给实现了由长期短缺到总量平衡、丰年有余的历史性转变。以信息产业为代表的高新技术产业迅速崛起。传统工业改造步伐加快。现代服务业快速发展。经济增长质量和效益不断提高。国家税收连年大幅度增长。全国财政收入从1997年的8651亿元增加到2002年的18914亿元,平均每年增加2053亿元;国家外汇储备从1399亿美元增加到2864亿美元。五年全社会固定资产投资累计完成17.2万亿元,特别是发行6600亿元长期建设国债,带动银行贷款和其他社会资金形成3.28万亿元的投资规模,办成不少多年想办而没有力量办的大事。社会生产力跃上新台阶,国家的经济实力、抗风险能力和国际竞争力明显增强。  ——基础设施建设成就显著。我们集中力量建成了一批关系全局的重大基础设施项目。进行了新中国成立以来规模最大的水利建设。五年全国水利建设投资3562亿元,扣除价格变动因素,相当于1950年到1997年全国水利建设投资的总和。一批重大水利设施项目相继开工和竣工。江河堤防加固工程开工3.5万公里,完成了长达3500多公里的长江干堤和近千公里的黄河堤防加固工程,防洪能力大大增强。举世瞩目的

pip install jieba

https://github.com/fxsjy/jieba

pip install wordcloud

https://github.com/amueller/word_cloud

pip install gensim

在terminal里成功安装第三方的包,结果发现在notebook里无法import

这个问题多出现于mac用户,因为mac有一个系统自带的python,成功安装的第三方包都被安装到了系统自带的python里。因此需要确保我们使用的是conda自己的pip,即需要指定pip的路径名,比如我的pip路径名在:/Users/chengjun/anaconda/bin/pip,那么在terminal里输入:

/Users/chengjun/anaconda/bin/pip install package_name

In [90]:
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import sys 
import numpy as np
from collections import defaultdict
import statsmodels.api as sm
from wordcloud import WordCloud
import jieba
import matplotlib
import gensim
from gensim import corpora, models, similarities
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
#matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体 
matplotlib.rc("savefig", dpi=400)
In [23]:
# 为了确保中文可以在matplotlib里正确显示
#matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体 
# 需要确定系统安装了Microsoft YaHei
In [23]:
# import matplotlib
# my_font = matplotlib.font_manager.FontProperties(
#     fname='/Users/chengjun/github/cjc/data/msyh.ttf')

1. 分词

In [25]:
import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))
Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
Default Mode: 我/ 来到/ 北京/ 清华大学
他, 来到, 了, 网易, 杭研, 大厦
小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ,, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造

2. 停用词

In [91]:
filename = '../data/stopwords.txt'
stopwords = {}
f = open(filename, 'r')
line = f.readline().rstrip()
while line:
    stopwords.setdefault(line, 0)
    stopwords[line] = 1
    line = f.readline().rstrip()
f.close()
In [92]:
adding_stopwords = [u'我们', u'要', u'地', u'有', u'这', u'人',
                    u'发展',u'建设',u'加强',u'继续',u'对',u'等',
                    u'推进',u'工作',u'增加']
for s in adding_stopwords: stopwords[s]=10

3. 关键词抽取

基于TF-IDF 算法的关键词抽取

In [93]:
import jieba.analyse
txt = reports[-1]
tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)
In [94]:
u"、".join([i[0] for i in tf[:50]])
Out[94]:
'发展、改革、推进、建设、加强、加快、推动、深化、完善、创新、就业、全面、促进、经济、政府、深入、实施、提高、企业、支持、群众、服务、坚持、人民、坚决、制度、治理、政策、农村、试点、扩大、机制、社会、落实、工作、保障、增长、国家、生态、安全、今年、稳定、继续、地区、保护、中国、合作、产能、维护、重点'
In [95]:
plt.hist([i[1] for i in tf])
plt.show()

基于 TextRank 算法的关键词抽取

In [96]:
tr = jieba.analyse.textrank(txt,topK=200, withWeight=True)
u"、".join([i[0] for i in tr[:50]])
Out[96]:
'发展、改革、推进、建设、经济、加强、推动、加快、政府、完善、创新、企业、全面、实施、促进、提高、支持、服务、政策、深入、中国、就业、国家、制度、群众、社会、人民、地区、坚持、扩大、农村、地方、保护、继续、增长、机制、工作、保障、治理、试点、合作、综合、重点、市场、投资、领域、加大、消费、制定、维护'
In [97]:
plt.hist([i[1] for i in tr])
plt.show()
In [98]:
import pandas as pd

def keywords(index):
    txt = reports[-index]
    tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)
    tr = jieba.analyse.textrank(txt,topK=200, withWeight=True)
    tfdata = pd.DataFrame(tf, columns=['word', 'tfidf'])
    trdata = pd.DataFrame(tr, columns=['word', 'textrank'])
    worddata = pd.merge(tfdata, trdata, on='word')
    fig = plt.figure(figsize=(16, 6),facecolor='white')
    plt.plot(worddata.tfidf, worddata.textrank, linestyle='',marker='.')
    for i in range(len(worddata.word)):
        plt.text(worddata.tfidf[i], worddata.textrank[i], worddata.word[i], 
                 fontsize = worddata.textrank[i]*30, 
                 color = 'red', rotation = 0
                )
    plt.title(txt[:4])
    plt.xlabel('Tf-Idf')
    plt.ylabel('TextRank')
    plt.show()
In [99]:
keywords(1)
In [52]:
keywords(2)
In [53]:
keywords(3)

算法论文:

TextRank: Bringing Order into Texts

基本思想:

  • 将待抽取关键词的文本进行分词
  • 以固定窗口大小(默认为5,通过span属性调整),词之间的共现关系,构建图
  • 计算图中节点的PageRank,注意是无向带权图

4. 词云

In [100]:
def wordcloudplot(txt, year):
    wordcloud = WordCloud(font_path='../data/msyh.ttf').generate(txt)
    # Open a plot of the generated image.
    fig = plt.figure(figsize=(16, 6),facecolor='white')
    plt.imshow(wordcloud)
    plt.title(year)
    plt.axis("off")
    #plt.show()

基于tfidf过滤的词云

In [57]:
txt = reports[-1]
tfidf200= jieba.analyse.extract_tags(txt, topK=200, withWeight=False)
seg_list = jieba.cut(txt, cut_all=False)
seg_list = [i for i in seg_list if i in tfidf200]
txt200 = r' '.join(seg_list)
wordcloudplot(txt200, txt[:4]) 
In [60]:
txt = reports[-2]
tfidf200= jieba.analyse.extract_tags(txt, topK=200, withWeight=False)
seg_list = jieba.cut(txt, cut_all=False)
seg_list = [i for i in seg_list if i in tfidf200]
txt200 = r' '.join(seg_list)
wordcloudplot(txt200, txt[:4]) 
In [326]:
txt = reports[-2]
tfidf200= jieba.analyse.extract_tags(txt, topK=200, withWeight=False)
seg_list = jieba.cut(txt, cut_all=False)
seg_list = [i for i in seg_list if i in tfidf200]
txt200 = r' '.join(seg_list)
wordcloudplot(txt200, txt[:4])