python 文档聚类和主题建模

易红发 [email protected]

文档聚类和文本分类是文本挖掘的基本任务,本文主要针对的是无监督的聚类算法,包括K-means聚类、谱系聚类和LDA主题建模。
Python环境下对文本的处理主要用到以下模块:nltk、pandas、sklearn、gensim等。
对于想利用Python来处理文本的挖掘者来说,本文应该是不错的借鉴。

本文的主要任务是通过电影简介为电影聚类,数据可在此处下载。分为title、synopses和genres三部分。

In [1]:
#导入要用到的模块
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup
from sklearn import feature_extraction

数据预处理

In [2]:
#导入三部分数据:电影名列表、链接以及简介,支取前100部电影
titles = open('title_list.txt').read().split('\n')
#保证只有前一百条被读入
titles = titles[:100]

synopses = open('synopses_list_wiki.txt').read().split('\n BREAKS HERE')
synopses = synopses[:100]
In [3]:
#清洗电影简介
synopses_clean = []
for text in synopses:
    text = BeautifulSoup(text, 'html.parser').getText()
    #将html格式的转化为无格式文本(unicode)
    synopses_clean.append(text)

synopses = synopses_clean
In [4]:
titles[:5]#查看前5部电影的电影名
Out[4]:
['The Godfather',
 'The Shawshank Redemption',
 "Schindler's List",
 'Raging Bull',
 'Casablanca']
In [5]:
synopses[0][:200]#查看第一条简介的前200个字符
Out[5]:
u" Plot  [edit]  [  [  edit  edit  ]  ]  \n  On the day of his only daughter's wedding, Vito Corleone hears requests in his role as the Godfather, the Don of a New York crime family. Vito's youngest son,"
In [6]:
#导入电影类型数据
genres = open('genres_list.txt').read().split('\n')
genres =genres[:100]
In [8]:
#总览所有数据
print(str(len(titles)) + ' titles')
print(str(len(synopses)) + ' synopses')
print(str(len(genres)) + ' genres')
100 titles
100 synopses
100 genres
In [9]:
#生成索引
ranks = []
for i in range(0,len(titles)):
    ranks.append(i)

通过nltk清洗数据

In [10]:
#导入nltk英文止停词
stopwords = nltk.corpus.stopwords.words('english')
In [11]:
#导入 SnowballStemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

下文将定义两种函数:

  • tokenize_and_stem: 分词并且将词“词干化”,所谓“词干化”是将单词根据词根归一,比如take和took,经过“词干化”均表示为take。
  • tokenize_only: 只分词。
In [12]:
def tokenize_and_stem(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    #过滤掉非字母,比如数字和间隔等
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

def tokenize_only(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens
        

用词干化和未词干化的结果构建DataFrame,这样使得后文的分析更为精确。虽然是文档聚类,其实最小单位还是词,词准确了,聚类才会准确。

In [13]:
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in synopses:
    allwords_stemmed = tokenize_and_stem(i)
    totalvocab_stemmed.extend(allwords_stemmed)
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)

TF-IDF模型和文档相似度

本部分的主要内容是将原始文档映射到词向量空间,形成TF-IDF,并计算文档相似度或距离。

In [14]:
#通过scikit-learn中的文本特征抽取模块中的TF-IDF向量模型进行文档向量化。
#其中max_df=0.8和min_df=0.2的意思是过滤掉文档频率高于80%和文档频率低于20%的词。
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                  min_df=0.2, stop_words='english',
                                  use_idf=True, tokenizer=tokenize_and_stem,
                                  ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses)
In [15]:
terms = tfidf_vectorizer.get_feature_names()#获取词(特征)名
In [16]:
terms[:5]
Out[16]:
[u'accept', u'agre', u'allow', u'alon', u'american']
In [17]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
In [18]:
dist[1,9] #某两个文档的距离
Out[18]:
0.82470069219035791

K-means 聚类

利用TF-IDF向量空间和文档距离

In [19]:
#利用scikit-learn中的Kmeans模型进行聚类,类别数为5
from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters)

km.fit(tfidf_matrix)

clusters = km.labels_.tolist()
In [20]:
clusters[:10]#前10个文档的类
Out[20]:
[4, 0, 3, 1, 0, 3, 3, 1, 1, 3]
In [21]:
#构造数据框
import pandas as pd

films = {'title':titles, 'rank':ranks, 'synopses':synopses, 'cluster':clusters, 'genre':genres}

frame = pd.DataFrame(films, index=[clusters], columns=['rank', 'title', 'cluster', 'genre'])
In [22]:
frame['rank'] += 1 
In [23]:
frame.to_excel('cluster.xlsx') #结果写入文件
In [24]:
frame #聚类结果
Out[24]:
rank title cluster genre
4 1 The Godfather 4 [u' Crime', u' Drama']
0 2 The Shawshank Redemption 0 [u' Crime', u' Drama']
3 3 Schindler's List 3 [u' Biography', u' Drama', u' History']
1 4 Raging Bull 1 [u' Biography', u' Drama', u' Sport']
0 5 Casablanca 0 [u' Drama', u' Romance', u' War']
3 6 One Flew Over the Cuckoo's Nest 3 [u' Drama']
3 7 Gone with the Wind 3 [u' Drama', u' Romance', u' War']
1 8 Citizen Kane 1 [u' Drama', u' Mystery']
1 9 The Wizard of Oz 1 [u' Adventure', u' Family', u' Fantasy', u' Mu...
3 10 Titanic 3 [u' Drama', u' Romance']
2 11 Lawrence of Arabia 2 [u' Adventure', u' Biography', u' Drama', u' H...
4 12 The Godfather: Part II 4 [u' Crime', u' Drama']
0 13 Psycho 0 [u' Horror', u' Mystery', u' Thriller']
3 14 Sunset Blvd. 3 [u' Drama', u' Film-Noir']
1 15 Vertigo 1 [u' Mystery', u' Romance', u' Thriller']
0 16 On the Waterfront 0 [u' Crime', u' Drama']
1 17 Forrest Gump 1 [u' Drama', u' Romance']
1 18 The Sound of Music 1 [u' Biography', u' Drama', u' Family', u' Musi...
0 19 West Side Story 0 [u' Crime', u' Drama', u' Musical', u' Romance...
2 20 Star Wars 2 [u' Action', u' Adventure', u' Fantasy', u' Sc...
0 21 E.T. the Extra-Terrestrial 0 [u' Adventure', u' Family', u' Sci-Fi']
0 22 2001: A Space Odyssey 0 [u' Mystery', u' Sci-Fi']
0 23 The Silence of the Lambs 0 [u' Crime', u' Drama', u' Thriller']
0 24 Chinatown 0 [u' Drama', u' Mystery', u' Thriller']
2 25 The Bridge on the River Kwai 2 [u' Adventure', u' Drama', u' War']
3 26 Singin' in the Rain 3 [u' Comedy', u' Musical', u' Romance']
3 27 It's a Wonderful Life 3 [u' Drama', u' Family', u' Fantasy']
3 28 Some Like It Hot 3 [u' Comedy']
1 29 12 Angry Men 1 [u' Drama']
2 30 Dr. Strangelove or: How I Learned to Stop Worr... 2 [u' Comedy', u' War']
... ... ... ... ...
4 71 Rain Man 4 [u' Drama']
1 72 Annie Hall 1 [u' Comedy', u' Drama', u' Romance']
1 73 Out of Africa 1 [u' Biography', u' Drama', u' Romance']
3 74 Good Will Hunting 3 [u' Drama']
1 75 Terms of Endearment 1 [u' Comedy', u' Drama']
1 76 Tootsie 1 [u' Comedy', u' Drama', u' Romance']
0 77 Fargo 0 [u' Crime', u' Drama', u' Thriller']
4 78 Giant 4 [u' Drama', u' Romance']
4 79 The Grapes of Wrath 4 [u' Drama']
0 80 Shane 0 [u' Drama', u' Romance', u' Western']
1 81 The Green Mile 1 [u' Crime', u' Drama', u' Fantasy', u' Mystery']
1 82 Close Encounters of the Third Kind 1 [u' Drama', u' Sci-Fi']
1 83 Network 1 [u' Drama']
3 84 Nashville 3 [u' Drama', u' Music']
1 85 The Graduate 1 [u' Comedy', u' Drama', u' Romance']
0 86 American Graffiti 0 [u' Comedy', u' Drama']
0 87 Pulp Fiction 0 [u' Crime', u' Drama', u' Thriller']
2 88 The African Queen 2 [u' Adventure', u' Romance', u' War']
2 89 Stagecoach 2 [u' Adventure', u' Western']
2 90 Mutiny on the Bounty 2 [u' Adventure', u' Drama', u' History']
0 91 The Maltese Falcon 0 [u' Drama', u' Film-Noir', u' Mystery']
3 92 A Clockwork Orange 3 [u' Crime', u' Drama', u' Sci-Fi']
0 93 Taxi Driver 0 [u' Crime', u' Drama']
1 94 Wuthering Heights 1 [u' Drama', u' Romance']
0 95 Double Indemnity 0 [u' Crime', u' Drama', u' Film-Noir', u' Thril...
0 96 Rebel Without a Cause 0 [u' Drama']
3 97 Rear Window 3 [u' Mystery', u' Thriller']
0 98 The Third Man 0 [u' Film-Noir', u' Mystery', u' Thriller']
0 99 North by Northwest 0 [u' Mystery', u' Thriller']
3 100 Yankee Doodle Dandy 3 [u' Biography', u' Drama', u' Musical']

100 rows × 4 columns

谱系聚类

In [25]:
from scipy.cluster.hierarchy import ward, dendrogram
import matplotlib.pyplot as plt
%matplotlib inline

linkage_matrix = ward(dist) #通过ward法构建矩阵

fig, ax = plt.subplots(figsize=(15,20))
ax = dendrogram(linkage_matrix, orientation='right', labels=titles)

plt.tick_params(axis= 'x',          
                which='both',      
                bottom='off',      
                top='off',         
                labelbottom='off')

plt.tight_layout() #紧凑布局

#保存图像
plt.savefig('ward_clusters.png', dpi=200) 

LDA主题建模

LDA主题建模需要并不依赖TF-IDF模型,所以需要重新进行数据预处理

In [26]:
#分词
import string
def strip_proppers(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word.islower()]
    return "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
In [27]:
#文本准备
from gensim import corpora, models, similarities

#remove proper names
preprocess = [strip_proppers(doc) for doc in synopses]

tokenized_text = [tokenize_and_stem(text) for text in preprocess]

texts = [[word for word in text if word not in stopwords] for text in tokenized_text]
In [31]:
#词典和语料库准备
dictionary = corpora.Dictionary(texts) #构造词典
dictionary.filter_extremes(no_below=1, no_above=0.8)#去高频词

corpus = [dictionary.doc2bow(text) for text in texts] #构造语料库
In [32]:
len(corpus)
Out[32]:
100
In [33]:
#训练一个LDA模型
%time lda = models.LdaModel(corpus, num_topics=5, id2word=dictionary, update_every=5, chunksize=10000, passes=100)
Wall time: 2min 55s
In [34]:
print(lda[corpus[0]])#打印第一个文档的主题建模结果
[(3, 0.99856435945125999)]
In [35]:
topics = lda.print_topics(5, num_words=20)#结果中的主题
In [36]:
topics #主题
Out[36]:
[u'0.006*kill + 0.006*take + 0.005*return + 0.005*back + 0.005*shark + 0.004*home + 0.004*befor + 0.004*polic + 0.004*arriv + 0.004*tri + 0.004*find + 0.004*famili + 0.004*discov + 0.003*order + 0.003*one + 0.003*father + 0.003*attack + 0.003*make + 0.003*command + 0.003*learn',
 u'0.007*kill + 0.005*tell + 0.005*famili + 0.005*man + 0.005*leav + 0.005*find + 0.004*friend + 0.004*return + 0.004*life + 0.004*back + 0.004*order + 0.004*fight + 0.004*take + 0.003*arriv + 0.003*first + 0.003*make + 0.003*prison + 0.003*way + 0.003*meet + 0.003*ask',
 u'0.006*kill + 0.005*find + 0.005*take + 0.005*return + 0.004*man + 0.004*time + 0.004*say + 0.004*tell + 0.004*vote + 0.004*ask + 0.004*home + 0.004*befor + 0.004*friend + 0.004*one + 0.003*make + 0.003*night + 0.003*onli + 0.003*begin + 0.003*love + 0.003*murder',
 u'0.006*leav + 0.005*get + 0.005*tell + 0.005*make + 0.004*home + 0.004*take + 0.004*love + 0.004*tri + 0.004*kill + 0.004*arriv + 0.004*find + 0.004*go + 0.004*back + 0.004*fight + 0.003*day + 0.003*call + 0.003*father + 0.003*see + 0.003*end + 0.003*run',
 u'0.007*leav + 0.006*return + 0.006*take + 0.005*tell + 0.005*find + 0.005*apart + 0.005*film + 0.004*kill + 0.004*night + 0.004*becom + 0.004*tri + 0.004*friend + 0.004*make + 0.004*man + 0.004*get + 0.003*ask + 0.003*see + 0.003*two + 0.003*father + 0.003*discov']
In [37]:
topics_matrix = lda.show_topics(formatted=False, num_words=20)
In [38]:
topics_matrix = np.array(topics_matrix)
In [39]:
topics_matrix #主题矩阵
Out[39]:
array([[[u'0.00593728051897', u'kill'],
        [u'0.0055825257252', u'take'],
        [u'0.00496062793487', u'return'],
        [u'0.00483481306431', u'back'],
        [u'0.00480556317119', u'shark'],
        [u'0.0044004127217', u'home'],
        [u'0.00419618701223', u'befor'],
        [u'0.0041661085672', u'polic'],
        [u'0.00394552565961', u'arriv'],
        [u'0.00388931191125', u'tri'],
        [u'0.00384788046479', u'find'],
        [u'0.00367382009423', u'famili'],
        [u'0.00358706714583', u'discov'],
        [u'0.00349972303922', u'order'],
        [u'0.00346444201506', u'one'],
        [u'0.00345468490397', u'father'],
        [u'0.00332564867337', u'attack'],
        [u'0.00332556499353', u'make'],
        [u'0.00315155482136', u'command'],
        [u'0.00311890645064', u'learn']],

       [[u'0.00715837199668', u'kill'],
        [u'0.00535375220484', u'tell'],
        [u'0.00499557447521', u'famili'],
        [u'0.00499304368397', u'man'],
        [u'0.00493849070503', u'leav'],
        [u'0.00493790942029', u'find'],
        [u'0.0044049685848', u'friend'],
        [u'0.00413480100702', u'return'],
        [u'0.00388676137211', u'life'],
        [u'0.00375299605718', u'back'],
        [u'0.0037346082665', u'order'],
        [u'0.00362266336055', u'fight'],
        [u'0.00360034092823', u'take'],
        [u'0.00342864925198', u'arriv'],
        [u'0.00340961506156', u'first'],
        [u'0.00337238361536', u'make'],
        [u'0.00331432820091', u'prison'],
        [u'0.00330795658758', u'way'],
        [u'0.00322158691844', u'meet'],
        [u'0.0032070297038', u'ask']],

       [[u'0.00582119174495', u'kill'],
        [u'0.00487970771401', u'find'],
        [u'0.00469944393744', u'take'],
        [u'0.00455596223554', u'return'],
        [u'0.00431860545204', u'man'],
        [u'0.00423258411918', u'time'],
        [u'0.00405579924199', u'say'],
        [u'0.00392344772473', u'tell'],
        [u'0.00380786413858', u'vote'],
        [u'0.00378255335825', u'ask'],
        [u'0.00365104324901', u'home'],
        [u'0.00362688414882', u'befor'],
        [u'0.00359782953736', u'friend'],
        [u'0.00353371850773', u'one'],
        [u'0.003279171247', u'make'],
        [u'0.00317625654581', u'night'],
        [u'0.00317587040334', u'onli'],
        [u'0.00315461809077', u'begin'],
        [u'0.00314843692965', u'love'],
        [u'0.003077782692', u'murder']],

       [[u'0.00593016542098', u'leav'],
        [u'0.00497057803933', u'get'],
        [u'0.00465146672243', u'tell'],
        [u'0.0046508929421', u'make'],
        [u'0.00449204369148', u'home'],
        [u'0.00449174202671', u'take'],
        [u'0.00433208452955', u'love'],
        [u'0.00433200196021', u'tri'],
        [u'0.00433111570668', u'kill'],
        [u'0.00401206970286', u'arriv'],
        [u'0.00401152264537', u'find'],
        [u'0.00393269577179', u'go'],
        [u'0.00393205581092', u'back'],
        [u'0.00369195142786', u'fight'],
        [u'0.00337326997872', u'day'],
        [u'0.00337307669527', u'call'],
        [u'0.00337248152121', u'father'],
        [u'0.00321271752906', u'see'],
        [u'0.00321247537524', u'end'],
        [u'0.00305322431739', u'run']],

       [[u'0.00660000285972', u'leav'],
        [u'0.00612034115546', u'return'],
        [u'0.00551177277551', u'take'],
        [u'0.00502148111266', u'tell'],
        [u'0.00476244014681', u'find'],
        [u'0.00452672790397', u'apart'],
        [u'0.0045097138479', u'film'],
        [u'0.00449296522362', u'kill'],
        [u'0.00375846391583', u'night'],
        [u'0.00375102222556', u'becom'],
        [u'0.00372537320405', u'tri'],
        [u'0.00370219814976', u'friend'],
        [u'0.00357074557682', u'make'],
        [u'0.00355615429764', u'man'],
        [u'0.00350753536498', u'get'],
        [u'0.00344019524237', u'ask'],
        [u'0.00341920302192', u'see'],
        [u'0.00338876374863', u'two'],
        [u'0.00336807121248', u'father'],
        [u'0.00335008691386', u'discov']]], 
      dtype='<U32')