文档聚类和文本分类是文本挖掘的基本任务,本文主要针对的是无监督的聚类算法,包括K-means聚类、谱系聚类和LDA主题建模。
Python环境下对文本的处理主要用到以下模块:nltk、pandas、sklearn、gensim等。
对于想利用Python来处理文本的挖掘者来说,本文应该是不错的借鉴。
本文的主要任务是通过电影简介为电影聚类,数据可在此处下载。分为title、synopses和genres三部分。
#导入要用到的模块
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup
from sklearn import feature_extraction
#导入三部分数据:电影名列表、链接以及简介,支取前100部电影
titles = open('title_list.txt').read().split('\n')
#保证只有前一百条被读入
titles = titles[:100]
synopses = open('synopses_list_wiki.txt').read().split('\n BREAKS HERE')
synopses = synopses[:100]
#清洗电影简介
synopses_clean = []
for text in synopses:
text = BeautifulSoup(text, 'html.parser').getText()
#将html格式的转化为无格式文本(unicode)
synopses_clean.append(text)
synopses = synopses_clean
titles[:5]#查看前5部电影的电影名
['The Godfather', 'The Shawshank Redemption', "Schindler's List", 'Raging Bull', 'Casablanca']
synopses[0][:200]#查看第一条简介的前200个字符
u" Plot [edit] [ [ edit edit ] ] \n On the day of his only daughter's wedding, Vito Corleone hears requests in his role as the Godfather, the Don of a New York crime family. Vito's youngest son,"
#导入电影类型数据
genres = open('genres_list.txt').read().split('\n')
genres =genres[:100]
#总览所有数据
print(str(len(titles)) + ' titles')
print(str(len(synopses)) + ' synopses')
print(str(len(genres)) + ' genres')
100 titles 100 synopses 100 genres
#生成索引
ranks = []
for i in range(0,len(titles)):
ranks.append(i)
通过nltk清洗数据
#导入nltk英文止停词
stopwords = nltk.corpus.stopwords.words('english')
#导入 SnowballStemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
下文将定义两种函数:
def tokenize_and_stem(text):
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
#过滤掉非字母,比如数字和间隔等
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
def tokenize_only(text):
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
用词干化和未词干化的结果构建DataFrame,这样使得后文的分析更为精确。虽然是文档聚类,其实最小单位还是词,词准确了,聚类才会准确。
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in synopses:
allwords_stemmed = tokenize_and_stem(i)
totalvocab_stemmed.extend(allwords_stemmed)
allwords_tokenized = tokenize_only(i)
totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
本部分的主要内容是将原始文档映射到词向量空间,形成TF-IDF,并计算文档相似度或距离。
#通过scikit-learn中的文本特征抽取模块中的TF-IDF向量模型进行文档向量化。
#其中max_df=0.8和min_df=0.2的意思是过滤掉文档频率高于80%和文档频率低于20%的词。
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem,
ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(synopses)
terms = tfidf_vectorizer.get_feature_names()#获取词(特征)名
terms[:5]
[u'accept', u'agre', u'allow', u'alon', u'american']
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
dist[1,9] #某两个文档的距离
0.82470069219035791
利用TF-IDF向量空间和文档距离
#利用scikit-learn中的Kmeans模型进行聚类,类别数为5
from sklearn.cluster import KMeans
num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
clusters[:10]#前10个文档的类
[4, 0, 3, 1, 0, 3, 3, 1, 1, 3]
#构造数据框
import pandas as pd
films = {'title':titles, 'rank':ranks, 'synopses':synopses, 'cluster':clusters, 'genre':genres}
frame = pd.DataFrame(films, index=[clusters], columns=['rank', 'title', 'cluster', 'genre'])
frame['rank'] += 1
frame.to_excel('cluster.xlsx') #结果写入文件
frame #聚类结果
rank | title | cluster | genre | |
---|---|---|---|---|
4 | 1 | The Godfather | 4 | [u' Crime', u' Drama'] |
0 | 2 | The Shawshank Redemption | 0 | [u' Crime', u' Drama'] |
3 | 3 | Schindler's List | 3 | [u' Biography', u' Drama', u' History'] |
1 | 4 | Raging Bull | 1 | [u' Biography', u' Drama', u' Sport'] |
0 | 5 | Casablanca | 0 | [u' Drama', u' Romance', u' War'] |
3 | 6 | One Flew Over the Cuckoo's Nest | 3 | [u' Drama'] |
3 | 7 | Gone with the Wind | 3 | [u' Drama', u' Romance', u' War'] |
1 | 8 | Citizen Kane | 1 | [u' Drama', u' Mystery'] |
1 | 9 | The Wizard of Oz | 1 | [u' Adventure', u' Family', u' Fantasy', u' Mu... |
3 | 10 | Titanic | 3 | [u' Drama', u' Romance'] |
2 | 11 | Lawrence of Arabia | 2 | [u' Adventure', u' Biography', u' Drama', u' H... |
4 | 12 | The Godfather: Part II | 4 | [u' Crime', u' Drama'] |
0 | 13 | Psycho | 0 | [u' Horror', u' Mystery', u' Thriller'] |
3 | 14 | Sunset Blvd. | 3 | [u' Drama', u' Film-Noir'] |
1 | 15 | Vertigo | 1 | [u' Mystery', u' Romance', u' Thriller'] |
0 | 16 | On the Waterfront | 0 | [u' Crime', u' Drama'] |
1 | 17 | Forrest Gump | 1 | [u' Drama', u' Romance'] |
1 | 18 | The Sound of Music | 1 | [u' Biography', u' Drama', u' Family', u' Musi... |
0 | 19 | West Side Story | 0 | [u' Crime', u' Drama', u' Musical', u' Romance... |
2 | 20 | Star Wars | 2 | [u' Action', u' Adventure', u' Fantasy', u' Sc... |
0 | 21 | E.T. the Extra-Terrestrial | 0 | [u' Adventure', u' Family', u' Sci-Fi'] |
0 | 22 | 2001: A Space Odyssey | 0 | [u' Mystery', u' Sci-Fi'] |
0 | 23 | The Silence of the Lambs | 0 | [u' Crime', u' Drama', u' Thriller'] |
0 | 24 | Chinatown | 0 | [u' Drama', u' Mystery', u' Thriller'] |
2 | 25 | The Bridge on the River Kwai | 2 | [u' Adventure', u' Drama', u' War'] |
3 | 26 | Singin' in the Rain | 3 | [u' Comedy', u' Musical', u' Romance'] |
3 | 27 | It's a Wonderful Life | 3 | [u' Drama', u' Family', u' Fantasy'] |
3 | 28 | Some Like It Hot | 3 | [u' Comedy'] |
1 | 29 | 12 Angry Men | 1 | [u' Drama'] |
2 | 30 | Dr. Strangelove or: How I Learned to Stop Worr... | 2 | [u' Comedy', u' War'] |
... | ... | ... | ... | ... |
4 | 71 | Rain Man | 4 | [u' Drama'] |
1 | 72 | Annie Hall | 1 | [u' Comedy', u' Drama', u' Romance'] |
1 | 73 | Out of Africa | 1 | [u' Biography', u' Drama', u' Romance'] |
3 | 74 | Good Will Hunting | 3 | [u' Drama'] |
1 | 75 | Terms of Endearment | 1 | [u' Comedy', u' Drama'] |
1 | 76 | Tootsie | 1 | [u' Comedy', u' Drama', u' Romance'] |
0 | 77 | Fargo | 0 | [u' Crime', u' Drama', u' Thriller'] |
4 | 78 | Giant | 4 | [u' Drama', u' Romance'] |
4 | 79 | The Grapes of Wrath | 4 | [u' Drama'] |
0 | 80 | Shane | 0 | [u' Drama', u' Romance', u' Western'] |
1 | 81 | The Green Mile | 1 | [u' Crime', u' Drama', u' Fantasy', u' Mystery'] |
1 | 82 | Close Encounters of the Third Kind | 1 | [u' Drama', u' Sci-Fi'] |
1 | 83 | Network | 1 | [u' Drama'] |
3 | 84 | Nashville | 3 | [u' Drama', u' Music'] |
1 | 85 | The Graduate | 1 | [u' Comedy', u' Drama', u' Romance'] |
0 | 86 | American Graffiti | 0 | [u' Comedy', u' Drama'] |
0 | 87 | Pulp Fiction | 0 | [u' Crime', u' Drama', u' Thriller'] |
2 | 88 | The African Queen | 2 | [u' Adventure', u' Romance', u' War'] |
2 | 89 | Stagecoach | 2 | [u' Adventure', u' Western'] |
2 | 90 | Mutiny on the Bounty | 2 | [u' Adventure', u' Drama', u' History'] |
0 | 91 | The Maltese Falcon | 0 | [u' Drama', u' Film-Noir', u' Mystery'] |
3 | 92 | A Clockwork Orange | 3 | [u' Crime', u' Drama', u' Sci-Fi'] |
0 | 93 | Taxi Driver | 0 | [u' Crime', u' Drama'] |
1 | 94 | Wuthering Heights | 1 | [u' Drama', u' Romance'] |
0 | 95 | Double Indemnity | 0 | [u' Crime', u' Drama', u' Film-Noir', u' Thril... |
0 | 96 | Rebel Without a Cause | 0 | [u' Drama'] |
3 | 97 | Rear Window | 3 | [u' Mystery', u' Thriller'] |
0 | 98 | The Third Man | 0 | [u' Film-Noir', u' Mystery', u' Thriller'] |
0 | 99 | North by Northwest | 0 | [u' Mystery', u' Thriller'] |
3 | 100 | Yankee Doodle Dandy | 3 | [u' Biography', u' Drama', u' Musical'] |
100 rows × 4 columns
from scipy.cluster.hierarchy import ward, dendrogram
import matplotlib.pyplot as plt
%matplotlib inline
linkage_matrix = ward(dist) #通过ward法构建矩阵
fig, ax = plt.subplots(figsize=(15,20))
ax = dendrogram(linkage_matrix, orientation='right', labels=titles)
plt.tick_params(axis= 'x',
which='both',
bottom='off',
top='off',
labelbottom='off')
plt.tight_layout() #紧凑布局
#保存图像
plt.savefig('ward_clusters.png', dpi=200)
LDA主题建模需要并不依赖TF-IDF模型,所以需要重新进行数据预处理
#分词
import string
def strip_proppers(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word.islower()]
return "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
#文本准备
from gensim import corpora, models, similarities
#remove proper names
preprocess = [strip_proppers(doc) for doc in synopses]
tokenized_text = [tokenize_and_stem(text) for text in preprocess]
texts = [[word for word in text if word not in stopwords] for text in tokenized_text]
#词典和语料库准备
dictionary = corpora.Dictionary(texts) #构造词典
dictionary.filter_extremes(no_below=1, no_above=0.8)#去高频词
corpus = [dictionary.doc2bow(text) for text in texts] #构造语料库
len(corpus)
100
#训练一个LDA模型
%time lda = models.LdaModel(corpus, num_topics=5, id2word=dictionary, update_every=5, chunksize=10000, passes=100)
Wall time: 2min 55s
print(lda[corpus[0]])#打印第一个文档的主题建模结果
[(3, 0.99856435945125999)]
topics = lda.print_topics(5, num_words=20)#结果中的主题
topics #主题
[u'0.006*kill + 0.006*take + 0.005*return + 0.005*back + 0.005*shark + 0.004*home + 0.004*befor + 0.004*polic + 0.004*arriv + 0.004*tri + 0.004*find + 0.004*famili + 0.004*discov + 0.003*order + 0.003*one + 0.003*father + 0.003*attack + 0.003*make + 0.003*command + 0.003*learn', u'0.007*kill + 0.005*tell + 0.005*famili + 0.005*man + 0.005*leav + 0.005*find + 0.004*friend + 0.004*return + 0.004*life + 0.004*back + 0.004*order + 0.004*fight + 0.004*take + 0.003*arriv + 0.003*first + 0.003*make + 0.003*prison + 0.003*way + 0.003*meet + 0.003*ask', u'0.006*kill + 0.005*find + 0.005*take + 0.005*return + 0.004*man + 0.004*time + 0.004*say + 0.004*tell + 0.004*vote + 0.004*ask + 0.004*home + 0.004*befor + 0.004*friend + 0.004*one + 0.003*make + 0.003*night + 0.003*onli + 0.003*begin + 0.003*love + 0.003*murder', u'0.006*leav + 0.005*get + 0.005*tell + 0.005*make + 0.004*home + 0.004*take + 0.004*love + 0.004*tri + 0.004*kill + 0.004*arriv + 0.004*find + 0.004*go + 0.004*back + 0.004*fight + 0.003*day + 0.003*call + 0.003*father + 0.003*see + 0.003*end + 0.003*run', u'0.007*leav + 0.006*return + 0.006*take + 0.005*tell + 0.005*find + 0.005*apart + 0.005*film + 0.004*kill + 0.004*night + 0.004*becom + 0.004*tri + 0.004*friend + 0.004*make + 0.004*man + 0.004*get + 0.003*ask + 0.003*see + 0.003*two + 0.003*father + 0.003*discov']
topics_matrix = lda.show_topics(formatted=False, num_words=20)
topics_matrix = np.array(topics_matrix)
topics_matrix #主题矩阵
array([[[u'0.00593728051897', u'kill'], [u'0.0055825257252', u'take'], [u'0.00496062793487', u'return'], [u'0.00483481306431', u'back'], [u'0.00480556317119', u'shark'], [u'0.0044004127217', u'home'], [u'0.00419618701223', u'befor'], [u'0.0041661085672', u'polic'], [u'0.00394552565961', u'arriv'], [u'0.00388931191125', u'tri'], [u'0.00384788046479', u'find'], [u'0.00367382009423', u'famili'], [u'0.00358706714583', u'discov'], [u'0.00349972303922', u'order'], [u'0.00346444201506', u'one'], [u'0.00345468490397', u'father'], [u'0.00332564867337', u'attack'], [u'0.00332556499353', u'make'], [u'0.00315155482136', u'command'], [u'0.00311890645064', u'learn']], [[u'0.00715837199668', u'kill'], [u'0.00535375220484', u'tell'], [u'0.00499557447521', u'famili'], [u'0.00499304368397', u'man'], [u'0.00493849070503', u'leav'], [u'0.00493790942029', u'find'], [u'0.0044049685848', u'friend'], [u'0.00413480100702', u'return'], [u'0.00388676137211', u'life'], [u'0.00375299605718', u'back'], [u'0.0037346082665', u'order'], [u'0.00362266336055', u'fight'], [u'0.00360034092823', u'take'], [u'0.00342864925198', u'arriv'], [u'0.00340961506156', u'first'], [u'0.00337238361536', u'make'], [u'0.00331432820091', u'prison'], [u'0.00330795658758', u'way'], [u'0.00322158691844', u'meet'], [u'0.0032070297038', u'ask']], [[u'0.00582119174495', u'kill'], [u'0.00487970771401', u'find'], [u'0.00469944393744', u'take'], [u'0.00455596223554', u'return'], [u'0.00431860545204', u'man'], [u'0.00423258411918', u'time'], [u'0.00405579924199', u'say'], [u'0.00392344772473', u'tell'], [u'0.00380786413858', u'vote'], [u'0.00378255335825', u'ask'], [u'0.00365104324901', u'home'], [u'0.00362688414882', u'befor'], [u'0.00359782953736', u'friend'], [u'0.00353371850773', u'one'], [u'0.003279171247', u'make'], [u'0.00317625654581', u'night'], [u'0.00317587040334', u'onli'], [u'0.00315461809077', u'begin'], [u'0.00314843692965', u'love'], [u'0.003077782692', u'murder']], [[u'0.00593016542098', u'leav'], [u'0.00497057803933', u'get'], [u'0.00465146672243', u'tell'], [u'0.0046508929421', u'make'], [u'0.00449204369148', u'home'], [u'0.00449174202671', u'take'], [u'0.00433208452955', u'love'], [u'0.00433200196021', u'tri'], [u'0.00433111570668', u'kill'], [u'0.00401206970286', u'arriv'], [u'0.00401152264537', u'find'], [u'0.00393269577179', u'go'], [u'0.00393205581092', u'back'], [u'0.00369195142786', u'fight'], [u'0.00337326997872', u'day'], [u'0.00337307669527', u'call'], [u'0.00337248152121', u'father'], [u'0.00321271752906', u'see'], [u'0.00321247537524', u'end'], [u'0.00305322431739', u'run']], [[u'0.00660000285972', u'leav'], [u'0.00612034115546', u'return'], [u'0.00551177277551', u'take'], [u'0.00502148111266', u'tell'], [u'0.00476244014681', u'find'], [u'0.00452672790397', u'apart'], [u'0.0045097138479', u'film'], [u'0.00449296522362', u'kill'], [u'0.00375846391583', u'night'], [u'0.00375102222556', u'becom'], [u'0.00372537320405', u'tri'], [u'0.00370219814976', u'friend'], [u'0.00357074557682', u'make'], [u'0.00355615429764', u'man'], [u'0.00350753536498', u'get'], [u'0.00344019524237', u'ask'], [u'0.00341920302192', u'see'], [u'0.00338876374863', u'two'], [u'0.00336807121248', u'father'], [u'0.00335008691386', u'discov']]], dtype='<U32')