This notebook is basically expanded version of this example from scikit-learn documentation.
from __future__ import print_function
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups
n_samples = 8000
n_features = 1000
n_components = 10
n_top_words = 20
def kl_loss(x, y, eps=1e-10):
return -(x.toarray() * np.log(y+eps)).sum() / x.shape[0]
def frobenius_loss(x, y):
return np.square(x - y).sum() / x.shape[0]
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d: " % topic_idx)
topic_words = " ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(topic_words)
print()
def score_model(model, data):
if model.beta_loss == 'kullback-leibler':
loss_function = kl_loss
elif model.beta_loss == 'frobenius':
loss_function = frobenius_loss
reduced_data = model.transform(data)
reconstructed_data = model.inverse_transform(reduced_data)
return loss_function(data, reconstructed_data)
%%time
print("Loading dataset...")
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'))
data_train = dataset.data[:n_samples]
data_test = dataset.data[n_samples:]
Loading dataset... CPU times: user 1.86 s, sys: 69.9 ms, total: 1.93 s Wall time: 1.97 s
%%time
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
tfidf_train = tfidf_vectorizer.fit_transform(data_train)
tfidf_test = tfidf_vectorizer.transform(data_test)
Extracting tf-idf features for NMF... CPU times: user 2.42 s, sys: 8.18 ms, total: 2.43 s Wall time: 2.43 s
%%time
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
"n_samples=%d and n_features=%d..."
% (n_samples, n_features))
frobenius_nmf = NMF(n_components=n_components, random_state=1,
alpha=.1, l1_ratio=.5).fit(tfidf_train)
Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=8000 and n_features=1000... CPU times: user 1.39 s, sys: 40.4 ms, total: 1.43 s Wall time: 828 ms
print('train reconstruction error:', score_model(frobenius_nmf, tfidf_train))
print('test reconstruction error:', score_model(frobenius_nmf, tfidf_test))
train reconstruction error: 0.890941957403 test reconstruction error: 0.892431321223
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(frobenius_nmf, tfidf_feature_names, n_top_words)
Topic #0: just don people think like know good time right ve make say did way really want going said ll thing Topic #1: card video monitor drivers cards vga bus driver color ram graphics mode bit board memory pc 16 speed performance controller Topic #2: god jesus bible christ faith believe christians christian church sin lord does life man hell truth belief say love father Topic #3: key chip clipper encryption keys government escrow use algorithm public nsa security phone secure law chips des data bit enforcement Topic #4: new 00 car sale 10 price shipping offer 50 20 15 condition 12 interested 11 used 30 25 sell old Topic #5: thanks does know mail advance hi info looking help anybody address appreciated email information post interested reply send like need Topic #6: windows file use dos files program using window problem running run version pc server application screen software ms ftp help Topic #7: edu soon cs university com internet ftp article pub send email mit david mail address ibm apr reply available export Topic #8: game team games year season play players hockey win league player nhl teams best played runs better hit think good Topic #9: drive scsi hard drives disk ide floppy controller mac cd power rom internal mb cable problem tape bus computer format
%%time
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
"tf-idf features, n_samples=%d and n_features=%d..."
% (n_samples, n_features))
kl_nmf = NMF(n_components=n_components, random_state=1,
beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
l1_ratio=0.9).fit(tfidf_train)
Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=8000 and n_features=1000... CPU times: user 12.1 s, sys: 380 ms, total: 12.5 s Wall time: 6.25 s
print('train reconstruction error:', score_model(kl_nmf, tfidf_train))
print('test reconstruction error:', score_model(kl_nmf, tfidf_test))
train reconstruction error: 18.355714861 test reconstruction error: 18.2931233004
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(kl_nmf, tfidf_feature_names, n_top_words)
Topic #0: time like way right really did years good said make just think don long thing going new say want know Topic #1: use thanks need used using software work help does card hi drive video pc mac computer problem new like speed Topic #2: god question does say people believe true read word jesus says point religion bible life christian claim christians mean faith Topic #3: use government people public make state law used key number fact chip using rights note case legal war keys large Topic #4: new sale 10 year 20 15 shipping offer 12 50 following 16 1993 11 price years 30 00 condition 25 Topic #5: thanks know mail post does information looking like com send interested email list address info reply net group advance help Topic #6: windows program file problem using run use version running files like window sun ftp try look available code image server Topic #7: just edu like don want try ve soon thing think things stuff sure oh case car deleted tell people bike Topic #8: good just does team ve game ll doesn better sure heard probably really thought got season mean isn play way Topic #9: think don know people year make win world let second won wouldn did actually mr come drive local hard said