Notebook

NMF topic modeling on 20 newsgroups¶

This notebook is basically expanded version of this example from scikit-learn documentation.

In [1]:

from __future__ import print_function

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups

n_samples = 8000
n_features = 1000
n_components = 10
n_top_words = 20


def kl_loss(x, y, eps=1e-10):
    return -(x.toarray() * np.log(y+eps)).sum() / x.shape[0]


def frobenius_loss(x, y):
    return np.square(x - y).sum() / x.shape[0]


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d: " % topic_idx)
        topic_words = " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(topic_words)
    print()
    
    
def score_model(model, data):
    if model.beta_loss == 'kullback-leibler':
        loss_function =  kl_loss
    elif model.beta_loss == 'frobenius':
        loss_function = frobenius_loss
    
    reduced_data = model.transform(data)
    reconstructed_data = model.inverse_transform(reduced_data)
    
    return loss_function(data, reconstructed_data)

In [2]:

%%time
print("Loading dataset...")
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))

data_train = dataset.data[:n_samples]
data_test = dataset.data[n_samples:]

Loading dataset...
CPU times: user 1.86 s, sys: 69.9 ms, total: 1.93 s
Wall time: 1.97 s

Use tf-idf features for NMF.¶

In [3]:

%%time
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')

tfidf_train = tfidf_vectorizer.fit_transform(data_train)
tfidf_test = tfidf_vectorizer.transform(data_test)

Extracting tf-idf features for NMF...
CPU times: user 2.42 s, sys: 8.18 ms, total: 2.43 s
Wall time: 2.43 s

NMF model with Frobenius loss¶

In [4]:

%%time
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
frobenius_nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf_train)

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=8000 and n_features=1000...
CPU times: user 1.39 s, sys: 40.4 ms, total: 1.43 s
Wall time: 828 ms

In [5]:

print('train reconstruction error:', score_model(frobenius_nmf, tfidf_train))
print('test reconstruction error:', score_model(frobenius_nmf, tfidf_test))

train reconstruction error: 0.890941957403
test reconstruction error: 0.892431321223

Topics¶

In [6]:

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(frobenius_nmf, tfidf_feature_names, n_top_words)

Topic #0: 
just don people think like know good time right ve make say did way really want going said ll thing
Topic #1: 
card video monitor drivers cards vga bus driver color ram graphics mode bit board memory pc 16 speed performance controller
Topic #2: 
god jesus bible christ faith believe christians christian church sin lord does life man hell truth belief say love father
Topic #3: 
key chip clipper encryption keys government escrow use algorithm public nsa security phone secure law chips des data bit enforcement
Topic #4: 
new 00 car sale 10 price shipping offer 50 20 15 condition 12 interested 11 used 30 25 sell old
Topic #5: 
thanks does know mail advance hi info looking help anybody address appreciated email information post interested reply send like need
Topic #6: 
windows file use dos files program using window problem running run version pc server application screen software ms ftp help
Topic #7: 
edu soon cs university com internet ftp article pub send email mit david mail address ibm apr reply available export
Topic #8: 
game team games year season play players hockey win league player nhl teams best played runs better hit think good
Topic #9: 
drive scsi hard drives disk ide floppy controller mac cd power rom internal mb cable problem tape bus computer format

NMF model with KL-divergence loss¶

In [7]:

%%time
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
kl_nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=0.9).fit(tfidf_train)

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=8000 and n_features=1000...
CPU times: user 12.1 s, sys: 380 ms, total: 12.5 s
Wall time: 6.25 s

In [8]:

print('train reconstruction error:', score_model(kl_nmf, tfidf_train))
print('test reconstruction error:', score_model(kl_nmf, tfidf_test))

train reconstruction error: 18.355714861
test reconstruction error: 18.2931233004

Topics¶

In [9]:

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(kl_nmf, tfidf_feature_names, n_top_words)

Topic #0: 
time like way right really did years good said make just think don long thing going new say want know
Topic #1: 
use thanks need used using software work help does card hi drive video pc mac computer problem new like speed
Topic #2: 
god question does say people believe true read word jesus says point religion bible life christian claim christians mean faith
Topic #3: 
use government people public make state law used key number fact chip using rights note case legal war keys large
Topic #4: 
new sale 10 year 20 15 shipping offer 12 50 following 16 1993 11 price years 30 00 condition 25
Topic #5: 
thanks know mail post does information looking like com send interested email list address info reply net group advance help
Topic #6: 
windows program file problem using run use version running files like window sun ftp try look available code image server
Topic #7: 
just edu like don want try ve soon thing think things stuff sure oh case car deleted tell people bike
Topic #8: 
good just does team ve game ll doesn better sure heard probably really thought got season mean isn play way
Topic #9: 
think don know people year make win world let second won wouldn did actually mr come drive local hard said