This is a proof of concept application of Non Negative Matrix Factorization of the term frequency matrix of a corpus of documents so as to extract an additive model of the topic structure of the corpus.
from sklearn import datasets
dataset = datasets.fetch_20newsgroups(shuffle=True, random_state=1)
print dataset.target_names[dataset.target[0]]
print dataset.data[0]
For shorter computation times.
n_samples = 1000
n_features = 1000
Restrict to the most common word frequency and use TF-IDF weighting (without top 5% stop words)
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer(max_df=0.95, max_features=n_features)
counts = vectorizer.fit_transform(dataset.data[:n_samples])
tfidf = text.TfidfTransformer().fit_transform(counts)
tfidf
Convert from a scipy.sparse.csr_matrix
representation to a dense numpy
array and remove negative values.
tfidf.toarray()
from sklearn import decomposition
n_topics = 5
nmf = decomposition.NMF(n_components=n_topics).fit(tfidf)
print nmf
print nmf.components_
Reuse the vocabulary of the vectorizer to find the words names from the matrix positions.
n_top_words = 12
inverse_vocabulary = dict((v, k) for k, v in vectorizer.vocabulary.iteritems())
for topic_idx, topic in enumerate(nmf.components_):
print "Topic #%d: " % topic_idx,
print " ".join([inverse_vocabulary[i]
for i in topic.argsort()[:-(n_top_words + 1):-1]])
print