Topic Models on WikiHow¶

In [1]:

!wget -nc -O wikihowAll.csv https://query.data.world/s/lult233wfonljfadtexn2t5x5rb7is

File ‘wikihowAll.csv’ already there; not retrieving.

In [2]:

!pip install git+https://github.com/lambdaofgod/mlutil
!pip install tqdm

Collecting git+https://github.com/lambdaofgod/mlutil
  Cloning https://github.com/lambdaofgod/mlutil to /tmp/pip-0itfu2mr-build
  Requirement already satisfied (use --upgrade to upgrade): mlutil==0.11 from git+https://github.com/lambdaofgod/mlutil in /home/kuba/Projects/mlutil
Requirement already satisfied: gensim in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11)
Requirement already satisfied: nltk in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11)
Requirement already satisfied: pandas in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11)
Requirement already satisfied: numpy in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11)
Requirement already satisfied: tqdm in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11)
Requirement already satisfied: scipy>=0.18.1 in /home/kuba/anaconda3/lib/python3.6/site-packages (from gensim->mlutil==0.11)
Requirement already satisfied: smart-open>=1.7.0 in /home/kuba/anaconda3/lib/python3.6/site-packages (from gensim->mlutil==0.11)
Requirement already satisfied: six>=1.5.0 in /home/kuba/anaconda3/lib/python3.6/site-packages (from gensim->mlutil==0.11)
Requirement already satisfied: singledispatch in /home/kuba/anaconda3/lib/python3.6/site-packages (from nltk->mlutil==0.11)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/kuba/anaconda3/lib/python3.6/site-packages (from pandas->mlutil==0.11)
Requirement already satisfied: pytz>=2011k in /home/kuba/anaconda3/lib/python3.6/site-packages (from pandas->mlutil==0.11)
Requirement already satisfied: boto>=2.32 in /home/kuba/anaconda3/lib/python3.6/site-packages (from smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: boto3 in /home/kuba/anaconda3/lib/python3.6/site-packages (from smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: requests in /home/kuba/anaconda3/lib/python3.6/site-packages (from smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: bz2file in /home/kuba/anaconda3/lib/python3.6/site-packages (from smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: botocore<1.13.0,>=1.12.130 in /home/kuba/anaconda3/lib/python3.6/site-packages (from boto3->smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/kuba/anaconda3/lib/python3.6/site-packages (from boto3->smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /home/kuba/anaconda3/lib/python3.6/site-packages (from boto3->smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: idna<2.9,>=2.5 in /home/kuba/anaconda3/lib/python3.6/site-packages (from requests->smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/kuba/anaconda3/lib/python3.6/site-packages (from requests->smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: certifi>=2017.4.17 in /home/kuba/anaconda3/lib/python3.6/site-packages (from requests->smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /home/kuba/anaconda3/lib/python3.6/site-packages (from requests->smart-open>=1.7.0->gensim->mlutil==0.11)
Requirement already satisfied: docutils>=0.10 in /home/kuba/anaconda3/lib/python3.6/site-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.7.0->gensim->mlutil==0.11)
You are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Requirement already satisfied: tqdm in /home/kuba/anaconda3/lib/python3.6/site-packages
You are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [3]:

from __future__ import print_function
from time import time

import numpy as np
import pandas as pd

import seaborn as sns

import tqdm


from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary

from IPython.display import display, Image

import nltk
nltk.download('wordnet')
nltk.download('wordnet_ic')

import mlutil
from mlutil.textmining import get_wordnet_similarity


import pyLDAvis
import pyLDAvis.sklearn

paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress
[nltk_data] Downloading package wordnet to /home/kuba/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to /home/kuba/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!

In [4]:

pyLDAvis.enable_notebook()

In [5]:

def plot_correlations(m):
  m_corr = m @ m.T / (m ** 2).sum(axis=1)
  sns.heatmap(m)

In [6]:

n_features = 5000
n_components = 10
n_top_words = 10

Loading WikiHow¶

In [7]:

wikihow_df = pd.read_csv('wikihowAll.csv')
print('wikihow size', wikihow_df.shape)
wikihow_df = wikihow_df[~wikihow_df['text'].isna()]
print('valid wikihow size (removed empty text)', wikihow_df.shape)

wikihow size (215365, 3)
valid wikihow size (removed empty text) (214294, 3)

In [8]:

data_samples = wikihow_df['text']
n_samples = len(data_samples)

In [9]:

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=5,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=5,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

Extracting tf-idf features for NMF...
done in 92.840s.
Extracting tf features for LDA...
done in 90.236s.

In [9]:

# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

Fitting the NMF model (Frobenius norm) with tf-idf features n_samples=214294 and n_features=5000...
done in 193.698s.

Topics in NMF model (Frobenius norm):

In [10]:

nmf_keywords_per_topic = mlutil.topic_modeling.top_topic_words(nmf, tfidf_feature_names, 100)
display(nmf_keywords_per_topic.iloc[:,:10])

	0	1	2	3	4	5	6	7	8	9
topic_0	people	don	person	like	feel	time	make	things	say	know
topic_1	add	water	mixture	minutes	oil	heat	stir	bowl	pan	mix
topic_2	click	screen	button	select	menu	tap	app	icon	file	page
topic_3	hair	shampoo	comb	dry	conditioner	look	skin	scalp	brush	oil
topic_4	dog	dogs	vet	pet	puppy	food	training	leash	treat	breed
topic_5	skin	doctor	body	help	blood	foods	pain	symptoms	day	exercise
topic_6	use	make	water	paper	paint	cut	sure	color	place	glue
topic_7	business	information	need	state	company	card	number	credit	money	online
topic_8	cat	cats	vet	food	pet	litter	veterinarian	toys	kitten	box
topic_9	child	children	kids	parents	parent	baby	school	help	behavior	toddler

Topic coherence¶

In the following we use average Resnik similarity of words from top topic keywords.

In [11]:

nmf_mean_coherence = mlutil.topic_modeling.get_topic_coherences(nmf_keywords_per_topic)
print('NMF-based topic model mean coherence:', nmf_mean_coherence)

100%|██████████| 10/10 [03:01<00:00, 18.13s/it]

NMF-based topic model mean coherence: 0    1.037505
1    1.003931
2    0.958536
3    1.273350
4    1.448881
5    0.864943
6    1.378219
7    0.715856
8    1.833986
9    1.831830
dtype: float64

In [12]:

# Fit the KL divergence NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
kl_nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

tfidf_feature_names = tfidf_vectorizer.get_feature_names()

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=214294 and n_features=5000...
done in 1760.113s.

Topics in NMF model (generalized Kullback-Leibler divergence)¶

In [14]:

kl_nmf_keywords_per_topic = mlutil.topic_modeling.top_topic_words(kl_nmf, tfidf_feature_names, 100)
display(kl_nmf_keywords_per_topic.iloc[:,:10])

	0	1	2	3	4	5	6	7	8	9
topic_0	time	try	make	like	way	want	don	people	help	just
topic_1	water	use	using	remove	sure	make	warm	dry	small	minutes
topic_2	click	select	screen	right	open	use	want	type	menu	window
topic_3	look	wear	hair	like	try	want	don	just	style	make
topic_4	pet	need	dog	sure	possible	prevent	safe	provide	likely	vet
topic_5	help	weight	include	reduce	doctor	body	health	treatment	need	increase
topic_6	use	need	work	way	sure	make	want	right	start	using
topic_7	use	information	online	number	website	need	year	example	provide	work
topic_8	stir	minutes	mix	add	mixture	serve	sugar	place	salt	time
topic_9	use	make	sure	place	small	want	using	paper	cut	shape

In [16]:

kl_nmf_mean_coherence = mlutil.topic_modeling.get_topic_coherences(kl_nmf_keywords_per_topic)
print('KL-NMF-based topic model mean coherence:', kl_nmf_mean_coherence)

100%|██████████| 10/10 [02:57<00:00, 19.10s/it]

KL-NMF-based topic model mean coherence: 0    1.218143
1    0.635964
2    0.715969
3    0.829468
4    0.707291
5    0.734174
6    1.303825
7    0.778983
8    1.028046
9    0.870015
dtype: float64

In [10]:

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0,
                                n_jobs=-1)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()

Fitting LDA models with tf features, n_samples=214294 and n_features=5000...
done in 840.556s.

Topics in LDA model:

In [11]:

lda_keywords_per_topic = mlutil.topic_modeling.top_topic_words(lda, tf_feature_names, 100)
display(lda_keywords_per_topic.iloc[:,:10])

	0	1	2	3	4	5	6	7	8	9
topic_0	food	foods	blood	skin	eat	help	like	doctor	day	meat
topic_1	use	make	cut	place	end	need	right	hand	sure	paper
topic_2	time	make	work	need	good	like	want	help	ll	people
topic_3	help	child	time	feel	try	body	children	exercise	day	sleep
topic_4	click	button	screen	select	right	open	computer	use	tap	window
topic_5	don	make	people	like	person	want	time	just	know	try
topic_6	water	use	add	dry	remove	make	oil	place	minutes	clean
topic_7	information	need	business	state	number	file	example	use	court	credit
topic_8	paint	look	color	hair	make	use	like	want	colors	wear
topic_9	dog	cat	water	make	need	sure	soil	plant	plants	home

Warning: the results of LDA may be a bit misleading - I don't know whether getting topic keywords from LDA uses the same mechanism as in NMF (which will correspond to tf-idf features, instead of tf ones)

In [23]:

lda_mean_coherence = mlutil.topic_modeling.get_topic_coherences(lda_keywords_per_topic)
print('LDA-based topic model mean coherence:', lda_mean_coherence)

100%|██████████| 10/10 [02:57<00:00, 18.68s/it]

LDA-based topic model mean coherence: 0    0.963005
1    0.679626
2    1.135926
3    0.927911
4    1.046612
5    1.159201
6    0.965963
7    0.917089
8    1.009249
9    0.886062
dtype: float64

In [12]:

%%time
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

/home/kuba/anaconda3/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))

CPU times: user 1min 45s, sys: 1.17 s, total: 1min 46s
Wall time: 7min 53s

Out[12]: