!wget -nc -O wikihowAll.csv https://query.data.world/s/lult233wfonljfadtexn2t5x5rb7is
File ‘wikihowAll.csv’ already there; not retrieving.
!pip install git+https://github.com/lambdaofgod/mlutil
!pip install tqdm
Collecting git+https://github.com/lambdaofgod/mlutil Cloning https://github.com/lambdaofgod/mlutil to /tmp/pip-0itfu2mr-build Requirement already satisfied (use --upgrade to upgrade): mlutil==0.11 from git+https://github.com/lambdaofgod/mlutil in /home/kuba/Projects/mlutil Requirement already satisfied: gensim in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11) Requirement already satisfied: nltk in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11) Requirement already satisfied: pandas in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11) Requirement already satisfied: numpy in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11) Requirement already satisfied: tqdm in /home/kuba/anaconda3/lib/python3.6/site-packages (from mlutil==0.11) Requirement already satisfied: scipy>=0.18.1 in /home/kuba/anaconda3/lib/python3.6/site-packages (from gensim->mlutil==0.11) Requirement already satisfied: smart-open>=1.7.0 in /home/kuba/anaconda3/lib/python3.6/site-packages (from gensim->mlutil==0.11) Requirement already satisfied: six>=1.5.0 in /home/kuba/anaconda3/lib/python3.6/site-packages (from gensim->mlutil==0.11) Requirement already satisfied: singledispatch in /home/kuba/anaconda3/lib/python3.6/site-packages (from nltk->mlutil==0.11) Requirement already satisfied: python-dateutil>=2.5.0 in /home/kuba/anaconda3/lib/python3.6/site-packages (from pandas->mlutil==0.11) Requirement already satisfied: pytz>=2011k in /home/kuba/anaconda3/lib/python3.6/site-packages (from pandas->mlutil==0.11) Requirement already satisfied: boto>=2.32 in /home/kuba/anaconda3/lib/python3.6/site-packages (from smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: boto3 in /home/kuba/anaconda3/lib/python3.6/site-packages (from smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: requests in /home/kuba/anaconda3/lib/python3.6/site-packages (from smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: bz2file in /home/kuba/anaconda3/lib/python3.6/site-packages (from smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: botocore<1.13.0,>=1.12.130 in /home/kuba/anaconda3/lib/python3.6/site-packages (from boto3->smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/kuba/anaconda3/lib/python3.6/site-packages (from boto3->smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /home/kuba/anaconda3/lib/python3.6/site-packages (from boto3->smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: idna<2.9,>=2.5 in /home/kuba/anaconda3/lib/python3.6/site-packages (from requests->smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/kuba/anaconda3/lib/python3.6/site-packages (from requests->smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: certifi>=2017.4.17 in /home/kuba/anaconda3/lib/python3.6/site-packages (from requests->smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: urllib3<1.25,>=1.21.1 in /home/kuba/anaconda3/lib/python3.6/site-packages (from requests->smart-open>=1.7.0->gensim->mlutil==0.11) Requirement already satisfied: docutils>=0.10 in /home/kuba/anaconda3/lib/python3.6/site-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.7.0->gensim->mlutil==0.11) You are using pip version 9.0.1, however version 19.1.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Requirement already satisfied: tqdm in /home/kuba/anaconda3/lib/python3.6/site-packages You are using pip version 9.0.1, however version 19.1.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
from __future__ import print_function
from time import time
import numpy as np
import pandas as pd
import seaborn as sns
import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from IPython.display import display, Image
import nltk
nltk.download('wordnet')
nltk.download('wordnet_ic')
import mlutil
from mlutil.textmining import get_wordnet_similarity
import pyLDAvis
import pyLDAvis.sklearn
paramiko missing, opening SSH/SCP/SFTP paths will be disabled. `pip install paramiko` to suppress [nltk_data] Downloading package wordnet to /home/kuba/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package wordnet_ic to /home/kuba/nltk_data... [nltk_data] Package wordnet_ic is already up-to-date!
pyLDAvis.enable_notebook()
def plot_correlations(m):
m_corr = m @ m.T / (m ** 2).sum(axis=1)
sns.heatmap(m)
n_features = 5000
n_components = 10
n_top_words = 10
wikihow_df = pd.read_csv('wikihowAll.csv')
print('wikihow size', wikihow_df.shape)
wikihow_df = wikihow_df[~wikihow_df['text'].isna()]
print('valid wikihow size (removed empty text)', wikihow_df.shape)
wikihow size (215365, 3) valid wikihow size (removed empty text) (214294, 3)
data_samples = wikihow_df['text']
n_samples = len(data_samples)
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=5,
max_features=n_features,
stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=5,
max_features=n_features,
stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()
Extracting tf-idf features for NMF... done in 92.840s. Extracting tf features for LDA... done in 90.236s.
# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features "
"n_samples=%d and n_features=%d..."
% (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
Fitting the NMF model (Frobenius norm) with tf-idf features n_samples=214294 and n_features=5000... done in 193.698s. Topics in NMF model (Frobenius norm):
nmf_keywords_per_topic = mlutil.topic_modeling.top_topic_words(nmf, tfidf_feature_names, 100)
display(nmf_keywords_per_topic.iloc[:,:10])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
topic_0 | people | don | person | like | feel | time | make | things | say | know |
topic_1 | add | water | mixture | minutes | oil | heat | stir | bowl | pan | mix |
topic_2 | click | screen | button | select | menu | tap | app | icon | file | page |
topic_3 | hair | shampoo | comb | dry | conditioner | look | skin | scalp | brush | oil |
topic_4 | dog | dogs | vet | pet | puppy | food | training | leash | treat | breed |
topic_5 | skin | doctor | body | help | blood | foods | pain | symptoms | day | exercise |
topic_6 | use | make | water | paper | paint | cut | sure | color | place | glue |
topic_7 | business | information | need | state | company | card | number | credit | money | online |
topic_8 | cat | cats | vet | food | pet | litter | veterinarian | toys | kitten | box |
topic_9 | child | children | kids | parents | parent | baby | school | help | behavior | toddler |
In the following we use average Resnik similarity of words from top topic keywords.
nmf_mean_coherence = mlutil.topic_modeling.get_topic_coherences(nmf_keywords_per_topic)
print('NMF-based topic model mean coherence:', nmf_mean_coherence)
100%|██████████| 10/10 [03:01<00:00, 18.13s/it]
NMF-based topic model mean coherence: 0 1.037505 1 1.003931 2 0.958536 3 1.273350 4 1.448881 5 0.864943 6 1.378219 7 0.715856 8 1.833986 9 1.831830 dtype: float64
# Fit the KL divergence NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
"tf-idf features, n_samples=%d and n_features=%d..."
% (n_samples, n_features))
t0 = time()
kl_nmf = NMF(n_components=n_components, random_state=1,
beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=214294 and n_features=5000... done in 1760.113s.
kl_nmf_keywords_per_topic = mlutil.topic_modeling.top_topic_words(kl_nmf, tfidf_feature_names, 100)
display(kl_nmf_keywords_per_topic.iloc[:,:10])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
topic_0 | time | try | make | like | way | want | don | people | help | just |
topic_1 | water | use | using | remove | sure | make | warm | dry | small | minutes |
topic_2 | click | select | screen | right | open | use | want | type | menu | window |
topic_3 | look | wear | hair | like | try | want | don | just | style | make |
topic_4 | pet | need | dog | sure | possible | prevent | safe | provide | likely | vet |
topic_5 | help | weight | include | reduce | doctor | body | health | treatment | need | increase |
topic_6 | use | need | work | way | sure | make | want | right | start | using |
topic_7 | use | information | online | number | website | need | year | example | provide | work |
topic_8 | stir | minutes | mix | add | mixture | serve | sugar | place | salt | time |
topic_9 | use | make | sure | place | small | want | using | paper | cut | shape |
kl_nmf_mean_coherence = mlutil.topic_modeling.get_topic_coherences(kl_nmf_keywords_per_topic)
print('KL-NMF-based topic model mean coherence:', kl_nmf_mean_coherence)
100%|██████████| 10/10 [02:57<00:00, 19.10s/it]
KL-NMF-based topic model mean coherence: 0 1.218143 1 0.635964 2 0.715969 3 0.829468 4 0.707291 5 0.734174 6 1.303825 7 0.778983 8 1.028046 9 0.870015 dtype: float64
print("Fitting LDA models with tf features, "
"n_samples=%d and n_features=%d..."
% (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0,
n_jobs=-1)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
Fitting LDA models with tf features, n_samples=214294 and n_features=5000... done in 840.556s. Topics in LDA model:
lda_keywords_per_topic = mlutil.topic_modeling.top_topic_words(lda, tf_feature_names, 100)
display(lda_keywords_per_topic.iloc[:,:10])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
topic_0 | food | foods | blood | skin | eat | help | like | doctor | day | meat |
topic_1 | use | make | cut | place | end | need | right | hand | sure | paper |
topic_2 | time | make | work | need | good | like | want | help | ll | people |
topic_3 | help | child | time | feel | try | body | children | exercise | day | sleep |
topic_4 | click | button | screen | select | right | open | computer | use | tap | window |
topic_5 | don | make | people | like | person | want | time | just | know | try |
topic_6 | water | use | add | dry | remove | make | oil | place | minutes | clean |
topic_7 | information | need | business | state | number | file | example | use | court | credit |
topic_8 | paint | look | color | hair | make | use | like | want | colors | wear |
topic_9 | dog | cat | water | make | need | sure | soil | plant | plants | home |
Warning: the results of LDA may be a bit misleading - I don't know whether getting topic keywords from LDA uses the same mechanism as in NMF (which will correspond to tf-idf features, instead of tf ones)
lda_mean_coherence = mlutil.topic_modeling.get_topic_coherences(lda_keywords_per_topic)
print('LDA-based topic model mean coherence:', lda_mean_coherence)
100%|██████████| 10/10 [02:57<00:00, 18.68s/it]
LDA-based topic model mean coherence: 0 0.963005 1 0.679626 2 1.135926 3 0.927911 4 1.046612 5 1.159201 6 0.965963 7 0.917089 8 1.009249 9 0.886062 dtype: float64
%%time
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
/home/kuba/anaconda3/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'. return pd.concat([default_term_info] + list(topic_dfs))
CPU times: user 1min 45s, sys: 1.17 s, total: 1min 46s Wall time: 7min 53s