This demo walks you through using bnpy from within python to train a Latent Dirichlet Allocation (LDA) topic model on bag-of-words data. We'll use the full-dataset variational inference algorithm.
We can use the following import statements to load bnpy.
import bnpy
from matplotlib import pylab
%pylab inline
imshowArgs = dict(interpolation='nearest',
cmap='bone_r',
vmin=0.0,
vmax=10./900,
)
Populating the interactive namespace from numpy and matplotlib
BarsK10V900
¶We'll use a simple "toy bars" dataset called BarsK10V900
, generated from the LDA topic model.
Our intended task here is to assign exactly one topic to every token. The data was generated this way.
Be aware that this is different than assigning a single cluster/topic to the entire document.
import BarsK10V900
Data = BarsK10V900.get_data(nDocTotal=1000, nWordsPerDoc=100)
Data.name = 'BarsK10V900'
First, we can visualize the 10 "true" bar topics. Each one is a distribution over 900 vocabulary "words", which are represented here as individual pixels in a 30x30 image.
bnpy.viz.BarsViz.showTopicsAsSquareImages(BarsK10V900.Defaults['topics'], **imshowArgs);
Next, we can visualize some example documents. Here are 16 documents
bnpy.viz.BarsViz.plotExampleBarsDocs(Data, nDocToPlot=16, **imshowArgs);
The initial number of topics (aka clusters) is determined by this integer. For finite models like LDA, once this is set, for the rest of the algorithm we will represent exactly this many topics in memory. We cannot add more, or remove any. Some topic's frequency parameters may be pushed to zero, but we will still represent them in the computer.
This string specifies the initialization procedure for the topic-word parameters. Here are the options
* kmeansplusplus [Recommended]
Runs kmeans on the empirical document word-count vectors, using the specified number of clusters $K$.
* randexamples [Default]
Selects $K$ documents at random without replacement. Can be bad because it does not
* randomlikewang
Draws each topic-word variational parameter vector $\tau$ as a Gamma random variable. This is the same procedure used in Wang et al.'s original Python code for HDP topic models.
$$ \qquad \qquad \tau_{kv} \sim \mbox{Gamma}(100), \forall k,v $$This positive scalar controls the sparsity of the Dirichlet prior on the document-topic assignments.
Set this value smaller (like 0.1 or smaller) to encourage more sparsity in these distributions.
Setting it large (like 10 or 100) will cause all the documents to use all the topics uniformly, which is bad.
This integer specifies how long to run the local "E" step algorithm, which alternately updates free variational parameters for (1) every token in the document, and (2) the document-specific topic distribution.
Larger values will let the algorithm converge more, but at the cost of more computation. Usually setting this to 10 is too small, while 200 would maybe be too large unless there are many many topics.
The local "E" step is run at each document, alternating the two steps above until the estimated document-topic counts for the document $[N_{d1} \ldots N_{dk} \ldots N_{dK}]$. We halt the iterations if all values in this vector change by less than convThrLP
.
This value defaults to 0.01, which should be fine for most purposes.
hmodel, RInfo = bnpy.run(Data, 'FiniteTopicModel', 'Mult', 'VB',
K=5, alpha=0.1, lam=0.1, initname='kmeansplusplus',
nLap=500, printEvery=25, nCoordAscentItersLP=25,
nTask=2, jobname='demobarsVB-Kinit=5-kmeans')
Toy Bars Data with 10 true topics. Each doc uses 1-3 bars. size: 1000 units (documents) vocab size: 900 min 5% 50% 95% max 68 76 85 92 99 nUniqueTokensPerDoc 100 100 100 100 100 nTotalTokensPerDoc Hist of word_count across tokens 1 2 3 <10 <100 >=100 0.84 0.14 0.02 200 0 0 Hist of unique docs per word type <1 <10 <100 <0.20 <0.50 >=0.50 0 0 0.75 0.25 0 0 Allocation Model: Finite LDA model with K=5 comps. alpha=0.10 Obs. Data Model: Multinomial over finite vocabulary. Obs. Data Prior: Dirichlet over finite vocabulary lam = [ 0.1 0.1] ... Learn Alg: VB Trial 1/2 | alg. seed: 2497280 | data order seed: 8541952 savepath: /results/BarsK10V900/demobarsVB-Kinit=5-kmeans/1 1/500 after 1 sec. | K 5 | ev -6.497725357e+00 | 2/500 after 1 sec. | K 5 | ev -6.474361847e+00 | Ndiff 256.402 25/500 after 3 sec. | K 5 | ev -6.459785851e+00 | Ndiff 0.705 37/500 after 5 sec. | K 5 | ev -6.459734310e+00 | Ndiff 0.048 ... done. converged. Trial 2/2 | alg. seed: 1128064 | data order seed: 7673856 savepath: /results/BarsK10V900/demobarsVB-Kinit=5-kmeans/2 1/500 after 0 sec. | K 5 | ev -6.605946952e+00 | 2/500 after 1 sec. | K 5 | ev -6.580567563e+00 | Ndiff 1810.242 25/500 after 4 sec. | K 5 | ev -6.548944331e+00 | Ndiff 1.797 50/500 after 6 sec. | K 5 | ev -6.548918037e+00 | Ndiff 0.192 75/500 after 8 sec. | K 5 | ev -6.548903603e+00 | Ndiff 0.164 98/500 after 10 sec. | K 5 | ev -6.548881828e+00 | Ndiff 0.013 ... done. converged.
hmodel, RInfo = bnpy.run(Data, 'FiniteTopicModel', 'Mult', 'VB',
K=10, alpha=0.1, lam=0.1, initname='kmeansplusplus',
nLap=500, printEvery=25, nCoordAscentItersLP=25,
nTask=2, jobname='demobarsVB-Kinit=10-kmeans')
Toy Bars Data with 10 true topics. Each doc uses 1-3 bars. size: 1000 units (documents) vocab size: 900 min 5% 50% 95% max 68 76 85 92 99 nUniqueTokensPerDoc 100 100 100 100 100 nTotalTokensPerDoc Hist of word_count across tokens 1 2 3 <10 <100 >=100 0.84 0.14 0.02 200 0 0 Hist of unique docs per word type <1 <10 <100 <0.20 <0.50 >=0.50 0 0 0.75 0.25 0 0 Allocation Model: Finite LDA model with K=10 comps. alpha=0.10 Obs. Data Model: Multinomial over finite vocabulary. Obs. Data Prior: Dirichlet over finite vocabulary lam = [ 0.1 0.1] ... Learn Alg: VB Trial 1/2 | alg. seed: 2497280 | data order seed: 8541952 savepath: /results/BarsK10V900/demobarsVB-Kinit=10-kmeans/1 1/500 after 1 sec. | K 10 | ev -6.384955954e+00 | 2/500 after 1 sec. | K 10 | ev -6.340769182e+00 | Ndiff 487.173 25/500 after 5 sec. | K 10 | ev -6.294641929e+00 | Ndiff 2.642 50/500 after 8 sec. | K 10 | ev -6.294527751e+00 | Ndiff 0.117 54/500 after 9 sec. | K 10 | ev -6.294523143e+00 | Ndiff 0.049 ... done. converged. Trial 2/2 | alg. seed: 1128064 | data order seed: 7673856 savepath: /results/BarsK10V900/demobarsVB-Kinit=10-kmeans/2 1/500 after 1 sec. | K 10 | ev -6.308175604e+00 | 2/500 after 1 sec. | K 10 | ev -6.262134141e+00 | Ndiff 467.564 25/500 after 5 sec. | K 10 | ev -6.234179645e+00 | Ndiff 2.107 50/500 after 8 sec. | K 10 | ev -6.234039377e+00 | Ndiff 0.527 75/500 after 11 sec. | K 10 | ev -6.234010173e+00 | Ndiff 0.155 98/500 after 14 sec. | K 10 | ev -6.233997917e+00 | Ndiff 0.044 ... done. converged.
hmodel, RInfo = bnpy.run(Data, 'FiniteTopicModel', 'Mult', 'VB',
K=20, alpha=0.1, lam=0.1, initname='kmeansplusplus',
nLap=500, printEvery=25, nCoordAscentItersLP=25,
nTask=2, jobname='demobarsVB-Kinit=20-kmeans')
Toy Bars Data with 10 true topics. Each doc uses 1-3 bars. size: 1000 units (documents) vocab size: 900 min 5% 50% 95% max 68 76 85 92 99 nUniqueTokensPerDoc 100 100 100 100 100 nTotalTokensPerDoc Hist of word_count across tokens 1 2 3 <10 <100 >=100 0.84 0.14 0.02 200 0 0 Hist of unique docs per word type <1 <10 <100 <0.20 <0.50 >=0.50 0 0 0.75 0.25 0 0 Allocation Model: Finite LDA model with K=20 comps. alpha=0.10 Obs. Data Model: Multinomial over finite vocabulary. Obs. Data Prior: Dirichlet over finite vocabulary lam = [ 0.1 0.1] ... Learn Alg: VB Trial 1/2 | alg. seed: 2497280 | data order seed: 8541952 savepath: /results/BarsK10V900/demobarsVB-Kinit=20-kmeans/1 1/500 after 1 sec. | K 20 | ev -6.259942731e+00 | 2/500 after 1 sec. | K 20 | ev -6.188580434e+00 | Ndiff 553.015 25/500 after 7 sec. | K 20 | ev -6.146453495e+00 | Ndiff 4.339 50/500 after 11 sec. | K 20 | ev -6.146234090e+00 | Ndiff 0.115 75/500 after 16 sec. | K 20 | ev -6.146224009e+00 | Ndiff 0.073 99/500 after 20 sec. | K 20 | ev -6.146212419e+00 | Ndiff 0.049 ... done. converged. Trial 2/2 | alg. seed: 1128064 | data order seed: 7673856 savepath: /results/BarsK10V900/demobarsVB-Kinit=20-kmeans/2 1/500 after 1 sec. | K 20 | ev -6.251949641e+00 | 2/500 after 1 sec. | K 20 | ev -6.183576703e+00 | Ndiff 719.873 25/500 after 7 sec. | K 20 | ev -6.142806661e+00 | Ndiff 5.266 50/500 after 11 sec. | K 20 | ev -6.142594683e+00 | Ndiff 0.627 75/500 after 16 sec. | K 20 | ev -6.142519221e+00 | Ndiff 0.359 96/500 after 19 sec. | K 20 | ev -6.142512753e+00 | Ndiff 0.049 ... done. converged.
Here, we plot the log evidence (sometimes called the evidence lower bound or ELBO). Here, larger objective scores indicate better model quality.
from matplotlib import pylab
%pylab inline
bnpy.viz.PlotELBO.plotJobsThatMatchKeywords('BarsK10V900/demobarsVB-*');
pylab.legend(loc='lower right');
Populating the interactive namespace from numpy and matplotlib
Conclusion: The initial number of topics matters! If we have too few topics, we cannot represent all the 10 true bars well, and the performance really suffers.
We can use the plotCompsForTask
function included in the visualization tools of bnpy to show the learned topic-word parameters.
bnpy.viz.PlotComps.plotCompsForTask('BarsK10V900/demobarsVB-Kinit=5-kmeans/1/', **imshowArgs)
** Conclusion:** This run discovers the 10 ideal bars topics.
bnpy.viz.PlotComps.plotCompsForTask('BarsK10V900/demobarsVB-Kinit=10-kmeans/1/', **imshowArgs)
Remarks: We've found some of the true topics, but other topics (like lower left) are blends of several bars, while others (far right) are redundant copies.
bnpy.viz.PlotComps.plotCompsForTask('BarsK10V900/demobarsVB-Kinit=20-kmeans/1/', **imshowArgs)
Remarks: We've found all 10 true topics, plus some extra "junk" or redundant topics.