import logging
from gensim.models import EnsembleLda, LdaMulticore
from gensim.models.ensemblelda import rank_masking
from gensim.corpora import OpinosisCorpus
import os
enable the ensemble logger to show what it is doing currently
elda_logger = logging.getLogger(EnsembleLda.__module__)
elda_logger.setLevel(logging.INFO)
elda_logger.addHandler(logging.StreamHandler())
def pretty_print_topics():
# note that the words are stemmed so they appear chopped off
for t in elda.print_topics(num_words=7):
print('-', t[1].replace('*',' ').replace('"','').replace(' +',','), '\n')
Opinosis [1] is a small (but redundant) corpus that contains 289 product reviews for 51 products. Since it's so small, the results are rather unstable.
[1] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han, Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions [online], Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 340–348. Available from: https://kavita-ganesan.com/opinosis/
First, download the opinosis dataset. On linux it can be done like this for example:
!mkdir ~/opinosis
!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip
!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis
path = os.path.expanduser('~/opinosis/')
Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package. It preprocesses the data using the PorterStemmer and stopwords from the nltk package.
The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder.
opinosis = OpinosisCorpus(path)
parameters
topic_model_kind ldamulticore is highly recommended for EnsembleLda. ensemble_workers and distance_workers are used to improve the time needed to train the models, as well as the masking_method 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so ensemble_workers can be set to 3 to get 95 - 100% cpu usage on my i5 3470.
Since the corpus is so small, a high number of num_models is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories.
elda = EnsembleLda(
corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,
passes=20, iterations=100, ensemble_workers=3, distance_workers=4,
topic_model_class='ldamulticore', masking_method=rank_masking,
)
pretty_print_topics()
The default for min_samples would be 64, half of the number of models and eps would be 0.1. You basically play around with them until you find a sweetspot that fits for your needs.
elda.recluster(min_samples=55, eps=0.14)
pretty_print_topics()