In this tutorial, you will learn how to use the author-topic model in Gensim for authorship prediction, based on the topic distributions and mesuring their similarity. We will train the author-topic model on a Reuters dataset, which contains 50 authors, each with 50 documents for trianing and another 50 documents for testing: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50 .
If you wish to learn more about the Author-topic model and LDA and how to train them, you should check out these tutorials beforehand. A lot of the preprocessing and configuration here has been done using their example:
NOTE:
To run this tutorial on your own, install Jupyter, Gensim, SpaCy, Scikit-Learn, Bokeh and Pandas, e.g. using pip:
pip install jupyter gensim spacy sklearn bokeh pandas
Note that you need to download some data for SpaCy using
python -m spacy.en.download
.Download the notebook at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_prediction_tutorial.ipynb.
Predicting the author of a document is a difficult task, where current approaches usually turn to neural networks. These base a lot of their predictions on learing stylistic and syntactic preferences of the authors and also other features which help rather identify the author.
In our case, we first model the domain knowledge of a certain author, based on what the author writes about. We do this by calculating the topic distributions for each author using the author-topic model. After that, we perform the new author inference on the held-out subset. This again calculates a topic distribution for this new unknown author. In order to perform the prediction, we find out of all known authors, the most similar one to the new unknown. Mathematically speaking, we find the author, whose topic distribution is the closest to the topic distribution of the new author, by a certrain distrance function or metric. Here we explore the Hellinger distance for the measuring the distance between two discrete multinomial topic distributions.
We start off by downloading the dataset. You can do it manually using the aforementioned link, or run the following code cell.
!wget -O - "https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip" > /tmp/C50.zip
--2018-03-25 17:24:26-- https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip Resolving archive.ics.uci.edu... 128.195.10.249 Connecting to archive.ics.uci.edu|128.195.10.249|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 8194031 (7.8M) [application/zip] Saving to: 'STDOUT' - 100%[===================>] 7.81M 2.30MB/s in 3.4s 2018-03-25 17:24:31 (2.30 MB/s) - written to stdout [8194031/8194031]
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')
import zipfile
filename = '/tmp/C50.zip'
zip_ref = zipfile.ZipFile(filename, 'r')
zip_ref.extractall("/tmp/")
zip_ref.close()
We wrap all the preprocessing steps, that you can find more about in the author-topic notebook , in one fucntion so that we are able to iterate over different preprocessing parameters.
import os, re, io
def preprocess_docs(data_dir):
doc_ids = []
author2doc = {}
docs = []
folders = os.listdir(data_dir) # List of filenames.
for authorname in folders:
files = file = os.listdir(data_dir + '/' + authorname)
for filen in files:
(idx1, idx2) = re.search('[0-9]+', filen).span() # Matches the indexes of the start end end of the ID.
if not author2doc.get(authorname):
# This is a new author.
author2doc[authorname] = []
doc_id = str(int(filen[idx1:idx2]))
doc_ids.append(doc_id)
author2doc[authorname].extend([doc_id])
# Read document text.
# Note: ignoring characters that cause encoding errors.
with io.open(data_dir + '/' + authorname + '/' + filen, errors='ignore', encoding='utf-8') as fid:
txt = fid.read()
# Replace any whitespace (newline, tabs, etc.) by a single space.
txt = re.sub('\s', ' ', txt)
docs.append(txt)
doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))
# Replace dataset IDs by integer IDs.
for a, a_doc_ids in author2doc.items():
for i, doc_id in enumerate(a_doc_ids):
author2doc[a][i] = doc_id_dict[doc_id]
import spacy
nlp = spacy.load('en')
%%time
processed_docs = []
for doc in nlp.pipe(docs, n_threads=4, batch_size=100):
# Process document using Spacy NLP pipeline.
ents = doc.ents # Named entities.
# Keep only words (no numbers, no punctuation).
# Lemmatize tokens, remove punctuation and remove stopwords.
doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
# Remove common words from a stopword list.
#doc = [token for token in doc if token not in STOPWORDS]
# Add named entities, but only if they are a compound of more than word.
doc.extend([str(entity) for entity in ents if len(entity) > 1])
processed_docs.append(doc)
docs = processed_docs
del processed_docs
# Compute bigrams.
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
for token in bigram[docs[idx]]:
if '_' in token:
# Token is a bigram, add to document.
docs[idx].append(token)
return docs, author2doc
We create the corpus of the train and test data using two separate functions, since each corpus is tied to a certain dictionary which maps the words to their ids. Also in order to create the test corpus, we use the dictionary from the train data, since the trained model has have the same id2word reference as the new test data. Otherwise token with id 1 from the test data wont't mean the same as the trained upon token with id 1 in the model.
def create_corpus_dictionary(docs, max_freq=0.5, min_wordcount=20):
# Create a dictionary representation of the documents, and filter out frequent and rare words.
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = max_freq
min_wordcount = min_wordcount
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)
_ = dictionary[0] # This sort of "initializes" dictionary.id2token.
# Vectorize data.
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]
return corpus, dictionary
def create_test_corpus(train_dictionary, docs):
# Create test corpus using the dictionary from the train data.
return [train_dictionary.doc2bow(doc) for doc in docs]
For our first training, we specify that we want the parameters max_freq and min_wordcoun to be 50 and 20, as proposed by the original notebook tutorial. We will find out if this configuration is good enough for us.
traindata_dir = "/tmp/C50train"
train_docs, train_author2doc = preprocess_docs(traindata_dir)
train_corpus_50_20, train_dictionary_50_20 = create_corpus_dictionary(train_docs, 0.5, 20)
05:24:36 DEBUG:Registered VCS backend: git 05:24:36 DEBUG:Registered VCS backend: hg 05:24:36 DEBUG:Registered VCS backend: svn 05:24:36 DEBUG:Registered VCS backend: bzr
CPU times: user 3 µs, sys: 0 ns, total: 3 µs Wall time: 7.15 µs
05:26:17 INFO:'pattern' package not found; tag filters are not available for English 05:26:17 INFO:collecting all words and their counts 05:26:17 INFO:PROGRESS: at sentence #0, processed 0 words and 0 word types 05:26:19 INFO:collected 437598 word types from a corpus of 746622 words (unigram + bigrams) and 2500 sentences 05:26:19 INFO:using 437598 counts as vocab in Phrases<0 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000> /Users/martin/Projects/bachelor/gensim/gensim/models/phrases.py:490: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class") 05:26:24 INFO:adding document #0 to Dictionary(0 unique tokens: []) 05:26:25 INFO:built Dictionary(46905 unique tokens: ['$83.4 million', 'boarder', '$2.72 billion', 'checking', 'suzuki']...) from 2500 documents (total 786032 corpus positions) 05:26:25 INFO:discarding 42991 tokens: [('$1.4 billion', 11), ('$15', 3), ('$17.25', 1), ('$380 million', 2), ('12.5 cents', 7), ('Big B', 3), ('Big B Inc.', 2), ("Big B's", 3), ('Big B. I', 1), ('Dwayne Hoven', 1)]... 05:26:25 INFO:keeping 3914 tokens which were in no less than 20 and no more than 1250 (=50.0%) documents 05:26:25 DEBUG:rebuilding dictionary, shrinking gaps 05:26:25 INFO:resulting dictionary: Dictionary(3914 unique tokens: ['chris_patten', 'online', 'loss', 'hub', 'sound']...)
print('Number of unique tokens: %d' % len(train_dictionary_50_20))
Number of unique tokens: 3914
testdata_dir = "/tmp/C50test"
test_docs, test_author2doc = preprocess_docs(testdata_dir)
test_corpus_50_20 = create_test_corpus(train_dictionary_50_20, test_docs)
CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs Wall time: 15 µs
05:28:06 INFO:collecting all words and their counts 05:28:06 INFO:PROGRESS: at sentence #0, processed 0 words and 0 word types 05:28:08 INFO:collected 448895 word types from a corpus of 758070 words (unigram + bigrams) and 2500 sentences 05:28:08 INFO:using 448895 counts as vocab in Phrases<0 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000> /Users/martin/Projects/bachelor/gensim/gensim/models/phrases.py:490: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
We wrap the model training also in a function, in order to, again, be able to iterate over different parametrizations.
def train_model(corpus, author2doc, dictionary, num_topics=20, eval_every=0, iterations=50, passes=20):
from gensim.models import AuthorTopicModel
model = AuthorTopicModel(corpus=corpus, num_topics=num_topics, id2word=dictionary.id2token, \
author2doc=author2doc, chunksize=2500, passes=passes, \
eval_every=eval_every, iterations=iterations, random_state=1)
top_topics = model.top_topics(corpus)
tc = sum([t[1] for t in top_topics])
print(tc / num_topics)
return model
# NOTE: Author of the logic of this function is the Olavur Mortensen, from his notebook tutorial.
def predict_author(new_doc, atmodel, top_n=10, smallest_author=1):
from gensim import matutils
import pandas as pd
def similarity(vec1, vec2):
'''Get similarity between two vectors'''
dist = matutils.hellinger(matutils.sparse2full(vec1, atmodel.num_topics), \
matutils.sparse2full(vec2, atmodel.num_topics))
sim = 1.0 / (1.0 + dist)
return sim
def get_sims(vec):
'''Get similarity of vector to all authors.'''
sims = [similarity(vec, vec2) for vec2 in author_vecs]
return sims
author_vecs = [atmodel.get_author_topics(author) for author in atmodel.id2author.values()]
new_doc_topics = atmodel.get_new_author_topics(new_doc)
# Get similarities.
sims = get_sims(new_doc_topics)
# Arrange author names, similarities, and author sizes in a list of tuples.
table = []
for elem in enumerate(sims):
author_name = atmodel.id2author[elem[0]]
sim = elem[1]
author_size = len(atmodel.author2doc[author_name])
if author_size >= smallest_author:
table.append((author_name, sim, author_size))
# Make dataframe and retrieve top authors.
df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
df = df.sort_values('Score', ascending=False)[:top_n]
return df
We define a custom function, which measures the prediction accuracy, following the precision at k principle. We parametrize the accuracy by a parameter k, k=1 meaning we need an exact match in order to be accurate, k=5 meaning our prediction has be in the top 5 results, ordered by similarity.
def prediction_accuracy(test_author2doc, test_corpus, model, k=5):
print("Precision@k: top_n={}".format(k))
matches=0
tries = 0
for author in test_author2doc:
author_id = model.author2id[author]
for doc_id in test_author2doc[author]:
predicted_authors = predict_author(test_corpus[doc_id:doc_id+1], atmodel=model, top_n=k)
tries = tries+1
if author_id in predicted_authors["Author"]:
matches=matches+1
accuracy = matches/tries
print("Prediction accuracy: {}".format(accuracy))
return accuracy, k
def plot_accuracy(scores1, label1, scores2=None, label2=None):
import matplotlib.pyplot as plt
s = [score*100 for score in scores1.values()]
t = list(scores1.keys())
plt.plot(t, s, "b-", label=label1)
plt.plot(t, s, "r^", label=label1+" data points")
if scores2 is not None:
s2 = [score*100 for score in scores2.values()]
plt.plot(t, s2, label=label2)
plt.plot(t, s2, "o", label=label2+" data points")
plt.legend(loc="lower right")
plt.xlabel('parameter k')
plt.ylabel('prediction accuracy')
plt.title('Precision at k')
plt.xticks(t)
plt.grid(True)
plt.yticks([30,40,50,60,70,80,90,100])
plt.axis([0, 11, 30, 100])
plt.show()
We calculate the accuracy for a range of values for k=[1,2,3,4,5,6,8,10] and plot how exactly the prediction accuracy naturally rises with higher k.
atmodel_standard = train_model(train_corpus_50_20, train_author2doc, train_dictionary_50_20)
05:28:14 INFO:Vocabulary consists of 3914 words. 05:28:14 INFO:using symmetric alpha at 0.05 05:28:14 INFO:using symmetric eta at 0.05 05:28:14 INFO:running online author-topic training, 20 topics, 50 authors, 20 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000 05:28:14 INFO:PROGRESS: pass 0, at document #2500/2500 05:28:14 DEBUG:performing inference on a chunk of 2500 documents 05:28:22 DEBUG:3/2500 documents converged within 50 iterations 05:28:22 DEBUG:updating topics 05:28:22 INFO:topic #11 (0.050): 0.028*"gm" + 0.013*"plant" + 0.012*"strike" + 0.009*"worker" + 0.009*"uaw" + 0.007*"automaker" + 0.007*"share" + 0.007*"union" + 0.006*"truck" + 0.006*"analyst" 05:28:22 INFO:topic #17 (0.050): 0.018*"apple" + 0.008*"computer" + 0.008*"software" + 0.008*"share" + 0.008*"analyst" + 0.007*"quarter" + 0.006*"microsoft" + 0.006*"service" + 0.006*"base" + 0.005*"plan" 05:28:22 INFO:topic #15 (0.050): 0.009*"analyst" + 0.008*"computer" + 0.008*"stock" + 0.007*"billion" + 0.007*"quarter" + 0.007*"share" + 0.006*"industry" + 0.005*"software" + 0.005*"oil" + 0.005*"sale" 05:28:22 INFO:topic #9 (0.050): 0.009*"analyst" + 0.006*"share" + 0.006*"china" + 0.006*"gold" + 0.006*"chinese" + 0.005*"price" + 0.005*"government" + 0.005*"stock" + 0.004*"base" + 0.004*"drug" 05:28:22 INFO:topic #14 (0.050): 0.010*"pound" + 0.009*"share" + 0.008*"profit" + 0.007*"billion" + 0.007*"analyst" + 0.007*"group" + 0.007*"bank" + 0.006*"business" + 0.005*"million_pound" + 0.005*"price" 05:28:22 INFO:topic diff=2.864277, rho=1.000000 05:28:22 INFO:PROGRESS: pass 1, at document #2500/2500 05:28:22 DEBUG:performing inference on a chunk of 2500 documents 05:28:25 DEBUG:2491/2500 documents converged within 50 iterations 05:28:25 DEBUG:updating topics 05:28:25 INFO:topic #0 (0.050): 0.011*"bank" + 0.009*"analyst" + 0.005*"share" + 0.005*"billion" + 0.005*"government" + 0.004*"news" + 0.004*"business" + 0.004*"rule" + 0.004*"profit" + 0.004*"group" 05:28:25 INFO:topic #14 (0.050): 0.011*"pound" + 0.010*"share" + 0.009*"profit" + 0.008*"group" + 0.008*"analyst" + 0.008*"billion" + 0.007*"bank" + 0.006*"business" + 0.006*"million_pound" + 0.005*"penny" 05:28:25 INFO:topic #15 (0.050): 0.010*"analyst" + 0.009*"stock" + 0.008*"share" + 0.008*"billion" + 0.007*"quarter" + 0.007*"computer" + 0.006*"oil" + 0.006*"bank" + 0.005*"industry" + 0.005*"high" 05:28:25 INFO:topic #1 (0.050): 0.014*"bank" + 0.010*"china" + 0.009*"hong_kong" + 0.009*"kong" + 0.008*"hong" + 0.008*"billion" + 0.007*"Hong Kong" + 0.006*"analyst" + 0.006*"stock" + 0.006*"fund" 05:28:25 INFO:topic #7 (0.050): 0.007*"analyst" + 0.007*"sale" + 0.007*"share" + 0.006*"group" + 0.005*"business" + 0.005*"price" + 0.004*"profit" + 0.004*"industry" + 0.004*"pound" + 0.004*"billion" 05:28:25 INFO:topic diff=1.147566, rho=0.577350 05:28:25 INFO:PROGRESS: pass 2, at document #2500/2500 05:28:25 DEBUG:performing inference on a chunk of 2500 documents 05:28:27 DEBUG:2498/2500 documents converged within 50 iterations 05:28:27 DEBUG:updating topics 05:28:27 INFO:topic #9 (0.050): 0.011*"drug" + 0.009*"colombia" + 0.007*"analyst" + 0.006*"government" + 0.006*"sale" + 0.005*"share" + 0.005*"base" + 0.004*"price" + 0.004*"stock" + 0.004*"united" 05:28:27 INFO:topic #15 (0.050): 0.010*"stock" + 0.009*"analyst" + 0.008*"share" + 0.008*"billion" + 0.007*"bank" + 0.007*"oil" + 0.006*"quarter" + 0.006*"canada" + 0.006*"toronto" + 0.005*"high" 05:28:27 INFO:topic #8 (0.050): 0.026*"bre" + 0.024*"gold" + 0.024*"bre_x" + 0.024*"x" + 0.018*"Bre-X" + 0.015*"barrick" + 0.011*"analyst" + 0.010*"busang" + 0.010*"indonesian" + 0.008*"government" 05:28:27 INFO:topic #19 (0.050): 0.019*"hong" + 0.018*"kong" + 0.018*"hong_kong" + 0.014*"china" + 0.012*"Hong Kong" + 0.006*"chinese" + 0.005*"price" + 0.005*"british" + 0.005*"tell" + 0.004*"tung" 05:28:27 INFO:topic #10 (0.050): 0.009*"billion" + 0.008*"bank" + 0.005*"loan" + 0.005*"tonne" + 0.005*"yen" + 0.005*"price" + 0.005*"exporter" + 0.004*"real_estate" + 0.004*"analyst" + 0.004*"real" 05:28:27 INFO:topic diff=1.010061, rho=0.500000 05:28:27 INFO:PROGRESS: pass 3, at document #2500/2500 05:28:27 DEBUG:performing inference on a chunk of 2500 documents 05:28:29 DEBUG:2500/2500 documents converged within 50 iterations 05:28:29 DEBUG:updating topics 05:28:29 INFO:topic #13 (0.050): 0.017*"china" + 0.014*"wang" + 0.012*"beijing" + 0.011*"taiwan" + 0.009*"court" + 0.009*"party" + 0.008*"chinese" + 0.008*"government" + 0.007*"official" + 0.007*"communist" 05:28:29 INFO:topic #19 (0.050): 0.022*"hong" + 0.021*"kong" + 0.021*"hong_kong" + 0.015*"china" + 0.014*"Hong Kong" + 0.006*"chinese" + 0.005*"airbus" + 0.005*"tung" + 0.005*"british" + 0.005*"Hong Kong's" 05:28:29 INFO:topic #12 (0.050): 0.012*"czech" + 0.007*"bank" + 0.007*"crown" + 0.006*"government" + 0.006*"klaus" + 0.005*"billion" + 0.005*"price" + 0.005*"party" + 0.005*"prague" + 0.005*"foreign" 05:28:29 INFO:topic #17 (0.050): 0.041*"apple" + 0.026*"computer" + 0.022*"software" + 0.020*"quarter" + 0.013*"microsoft" + 0.013*"analyst" + 0.010*"share" + 0.009*"sale" + 0.009*"macintosh" + 0.008*"pc" 05:28:29 INFO:topic #6 (0.050): 0.020*"share" + 0.017*"analyst" + 0.010*"bank" + 0.010*"shanghai" + 0.009*"stock" + 0.007*"sale" + 0.007*"b" + 0.006*"quarter" + 0.006*"base" + 0.005*"business" 05:28:29 INFO:topic diff=0.877566, rho=0.447214 05:28:29 INFO:PROGRESS: pass 4, at document #2500/2500 05:28:29 DEBUG:performing inference on a chunk of 2500 documents 05:28:31 DEBUG:2500/2500 documents converged within 50 iterations 05:28:31 DEBUG:updating topics 05:28:31 INFO:topic #14 (0.050): 0.012*"pound" + 0.010*"profit" + 0.010*"share" + 0.009*"group" + 0.008*"analyst" + 0.008*"billion" + 0.007*"business" + 0.007*"bank" + 0.006*"million_pound" + 0.005*"british" 05:28:31 INFO:topic #9 (0.050): 0.014*"drug" + 0.011*"colombia" + 0.006*"government" + 0.005*"analyst" + 0.005*"sale" + 0.005*"united" + 0.005*"colombian" + 0.004*"guerrilla" + 0.004*"base" + 0.004*"force" 05:28:31 INFO:topic #13 (0.050): 0.018*"china" + 0.015*"wang" + 0.013*"beijing" + 0.012*"taiwan" + 0.009*"court" + 0.009*"party" + 0.009*"chinese" + 0.008*"government" + 0.007*"communist" + 0.007*"official" 05:28:31 INFO:topic #2 (0.050): 0.010*"share" + 0.009*"analyst" + 0.008*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business" 05:28:31 INFO:topic #15 (0.050): 0.010*"stock" + 0.009*"analyst" + 0.008*"billion" + 0.008*"share" + 0.008*"bank" + 0.007*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"russia" + 0.005*"high" 05:28:31 INFO:topic diff=0.761073, rho=0.408248 05:28:31 INFO:PROGRESS: pass 5, at document #2500/2500 05:28:31 DEBUG:performing inference on a chunk of 2500 documents 05:28:33 DEBUG:2500/2500 documents converged within 50 iterations 05:28:33 DEBUG:updating topics 05:28:33 INFO:topic #15 (0.050): 0.010*"stock" + 0.009*"analyst" + 0.009*"bank" + 0.008*"billion" + 0.008*"share" + 0.007*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"russia" + 0.006*"tonne" 05:28:33 INFO:topic #2 (0.050): 0.011*"share" + 0.009*"analyst" + 0.008*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business" 05:28:33 INFO:topic #13 (0.050): 0.018*"china" + 0.015*"wang" + 0.013*"beijing" + 0.012*"taiwan" + 0.009*"court" + 0.009*"chinese" + 0.009*"party" + 0.008*"government" + 0.007*"communist" + 0.007*"official" 05:28:33 INFO:topic #8 (0.050): 0.027*"gold" + 0.026*"bre" + 0.026*"bre_x" + 0.026*"x" + 0.019*"Bre-X" + 0.015*"barrick" + 0.012*"analyst" + 0.011*"busang" + 0.010*"indonesian" + 0.009*"government" 05:28:33 INFO:topic #1 (0.050): 0.017*"bank" + 0.010*"fund" + 0.010*"china" + 0.009*"billion" + 0.008*"hong_kong" + 0.008*"kong" + 0.008*"hong" + 0.007*"financial" + 0.007*"japan" + 0.006*"Hong Kong" 05:28:33 INFO:topic diff=0.658823, rho=0.377964 05:28:33 INFO:PROGRESS: pass 6, at document #2500/2500 05:28:33 DEBUG:performing inference on a chunk of 2500 documents 05:28:35 DEBUG:2500/2500 documents converged within 50 iterations 05:28:35 DEBUG:updating topics 05:28:35 INFO:topic #17 (0.050): 0.038*"apple" + 0.027*"computer" + 0.022*"software" + 0.021*"quarter" + 0.014*"analyst" + 0.013*"microsoft" + 0.010*"sale" + 0.010*"share" + 0.008*"pc" + 0.008*"macintosh" 05:28:35 INFO:topic #9 (0.050): 0.015*"drug" + 0.012*"colombia" + 0.006*"government" + 0.005*"united" + 0.005*"sale" + 0.005*"colombian" + 0.005*"guerrilla" + 0.005*"analyst" + 0.004*"force" + 0.004*"week" 05:28:35 INFO:topic #15 (0.050): 0.010*"stock" + 0.009*"bank" + 0.009*"analyst" + 0.008*"billion" + 0.008*"share" + 0.007*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"russia" + 0.006*"tonne" 05:28:35 INFO:topic #13 (0.050): 0.018*"china" + 0.015*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"court" + 0.009*"chinese" + 0.009*"party" + 0.008*"government" + 0.007*"communist" + 0.007*"official" 05:28:35 INFO:topic #16 (0.050): 0.016*"franc" + 0.015*"french" + 0.015*"air" + 0.014*"france" + 0.011*"thomson" + 0.010*"billion" + 0.009*"group" + 0.007*"government" + 0.007*"plan" + 0.007*"bid" 05:28:35 INFO:topic diff=0.568497, rho=0.353553 05:28:35 INFO:PROGRESS: pass 7, at document #2500/2500 05:28:35 DEBUG:performing inference on a chunk of 2500 documents 05:28:36 DEBUG:2500/2500 documents converged within 50 iterations 05:28:36 DEBUG:updating topics 05:28:37 INFO:topic #17 (0.050): 0.037*"apple" + 0.027*"computer" + 0.022*"software" + 0.021*"quarter" + 0.014*"analyst" + 0.013*"microsoft" + 0.010*"sale" + 0.010*"share" + 0.008*"pc" + 0.008*"macintosh" 05:28:37 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business" 05:28:37 INFO:topic #3 (0.050): 0.027*"bt" + 0.017*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"billion" + 0.011*"analyst" + 0.011*"deal" + 0.010*"british" + 0.010*"share" + 0.010*"group" 05:28:37 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"court" + 0.009*"chinese" + 0.009*"party" + 0.008*"government" + 0.007*"communist" + 0.007*"official" 05:28:37 INFO:topic #4 (0.050): 0.018*"china" + 0.011*"official" + 0.009*"state" + 0.008*"beijing" + 0.008*"tibet" + 0.007*"chinese" + 0.007*"government" + 0.007*"wang" + 0.006*"people" + 0.005*"dissident" 05:28:37 INFO:topic diff=0.488932, rho=0.333333 05:28:37 INFO:PROGRESS: pass 8, at document #2500/2500 05:28:37 DEBUG:performing inference on a chunk of 2500 documents 05:28:38 DEBUG:2500/2500 documents converged within 50 iterations 05:28:38 DEBUG:updating topics 05:28:38 INFO:topic #17 (0.050): 0.037*"apple" + 0.027*"computer" + 0.022*"software" + 0.021*"quarter" + 0.014*"analyst" + 0.013*"microsoft" + 0.010*"sale" + 0.010*"share" + 0.008*"pc" + 0.008*"macintosh" 05:28:38 INFO:topic #5 (0.050): 0.032*"china" + 0.016*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.009*"tonne" + 0.007*"hong" + 0.007*"hong_kong" + 0.007*"kong" + 0.007*"trade" + 0.006*"state" 05:28:38 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"chinese" + 0.009*"court" + 0.009*"party" + 0.008*"government" + 0.007*"communist" + 0.007*"official" 05:28:38 INFO:topic #14 (0.050): 0.012*"pound" + 0.011*"profit" + 0.010*"share" + 0.009*"analyst" + 0.009*"group" + 0.008*"billion" + 0.007*"bank" + 0.007*"business" + 0.006*"million_pound" + 0.005*"british" 05:28:38 INFO:topic #6 (0.050): 0.019*"share" + 0.016*"analyst" + 0.012*"shanghai" + 0.011*"bank" + 0.009*"stock" + 0.007*"b" + 0.007*"sale" + 0.006*"exchange" + 0.006*"base" + 0.006*"quarter" 05:28:38 INFO:topic diff=0.419457, rho=0.316228 05:28:38 INFO:PROGRESS: pass 9, at document #2500/2500 05:28:38 DEBUG:performing inference on a chunk of 2500 documents 05:28:40 DEBUG:2500/2500 documents converged within 50 iterations 05:28:40 DEBUG:updating topics 05:28:40 INFO:topic #3 (0.050): 0.027*"bt" + 0.017*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"billion" + 0.011*"analyst" + 0.011*"deal" + 0.010*"british" + 0.010*"share" + 0.010*"group" 05:28:40 INFO:topic #1 (0.050): 0.018*"bank" + 0.011*"fund" + 0.009*"billion" + 0.008*"china" + 0.008*"financial" + 0.007*"japan" + 0.007*"kong" + 0.007*"hong_kong" + 0.007*"hong" + 0.006*"analyst" 05:28:40 INFO:topic #15 (0.050): 0.010*"bank" + 0.009*"stock" + 0.008*"analyst" + 0.008*"billion" + 0.008*"share" + 0.008*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"tonne" + 0.006*"russia" 05:28:40 INFO:topic #19 (0.050): 0.029*"hong" + 0.028*"kong" + 0.028*"hong_kong" + 0.019*"Hong Kong" + 0.019*"china" + 0.008*"chinese" + 0.007*"tung" + 0.007*"Hong Kong's" + 0.006*"beijing" + 0.006*"airbus" 05:28:40 INFO:topic #8 (0.050): 0.028*"gold" + 0.026*"bre" + 0.026*"bre_x" + 0.026*"x" + 0.019*"Bre-X" + 0.015*"barrick" + 0.012*"analyst" + 0.011*"busang" + 0.010*"indonesian" + 0.009*"government" 05:28:40 INFO:topic diff=0.359320, rho=0.301511 05:28:40 INFO:PROGRESS: pass 10, at document #2500/2500 05:28:40 DEBUG:performing inference on a chunk of 2500 documents 05:28:41 DEBUG:2500/2500 documents converged within 50 iterations 05:28:41 DEBUG:updating topics 05:28:41 INFO:topic #6 (0.050): 0.019*"share" + 0.016*"analyst" + 0.013*"shanghai" + 0.011*"bank" + 0.009*"stock" + 0.008*"b" + 0.007*"sale" + 0.007*"exchange" + 0.006*"china" + 0.006*"base" 05:28:41 INFO:topic #11 (0.050): 0.042*"gm" + 0.028*"plant" + 0.016*"uaw" + 0.016*"strike" + 0.015*"worker" + 0.011*"automaker" + 0.010*"local" + 0.010*"truck" + 0.009*"part" + 0.008*"ford" 05:28:41 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business" 05:28:41 INFO:topic #1 (0.050): 0.019*"bank" + 0.011*"fund" + 0.009*"billion" + 0.008*"financial" + 0.008*"china" + 0.008*"japan" + 0.007*"kong" + 0.007*"hong_kong" + 0.007*"hong" + 0.006*"analyst" 05:28:41 INFO:topic #12 (0.050): 0.014*"czech" + 0.008*"crown" + 0.008*"bank" + 0.007*"klaus" + 0.007*"government" + 0.006*"billion" + 0.006*"prague" + 0.005*"price" + 0.005*"foreign" + 0.005*"party" 05:28:41 INFO:topic diff=0.307661, rho=0.288675 05:28:41 INFO:PROGRESS: pass 11, at document #2500/2500 05:28:41 DEBUG:performing inference on a chunk of 2500 documents 05:28:43 DEBUG:2500/2500 documents converged within 50 iterations 05:28:43 DEBUG:updating topics 05:28:43 INFO:topic #15 (0.050): 0.011*"bank" + 0.009*"stock" + 0.008*"billion" + 0.008*"analyst" + 0.008*"share" + 0.008*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"russia" + 0.006*"tonne" 05:28:43 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business" 05:28:43 INFO:topic #5 (0.050): 0.032*"china" + 0.016*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.009*"tonne" + 0.007*"hong" + 0.007*"hong_kong" + 0.007*"kong" + 0.007*"trade" + 0.006*"state" 05:28:43 INFO:topic #7 (0.050): 0.009*"sale" + 0.009*"analyst" + 0.007*"share" + 0.007*"group" + 0.006*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive" 05:28:43 INFO:topic #19 (0.050): 0.029*"hong" + 0.029*"kong" + 0.029*"hong_kong" + 0.019*"Hong Kong" + 0.019*"china" + 0.008*"chinese" + 0.008*"tung" + 0.007*"Hong Kong's" + 0.006*"beijing" + 0.006*"airbus" 05:28:43 INFO:topic diff=0.263525, rho=0.277350 05:28:43 INFO:PROGRESS: pass 12, at document #2500/2500 05:28:43 DEBUG:performing inference on a chunk of 2500 documents 05:28:44 DEBUG:2500/2500 documents converged within 50 iterations 05:28:44 DEBUG:updating topics 05:28:45 INFO:topic #11 (0.050): 0.042*"gm" + 0.028*"plant" + 0.016*"uaw" + 0.016*"strike" + 0.015*"worker" + 0.011*"automaker" + 0.010*"local" + 0.010*"truck" + 0.009*"part" + 0.008*"ford" 05:28:45 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"chinese" + 0.009*"court" + 0.008*"party" + 0.008*"government" + 0.007*"official" + 0.007*"communist" 05:28:45 INFO:topic #4 (0.050): 0.021*"china" + 0.012*"official" + 0.010*"beijing" + 0.009*"chinese" + 0.009*"wang" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.006*"people" + 0.006*"dissident" 05:28:45 INFO:topic #16 (0.050): 0.020*"franc" + 0.018*"french" + 0.017*"air" + 0.017*"france" + 0.014*"thomson" + 0.012*"billion" + 0.010*"group" + 0.008*"billion_franc" + 0.008*"telecom" + 0.007*"plan" 05:28:45 INFO:topic #18 (0.050): 0.014*"analyst" + 0.011*"computer" + 0.010*"quarter" + 0.010*"internet" + 0.008*"share" + 0.008*"business" + 0.008*"service" + 0.008*"stock" + 0.007*"industry" + 0.007*"software" 05:28:45 INFO:topic diff=0.226015, rho=0.267261 05:28:45 INFO:PROGRESS: pass 13, at document #2500/2500 05:28:45 DEBUG:performing inference on a chunk of 2500 documents 05:28:46 DEBUG:2500/2500 documents converged within 50 iterations 05:28:46 DEBUG:updating topics 05:28:46 INFO:topic #4 (0.050): 0.021*"china" + 0.012*"official" + 0.010*"beijing" + 0.009*"wang" + 0.009*"chinese" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.006*"people" + 0.006*"dissident" 05:28:46 INFO:topic #3 (0.050): 0.027*"bt" + 0.017*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"billion" + 0.011*"analyst" + 0.011*"deal" + 0.010*"british" + 0.010*"share" + 0.010*"group" 05:28:46 INFO:topic #12 (0.050): 0.015*"czech" + 0.009*"crown" + 0.008*"bank" + 0.007*"klaus" + 0.007*"government" + 0.006*"prague" + 0.006*"billion" + 0.005*"foreign" + 0.005*"party" + 0.005*"price" 05:28:46 INFO:topic #19 (0.050): 0.030*"hong" + 0.030*"kong" + 0.030*"hong_kong" + 0.020*"Hong Kong" + 0.020*"china" + 0.008*"chinese" + 0.008*"tung" + 0.007*"Hong Kong's" + 0.007*"beijing" + 0.006*"airbus" 05:28:46 INFO:topic #5 (0.050): 0.032*"china" + 0.016*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.010*"tonne" + 0.007*"hong" + 0.007*"kong" + 0.007*"hong_kong" + 0.007*"trade" + 0.006*"state" 05:28:46 INFO:topic diff=0.194260, rho=0.258199 05:28:46 INFO:PROGRESS: pass 14, at document #2500/2500 05:28:46 DEBUG:performing inference on a chunk of 2500 documents 05:28:48 DEBUG:2500/2500 documents converged within 50 iterations 05:28:48 DEBUG:updating topics 05:28:48 INFO:topic #5 (0.050): 0.033*"china" + 0.016*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.010*"tonne" + 0.008*"hong" + 0.007*"kong" + 0.007*"hong_kong" + 0.007*"trade" + 0.007*"state" 05:28:48 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"chinese" + 0.009*"court" + 0.008*"party" + 0.008*"government" + 0.007*"official" + 0.007*"communist" 05:28:48 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.006*"business" 05:28:48 INFO:topic #0 (0.050): 0.011*"bank" + 0.009*"internet" + 0.009*"fcc" + 0.008*"service" + 0.008*"phone" + 0.007*"rule" + 0.006*"local" + 0.006*"tv" + 0.006*"court" + 0.006*"law" 05:28:48 INFO:topic #16 (0.050): 0.020*"franc" + 0.019*"french" + 0.018*"air" + 0.017*"france" + 0.014*"thomson" + 0.013*"billion" + 0.010*"group" + 0.008*"billion_franc" + 0.008*"telecom" + 0.007*"plan" 05:28:48 INFO:topic diff=0.167433, rho=0.250000 05:28:48 INFO:PROGRESS: pass 15, at document #2500/2500 05:28:48 DEBUG:performing inference on a chunk of 2500 documents 05:28:49 DEBUG:2500/2500 documents converged within 50 iterations 05:28:49 DEBUG:updating topics 05:28:49 INFO:topic #3 (0.050): 0.027*"bt" + 0.017*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"analyst" + 0.011*"billion" + 0.011*"deal" + 0.010*"british" + 0.010*"share" + 0.010*"group" 05:28:49 INFO:topic #10 (0.050): 0.001*"billion" + 0.000*"bank" + 0.000*"loan" + 0.000*"tonne" + 0.000*"yen" + 0.000*"price" + 0.000*"exporter" + 0.000*"real_estate" + 0.000*"analyst" + 0.000*"real" 05:28:49 INFO:topic #5 (0.050): 0.033*"china" + 0.017*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.010*"tonne" + 0.008*"hong" + 0.008*"kong" + 0.007*"hong_kong" + 0.007*"trade" + 0.007*"state" 05:28:49 INFO:topic #4 (0.050): 0.021*"china" + 0.012*"official" + 0.011*"beijing" + 0.009*"wang" + 0.009*"chinese" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.006*"people" + 0.006*"dissident" 05:28:49 INFO:topic #0 (0.050): 0.010*"bank" + 0.009*"internet" + 0.009*"fcc" + 0.008*"service" + 0.008*"phone" + 0.007*"rule" + 0.007*"local" + 0.007*"tv" + 0.006*"court" + 0.006*"law" 05:28:49 INFO:topic diff=0.144777, rho=0.242536 05:28:49 INFO:PROGRESS: pass 16, at document #2500/2500 05:28:49 DEBUG:performing inference on a chunk of 2500 documents 05:28:51 DEBUG:2500/2500 documents converged within 50 iterations 05:28:51 DEBUG:updating topics 05:28:51 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.006*"business" 05:28:51 INFO:topic #9 (0.050): 0.016*"drug" + 0.013*"colombia" + 0.006*"government" + 0.006*"united" + 0.005*"colombian" + 0.005*"guerrilla" + 0.005*"force" + 0.004*"oil" + 0.004*"country" + 0.004*"police" 05:28:51 INFO:topic #7 (0.050): 0.009*"sale" + 0.008*"analyst" + 0.007*"share" + 0.007*"group" + 0.007*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive" 05:28:51 INFO:topic #4 (0.050): 0.021*"china" + 0.012*"official" + 0.011*"beijing" + 0.009*"wang" + 0.009*"chinese" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.006*"people" + 0.006*"dissident" 05:28:51 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"chinese" + 0.009*"court" + 0.008*"party" + 0.008*"government" + 0.007*"official" + 0.007*"communist" 05:28:51 INFO:topic diff=0.125646, rho=0.235702 05:28:51 INFO:PROGRESS: pass 17, at document #2500/2500 05:28:51 DEBUG:performing inference on a chunk of 2500 documents 05:28:52 DEBUG:2500/2500 documents converged within 50 iterations 05:28:52 DEBUG:updating topics 05:28:52 INFO:topic #6 (0.050): 0.019*"share" + 0.016*"analyst" + 0.013*"shanghai" + 0.012*"bank" + 0.009*"stock" + 0.008*"china" + 0.007*"b" + 0.007*"sale" + 0.007*"exchange" + 0.006*"base" 05:28:52 INFO:topic #14 (0.050): 0.011*"pound" + 0.011*"profit" + 0.010*"share" + 0.009*"analyst" + 0.009*"group" + 0.008*"billion" + 0.007*"bank" + 0.007*"business" + 0.006*"million_pound" + 0.005*"british" 05:28:52 INFO:topic #17 (0.050): 0.036*"apple" + 0.026*"computer" + 0.021*"software" + 0.021*"quarter" + 0.014*"analyst" + 0.013*"microsoft" + 0.010*"sale" + 0.010*"share" + 0.008*"pc" + 0.008*"technology" 05:28:52 INFO:topic #7 (0.050): 0.009*"sale" + 0.008*"analyst" + 0.007*"share" + 0.007*"group" + 0.007*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive" 05:28:52 INFO:topic #8 (0.050): 0.028*"gold" + 0.026*"bre" + 0.026*"bre_x" + 0.026*"x" + 0.019*"Bre-X" + 0.016*"barrick" + 0.012*"analyst" + 0.011*"busang" + 0.010*"indonesian" + 0.009*"government" 05:28:52 INFO:topic diff=0.109484, rho=0.229416 05:28:52 INFO:PROGRESS: pass 18, at document #2500/2500 05:28:52 DEBUG:performing inference on a chunk of 2500 documents 05:28:54 DEBUG:2500/2500 documents converged within 50 iterations 05:28:54 DEBUG:updating topics 05:28:54 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.006*"business" 05:28:54 INFO:topic #15 (0.050): 0.012*"bank" + 0.009*"stock" + 0.008*"billion" + 0.008*"analyst" + 0.008*"oil" + 0.007*"canada" + 0.007*"share" + 0.007*"toronto" + 0.007*"russia" + 0.006*"tonne" 05:28:54 INFO:topic #0 (0.050): 0.010*"bank" + 0.009*"internet" + 0.009*"fcc" + 0.009*"service" + 0.008*"phone" + 0.007*"rule" + 0.007*"local" + 0.007*"tv" + 0.007*"court" + 0.006*"law" 05:28:54 INFO:topic #7 (0.050): 0.009*"sale" + 0.008*"analyst" + 0.007*"share" + 0.007*"group" + 0.007*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive" 05:28:54 INFO:topic #18 (0.050): 0.014*"analyst" + 0.011*"computer" + 0.010*"quarter" + 0.010*"internet" + 0.008*"share" + 0.008*"business" + 0.008*"stock" + 0.008*"service" + 0.007*"industry" + 0.007*"software" 05:28:54 INFO:topic diff=0.095805, rho=0.223607 05:28:54 INFO:PROGRESS: pass 19, at document #2500/2500 05:28:54 DEBUG:performing inference on a chunk of 2500 documents 05:28:55 DEBUG:2500/2500 documents converged within 50 iterations 05:28:55 DEBUG:updating topics 05:28:55 INFO:topic #7 (0.050): 0.009*"sale" + 0.008*"analyst" + 0.007*"share" + 0.007*"group" + 0.007*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive" 05:28:55 INFO:topic #4 (0.050): 0.022*"china" + 0.012*"official" + 0.011*"beijing" + 0.010*"wang" + 0.009*"chinese" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.007*"people" + 0.006*"dissident" 05:28:55 INFO:topic #19 (0.050): 0.032*"hong" + 0.031*"kong" + 0.031*"hong_kong" + 0.021*"Hong Kong" + 0.021*"china" + 0.009*"chinese" + 0.008*"tung" + 0.008*"Hong Kong's" + 0.007*"beijing" + 0.007*"airbus" 05:28:55 INFO:topic #3 (0.050): 0.027*"bt" + 0.018*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"deal" + 0.011*"analyst" + 0.011*"billion" + 0.011*"british" + 0.010*"share" + 0.010*"group" 05:28:55 INFO:topic #8 (0.050): 0.028*"gold" + 0.026*"bre" + 0.026*"bre_x" + 0.026*"x" + 0.019*"Bre-X" + 0.016*"barrick" + 0.012*"analyst" + 0.011*"busang" + 0.010*"indonesian" + 0.009*"government" 05:28:55 INFO:topic diff=0.084200, rho=0.218218 05:28:55 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=3914, num_topics=20, num_authors=50, decay=0.5, chunksize=2500) 05:28:55 INFO:CorpusAccumulator accumulated stats from 1000 documents 05:28:55 INFO:CorpusAccumulator accumulated stats from 2000 documents
-1.50354141347
We run our first training and observe that the passes and iterations parameters are set high enough, so that the model converges.
07:47:24 INFO:PROGRESS: pass 15, at document #2500/2500
07:47:24 DEBUG:performing inference on a chunk of 2500 documents
07:47:27 DEBUG:2500/2500 documents converged within 50 iterations
Tells us that the model indeed conveges well.
accuracy_scores_20topic={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_50_20, atmodel_standard, k=i)
accuracy_scores_20topic[k] = accuracy
plot_accuracy(scores1=accuracy_scores_20topic, label1="20 topics")
Precision@k: top_n=1 Prediction accuracy: 0.3548 Precision@k: top_n=2 Prediction accuracy: 0.5228 Precision@k: top_n=3 Prediction accuracy: 0.6456 Precision@k: top_n=4 Prediction accuracy: 0.7208 Precision@k: top_n=5 Prediction accuracy: 0.7748 Precision@k: top_n=6 Prediction accuracy: 0.8188 Precision@k: top_n=8 Prediction accuracy: 0.8576 Precision@k: top_n=10 Prediction accuracy: 0.8936
This is a rather poor accuracy performace. We increase the number of topic to 100.
atmodel_100topics = train_model(train_corpus_50_20, train_author2doc, train_dictionary_50_20, num_topics=100, eval_every=0, iterations=50, passes=10)
05:31:51 INFO:Vocabulary consists of 3914 words. 05:31:51 INFO:using symmetric alpha at 0.01 05:31:51 INFO:using symmetric eta at 0.01 05:31:53 INFO:running online author-topic training, 100 topics, 50 authors, 10 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000 05:31:53 INFO:PROGRESS: pass 0, at document #2500/2500 05:31:53 DEBUG:performing inference on a chunk of 2500 documents 05:32:05 DEBUG:5/2500 documents converged within 50 iterations 05:32:05 DEBUG:updating topics 05:32:05 INFO:topic #18 (0.010): 0.007*"analyst" + 0.007*"business" + 0.005*"billion" + 0.005*"stock" + 0.005*"boeing" + 0.004*"quarter" + 0.004*"industry" + 0.004*"share" + 0.004*"shareholder" + 0.004*"sale" 05:32:05 INFO:topic #71 (0.010): 0.015*"fcc" + 0.015*"phone" + 0.011*"local" + 0.011*"carrier" + 0.010*"service" + 0.009*"rule" + 0.008*"court" + 0.008*"distance" + 0.008*"long" + 0.007*"tv" 05:32:05 INFO:topic #79 (0.010): 0.011*"china" + 0.010*"beijing" + 0.007*"official" + 0.006*"chinese" + 0.006*"lama" + 0.006*"tibet" + 0.006*"share" + 0.005*"analyst" + 0.005*"region" + 0.005*"billion" 05:32:05 INFO:topic #93 (0.010): 0.015*"ibm" + 0.011*"analyst" + 0.011*"computer" + 0.010*"pc" + 0.009*"sale" + 0.009*"quarter" + 0.008*"industry" + 0.008*"price" + 0.007*"consumer" + 0.007*"service" 05:32:05 INFO:topic #99 (0.010): 0.008*"world" + 0.008*"czech" + 0.007*"analyst" + 0.006*"stock" + 0.006*"win" + 0.005*"billion" + 0.005*"team" + 0.005*"game" + 0.005*"bank" + 0.005*"second" 05:32:05 INFO:topic diff=25.070898, rho=1.000000 05:32:05 INFO:PROGRESS: pass 1, at document #2500/2500 05:32:05 DEBUG:performing inference on a chunk of 2500 documents 05:32:12 DEBUG:2492/2500 documents converged within 50 iterations 05:32:12 DEBUG:updating topics 05:32:12 INFO:topic #70 (0.010): 0.019*"shanghai" + 0.018*"share" + 0.017*"china" + 0.011*"stock" + 0.011*"beijing" + 0.010*"b" + 0.010*"foreign" + 0.010*"exchange" + 0.009*"analyst" + 0.008*"investor" 05:32:12 INFO:topic #2 (0.010): 0.020*"mci" + 0.012*"long" + 0.012*"service" + 0.010*"distance" + 0.010*"analyst" + 0.010*"sprint" + 0.010*"billion" + 0.010*"corp" + 0.008*"local" + 0.008*"deal" 05:32:12 INFO:topic #57 (0.010): 0.024*"china" + 0.013*"beijing" + 0.011*"chinese" + 0.010*"wang" + 0.010*"hong_kong" + 0.009*"hong" + 0.008*"kong" + 0.008*"official" + 0.007*"Hong Kong" + 0.006*"people" 05:32:12 INFO:topic #45 (0.010): 0.021*"time" + 0.016*"executive" + 0.013*"cable" + 0.011*"rise" + 0.011*"sale" + 0.010*"billion" + 0.009*"quarter" + 0.008*"share" + 0.008*"group" + 0.007*"analyst" 05:32:12 INFO:topic #18 (0.010): 0.007*"analyst" + 0.006*"business" + 0.005*"billion" + 0.004*"stock" + 0.004*"boeing" + 0.004*"quarter" + 0.004*"industry" + 0.004*"share" + 0.004*"shareholder" + 0.003*"sale" 05:32:12 INFO:topic diff=7.998665, rho=0.577350 05:32:12 INFO:PROGRESS: pass 2, at document #2500/2500 05:32:12 DEBUG:performing inference on a chunk of 2500 documents 05:32:19 DEBUG:2500/2500 documents converged within 50 iterations 05:32:19 DEBUG:updating topics 05:32:19 INFO:topic #70 (0.010): 0.021*"share" + 0.021*"shanghai" + 0.019*"china" + 0.012*"b" + 0.011*"foreign" + 0.011*"stock" + 0.011*"bank" + 0.011*"analyst" + 0.010*"beijing" + 0.010*"exchange" 05:32:19 INFO:topic #71 (0.010): 0.020*"fcc" + 0.015*"phone" + 0.013*"carrier" + 0.013*"tv" + 0.012*"local" + 0.010*"service" + 0.010*"rule" + 0.010*"long" + 0.009*"distance" + 0.008*"long_distance" 05:32:19 INFO:topic #22 (0.010): 0.033*"bank" + 0.010*"rate" + 0.010*"cut" + 0.009*"analyst" + 0.008*"day" + 0.008*"merger" + 0.007*"profit" + 0.007*"australia" + 0.007*"financial" + 0.006*"ltd" 05:32:19 INFO:topic #18 (0.010): 0.006*"analyst" + 0.005*"business" + 0.004*"billion" + 0.004*"stock" + 0.004*"boeing" + 0.003*"quarter" + 0.003*"industry" + 0.003*"share" + 0.003*"shareholder" + 0.003*"sale" 05:32:19 INFO:topic #5 (0.010): 0.018*"china" + 0.009*"beijing" + 0.008*"tonne" + 0.007*"chinese" + 0.006*"official" + 0.005*"trade" + 0.005*"price" + 0.005*"chen" + 0.005*"trader" + 0.004*"million_tonne" 05:32:19 INFO:topic diff=7.090922, rho=0.500000 05:32:19 INFO:PROGRESS: pass 3, at document #2500/2500 05:32:19 DEBUG:performing inference on a chunk of 2500 documents 05:32:25 DEBUG:2500/2500 documents converged within 50 iterations 05:32:25 DEBUG:updating topics 05:32:26 INFO:topic #9 (0.010): 0.004*"analyst" + 0.003*"government" + 0.002*"share" + 0.002*"china" + 0.002*"cost" + 0.002*"sale" + 0.002*"right" + 0.002*"stock" + 0.002*"big" + 0.002*"end" 05:32:26 INFO:topic #20 (0.010): 0.016*"gold" + 0.016*"bre" + 0.015*"x" + 0.015*"bre_x" + 0.010*"barrick" + 0.009*"Bre-X" + 0.008*"gm" + 0.008*"price" + 0.008*"analyst" + 0.008*"plant" 05:32:26 INFO:topic #4 (0.010): 0.015*"franc" + 0.015*"thomson" + 0.014*"french" + 0.009*"group" + 0.009*"share" + 0.008*"france" + 0.008*"government" + 0.008*"plan" + 0.008*"lagardere" + 0.007*"billion" 05:32:26 INFO:topic #87 (0.010): 0.017*"analyst" + 0.011*"sale" + 0.010*"share" + 0.009*"business" + 0.008*"quarter" + 0.008*"price" + 0.007*"add" + 0.007*"chemical" + 0.006*"stock" + 0.006*"earning" 05:32:26 INFO:topic #62 (0.010): 0.022*"profit" + 0.014*"pound" + 0.011*"sale" + 0.011*"rise" + 0.010*"analyst" + 0.010*"stg" + 0.010*"group" + 0.009*"business" + 0.009*"half" + 0.009*"million_stg" 05:32:26 INFO:topic diff=6.178695, rho=0.447214 05:32:26 INFO:PROGRESS: pass 4, at document #2500/2500 05:32:26 DEBUG:performing inference on a chunk of 2500 documents 05:32:31 DEBUG:2500/2500 documents converged within 50 iterations 05:32:31 DEBUG:updating topics 05:32:32 INFO:topic #47 (0.010): 0.012*"gold" + 0.008*"oil" + 0.008*"share" + 0.007*"stock" + 0.006*"analyst" + 0.006*"government" + 0.006*"price" + 0.006*"colombia" + 0.005*"rise" + 0.005*"issue" 05:32:32 INFO:topic #86 (0.010): 0.014*"cargo" + 0.010*"service" + 0.010*"kong" + 0.009*"hong" + 0.009*"air" + 0.009*"airline" + 0.008*"hong_kong" + 0.007*"Hong Kong" + 0.007*"route" + 0.006*"rate" 05:32:32 INFO:topic #25 (0.010): 0.010*"boeing" + 0.009*"share" + 0.009*"analyst" + 0.007*"billion" + 0.006*"service" + 0.006*"mci" + 0.006*"business" + 0.005*"stock" + 0.005*"jet" + 0.005*"growth" 05:32:32 INFO:topic #74 (0.010): 0.043*"china" + 0.020*"chinese" + 0.014*"official" + 0.014*"beijing" + 0.010*"trade" + 0.009*"state" + 0.007*"states" + 0.006*"united" + 0.006*"united_states" + 0.006*"import" 05:32:32 INFO:topic #53 (0.010): 0.032*"fund" + 0.012*"investment" + 0.011*"hong_kong" + 0.011*"hong" + 0.010*"stock" + 0.010*"management" + 0.010*"week" + 0.009*"manager" + 0.009*"billion" + 0.009*"kong" 05:32:32 INFO:topic diff=5.327576, rho=0.408248 05:32:32 INFO:PROGRESS: pass 5, at document #2500/2500 05:32:32 DEBUG:performing inference on a chunk of 2500 documents 05:32:36 DEBUG:2500/2500 documents converged within 50 iterations 05:32:36 DEBUG:updating topics 05:32:37 INFO:topic #60 (0.010): 0.002*"financial" + 0.002*"official" + 0.002*"stock" + 0.002*"policy" + 0.002*"group" + 0.002*"share" + 0.002*"china" + 0.001*"chinese" + 0.001*"beijing" + 0.001*"bank" 05:32:37 INFO:topic #77 (0.010): 0.006*"computer" + 0.005*"internet" + 0.005*"quarter" + 0.005*"analyst" + 0.004*"business" + 0.004*"share" + 0.004*"service" + 0.003*"profit" + 0.003*"industry" + 0.003*"system" 05:32:37 INFO:topic #43 (0.010): 0.003*"bre_x" + 0.003*"bre" + 0.002*"analyst" + 0.002*"gold" + 0.002*"barrick" + 0.002*"government" + 0.002*"Bre-X" + 0.002*"x" + 0.002*"share" + 0.002*"stock" 05:32:37 INFO:topic #10 (0.010): 0.002*"billion" + 0.001*"investment" + 0.001*"tonne" + 0.001*"quarter" + 0.001*"venture" + 0.001*"industry" + 0.001*"price" + 0.001*"cocoa" + 0.001*"coast" + 0.001*"month" 05:32:37 INFO:topic #99 (0.010): 0.005*"world" + 0.005*"czech" + 0.004*"analyst" + 0.004*"stock" + 0.004*"win" + 0.003*"billion" + 0.003*"team" + 0.003*"game" + 0.003*"bank" + 0.003*"second" 05:32:37 INFO:topic diff=4.560862, rho=0.377964 05:32:37 INFO:PROGRESS: pass 6, at document #2500/2500 05:32:37 DEBUG:performing inference on a chunk of 2500 documents 05:32:41 DEBUG:2500/2500 documents converged within 50 iterations 05:32:41 DEBUG:updating topics 05:32:41 INFO:topic #38 (0.010): 0.015*"analyst" + 0.014*"australian" + 0.013*"ltd" + 0.012*"share" + 0.011*"australia" + 0.011*"profit" + 0.010*"sydney" + 0.009*"group" + 0.009*"news" + 0.008*"corp" 05:32:41 INFO:topic #46 (0.010): 0.004*"hong" + 0.004*"kong" + 0.003*"china" + 0.002*"Hong Kong" + 0.002*"official" + 0.002*"hong_kong" + 0.002*"chinese" + 0.002*"united" + 0.002*"singapore" + 0.001*"month" 05:32:41 INFO:topic #97 (0.010): 0.021*"internet" + 0.017*"bank" + 0.008*"law" + 0.008*"court" + 0.008*"congress" + 0.007*"service" + 0.007*"credit" + 0.007*"allow" + 0.007*"bill" + 0.006*"policy" 05:32:41 INFO:topic #75 (0.010): 0.028*"bank" + 0.016*"japan" + 0.015*"billion" + 0.014*"yen" + 0.014*"financial" + 0.012*"loan" + 0.011*"japanese" + 0.010*"problem" + 0.010*"analyst" + 0.009*"firm" 05:32:41 INFO:topic #11 (0.010): 0.063*"gm" + 0.032*"plant" + 0.024*"strike" + 0.021*"automaker" + 0.021*"worker" + 0.017*"uaw" + 0.013*"truck" + 0.013*"local" + 0.013*"union" + 0.012*"chrysler" 05:32:41 INFO:topic diff=3.882969, rho=0.353553 05:32:41 INFO:PROGRESS: pass 7, at document #2500/2500 05:32:41 DEBUG:performing inference on a chunk of 2500 documents 05:32:46 DEBUG:2500/2500 documents converged within 50 iterations 05:32:46 DEBUG:updating topics 05:32:46 INFO:topic #82 (0.010): 0.003*"quarter" + 0.003*"executive" + 0.003*"internet" + 0.003*"high" + 0.003*"share" + 0.002*"loss" + 0.002*"technology" + 0.002*"high_tech" + 0.002*"stock" + 0.002*"software" 05:32:46 INFO:topic #38 (0.010): 0.015*"analyst" + 0.014*"australian" + 0.013*"ltd" + 0.012*"share" + 0.011*"australia" + 0.011*"profit" + 0.010*"sydney" + 0.009*"group" + 0.009*"news" + 0.008*"corp" 05:32:46 INFO:topic #9 (0.010): 0.001*"analyst" + 0.001*"government" + 0.001*"share" + 0.001*"china" + 0.001*"cost" + 0.001*"sale" + 0.001*"right" + 0.001*"stock" + 0.001*"big" + 0.001*"end" 05:32:46 INFO:topic #13 (0.010): 0.001*"china" + 0.001*"share" + 0.001*"official" + 0.001*"analyst" + 0.001*"group" + 0.001*"sale" + 0.001*"beijing" + 0.001*"party" + 0.001*"month" + 0.001*"billion" 05:32:46 INFO:topic #76 (0.010): 0.024*"cocoa" + 0.019*"exporter" + 0.019*"tonne" + 0.012*"ivory" + 0.012*"coast" + 0.012*"ivory_coast" + 0.011*"crop" + 0.011*"price" + 0.010*"buyer" + 0.009*"export" 05:32:46 INFO:topic diff=3.291750, rho=0.333333 05:32:46 INFO:PROGRESS: pass 8, at document #2500/2500 05:32:46 DEBUG:performing inference on a chunk of 2500 documents 05:32:50 DEBUG:2500/2500 documents converged within 50 iterations 05:32:50 DEBUG:updating topics 05:32:50 INFO:topic #18 (0.010): 0.001*"analyst" + 0.001*"business" + 0.001*"billion" + 0.001*"stock" + 0.001*"boeing" + 0.001*"quarter" + 0.001*"industry" + 0.001*"share" + 0.001*"shareholder" + 0.001*"sale" 05:32:50 INFO:topic #61 (0.010): 0.014*"analyst" + 0.014*"microsoft" + 0.010*"share" + 0.009*"software" + 0.009*"quarter" + 0.009*"boeing" + 0.009*"office" + 0.008*"computer" + 0.008*"worker" + 0.008*"fiscal" 05:32:50 INFO:topic #2 (0.010): 0.019*"mci" + 0.013*"analyst" + 0.011*"long" + 0.011*"share" + 0.011*"service" + 0.010*"distance" + 0.010*"long_distance" + 0.010*"billion" + 0.010*"corp" + 0.008*"local" 05:32:50 INFO:topic #98 (0.010): 0.031*"tonne" + 0.030*"china" + 0.019*"trader" + 0.018*"chinese" + 0.018*"price" + 0.016*"hong_kong" + 0.016*"hong" + 0.016*"kong" + 0.013*"source" + 0.013*"import" 05:32:50 INFO:topic #71 (0.010): 0.019*"fcc" + 0.014*"tv" + 0.014*"phone" + 0.013*"carrier" + 0.011*"local" + 0.010*"service" + 0.010*"long" + 0.009*"rule" + 0.009*"distance" + 0.009*"long_distance" 05:32:50 INFO:topic diff=2.781235, rho=0.316228 05:32:50 INFO:PROGRESS: pass 9, at document #2500/2500 05:32:50 DEBUG:performing inference on a chunk of 2500 documents 05:32:54 DEBUG:2500/2500 documents converged within 50 iterations 05:32:54 DEBUG:updating topics 05:32:55 INFO:topic #99 (0.010): 0.002*"world" + 0.002*"czech" + 0.002*"analyst" + 0.002*"stock" + 0.002*"win" + 0.001*"billion" + 0.001*"team" + 0.001*"game" + 0.001*"bank" + 0.001*"second" 05:32:55 INFO:topic #26 (0.010): 0.003*"business" + 0.002*"analyst" + 0.002*"gm" + 0.001*"share" + 0.001*"internet" + 0.001*"billion" + 0.001*"stock" + 0.001*"access" + 0.001*"chemical" + 0.001*"service" 05:32:55 INFO:topic #80 (0.010): 0.012*"analyst" + 0.012*"computer" + 0.009*"stock" + 0.009*"internet" + 0.008*"quarter" + 0.008*"technology" + 0.008*"service" + 0.007*"software" + 0.007*"share" + 0.007*"business" 05:32:55 INFO:topic #67 (0.010): 0.045*"gm" + 0.033*"plant" + 0.021*"uaw" + 0.017*"strike" + 0.017*"worker" + 0.012*"part" + 0.011*"local" + 0.010*"truck" + 0.010*"automaker" + 0.010*"contract" 05:32:55 INFO:topic #97 (0.010): 0.021*"internet" + 0.017*"bank" + 0.008*"law" + 0.008*"court" + 0.008*"congress" + 0.007*"service" + 0.007*"credit" + 0.007*"allow" + 0.007*"bill" + 0.006*"policy" 05:32:55 INFO:topic diff=2.344407, rho=0.301511 05:32:55 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=3914, num_topics=100, num_authors=50, decay=0.5, chunksize=2500) 05:32:55 INFO:CorpusAccumulator accumulated stats from 1000 documents 05:32:55 INFO:CorpusAccumulator accumulated stats from 2000 documents
-1.89056657258
accuracy_scores_100topic={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_50_20, atmodel_100topics, k=i)
accuracy_scores_100topic[k] = accuracy
plot_accuracy(scores1=accuracy_scores_20topic, label1="20 topics", scores2=accuracy_scores_100topic, label2="100 topics")
Precision@k: top_n=1 Prediction accuracy: 0.5808 Precision@k: top_n=2 Prediction accuracy: 0.7472 Precision@k: top_n=3 Prediction accuracy: 0.8252 Precision@k: top_n=4 Prediction accuracy: 0.8732 Precision@k: top_n=5 Prediction accuracy: 0.8956 Precision@k: top_n=6 Prediction accuracy: 0.9072 Precision@k: top_n=8 Prediction accuracy: 0.9276 Precision@k: top_n=10 Prediction accuracy: 0.9412
The 100-topic model is much more accurate than the 20-topic model. We continue to increase the topic until convergence.
atmodel_150topics = train_model(train_corpus_50_20, train_author2doc, train_dictionary_50_20, num_topics=150, eval_every=0, iterations=50, passes=15)
05:36:37 INFO:Vocabulary consists of 3914 words. 05:36:37 INFO:using symmetric alpha at 0.006666666666666667 05:36:37 INFO:using symmetric eta at 0.006666666666666667 05:36:40 INFO:running online author-topic training, 150 topics, 50 authors, 15 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000 05:36:40 INFO:PROGRESS: pass 0, at document #2500/2500 05:36:40 DEBUG:performing inference on a chunk of 2500 documents 05:36:55 DEBUG:15/2500 documents converged within 50 iterations 05:36:55 DEBUG:updating topics 05:36:56 INFO:topic #51 (0.007): 0.015*"profit" + 0.012*"price" + 0.012*"group" + 0.012*"analyst" + 0.009*"share" + 0.009*"steel" + 0.008*"tell" + 0.008*"australian" + 0.007*"month" + 0.007*"forecast" 05:36:56 INFO:topic #86 (0.007): 0.011*"china" + 0.008*"kong" + 0.007*"hong" + 0.007*"cargo" + 0.006*"hong_kong" + 0.006*"service" + 0.006*"Hong Kong" + 0.005*"profit" + 0.005*"analyst" + 0.005*"month" 05:36:56 INFO:topic #125 (0.007): 0.009*"analyst" + 0.007*"share" + 0.007*"bank" + 0.005*"problem" + 0.004*"billion" + 0.004*"sale" + 0.004*"loan" + 0.004*"plant" + 0.004*"gm" + 0.004*"corp" 05:36:56 INFO:topic #4 (0.007): 0.020*"franc" + 0.018*"thomson" + 0.016*"french" + 0.011*"group" + 0.011*"share" + 0.010*"government" + 0.009*"plan" + 0.009*"france" + 0.009*"lagardere" + 0.009*"billion" 05:36:56 INFO:topic #114 (0.007): 0.006*"analyst" + 0.006*"sale" + 0.005*"chairman" + 0.005*"business" + 0.004*"social" + 0.004*"party" + 0.004*"month" + 0.004*"industry" + 0.003*"share" + 0.003*"government" 05:36:56 INFO:topic diff=43.566047, rho=1.000000 05:36:56 INFO:PROGRESS: pass 1, at document #2500/2500 05:36:56 DEBUG:performing inference on a chunk of 2500 documents 05:37:04 DEBUG:2493/2500 documents converged within 50 iterations 05:37:04 DEBUG:updating topics 05:37:04 INFO:topic #72 (0.007): 0.024*"gold" + 0.024*"bre_x" + 0.023*"x" + 0.023*"bre" + 0.020*"Bre-X" + 0.013*"barrick" + 0.010*"government" + 0.010*"indonesian" + 0.010*"busang" + 0.010*"analyst" 05:37:04 INFO:topic #133 (0.007): 0.013*"group" + 0.012*"pound" + 0.011*"share" + 0.009*"billion" + 0.006*"bt" + 0.006*"business" + 0.006*"analyst" + 0.005*"british" + 0.005*"profit" + 0.005*"britain" 05:37:04 INFO:topic #19 (0.007): 0.007*"billion" + 0.006*"group" + 0.006*"airbus" + 0.005*"state" + 0.005*"profit" + 0.005*"industry" + 0.005*"tobacco" + 0.005*"tell" + 0.004*"price" + 0.004*"cost" 05:37:04 INFO:topic #90 (0.007): 0.029*"bank" + 0.017*"canadian" + 0.016*"billion" + 0.014*"canada" + 0.010*"toronto" + 0.009*"analyst" + 0.008*"stock" + 0.008*"fund" + 0.008*"share" + 0.007*"high" 05:37:04 INFO:topic #91 (0.007): 0.008*"analyst" + 0.007*"bre" + 0.005*"bre_x" + 0.005*"x" + 0.005*"gm" + 0.005*"billion" + 0.005*"Bre-X" + 0.005*"stock" + 0.004*"sale" + 0.004*"share" 05:37:04 INFO:topic diff=12.489199, rho=0.577350 05:37:04 INFO:PROGRESS: pass 2, at document #2500/2500 05:37:04 DEBUG:performing inference on a chunk of 2500 documents 05:37:12 DEBUG:2497/2500 documents converged within 50 iterations 05:37:12 DEBUG:updating topics 05:37:12 INFO:topic #58 (0.007): 0.011*"shanghai" + 0.010*"china" + 0.005*"bank" + 0.005*"chinese" + 0.004*"city" + 0.004*"stock" + 0.004*"chen" + 0.003*"beijing" + 0.003*"analyst" + 0.003*"modern" 05:37:12 INFO:topic #26 (0.007): 0.010*"business" + 0.005*"analyst" + 0.005*"share" + 0.004*"billion" + 0.004*"stock" + 0.004*"continue" + 0.003*"states" + 0.003*"chemical" + 0.003*"united" + 0.003*"internet" 05:37:12 INFO:topic #61 (0.007): 0.027*"boeing" + 0.014*"analyst" + 0.013*"billion" + 0.012*"microsoft" + 0.012*"jet" + 0.010*"airbus" + 0.009*"share" + 0.009*"order" + 0.008*"mcdonnell" + 0.007*"revenue" 05:37:12 INFO:topic #149 (0.007): 0.018*"china" + 0.010*"official" + 0.010*"chinese" + 0.008*"beijing" + 0.006*"trade" + 0.006*"world" + 0.005*"foreign" + 0.005*"united_states" + 0.005*"drug" + 0.005*"metre" 05:37:12 INFO:topic #74 (0.007): 0.044*"china" + 0.022*"chinese" + 0.012*"official" + 0.012*"tonne" + 0.011*"beijing" + 0.008*"trade" + 0.008*"import" + 0.008*"trader" + 0.007*"price" + 0.007*"state" 05:37:12 INFO:topic diff=10.945011, rho=0.500000 05:37:12 INFO:PROGRESS: pass 3, at document #2500/2500 05:37:12 DEBUG:performing inference on a chunk of 2500 documents 05:37:19 DEBUG:2499/2500 documents converged within 50 iterations 05:37:19 DEBUG:updating topics 05:37:19 INFO:topic #125 (0.007): 0.004*"analyst" + 0.003*"share" + 0.003*"bank" + 0.002*"problem" + 0.002*"billion" + 0.002*"sale" + 0.002*"loan" + 0.002*"plant" + 0.002*"gm" + 0.002*"corp" 05:37:19 INFO:topic #95 (0.007): 0.038*"bank" + 0.018*"billion" + 0.016*"society" + 0.010*"analyst" + 0.009*"debt" + 0.009*"eurotunnel" + 0.008*"banking" + 0.008*"pound" + 0.008*"member" + 0.007*"convert" 05:37:19 INFO:topic #19 (0.007): 0.007*"billion" + 0.006*"state" + 0.005*"group" + 0.005*"airbus" + 0.005*"loss" + 0.005*"cost" + 0.005*"profit" + 0.005*"industry" + 0.005*"sale" + 0.004*"executive" 05:37:19 INFO:topic #115 (0.007): 0.003*"share" + 0.002*"stock" + 0.002*"billion" + 0.002*"analyst" + 0.002*"china" + 0.002*"industry" + 0.001*"month" + 0.001*"rise" + 0.001*"big" + 0.001*"deal" 05:37:20 INFO:topic #29 (0.007): 0.018*"czech" + 0.008*"klaus" + 0.007*"government" + 0.007*"crown" + 0.007*"party" + 0.007*"bank" + 0.007*"prague" + 0.005*"country" + 0.005*"foreign" + 0.005*"election" 05:37:20 INFO:topic diff=9.415271, rho=0.447214 05:37:20 INFO:PROGRESS: pass 4, at document #2500/2500 05:37:20 DEBUG:performing inference on a chunk of 2500 documents 05:37:26 DEBUG:2499/2500 documents converged within 50 iterations 05:37:26 DEBUG:updating topics 05:37:27 INFO:topic #31 (0.007): 0.010*"franc" + 0.009*"french" + 0.008*"china" + 0.008*"billion" + 0.006*"shanghai" + 0.006*"analyst" + 0.006*"share" + 0.005*"government" + 0.005*"plan" + 0.005*"exchange" 05:37:27 INFO:topic #76 (0.007): 0.007*"china" + 0.006*"hong_kong" + 0.006*"price" + 0.005*"kong" + 0.005*"hong" + 0.004*"tonne" + 0.004*"world" + 0.004*"analyst" + 0.003*"chinese" + 0.003*"Hong Kong" 05:37:27 INFO:topic #36 (0.007): 0.024*"bid" + 0.021*"penny" + 0.020*"analyst" + 0.017*"share" + 0.015*"electric" + 0.013*"electricity" + 0.012*"offer" + 0.012*"price" + 0.011*"northern" + 0.010*"water" 05:37:27 INFO:topic #144 (0.007): 0.008*"computer" + 0.007*"software" + 0.006*"technology" + 0.006*"internet" + 0.005*"web" + 0.004*"site" + 0.004*"people" + 0.004*"quarter" + 0.004*"industry" + 0.004*"base" 05:37:27 INFO:topic #96 (0.007): 0.013*"tv" + 0.011*"industry" + 0.010*"group" + 0.010*"system" + 0.008*"television" + 0.008*"plan" + 0.008*"service" + 0.007*"rating" + 0.006*"american" + 0.006*"long" 05:37:27 INFO:topic diff=8.020445, rho=0.408248 05:37:27 INFO:PROGRESS: pass 5, at document #2500/2500 05:37:27 DEBUG:performing inference on a chunk of 2500 documents 05:37:33 DEBUG:2500/2500 documents converged within 50 iterations 05:37:33 DEBUG:updating topics 05:37:33 INFO:topic #128 (0.007): 0.019*"ford" + 0.017*"gm" + 0.015*"sale" + 0.015*"plant" + 0.011*"car" + 0.011*"vehicle" + 0.008*"chrysler" + 0.008*"worker" + 0.008*"automaker" + 0.007*"truck" 05:37:33 INFO:topic #82 (0.007): 0.006*"china" + 0.004*"tonne" + 0.004*"chinese" + 0.003*"trader" + 0.003*"copper" + 0.002*"price" + 0.002*"source" + 0.002*"kong" + 0.002*"shanghai" + 0.002*"metal" 05:37:33 INFO:topic #96 (0.007): 0.013*"tv" + 0.011*"industry" + 0.010*"group" + 0.010*"system" + 0.008*"plan" + 0.008*"television" + 0.008*"service" + 0.007*"rating" + 0.007*"american" + 0.006*"long" 05:37:33 INFO:topic #41 (0.007): 0.016*"australian" + 0.014*"bank" + 0.013*"profit" + 0.013*"share" + 0.013*"news" + 0.013*"sydney" + 0.013*"australia" + 0.011*"ltd" + 0.011*"analyst" + 0.011*"corp" 05:37:33 INFO:topic #79 (0.007): 0.006*"china" + 0.006*"beijing" + 0.004*"official" + 0.004*"lama" + 0.004*"tibet" + 0.004*"chinese" + 0.003*"region" + 0.003*"dalai_lama" + 0.003*"share" + 0.003*"analyst" 05:37:33 INFO:topic diff=6.797042, rho=0.377964 05:37:33 INFO:PROGRESS: pass 6, at document #2500/2500 05:37:33 DEBUG:performing inference on a chunk of 2500 documents 05:37:40 DEBUG:2500/2500 documents converged within 50 iterations 05:37:40 DEBUG:updating topics 05:37:40 INFO:topic #76 (0.007): 0.005*"china" + 0.004*"hong_kong" + 0.004*"price" + 0.003*"kong" + 0.003*"hong" + 0.003*"tonne" + 0.003*"world" + 0.003*"analyst" + 0.002*"chinese" + 0.002*"Hong Kong" 05:37:40 INFO:topic #31 (0.007): 0.008*"franc" + 0.007*"french" + 0.006*"china" + 0.006*"billion" + 0.005*"shanghai" + 0.005*"analyst" + 0.004*"share" + 0.004*"government" + 0.004*"plan" + 0.004*"exchange" 05:37:40 INFO:topic #140 (0.007): 0.026*"china" + 0.020*"beijing" + 0.016*"chinese" + 0.013*"official" + 0.010*"wang" + 0.006*"foreign" + 0.005*"right" + 0.005*"human" + 0.005*"washington" + 0.005*"state" 05:37:40 INFO:topic #46 (0.007): 0.004*"hong" + 0.004*"kong" + 0.003*"china" + 0.003*"Hong Kong" + 0.002*"official" + 0.002*"hong_kong" + 0.002*"chinese" + 0.002*"singapore" + 0.002*"united" + 0.002*"plan" 05:37:40 INFO:topic #148 (0.007): 0.001*"network" + 0.001*"analyst" + 0.001*"stock" + 0.001*"share" + 0.001*"price" + 0.001*"remote" + 0.001*"recent" + 0.001*"industry" + 0.001*"chinese" + 0.001*"billion" 05:37:40 INFO:topic diff=5.743502, rho=0.353553 05:37:40 INFO:PROGRESS: pass 7, at document #2500/2500 05:37:40 DEBUG:performing inference on a chunk of 2500 documents 05:37:46 DEBUG:2500/2500 documents converged within 50 iterations 05:37:46 DEBUG:updating topics 05:37:47 INFO:topic #99 (0.007): 0.003*"czech" + 0.003*"world" + 0.003*"team" + 0.002*"win" + 0.002*"game" + 0.002*"play" + 0.002*"second" + 0.002*"stock" + 0.002*"billion" + 0.002*"end" 05:37:47 INFO:topic #81 (0.007): 0.003*"pound" + 0.002*"share" + 0.002*"profit" + 0.002*"million_pound" + 0.002*"sale" + 0.001*"business" + 0.001*"analyst" + 0.001*"rise" + 0.001*"group" + 0.001*"fall" 05:37:47 INFO:topic #97 (0.007): 0.001*"analyst" + 0.001*"business" + 0.001*"china" + 0.001*"internet" + 0.001*"stock" + 0.001*"sale" + 0.001*"service" + 0.001*"chairman" + 0.001*"continue" + 0.001*"base" 05:37:47 INFO:topic #28 (0.007): 0.000*"large" + 0.000*"share" + 0.000*"stock" + 0.000*"property" + 0.000*"analyst" + 0.000*"taiwan" + 0.000*"china" + 0.000*"bank" + 0.000*"news" + 0.000*"billion" 05:37:47 INFO:topic #93 (0.007): 0.020*"ibm" + 0.018*"internet" + 0.016*"computer" + 0.013*"pc" + 0.012*"service" + 0.011*"analyst" + 0.009*"industry" + 0.009*"software" + 0.009*"quarter" + 0.009*"consumer" 05:37:47 INFO:topic diff=4.844379, rho=0.333333 05:37:47 INFO:PROGRESS: pass 8, at document #2500/2500 05:37:47 DEBUG:performing inference on a chunk of 2500 documents 05:37:53 DEBUG:2500/2500 documents converged within 50 iterations 05:37:53 DEBUG:updating topics 05:37:54 INFO:topic #97 (0.007): 0.001*"analyst" + 0.001*"business" + 0.001*"china" + 0.001*"internet" + 0.001*"stock" + 0.001*"sale" + 0.001*"service" + 0.001*"chairman" + 0.001*"continue" + 0.001*"base" 05:37:54 INFO:topic #77 (0.007): 0.002*"computer" + 0.002*"internet" + 0.002*"quarter" + 0.002*"business" + 0.002*"service" + 0.002*"analyst" + 0.001*"share" + 0.001*"system" + 0.001*"cost" + 0.001*"industry" 05:37:54 INFO:topic #70 (0.007): 0.024*"share" + 0.023*"shanghai" + 0.021*"china" + 0.015*"bank" + 0.013*"b" + 0.013*"analyst" + 0.012*"foreign" + 0.010*"exchange" + 0.010*"investor" + 0.010*"stock" 05:37:54 INFO:topic #26 (0.007): 0.003*"business" + 0.002*"analyst" + 0.001*"share" + 0.001*"billion" + 0.001*"stock" + 0.001*"continue" + 0.001*"states" + 0.001*"chemical" + 0.001*"united" + 0.001*"internet" 05:37:54 INFO:topic #145 (0.007): 0.019*"analyst" + 0.015*"sale" + 0.013*"share" + 0.012*"quarter" + 0.009*"business" + 0.008*"base" + 0.007*"earning" + 0.007*"stock" + 0.007*"drug" + 0.006*"amp" 05:37:54 INFO:topic diff=4.081138, rho=0.316228 05:37:54 INFO:PROGRESS: pass 9, at document #2500/2500 05:37:54 DEBUG:performing inference on a chunk of 2500 documents 05:38:00 DEBUG:2500/2500 documents converged within 50 iterations 05:38:00 DEBUG:updating topics 05:38:00 INFO:topic #116 (0.007): 0.003*"x" + 0.002*"bre" + 0.002*"analyst" + 0.002*"Bre-X" + 0.002*"bre_x" + 0.001*"government" + 0.001*"barrick" + 0.001*"gold" + 0.001*"mining" + 0.001*"indonesian" 05:38:00 INFO:topic #67 (0.007): 0.002*"hong" + 0.002*"china" + 0.002*"kong" + 0.001*"hong_kong" + 0.001*"Hong Kong" + 0.001*"beijing" + 0.001*"legislature" + 0.001*"rule" + 0.001*"chinese" + 0.001*"plan" 05:38:00 INFO:topic #69 (0.007): 0.001*"tibet" + 0.001*"chen" + 0.001*"dalai_lama" + 0.001*"china" + 0.001*"beijing" + 0.001*"group" + 0.001*"dalai" + 0.000*"lama" + 0.000*"billion" + 0.000*"region" 05:38:00 INFO:topic #146 (0.007): 0.001*"hong_kong" + 0.001*"analyst" + 0.000*"share" + 0.000*"hong" + 0.000*"kong" + 0.000*"china" + 0.000*"news" + 0.000*"Hong Kong" + 0.000*"billion" + 0.000*"price" 05:38:00 INFO:topic #66 (0.007): 0.001*"bank" + 0.001*"china" + 0.000*"hong" + 0.000*"government" + 0.000*"hong_kong" + 0.000*"plan" + 0.000*"bre" + 0.000*"x" + 0.000*"kong" + 0.000*"financial" 05:38:00 INFO:topic diff=3.435844, rho=0.301511 05:38:00 INFO:PROGRESS: pass 10, at document #2500/2500 05:38:00 DEBUG:performing inference on a chunk of 2500 documents 05:38:06 DEBUG:2500/2500 documents converged within 50 iterations 05:38:06 DEBUG:updating topics 05:38:06 INFO:topic #80 (0.007): 0.019*"analyst" + 0.015*"microsoft" + 0.013*"quarter" + 0.011*"business" + 0.010*"computer" + 0.009*"sale" + 0.008*"revenue" + 0.008*"windows" + 0.008*"share" + 0.008*"system" 05:38:06 INFO:topic #44 (0.007): 0.001*"sale" + 0.001*"china" + 0.001*"analyst" + 0.001*"share" + 0.001*"service" + 0.000*"plan" + 0.000*"bank" + 0.000*"deal" + 0.000*"billion" + 0.000*"world" 05:38:06 INFO:topic #58 (0.007): 0.001*"shanghai" + 0.001*"china" + 0.001*"bank" + 0.001*"chinese" + 0.000*"city" + 0.000*"stock" + 0.000*"chen" + 0.000*"beijing" + 0.000*"analyst" + 0.000*"modern" 05:38:06 INFO:topic #36 (0.007): 0.022*"penny" + 0.022*"bid" + 0.021*"analyst" + 0.018*"share" + 0.014*"electric" + 0.012*"price" + 0.012*"electricity" + 0.012*"offer" + 0.012*"pound" + 0.011*"northern" 05:38:06 INFO:topic #143 (0.007): 0.018*"mci" + 0.012*"analyst" + 0.012*"allen" + 0.011*"long" + 0.011*"billion" + 0.011*"distance" + 0.010*"long_distance" + 0.009*"stock" + 0.009*"share" + 0.009*"executive" 05:38:06 INFO:topic diff=2.892181, rho=0.288675 05:38:06 INFO:PROGRESS: pass 11, at document #2500/2500 05:38:06 DEBUG:performing inference on a chunk of 2500 documents 05:38:12 DEBUG:2500/2500 documents converged within 50 iterations 05:38:12 DEBUG:updating topics 05:38:13 INFO:topic #141 (0.007): 0.006*"internet" + 0.005*"bank" + 0.003*"law" + 0.003*"congress" + 0.003*"court" + 0.002*"service" + 0.002*"export" + 0.002*"security" + 0.002*"member" + 0.002*"credit" 05:38:13 INFO:topic #33 (0.007): 0.000*"billion" + 0.000*"service" + 0.000*"plan" + 0.000*"industry" + 0.000*"china" + 0.000*"internet" + 0.000*"tonne" + 0.000*"price" + 0.000*"chinese" + 0.000*"share" 05:38:13 INFO:topic #49 (0.007): 0.001*"eurotunnel" + 0.001*"service" + 0.000*"billion" + 0.000*"share" + 0.000*"fire" + 0.000*"pound" + 0.000*"tunnel" + 0.000*"debt" + 0.000*"group" + 0.000*"financial" 05:38:13 INFO:topic #116 (0.007): 0.002*"x" + 0.001*"bre" + 0.001*"analyst" + 0.001*"Bre-X" + 0.001*"bre_x" + 0.001*"government" + 0.001*"barrick" + 0.001*"gold" + 0.001*"mining" + 0.001*"indonesian" 05:38:13 INFO:topic #83 (0.007): 0.001*"beijing" + 0.001*"chinese" + 0.001*"billion" + 0.001*"profit" + 0.001*"tell" + 0.001*"china" + 0.001*"bank" + 0.001*"analyst" + 0.001*"australian" + 0.001*"share" 05:38:13 INFO:topic diff=2.435570, rho=0.277350 05:38:13 INFO:PROGRESS: pass 12, at document #2500/2500 05:38:13 DEBUG:performing inference on a chunk of 2500 documents 05:38:19 DEBUG:2500/2500 documents converged within 50 iterations 05:38:19 DEBUG:updating topics 05:38:19 INFO:topic #138 (0.007): 0.000*"china" + 0.000*"beijing" + 0.000*"share" + 0.000*"states" + 0.000*"news" + 0.000*"analyst" + 0.000*"long" + 0.000*"chinese" + 0.000*"trade" + 0.000*"the United States" 05:38:19 INFO:topic #125 (0.007): 0.000*"analyst" + 0.000*"share" + 0.000*"bank" + 0.000*"problem" + 0.000*"billion" + 0.000*"sale" + 0.000*"loan" + 0.000*"plant" + 0.000*"gm" + 0.000*"corp" 05:38:19 INFO:topic #21 (0.007): 0.024*"stock" + 0.021*"toronto" + 0.018*"share" + 0.017*"bank" + 0.016*"canada" + 0.013*"gold" + 0.012*"billion" + 0.012*"index" + 0.011*"close" + 0.010*"point" 05:38:19 INFO:topic #5 (0.007): 0.002*"china" + 0.001*"beijing" + 0.001*"chen" + 0.001*"official" + 0.001*"trade" + 0.001*"economic" + 0.001*"chinese" + 0.001*"party" + 0.001*"survey" + 0.001*"month" 05:38:19 INFO:topic #31 (0.007): 0.002*"franc" + 0.002*"french" + 0.002*"china" + 0.002*"billion" + 0.001*"shanghai" + 0.001*"analyst" + 0.001*"share" + 0.001*"government" + 0.001*"plan" + 0.001*"exchange" 05:38:19 INFO:topic diff=2.053082, rho=0.267261 05:38:19 INFO:PROGRESS: pass 13, at document #2500/2500 05:38:19 DEBUG:performing inference on a chunk of 2500 documents 05:38:26 DEBUG:2500/2500 documents converged within 50 iterations 05:38:26 DEBUG:updating topics 05:38:26 INFO:topic #3 (0.007): 0.000*"pound" + 0.000*"billion" + 0.000*"share" + 0.000*"group" + 0.000*"british" + 0.000*"deal" + 0.000*"bank" + 0.000*"sale" + 0.000*"mci" + 0.000*"service" 05:38:26 INFO:topic #42 (0.007): 0.000*"technology" + 0.000*"russia" + 0.000*"computer" + 0.000*"industry" + 0.000*"russian" + 0.000*"internet" + 0.000*"analyst" + 0.000*"world" + 0.000*"price" + 0.000*"software" 05:38:26 INFO:topic #104 (0.007): 0.044*"cent" + 0.043*"bank" + 0.028*"cent_share" + 0.022*"league" + 0.019*"football" + 0.016*"share" + 0.014*"card" + 0.013*"earning" + 0.012*"cos" + 0.011*"canada" 05:38:26 INFO:topic #69 (0.007): 0.000*"tibet" + 0.000*"chen" + 0.000*"dalai_lama" + 0.000*"china" + 0.000*"beijing" + 0.000*"group" + 0.000*"dalai" + 0.000*"lama" + 0.000*"billion" + 0.000*"region" 05:38:26 INFO:topic #115 (0.007): 0.000*"share" + 0.000*"stock" + 0.000*"billion" + 0.000*"analyst" + 0.000*"china" + 0.000*"industry" + 0.000*"month" + 0.000*"rise" + 0.000*"big" + 0.000*"deal" 05:38:26 INFO:topic diff=1.733322, rho=0.258199 05:38:26 INFO:PROGRESS: pass 14, at document #2500/2500 05:38:26 DEBUG:performing inference on a chunk of 2500 documents 05:38:32 DEBUG:2500/2500 documents converged within 50 iterations 05:38:32 DEBUG:updating topics 05:38:33 INFO:topic #82 (0.007): 0.001*"china" + 0.000*"tonne" + 0.000*"chinese" + 0.000*"trader" + 0.000*"copper" + 0.000*"price" + 0.000*"source" + 0.000*"kong" + 0.000*"shanghai" + 0.000*"metal" 05:38:33 INFO:topic #122 (0.007): 0.000*"share" + 0.000*"analyst" + 0.000*"quarter" + 0.000*"pc" + 0.000*"computer" + 0.000*"ibm" + 0.000*"profit" + 0.000*"service" + 0.000*"industry" + 0.000*"compaq" 05:38:33 INFO:topic #126 (0.007): 0.000*"group" + 0.000*"europe" + 0.000*"plan" + 0.000*"air" + 0.000*"model" + 0.000*"pound" + 0.000*"hong" + 0.000*"month" + 0.000*"japan" + 0.000*"Hong Kong" 05:38:33 INFO:topic #1 (0.007): 0.005*"bank" + 0.004*"stock" + 0.004*"billion" + 0.004*"japan" + 0.003*"analyst" + 0.003*"financial" + 0.003*"asset" + 0.003*"japanese" + 0.002*"big" + 0.002*"yen" 05:38:33 INFO:topic #120 (0.007): 0.022*"stiff" + 0.019*"court" + 0.018*"rating" + 0.018*"frequently" + 0.012*"mercury" + 0.012*"williams" + 0.012*"remove" + 0.011*"judge" + 0.011*"armed" + 0.009*"ford" 05:38:33 INFO:topic diff=1.466340, rho=0.250000 05:38:33 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=3914, num_topics=150, num_authors=50, decay=0.5, chunksize=2500) 05:38:33 INFO:CorpusAccumulator accumulated stats from 1000 documents 05:38:33 INFO:CorpusAccumulator accumulated stats from 2000 documents
-1.90810257282
accuracy_scores_150topic={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_50_20, atmodel_150topics, k=i)
accuracy_scores_150topic[k] = accuracy
plot_accuracy(scores1=accuracy_scores_100topic, label1="100 topics", scores2=accuracy_scores_150topic, label2="150 topics")
Precision@k: top_n=1 Prediction accuracy: 0.6004 Precision@k: top_n=2 Prediction accuracy: 0.7632 Precision@k: top_n=3 Prediction accuracy: 0.8452 Precision@k: top_n=4 Prediction accuracy: 0.8796 Precision@k: top_n=5 Prediction accuracy: 0.8988 Precision@k: top_n=6 Prediction accuracy: 0.914 Precision@k: top_n=8 Prediction accuracy: 0.9324 Precision@k: top_n=10 Prediction accuracy: 0.9464
The 150-topic model is also slightly better, especially in the lower end of k. But we clearly see convergence. We try with 200 topic to be sure.
atmodel_200topics = train_model(train_corpus_50_20, train_author2doc, train_dictionary_50_20, num_topics=200, eval_every=0, iterations=50, passes=15)
05:43:01 INFO:Vocabulary consists of 3914 words. 05:43:01 INFO:using symmetric alpha at 0.005 05:43:01 INFO:using symmetric eta at 0.005 05:43:05 INFO:running online author-topic training, 200 topics, 50 authors, 15 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000 05:43:05 INFO:PROGRESS: pass 0, at document #2500/2500 05:43:05 DEBUG:performing inference on a chunk of 2500 documents 05:43:25 DEBUG:2/2500 documents converged within 50 iterations 05:43:25 DEBUG:updating topics 05:43:26 INFO:topic #198 (0.005): 0.006*"plant" + 0.006*"analyst" + 0.006*"gm" + 0.005*"sale" + 0.005*"group" + 0.005*"service" + 0.004*"share" + 0.004*"internet" + 0.004*"plan" + 0.004*"uaw" 05:43:26 INFO:topic #186 (0.005): 0.009*"pound" + 0.007*"quarter" + 0.007*"analyst" + 0.007*"share" + 0.006*"group" + 0.006*"business" + 0.005*"million_pound" + 0.005*"software" + 0.005*"sale" + 0.005*"industry" 05:43:26 INFO:topic #188 (0.005): 0.011*"share" + 0.010*"analyst" + 0.008*"billion" + 0.007*"sale" + 0.007*"stock" + 0.006*"mci" + 0.006*"business" + 0.005*"british" + 0.005*"quarter" + 0.005*"deal" 05:43:26 INFO:topic #162 (0.005): 0.012*"analyst" + 0.008*"business" + 0.007*"quarter" + 0.006*"share" + 0.006*"industry" + 0.006*"sale" + 0.005*"base" + 0.005*"billion" + 0.005*"high" + 0.005*"price" 05:43:26 INFO:topic #22 (0.005): 0.039*"bank" + 0.011*"rate" + 0.011*"day" + 0.011*"cut" + 0.010*"analyst" + 0.008*"australia" + 0.008*"profit" + 0.008*"financial" + 0.007*"ltd" + 0.007*"merger" 05:43:26 INFO:topic diff=65.500588, rho=1.000000 05:43:26 INFO:PROGRESS: pass 1, at document #2500/2500 05:43:26 DEBUG:performing inference on a chunk of 2500 documents 05:43:36 DEBUG:2494/2500 documents converged within 50 iterations 05:43:36 DEBUG:updating topics 05:43:37 INFO:topic #77 (0.005): 0.013*"internet" + 0.011*"computer" + 0.009*"business" + 0.009*"quarter" + 0.009*"service" + 0.008*"revenue" + 0.007*"analyst" + 0.007*"cost" + 0.007*"industry" + 0.006*"compaq" 05:43:37 INFO:topic #25 (0.005): 0.010*"share" + 0.007*"analyst" + 0.007*"service" + 0.007*"business" + 0.006*"growth" + 0.006*"mci" + 0.006*"billion" + 0.006*"long" + 0.005*"distance" + 0.005*"stock" 05:43:37 INFO:topic #133 (0.005): 0.011*"group" + 0.011*"share" + 0.010*"pound" + 0.009*"billion" + 0.007*"profit" + 0.007*"business" + 0.006*"sale" + 0.005*"good" + 0.005*"bank" + 0.005*"analyst" 05:43:37 INFO:topic #180 (0.005): 0.011*"billion" + 0.006*"venture" + 0.006*"quarter" + 0.005*"investment" + 0.005*"industry" + 0.005*"analyst" + 0.004*"price" + 0.003*"group" + 0.003*"high" + 0.003*"rise" 05:43:37 INFO:topic #1 (0.005): 0.018*"japan" + 0.014*"japanese" + 0.013*"billion" + 0.011*"yen" + 0.011*"stock" + 0.011*"bank" + 0.010*"life" + 0.010*"financial" + 0.010*"big" + 0.008*"profit" 05:43:37 INFO:topic diff=17.080447, rho=0.577350 05:43:37 INFO:PROGRESS: pass 2, at document #2500/2500 05:43:37 DEBUG:performing inference on a chunk of 2500 documents 05:43:46 DEBUG:2499/2500 documents converged within 50 iterations 05:43:46 DEBUG:updating topics 05:43:47 INFO:topic #92 (0.005): 0.013*"analyst" + 0.011*"share" + 0.007*"sale" + 0.006*"profit" + 0.005*"pound" + 0.004*"high" + 0.004*"revenue" + 0.004*"quarter" + 0.004*"billion" + 0.004*"cent" 05:43:47 INFO:topic #117 (0.005): 0.015*"access" + 0.012*"local" + 0.011*"internet" + 0.011*"fee" + 0.010*"distance" + 0.010*"long" + 0.008*"long_distance" + 0.008*"service" + 0.007*"issue" + 0.006*"provider" 05:43:47 INFO:topic #81 (0.005): 0.011*"pound" + 0.010*"profit" + 0.009*"share" + 0.008*"sale" + 0.007*"million_pound" + 0.006*"analyst" + 0.006*"business" + 0.006*"rise" + 0.005*"group" + 0.005*"fall" 05:43:47 INFO:topic #97 (0.005): 0.025*"internet" + 0.018*"bill" + 0.014*"administration" + 0.014*"product" + 0.012*"key" + 0.011*"policy" + 0.011*"export" + 0.010*"law" + 0.008*"access" + 0.008*"bank" 05:43:47 INFO:topic #24 (0.005): 0.005*"crop" + 0.005*"price" + 0.005*"share" + 0.004*"tonne" + 0.004*"analyst" + 0.004*"exporter" + 0.004*"cocoa" + 0.003*"ivory_coast" + 0.003*"government" + 0.003*"reuters" 05:43:47 INFO:topic diff=14.773285, rho=0.500000 05:43:47 INFO:PROGRESS: pass 3, at document #2500/2500 05:43:47 DEBUG:performing inference on a chunk of 2500 documents 05:43:56 DEBUG:2499/2500 documents converged within 50 iterations 05:43:56 DEBUG:updating topics 05:43:56 INFO:topic #16 (0.005): 0.003*"group" + 0.003*"billion" + 0.002*"gm" + 0.002*"hong_kong" + 0.002*"pound" + 0.002*"kong" + 0.002*"china" + 0.002*"bid" + 0.002*"analyst" + 0.002*"hong" 05:43:56 INFO:topic #133 (0.005): 0.012*"group" + 0.011*"pound" + 0.010*"share" + 0.008*"billion" + 0.007*"business" + 0.006*"profit" + 0.005*"good" + 0.005*"sale" + 0.005*"add" + 0.005*"cost" 05:43:56 INFO:topic #23 (0.005): 0.017*"boeing" + 0.010*"billion" + 0.009*"analyst" + 0.006*"share" + 0.006*"microsoft" + 0.006*"industry" + 0.005*"quarter" + 0.005*"jet" + 0.005*"windows" + 0.005*"mcdonnell" 05:43:56 INFO:topic #156 (0.005): 0.005*"analyst" + 0.004*"bank" + 0.003*"share" + 0.003*"service" + 0.003*"internet" + 0.002*"china" + 0.002*"plan" + 0.002*"profit" + 0.002*"billion" + 0.002*"cost" 05:43:56 INFO:topic #131 (0.005): 0.024*"analyst" + 0.017*"share" + 0.013*"price" + 0.011*"business" + 0.011*"penny" + 0.009*"bid" + 0.007*"electric" + 0.006*"offer" + 0.006*"add" + 0.006*"northern" 05:43:56 INFO:topic diff=12.542799, rho=0.447214 05:43:56 INFO:PROGRESS: pass 4, at document #2500/2500 05:43:56 DEBUG:performing inference on a chunk of 2500 documents 05:44:05 DEBUG:2500/2500 documents converged within 50 iterations 05:44:05 DEBUG:updating topics 05:44:05 INFO:topic #92 (0.005): 0.009*"analyst" + 0.008*"share" + 0.005*"sale" + 0.004*"profit" + 0.003*"pound" + 0.003*"high" + 0.003*"revenue" + 0.003*"quarter" + 0.003*"billion" + 0.003*"cent" 05:44:05 INFO:topic #86 (0.005): 0.029*"cargo" + 0.021*"kong" + 0.020*"hong" + 0.020*"hong_kong" + 0.016*"air" + 0.015*"Hong Kong" + 0.015*"airline" + 0.009*"service" + 0.009*"route" + 0.009*"airport" 05:44:05 INFO:topic #80 (0.005): 0.017*"analyst" + 0.014*"microsoft" + 0.013*"quarter" + 0.010*"computer" + 0.010*"business" + 0.009*"windows" + 0.008*"revenue" + 0.008*"internet" + 0.007*"system" + 0.007*"sale" 05:44:05 INFO:topic #23 (0.005): 0.015*"boeing" + 0.009*"billion" + 0.008*"analyst" + 0.006*"share" + 0.005*"microsoft" + 0.005*"industry" + 0.005*"quarter" + 0.005*"jet" + 0.004*"windows" + 0.004*"mcdonnell" 05:44:05 INFO:topic #177 (0.005): 0.011*"investment" + 0.010*"pound" + 0.010*"group" + 0.008*"cable" + 0.008*"british" + 0.007*"fleming" + 0.007*"management" + 0.006*"fund" + 0.006*"share" + 0.006*"merger" 05:44:05 INFO:topic diff=10.561281, rho=0.408248 05:44:05 INFO:PROGRESS: pass 5, at document #2500/2500 05:44:05 DEBUG:performing inference on a chunk of 2500 documents 05:44:14 DEBUG:2500/2500 documents converged within 50 iterations 05:44:14 DEBUG:updating topics 05:44:15 INFO:topic #61 (0.005): 0.038*"boeing" + 0.017*"billion" + 0.016*"jet" + 0.013*"analyst" + 0.012*"mcdonnell" + 0.011*"microsoft" + 0.010*"airbus" + 0.010*"order" + 0.010*"douglas" + 0.009*"share" 05:44:15 INFO:topic #73 (0.005): 0.001*"bank" + 0.001*"china" + 0.001*"group" + 0.001*"big" + 0.001*"analyst" + 0.001*"sale" + 0.001*"shanghai" + 0.001*"deal" + 0.001*"gm" + 0.001*"pound" 05:44:15 INFO:topic #147 (0.005): 0.012*"czech" + 0.011*"crown" + 0.010*"week" + 0.009*"analyst" + 0.009*"point" + 0.008*"investor" + 0.007*"round" + 0.007*"prague" + 0.007*"billion" + 0.006*"second" 05:44:15 INFO:topic #50 (0.005): 0.006*"british" + 0.006*"telecom" + 0.006*"deal" + 0.005*"analyst" + 0.005*"drug" + 0.004*"share" + 0.004*"mci" + 0.004*"billion" + 0.004*"group" + 0.003*"sale" 05:44:15 INFO:topic #99 (0.005): 0.003*"stock" + 0.002*"business" + 0.002*"share" + 0.002*"analyst" + 0.002*"end" + 0.002*"day" + 0.002*"sale" + 0.002*"world" + 0.002*"billion" + 0.001*"quarter" 05:44:15 INFO:topic diff=8.863923, rho=0.377964 05:44:15 INFO:PROGRESS: pass 6, at document #2500/2500 05:44:15 DEBUG:performing inference on a chunk of 2500 documents 05:44:23 DEBUG:2500/2500 documents converged within 50 iterations 05:44:23 DEBUG:updating topics 05:44:24 INFO:topic #70 (0.005): 0.011*"stock" + 0.010*"shanghai" + 0.008*"share" + 0.008*"exchange" + 0.007*"trading" + 0.007*"china" + 0.007*"bank" + 0.006*"future" + 0.006*"beijing" + 0.005*"index" 05:44:24 INFO:topic #177 (0.005): 0.011*"investment" + 0.010*"pound" + 0.010*"group" + 0.008*"cable" + 0.008*"british" + 0.007*"fleming" + 0.007*"management" + 0.007*"share" + 0.006*"fund" + 0.006*"merger" 05:44:24 INFO:topic #37 (0.005): 0.015*"bank" + 0.012*"czech" + 0.008*"crown" + 0.006*"prague" + 0.005*"foreign" + 0.005*"billion" + 0.004*"state" + 0.004*"deficit" + 0.004*"communist" + 0.004*"central" 05:44:24 INFO:topic #56 (0.005): 0.032*"kong" + 0.031*"hong" + 0.031*"hong_kong" + 0.022*"Hong Kong" + 0.016*"china" + 0.007*"fund" + 0.007*"Hong Kong's" + 0.006*"chinese" + 0.005*"tung" + 0.005*"british" 05:44:24 INFO:topic #32 (0.005): 0.004*"china" + 0.003*"beijing" + 0.002*"taiwan" + 0.002*"bre" + 0.002*"bre_x" + 0.002*"share" + 0.002*"x" + 0.002*"chinese" + 0.002*"analyst" + 0.002*"party" 05:44:24 INFO:topic diff=7.433677, rho=0.353553 05:44:24 INFO:PROGRESS: pass 7, at document #2500/2500 05:44:24 DEBUG:performing inference on a chunk of 2500 documents 05:44:32 DEBUG:2500/2500 documents converged within 50 iterations 05:44:32 DEBUG:updating topics 05:44:33 INFO:topic #194 (0.005): 0.024*"cocoa" + 0.020*"tonne" + 0.019*"exporter" + 0.012*"ivory" + 0.012*"ivory_coast" + 0.012*"coast" + 0.011*"crop" + 0.011*"price" + 0.010*"buyer" + 0.009*"export" 05:44:33 INFO:topic #116 (0.005): 0.004*"x" + 0.003*"analyst" + 0.003*"bre" + 0.003*"Bre-X" + 0.002*"bre_x" + 0.002*"share" + 0.002*"government" + 0.002*"bank" + 0.002*"billion" + 0.002*"sale" 05:44:33 INFO:topic #174 (0.005): 0.001*"quarter" + 0.001*"venture" + 0.001*"china" + 0.001*"billion" + 0.001*"beijing" + 0.000*"investment" + 0.000*"share" + 0.000*"level" + 0.000*"chinese" + 0.000*"official" 05:44:33 INFO:topic #169 (0.005): 0.001*"china" + 0.001*"tell" + 0.001*"service" + 0.001*"hong_kong" + 0.001*"billion" + 0.001*"share" + 0.001*"beijing" + 0.001*"group" + 0.001*"kong" + 0.001*"analyst" 05:44:33 INFO:topic #88 (0.005): 0.024*"franc" + 0.023*"french" + 0.022*"air" + 0.021*"france" + 0.017*"thomson" + 0.014*"billion" + 0.011*"group" + 0.010*"telecom" + 0.010*"billion_franc" + 0.009*"government" 05:44:33 INFO:topic diff=6.234229, rho=0.333333 05:44:33 INFO:PROGRESS: pass 8, at document #2500/2500 05:44:33 DEBUG:performing inference on a chunk of 2500 documents 05:44:41 DEBUG:2500/2500 documents converged within 50 iterations 05:44:41 DEBUG:updating topics 05:44:42 INFO:topic #149 (0.005): 0.010*"china" + 0.005*"chinese" + 0.005*"official" + 0.004*"beijing" + 0.004*"metre" + 0.003*"world" + 0.003*"trade" + 0.003*"foreign" + 0.003*"united_states" + 0.003*"united" 05:44:42 INFO:topic #62 (0.005): 0.002*"property" + 0.002*"increase" + 0.002*"month" + 0.002*"klaus" + 0.001*"social" + 0.001*"commission" + 0.001*"pound" + 0.001*"analyst" + 0.001*"large" + 0.001*"party" 05:44:42 INFO:topic #38 (0.005): 0.016*"analyst" + 0.014*"australian" + 0.014*"ltd" + 0.013*"share" + 0.012*"australia" + 0.011*"profit" + 0.011*"sydney" + 0.009*"news" + 0.009*"group" + 0.009*"corp" 05:44:42 INFO:topic #155 (0.005): 0.002*"china" + 0.001*"fund" + 0.001*"stock" + 0.001*"billion" + 0.001*"economic" + 0.001*"hong" + 0.001*"bank" + 0.001*"group" + 0.001*"kong" + 0.001*"canada" 05:44:42 INFO:topic #130 (0.005): 0.019*"mci" + 0.012*"analyst" + 0.011*"service" + 0.011*"share" + 0.011*"long" + 0.010*"billion" + 0.009*"long_distance" + 0.009*"distance" + 0.009*"corp" + 0.008*"deal" 05:44:42 INFO:topic diff=5.231049, rho=0.316228 05:44:42 INFO:PROGRESS: pass 9, at document #2500/2500 05:44:42 DEBUG:performing inference on a chunk of 2500 documents 05:44:50 DEBUG:2500/2500 documents converged within 50 iterations 05:44:51 DEBUG:updating topics 05:44:51 INFO:topic #166 (0.005): 0.001*"oil" + 0.001*"russian" + 0.001*"russia" + 0.001*"internet" + 0.001*"export" + 0.001*"world" + 0.001*"service" + 0.001*"tonne" + 0.001*"analyst" + 0.001*"output" 05:44:51 INFO:topic #187 (0.005): 0.033*"china" + 0.011*"beijing" + 0.011*"official" + 0.010*"chinese" + 0.008*"state" + 0.008*"foreign" + 0.008*"trade" + 0.006*"united" + 0.005*"united_states" + 0.005*"states" 05:44:51 INFO:topic #139 (0.005): 0.013*"drug" + 0.012*"group" + 0.010*"pound" + 0.010*"sale" + 0.009*"plc" + 0.009*"british" + 0.009*"share" + 0.008*"product" + 0.008*"profit" + 0.008*"analyst" 05:44:51 INFO:topic #43 (0.005): 0.001*"tonne" + 0.001*"cocoa" + 0.001*"china" + 0.000*"share" + 0.000*"bank" + 0.000*"government" + 0.000*"exporter" + 0.000*"stock" + 0.000*"plan" + 0.000*"close" 05:44:51 INFO:topic #71 (0.005): 0.025*"fcc" + 0.018*"phone" + 0.016*"carrier" + 0.015*"local" + 0.012*"rule" + 0.011*"long" + 0.011*"service" + 0.011*"distance" + 0.010*"tv" + 0.010*"long_distance" 05:44:51 INFO:topic diff=4.391602, rho=0.301511 05:44:51 INFO:PROGRESS: pass 10, at document #2500/2500 05:44:51 DEBUG:performing inference on a chunk of 2500 documents 05:44:59 DEBUG:2500/2500 documents converged within 50 iterations 05:44:59 DEBUG:updating topics 05:45:00 INFO:topic #156 (0.005): 0.001*"analyst" + 0.001*"bank" + 0.001*"share" + 0.001*"service" + 0.000*"internet" + 0.000*"china" + 0.000*"plan" + 0.000*"profit" + 0.000*"billion" + 0.000*"cost" 05:45:00 INFO:topic #63 (0.005): 0.037*"oil" + 0.029*"russia" + 0.026*"russian" + 0.016*"tonne" + 0.016*"aluminium" + 0.015*"smelter" + 0.014*"output" + 0.012*"world" + 0.012*"export" + 0.010*"western" 05:45:00 INFO:topic #83 (0.005): 0.001*"profit" + 0.001*"bank" + 0.001*"australian" + 0.001*"analyst" + 0.001*"billion" + 0.001*"australia" + 0.001*"share" + 0.001*"tell" + 0.001*"ltd" + 0.001*"beijing" 05:45:00 INFO:topic #144 (0.005): 0.001*"computer" + 0.001*"software" + 0.001*"site" + 0.001*"quarter" + 0.001*"technology" + 0.001*"internet" + 0.001*"industry" + 0.001*"web" + 0.001*"product" + 0.001*"high" 05:45:00 INFO:topic #84 (0.005): 0.015*"klaus" + 0.014*"czech" + 0.014*"bank" + 0.011*"billion" + 0.011*"crown" + 0.009*"state" + 0.009*"price" + 0.008*"minister" + 0.007*"tell" + 0.007*"low" 05:45:00 INFO:topic diff=3.689287, rho=0.288675 05:45:00 INFO:PROGRESS: pass 11, at document #2500/2500 05:45:00 DEBUG:performing inference on a chunk of 2500 documents 05:45:08 DEBUG:2500/2500 documents converged within 50 iterations 05:45:08 DEBUG:updating topics 05:45:09 INFO:topic #50 (0.005): 0.001*"british" + 0.001*"telecom" + 0.001*"deal" + 0.001*"analyst" + 0.001*"drug" + 0.001*"share" + 0.001*"mci" + 0.001*"billion" + 0.001*"group" + 0.001*"sale" 05:45:09 INFO:topic #194 (0.005): 0.024*"cocoa" + 0.020*"tonne" + 0.020*"exporter" + 0.012*"ivory" + 0.012*"coast" + 0.012*"ivory_coast" + 0.011*"crop" + 0.011*"price" + 0.010*"buyer" + 0.009*"export" 05:45:09 INFO:topic #138 (0.005): 0.001*"china" + 0.000*"beijing" + 0.000*"share" + 0.000*"news" + 0.000*"states" + 0.000*"chinese" + 0.000*"analyst" + 0.000*"month" + 0.000*"long" + 0.000*"the United States" 05:45:09 INFO:topic #100 (0.005): 0.001*"chinese" + 0.001*"china" + 0.001*"beijing" + 0.000*"hong" + 0.000*"hong_kong" + 0.000*"official" + 0.000*"tibet" + 0.000*"magazine" + 0.000*"kong" + 0.000*"lama" 05:45:09 INFO:topic #94 (0.005): 0.000*"share" + 0.000*"stock" + 0.000*"election" + 0.000*"analyst" + 0.000*"low" + 0.000*"bank" + 0.000*"havel" + 0.000*"government" + 0.000*"high" + 0.000*"large" 05:45:09 INFO:topic diff=3.102101, rho=0.277350 05:45:09 INFO:PROGRESS: pass 12, at document #2500/2500 05:45:09 DEBUG:performing inference on a chunk of 2500 documents 05:45:17 DEBUG:2500/2500 documents converged within 50 iterations 05:45:17 DEBUG:updating topics 05:45:18 INFO:topic #122 (0.005): 0.000*"share" + 0.000*"analyst" + 0.000*"plant" + 0.000*"gm" + 0.000*"industry" + 0.000*"quarter" + 0.000*"service" + 0.000*"law" + 0.000*"month" + 0.000*"large" 05:45:18 INFO:topic #20 (0.005): 0.029*"gold" + 0.019*"bre" + 0.019*"x" + 0.018*"bre_x" + 0.014*"price" + 0.014*"analyst" + 0.011*"Bre-X" + 0.010*"busang" + 0.010*"barrick" + 0.010*"toronto" 05:45:18 INFO:topic #107 (0.005): 0.031*"russia" + 0.027*"oil" + 0.016*"russian" + 0.014*"export" + 0.012*"output" + 0.010*"moscow" + 0.009*"tonne" + 0.009*"domestic" + 0.009*"world" + 0.008*"western" 05:45:18 INFO:topic #130 (0.005): 0.019*"mci" + 0.012*"analyst" + 0.011*"service" + 0.011*"share" + 0.010*"long" + 0.010*"billion" + 0.009*"long_distance" + 0.009*"distance" + 0.009*"corp" + 0.008*"deal" 05:45:18 INFO:topic #151 (0.005): 0.012*"billion" + 0.008*"sale" + 0.007*"computer" + 0.007*"industry" + 0.006*"good" + 0.006*"analyst" + 0.006*"product" + 0.006*"quarter" + 0.005*"forecast" + 0.005*"internet" 05:45:18 INFO:topic diff=2.611657, rho=0.267261 05:45:18 INFO:PROGRESS: pass 13, at document #2500/2500 05:45:18 DEBUG:performing inference on a chunk of 2500 documents 05:45:28 DEBUG:2500/2500 documents converged within 50 iterations 05:45:28 DEBUG:updating topics 05:45:28 INFO:topic #74 (0.005): 0.036*"china" + 0.029*"tonne" + 0.021*"chinese" + 0.021*"trader" + 0.018*"price" + 0.014*"import" + 0.013*"source" + 0.011*"copper" + 0.010*"official" + 0.010*"million_tonne" 05:45:28 INFO:topic #138 (0.005): 0.000*"china" + 0.000*"beijing" + 0.000*"share" + 0.000*"news" + 0.000*"states" + 0.000*"chinese" + 0.000*"analyst" + 0.000*"month" + 0.000*"long" + 0.000*"the United States" 05:45:28 INFO:topic #60 (0.005): 0.000*"half" + 0.000*"financial" + 0.000*"northern" + 0.000*"policy" + 0.000*"official" + 0.000*"group" + 0.000*"product" + 0.000*"draft" + 0.000*"administration" + 0.000*"stock" 05:45:28 INFO:topic #42 (0.005): 0.000*"russia" + 0.000*"russian" + 0.000*"industry" + 0.000*"technology" + 0.000*"oil" + 0.000*"world" + 0.000*"export" + 0.000*"price" + 0.000*"diamond" + 0.000*"analyst" 05:45:28 INFO:topic #36 (0.005): 0.026*"bid" + 0.025*"penny" + 0.016*"analyst" + 0.015*"electric" + 0.015*"share" + 0.014*"electricity" + 0.013*"pound" + 0.013*"offer" + 0.011*"northern" + 0.011*"british" 05:45:28 INFO:topic diff=2.202422, rho=0.258199 05:45:28 INFO:PROGRESS: pass 14, at document #2500/2500 05:45:28 DEBUG:performing inference on a chunk of 2500 documents 05:45:36 DEBUG:2500/2500 documents converged within 50 iterations 05:45:36 DEBUG:updating topics 05:45:36 INFO:topic #89 (0.005): 0.001*"bre_x" + 0.001*"x" + 0.001*"bre" + 0.001*"analyst" + 0.001*"barrick" + 0.001*"Bre-X" + 0.001*"government" + 0.001*"gold" + 0.001*"indonesian" + 0.001*"billion" 05:45:36 INFO:topic #121 (0.005): 0.000*"time" + 0.000*"share" + 0.000*"second" + 0.000*"tobacco" + 0.000*"group" + 0.000*"industry" + 0.000*"action" + 0.000*"month" + 0.000*"plan" + 0.000*"hand" 05:45:36 INFO:topic #188 (0.005): 0.015*"sale" + 0.013*"analyst" + 0.011*"share" + 0.008*"mercury" + 0.008*"bank" + 0.007*"stock" + 0.007*"billion" + 0.006*"amp" + 0.006*"think" + 0.006*"base" 05:45:36 INFO:topic #80 (0.005): 0.016*"microsoft" + 0.015*"analyst" + 0.013*"quarter" + 0.010*"windows" + 0.010*"computer" + 0.009*"business" + 0.009*"revenue" + 0.009*"sale" + 0.008*"system" + 0.008*"software" 05:45:36 INFO:topic #56 (0.005): 0.033*"hong" + 0.033*"kong" + 0.032*"hong_kong" + 0.023*"Hong Kong" + 0.017*"china" + 0.007*"Hong Kong's" + 0.006*"chinese" + 0.006*"tung" + 0.006*"british" + 0.004*"government" 05:45:36 INFO:topic diff=1.861192, rho=0.250000 05:45:36 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=3914, num_topics=200, num_authors=50, decay=0.5, chunksize=2500) 05:45:37 INFO:CorpusAccumulator accumulated stats from 1000 documents 05:45:37 INFO:CorpusAccumulator accumulated stats from 2000 documents
-1.93149366596
accuracy_scores_200topic={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_50_20, atmodel_200topics, k=i)
accuracy_scores_200topic[k] = accuracy
plot_accuracy(scores1=accuracy_scores_150topic, label1="150 topics", scores2=accuracy_scores_200topic, label2="200 topics")
Precision@k: top_n=1 Prediction accuracy: 0.6232 Precision@k: top_n=2 Prediction accuracy: 0.7664 Precision@k: top_n=3 Prediction accuracy: 0.8456 Precision@k: top_n=4 Prediction accuracy: 0.8816 Precision@k: top_n=5 Prediction accuracy: 0.9032 Precision@k: top_n=6 Prediction accuracy: 0.9164 Precision@k: top_n=8 Prediction accuracy: 0.9368 Precision@k: top_n=10 Prediction accuracy: 0.9464
The 200-topic seems to be performing a bit better for lower k, might be due to a slight overrepresentation with high topic number. So let us stop here with the topic number increase and focus some more on the dictionary. We choose either one of the models. Currently we are filtering out tokens, that appear in more 50% of all documents and no more than 20 times overall, which drastically decreaces the size of our dictionary. We know about this dataset, that the underlying topic are not so diverse and are structed around corporate/industrial topic class. Thus it makes sense to increase the dictionary by filtering less tokens.
We set the parameters set max_freq=25%, min_wordcount=10
train_corpus_25_10, train_dictionary_25_10 = create_corpus_dictionary(train_docs, 0.25, 10)
06:18:50 INFO:adding document #0 to Dictionary(0 unique tokens: []) 06:18:51 INFO:built Dictionary(46905 unique tokens: ['$83.4 million', 'boarder', '$2.72 billion', 'checking', 'suzuki']...) from 2500 documents (total 786032 corpus positions) 06:18:51 INFO:discarding 40690 tokens: [('$15', 3), ('$17.25', 1), ('$380 million', 2), ('12.5 cents', 7), ('Big B', 3), ('Big B Inc.', 2), ("Big B's", 3), ('Big B. I', 1), ('Dwayne Hoven', 1), ('Eckerd Corp.', 1)]... 06:18:51 INFO:keeping 6215 tokens which were in no less than 10 and no more than 625 (=25.0%) documents 06:18:51 DEBUG:rebuilding dictionary, shrinking gaps 06:18:51 INFO:resulting dictionary: Dictionary(6215 unique tokens: ['offshoot', 'shore', 'loss', 'merger', 'disappointing']...)
test_corpus_25_10 = create_test_corpus(train_dictionary_25_10, test_docs)
print('Number of unique tokens: %d' % len(train_dictionary_25_10))
Number of unique tokens: 6215
We now have now nearly doubled the tokens. Let's train and evaluate.
atmodel_150topics_25_10 = train_model(train_corpus_25_10, train_author2doc, train_dictionary_25_10, num_topics=150, eval_every=0, iterations=50, passes=15)
06:18:53 INFO:Vocabulary consists of 6215 words. 06:18:53 INFO:using symmetric alpha at 0.006666666666666667 06:18:53 INFO:using symmetric eta at 0.006666666666666667 06:18:57 INFO:running online author-topic training, 150 topics, 50 authors, 15 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000 06:18:57 INFO:PROGRESS: pass 0, at document #2500/2500 06:18:57 DEBUG:performing inference on a chunk of 2500 documents 06:19:11 DEBUG:17/2500 documents converged within 50 iterations 06:19:11 DEBUG:updating topics 06:19:12 INFO:topic #141 (0.007): 0.031*"gm" + 0.016*"plant" + 0.011*"worker" + 0.010*"uaw" + 0.009*"strike" + 0.009*"truck" + 0.008*"local" + 0.007*"automaker" + 0.006*"part" + 0.005*"contract" 06:19:12 INFO:topic #105 (0.007): 0.013*"china" + 0.010*"tonne" + 0.009*"chinese" + 0.008*"trader" + 0.007*"copper" + 0.007*"product" + 0.005*"drug" + 0.005*"hong_kong" + 0.004*"soybean" + 0.004*"hong" 06:19:12 INFO:topic #15 (0.007): 0.006*"china" + 0.004*"network" + 0.003*"drug" + 0.003*"trade" + 0.003*"united" + 0.003*"states" + 0.003*"boeing" + 0.003*"chinese" + 0.003*"beijing" + 0.002*"product" 06:19:12 INFO:topic #30 (0.007): 0.010*"amp" + 0.009*"bank" + 0.005*"ernst" + 0.005*"claim" + 0.005*"bre" + 0.004*"bre_x" + 0.004*"gold" + 0.003*"rate" + 0.003*"x" + 0.003*"pay" 06:19:12 INFO:topic #114 (0.007): 0.019*"bank" + 0.010*"japan" + 0.009*"pound" + 0.008*"problem" + 0.008*"loan" + 0.007*"financial" + 0.006*"yen" + 0.005*"bt" + 0.005*"million_pound" + 0.005*"japanese" 06:19:12 INFO:topic diff=61.971494, rho=1.000000 06:19:12 INFO:PROGRESS: pass 1, at document #2500/2500 06:19:12 DEBUG:performing inference on a chunk of 2500 documents 06:19:19 DEBUG:2491/2500 documents converged within 50 iterations 06:19:19 DEBUG:updating topics 06:19:19 INFO:topic #45 (0.007): 0.006*"property" + 0.004*"china" + 0.003*"holding" + 0.003*"survey" + 0.003*"sector" + 0.002*"bank" + 0.002*"gold" + 0.002*"fall" + 0.002*"debt" + 0.002*"air" 06:19:19 INFO:topic #139 (0.007): 0.024*"colombia" + 0.021*"drug" + 0.008*"guerrilla" + 0.008*"colombian" + 0.007*"police" + 0.006*"extradition" + 0.005*"late" + 0.005*"anti" + 0.005*"congress" + 0.005*"contract" 06:19:19 INFO:topic #15 (0.007): 0.005*"china" + 0.003*"network" + 0.002*"drug" + 0.002*"trade" + 0.002*"united" + 0.002*"states" + 0.002*"boeing" + 0.002*"chinese" + 0.002*"beijing" + 0.002*"product" 06:19:19 INFO:topic #2 (0.007): 0.004*"bre_x" + 0.004*"x" + 0.003*"bid" + 0.003*"bre" + 0.003*"product" + 0.003*"Bre-X" + 0.003*"drug" + 0.003*"gold" + 0.002*"mining" + 0.002*"pound" 06:19:19 INFO:topic #116 (0.007): 0.007*"china" + 0.006*"bank" + 0.004*"tonne" + 0.004*"problem" + 0.004*"hong_kong" + 0.003*"trader" + 0.003*"chinese" + 0.003*"loan" + 0.003*"kong" + 0.003*"hong" 06:19:19 INFO:topic diff=11.411593, rho=0.577350 06:19:19 INFO:PROGRESS: pass 2, at document #2500/2500 06:19:19 DEBUG:performing inference on a chunk of 2500 documents 06:19:26 DEBUG:2499/2500 documents converged within 50 iterations 06:19:26 DEBUG:updating topics 06:19:26 INFO:topic #116 (0.007): 0.005*"china" + 0.005*"bank" + 0.003*"tonne" + 0.003*"problem" + 0.003*"hong_kong" + 0.002*"trader" + 0.002*"chinese" + 0.002*"loan" + 0.002*"kong" + 0.002*"hong" 06:19:26 INFO:topic #79 (0.007): 0.030*"china" + 0.020*"beijing" + 0.014*"chinese" + 0.009*"taiwan" + 0.008*"trade" + 0.008*"wang" + 0.007*"foreign" + 0.006*"united" + 0.006*"washington" + 0.006*"states" 06:19:26 INFO:topic #58 (0.007): 0.006*"pound" + 0.004*"million_pound" + 0.003*"hong_kong" + 0.002*"hong" + 0.002*"kong" + 0.002*"pay" + 0.002*"china" + 0.002*"Hong Kong" + 0.002*"shareholder" + 0.002*"service" 06:19:26 INFO:topic #19 (0.007): 0.036*"bre" + 0.035*"x" + 0.033*"bre_x" + 0.031*"gold" + 0.026*"Bre-X" + 0.019*"barrick" + 0.015*"busang" + 0.013*"indonesian" + 0.011*"mining" + 0.009*"deposit" 06:19:26 INFO:topic #112 (0.007): 0.005*"bank" + 0.003*"russia" + 0.002*"x" + 0.002*"diamond" + 0.002*"bre" + 0.002*"bre_x" + 0.002*"canada" + 0.002*"export" + 0.002*"canadian" + 0.002*"Bre-X" 06:19:26 INFO:topic diff=9.522079, rho=0.500000 06:19:26 INFO:PROGRESS: pass 3, at document #2500/2500 06:19:26 DEBUG:performing inference on a chunk of 2500 documents 06:19:32 DEBUG:2500/2500 documents converged within 50 iterations 06:19:32 DEBUG:updating topics 06:19:33 INFO:topic #38 (0.007): 0.003*"block" + 0.002*"quarter" + 0.002*"service" + 0.002*"compuserve" + 0.002*"china" + 0.002*"pound" + 0.002*"loss" + 0.001*"chinese" + 0.001*"time_warner" + 0.001*"cent" 06:19:33 INFO:topic #148 (0.007): 0.018*"franc" + 0.015*"french" + 0.014*"airbus" + 0.014*"france" + 0.013*"thomson" + 0.009*"air" + 0.009*"billion_franc" + 0.007*"boeing" + 0.007*"state" + 0.007*"air_france" 06:19:33 INFO:topic #9 (0.007): 0.023*"shanghai" + 0.021*"china" + 0.018*"bank" + 0.014*"b" + 0.014*"foreign" + 0.011*"investor" + 0.011*"exchange" + 0.011*"b_share" + 0.010*"beijing" + 0.010*"shenzhen" 06:19:33 INFO:topic #11 (0.007): 0.013*"tobacco" + 0.010*"florida" + 0.009*"quick" + 0.008*"state" + 0.007*"car" + 0.007*"amp" + 0.007*"trial" + 0.006*"cigarette" + 0.006*"television" + 0.006*"maker" 06:19:33 INFO:topic #116 (0.007): 0.004*"china" + 0.003*"bank" + 0.002*"tonne" + 0.002*"problem" + 0.002*"hong_kong" + 0.002*"trader" + 0.002*"chinese" + 0.002*"loan" + 0.001*"kong" + 0.001*"hong" 06:19:33 INFO:topic diff=7.935955, rho=0.447214 06:19:33 INFO:PROGRESS: pass 4, at document #2500/2500 06:19:33 DEBUG:performing inference on a chunk of 2500 documents 06:19:39 DEBUG:2500/2500 documents converged within 50 iterations 06:19:39 DEBUG:updating topics 06:19:39 INFO:topic #118 (0.007): 0.003*"tonne" + 0.002*"cocoa" + 0.002*"exporter" + 0.001*"chad" + 0.001*"bank" + 0.001*"coast" + 0.001*"ivory" + 0.001*"crop" + 0.001*"ivory_coast" + 0.001*"cable" 06:19:39 INFO:topic #134 (0.007): 0.038*"bank" + 0.020*"canada" + 0.017*"canadian" + 0.011*"toronto" + 0.009*"fund" + 0.008*"cent" + 0.007*"molson" + 0.006*"earning" + 0.005*"royal_bank" + 0.005*"royal" 06:19:39 INFO:topic #7 (0.007): 0.002*"soybean" + 0.002*"china" + 0.002*"monsanto" + 0.002*"director" + 0.002*"adm" + 0.002*"hong" + 0.002*"crop" + 0.001*"hong_kong" + 0.001*"united" + 0.001*"equipment" 06:19:39 INFO:topic #93 (0.007): 0.005*"earning" + 0.005*"point" + 0.004*"quarter" + 0.004*"investor" + 0.004*"fund" + 0.004*"growth" + 0.003*"exchange" + 0.003*"investment" + 0.003*"strong" + 0.003*"trade" 06:19:39 INFO:topic #26 (0.007): 0.012*"bank" + 0.010*"yen" + 0.008*"billion_yen" + 0.005*"financial" + 0.005*"affiliate" + 0.004*"daiwa" + 0.004*"non" + 0.004*"non_bank" + 0.004*"half" + 0.004*"post" 06:19:39 INFO:topic diff=6.627219, rho=0.408248 06:19:39 INFO:PROGRESS: pass 5, at document #2500/2500 06:19:39 DEBUG:performing inference on a chunk of 2500 documents 06:19:45 DEBUG:2500/2500 documents converged within 50 iterations 06:19:45 DEBUG:updating topics 06:19:46 INFO:topic #16 (0.007): 0.030*"toronto" + 0.020*"index" + 0.019*"bank" + 0.018*"canada" + 0.016*"point" + 0.015*"gold" + 0.012*"canadian" + 0.011*"toronto_stock" + 0.011*"fall" + 0.010*"gain" 06:19:46 INFO:topic #114 (0.007): 0.010*"bank" + 0.005*"japan" + 0.005*"pound" + 0.004*"problem" + 0.004*"loan" + 0.003*"financial" + 0.003*"yen" + 0.003*"bt" + 0.003*"million_pound" + 0.003*"japanese" 06:19:46 INFO:topic #52 (0.007): 0.019*"bank" + 0.017*"airbus" + 0.008*"canada" + 0.006*"fund" + 0.006*"canadian" + 0.006*"service" + 0.005*"boeing" + 0.004*"aircraft" + 0.004*"aerospace" + 0.004*"office" 06:19:46 INFO:topic #146 (0.007): 0.007*"china" + 0.005*"party" + 0.004*"pound" + 0.003*"british" + 0.003*"plc" + 0.003*"stg" + 0.003*"drug" + 0.002*"million_pound" + 0.002*"country" + 0.002*"technology" 06:19:46 INFO:topic #47 (0.007): 0.009*"tonne" + 0.008*"smelter" + 0.007*"oil" + 0.007*"aluminium" + 0.006*"state" + 0.006*"plant" + 0.006*"russia" + 0.006*"trader" + 0.006*"source" + 0.005*"metal" 06:19:46 INFO:topic diff=5.554374, rho=0.377964 06:19:46 INFO:PROGRESS: pass 6, at document #2500/2500 06:19:46 DEBUG:performing inference on a chunk of 2500 documents 06:19:51 DEBUG:2500/2500 documents converged within 50 iterations 06:19:51 DEBUG:updating topics 06:19:52 INFO:topic #44 (0.007): 0.007*"internet" + 0.004*"committee" + 0.003*"proposal" + 0.003*"address" + 0.003*"trade" + 0.003*"china" + 0.003*"congress" + 0.002*"member" + 0.002*"financial" + 0.002*"name" 06:19:52 INFO:topic #131 (0.007): 0.009*"bank" + 0.008*"internet" + 0.007*"court" + 0.005*"exchange" + 0.004*"foreign" + 0.004*"currency" + 0.004*"trading" + 0.004*"policy" + 0.003*"law" + 0.003*"security" 06:19:52 INFO:topic #112 (0.007): 0.001*"bank" + 0.001*"russia" + 0.001*"x" + 0.001*"diamond" + 0.001*"bre" + 0.001*"bre_x" + 0.001*"canada" + 0.001*"export" + 0.001*"canadian" + 0.001*"Bre-X" 06:19:52 INFO:topic #49 (0.007): 0.008*"bid" + 0.008*"penny" + 0.005*"pound" + 0.004*"northern" + 0.004*"electric" + 0.003*"midlands" + 0.003*"offer" + 0.003*"sector" + 0.003*"electricity" + 0.003*"east" 06:19:52 INFO:topic #61 (0.007): 0.008*"china" + 0.008*"tibet" + 0.005*"chinese" + 0.005*"beijing" + 0.005*"foreign" + 0.004*"wang" + 0.004*"hong_kong" + 0.004*"kong" + 0.003*"hong" + 0.003*"region" 06:19:52 INFO:topic diff=4.666072, rho=0.353553 06:19:52 INFO:PROGRESS: pass 7, at document #2500/2500 06:19:52 DEBUG:performing inference on a chunk of 2500 documents 06:19:58 DEBUG:2500/2500 documents converged within 50 iterations 06:19:58 DEBUG:updating topics 06:19:58 INFO:topic #8 (0.007): 0.001*"french" + 0.001*"bank" + 0.001*"service" + 0.001*"financial" + 0.001*"internet" + 0.001*"china" + 0.000*"mfs" + 0.000*"sell" + 0.000*"state" + 0.000*"product" 06:19:58 INFO:topic #45 (0.007): 0.001*"property" + 0.000*"china" + 0.000*"holding" + 0.000*"survey" + 0.000*"sector" + 0.000*"bank" + 0.000*"gold" + 0.000*"fall" + 0.000*"debt" + 0.000*"air" 06:19:58 INFO:topic #22 (0.007): 0.016*"pound" + 0.012*"drug" + 0.011*"plc" + 0.011*"british" + 0.011*"million_pound" + 0.008*"product" + 0.007*"penny" + 0.006*"cancer" + 0.006*"stg" + 0.005*"biotech" 06:19:58 INFO:topic #10 (0.007): 0.023*"bank" + 0.015*"pound" + 0.008*"society" + 0.006*"banking" + 0.006*"fund" + 0.006*"shareholder" + 0.005*"investment" + 0.005*"eurotunnel" + 0.005*"lloyds" + 0.005*"debt" 06:19:58 INFO:topic #11 (0.007): 0.013*"tobacco" + 0.012*"florida" + 0.009*"quick" + 0.009*"state" + 0.008*"car" + 0.007*"amp" + 0.007*"trial" + 0.007*"television" + 0.006*"news" + 0.006*"maker" 06:19:58 INFO:topic diff=3.925478, rho=0.333333 06:19:58 INFO:PROGRESS: pass 8, at document #2500/2500 06:19:58 DEBUG:performing inference on a chunk of 2500 documents 06:20:04 DEBUG:2500/2500 documents converged within 50 iterations 06:20:04 DEBUG:updating topics 06:20:04 INFO:topic #9 (0.007): 0.024*"shanghai" + 0.022*"china" + 0.018*"bank" + 0.014*"b" + 0.014*"foreign" + 0.011*"investor" + 0.011*"exchange" + 0.011*"b_share" + 0.010*"beijing" + 0.010*"shenzhen" 06:20:04 INFO:topic #42 (0.007): 0.001*"news" + 0.000*"china" + 0.000*"corp" + 0.000*"net" + 0.000*"property" + 0.000*"news_corp" + 0.000*"value" + 0.000*"shareholder" + 0.000*"bre_x" + 0.000*"x" 06:20:04 INFO:topic #131 (0.007): 0.006*"bank" + 0.005*"internet" + 0.004*"court" + 0.003*"exchange" + 0.003*"foreign" + 0.003*"currency" + 0.002*"trading" + 0.002*"policy" + 0.002*"law" + 0.002*"security" 06:20:04 INFO:topic #99 (0.007): 0.011*"mci" + 0.007*"digital" + 0.007*"camera" + 0.006*"rockwell" + 0.005*"technology" + 0.005*"kong" + 0.005*"hand" + 0.005*"hong_kong" + 0.005*"system" + 0.004*"trade" 06:20:04 INFO:topic #5 (0.007): 0.018*"bt" + 0.013*"telecom" + 0.011*"pound" + 0.010*"british" + 0.008*"mci" + 0.007*"service" + 0.006*"merger" + 0.005*"penny" + 0.005*"britain" + 0.005*"ntt" 06:20:04 INFO:topic diff=3.304076, rho=0.316228 06:20:04 INFO:PROGRESS: pass 9, at document #2500/2500 06:20:04 DEBUG:performing inference on a chunk of 2500 documents 06:20:10 DEBUG:2500/2500 documents converged within 50 iterations 06:20:10 DEBUG:updating topics 06:20:10 INFO:topic #108 (0.007): 0.005*"gm" + 0.004*"computer" + 0.004*"quarter" + 0.004*"ibm" + 0.004*"car" + 0.004*"technology" + 0.004*"france" + 0.003*"thomson" + 0.003*"plant" + 0.003*"service" 06:20:11 INFO:topic #111 (0.007): 0.008*"computer" + 0.006*"software" + 0.005*"apple" + 0.005*"quarter" + 0.004*"microsoft" + 0.003*"technology" + 0.003*"design" + 0.003*"pc" + 0.002*"oracle" + 0.002*"financial" 06:20:11 INFO:topic #82 (0.007): 0.003*"china" + 0.002*"shanghai" + 0.002*"future" + 0.002*"exchange" + 0.001*"b" + 0.001*"index" + 0.001*"authority" + 0.001*"investor" + 0.001*"trading" + 0.001*"foreign" 06:20:11 INFO:topic #89 (0.007): 0.016*"internet" + 0.015*"computer" + 0.014*"technology" + 0.010*"quarter" + 0.010*"software" + 0.009*"product" + 0.008*"microsoft" + 0.008*"sun" + 0.007*"netscape" + 0.007*"web" 06:20:11 INFO:topic #101 (0.007): 0.001*"china" + 0.001*"kong" + 0.001*"hong" + 0.001*"hong_kong" + 0.000*"Hong Kong" + 0.000*"macau" + 0.000*"tung" + 0.000*"chinese" + 0.000*"formula" + 0.000*"beijing" 06:20:11 INFO:topic diff=2.781140, rho=0.301511 06:20:11 INFO:PROGRESS: pass 10, at document #2500/2500 06:20:11 DEBUG:performing inference on a chunk of 2500 documents 06:20:16 DEBUG:2500/2500 documents converged within 50 iterations 06:20:16 DEBUG:updating topics 06:20:17 INFO:topic #60 (0.007): 0.011*"oil" + 0.009*"colombia" + 0.008*"colombian" + 0.008*"paramilitary" + 0.008*"country" + 0.008*"drug" + 0.008*"police" + 0.007*"attack" + 0.007*"force" + 0.007*"medellin" 06:20:17 INFO:topic #99 (0.007): 0.010*"mci" + 0.008*"digital" + 0.007*"camera" + 0.007*"rockwell" + 0.005*"technology" + 0.005*"hand" + 0.005*"system" + 0.005*"agreement" + 0.005*"personal" + 0.005*"trade" 06:20:17 INFO:topic #109 (0.007): 0.025*"pound" + 0.016*"million_pound" + 0.012*"life" + 0.011*"insurance" + 0.011*"scotam" + 0.009*"offer" + 0.009*"abbey" + 0.007*"policyholder" + 0.007*"british" + 0.006*"scottish" 06:20:17 INFO:topic #55 (0.007): 0.028*"internet" + 0.027*"court" + 0.019*"foreign" + 0.017*"exchange" + 0.017*"currency" + 0.014*"case" + 0.014*"foreign_currency" + 0.014*"trading" + 0.012*"amendment" + 0.012*"address" 06:20:17 INFO:topic #9 (0.007): 0.024*"shanghai" + 0.022*"china" + 0.018*"bank" + 0.014*"b" + 0.014*"foreign" + 0.011*"investor" + 0.011*"exchange" + 0.011*"b_share" + 0.010*"beijing" + 0.010*"shenzhen" 06:20:17 INFO:topic diff=2.340896, rho=0.288675 06:20:17 INFO:PROGRESS: pass 11, at document #2500/2500 06:20:17 DEBUG:performing inference on a chunk of 2500 documents 06:20:22 DEBUG:2500/2500 documents converged within 50 iterations 06:20:22 DEBUG:updating topics 06:20:23 INFO:topic #20 (0.007): 0.000*"china" + 0.000*"de" + 0.000*"russia" + 0.000*"chinese" + 0.000*"beijing" + 0.000*"diamond" + 0.000*"kong" + 0.000*"export" + 0.000*"oil" + 0.000*"service" 06:20:23 INFO:topic #7 (0.007): 0.000*"soybean" + 0.000*"china" + 0.000*"monsanto" + 0.000*"director" + 0.000*"adm" + 0.000*"hong" + 0.000*"crop" + 0.000*"hong_kong" + 0.000*"united" + 0.000*"equipment" 06:20:23 INFO:topic #24 (0.007): 0.024*"czech" + 0.011*"crown" + 0.011*"bank" + 0.010*"prague" + 0.010*"klaus" + 0.008*"party" + 0.006*"havel" + 0.006*"foreign" + 0.006*"country" + 0.006*"election" 06:20:23 INFO:topic #80 (0.007): 0.025*"king" + 0.021*"silver" + 0.013*"network" + 0.012*"station" + 0.012*"shopping" + 0.012*"home_shopping" + 0.012*"television" + 0.009*"latin" + 0.009*"news" + 0.009*"home" 06:20:23 INFO:topic #82 (0.007): 0.001*"china" + 0.001*"shanghai" + 0.001*"future" + 0.001*"exchange" + 0.001*"b" + 0.001*"index" + 0.001*"authority" + 0.001*"investor" + 0.001*"trading" + 0.001*"foreign" 06:20:23 INFO:topic diff=1.970765, rho=0.277350 06:20:23 INFO:PROGRESS: pass 12, at document #2500/2500 06:20:23 DEBUG:performing inference on a chunk of 2500 documents 06:20:29 DEBUG:2500/2500 documents converged within 50 iterations 06:20:29 DEBUG:updating topics 06:20:30 INFO:topic #123 (0.007): 0.021*"china" + 0.013*"wang" + 0.011*"beijing" + 0.010*"chinese" + 0.007*"tibet" + 0.006*"dissident" + 0.006*"state" + 0.006*"party" + 0.005*"communist" + 0.005*"court" 06:20:30 INFO:topic #86 (0.007): 0.015*"internet" + 0.014*"computer" + 0.014*"ibm" + 0.012*"quarter" + 0.011*"service" + 0.009*"pc" + 0.008*"software" + 0.007*"system" + 0.007*"consumer" + 0.006*"network" 06:20:30 INFO:topic #50 (0.007): 0.034*"hong_kong" + 0.033*"kong" + 0.033*"hong" + 0.029*"china" + 0.021*"Hong Kong" + 0.018*"tung" + 0.013*"beijing" + 0.012*"chinese" + 0.012*"Hong Kong's" + 0.010*"britain" 06:20:30 INFO:topic #9 (0.007): 0.025*"shanghai" + 0.022*"china" + 0.018*"bank" + 0.014*"b" + 0.014*"foreign" + 0.011*"investor" + 0.011*"exchange" + 0.011*"b_share" + 0.010*"beijing" + 0.010*"shenzhen" 06:20:30 INFO:topic #24 (0.007): 0.024*"czech" + 0.011*"crown" + 0.011*"bank" + 0.010*"prague" + 0.009*"klaus" + 0.008*"party" + 0.006*"havel" + 0.006*"foreign" + 0.006*"country" + 0.006*"election" 06:20:30 INFO:topic diff=1.660230, rho=0.267261 06:20:30 INFO:PROGRESS: pass 13, at document #2500/2500 06:20:30 DEBUG:performing inference on a chunk of 2500 documents 06:20:37 DEBUG:2500/2500 documents converged within 50 iterations 06:20:37 DEBUG:updating topics 06:20:38 INFO:topic #108 (0.007): 0.003*"gm" + 0.002*"computer" + 0.002*"quarter" + 0.002*"ibm" + 0.002*"car" + 0.002*"technology" + 0.002*"france" + 0.002*"thomson" + 0.002*"plant" + 0.002*"service" 06:20:38 INFO:topic #116 (0.007): 0.000*"china" + 0.000*"bank" + 0.000*"tonne" + 0.000*"problem" + 0.000*"hong_kong" + 0.000*"trader" + 0.000*"chinese" + 0.000*"loan" + 0.000*"kong" + 0.000*"hong" 06:20:38 INFO:topic #59 (0.007): 0.000*"pound" + 0.000*"lloyds" + 0.000*"bank" + 0.000*"pension" + 0.000*"insurance" + 0.000*"amp" + 0.000*"bhp" + 0.000*"claim" + 0.000*"million_pound" + 0.000*"scottish" 06:20:38 INFO:topic #19 (0.007): 0.034*"bre" + 0.033*"x" + 0.032*"bre_x" + 0.031*"gold" + 0.025*"Bre-X" + 0.018*"barrick" + 0.015*"busang" + 0.013*"indonesian" + 0.011*"mining" + 0.008*"exploration" 06:20:38 INFO:topic #112 (0.007): 0.000*"bank" + 0.000*"russia" + 0.000*"x" + 0.000*"diamond" + 0.000*"bre" + 0.000*"bre_x" + 0.000*"canada" + 0.000*"export" + 0.000*"canadian" + 0.000*"Bre-X" 06:20:38 INFO:topic diff=1.400248, rho=0.258199 06:20:38 INFO:PROGRESS: pass 14, at document #2500/2500 06:20:38 DEBUG:performing inference on a chunk of 2500 documents 06:20:47 DEBUG:2500/2500 documents converged within 50 iterations 06:20:47 DEBUG:updating topics 06:20:47 INFO:topic #43 (0.007): 0.001*"czech" + 0.001*"party" + 0.001*"klaus" + 0.001*"coalition" + 0.001*"election" + 0.001*"havel" + 0.000*"house" + 0.000*"crown" + 0.000*"prague" + 0.000*"parliament" 06:20:47 INFO:topic #49 (0.007): 0.001*"bid" + 0.001*"penny" + 0.001*"pound" + 0.001*"northern" + 0.001*"electric" + 0.000*"midlands" + 0.000*"offer" + 0.000*"sector" + 0.000*"electricity" + 0.000*"east" 06:20:47 INFO:topic #122 (0.007): 0.000*"wang" + 0.000*"china" + 0.000*"beijing" + 0.000*"law" + 0.000*"trial" + 0.000*"death" + 0.000*"dissident" + 0.000*"pound" + 0.000*"hong" + 0.000*"sentence" 06:20:47 INFO:topic #132 (0.007): 0.001*"bank" + 0.000*"crown" + 0.000*"klaus" + 0.000*"czech" + 0.000*"social" + 0.000*"banka" + 0.000*"party" + 0.000*"minister" + 0.000*"state" + 0.000*"billion_crown" 06:20:47 INFO:topic #16 (0.007): 0.030*"toronto" + 0.020*"index" + 0.019*"bank" + 0.019*"canada" + 0.017*"gold" + 0.016*"point" + 0.012*"canadian" + 0.011*"toronto_stock" + 0.011*"fall" + 0.010*"gain" 06:20:47 INFO:topic diff=1.182972, rho=0.250000 06:20:47 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=6215, num_topics=150, num_authors=50, decay=0.5, chunksize=2500) 06:20:47 INFO:CorpusAccumulator accumulated stats from 1000 documents 06:20:48 INFO:CorpusAccumulator accumulated stats from 2000 documents
-2.83261288295
accuracy_scores_150topic_25_10={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_25_10, atmodel_150topics_25_10, k=i)
accuracy_scores_150topic_25_10[k] = accuracy
plot_accuracy(scores1=accuracy_scores_150topic_25_10, label1="150 topics, max_freq=25%, min_wordcount=10", scores2=accuracy_scores_150topic, label2="150 topics, standard")
Precision@k: top_n=1 Prediction accuracy: 0.6176 Precision@k: top_n=2 Prediction accuracy: 0.7712 Precision@k: top_n=3 Prediction accuracy: 0.8268 Precision@k: top_n=4 Prediction accuracy: 0.8656 Precision@k: top_n=5 Prediction accuracy: 0.8916 Precision@k: top_n=6 Prediction accuracy: 0.9112 Precision@k: top_n=8 Prediction accuracy: 0.9308 Precision@k: top_n=10 Prediction accuracy: 0.9408
The results seem rather ambigious and do not show a clear trend. Which is why we would stop here for the iterations.