gensim
, topic modelling library in Python$ \textrm{min}_{\beta} \sum_n \textrm{log} p(y_n | X_n, \beta) + \lambda_1 ||\beta||_1 + \lambda_2 ||\beta||_2$
Use Bayes rule to chain classifiers.
p(ROOT, electronics, ... | X) = p(ROOT|X) * p(electronics|ROOT) * ...
Use greedy algorithm to traverse all paths
Simple MapReduce task for data cleaning and feature extraction.
def mapper(id, category, features): for subcat in lineage(category): for hyper in num_folds: for fold in num_folds: yield { 'fold': fold, 'classifier': subcat, 'hyper': hyper }, ( id, category, features )
def reducer((fold, subcat, hyper), data): model = wapiti.train( d in data if d['fold'] ...
Use Dumbo on Hadoop