This notebooks is an experiment to see if a pure scikit-learn implementation of the fastText model can work better than a linear model on a small text classification problem: 20 newsgroups.
http://arxiv.org/abs/1607.01759
Those models are very similar to Deep Averaging Network (with only 1 hidden layer with a linear activation function):
https://www.cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf
Note that scikit-learn does not provide a hierarchical softmax implementation (but we don't need it on 20 newsgroups anyways).
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
twentyng_train = fetch_20newsgroups(
subset='train',
#remove=('headers', 'footers'),
)
docs_train, target_train = twentyng_train.data, twentyng_train.target
twentyng_test = fetch_20newsgroups(
subset='test',
#remove=('headers', 'footers'),
)
docs_test, target_test = twentyng_test.data, twentyng_test.target
2 ** 18
262144
The following uses the hashing tricks on unigrams and bigrams. binary=True
makes us ignore repeated words in a document. The l1
normalization ensures that we "average" the embeddings of the tokens in the document instead of summing them.
%%time
vec = HashingVectorizer(
encoding='latin-1', binary=True, ngram_range=(1, 2),
norm='l1', n_features=2 ** 18)
X_train = vec.transform(docs_train)
X_test = vec.transform(docs_test)
CPU times: user 16.8 s, sys: 116 ms, total: 16.9 s Wall time: 16.9 s
first_doc_vectors = X_train[:3].toarray()
first_doc_vectors
array([[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]])
first_doc_vectors.min(axis=1)
array([ 0., 0., 0.])
first_doc_vectors.max(axis=1)
array([ 0.0049505 , 0.00469484, 0.00200401])
first_doc_vectors.sum(axis=1)
array([ 1., 1., 1.])
Baseline: OvR logistic regression (the multinomial logistic regression loss is currently not implemented in scikit-learn). In practice, the OvR reduction seems to work well enough.
%%time
from sklearn.linear_model import SGDClassifier
lr = SGDClassifier(loss='log', alpha=1e-10, n_iter=50, n_jobs=-1)
lr.fit(X_train, target_train)
CPU times: user 1min 46s, sys: 6.69 s, total: 1min 53s Wall time: 11.1 s
%%time
print("train score: %0.3f" % lr.score(X_train, target_train))
print("test score: %0.3f" % lr.score(X_test, target_test))
train score: 1.000 test score: 0.827 CPU times: user 588 ms, sys: 289 ms, total: 877 ms Wall time: 602 ms
Let's now use the MLPClassifier of scikit-learn to add a single hidden layer with a small number of hidden units.
Note: instead of tanh or relu we would rather like to use a linear / identity activation function for the hidden layer but this is not (yet) implemented in scikit-learn.
In that respect the following model is closer to a Deep Averaging Network (without dropout) than fastText.
%%time
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(algorithm='adam', learning_rate_init=0.01,
hidden_layer_sizes=10, max_iter=100, activation='tanh', verbose=100,
early_stopping=True, validation_fraction=0.05, alpha=1e-10)
mlp.fit(X_train, target_train)
Iteration 1, loss = 2.94108225 Validation score: 0.464664 Iteration 2, loss = 2.49072336 Validation score: 0.639576 Iteration 3, loss = 1.63266821 Validation score: 0.810954 Iteration 4, loss = 0.90327443 Validation score: 0.869258 Iteration 5, loss = 0.48531751 Validation score: 0.893993 Iteration 6, loss = 0.27329257 Validation score: 0.909894 Iteration 7, loss = 0.16704835 Validation score: 0.911661 Iteration 8, loss = 0.11122343 Validation score: 0.918728 Iteration 9, loss = 0.07885910 Validation score: 0.918728 Iteration 10, loss = 0.05876991 Validation score: 0.924028 Iteration 11, loss = 0.04566916 Validation score: 0.920495 Iteration 12, loss = 0.03644058 Validation score: 0.915194 Iteration 13, loss = 0.02982519 Validation score: 0.922261 Validation score did not improve more than tol=0.000100 for two consecutive epochs. Stopping. CPU times: user 1min 21s, sys: 187 ms, total: 1min 21s Wall time: 1min 21s
%%time
print("train score: %0.3f" % mlp.score(X_train, target_train))
print("test score: %0.3f" % mlp.score(X_test, target_test))
train score: 0.996 test score: 0.801 CPU times: user 304 ms, sys: 54 µs, total: 304 ms Wall time: 303 ms