Notebook

Comparison of different Regularizers for Linear Models¶

Summary

Linear model (with regularization) is usually useful because
- in large scale (online) learning, linear model is cheap to build
- in high dimensional data, linear model is less likely to overfit
- coefficients learned by linear model can be used as feature selection indicators
- linear combination of some features can be used as new features in feature transformation
Ridge penalizes l2 and thus does shrinkage. it supports Ridge and BayesianRidge (for regression) and RidgeClassifier (for classification). parameter: alpha, more parameters for BayesianRidge
Lasso and Lars penalizes l1 norm and thus achieves sparsity.
- Difference between the two is lasso uses coordinate-decent and thus scales with large scale data, whereas lars tends to give better (sparse) results for highly sparse solution (few rows and many columns).
- Both of them are implemented as regression in sklearn, including Lasso, Lars, LassoLars, and MultiTaskLasso (penalized with combined l1 and l2 norm)
- Difference between Lars and LassoLars implementations is that Lars accpet n_nonzero_coefs as parameter whereas LassoLars accept the traditional alpha
- MultiTaskLasso is specially useful to fit multiple regression problems jointly enforcing the selected features to be the same accross tasks. In other words, the tasks are NOT independent of each other any more -- this is usually useful to model time-series data with time-varying linear models, the coeffs of models could change along time, but their importances will not. Regression ONLY
- RandomizedLasso works by resampling the train data and computing a Lasso on each resampling. In short, the features selected more often are good features. It is also known as stability selection. Regression ONLY
ElasticNet penalizes both l1 and l2 norms at the same time, by specifing params alpha and l1_ratio, so it achieves both shrinkage and sparsity
- MultiTaskElasticNet is the mutlitask version of it. It is trained with trained with L1/L2 mixed-norm as regularizer so it jointly enforcing selected features across tasks
- both of them are for regression only currently
- compared to LASSO, advantages of ElasticNet and SGD (p1 and p2 combined) are
  - As for Lasso vs ElasticNet, ElasticNet will tend to select more variables hence lead to larger models (also more expensive to train) but also be more accurate in general. In particular Lasso is very sensitive to correlation between features and might select randomly one out of 2 very correlated informative features while ElasticNet will be more likely to select both which should lead to a more stable model (in terms of generalization ability so new samples). source http://stackoverflow.com/questions/12283184/how-is-elastic-net-used
  - For example, in the "large p, small n problem" case, the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. source http://en.wikipedia.org/wiki/Elastic_net_regularization
Logistic Regression is like the implementation of Ridge/Lasso for classification problems. It penalizes l1 or l2, but not at the same time.
- Its parameter C plays the oppsite role as of alpha in Ridge/Lasso. C is the weight on the data-fitting term, alpha is the weight on the regulariztaion term
- Its parameter dual specifies implementing as a dual or primal formulation for l2 norm only usually Prefer dual=False when n_samples > n_features.
- As many classifiers, it is able to predict_proba.
- RandomizedLogisticRegression is similiar to RandomizedLasso, which does stability selection by resampling the data several times. For large data, use n_jobs=-1 and memory='memory_cache_path' for computation efficiency
ARDRegression - Bayesian version of linear Regression with ARD prior, see also BayesianRidge
OrthogonalMatchingPursuit - For regression tasks, it essentially penalized l0 norm, i.e., number of features used
PassiveAggressive - PassiveAggressiveClassifier and PasssiveAggressiveRegressor for both classification and regression.
- One of the key features is that those models support partial_fit, which means they can be used in ONLINE Learning mode, which is specially useful for LARGE scale learning
- The parameter C controls regularization. Performance of algorithm is SENSITIVE to C values
- n_jobs implements parallelism, but only for multi-class learning
- reference: http://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf
- by default PassiveAggresive cannot generate sparse solutions
SGD
- SGD supports ONLINE mode partial_fit, similiar to PassiveAggressive
- SGD supports l1, l2 and elasticnet penalties, similiar to ElasticNet
- It supports Classification and Regression, and Feature Selection
- It supports different type of loss function such as ‘hinge’ or ‘log’ or ‘modified_huber’. The loss function to be used. Defaults to ‘hinge’. The hinge loss is a margin loss used by standard linear SVM models. The ‘log’ loss is the loss of logistic regression models and can be used for probability estimation in binary classifiers. ‘modified_huber’ is another smooth loss that brings tolerance to outliers.
- Main complexity control parameters are alpha and l1_ratio. Parameter n_iter and learning_rate may also influence accuracies
- results are usually sensitive to normalization
- class_weights can be used to handle imbalanced classification problem, e.g., with value auto
Perceptron
LinearSVM
Tricks
- Reguarlization is seldom useful is the feature is not representative enogugh (like in the case of blackbox)
- Reguarlizatin (sparsity) is specially useful for a lot of features but we need to find a subset
- When cross-validated properly, there is no significante difference between the performances of different linear models (regularized based on cv), yet their computation cost could be very different. Online MOdels are faster and more memory-efficient in general
- Feature engineering by decision-tree, clustering or deep-learning will come in handy when underfitting is a main issue
- normalizaion (StandScaler) is important to SGD and clustering, but will make a sparse matrix to be dense again

In [82]:

import cPickle
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
import matplotlib.pylab as plt
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import cluster
from IPython.parallel import Client
from scipy import sparse

def make_cv(n_samples, n_folds = 3):
    return KFold(n_samples, n_folds=n_folds, random_state=0)

client = Client()
print len(client), 'cores running...'
dv = client[:]
lb = client.load_balanced_view()

4 cores running...

In [7]:

## Some Linear Regualarizers specially Ridge, LASSO, LARS and linearSVM
## the data used is icml2013 blackbox representation learning and MNIST (large scale?)
## BLACKBOX data first
## data source: http://www.kaggle.com/c/challenges-in-representation-learning-the-black-box-learning-challenge
X, y = cPickle.load(open('data/blackbox.pkl', 'rb'))
print X.shape, y.shape
print np.unique(y), y.dtype ## classificaiton 

(1000, 1875) (1000,)
[1 2 3 4 5 6 7 8 9] int64

In [3]:

## Ridge Regression - penalizes on l2 norm, alpha parameter controls penalty weight
## (as contrast to C in SVM), testing alpha usually follow a logspace grid
## Ridge regression gives shrinkage, but not sparsity
alphas = np.logspace(-10, 5, 15)
ridge = linear_model.RidgeClassifier()
scores = [cross_val_score(ridge.set_params(alpha=alpha), X, y, cv = make_cv(X.shape[0]), n_jobs=-1) 
            for alpha in alphas]
cv_scores = map(np.mean, scores)

In [4]:

## To guarantee the optimial selection, the cv score curve should
## go up first and go down again
best_alpha, best_score = max(zip(alphas, cv_scores), key = lambda (a, s): s)
print 'best alpha and best cv_score:', best_alpha, best_score
plt.semilogx(alphas, cv_scores)
plt.vlines(best_alpha, 0, best_score, colors='r')
best_ridge = linear_model.RidgeClassifier(alpha=best_alpha).fit(X, y)
#print 'optimal coefficients for ridge', best_ridge.coef_
print 'coeff sparse rates for different classes', np.sum(abs(best_ridge.coef_) > 0, axis = 1) *1. / X.shape[1]

best alpha and best cv_score: 5.17947467923 0.221985458512
coeff sparse rates for different classes [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]

In [3]:

## Use LogisticRegression on L1 and L2 norm
penalties = ['l1', 'l2']
Cs = np.logspace(0, 5, 6)
lr = linear_model.LogisticRegression(dual = True) # more features than samples
scores = [cross_val_score(lr.set_params(penalty=penalty, C=C, dual = True if penalty =='l2' else False), 
                            X, y, cv=make_cv(X.shape[0]), n_jobs=-1) 
            for penalty in penalties
            for C in Cs]
cv_scores = map(np.mean, scores)

In [10]:

sorted_scores = sorted(zip([(p,C) for p in penalties for C in Cs], cv_scores), key = lambda (params, s): s, reverse=True)
print sorted_scores
best_params, best_score = sorted_scores[0]
print 'best alpha and best cv_score:', best_params, best_score
plt.plot(cv_scores)
l1_lr = linear_model.LogisticRegression(penalty='l1', C=1.0).fit(X, y)
print 'coeff sparse rates for different classes', np.sum(abs(l1_lr.coef_) > 0, axis = 1) *1. / X.shape[1]

[(('l2', 1.0), 0.20499241756726785), (('l1', 1.0), 0.20099440758123391), (('l2', 10.0), 0.18500836165506826), (('l1', 10.0), 0.18199936463409516), (('l2', 100.0), 0.17201632770495046), (('l1', 100000.0), 0.16600133067198933), (('l1', 10000.0), 0.16599533665401928), (('l1', 1000.0), 0.15901230571889255), (('l2', 10000.0), 0.1590063117009225), (('l2', 1000.0), 0.15801430172687658), (('l2', 100000.0), 0.15800231369093645), (('l1', 100.0), 0.15101328873783962)]
best alpha and best cv_score: ('l2', 1.0) 0.204992417567
coeff sparse rates for different classes [ 0.03733333  0.04053333  0.02986667  0.02986667  0.02666667  0.02293333
  0.02933333  0.02293333  0.024     ]

In [15]:

## stochastic Logistic Regression for stability selection
## ITS a feature selection method without making predictions and scoring on y
Cs = np.logspace(0, 5, 6)
rlr = linear_model.RandomizedLogisticRegression(sample_fraction=0.75, n_resampling=200, 
                                selection_threshold=0.25, n_jobs=-1,
                                memory= './data/tmp/')
rlr.set_params(C=1.0).fit(X, y)

________________________________________________________________________________
[Memory] Calling sklearn.linear_model.randomized_l1._resample_model...
_resample_model(<function _randomized_logistic at 0x111b1cc80>, array([[ 0.010795, ...,  0.011825],
       ..., 
       [ 0.006166, ..., -0.009114]]), 
array([1, ..., 4]), C=1.0, n_jobs=-1, verbose=False, fit_intercept=True, scaling=0.5, n_resampling=200, random_state=None, sample_fraction=0.75, tol=0.001, pre_dispatch='3*n_jobs')
__________________________________________________resample_model - 39.3s, 0.7min

Out[15]:

RandomizedLogisticRegression(C=1.0, fit_intercept=True, memory='./data/tmp/',
               n_jobs=-1, n_resampling=200, normalize=True,
               pre_dispatch='3*n_jobs', random_state=None,
               sample_fraction=0.75, scaling=0.5, selection_threshold=0.25,
               tol=0.001, verbose=False)

In [23]:

print 'sparseness rate for randomizedlogistic regression', np.sum(abs(rlr.scores_) > 0) * 1./ X.shape[1]

sparseness rate for randomizedlogistic regression 0.1328

In [7]:

## Passive Aggressive Classifier
Cs = np.logspace(-5, 5, 11)
pa = linear_model.PassiveAggressiveClassifier(random_state = 0, n_iter = 50, n_jobs = 1, loss = 'hinge')
scores = [cross_val_score(pa.set_params(C=C), X, y, n_jobs=-1, cv=make_cv(X.shape[0])) for C in Cs]
cv_scores = map(np.mean, scores)

In [8]:

plt.semilogx(Cs, cv_scores)

Out[8]:

[<matplotlib.lines.Line2D at 0x109287f10>]

In [12]:

pa = linear_model.PassiveAggressiveClassifier(random_state = 0, n_iter = 50, n_jobs = 1, loss = 'hinge', C=1e-3)
pa.fit(X, y)

Out[12]:

PassiveAggressiveClassifier(C=0.001, fit_intercept=True, loss='hinge',
              n_iter=50, n_jobs=1, random_state=0, shuffle=False,
              verbose=0, warm_start=False)

In [17]:

np.sum(abs(pa.coef_) > 1e-6, axis = 1) * 1. / X.shape[1]

Out[17]:

array([ 1.        ,  0.99946667,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ])

In [3]:

## Stochastic Gradient Decent
alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.1)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), X, y, n_jobs=-1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)

In [4]:

sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]

[({'alpha': 0.10000000000000001, 'l1_ratio': 1.0}, 0.21998045950141756), ({'alpha': 0.01, 'l1_ratio': 1.0}, 0.21000041958125792), ({'alpha': 0.01, 'l1_ratio': 0.90000000000000002}, 0.19596542650434867), ({'alpha': 1.0, 'l1_ratio': 1.0}, 0.19498540456624289), ({'alpha': 0.001, 'l1_ratio': 0.10000000000000001}, 0.18899438360516208), ({'alpha': 0.01, 'l1_ratio': 0.80000000000000004}, 0.18498738259217298), ({'alpha': 0.001, 'l1_ratio': 0.20000000000000001}, 0.18299736862611116), ({'alpha': 1.0000000000000001e-05, 'l1_ratio': 1.0}, 0.18000935066803328), ({'alpha': 0.001, 'l1_ratio': 0.90000000000000002}, 0.17997938057818297), ({'alpha': 0.001, 'l1_ratio': 0.30000000000000004}, 0.17899336462210713)]

In [12]:

sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=-1, n_iter = 20, alpha=0.1, l1_ratio=1)
sgd.fit(X, y)
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / X.shape[1]

sparse rate: [ 0.86666667  0.86133333  0.912       0.9552      0.94986667  0.95093333
  0.9456      0.9568      0.95786667]

In [10]:

## SGD and SVM are generally senstive to normalization

XX = preprocessing.Normalizer().fit_transform(X)
alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.1)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), XX, y, n_jobs=-1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)

In [14]:

## different alpha values with similar accuracy meaning redunancies in the original feature space
sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=-1, n_iter = 20, alpha=0.1, l1_ratio=1)
sgd.fit(XX, y)
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / X.shape[1]

[({'alpha': 0.001, 'l1_ratio': 1.0}, 0.21799643955332579), ({'alpha': 1.0, 'l1_ratio': 1.0}, 0.21798445151738566), ({'alpha': 0.01, 'l1_ratio': 1.0}, 0.21498744253235269), ({'alpha': 0.0001, 'l1_ratio': 0.59999999999999998}, 0.20999142855430283), ({'alpha': 0.0001, 'l1_ratio': 0.20000000000000001}, 0.2099854345363327), ({'alpha': 0.0001, 'l1_ratio': 0.40000000000000002}, 0.20997044949140756), ({'alpha': 0.0001, 'l1_ratio': 0.80000000000000004}, 0.20898443353533178), ({'alpha': 0.001, 'l1_ratio': 0.90000000000000002}, 0.20498942055828284), ({'alpha': 0.0001, 'l1_ratio': 0.70000000000000007}, 0.20496844149538762), ({'alpha': 0.0001, 'l1_ratio': 0.10000000000000001}, 0.20397942853032672)]
sparse rate: [ 0.2176      0.192       0.13546667  0.0816      0.0448      0.02453333
  0.07306667  0.02613333  0.06826667]

In [118]:

## TO see how the variances are distributed among features, using PCA
from sklearn import decomposition

In [119]:

pca = decomposition.PCA(whiten = True)
pca.fit(X, y)

Out[119]:

PCA(copy=True, n_components=None, whiten=True)

In [120]:

## only the first pc is important (60% variance)
plt.plot(pca.explained_variance_ratio_)
print pca.explained_variance_ratio_[:20]

[ 0.58323866  0.06736495  0.05158571  0.03981327  0.0228172   0.01756079
  0.01616188  0.0141026   0.01152615  0.01020264  0.00950273  0.00843913
  0.0074265   0.00593941  0.00568563  0.00500661  0.00489166  0.00441606
  0.00407469  0.00393579]

In [167]:

## Use PCA results to fit SGD
PCA_X = decomposition.PCA(n_components=10, whiten=True).fit_transform(X)
## draw on first 2 PC
colors = ['r', 'g', 'b', 'm', 'y', 'k', 'c']
for (cls,c) in zip(np.unique(y), colors):
    plt.scatter(PCA_X[y==cls,0], PCA_X[y==cls, 1], color=c, label=str(cls))
plt.legend()

Out[167]:

<matplotlib.legend.Legend at 0x10a186350>

In [4]:

alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.1)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), PCA_X, y, n_jobs=1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)

In [5]:

## different alpha values with similar accuracy meaning redunancies in the original feature space
sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=-1, n_iter = 20, alpha=0.1, l1_ratio=1)
sgd.fit(PCA_X, y)
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / X.shape[1]

[({'alpha': 0.10000000000000001, 'l1_ratio': 0.80000000000000004}, 0.16497635359910809), ({'alpha': 0.01, 'l1_ratio': 0.59999999999999998}, 0.16095736455017892), ({'alpha': 0.10000000000000001, 'l1_ratio': 0.70000000000000007}, 0.15795136453819089), ({'alpha': 0.001, 'l1_ratio': 0.30000000000000004}, 0.15696834559110007), ({'alpha': 0.01, 'l1_ratio': 0.80000000000000004}, 0.15299730868593145), ({'alpha': 9.9999999999999995e-07, 'l1_ratio': 0.10000000000000001}, 0.15198431964899031), ({'alpha': 0.001, 'l1_ratio': 0.40000000000000002}, 0.15098331864798931), ({'alpha': 0.001, 'l1_ratio': 0.10000000000000001}, 0.14999430568292846), ({'alpha': 0.01, 'l1_ratio': 0.5}, 0.14996133858409308), ({'alpha': 1.0, 'l1_ratio': 1.0}, 0.14995234755713796)]
sparse rate: [ 0.00426667  0.0016      0.0032      0.00266667  0.0032      0.00213333
  0.00533333  0.00106667  0.0048    ]

In [34]:

## based on experiments above, the linear models consistenly give UNDERFITTING results.
## try some feature engineering methods, such as tri-kmeans
def tri_kmeans(n_clusters, data):
    from sklearn import cluster
    import numpy as np
    kmeans = cluster.KMeans(n_clusters=n_clusters, random_state = 0).fit(data)
    dists_to_clusters = kmeans.transform(data)
    meandist_per_clusters = np.mean(dists_to_clusters, axis = 0)
    return np.apply_along_axis(lambda row: np.maximum(0, meandist_per_clusters-row), 
                                    1, dists_to_clusters)

In [35]:

r = tri_kmeans(9, X[:,:1])
r.shape

Out[35]:

(1000, 9)

In [124]:

batch_size = 2
stride = 1
n_samples, n_features = X.shape
dv['X'] = preprocessing.StandardScaler().fit_transform(X) ## standization for KMeANS
batches = [range(start, min(start+batch_size, n_features)) for start in range(n_features)[::stride]]
dv.scatter('batches', batches)

Out[124]:

<AsyncResult: scatter>

In [125]:

%px print len(batches), len(batches[0]), batches[0]

[stdout:0] 469 2 [0, 1]
[stdout:1] 469 2 [469, 470]
[stdout:2] 469 2 [938, 939]
[stdout:3] 468 2 [1407, 1408]

In [126]:

%%px
import numpy as np
def tri_kmeans(n_clusters, data):
    from sklearn import cluster
    import numpy as np
    kmeans = cluster.KMeans(n_clusters=n_clusters, random_state = 0).fit(data)
    dists_to_clusters = kmeans.transform(data)
    meandist_per_clusters = np.mean(dists_to_clusters, axis = 0)
    return np.apply_along_axis(lambda row: np.maximum(0, meandist_per_clusters-row), 
                                    1, dists_to_clusters)
kmeans_feats = [tri_kmeans(9, X[:,batch]) for batch in batches]
print len(kmeans_feats)

[stdout:0] 469
[stdout:1] 469
[stdout:2] 469
[stdout:3] 468

In [127]:

result = dv.gather('kmeans_feats')
result.ready()

Out[127]:

False

In [128]:

result.ready()

Out[128]:

True

In [129]:

kmeans_feats_from_batches = result.get()

In [130]:

#kmeans_feats = sparse.coo_matrix(np.hstack(kmeans_feats_from_batches+[X]))
kmeans_feats = sparse.coo_matrix(np.hstack(kmeans_feats_from_batches))
print kmeans_feats.shape

(1000, 16875)

In [131]:

## FIT a sgd on new features
alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.2)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), kmeans_feats, y, n_jobs=-1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)

In [132]:

sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]

[({'alpha': 0.0001, 'l1_ratio': 0.90000000000000013}, 0.32699465932998867), ({'alpha': 9.9999999999999995e-07, 'l1_ratio': 0.50000000000000011}, 0.32698866531201859), ({'alpha': 9.9999999999999995e-07, 'l1_ratio': 0.10000000000000001}, 0.32497167826509143), ({'alpha': 0.0001, 'l1_ratio': 0.70000000000000007}, 0.32398266530003056), ({'alpha': 0.001, 'l1_ratio': 0.90000000000000013}, 0.32297567028105956), ({'alpha': 0.10000000000000001, 'l1_ratio': 0.90000000000000013}, 0.31999064933196669), ({'alpha': 0.0001, 'l1_ratio': 0.10000000000000001}, 0.31896567225908545), ({'alpha': 1.0000000000000001e-05, 'l1_ratio': 0.10000000000000001}, 0.31800662938387486), ({'alpha': 0.0001, 'l1_ratio': 0.50000000000000011}, 0.31499163834493177), ({'alpha': 1.0000000000000001e-05, 'l1_ratio': 0.30000000000000004}, 0.31498864133594678)]

In [134]:

sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', 
                n_jobs=-1, n_iter = 20, alpha=1e-4, l1_ratio=0.9)
sgd.fit(kmeans_feats, y)
print 'score:', np.mean(cross_val_score(sgd, kmeans_feats, y, n_jobs=-1, cv = make_cv(X.shape[0])))
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / kmeans_feats.shape[1]

score: 0.32699465933
sparse rate: [ 0.97522963  0.97528889  0.97131852  0.97078519  0.97511111  0.97108148
  0.97125926  0.97214815  0.97137778]

In [136]:

## how about supervised features such as decision tree results?
batch_size = 2
stride = 1
n_samples, n_features = X.shape
dv['X'] = preprocessing.StandardScaler().fit_transform(X) ## standization for KMeANS
dv['y'] = y
batches = [range(start, min(start+batch_size, n_features)) for start in range(n_features)[::stride]]
dv.scatter('batches', batches)

Out[136]:

<AsyncResult: scatter>

In [160]:

%%px
import numpy as np
def tree_model(n_clusters, X, y):
    from sklearn import tree
    import numpy as np
    n_samples = X.shape[0]
    ## use 1/3 data to train model
    index = range(n_samples)
    np.random.shuffle(index)
    selected = index[:n_samples*1/3]
    treemodel = tree.DecisionTreeClassifier(max_depth=5).fit(X[selected, :], y[selected])
    probs = treemodel.predict_proba(X)
    return probs
tree_feats = [tree_model(9, X[:,batch], y) for batch in batches]
print len(tree_feats)

[stdout:0] 469
[stdout:1] 469
[stdout:2] 469
[stdout:3] 468

In [161]:

result = dv.gather('tree_feats')
result.ready()

Out[161]:

False

In [162]:

print result.ready()
tree_feats_from_batches = result.get()

True

In [163]:

tree_feats = sparse.coo_matrix(np.hstack(tree_feats_from_batches))
print tree_feats.shape

(1000, 16875)

In [164]:

alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.2)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), tree_feats, y, n_jobs=-1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)

In [168]:

sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]

[({'alpha': 0.001, 'l1_ratio': 0.90000000000000013}, 0.94298789807771843), ({'alpha': 0.001, 'l1_ratio': 0.70000000000000007}, 0.93100885316454185), ({'alpha': 0.01, 'l1_ratio': 0.90000000000000013}, 0.92500884117650584), ({'alpha': 0.001, 'l1_ratio': 0.50000000000000011}, 0.91599683515851182), ({'alpha': 0.001, 'l1_ratio': 0.10000000000000001}, 0.91599084114054163), ({'alpha': 0.001, 'l1_ratio': 0.30000000000000004}, 0.90899282516048974), ({'alpha': 0.0001, 'l1_ratio': 0.10000000000000001}, 0.84600768433103768), ({'alpha': 0.0001, 'l1_ratio': 0.30000000000000004}, 0.84502766239293192), ({'alpha': 0.0001, 'l1_ratio': 0.50000000000000011}, 0.83000365635096174), ({'alpha': 9.9999999999999995e-07, 'l1_ratio': 0.70000000000000007}, 0.82294570019120916)]

In [170]:

sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', 
                n_jobs=-1, n_iter = 20, alpha=1e-3, l1_ratio=0.9)
sgd.fit(kmeans_feats, y)
print 'score:', np.mean(cross_val_score(sgd, tree_feats, y, n_jobs=-1, cv = make_cv(X.shape[0])))
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / tree_feats.shape[1]

score: 0.942987898078
sparse rate: [ 0.81167407  0.82524444  0.80379259  0.79288889  0.78222222  0.77943704
  0.77866667  0.77386667  0.76248889]

*As a comparison, try SVM-RBF and RandomForest*

In [174]:

from sklearn import ensemble
forest = ensemble.RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=0)
scores = cross_val_score(forest, X,y, cv=make_cv(X.shape[0]))
print 'cv score:', np.mean(scores)

cv score: 0.313981646317

In [177]:

from sklearn import svm
scaled_X = preprocessing.StandardScaler().fit_transform(X)
svc = svm.SVC()
Cs = np.logspace(-5, -1, 5)
gammas = np.logspace(-5, -1, 5)
params = [{'C': C, 'gamma': gamma} for C in Cs for gamma in gammas]
scores = [cross_val_score(svc.set_params(**param), scaled_X, y, cv=make_cv(scaled_X.shape[0], n_jobs=-1))
            for param in params]
cv_scores = map(np.mean, scores)

In [179]:

sorted_scores = sorted(zip(params, cv_scores), key=lambda (p,s):s, reverse=True)
print sorted_scores[:10]

[({'C': 1.0000000000000001e-05, 'gamma': 1.0000000000000001e-05}, 0.19799140457823092), ({'C': 1.0000000000000001e-05, 'gamma': 0.0001}, 0.19799140457823092), ({'C': 1.0000000000000001e-05, 'gamma': 0.001}, 0.19799140457823092), ({'C': 1.0000000000000001e-05, 'gamma': 0.01}, 0.19799140457823092), ({'C': 1.0000000000000001e-05, 'gamma': 0.10000000000000001}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 1.0000000000000001e-05}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 0.0001}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 0.001}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 0.01}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 0.10000000000000001}, 0.19799140457823092)]

*It seems that good representations of features go a longer way than fancy classifiers*

*Another promising things are ensemble, which will be covered in other notebooks*

In [ ]: