Summary
Linear model (with regularization) is usually useful because
Ridge penalizes l2
and thus does shrinkage. it supports Ridge
and BayesianRidge
(for regression) and RidgeClassifier
(for classification). parameter: alpha, more parameters for BayesianRidge
Lasso and Lars penalizes l1
norm and thus achieves sparsity.
Lasso
, Lars
, LassoLars
, and MultiTaskLasso
(penalized with combined l1 and l2 norm)Lars
and LassoLars
implementations is that Lars
accpet n_nonzero_coefs
as parameter whereas LassoLars
accept the traditional alpha
MultiTaskLasso
is specially useful to fit multiple regression problems jointly enforcing the selected features to be the same accross tasks. In other words, the tasks are NOT independent of each other any more -- this is usually useful to model time-series data with time-varying linear models, the coeffs of models could change along time, but their importances will not. Regression ONLYRandomizedLasso
works by resampling the train data and computing a Lasso on each resampling. In short, the features selected more often are good features. It is also known as stability selection. Regression ONLYElasticNet penalizes both l1
and l2
norms at the same time, by specifing params alpha
and l1_ratio
, so it achieves both shrinkage and sparsity
MultiTaskElasticNet
is the mutlitask version of it. It is trained with trained with L1/L2 mixed-norm as regularizer so it jointly enforcing selected features across tasksLogistic Regression is like the implementation of Ridge/Lasso for classification problems. It penalizes l1
or l2
, but not at the same time.
C
plays the oppsite role as of alpha
in Ridge/Lasso. C
is the weight on the data-fitting term, alpha
is the weight on the regulariztaion termdual
specifies implementing as a dual or primal formulation for l2 norm only usually Prefer dual=False when n_samples > n_features.RandomizedLogisticRegression
is similiar to RandomizedLasso
, which does stability selection by resampling the data several times. For large data, use n_jobs=-1 and memory='memory_cache_path' for computation efficiencyARDRegression - Bayesian version of linear Regression with ARD prior, see also BayesianRidge
OrthogonalMatchingPursuit - For regression tasks, it essentially penalized l0
norm, i.e., number of features used
PassiveAggressive - PassiveAggressiveClassifier
and PasssiveAggressiveRegressor
for both classification and regression.
partial_fit
, which means they can be used in ONLINE Learning mode, which is specially useful for LARGE scale learningC
controls regularization. Performance of algorithm is SENSITIVE to C valuesn_jobs
implements parallelism, but only for multi-class learningSGD
partial_fit
, similiar to PassiveAggressivel1
, l2
and elasticnet
penalties, similiar to ElasticNetalpha
and l1_ratio
. Parameter n_iter
and learning_rate
may also influence accuraciesclass_weights
can be used to handle imbalanced classification problem, e.g., with value auto
Perceptron
LinearSVM
Tricks
import cPickle
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
import matplotlib.pylab as plt
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import cluster
from IPython.parallel import Client
from scipy import sparse
def make_cv(n_samples, n_folds = 3):
return KFold(n_samples, n_folds=n_folds, random_state=0)
client = Client()
print len(client), 'cores running...'
dv = client[:]
lb = client.load_balanced_view()
4 cores running...
## Some Linear Regualarizers specially Ridge, LASSO, LARS and linearSVM
## the data used is icml2013 blackbox representation learning and MNIST (large scale?)
## BLACKBOX data first
## data source: http://www.kaggle.com/c/challenges-in-representation-learning-the-black-box-learning-challenge
X, y = cPickle.load(open('data/blackbox.pkl', 'rb'))
print X.shape, y.shape
print np.unique(y), y.dtype ## classificaiton
(1000, 1875) (1000,) [1 2 3 4 5 6 7 8 9] int64
## Ridge Regression - penalizes on l2 norm, alpha parameter controls penalty weight
## (as contrast to C in SVM), testing alpha usually follow a logspace grid
## Ridge regression gives shrinkage, but not sparsity
alphas = np.logspace(-10, 5, 15)
ridge = linear_model.RidgeClassifier()
scores = [cross_val_score(ridge.set_params(alpha=alpha), X, y, cv = make_cv(X.shape[0]), n_jobs=-1)
for alpha in alphas]
cv_scores = map(np.mean, scores)
## To guarantee the optimial selection, the cv score curve should
## go up first and go down again
best_alpha, best_score = max(zip(alphas, cv_scores), key = lambda (a, s): s)
print 'best alpha and best cv_score:', best_alpha, best_score
plt.semilogx(alphas, cv_scores)
plt.vlines(best_alpha, 0, best_score, colors='r')
best_ridge = linear_model.RidgeClassifier(alpha=best_alpha).fit(X, y)
#print 'optimal coefficients for ridge', best_ridge.coef_
print 'coeff sparse rates for different classes', np.sum(abs(best_ridge.coef_) > 0, axis = 1) *1. / X.shape[1]
best alpha and best cv_score: 5.17947467923 0.221985458512 coeff sparse rates for different classes [ 1. 1. 1. 1. 1. 1. 1. 1. 1.]
## Use LogisticRegression on L1 and L2 norm
penalties = ['l1', 'l2']
Cs = np.logspace(0, 5, 6)
lr = linear_model.LogisticRegression(dual = True) # more features than samples
scores = [cross_val_score(lr.set_params(penalty=penalty, C=C, dual = True if penalty =='l2' else False),
X, y, cv=make_cv(X.shape[0]), n_jobs=-1)
for penalty in penalties
for C in Cs]
cv_scores = map(np.mean, scores)
sorted_scores = sorted(zip([(p,C) for p in penalties for C in Cs], cv_scores), key = lambda (params, s): s, reverse=True)
print sorted_scores
best_params, best_score = sorted_scores[0]
print 'best alpha and best cv_score:', best_params, best_score
plt.plot(cv_scores)
l1_lr = linear_model.LogisticRegression(penalty='l1', C=1.0).fit(X, y)
print 'coeff sparse rates for different classes', np.sum(abs(l1_lr.coef_) > 0, axis = 1) *1. / X.shape[1]
[(('l2', 1.0), 0.20499241756726785), (('l1', 1.0), 0.20099440758123391), (('l2', 10.0), 0.18500836165506826), (('l1', 10.0), 0.18199936463409516), (('l2', 100.0), 0.17201632770495046), (('l1', 100000.0), 0.16600133067198933), (('l1', 10000.0), 0.16599533665401928), (('l1', 1000.0), 0.15901230571889255), (('l2', 10000.0), 0.1590063117009225), (('l2', 1000.0), 0.15801430172687658), (('l2', 100000.0), 0.15800231369093645), (('l1', 100.0), 0.15101328873783962)] best alpha and best cv_score: ('l2', 1.0) 0.204992417567 coeff sparse rates for different classes [ 0.03733333 0.04053333 0.02986667 0.02986667 0.02666667 0.02293333 0.02933333 0.02293333 0.024 ]
## stochastic Logistic Regression for stability selection
## ITS a feature selection method without making predictions and scoring on y
Cs = np.logspace(0, 5, 6)
rlr = linear_model.RandomizedLogisticRegression(sample_fraction=0.75, n_resampling=200,
selection_threshold=0.25, n_jobs=-1,
memory= './data/tmp/')
rlr.set_params(C=1.0).fit(X, y)
________________________________________________________________________________ [Memory] Calling sklearn.linear_model.randomized_l1._resample_model... _resample_model(<function _randomized_logistic at 0x111b1cc80>, array([[ 0.010795, ..., 0.011825], ..., [ 0.006166, ..., -0.009114]]), array([1, ..., 4]), C=1.0, n_jobs=-1, verbose=False, fit_intercept=True, scaling=0.5, n_resampling=200, random_state=None, sample_fraction=0.75, tol=0.001, pre_dispatch='3*n_jobs') __________________________________________________resample_model - 39.3s, 0.7min
RandomizedLogisticRegression(C=1.0, fit_intercept=True, memory='./data/tmp/', n_jobs=-1, n_resampling=200, normalize=True, pre_dispatch='3*n_jobs', random_state=None, sample_fraction=0.75, scaling=0.5, selection_threshold=0.25, tol=0.001, verbose=False)
print 'sparseness rate for randomizedlogistic regression', np.sum(abs(rlr.scores_) > 0) * 1./ X.shape[1]
sparseness rate for randomizedlogistic regression 0.1328
## Passive Aggressive Classifier
Cs = np.logspace(-5, 5, 11)
pa = linear_model.PassiveAggressiveClassifier(random_state = 0, n_iter = 50, n_jobs = 1, loss = 'hinge')
scores = [cross_val_score(pa.set_params(C=C), X, y, n_jobs=-1, cv=make_cv(X.shape[0])) for C in Cs]
cv_scores = map(np.mean, scores)
plt.semilogx(Cs, cv_scores)
[<matplotlib.lines.Line2D at 0x109287f10>]
pa = linear_model.PassiveAggressiveClassifier(random_state = 0, n_iter = 50, n_jobs = 1, loss = 'hinge', C=1e-3)
pa.fit(X, y)
PassiveAggressiveClassifier(C=0.001, fit_intercept=True, loss='hinge', n_iter=50, n_jobs=1, random_state=0, shuffle=False, verbose=0, warm_start=False)
np.sum(abs(pa.coef_) > 1e-6, axis = 1) * 1. / X.shape[1]
array([ 1. , 0.99946667, 1. , 1. , 1. , 1. , 1. , 1. , 1. ])
## Stochastic Gradient Decent
alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.1)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), X, y, n_jobs=-1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)
sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]
[({'alpha': 0.10000000000000001, 'l1_ratio': 1.0}, 0.21998045950141756), ({'alpha': 0.01, 'l1_ratio': 1.0}, 0.21000041958125792), ({'alpha': 0.01, 'l1_ratio': 0.90000000000000002}, 0.19596542650434867), ({'alpha': 1.0, 'l1_ratio': 1.0}, 0.19498540456624289), ({'alpha': 0.001, 'l1_ratio': 0.10000000000000001}, 0.18899438360516208), ({'alpha': 0.01, 'l1_ratio': 0.80000000000000004}, 0.18498738259217298), ({'alpha': 0.001, 'l1_ratio': 0.20000000000000001}, 0.18299736862611116), ({'alpha': 1.0000000000000001e-05, 'l1_ratio': 1.0}, 0.18000935066803328), ({'alpha': 0.001, 'l1_ratio': 0.90000000000000002}, 0.17997938057818297), ({'alpha': 0.001, 'l1_ratio': 0.30000000000000004}, 0.17899336462210713)]
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=-1, n_iter = 20, alpha=0.1, l1_ratio=1)
sgd.fit(X, y)
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / X.shape[1]
sparse rate: [ 0.86666667 0.86133333 0.912 0.9552 0.94986667 0.95093333 0.9456 0.9568 0.95786667]
## SGD and SVM are generally senstive to normalization
XX = preprocessing.Normalizer().fit_transform(X)
alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.1)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), XX, y, n_jobs=-1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)
## different alpha values with similar accuracy meaning redunancies in the original feature space
sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=-1, n_iter = 20, alpha=0.1, l1_ratio=1)
sgd.fit(XX, y)
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / X.shape[1]
[({'alpha': 0.001, 'l1_ratio': 1.0}, 0.21799643955332579), ({'alpha': 1.0, 'l1_ratio': 1.0}, 0.21798445151738566), ({'alpha': 0.01, 'l1_ratio': 1.0}, 0.21498744253235269), ({'alpha': 0.0001, 'l1_ratio': 0.59999999999999998}, 0.20999142855430283), ({'alpha': 0.0001, 'l1_ratio': 0.20000000000000001}, 0.2099854345363327), ({'alpha': 0.0001, 'l1_ratio': 0.40000000000000002}, 0.20997044949140756), ({'alpha': 0.0001, 'l1_ratio': 0.80000000000000004}, 0.20898443353533178), ({'alpha': 0.001, 'l1_ratio': 0.90000000000000002}, 0.20498942055828284), ({'alpha': 0.0001, 'l1_ratio': 0.70000000000000007}, 0.20496844149538762), ({'alpha': 0.0001, 'l1_ratio': 0.10000000000000001}, 0.20397942853032672)] sparse rate: [ 0.2176 0.192 0.13546667 0.0816 0.0448 0.02453333 0.07306667 0.02613333 0.06826667]
## TO see how the variances are distributed among features, using PCA
from sklearn import decomposition
pca = decomposition.PCA(whiten = True)
pca.fit(X, y)
PCA(copy=True, n_components=None, whiten=True)
## only the first pc is important (60% variance)
plt.plot(pca.explained_variance_ratio_)
print pca.explained_variance_ratio_[:20]
[ 0.58323866 0.06736495 0.05158571 0.03981327 0.0228172 0.01756079 0.01616188 0.0141026 0.01152615 0.01020264 0.00950273 0.00843913 0.0074265 0.00593941 0.00568563 0.00500661 0.00489166 0.00441606 0.00407469 0.00393579]
## Use PCA results to fit SGD
PCA_X = decomposition.PCA(n_components=10, whiten=True).fit_transform(X)
## draw on first 2 PC
colors = ['r', 'g', 'b', 'm', 'y', 'k', 'c']
for (cls,c) in zip(np.unique(y), colors):
plt.scatter(PCA_X[y==cls,0], PCA_X[y==cls, 1], color=c, label=str(cls))
plt.legend()
<matplotlib.legend.Legend at 0x10a186350>
alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.1)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), PCA_X, y, n_jobs=1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)
## different alpha values with similar accuracy meaning redunancies in the original feature space
sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=-1, n_iter = 20, alpha=0.1, l1_ratio=1)
sgd.fit(PCA_X, y)
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / X.shape[1]
[({'alpha': 0.10000000000000001, 'l1_ratio': 0.80000000000000004}, 0.16497635359910809), ({'alpha': 0.01, 'l1_ratio': 0.59999999999999998}, 0.16095736455017892), ({'alpha': 0.10000000000000001, 'l1_ratio': 0.70000000000000007}, 0.15795136453819089), ({'alpha': 0.001, 'l1_ratio': 0.30000000000000004}, 0.15696834559110007), ({'alpha': 0.01, 'l1_ratio': 0.80000000000000004}, 0.15299730868593145), ({'alpha': 9.9999999999999995e-07, 'l1_ratio': 0.10000000000000001}, 0.15198431964899031), ({'alpha': 0.001, 'l1_ratio': 0.40000000000000002}, 0.15098331864798931), ({'alpha': 0.001, 'l1_ratio': 0.10000000000000001}, 0.14999430568292846), ({'alpha': 0.01, 'l1_ratio': 0.5}, 0.14996133858409308), ({'alpha': 1.0, 'l1_ratio': 1.0}, 0.14995234755713796)] sparse rate: [ 0.00426667 0.0016 0.0032 0.00266667 0.0032 0.00213333 0.00533333 0.00106667 0.0048 ]
## based on experiments above, the linear models consistenly give UNDERFITTING results.
## try some feature engineering methods, such as tri-kmeans
def tri_kmeans(n_clusters, data):
from sklearn import cluster
import numpy as np
kmeans = cluster.KMeans(n_clusters=n_clusters, random_state = 0).fit(data)
dists_to_clusters = kmeans.transform(data)
meandist_per_clusters = np.mean(dists_to_clusters, axis = 0)
return np.apply_along_axis(lambda row: np.maximum(0, meandist_per_clusters-row),
1, dists_to_clusters)
r = tri_kmeans(9, X[:,:1])
r.shape
(1000, 9)
batch_size = 2
stride = 1
n_samples, n_features = X.shape
dv['X'] = preprocessing.StandardScaler().fit_transform(X) ## standization for KMeANS
batches = [range(start, min(start+batch_size, n_features)) for start in range(n_features)[::stride]]
dv.scatter('batches', batches)
<AsyncResult: scatter>
%px print len(batches), len(batches[0]), batches[0]
[stdout:0] 469 2 [0, 1] [stdout:1] 469 2 [469, 470] [stdout:2] 469 2 [938, 939] [stdout:3] 468 2 [1407, 1408]
%%px
import numpy as np
def tri_kmeans(n_clusters, data):
from sklearn import cluster
import numpy as np
kmeans = cluster.KMeans(n_clusters=n_clusters, random_state = 0).fit(data)
dists_to_clusters = kmeans.transform(data)
meandist_per_clusters = np.mean(dists_to_clusters, axis = 0)
return np.apply_along_axis(lambda row: np.maximum(0, meandist_per_clusters-row),
1, dists_to_clusters)
kmeans_feats = [tri_kmeans(9, X[:,batch]) for batch in batches]
print len(kmeans_feats)
[stdout:0] 469 [stdout:1] 469 [stdout:2] 469 [stdout:3] 468
result = dv.gather('kmeans_feats')
result.ready()
False
result.ready()
True
kmeans_feats_from_batches = result.get()
#kmeans_feats = sparse.coo_matrix(np.hstack(kmeans_feats_from_batches+[X]))
kmeans_feats = sparse.coo_matrix(np.hstack(kmeans_feats_from_batches))
print kmeans_feats.shape
(1000, 16875)
## FIT a sgd on new features
alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.2)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), kmeans_feats, y, n_jobs=-1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)
sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]
[({'alpha': 0.0001, 'l1_ratio': 0.90000000000000013}, 0.32699465932998867), ({'alpha': 9.9999999999999995e-07, 'l1_ratio': 0.50000000000000011}, 0.32698866531201859), ({'alpha': 9.9999999999999995e-07, 'l1_ratio': 0.10000000000000001}, 0.32497167826509143), ({'alpha': 0.0001, 'l1_ratio': 0.70000000000000007}, 0.32398266530003056), ({'alpha': 0.001, 'l1_ratio': 0.90000000000000013}, 0.32297567028105956), ({'alpha': 0.10000000000000001, 'l1_ratio': 0.90000000000000013}, 0.31999064933196669), ({'alpha': 0.0001, 'l1_ratio': 0.10000000000000001}, 0.31896567225908545), ({'alpha': 1.0000000000000001e-05, 'l1_ratio': 0.10000000000000001}, 0.31800662938387486), ({'alpha': 0.0001, 'l1_ratio': 0.50000000000000011}, 0.31499163834493177), ({'alpha': 1.0000000000000001e-05, 'l1_ratio': 0.30000000000000004}, 0.31498864133594678)]
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet',
n_jobs=-1, n_iter = 20, alpha=1e-4, l1_ratio=0.9)
sgd.fit(kmeans_feats, y)
print 'score:', np.mean(cross_val_score(sgd, kmeans_feats, y, n_jobs=-1, cv = make_cv(X.shape[0])))
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / kmeans_feats.shape[1]
score: 0.32699465933 sparse rate: [ 0.97522963 0.97528889 0.97131852 0.97078519 0.97511111 0.97108148 0.97125926 0.97214815 0.97137778]
## how about supervised features such as decision tree results?
batch_size = 2
stride = 1
n_samples, n_features = X.shape
dv['X'] = preprocessing.StandardScaler().fit_transform(X) ## standization for KMeANS
dv['y'] = y
batches = [range(start, min(start+batch_size, n_features)) for start in range(n_features)[::stride]]
dv.scatter('batches', batches)
<AsyncResult: scatter>
%%px
import numpy as np
def tree_model(n_clusters, X, y):
from sklearn import tree
import numpy as np
n_samples = X.shape[0]
## use 1/3 data to train model
index = range(n_samples)
np.random.shuffle(index)
selected = index[:n_samples*1/3]
treemodel = tree.DecisionTreeClassifier(max_depth=5).fit(X[selected, :], y[selected])
probs = treemodel.predict_proba(X)
return probs
tree_feats = [tree_model(9, X[:,batch], y) for batch in batches]
print len(tree_feats)
[stdout:0] 469 [stdout:1] 469 [stdout:2] 469 [stdout:3] 468
result = dv.gather('tree_feats')
result.ready()
False
print result.ready()
tree_feats_from_batches = result.get()
True
tree_feats = sparse.coo_matrix(np.hstack(tree_feats_from_batches))
print tree_feats.shape
(1000, 16875)
alphas = np.logspace(-6, 0, 7)
l1_ratios = np.arange(0.1, 1.1, 0.2)
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet', n_jobs=1, n_iter = 20)
params = [{'alpha': alpha, 'l1_ratio': l1_ratio} for alpha in alphas for l1_ratio in l1_ratios]
scores = [cross_val_score(sgd.set_params(**param), tree_feats, y, n_jobs=-1, cv=make_cv(X.shape[0])) for param in params]
cv_scores = map(np.mean, scores)
sorted_scores = sorted(zip(params, cv_scores), key = lambda (p,s): s, reverse=True)
print sorted_scores[:10]
[({'alpha': 0.001, 'l1_ratio': 0.90000000000000013}, 0.94298789807771843), ({'alpha': 0.001, 'l1_ratio': 0.70000000000000007}, 0.93100885316454185), ({'alpha': 0.01, 'l1_ratio': 0.90000000000000013}, 0.92500884117650584), ({'alpha': 0.001, 'l1_ratio': 0.50000000000000011}, 0.91599683515851182), ({'alpha': 0.001, 'l1_ratio': 0.10000000000000001}, 0.91599084114054163), ({'alpha': 0.001, 'l1_ratio': 0.30000000000000004}, 0.90899282516048974), ({'alpha': 0.0001, 'l1_ratio': 0.10000000000000001}, 0.84600768433103768), ({'alpha': 0.0001, 'l1_ratio': 0.30000000000000004}, 0.84502766239293192), ({'alpha': 0.0001, 'l1_ratio': 0.50000000000000011}, 0.83000365635096174), ({'alpha': 9.9999999999999995e-07, 'l1_ratio': 0.70000000000000007}, 0.82294570019120916)]
sgd = linear_model.SGDClassifier(loss = 'hinge', penalty = 'elasticnet',
n_jobs=-1, n_iter = 20, alpha=1e-3, l1_ratio=0.9)
sgd.fit(kmeans_feats, y)
print 'score:', np.mean(cross_val_score(sgd, tree_feats, y, n_jobs=-1, cv = make_cv(X.shape[0])))
print 'sparse rate:', np.sum(abs(sgd.coef_) > 1e-3, axis = 1) * 1. / tree_feats.shape[1]
score: 0.942987898078 sparse rate: [ 0.81167407 0.82524444 0.80379259 0.79288889 0.78222222 0.77943704 0.77866667 0.77386667 0.76248889]
*As a comparison, try SVM-RBF and RandomForest*
from sklearn import ensemble
forest = ensemble.RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=0)
scores = cross_val_score(forest, X,y, cv=make_cv(X.shape[0]))
print 'cv score:', np.mean(scores)
cv score: 0.313981646317
from sklearn import svm
scaled_X = preprocessing.StandardScaler().fit_transform(X)
svc = svm.SVC()
Cs = np.logspace(-5, -1, 5)
gammas = np.logspace(-5, -1, 5)
params = [{'C': C, 'gamma': gamma} for C in Cs for gamma in gammas]
scores = [cross_val_score(svc.set_params(**param), scaled_X, y, cv=make_cv(scaled_X.shape[0], n_jobs=-1))
for param in params]
cv_scores = map(np.mean, scores)
sorted_scores = sorted(zip(params, cv_scores), key=lambda (p,s):s, reverse=True)
print sorted_scores[:10]
[({'C': 1.0000000000000001e-05, 'gamma': 1.0000000000000001e-05}, 0.19799140457823092), ({'C': 1.0000000000000001e-05, 'gamma': 0.0001}, 0.19799140457823092), ({'C': 1.0000000000000001e-05, 'gamma': 0.001}, 0.19799140457823092), ({'C': 1.0000000000000001e-05, 'gamma': 0.01}, 0.19799140457823092), ({'C': 1.0000000000000001e-05, 'gamma': 0.10000000000000001}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 1.0000000000000001e-05}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 0.0001}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 0.001}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 0.01}, 0.19799140457823092), ({'C': 0.0001, 'gamma': 0.10000000000000001}, 0.19799140457823092)]
*It seems that good representations of features go a longer way than fancy classifiers*
*Another promising things are ensemble, which will be covered in other notebooks*