Comparision of dask_glm and scikit-learn on the SUSY dataset.
import numpy as np
import pandas as pd
import dask
from distributed import Client
import dask.array as da
from sklearn import linear_model
from dask_glm.estimators import LogisticRegression
df = pd.read_csv("SUSY.csv.gz", header=None)
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.972861 | 0.653855 | 1.176225 | 1.157156 | -1.739873 | -0.874309 | 0.567765 | -0.175000 | 0.810061 | -0.252552 | 1.921887 | 0.889637 | 0.410772 | 1.145621 | 1.932632 | 0.994464 | 1.367815 | 0.040714 |
1 | 1.0 | 1.667973 | 0.064191 | -1.225171 | 0.506102 | -0.338939 | 1.672543 | 3.475464 | -1.219136 | 0.012955 | 3.775174 | 1.045977 | 0.568051 | 0.481928 | 0.000000 | 0.448410 | 0.205356 | 1.321893 | 0.377584 |
2 | 1.0 | 0.444840 | -0.134298 | -0.709972 | 0.451719 | -1.613871 | -0.768661 | 1.219918 | 0.504026 | 1.831248 | -0.431385 | 0.526283 | 0.941514 | 1.587535 | 2.024308 | 0.603498 | 1.562374 | 1.135454 | 0.180910 |
3 | 1.0 | 0.381256 | -0.976145 | 0.693152 | 0.448959 | 0.891753 | -0.677328 | 2.033060 | 1.533041 | 3.046260 | -1.005285 | 0.569386 | 1.015211 | 1.582217 | 1.551914 | 0.761215 | 1.715464 | 1.492257 | 0.090719 |
4 | 1.0 | 1.309996 | -0.690089 | -0.676259 | 1.589283 | -0.693326 | 0.622907 | 1.087562 | -0.381742 | 0.589204 | 1.365479 | 1.179295 | 0.968218 | 0.728563 | 0.000000 | 1.083158 | 0.043429 | 1.154854 | 0.094859 |
len(df)
5000000
We have 5,000,000 rows of all-numeric data. We'll skip any feature engineering and preprocessing.
y = df[0].values
X = df.drop(0, axis=1).values
C = 10 # for scikit-learn
λ = 1 / C # for dask_glm
First, we run scikit-learn's LogisticRegression
on the full dataset.
%%time
lm = linear_model.LogisticRegression(penalty='l1', C=C)
lm.fit(X, y)
CPU times: user 26min 16s, sys: 39.6 s, total: 26min 56s Wall time: 25min 1s
%%time
lm.score(X, y)
CPU times: user 692 ms, sys: 765 ms, total: 1.46 s Wall time: 1.24 s
0.78830140000000004
Now for the dask-glm version.
client = Client()
# dask
K = 100000
dX = da.from_array(X, chunks=(K, X.shape[-1]))
dy = da.from_array(y, chunks=(K,))
dX, dy = dask.persist(X, y)
client.rebalance([X, y])
%%time
dk = LogisticRegression()
dk.fit(dX, dy)
Converged! 6 CPU times: user 58.8 s, sys: 42.2 s, total: 1min 40s Wall time: 9min 25s
%%time
dk.score(dX, dy)
CPU times: user 550 ms, sys: 349 ms, total: 899 ms Wall time: 705 ms
0.78832579999999997
Library | Training time | Score |
---|---|---|
dask-glm | 9:25 | .788 |
scikit-learn | 25:01 | .788 |