JustML – Client library usage example

The following example demonstrates the usage of the JustML Python client library. JustML provides automatic machine learning model selection, training and deployment in the cloud.

JustML finds the right scikit-learn or xgboost estimator for a given supervised machine learning problem, along with the optimal hyperparameters. It also selects data and feature preprocessors in order to build a complete machine learning pipeline. The selected models are fitted and deployed on JustML computing infrastructure and can be used to generate new predictions. For more information, check https://justml.io.

To reproduce this example, if you haven't done so yet, you will need to request your JustML API key here.

The JustML Python library is installed with pip install justml.

In [1]:
import justml

justml.api_key = "key-xxxxxxxxx"
justml.activate_logging()

Example: Select a classifier and fit it to given arrays

Create a new classifier, or retrieve existing one, named "classifier1":

In [2]:
clf = justml.Classifier(name="classifier1")
print(clf.status)
uninitialized

If the classifier exists and is already fitted, its status is equal to "trained". It's not the case here, and we fit it using the digits dataset from scikit-learn, which we split into a train and a test datasets.

Fit() tests scikit-learn and xgboost machine learning classifiers and chooses the one that performs best on the data provided. The selected model is fitted and deployed in the cloud.

Training data is used to build the estimator model and then is immediately and permanently deleted.

In [3]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
In [4]:
%%timeit -n1 -r1 # measure the execution time

if clf.status != "trained":
    clf.fit(X=X_train, y=y_train)
INFO:justml:Sending data to train estimator classifier1...
INFO:justml:Waiting for training to complete on JustML servers...
INFO:justml:Training is finished, and data has been deleted.
3min 17s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

It took 3min 17s to select and build a machine learning pipeline.

The show_pipeline() method reveals the pipeline that was built during fit. In this case, two data preprocessors were selected: one hot encoding, and imputation using the mean value. For feature preprocessing, a select_rates method was chosen. And finally, the classifer that performed best is QDA (sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis). All hyperparameters are displayed along with the selected model.

In [5]:
clf.show_pipeline()
{
    "data_preprocessor": [
        {
            "step": "categorical_encoding",
            "class": "one_hot_encoding",
            "args": {
                "use_minimum_fraction": false
            }
        },
        {
            "step": "imputation",
            "args": {
                "strategy": "mean"
            }
        },
        {
            "step": "rescaling",
            "class": "none"
        },
        {
            "step": "balancing",
            "args": {
                "strategy": "none"
            }
        }
    ],
    "feature_preprocessor": {
        "class": "select_rates",
        "args": {
            "alpha": 0.06544340428506021,
            "mode": "fwe",
            "score_func": "f_classif"
        }
    },
    "classifier": {
        "class": "sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis",
        "args": {
            "reg_param": 0.6396026761675004
        }
    }
}

The show_automl_info() method is useful to see information and statistics about the model selection.

It displays the best validation score found, the number of target algorithms that were selected to be tested, the number of successful runs, and the number of runs that didn't succeed due to memory or runtime exceeding their limits. It is possible to configure and increase memory and runtime limits to fit each problem/dataset needs (contact JustML at [email protected]).

In this case, 12 algorithms were successfully tested and compared.

In [6]:
clf.show_automl_info()
{
    "Best validation score": "0.991011",
    "Metric": "accuracy",
    "Number of crashed target algorithm runs": "1",
    "Number of successful target algorithm runs": "12",
    "Number of target algorithm runs": "15",
    "Number of target algorithms that exceeded the memory limit": "2",
    "Number of target algorithms that exceeded the time limit": "0"
}

Now that the pipeline was selected, fitted to the training data, and deployed, we can use it to classify X_test:

In [7]:
%%timeit -n1 -r1 # measure the execution time

predictions = clf.predict(X=X_test)
2.38 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Predict took 2.38 seconds.

Like for training, prediction data does not get stored. As soon as the results are computed, data provided as input is permanently deleted.

Let's compare the results with y_test, the true y values for the test dataset:

In [8]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))
             precision    recall  f1-score   support

          0       1.00      0.98      0.99        53
          1       1.00      1.00      1.00        42
          2       1.00      0.98      0.99        41
          3       1.00      1.00      1.00        52
          4       0.98      1.00      0.99        47
          5       1.00      1.00      1.00        39
          6       1.00      1.00      1.00        43
          7       1.00      1.00      1.00        48
          8       0.97      1.00      0.99        37
          9       1.00      1.00      1.00        48

avg / total       1.00      1.00      1.00       450

Let's take a look at the pipeline again and rebuild using the corresponding sklearn classes and functions:

In [9]:
clf.show_pipeline()
{
    "data_preprocessor": [
        {
            "step": "categorical_encoding",
            "class": "one_hot_encoding",
            "args": {
                "use_minimum_fraction": false
            }
        },
        {
            "step": "imputation",
            "args": {
                "strategy": "mean"
            }
        },
        {
            "step": "rescaling",
            "class": "none"
        },
        {
            "step": "balancing",
            "args": {
                "strategy": "none"
            }
        }
    ],
    "feature_preprocessor": {
        "class": "select_rates",
        "args": {
            "alpha": 0.06544340428506021,
            "mode": "fwe",
            "score_func": "f_classif"
        }
    },
    "classifier": {
        "class": "sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis",
        "args": {
            "reg_param": 0.6396026761675004
        }
    }
}

Let's build and fit the pipeline, step by step:

In [10]:
import sklearn
import sklearn.discriminant_analysis

# One hot encoding
onehotencoder = sklearn.preprocessing.OneHotEncoder(categorical_features=[], sparse=False)
X_train_preprocessed = onehotencoder.fit_transform(X_train)

# Imputation using the mean
imputer = sklearn.preprocessing.Imputer(strategy="mean")
X_train_preprocessed = imputer.fit_transform(X_train_preprocessed)

# Feature preprocessing (select_rates corresponds to sklearn.feature_selection.GenericUnivariateSelect)
feature_preprocessor = sklearn.feature_selection.GenericUnivariateSelect(param=0.06544340428506021, mode="fwe", score_func=sklearn.feature_selection.f_classif)
X_train_preprocessed = feature_preprocessor.fit_transform(X_train_preprocessed, y_train)

# Classifier (QDA)
classifier = sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis(reg_param=0.6396026761675004)
classifier.fit(X_train_preprocessed, y_train)
Out[10]:
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.6396026761675004,
               store_covariance=False, store_covariances=None, tol=0.0001)

We know apply the fitted pipeline to X_test, and compare the predicted outputs to y_test:

In [11]:
X_test_preprocessed = onehotencoder.transform(X_test)
X_test_preprocessed = imputer.transform(X_test_preprocessed)
X_test_preprocessed = feature_preprocessor.transform(X_test_preprocessed)

predictions = classifier.predict(X_test_preprocessed)

print(classification_report(y_test, predictions))
             precision    recall  f1-score   support

          0       1.00      0.98      0.99        53
          1       1.00      1.00      1.00        42
          2       1.00      0.98      0.99        41
          3       1.00      1.00      1.00        52
          4       0.98      1.00      0.99        47
          5       1.00      1.00      1.00        39
          6       1.00      1.00      1.00        43
          7       1.00      1.00      1.00        48
          8       0.97      1.00      0.99        37
          9       1.00      1.00      1.00        48

avg / total       1.00      1.00      1.00       450

We see we obtain the same results.

Example: Select a classifier and fit it to CSV data

Create a new classifier, or retrieve existing one, named "classifier2":

In [12]:
clf = justml.Classifier(name="classifier2")
print(clf.status)
uninitialized

If the classifier exists and is already fitted, its status is equal to "trained". It's not the case here, and we fit it using a dataset in the form of a CSV file.

The CSV needs to contain both the features (predictors) and the response (outcome variable). It needs to have a first row with column names, where the response column has the name indicated by the col_y argument ("y" by default). If there are columns with non-numerical values (text), JustML will consider them as categorical variables. You don't need to encode these columns as integers – JustML can handle this for you.

In [13]:
%%timeit -n1 -r1 # measure the execution time

if clf.status != "trained":
    clf.fit(csvpath="data.csv", col_y="y")
INFO:justml:Sending data to train estimator classifier2...
INFO:justml:Waiting for training to complete on JustML servers...
INFO:justml:Training is finished, and data has been deleted.
3min 39s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

It took 3min 39s to select and build a machine learning pipeline.

Display the pipeline:

In [14]:
clf.show_pipeline()
{
    "data_preprocessor": [
        {
            "step": "categorical_encoding",
            "class": "no_encoding"
        },
        {
            "step": "imputation",
            "args": {
                "strategy": "mean"
            }
        },
        {
            "step": "rescaling",
            "class": "normalize"
        },
        {
            "step": "balancing",
            "args": {
                "strategy": "none"
            }
        }
    ],
    "feature_preprocessor": {
        "class": "select_rates",
        "args": {
            "alpha": 0.1,
            "mode": "fpr",
            "score_func": "chi2"
        }
    },
    "classifier": {
        "class": "sklearn.neighbors.KNeighborsClassifier",
        "args": {
            "n_neighbors": 4,
            "p": 2,
            "weights": "uniform"
        }
    }
}

Show model selection statistics:

In [15]:
clf.show_automl_info()
{
    "Best validation score": "0.986532",
    "Metric": "accuracy",
    "Number of crashed target algorithm runs": "1",
    "Number of successful target algorithm runs": "14",
    "Number of target algorithm runs": "17",
    "Number of target algorithms that exceeded the memory limit": "2",
    "Number of target algorithms that exceeded the time limit": "0"
}

Output the pipeline's predicted values:

In [16]:
%%timeit -n1 -r1 # measure the execution time

predictions = clf.predict(csvpath="data.csv", col_y="y")
print(predictions[:30])
[0, 1, 2, 3, 4, 9, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
6.76 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Predict took 6.76 seconds.

Example: Select a regressor and fit it to given arrays

This example will show how to select and build a regressor. The steps are pretty much the same as those used to fit a classifier.

But instead of using justml.Classifier, we now use the justml.Regressor class.

Create a new regressor, or retrive existing one, named "regressor1":

In [17]:
reg = justml.Regressor(name="regressor1")
print(reg.status)
uninitialized

If the regressor exists and is already fitted, its status is equal to "trained". It's not the case here, and we fit it using the boston dataset from scikit-learn, which we split into a train and a test datasets.

In [18]:
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

if reg.status != "trained":
    reg.fit(X=X_train, y=y_train)
INFO:justml:Sending data to train estimator regressor1...
INFO:justml:Waiting for training to complete on JustML servers...
INFO:justml:Training is finished, and data has been deleted.

Show the machine learning pipeline that was built during fit:

In [19]:
reg.show_pipeline()
{
    "data_preprocessor": [
        {
            "step": "categorical_encoding",
            "class": "no_encoding"
        },
        {
            "step": "imputation",
            "args": {
                "strategy": "mean"
            }
        },
        {
            "step": "rescaling",
            "class": "quantile_transformer",
            "args": {
                "n_quantiles": 42152,
                "output_distribution": "normal"
            }
        }
    ],
    "feature_preprocessor": {
        "class": "no_preprocessing"
    },
    "regressor": {
        "class": "sklearn.ensemble.GradientBoostingRegressor",
        "args": {
            "alpha": 0.9575021330927016,
            "learning_rate": 0.1616604426098248,
            "loss": "huber",
            "max_depth": 4,
            "max_features": 0.15922214934134588,
            "max_leaf_nodes": "None",
            "min_impurity_decrease": 0.0,
            "min_samples_leaf": 8,
            "min_samples_split": 20,
            "min_weight_fraction_leaf": 0.0,
            "n_estimators": 213,
            "subsample": 0.6969886475405643
        }
    }
}

Show model selection statistics:

In [20]:
reg.show_automl_info()
{
    "Best validation score": "0.892810",
    "Metric": "r2",
    "Number of crashed target algorithm runs": "2",
    "Number of successful target algorithm runs": "111",
    "Number of target algorithm runs": "113",
    "Number of target algorithms that exceeded the memory limit": "0",
    "Number of target algorithms that exceeded the time limit": "0"
}

Use the regressor to predict the outcome of X_test, and compare the results with y_test:

In [21]:
predictions = reg.predict(X=X_test)

from sklearn.metrics import r2_score, mean_squared_error, explained_variance_score
print('R2: %.6f' % r2_score(y_test, predictions))
print('MSE: %.6f' % mean_squared_error(y_test, predictions))
print('Explained variance: %.6f' % explained_variance_score(y_test, predictions))
R2: 0.805831
MSE: 15.309361
Explained variance: 0.805832

That's it! If you have any questions, comments or suggestions, drop us a line at [email protected]