Wrapping New Individual Operators

Lale comes with several library operators, so you do not need to write your own. But if you want to contribute new operators, this tutorial is for you. First let us review some basic concepts in Lale from the point of view of adding new operators (estimators and transformers). Lale is a library for semi-automated data science, and is designed for the following goals:

  • Automation: easy search and tuning of pipelines
  • Usability: scikit-learn compatible, plus types
  • Interoperability: support for Python building blocks and beyond

To enable the above properties for your operators with Lale, you need to:

  1. Write an operator implementation class with methods __init__, fit, and predict or transform. If you have a custom estimator or transformer as per scikit-learn, you can skip this step as that is already a valid Lale operator.
  2. Register the operator implementation from Step 1 via lale.operators.make_operator. This step automatically creates a JSON schema skeleton for your operator.
  3. Customize the hyperparameter JSON schema to indicate what hyperparameters are expected by an operator, to specify the types, default values, and recommended minimum/maximum values for automatic tuning. The hyperparameter schema can also encode constraints indicating dependencies between hyperparameter values such as solver abc only supports penalty xyz.
  4. Optionaly, customize the schemas for input and output datasets.
  5. Test and use the new operator, for instance, for training or hyperparameter optimization.
  6. Consider contributing your new operator to the Lale open-source project.

The next sections illustrate these five steps using an example. After the example-driven sections, this document concludes with a reference covering features from the example and beyond. This document focuses on individual operators. Pipelines that compose multiple operators are documented elsewhere.

1. Create a New Operator

This section can be skipped if you already have a scikit-learn compatible estimator or transformer class with methods __init__, fit, and predict or transform. Any other compatibility with scikit-learn such as get_params or set_params is optional, and so is extending from sklearn.base.BaseEstimator.

This section illustrates how to implement this class with the help of an example. The running example in this document is a simple custom operator that just wraps the LogisticRegression estimator from scikit-learn. Of course you can write a similar class to wrap your own operators, which do not need to come from scikit-learn. The following code defines a class MyLRImpl.

In [1]:
import sklearn.linear_model

class _MyLRImpl:
    def __init__(self, **hyperparams):
        self._wrapped_model = sklearn.linear_model.LogisticRegression(
            **hyperparams)

    def fit(self, X, y):
        self._wrapped_model.fit(X, y)
        return self

    def predict(self, X, **kwargs):
        return self._wrapped_model.predict(X, **kwargs)

This code first imports the relevant scikit-learn package. Then, it declares a new class for wrapping it. Currently, Lale only supports Python, but eventually, it will also support other programming languages. Therefore, the Lale approach for wrapping new operators carefully avoids depending too much on the Python language or any particular Python library. Hence, the MyLRImpl class does not need to inherit from anything, but it does need to follow certain conventions:

  • It has a constructor, __init__, whose arguments are the hyperparameters.

  • It has a training method, fit, with an argument X containing the training examples and, in the case of supervised models, an argument y containing labels. The fit method creates an instance of the scikit-learn LogisticRegression operator, trains it, and returns the wrapper object.

  • It has a prediction method, predict for an estimator or transform for a transformer. The method has an argument X containing the test examples and returns the labels for predict or the transformed data for transform.

These conventions are designed to be similar to those of scikit-learn. However, they avoid a code dependency upon scikit-learn.

Note that in a simple example like this, the underlying sklearn.linear_model.LogisticRegression class could be used directly, without needing the _MyLRImpl wrapper. However, creating such a wrapper is useful for more complicated examples.

2. Register a New Lale Operator

We can now register _MyLRImpl as a new Lale operator MyLR.

In [2]:
import lale.operators
MyLR = lale.operators.make_operator(_MyLRImpl)

The call to make_operator automatically creates a skeleton JSON schema for the hyperparameters and the operator methods and attaches it to MyLR.

In [3]:
from lale.pretty_print import ipython_display
ipython_display(MyLR._schemas)
{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "description": "Schema for <class 'type'> auto-generated by lale.type_checking.get_default_schema().",
    "type": "object",
    "tags": {"pre": [], "op": ["estimator"], "post": []},
    "properties": {
        "hyperparams": {
            "allOf": [
                {
                    "type": "object",
                    "properties": {},
                    "relevantToOptimizer": [],
                }
            ]
        },
        "input_fit": {
            "type": "object",
            "properties": {
                "X": {"laleType": "Any"},
                "y": {"laleType": "Any"},
            },
            "additionalProperties": false,
            "required": ["X", "y"],
        },
        "input_predict": {
            "type": "object",
            "properties": {"X": {"laleType": "Any"}},
            "additionalProperties": false,
            "required": ["X"],
        },
        "output_predict": {"laleType": "Any"},
    },
}

3. Customize Hyperparameter Schema

Lale requires schemas both for error-checking and for generating search spaces for hyperparameter optimization. The schemas of a Lale operator specify the space of valid values for hyperparameters, for the arguments to fit and predict or transform, and for the output of predict or transform. To keep the schemas independent of the Python programming language, they are expressed as JSON Schema. JSON Schema is currently a draft standard and is already being widely adopted and implemented, for instance, as part of specifying Swagger APIs.

The schema of a Lale operator can be incrementally customized using the customize_schema method wich returns a copy of the operator with the customized schema. The customize_schema method also validates the new schema for early error reporting.

Instead of manually writing the schemas -- which can be error prone -- we provide a dedicated API to help the authoring of operator schemas.

The running example chooses hyperparameters of scikit-learn LogisticRegression that illustrate all the interesting cases. More complete and elaborate examples can be found in the Lale standard library. The following specifies each hyperparameter one at a time, omitting cross-cutting constraints.

In [4]:
from lale.schemas import Null, Enum, Int, Float, Object, Array, Not, AnyOf

MyLR = MyLR.customize_schema(
    relevantToOptimizer=['solver', 'penalty', 'C'],
    solver=Enum(desc='Algorithm for optimization problem.',
                values=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                default='liblinear'),
    penalty=Enum(desc='Norm used in the penalization.',
                 values=['l1', 'l2'],
                 default='l2'),
    C=Float(desc='Inverse regularization strength. '
                 'Smaller values specify stronger regularization.',
            minimum=0.0, exclusiveMinimum=True,
            minimumForOptimizer=0.03125, maximumForOptimizer=32768, 
            distribution='loguniform',
            default=1.0))

Here, solver and penalty are categorical hyperparameters and C is a continuous hyperparameter. For all three hyperparameters, the schema includes a description, used for interactive documentation, and a default value, used when no explicit value is specified. The categorical hyperparameters are then specified as enumerations of their legal values. In contrast, the continuous hyperparameter is a number, and the schema includes additional information such as its distribution, minimum, and maximum. In the example, C has 'minimum': 0.0, indicating that only positive values are valid. Furthermore, C has a 'minimumForOptimizer': 0.03125 and 'maxmumForOptimizer': 32768, guiding the optimizer to limit its search space.

Constraints

Besides specifying hyperparameters one at a time, users may also want to specify cross-cutting constraints to further restrict the hyperparameter schema. This part is an advanced use case and can be skipped by novice users.

In [5]:
MyLR = MyLR.customize_schema(
    constraint=AnyOf([Object(solver=Not(Enum(['newton-cg', 'sag', 'lbfgs']))),
                      Object(penalty=Enum(['l2']))]))

In JSON schema, allOf is a logical "and", anyOf is a logical "or", and not is a logical negation. Thus, the anyOf part of the example can be read as

assert not (solver in ['newton-cg', 'sag', 'lbfgs']) or penalty == 'l2'

By standard Boolean rules, this is equivalent to a logical implication:

if solver in ['newton-cg', 'sag', 'lbfgs']:
    assert penalty == 'l2'

The complete hyperparameters schema simply combines the ranges with the constraints:

In [6]:
ipython_display(MyLR.hyperparam_schema())
{
    "allOf": [
        {
            "type": "object",
            "properties": {
                "solver": {
                    "default": "liblinear",
                    "description": "Algorithm for optimization problem.",
                    "enum": [
                        "newton-cg", "lbfgs", "liblinear", "sag", "saga",
                    ],
                },
                "penalty": {
                    "default": "l2",
                    "description": "Norm used in the penalization.",
                    "enum": ["l1", "l2"],
                },
                "C": {
                    "default": 1.0,
                    "description": "Inverse regularization strength. Smaller values specify stronger regularization.",
                    "type": "number",
                    "minimum": 0.0,
                    "exclusiveMinimum": true,
                    "minimumForOptimizer": 0.03125,
                    "maximumForOptimizer": 32768,
                    "distribution": "loguniform",
                },
            },
            "relevantToOptimizer": ["solver", "penalty", "C"],
        },
        {
            "anyOf": [
                {
                    "type": "object",
                    "properties": {
                        "solver": {
                            "not": {"enum": ["newton-cg", "sag", "lbfgs"]}
                        }
                    },
                },
                {
                    "type": "object",
                    "properties": {"penalty": {"enum": ["l2"]}},
                },
            ]
        },
    ]
}

4. Customize Fit and Predict Schemas

The next step is to specify the expected input and output type of the methods fit, and predict or transform.

The fit method of MyLR takes two arguments, X and y. The X argument is an array of arrays of numbers. The outer array is over samples (rows) of a dataset. The inner array is over features (columns) of a sample. The y argument is an array of non-negative numbers. Each element of y is a label for the corresponding sample in X.

In [7]:
MyLR = MyLR.customize_schema(
    input_fit=Object(
        required=['X', 'y'],
        additionalProperties=False,
        X=Array(items=Array(items=Float())),
        y=Array(items=Float())))

The schema for the arguments of the predict method is similar, just omitting y:

In [8]:
MyLR = MyLR.customize_schema(
    input_predict=Object(
        required=['X'],
        X=Array(items=Array(items=Float())), 
        additionalProperties=False))

The output schema indicates that the predict method returns an array of labels with the same schema as y:

In [9]:
MyLR = MyLR.customize_schema(output_predict=Array(items=Float()))

Tags

Finally, we can add tags for discovery and documentation.

In [10]:
MyLR = MyLR.customize_schema(
    tags= {'pre': ['~categoricals'],
           'op': ['estimator', 'classifier', 'interpretable'],
           'post': ['probabilities']})

We now have a complete JSON schemas for our MyLR operator

In [11]:
ipython_display(MyLR._schemas)
{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "description": "Schema for <class 'type'> auto-generated by lale.type_checking.get_default_schema().",
    "type": "object",
    "tags": {
        "pre": ["~categoricals"],
        "op": ["estimator", "classifier", "interpretable"],
        "post": ["probabilities"],
    },
    "properties": {
        "hyperparams": {
            "allOf": [
                {
                    "type": "object",
                    "properties": {
                        "solver": {
                            "default": "liblinear",
                            "description": "Algorithm for optimization problem.",
                            "enum": [
                                "newton-cg", "lbfgs", "liblinear", "sag",
                                "saga",
                            ],
                        },
                        "penalty": {
                            "default": "l2",
                            "description": "Norm used in the penalization.",
                            "enum": ["l1", "l2"],
                        },
                        "C": {
                            "default": 1.0,
                            "description": "Inverse regularization strength. Smaller values specify stronger regularization.",
                            "type": "number",
                            "minimum": 0.0,
                            "exclusiveMinimum": true,
                            "minimumForOptimizer": 0.03125,
                            "maximumForOptimizer": 32768,
                            "distribution": "loguniform",
                        },
                    },
                    "relevantToOptimizer": ["solver", "penalty", "C"],
                },
                {
                    "anyOf": [
                        {
                            "type": "object",
                            "properties": {
                                "solver": {
                                    "not": {
                                        "enum": ["newton-cg", "sag", "lbfgs"]
                                    }
                                }
                            },
                        },
                        {
                            "type": "object",
                            "properties": {"penalty": {"enum": ["l2"]}},
                        },
                    ]
                },
            ]
        },
        "input_fit": {
            "type": "object",
            "required": ["X", "y"],
            "additionalProperties": false,
            "properties": {
                "X": {
                    "type": "array",
                    "items": {"type": "array", "items": {"type": "number"}},
                },
                "y": {"type": "array", "items": {"type": "number"}},
            },
        },
        "input_predict": {
            "type": "object",
            "required": ["X"],
            "additionalProperties": false,
            "properties": {
                "X": {
                    "type": "array",
                    "items": {"type": "array", "items": {"type": "number"}},
                }
            },
        },
        "output_predict": {"type": "array", "items": {"type": "number"}},
    },
}

5. Testing and Using the new Operator

Once your operator implementation and schema definitions are ready, you can test it with Lale as follows. First, you will need to install Lale, as described in the installation) instructions.

Use the new Operator

Before demonstrating the new MyLR operator, the following code loads the Iris dataset, which comes out-of-the-box with scikit-learn.

In [12]:
import sklearn.datasets
import sklearn.utils
iris = sklearn.datasets.load_iris()
X_all, y_all = sklearn.utils.shuffle(iris.data, iris.target, random_state=42)
holdout_size = 30
X_train, y_train = X_all[holdout_size:], y_all[holdout_size:]
X_test, y_test = X_all[:holdout_size], y_all[:holdout_size]
print('expected {}'.format(y_test))
expected [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]

Now that the data is in place, the following code sets the hyperparameters, calls fit to train, and calls predict to make predictions. This code looks almost like what people would usually write with scikit-learn, except that it uses an enumeration MyLR.solver that is implicitly defined by Lale so users do not have to pass in error-prone strings for categorical hyperparameters.

In [13]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

trainable = MyLR(MyLR.enum.solver.lbfgs, C=0.1)
trained = trainable.fit(X_train, y_train)
predictions = trained.predict(X_test)
print('actual {}'.format(predictions))
actual [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]

To illustrate interactive documentation, the following code retrieves the specification of the C hyperparameter.

In [14]:
MyLR.hyperparam_schema('C')
Out[14]:
{'default': 1.0,
 'description': 'Inverse regularization strength. Smaller values specify stronger regularization.',
 'type': 'number',
 'minimum': 0.0,
 'exclusiveMinimum': True,
 'minimumForOptimizer': 0.03125,
 'maximumForOptimizer': 32768,
 'distribution': 'loguniform'}

Similarly, operator tags are reflected via Python methods on the operator:

In [15]:
print(MyLR.has_tag('interpretable'))
print(MyLR.get_tags())
True
{'pre': ['~categoricals'], 'op': ['estimator', 'classifier', 'interpretable'], 'post': ['probabilities']}

To illustrate error-checking, the following code showcases an invalid hyperparameter caught by JSON schema validation.

In [16]:
import jsonschema, sys
try:
    MyLR(solver='adam')
except jsonschema.ValidationError as e:
    print(e.message, file=sys.stderr)
Invalid configuration for MyLR(solver='adam') due to invalid value solver=adam.
Schema of argument solver: {
    "default": "liblinear",
    "description": "Algorithm for optimization problem.",
    "enum": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
}
Value: adam

Finally, to illustrate hyperparameter optimization, the following code uses hyperopt. We will document the hyperparameter optimization use-case in more detail elsewhere. Here we only demonstrate that Lale with MyLR supports it.

In [17]:
from lale.search.op2hp import hyperopt_search_space
from hyperopt import STATUS_OK, Trials, fmin, tpe, space_eval
from sklearn.metrics import accuracy_score
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

def objective(hyperparams):
    del hyperparams['name']
    trainable = MyLR(**hyperparams)
    trained = trainable.fit(X_train, y_train)
    predictions = trained.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    return {'loss': -accuracy, 'status': STATUS_OK}

#The following line is enabled by the hyperparameter schema.
search_space = hyperopt_search_space(MyLR)

trials = Trials()
fmin(objective, search_space, algo=tpe.suggest, max_evals=10, trials=trials)
best_hyperparams = space_eval(search_space, trials.argmin)
print('best hyperparameter combination {}'.format(best_hyperparams))
100%|██████████| 10/10 [00:00<00:00, 54.54trial/s, best loss: -1.0]
best hyperparameter combination {'C': 18098.51542502289, 'name': '__main__.MyLR', 'penalty': 'l2', 'solver': 'saga'}

This concludes the running example. To summarize, we have learned how to write an operator implementation class and JSON schemas; how to register the Lale operator; and how to use the Lale operator for manual as well as automated machine-learning.

Additional Wrapper Class Features

Besides X and y, the fit method in scikit-learn sometimes has additional arguments. Lale also supports such additional arguments.

In addition to the __init__, fit, and predict methods, many scikit-learn estimators also have a predict_proba method. Lale will support that with its own metadata schema.

6. Consider Contributing to Lale Open-Source Project

We encourage you to add your new operator to Lale. Take a look at the "Developer's Certificate of Origin", which can be found in DCO1.1.txt. Can you agree to the terms in the DCO, as well as to the license? If yes, check out the "For Developers" part at the end of the Installation instructions, and follow the steps to create a pull request.

7. Reference

This section documents features of JSON Schema that Lale uses, as well as extensions that Lale adds to JSON schema for information specific to the machine-learning domain. For a more comprehensive introduction to JSON Schema, refer to its Reference.

The following table lists kinds of schemas in JSON Schema:

Kind of schema Corresponding type in Python/Lale
null NoneType, value None
boolean bool, values True or False
string str
enum See discussion below.
number float, .e.g, 0.1
integer int, e.g., 42
array See discussion below.
object dict with string keys
anyOf, allOf, not See discussion below.

The use of null, boolean, and string is fairly straightforward. The following paragraphs discuss the other kinds of schemas one by one.

7.1. enum

In JSON Schema, an enum can contain assorted values including strings, numbers, or even null. Lale uses enums of strings for categorical hyperparameters, such as 'penalty': {'enum': ['l1', 'l2']} in the earlier example. In that case, Lale also automatically declares a corresponding Python enum. When Lale uses enums of other types, it is usually to restrict a hyperparameter to a single value, such as 'enum': [None].

7.2. number, integer

In schemas with type set to number or integer, JSON schema lets users specify minimum, maximum, exclusiveMinimum, and exclusiveMaximum. Lale further extends JSON schema with minimumForOptimizer, maximumForOptimizer, and distribution. Possible values for the distribution are 'uniform' (the default) and 'loguniform'. In the case of integers, Lale quantizes the distributions accordingly.

7.3. array

Lale schemas for input and output data make heavy use of the JSON Schema array type. In this case, Lale schemas are intended to capture logical schemas, not physical representations, similarly to how relational databases hide physical representations behind a well-formalized abstraction layer. Therefore, Lale uses arrays from JSON Schema for several types in Python. The most obvious one is a Python list. Another common one is a numpy ndarray, where Lale uses nested arrays to represent each of the dimensions of a multi-dimensional array. Lale also has support for pandas.DataFrame and pandas.Series, for which it again uses JSON Schema arrays.

For arrays, JSON schema lets users specify items, minItems, and maxItems. Lale further extends JSON schema with minItemsForOptimizer and maxItemsForOptimizer. Furthermore, Lale supports a laleType, which can be 'Any' to locally disable a subschema check, or'tuple' to support cases where the Python code requires a tuple instead of a list.

7.4. object

For objects, JSON schema lets users specify a list required of properties that must be present, a dictionary properties of sub-schemas, and a flag additionalProperties to indicate whether the object can have additional properties beyond the keys of the properties dictionary. Lale further extends JSON schema with a relevantToOptimizer list of properties that hyperparameter optimizers should search over.

For individual properties, Lale supports a default, which is inspired by and consistent with web API specification practice. It also supports a forOptimizer flag which defaults to True but can be set to False to hide a particular subschema from the hyperparameter optimizer. For example, the number of components for PCA in scikit-learn can be specified as an integer or a floating point number, but an optimizer should only explore one of these choices. Lale supports a Boolean flag transient that, if true, elides a hyperparameter during pretty-printing, visualization, or in JSON.

7.5. allOf, anyOf, not

As discussed before, in JSON schema, allOf is a logical "and", anyOf is a logical "or", and not is a logical negation. The running example from earlier already illustrated how to use these for implementing cross-cutting constraints. Another use-case that takes advantage of anyOf is for expressing union types, which arise frequently in scikit-learn. For example, here is the schema for n_components from PCA:

In [18]:
n_components_sch = AnyOf(
    [Enum([None], desc="If not set, keep all components."),
     Enum(['mle'], desc="Use Minka's MLE to guess the dimension."),
     Float(minimum=0.0, exclusiveMinimum=True, 
           maximum=1.0, exclusiveMaximum=True, 
           desc='Select the number of components such that the amount of variance '
                'that needs to be explained is greater than the specified percentage.'),
     Int(minimum=1, forOptimizer=False, desc='Number of components to keep.')],
    default=None)

ipython_display(n_components_sch.schema)
{
    "default": null,
    "anyOf": [
        {"description": "If not set, keep all components.", "enum": [null]},
        {
            "description": "Use Minka's MLE to guess the dimension.",
            "enum": ["mle"],
        },
        {
            "description": "Select the number of components such that the amount of variance that needs to be explained is greater than the specified percentage.",
            "type": "number",
            "minimum": 0.0,
            "exclusiveMinimum": true,
            "maximum": 1.0,
            "exclusiveMaximum": true,
        },
        {
            "description": "Number of components to keep.",
            "forOptimizer": false,
            "type": "integer",
            "minimum": 1,
        },
    ],
}

7.6. Schema Metadata

We encourage users to make their schemas more readable by also including common JSON schema metadata such as $schema and description. As seen in the examples in this document, Lale also extends JSON schema with tags and documentation_url. Finally, in some cases, schema-internal duplication can be avoided by cross-references and linking. This is supported by off-the-shelf features of JSON schema without requiring Lale-specific extensions.