import pathlib
import category_encoders
import numpy as np
import pandas as pd
import sklearn.metrics
import sklearn.model_selection
import sklearn.neighbors
import sklearn.pipeline
import sklearn.preprocessing
%load_ext autoreload
%autoreload 2
Welcome to the Fiddler notebook experience! This notebook will demonstrate how to effectively get started with the Fiddler platform by uploading your models and data. The notebook is organized into two sections:
Section 1 does not use any Fiddler code, so if you are familiar with Pandas and Scikit-Learn, you should feel comfortable skimming through and jumping into section 2.
Being an effective data scientist involves using the right tool for the job. When it comes to importing, cleaning, and and exploring your data in Jupyter, we don't want to interrupt your normal workflow, so we integrate our tools with the popular Pandas DataFrame object. Thus as long as your data can be dumped into a DataFrame object, there is nothing else you need to do to get it ready to upload to Fiddler.
# let's import some data from disk
data_dir = pathlib.Path('example_datasets')
bikeshare_csv_file = data_dir / 'bikeshare_hour.csv'
# here we pre-configure the datatypes for our dataframe
# so it doesn't require any datatype modification after import
bikeshare_dtypes = dict(season='category', holiday='bool',
workingday='bool', weathersit='category')
bikeshare_datetime_columns = ['dteday']
bikeshare_index_column = 'instant'
df = pd.read_csv(bikeshare_csv_file,
dtype=bikeshare_dtypes,
parse_dates=bikeshare_datetime_columns,
index_col=bikeshare_index_column)
# train-test split by time: 2011 as train, 2012 as test
is_2011 = df['yr'] == 0
df_2011 = df[is_2011]
df_2012 = df[~is_2011]
# peek at the data
display(df.sample(3, random_state=0).T)
# print info about train-test split
print(f'Train set (bikeshare rentals in 2011) has {df_2011.shape[0]} rows,'
f' test set (bikeshare rentals in 2012) has {df_2012.shape[0]} rows')
instant | 3440 | 6543 | 15471 |
---|---|---|---|
dteday | 2011-05-28 00:00:00 | 2011-10-05 00:00:00 | 2012-10-11 00:00:00 |
season | 2 | 4 | 4 |
yr | 0 | 0 | 1 |
mnth | 5 | 10 | 10 |
hr | 5 | 4 | 19 |
holiday | False | False | False |
weekday | 6 | 3 | 4 |
workingday | False | True | True |
weathersit | 1 | 1 | 1 |
temp | 0.56 | 0.44 | 0.44 |
atemp | 0.5303 | 0.4394 | 0.4394 |
hum | 0.88 | 0.88 | 0.51 |
windspeed | 0.2239 | 0 | 0.1343 |
casual | 4 | 1 | 81 |
registered | 3 | 4 | 662 |
cnt | 7 | 5 | 743 |
Train set (bikeshare rentals in 2011) has 8645 rows, test set (bikeshare rentals in 2012) has 8734 rows
Just like with data work, we believe in the right tools for the job. We currently integrate tightly models supporing the sklearn
API, including non-sklearn
packages that support the sklearn
API, like xgboost
and LightGBM
. Since encoding categorical variables can be a pain in sklearn
, we also support the category_encoders
package. Please note that if you introduce any custom classes or transformation functions into your modeling, it may become difficult to get your models running in Fiddler. We therefore recommend using the Transformer
objects provided by sklearn
(and the category_encoders
package) and combining preprocessing and inference steps using the sklearn
Pipeline
API.
# specify which columns are features and which are not
target = 'cnt'
not_used_as_features = ['dteday', 'yr', 'casual', 'registered']
non_feature_columns = [target] + not_used_as_features
feature_columns = list(set(df_2011.columns) - set(non_feature_columns))
# split our data into features and targets
x_train = df_2011.drop(columns=non_feature_columns)
x_test = df_2012.drop(columns=non_feature_columns)
y_train = df_2011[target]
y_test = df_2012[target]
# modeling approach:
# 1) onehot encode categorical variables
# 2) standard scale all variables
# 3) fit a k-Nearest-Neighbors model with k=10 and l1 distance as the distance metric
onehot = category_encoders.OneHotEncoder(cols=df.select_dtypes('category').columns.tolist())
standard_scaler = sklearn.preprocessing.StandardScaler()
knn = sklearn.neighbors.KNeighborsRegressor(
n_neighbors=10,
weights='distance', metric='l1',
n_jobs=-1)
model = sklearn.pipeline.make_pipeline(onehot, standard_scaler, knn)
# fit the model
model.fit(x_train, y_train)
# score the model
train_r2 = sklearn.metrics.r2_score(y_train, model.predict(x_train))
test_r2 = sklearn.metrics.r2_score(y_test, model.predict(x_test))
print(f'r2 scores: {train_r2:.2f} Train | {test_r2:.2f} Test')
r2 scores: 1.00 Train | 0.38 Test
Up until now, we haven't done anything Fiddler-specific. Now we'll go ahead and change that. Let's begin by importing the Fiddler package.
import fiddler as fdl
Before you can start working with a Fiddler-integrated Jupyter environment, you should set up access to a running instance of Fiddler.
In onebox, this means running the start.sh
script to launch onebox locally.
For the cloud version of our product, this means looking up your authentication token in the Fiddler settings dashboard
In order to get your data and models into the Fiddler Engine, you'll need to connect using the API. The FiddlerApi
object to handles most of the nitty-gritty for you, so all you have to do is specify some details about the Fiddler system you're connecting to.
import getpass
# NOTE: use 'http://host.docker.internal:4100' as our URL only if Jupyter is running in a docker VM on the same macOS machine as onebox
url = input('Enter the API url for your running instance of Fiddler (usually "https://api.fiddler.ai" or "http://localhost:4100"):')
token = getpass.getpass('Enter Fiddler access token (see <Fiddler URL>/settings/credentials to find/create/change this token):')
fiddler_api = fdl.FiddlerApi(url=url, org_id='onebox', auth_token=token)
Now that we have our dataset in working order, let's upload it to the Fiddler platform. As mentioned above, our Dataset
class directly integrates with Pandas to make this a snap.
fiddler_api.list_datasets()
['imdb_rnn', 'iris', 'bank_churn', '20news', 'p2p_loans', 'winequality']
# now that we have a Dataset, we just need to pass
# it to the FiddlerApi to perform an upload
upload_result = fiddler_api.upload_dataset(
dataset={'train': df_2011, 'test': df_2012},
dataset_id='bikeshare')
upload_result
Heads up! We are inferring the details of your dataset from the dataframe(s) provided. Please take a second to check our work. If the following DatasetInfo is an incorrect representation of your data, you can construct a DatasetInfo with the DatasetInfo.from_dataframe() method and modify that object to reflect the correct details of your dataset. After constructing a corrected DatasetInfo, please re-upload your dataset with that DatasetInfo object explicitly passed via the `info` parameter of FiddlerApi.upload_dataset(). You may need to delete the initially uploaded versionvia FiddlerApi.delete_dataset('bikeshare'). Inferred DatasetInfo to check: DatasetInfo: display_name: bikeshare files: [] columns: column dtype count(possible_values) 0 instant INTEGER - 1 dteday STRING - 2 season CATEGORY 4 3 yr INTEGER - 4 mnth INTEGER - 5 hr INTEGER - 6 holiday BOOLEAN - 7 weekday INTEGER - 8 workingday BOOLEAN - 9 weathersit CATEGORY 4 10 temp FLOAT - 11 atemp FLOAT - 12 hum FLOAT - 13 windspeed FLOAT - 14 casual INTEGER - 15 registered INTEGER - 16 cnt INTEGER -
{'row_count': 17379, 'col_count': 17, 'log': ['Importing dataset bikeshare', 'Found old data. Deleting it', 'Creating table for bikeshare', 'Importing data file: test.csv', 'Importing data file: train.csv']}
fiddler_api.delete_dataset('bikeshare')
'Dataset deleted bikeshare'
# we see that the 'bikeshare' dataset now shows up in the list of all datasets
fiddler_api.list_datasets()
['imdb_rnn', 'iris', 'bank_churn', '20news', 'p2p_loans', 'winequality']
# Upload example with custom DatasetInfo
bikeshare_info = fdl.DatasetInfo.from_dataframe(df_2011, display_name='Bikeshare Dataset')
bikeshare_info['weathersit'].possible_values.extend([123, 456, 789])
print('We customized the DatasetInfo for this dataset '
'with a custom display_name and more `weathersit` possible-values.')
print(bikeshare_info)
# upload
upload_result = fiddler_api.upload_dataset(
dataset={'train': df_2011, 'test': df_2012},
dataset_id='bikeshare',
info=bikeshare_info
)
upload_result
We customized the DatasetInfo for this dataset with a custom display_name and more `weathersit` possible-values. DatasetInfo: display_name: Bikeshare Dataset files: [] columns: column dtype count(possible_values) 0 instant INTEGER - 1 dteday STRING - 2 season CATEGORY 4 3 yr INTEGER - 4 mnth INTEGER - 5 hr INTEGER - 6 holiday BOOLEAN - 7 weekday INTEGER - 8 workingday BOOLEAN - 9 weathersit CATEGORY 7 10 temp FLOAT - 11 atemp FLOAT - 12 hum FLOAT - 13 windspeed FLOAT - 14 casual INTEGER - 15 registered INTEGER - 16 cnt INTEGER -
{'row_count': 17379, 'col_count': 17, 'log': ['Importing dataset bikeshare', 'Creating table for bikeshare', 'Importing data file: test.csv', 'Importing data file: train.csv']}
# we see that the 'bikeshare' dataset now shows up in the list of all datasets
fiddler_api.list_datasets()
['imdb_rnn', 'iris', 'bank_churn', '20news', 'p2p_loans', 'winequality', 'bikeshare']
We can also verify everything worked by looking at the web UI:
(or if you used cloud instead of onebox)
We currently support the upload of scikit-learn models directly through the fiddler
package. While custom code is tricky to deploy to Fiddler, we support a number of additional packages beyond sklearn
that enable the deployment of powerful black-box models. These include:
xgboost
(as long as the scikit-learn API is used)lightgbm
(as long as the scikit-learn API is used)category_encoders
For best explainability results, we recommend organizing your modeling pipeline using the scikit-learn Pipeline
API so that your feature transformations are integrated with your model. This is because pre-transforming your data can have a negative effect on explanation interpretability.
# To organize our models, let's first create a project on Fiddler.
fiddler_api.create_project('bikeshare_forecasting')
{'project_name': 'bikeshare_forecasting'}
# we see that the 'bikeshare_forecasting' project now shows up in the list of all datasets
fiddler_api.list_projects()
['imdb_rnn', 'bank_churn', 'newsgroup_text_topics', 'lending', 'bikeshare_forecasting', 'iris_classification', 'wine_quality']
For Fiddler to properly run and explain your model, you need to provide some information about model inputs and outputs that is not captured by the sklearn
object itself. Luckily the Dataset
we created above has a DatasetInfo
component that can help us infer the ModelInfo
of models trained on that dataset.
model_info = fdl.ModelInfo.from_dataset_info(
dataset_info=fiddler_api.get_dataset_info('bikeshare'),
target=target,
features=feature_columns,
display_name='Bikeshare kNN',
description='A kNN model trained for predict the `cnt` feature of the bikeshare dataset.'
)
model_info
ModelInfo: display_name: Bikeshare kNN description: A kNN model trained for predict the `cnt` feature of the bikeshare dataset. input_type: ModelInputType.TABULAR model_task: ModelTask.REGRESSION inputs and outputs: column column_type dtype count(possible_values) 0 season input CATEGORY 4 1 mnth input INTEGER - 2 hr input INTEGER - 3 holiday input BOOLEAN - 4 weekday input INTEGER - 5 workingday input BOOLEAN - 6 weathersit input CATEGORY 7 7 temp input FLOAT - 8 atemp input FLOAT - 9 hum input FLOAT - 10 windspeed input FLOAT - 11 predicted_cnt output FLOAT - misc: {}
fiddler_api.upload_model_sklearn(
model=model,
info=model_info,
project_id='bikeshare_forecasting',
model_id='knn_model',
associated_dataset_ids=['bikeshare'])
You are uploading a scikit-learn model using the Fiddler API. If this model uses any custom (non-sklearn) code, it will not run properly on the Fiddler Engine. The Fiddler engine may not be able to detect this in advance.
{'model': {'display name': 'Bikeshare kNN', 'input-type': 'structured', 'model-task': 'regression', 'inputs': [{'column-name': 'season', 'data-type': 'category', 'possible-values': ['1', '2', '3', '4']}, {'column-name': 'mnth', 'data-type': 'int'}, {'column-name': 'hr', 'data-type': 'int'}, {'column-name': 'holiday', 'data-type': 'bool'}, {'column-name': 'weekday', 'data-type': 'int'}, {'column-name': 'workingday', 'data-type': 'bool'}, {'column-name': 'weathersit', 'data-type': 'category', 'possible-values': ['1', '2', '3', '4', '123', '456', '789']}, {'column-name': 'temp', 'data-type': 'float'}, {'column-name': 'atemp', 'data-type': 'float'}, {'column-name': 'hum', 'data-type': 'float'}, {'column-name': 'windspeed', 'data-type': 'float'}], 'outputs': [{'column-name': 'predicted_cnt', 'data-type': 'float'}], 'description': 'A kNN model trained for predict the `cnt` feature of the bikeshare dataset.', 'datasets': ['bikeshare']}}
We can now look at explanations!
(or if you used cloud instead of onebox)