In [1]:

import pathlib

import category_encoders
import numpy as np
import pandas as pd
import sklearn.metrics
import sklearn.model_selection
import sklearn.neighbors
import sklearn.pipeline
import sklearn.preprocessing

In [2]:

%load_ext autoreload
%autoreload 2

Intro¶

Welcome to the Fiddler notebook experience! This notebook will demonstrate how to effectively get started with the Fiddler platform by uploading your models and data. The notebook is organized into two sections:

Loading data and building a scikit-learn model
Uploading your data and model to the Fiddler platform

Section 1 does not use any Fiddler code, so if you are familiar with Pandas and Scikit-Learn, you should feel comfortable skimming through and jumping into section 2.

Section 1: Loading data and building a model¶

Working with Data¶

Being an effective data scientist involves using the right tool for the job. When it comes to importing, cleaning, and and exploring your data in Jupyter, we don't want to interrupt your normal workflow, so we integrate our tools with the popular Pandas DataFrame object. Thus as long as your data can be dumped into a DataFrame object, there is nothing else you need to do to get it ready to upload to Fiddler.

In [3]:

# let's import some data from disk
data_dir = pathlib.Path('example_datasets')
bikeshare_csv_file = data_dir / 'bikeshare_hour.csv'

# here we pre-configure the datatypes for our dataframe
# so it doesn't require any datatype modification after import
bikeshare_dtypes = dict(season='category', holiday='bool',
                        workingday='bool', weathersit='category')
bikeshare_datetime_columns = ['dteday']
bikeshare_index_column = 'instant'
df = pd.read_csv(bikeshare_csv_file, 
                 dtype=bikeshare_dtypes, 
                 parse_dates=bikeshare_datetime_columns,
                 index_col=bikeshare_index_column)

# train-test split by time: 2011 as train, 2012 as test
is_2011 = df['yr'] == 0
df_2011 = df[is_2011]
df_2012 = df[~is_2011]

# peek at the data
display(df.sample(3, random_state=0).T)

# print info about train-test split
print(f'Train set (bikeshare rentals in 2011) has {df_2011.shape[0]} rows,'
      f' test set (bikeshare rentals in 2012) has {df_2012.shape[0]} rows')

instant	3440	6543	15471
dteday	2011-05-28 00:00:00	2011-10-05 00:00:00	2012-10-11 00:00:00
season	2	4	4
yr	0	0	1
mnth	5	10	10
hr	5	4	19
holiday	False	False	False
weekday	6	3	4
workingday	False	True	True
weathersit	1	1	1
temp	0.56	0.44	0.44
atemp	0.5303	0.4394	0.4394
hum	0.88	0.88	0.51
windspeed	0.2239	0	0.1343
casual	4	1	81
registered	3	4	662
cnt	7	5	743

Train set (bikeshare rentals in 2011) has 8645 rows, test set (bikeshare rentals in 2012) has 8734 rows

Building a model¶

Just like with data work, we believe in the right tools for the job. We currently integrate tightly models supporing the sklearn API, including non-sklearn packages that support the sklearn API, like xgboost and LightGBM. Since encoding categorical variables can be a pain in sklearn, we also support the category_encoders package. Please note that if you introduce any custom classes or transformation functions into your modeling, it may become difficult to get your models running in Fiddler. We therefore recommend using the Transformer objects provided by sklearn (and the category_encoders package) and combining preprocessing and inference steps using the sklearn Pipeline API.

In [4]:

# specify which columns are features and which are not
target = 'cnt'
not_used_as_features = ['dteday', 'yr', 'casual', 'registered']
non_feature_columns = [target] + not_used_as_features
feature_columns = list(set(df_2011.columns) - set(non_feature_columns))

# split our data into features and targets
x_train = df_2011.drop(columns=non_feature_columns)
x_test = df_2012.drop(columns=non_feature_columns)
y_train = df_2011[target]
y_test = df_2012[target]

In [5]:

# modeling approach: 
# 1) onehot encode categorical variables
# 2) standard scale all variables
# 3) fit a k-Nearest-Neighbors model with k=10 and l1 distance as the distance metric
onehot = category_encoders.OneHotEncoder(cols=df.select_dtypes('category').columns.tolist())
standard_scaler = sklearn.preprocessing.StandardScaler()
knn = sklearn.neighbors.KNeighborsRegressor(
    n_neighbors=10, 
    weights='distance', metric='l1',
    n_jobs=-1)
model = sklearn.pipeline.make_pipeline(onehot, standard_scaler, knn)

In [6]:

# fit the model
model.fit(x_train, y_train)

# score the model
train_r2 = sklearn.metrics.r2_score(y_train, model.predict(x_train))
test_r2 = sklearn.metrics.r2_score(y_test, model.predict(x_test))
print(f'r2 scores: {train_r2:.2f} Train | {test_r2:.2f} Test')

r2 scores: 1.00 Train | 0.38 Test

Section 2: Uploading to Fiddler¶

Up until now, we haven't done anything Fiddler-specific. Now we'll go ahead and change that. Let's begin by importing the Fiddler package.

In [7]:

import fiddler as fdl

Before you start: set up your API connection¶

Launch onebox or authenticate with a remote server¶

Before you can start working with a Fiddler-integrated Jupyter environment, you should set up access to a running instance of Fiddler.

Onebox¶

In onebox, this means running the start.sh script to launch onebox locally.

Cloud¶

For the cloud version of our product, this means looking up your authentication token in the Fiddler settings dashboard

Create a FiddlerApi object¶

In order to get your data and models into the Fiddler Engine, you'll need to connect using the API. The FiddlerApi object to handles most of the nitty-gritty for you, so all you have to do is specify some details about the Fiddler system you're connecting to.

In [8]:

import getpass

# NOTE: use 'http://host.docker.internal:4100' as our URL only if Jupyter is running in a docker VM on the same macOS machine as onebox
url = input('Enter the API url for your running instance of Fiddler (usually "https://api.fiddler.ai" or "http://localhost:4100"):')
token = getpass.getpass('Enter Fiddler access token (see <Fiddler URL>/settings/credentials to find/create/change this token):')

fiddler_api = fdl.FiddlerApi(url=url, org_id='onebox', auth_token=token)

Dataset Upload¶

Now that we have our dataset in working order, let's upload it to the Fiddler platform. As mentioned above, our Dataset class directly integrates with Pandas to make this a snap.

In [9]:

fiddler_api.list_datasets()

Out[9]:

['imdb_rnn', 'iris', 'bank_churn', '20news', 'p2p_loans', 'winequality']

In [10]:

# now that we have a Dataset, we just need to pass
# it to the FiddlerApi to perform an upload
upload_result = fiddler_api.upload_dataset(
    dataset={'train': df_2011, 'test': df_2012}, 
    dataset_id='bikeshare')
upload_result

Heads up! We are inferring the details of your dataset from the dataframe(s) provided. Please take a second to check our work.

If the following DatasetInfo is an incorrect representation of your data, you can construct a DatasetInfo with the DatasetInfo.from_dataframe() method and modify that object to reflect the correct details of your dataset.

After constructing a corrected DatasetInfo, please re-upload your dataset with that DatasetInfo object explicitly passed via the `info` parameter of FiddlerApi.upload_dataset().

You may need to delete the initially uploaded versionvia FiddlerApi.delete_dataset('bikeshare').

Inferred DatasetInfo to check:
  DatasetInfo:
    display_name: bikeshare
    files: []
    columns:
              column     dtype count(possible_values)
      0      instant   INTEGER                      -
      1       dteday    STRING                      -
      2       season  CATEGORY                      4
      3           yr   INTEGER                      -
      4         mnth   INTEGER                      -
      5           hr   INTEGER                      -
      6      holiday   BOOLEAN                      -
      7      weekday   INTEGER                      -
      8   workingday   BOOLEAN                      -
      9   weathersit  CATEGORY                      4
      10        temp     FLOAT                      -
      11       atemp     FLOAT                      -
      12         hum     FLOAT                      -
      13   windspeed     FLOAT                      -
      14      casual   INTEGER                      -
      15  registered   INTEGER                      -
      16         cnt   INTEGER                      -

Out[10]:

{'row_count': 17379,
 'col_count': 17,
 'log': ['Importing dataset bikeshare',
  'Found old data. Deleting it',
  'Creating table for bikeshare',
  'Importing data file: test.csv',
  'Importing data file: train.csv']}

In [11]:

fiddler_api.delete_dataset('bikeshare')

Out[11]:

'Dataset deleted bikeshare'

In [12]:

# we see that the 'bikeshare' dataset now shows up in the list of all datasets
fiddler_api.list_datasets()

Out[12]:

['imdb_rnn', 'iris', 'bank_churn', '20news', 'p2p_loans', 'winequality']

In [13]:

# Upload example with custom DatasetInfo
bikeshare_info = fdl.DatasetInfo.from_dataframe(df_2011, display_name='Bikeshare Dataset')
bikeshare_info['weathersit'].possible_values.extend([123, 456, 789])
print('We customized the DatasetInfo for this dataset '
      'with a custom display_name and more `weathersit` possible-values.')
print(bikeshare_info)

# upload
upload_result = fiddler_api.upload_dataset(
    dataset={'train': df_2011, 'test': df_2012},
    dataset_id='bikeshare',
    info=bikeshare_info
)
upload_result

We customized the DatasetInfo for this dataset with a custom display_name and more `weathersit` possible-values.
DatasetInfo:
  display_name: Bikeshare Dataset
  files: []
  columns:
            column     dtype count(possible_values)
    0      instant   INTEGER                      -
    1       dteday    STRING                      -
    2       season  CATEGORY                      4
    3           yr   INTEGER                      -
    4         mnth   INTEGER                      -
    5           hr   INTEGER                      -
    6      holiday   BOOLEAN                      -
    7      weekday   INTEGER                      -
    8   workingday   BOOLEAN                      -
    9   weathersit  CATEGORY                      7
    10        temp     FLOAT                      -
    11       atemp     FLOAT                      -
    12         hum     FLOAT                      -
    13   windspeed     FLOAT                      -
    14      casual   INTEGER                      -
    15  registered   INTEGER                      -
    16         cnt   INTEGER                      -

Out[13]:

{'row_count': 17379,
 'col_count': 17,
 'log': ['Importing dataset bikeshare',
  'Creating table for bikeshare',
  'Importing data file: test.csv',
  'Importing data file: train.csv']}

In [14]:

# we see that the 'bikeshare' dataset now shows up in the list of all datasets
fiddler_api.list_datasets()

Out[14]:

['imdb_rnn',
 'iris',
 'bank_churn',
 '20news',
 'p2p_loans',
 'winequality',
 'bikeshare']

Accessing the data on Fiddler¶

We can also verify everything worked by looking at the web UI:

http://localhost:4100/datasets

(or if you used cloud instead of onebox)

https://app.fiddler.ai/datasets

Model Upload¶

We currently support the upload of scikit-learn models directly through the fiddler package. While custom code is tricky to deploy to Fiddler, we support a number of additional packages beyond sklearn that enable the deployment of powerful black-box models. These include:

xgboost (as long as the scikit-learn API is used)
lightgbm (as long as the scikit-learn API is used)
category_encoders

For best explainability results, we recommend organizing your modeling pipeline using the scikit-learn Pipeline API so that your feature transformations are integrated with your model. This is because pre-transforming your data can have a negative effect on explanation interpretability.

In [15]:

# To organize our models, let's first create a project on Fiddler.
fiddler_api.create_project('bikeshare_forecasting')

Out[15]:

{'project_name': 'bikeshare_forecasting'}

In [16]:

# we see that the 'bikeshare_forecasting' project now shows up in the list of all datasets
fiddler_api.list_projects()

Out[16]:

['imdb_rnn',
 'bank_churn',
 'newsgroup_text_topics',
 'lending',
 'bikeshare_forecasting',
 'iris_classification',
 'wine_quality']

ModelInfo¶

For Fiddler to properly run and explain your model, you need to provide some information about model inputs and outputs that is not captured by the sklearn object itself. Luckily the Dataset we created above has a DatasetInfo component that can help us infer the ModelInfo of models trained on that dataset.

In [17]:

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=fiddler_api.get_dataset_info('bikeshare'),
    target=target, 
    features=feature_columns,
    display_name='Bikeshare kNN',
    description='A kNN model trained for predict the `cnt` feature of the bikeshare dataset.'
)
model_info

Out[17]:

ModelInfo:
  display_name: Bikeshare kNN
  description: A kNN model trained for predict the `cnt` feature of the bikeshare dataset.
  input_type: ModelInputType.TABULAR
  model_task: ModelTask.REGRESSION
  inputs and outputs:
               column column_type     dtype count(possible_values)
    0          season       input  CATEGORY                      4
    1            mnth       input   INTEGER                      -
    2              hr       input   INTEGER                      -
    3         holiday       input   BOOLEAN                      -
    4         weekday       input   INTEGER                      -
    5      workingday       input   BOOLEAN                      -
    6      weathersit       input  CATEGORY                      7
    7            temp       input     FLOAT                      -
    8           atemp       input     FLOAT                      -
    9             hum       input     FLOAT                      -
    10      windspeed       input     FLOAT                      -
    11  predicted_cnt      output     FLOAT                      -
  misc:
    {}

In [18]:

fiddler_api.upload_model_sklearn(
    model=model,
    info=model_info,
    project_id='bikeshare_forecasting',
    model_id='knn_model',
    associated_dataset_ids=['bikeshare'])

You are uploading a scikit-learn model using the Fiddler API.
If this model uses any custom (non-sklearn) code, it will not run properly on the Fiddler Engine.
The Fiddler engine may not be able to detect this in advance.

Out[18]:

{'model': {'display name': 'Bikeshare kNN',
  'input-type': 'structured',
  'model-task': 'regression',
  'inputs': [{'column-name': 'season',
    'data-type': 'category',
    'possible-values': ['1', '2', '3', '4']},
   {'column-name': 'mnth', 'data-type': 'int'},
   {'column-name': 'hr', 'data-type': 'int'},
   {'column-name': 'holiday', 'data-type': 'bool'},
   {'column-name': 'weekday', 'data-type': 'int'},
   {'column-name': 'workingday', 'data-type': 'bool'},
   {'column-name': 'weathersit',
    'data-type': 'category',
    'possible-values': ['1', '2', '3', '4', '123', '456', '789']},
   {'column-name': 'temp', 'data-type': 'float'},
   {'column-name': 'atemp', 'data-type': 'float'},
   {'column-name': 'hum', 'data-type': 'float'},
   {'column-name': 'windspeed', 'data-type': 'float'}],
  'outputs': [{'column-name': 'predicted_cnt', 'data-type': 'float'}],
  'description': 'A kNN model trained for predict the `cnt` feature of the bikeshare dataset.',
  'datasets': ['bikeshare']}}

We can now look at explanations!

http://localhost:4100/projects/bikeshare_forecasting/explain

(or if you used cloud instead of onebox)

https://app.fiddler.ai/projects/bikeshare_forecasting/explain