import pathlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import fiddler as fdl
%matplotlib inline
# use the nicer plotting styles from seaborn
sns.set()
%load_ext autoreload
%autoreload 2
This notebook assumes you already have the models and data you're interested in using uploaded to Fiddler. Please refer to the previous notebook in this series for more information on uploading to Fiddler. You will also need to have run notebook 1 in order to upload the bikesharing example used in this notebook.
In this notebook we run through a number of other Fiddler functionalities that have been integrated into the Python package. Unlike the previous notebook, there is not as much of a sequential flow to the steps demonstrated here.
Before you can start working with a Fiddler-integrated Jupyter environment, you should set up access to a running instance of Fiddler.
In onebox, this means running the start.sh
script to launch onebox locally.
For the cloud version of our product, this means looking up your authentication token in the Fiddler settings dashboard
In order to get your data and models into the Fiddler Engine, you'll need to connect using the API. The FiddlerApi
object to handles most of the nitty-gritty for you, so all you have to do is specify some details about the Fiddler system you're connecting to.
import getpass
# NOTE: use 'http://host.docker.internal:4100' as our URL only if Jupyter is running in a docker VM on the same macOS machine as onebox
url = input('Enter the API url for your running instance of Fiddler (usually "https://api.fiddler.ai" or "http://localhost:4100"):')
token = getpass.getpass('Enter Fiddler access token (see <Fiddler URL>/settings/credentials to find/create/change this token):')
fiddler_api = fdl.FiddlerApi(url=url, org_id='onebox', auth_token=token)
# let's see which datasets we have on Fiddler
fiddler_api.list_datasets()
['imdb_rnn', 'iris', 'bank_churn', '20news', 'p2p_loans', 'winequality', 'bikeshare']
# the info for any dataset can quickly and easily be fetched with the `dataset_info` method
bikeshare_dataset_info = fiddler_api.get_dataset_info('bikeshare')
bikeshare_dataset_info
DatasetInfo: display_name: Bikeshare Dataset files: ['train.csv', 'test.csv'] columns: column dtype count(possible_values) 0 instant INTEGER - 1 dteday STRING - 2 season CATEGORY 4 3 yr INTEGER - 4 mnth INTEGER - 5 hr INTEGER - 6 holiday BOOLEAN - 7 weekday INTEGER - 8 workingday BOOLEAN - 9 weathersit CATEGORY 7 10 temp FLOAT - 11 atemp FLOAT - 12 hum FLOAT - 13 windspeed FLOAT - 14 casual INTEGER - 15 registered INTEGER - 16 cnt INTEGER -
# we can also pull data from the dataset directly into Pandas
bikeshare_dataset = fiddler_api.get_dataset('bikeshare', max_rows=999_999)
print(f'The bikeshare_dataset object is a {type(bikeshare_dataset)} with keys ({list(bikeshare_dataset.keys())})')
df_train = bikeshare_dataset['train']
df_test = bikeshare_dataset['test']
# demo the data
df_train.sample(3, random_state=0).T
The bikeshare_dataset object is a <class 'dict'> with keys (['train', 'test'])
8326 | 6451 | 6429 | |
---|---|---|---|
instant | 8328 | 6453 | 6431 |
dteday | 2011-12-18 | 2011-10-01 | 2011-09-30 |
season | 4 | 4 | 4 |
yr | 0 | 0 | 0 |
mnth | 12 | 10 | 9 |
hr | 15 | 10 | 12 |
holiday | False | False | False |
weekday | 0 | 6 | 5 |
workingday | False | False | True |
weathersit | 1 | 3 | 2 |
temp | 0.32 | 0.4 | 0.64 |
atemp | 0.303 | 0.4091 | 0.6212 |
hum | 0.45 | 0.76 | 0.57 |
windspeed | 0.2836 | 0.3582 | 0.194 |
casual | 23 | 21 | 59 |
registered | 184 | 100 | 195 |
cnt | 207 | 121 | 254 |
# for example, let's plot a regression plot of the target (cnt) against the temperature feature (temp)
sns.regplot(df_train['temp'], df_train['cnt'], marker='.', scatter_kws=dict(alpha=0.1))
<matplotlib.axes._subplots.AxesSubplot at 0x7f965b8712e8>
If you have data but haven't built a model yet, you can take advantage of the model-builder feature to whip up a model instantly so you can dive right into running explanations.
bikeshare_dataset_info.get_column_names()
['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
# fiddler_api.delete_model('bikeshare_forecasting', 'generated_bikeshare_model')
# NOTE: to avoid training on the whole dataset, we should
# pass a `source` parameter.
features = list(set(bikeshare_dataset_info.get_column_names()) - {'casual', 'registered', 'cnt', 'dteday'})
fiddler_api.create_model(project_id='bikeshare_forecasting',
dataset_id='bikeshare',
target='cnt',
features=features,
model_id='generated_bikeshare_model',
train_splits=['train'])
{'created_files': {'package.py': 'Wrapper code to run the model on the Fiddler engine', 'model.yaml': 'Model metadata and configuration', 'data_processor.py': 'Data cleaning and feature engineering code', '__init__.py': 'Empty file. Makes this model directory a python package so the Fiddler engine can run it properly.', 'model.pkl': 'Serialized model artifact.', 'processor.pkl': 'Serialized model artifact.', 'training_features.pkl': 'Serialized training data', 'train.py': 'Model training script', 'training_targets.pkl': 'Serialized training data'}, 'project_name': 'bikeshare_forecasting', 'model_name': 'generated_bikeshare_model'}
# the new model shows up when we list the models in the bikeshare_forecasting project
fiddler_api.list_models(project_id='bikeshare_forecasting')
['knn_model', 'generated_bikeshare_model']
We also support basic integration of our explanation and prediction functionality in Jupyter. The FiddlerApi
object is your friend here.
# running some predictions on the generated model
fiddler_api.run_model(project_id='bikeshare_forecasting', model_id='generated_bikeshare_model', df=df_test.head(10))
predicted_cnt | |
---|---|
0 | 27.546072 |
1 | 18.193697 |
2 | -3.626518 |
3 | -5.807865 |
4 | 0.780283 |
5 | 1.055152 |
6 | -27.742528 |
7 | 40.625284 |
8 | 61.031091 |
9 | 96.152091 |
# compare against predictions on our kNN model
fiddler_api.run_model(project_id='bikeshare_forecasting', model_id='knn_model', df=df_test.head(10))
predicted_cnt | |
---|---|
0 | 29.084173 |
1 | 21.888947 |
2 | 17.315574 |
3 | 25.716166 |
4 | 26.971802 |
5 | 25.599356 |
6 | 28.150124 |
7 | 28.376281 |
8 | 34.132442 |
9 | 68.728265 |
# Run explanations on both models
selected_point = df_test.head(1)
ex_generated = fiddler_api.run_explanation(
project_id='bikeshare_forecasting',
model_id='generated_bikeshare_model',
df=selected_point,
dataset_id='bikeshare')
ex_knn = fiddler_api.run_explanation(
project_id='bikeshare_forecasting',
model_id='knn_model',
df=selected_point,
dataset_id='bikeshare')
# Create a plot comparing attributions
fig = plt.figure(figsize=(12, 6))
comparison_table = pd.DataFrame({
'Generated Model': pd.Series(ex_generated.attributions, index=ex_generated.inputs),
'kNN Model': pd.Series(ex_knn.attributions, index=ex_knn.inputs)
})
comparison_table = comparison_table.loc[comparison_table['kNN Model'].abs().sort_values(ascending=False).index, :]
melted_table = (comparison_table
.reset_index()
.rename(columns={'index': 'Feature'})
.melt(id_vars='Feature',
var_name='Model',
value_name='Attribution'))
sns.barplot(x='Attribution', y='Feature', hue='Model', data=melted_table)
plt.title('Top SHAP attributions on first row of bikeshare for generated model')
Text(0.5, 1.0, 'Top SHAP attributions on first row of bikeshare for generated model')
As we have seen in this notebook, once data and models have been deployed to Fiddler, it becomes very easy to share the data, automatically train a model on Fiddler, and run explanations all without leaving