* This notebook is still working-in-progress, not everything works, expect an update from BentoML team soon

BentoML Example: H2O Loan Default Prediction

BentoML is an open source platform for machine learning model serving and deployment.

This notebook demonstrates how to use BentoML to turn a H2O model into a docker image containing a REST API server serving this model, as well as distributing your model as a command line tool or a pip-installable PyPI package.

The notebook was built based on: https://github.com/kguruswamy/H2O3-Driverless-AI-Code-Examples/blob/master/Lending%20Club%20Data%20-%20H2O3%20Auto%20ML%20-%20Python%20Tutorial.ipynb

Impression

In [ ]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
In [ ]:
!pip install bentoml
!pip install h2o xlrd sklearn pandas numpy
In [ ]:
import h2o
import bentoml
import numpy as np
import pandas as pd

import requests
import math
from sklearn import model_selection

h2o.init()

Prepare Dataset

In [ ]:
%%bash

# Download training dataset
if [ ! -f ./LoanStats3c.csv.zip ]; then
    curl -O https://resources.lendingclub.com/LoanStats3c.csv.zip
fi
In [ ]:
pd.set_option('expand_frame_repr', True)
pd.set_option('max_colwidth',9999)
pd.set_option('display.max_columns',9999)
pd.set_option('display.max_rows',9999)

data_dictionary = pd.read_excel("https://resources.lendingclub.com/LCDataDictionary.xlsx")
data_dictionary
In [ ]:
# Very first row has non-header data and hence skipping it. Read to a data frame
# Fix the Mon-Year on one column to be readable

def parse_dates(x):
    return datetime.strptime(x, "%b-%d")

lc = pd.read_csv("LoanStats3c.csv.zip", skiprows=1,verbose=False, parse_dates=['issue_d'],low_memory=False) 
lc.shape
In [ ]:
lc.loan_status.unique()
In [ ]:
# Keep just "Fully Paid" and "Charged Off" to make it a simple 'Yes' or 'No' - binary classification problem

lc = lc[lc.loan_status.isin(['Fully Paid','Charged Off'])]
lc.loan_status.unique()
In [ ]:
# Drop the columns from the data frame that are Target Leakage ones
# Target Leakage columns are generally created in hindsight by analysts/data engineers/operations after an outcome 
# was detected in historical data. If we don't remove them now, they would climb to the top of the feature list after a model is built and 
# falsely increase the accuracy to 95% :) 
#
# In Production or real life scoring environment, don't expect these columns to be available at scoring time
# , that is,when someone applies for a loan. So we don't train on those columns ...

ignored_cols = [ 
                'out_prncp',                 # Remaining outstanding principal for total amount funded
                'out_prncp_inv',             # Remaining outstanding principal for portion of total amount 
                                             # funded by investors
                'total_pymnt',               # Payments received to date for total amount funded
                'total_pymnt_inv',           # Payments received to date for portion of total amount 
                                             # funded by investors
                'total_rec_prncp',           # Principal received to date 
                'total_rec_int',             # Interest received to date
                'total_rec_late_fee',        # Late fees received to date
                'recoveries',                # post charge off gross recovery
                'collection_recovery_fee',   # post charge off collection fee
                'last_pymnt_d',              # Last month payment was received
                'last_pymnt_amnt',           # Last total payment amount received
                'next_pymnt_d',              # Next scheduled payment date
                'last_credit_pull_d',        # The most recent month LC pulled credit for this loan
                'settlement_term',           # The number of months that the borrower will be on the settlement plan
                'settlement_date',           # The date that the borrower agrees to the settlement plan
                'settlement_amount',         # The loan amount that the borrower has agreed to settle for
                'settlement_percentage',     # The settlement amount as a percentage of the payoff balance amount on the loan
                'settlement_status',         # The status of the borrower’s settlement plan. Possible values are: 
                                             # COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAF
                'debt_settlement_flag',      # Flags whether or not the borrower, who has charged-off, is working with 
                                             # a debt-settlement company.
                'debt_settlement_flag_date'  # The most recent date that the Debt_Settlement_Flag has been set
                ]

lc = lc.drop(columns=ignored_cols, axis = 1)
In [ ]:
# After dropping Target Leakage columns, we have 223K rows and 125 columns
lc.shape
In [ ]:
import csv
import os 

train_path = os.getcwd() + "/train_lc.csv.zip"
test_path = os.getcwd() + "/test_lc.csv.zip"

train_lc, test_lc = model_selection.train_test_split(lc, test_size=0.2, random_state=10,stratify=lc['loan_status'])
train_lc.to_csv(train_path, index=False,compression="zip")
test_lc.to_csv(test_path, index=False,compression="zip")

# These two CSV files were created in the previous section
train_path = os.getcwd()+"/train_lc.csv.zip"
test_path = os.getcwd()+ "/test_lc.csv.zip"

train = h2o.load_dataset(train_path)
test = h2o.load_dataset(test_path)

Model Training

In [ ]:
from h2o.automl import H2OAutoML

# Identify predictors and response
x = train.columns
y = "loan_status"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML 
aml = H2OAutoML(project_name='LC', 
                max_models=10,         # 10 base models *FOR DEMO PURPOSE
                balance_classes=True,  # Doing smart Class imbalance sampling
                max_runtime_secs=60,   # 1 Minute *FOR DEMO PURPOSE
                seed=1234)             # Set a seed for reproducability
aml.train(x=x, y=y, training_frame=train)

View the AutoML Leaderboard

In [ ]:
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)
In [ ]:
test_pc = aml.predict(test)
test_pc

Define BentoService for model serving

In [ ]:
%%writefile loan_prediction.py

import h2o

from bentoml import api, env, artifacts, BentoService
from bentoml.artifact import H2oModelArtifact
from bentoml.handlers import DataframeHandler

@env(pip_dependencies = ['h2o', 'pandas'])
@artifacts([H2oModelArtifact('model')])
class LoanPrediction(BentoService):
    
    @api(DataframeHandler)
    def predict(self, df):
        h2o_frame = h2o.H2OFrame(df, na_strings=['NaN'])
        predictions = self.artifacts.model.predict(h2o_frame)
        return predictions.as_data_frame()

Save BentoService to file archive

In [ ]:
# 1) import the custom BentoService defined above
from loan_prediction import LoanPrediction

# 2) `pack` it with required artifacts
bentoml_svc = LoanPrediction.pack(model=aml.leader)

# 3) save your BentoSerivce
saved_path = bentoml_svc.save()

Load BentoService from archive

In [ ]:
loaded_bentoml_svc = bentoml.load(saved_path)

print(loaded_bentoml_svc.predict(test))

Model Serving via REST API

In your termnial, run the following command to start the REST API server:

In [ ]:
!bentoml serve {saved_path}

Open http://127.0.0.1:5000 to see more information about the REST APIs server in your browser.