Getting Started with BentoML

BentoML is an open-source framework for high-performance machine learning model serving. It makes it easy to build production API endpoints for trained ML models and supports all major machine learning frameworks, including Tensorflow, Keras, PyTorch, XGBoost, scikit-learn, fastai, etc.

BentoML comes with a high-performance API model server with adaptive micro-batching support, bringing the advantage of batch processing to online serving workloads. It also provides batch serving, model management and model deployment functionality, which gives ML teams an end-to-end model serving solution with baked-in DevOps best practices.

This is a quick tutorial on how to use BentoML to serve a sklearn modeld via a REST API server, containerize the API model server with Docker, and deploy it to AWS Lambda as a serverless endpoint.


In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

BentoML requires python 3.6 or above, install dependencies via pip:

In [2]:
# Install PyPI packages required in this guide, including BentoML
!pip install -q bentoml pandas sklearn

Creating a Prediction Service with BentoML

A minimal prediction service in BentoML looks something like this:

In [3]:
from bentoml import env, artifacts, api, BentoService
from bentoml.adapters import DataframeInput
from bentoml.artifact import SklearnModelArtifact

class IrisClassifier(BentoService):

    def predict(self, df):
        # Optional pre-processing, post-processing code goes here
        return self.artifacts.model.predict(df)

This code defines a prediction service that bundles a scikit-learn model and provides an API that expects input data in the form of pandas.Dataframe. The user-defined API function predict defines how the input dataframe data will be processed and used for inference with the bundled scikit-learn model. BentoML also supports other API input types such as ImageInput, JsonInput and more.

The following code trains a scikit-learn model and packages the trained model with the IrisClassifier class defined above. It then saves the IrisClassifier instance to disk in the BentoML SavedBundle format:

In [4]:
from sklearn import svm
from sklearn import datasets

# import the custom BentoService defined above
from iris_classifier import IrisClassifier

# Load training data
iris = datasets.load_iris()
X, y =,

# Model Training
clf = svm.SVC(gamma='scale'), y)

# Create a iris classifier service instance
iris_classifier_service = IrisClassifier()

# Pack the newly trained model artifact
iris_classifier_service.pack('model', clf)

# Save the prediction service to disk for model serving
saved_path =
[2020-05-06 00:35:30,716] INFO - BentoService bundle 'IrisClassifier:20200506003514_CBCF1F' saved to: /Users/chaoyu/bentoml/repository/IrisClassifier/20200506003514_CBCF1F

By default, BentoML stores SavedBundle files under the ~/bentoml directory. Users can also customize BentoML to use a different directory or cloud storage like AWS S3 and MinIO, via BentoML's model management component YataiService, which provides advanced model management features including a dashboard web UI:

BentoML YataiService Bento Repository Page

BentoML YataiService Bento Details Page

Learn more about using YataiService for model management and try out the Web UI here.

In [5]:
# Where the SavedBundle directory is saved to
print("saved_path:", saved_path)

# Print the auto-generated service version
print("version:", iris_classifier_service.version)
saved_path: /Users/chaoyu/bentoml/repository/IrisClassifier/20200506003514_CBCF1F
version: 20200506003514_CBCF1F

REST API Model Serving

The BentoML SavedBundle directory contains all the code, data and configs required to deploy the model.

To start a REST API model server with the IrisClassifier SavedBundle, use the bentoml serve command:

In [6]:
# Note that REST API serving **does not work in Google Colab** due to unable to access Colab's VM
!bentoml serve IrisClassifier:latest
[2020-05-06 00:35:32,943] INFO - Getting latest version IrisClassifier:20200506003514_CBCF1F
 * Serving Flask app "IrisClassifier" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on (Press CTRL+C to quit) - - [06/May/2020 00:35:40] "POST /predict HTTP/1.1" 200 -

The IrisClassifier model is now served at localhost:5000. Use curl command to send a prediction request:

curl -i \
--header "Content-Type: application/json" \
--request POST \
--data '[[5.1, 3.5, 1.4, 0.2]]' \

Or with python and request library:

import requests
response ="", json=[[5.1, 3.5, 1.4, 0.2]])

The BentoML API server also provides a web UI for accessing predictions and debugging the server. Visit http://localhost:5000 in the browser and use the Web UI to send prediction request:

BentoML API Server Web UI Screenshot

Containerize model server with Docker

BentoML provides a convenient way to containerize the model API server with Docker. Simply run docker build with the SavedBundle directory which contains a generated Dockerfile:

In [7]:
!docker build -q -t  iris-classifier {saved_path}

Note that docker is note available in Google Colab, download the notebook, ensure docker is installed and try it locally.

Run the generated docker image to start a docker container serving the model:

In [8]:
!docker run -p 5000:5000 -e BENTOML_ENABLE_MICROBATCH=True iris-classifier:latest
[2020-05-06 07:37:07,693] INFO - get_gunicorn_num_of_workers: 3, calculated by cpu count
[2020-05-06 07:37:07,704] INFO - Running micro batch service on :5000
[2020-05-06 07:37:07 +0000] [9] [INFO] Starting gunicorn 20.0.4
[2020-05-06 07:37:07 +0000] [9] [INFO] Listening at: (9)
[2020-05-06 07:37:07 +0000] [9] [INFO] Using worker: aiohttp.worker.GunicornWebWorker
[2020-05-06 07:37:07 +0000] [10] [INFO] Booting worker with pid: 10
[2020-05-06 07:37:08,008] INFO - Micro batch enabled for API `predict`
[2020-05-06 07:37:08,009] INFO - Your system nofile limit is 1048576, which means each instance of microbatch service is able to hold this number of connections at same time. You can increase the number of file descriptors for the server process, or launch more microbatch instances to accept more concurrent connection.
[2020-05-06 07:37:08 +0000] [1] [INFO] Starting gunicorn 20.0.4
[2020-05-06 07:37:08 +0000] [1] [INFO] Listening at: (1)
[2020-05-06 07:37:08 +0000] [1] [INFO] Using worker: sync
[2020-05-06 07:37:08 +0000] [11] [INFO] Booting worker with pid: 11
[2020-05-06 07:37:08 +0000] [12] [INFO] Booting worker with pid: 12
[2020-05-06 07:37:08 +0000] [13] [INFO] Booting worker with pid: 13
[2020-05-06 07:37:16 +0000] [1] [INFO] Handling signal: int
/opt/conda/lib/python3.7/site-packages/sklearn/ UserWarning: Trying to unpickle estimator SVC from version 0.22 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
[2020-05-06 07:37:16 +0000] [12] [INFO] Worker exiting (pid: 12)
/opt/conda/lib/python3.7/site-packages/sklearn/ UserWarning: Trying to unpickle estimator SVC from version 0.22 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
[2020-05-06 07:37:16 +0000] [13] [INFO] Worker exiting (pid: 13)
/opt/conda/lib/python3.7/site-packages/sklearn/ UserWarning: Trying to unpickle estimator SVC from version 0.22 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
[2020-05-06 07:37:16 +0000] [11] [INFO] Worker exiting (pid: 11)

This made it possible to deploy BentoML bundled ML models with platforms such as Kubeflow, Knative, Kubernetes, which provides advanced model deployment features such as auto-scaling, A/B testing, scale-to-zero, canary rollout and multi-armed bandit.

Load saved BentoService

bentoml.load is the enssential API for loading a Bento into your python application:

In [9]:
import bentoml
import pandas as pd

bento_svc = bentoml.load(saved_path)

# Test loaded bentoml service:
[2020-05-06 00:37:20,492] WARNING - Module `iris_classifier` already loaded, using existing imported module.

This can be useful for building test pipeline for your prediction service or using the same predictions service for offline batch serving.

Distribute BentoML SavedBundle as PyPI package

The BentoML SavedBundle is pip-installable and can be directly distributed as a PyPI package for use in python applications:

In [10]:
!pip install -q {saved_path}
In [11]:
# The BentoService class name will become packaged name
import IrisClassifier

installed_svc = IrisClassifier.load()

This also allow users to upload their BentoService to as public python package or to their organization's private PyPi index to share with other developers.

cd {saved_path} & python sdist upload

You will have to configure ".pypirc" file before uploading to pypi index. You can find more information about distributing python package at:

Batch Offline Serving via CLI

pip install {saved_path} also installs a CLI tool for accessing the BentoML service, print CLI help document with --help:

In [12]:
!IrisClassifier --help
Usage: IrisClassifier [OPTIONS] COMMAND [ARGS]...

  BentoML CLI tool

  --version  Show the version and exit.
  --help     Show this message and exit.

  info                List APIs
  install-completion  Install shell command completion
  open-api-spec       Display OpenAPI/Swagger JSON specs
  run                 Run API function
  serve               Start local rest server
  serve-gunicorn      Start local gunicorn server

View the help manual for the run command:

In [13]:
!IrisClassifier run predict --help
Usage: IrisClassifier run [OPTIONS] API_NAME [RUN_ARGS]...

  Run a API defined in saved BentoService bundle from command line

  --with-conda        Run API server in a BentoML managed Conda environment
  -q, --quiet         Hide all warnings and info logs
  --verbose, --debug  Show debug logs when running the command
  --help              Show this message and exit.

Run prediction job from CLI:

In [14]:
!IrisClassifier run predict --input='[[5.1, 3.5, 1.4, 0.2]]'

BentoML cli also supports reading input data from csv or json files, in either local machine or remote HTTP/S3 location:

In [15]:
!IrisClassifier run predict --input=""
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
 2 2]

The same CLI command is also available via bentoml cli, by specifying the BentoService name and version:

In [16]:
!bentoml run IrisClassifier:latest predict --input='[[5.1, 3.5, 1.4, 0.2]]'
[2020-05-06 00:37:51,843] INFO - Getting latest version IrisClassifier:20200506003514_CBCF1F

Deploy API model server to cloud services

BentoML can deploy SavedBundle directly to cloud services such as AWS Lambda or AWS SageMaker, with the bentoml CLI command. Check out the deployment guides and other deployment options with BentoML here.

The following part of the notebook, demonstrates how to deploy the IrisClassifier model server built in the previous steps, to AWS Lambda as a serverless endpoint.

Before started, install the aws-sam-cli package, which is required by BentoML to create AWS Lambda deployment:

In [19]:
!pip install -q -U aws-sam-cli==0.33.1

Make sure an AWS account and credentials is configured either via environment variables or the aws configure command. (Install aws cli command via pip install awscli and follow instructions here)

To create a BentoML deployment on AWS Lambda, using the bentoml lambda deploy command:

In [20]:
!bentoml lambda deploy quick-start-guide-deployment -b IrisClassifier:{iris_classifier_service.version} 
Deploying "IrisClassifier:20200506003514_CBCF1F" to AWS Lambda -[2020-05-06 00:44:50,537] INFO - Building lambda project
|[2020-05-06 00:47:07,298] INFO - Packaging AWS Lambda project at /private/var/folders/7p/y_934t3s4yg8fx595vr28gym0000gn/T/bentoml-temp-truw0360 ...
/[2020-05-06 00:49:06,140] INFO - Deploying lambda project
\[2020-05-06 00:49:56,374] INFO - ApplyDeployment (quick-start-guide-deployment, namespace dev) succeeded
Successfully created AWS Lambda deployment quick-start-guide-deployment
  "namespace": "dev",
  "name": "quick-start-guide-deployment",
  "spec": {
    "bentoName": "IrisClassifier",
    "bentoVersion": "20200506003514_CBCF1F",
    "operator": "AWS_LAMBDA",
    "awsLambdaOperatorConfig": {
      "region": "us-west-2",
      "memorySize": 1024,
      "timeout": 3
  "state": {
    "state": "RUNNING",
    "infoJson": {
      "endpoints": [
      "s3_bucket": "btml-dev-quick-start-guide-deployment-b15326"
    "timestamp": "2020-05-06T07:49:56.600224Z"
  "createdAt": "2020-05-06T07:44:45.627817Z",
  "lastUpdatedAt": "2020-05-06T07:44:45.627842Z"

The 'quick-starrt-guide-deployment' here is the deployment name, which can be used to query the current deployment status:

In [21]:
!bentoml lambda get quick-start-guide-deployment
  "namespace": "dev",
  "name": "quick-start-guide-deployment",
  "spec": {
    "bentoName": "IrisClassifier",
    "bentoVersion": "20200506003514_CBCF1F",
    "operator": "AWS_LAMBDA",
    "awsLambdaOperatorConfig": {
      "region": "us-west-2",
      "memorySize": 1024,
      "timeout": 3
  "state": {
    "state": "RUNNING",
    "infoJson": {
      "endpoints": [
      "s3_bucket": "btml-dev-quick-start-guide-deployment-b15326"
    "timestamp": "2020-05-06T07:50:03.839368Z"
  "createdAt": "2020-05-06T07:44:45.627817Z",
  "lastUpdatedAt": "2020-05-06T07:44:45.627842Z"
In [28]:
!endpoint=$(bentoml lambda get quick-start-guide-deployment | jq -r ".state.infoJson.endpoints[0]") && \
    echo $endpoint

To send request to your AWS Lambda deployment, grab the endpoint URL from the json output above:

In [29]:
!curl -i \
--header "Content-Type: application/json" \
--request POST \
--data '[[5.1, 3.5, 1.4, 0.2]]' \
$(bentoml lambda get quick-start-guide-deployment | jq -r ".state.infoJson.endpoints[0]")


To list all the deployments you've created:

In [30]:
!bentoml deployment list
NAME                          NAMESPACE    PLATFORM    BENTO_SERVICE                         STATUS    AGE
quick-start-guide-deployment  dev          aws-lambda  IrisClassifier:20200506003514_CBCF1F  running   8 minutes and 31.06 seconds

And to delete an active deployment:

In [31]:
!bentoml lambda delete quick-start-guide-deployment
Successfully deleted AWS Lambda deployment "quick-start-guide-deployment"

BentoML by default stores the deployment metadata on the local machine. For team settings, we recommend hosting a shared BentoML YataiService for a data science team to track all their BentoML SavedBundles and model serving deployments created. See related documentation here.


This is what it looks like when using BentoML to serve and deploy a model in the cloud. BentoML also supports many other Machine Learning frameworks, as well as many other deployment platforms. The BentoML core concepts doc is also recommended for anyone looking to get a deeper understanding of BentoML.

Join the BentoML Slack to follow the latest development updates and roadmap discussions.