PyCaret Tutorial Notebook¶

We are into the 3rd week of the Fundamentals of MLOps: A Hands-On Approach, where we will learn how to efficiently create & experiment with ML Pipelines using a low-code framework called PyCaret.

This tutorial is intended to familiarize you with some of the functionality offered by PyCaret using a regression example (we will use the pycaret.regression module in this tutorial). Most of the functions (with a few parameter tweaks) can be extended to other modules as well (you will explore the pycaret.classification module in this week's assignment).

Installation of PyCaret¶

The first step to get started with PyCaret is to install PyCaret. It can be done as follows:

In [ ]:

!pip install pycaret

Collecting pycaret
  Downloading https://files.pythonhosted.org/packages/bc/b6/9d620a23a038b3abdc249472ffd9be217f6b1877d2d952bfb3f653622a28/pycaret-2.3.2-py3-none-any.whl (263kB)
     |████████████████████████████████| 266kB 8.2MB/s 
Requirement already satisfied: textblob in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.15.3)
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.6.0)
Collecting yellowbrick>=1.0.1
  Downloading https://files.pythonhosted.org/packages/3a/15/58feb940b6a2f52d3335cccf9e5d00704ec5ba62782da83f7e2abeca5e4b/yellowbrick-1.3.post1-py3-none-any.whl (271kB)
     |████████████████████████████████| 276kB 13.2MB/s 
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.11.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.5)
Collecting umap-learn
  Downloading https://files.pythonhosted.org/packages/75/69/85e7f950bb75792ad5d666d86c5f3e62eedbb942848e7e3126513af9999c/umap-learn-0.5.1.tar.gz (80kB)
     |████████████████████████████████| 81kB 10.7MB/s 
Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (2.2.4)
Collecting kmodes>=0.10.1
  Downloading https://files.pythonhosted.org/packages/9b/34/fffc601aa4d44b94e945a7cc72f477e09dffa7dce888898f2ffd9f4e343e/kmodes-0.11.0-py2.py3-none-any.whl
Requirement already satisfied: IPython in /usr/local/lib/python3.7/dist-packages (from pycaret) (5.5.0)
Collecting scikit-plot
  Downloading https://files.pythonhosted.org/packages/7c/47/32520e259340c140a4ad27c1b97050dd3254fdc517b1d59974d47037510e/scikit_plot-0.3.7-py3-none-any.whl
Collecting Boruta
  Downloading https://files.pythonhosted.org/packages/b2/11/583f4eac99d802c79af9217e1eff56027742a69e6c866b295cce6a5a8fc2/Boruta-0.3-py3-none-any.whl (56kB)
     |████████████████████████████████| 61kB 9.0MB/s 
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.7/dist-packages (from pycaret) (7.6.3)
Requirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.5.0)
Collecting pyod
  Downloading https://files.pythonhosted.org/packages/71/8a/faa04a753bc32aeef00b9acf8e23d0b914b03844b89dcc6062b28e7ab1c5/pyod-0.9.0.tar.gz (105kB)
     |████████████████████████████████| 112kB 15.8MB/s 
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.5)
Collecting imbalanced-learn==0.7.0
  Downloading https://files.pythonhosted.org/packages/c8/81/8db4d87b03b998fda7c6f835d807c9ae4e3b141f978597b8d7f31600be15/imbalanced_learn-0.7.0-py3-none-any.whl (167kB)
     |████████████████████████████████| 174kB 15.0MB/s 
Requirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.4.1)
Collecting lightgbm>=2.3.1
  Downloading https://files.pythonhosted.org/packages/18/b2/fff8370f48549ce223f929fe8cab4ee6bf285a41f86037d91312b48ed95b/lightgbm-3.2.1-py3-none-manylinux1_x86_64.whl (2.0MB)
     |████████████████████████████████| 2.0MB 17.0MB/s 
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.2)
Collecting mlflow
  Downloading https://files.pythonhosted.org/packages/f3/c9/190a45e667b63edb76112deefa70629c2d9985603a85cb1968015fe0f327/mlflow-1.18.0-py3-none-any.whl (14.2MB)
     |████████████████████████████████| 14.2MB 216kB/s 
Collecting mlxtend>=0.17.0
  Downloading https://files.pythonhosted.org/packages/86/30/781c0b962a70848db83339567ecab656638c62f05adb064cb33c0ae49244/mlxtend-0.18.0-py2.py3-none-any.whl (1.3MB)
     |████████████████████████████████| 1.4MB 42.9MB/s 
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.0.1)
Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (4.4.1)
Collecting pyLDAvis
  Downloading https://files.pythonhosted.org/packages/03/a5/15a0da6b0150b8b68610cc78af80364a80a9a4c8b6dd5ee549b8989d4b60/pyLDAvis-3.3.1.tar.gz (1.7MB)
     |████████████████████████████████| 1.7MB 29.4MB/s 
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
    Preparing wheel metadata ... done
Requirement already satisfied: numpy==1.19.5 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.19.5)
Collecting pandas-profiling>=2.8.0
  Downloading https://files.pythonhosted.org/packages/3b/a3/34519d16e5ebe69bad30c5526deea2c3912634ced7f9b5e6e0bb9dbbd567/pandas_profiling-3.0.0-py2.py3-none-any.whl (248kB)
     |████████████████████████████████| 256kB 51.3MB/s 
Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.17.3)
Collecting scikit-learn==0.23.2
  Downloading https://files.pythonhosted.org/packages/f4/cb/64623369f348e9bfb29ff898a57ac7c91ed4921f228e9726546614d63ccb/scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8MB)
     |████████████████████████████████| 6.8MB 35.9MB/s 
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->pycaret) (1.15.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->pycaret) (5.1.0)
Requirement already satisfied: cycler>=0.10.0 in /usr/local/lib/python3.7/dist-packages (from yellowbrick>=1.0.1->pycaret) (0.10.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->pycaret) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pycaret) (2.8.1)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn->pycaret) (0.51.2)
Collecting pynndescent>=0.5
  Downloading https://files.pythonhosted.org/packages/b1/8d/44bf1c9e69dd9bf0697a3b9375b0729942525c0eee7b7859f563439d676a/pynndescent-0.5.4.tar.gz (1.1MB)
     |████████████████████████████████| 1.1MB 23.9MB/s 
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (57.0.0)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (7.4.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.4.1)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.5)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (4.41.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (2.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (3.0.5)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (2.23.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.1.3)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.8.2)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (5.0.5)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.4.2)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.8.1)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (1.0.18)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.7.5)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.8.0)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (2.6.1)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (3.5.1)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (4.10.1)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (5.1.3)
Requirement already satisfied: jupyterlab-widgets>=1.0.0; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (1.0.0)
Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (from wordcloud->pycaret) (7.1.2)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.10.2)
Requirement already satisfied: wheel in /usr/local/lib/python3.7/dist-packages (from lightgbm>=2.3.1->pycaret) (0.36.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (1.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (2.4.7)
Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (3.12.4)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.3)
Collecting docker>=4.0.0
  Downloading https://files.pythonhosted.org/packages/b2/5a/f988909dfed18c1ac42ad8d9e611e6c5657e270aa6eb68559985dbb69c13/docker-5.0.0-py2.py3-none-any.whl (146kB)
     |████████████████████████████████| 153kB 41.6MB/s 
Collecting gunicorn; platform_system != "Windows"
  Downloading https://files.pythonhosted.org/packages/e4/dd/5b190393e6066286773a67dfcc2f9492058e9b57c4867a95f1ba5caf0a83/gunicorn-20.1.0-py3-none-any.whl (79kB)
     |████████████████████████████████| 81kB 13.5MB/s 
Requirement already satisfied: sqlparse>=0.3.1 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.4.1)
Collecting databricks-cli>=0.8.7
  Downloading https://files.pythonhosted.org/packages/bc/af/631375abc29e59cedfa4467a5f7755503ba19898890751e1f2636ef02f92/databricks-cli-0.14.3.tar.gz (54kB)
     |████████████████████████████████| 61kB 9.3MB/s 
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.3.0)
Collecting pyyaml>=5.1
  Downloading https://files.pythonhosted.org/packages/7a/a5/393c087efdc78091afa2af9f1378762f9821c9c1d7a22c5753fb5ac5f97a/PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636kB)
     |████████████████████████████████| 645kB 38.6MB/s 
Collecting prometheus-flask-exporter
  Downloading https://files.pythonhosted.org/packages/f3/c1/2cc385fadf18dc75fe24c18899269eda4dcc60221d61eff7da4a6cc5c01d/prometheus_flask_exporter-0.18.2.tar.gz
Collecting gitpython>=2.1.0
  Downloading https://files.pythonhosted.org/packages/bc/91/b38c4fabb6e5092ab23492ded4f318ab7299b19263272b703478038c0fbc/GitPython-3.1.18-py3-none-any.whl (170kB)
     |████████████████████████████████| 174kB 55.7MB/s 
Collecting alembic<=1.4.1
  Downloading https://files.pythonhosted.org/packages/e0/e9/359dbb77c35c419df0aedeb1d53e71e7e3f438ff64a8fdb048c907404de3/alembic-1.4.1.tar.gz (1.1MB)
     |████████████████████████████████| 1.1MB 32.1MB/s 
Requirement already satisfied: Flask in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.1.4)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.4.18)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (20.9)
Collecting querystring-parser
  Downloading https://files.pythonhosted.org/packages/88/6b/572b2590fd55114118bf08bde63c0a421dcc82d593700f3e2ad89908a8a9/querystring_parser-1.2.4-py2.py3-none-any.whl
Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (7.1.2)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly>=4.4.1->pycaret) (1.3.3)
Collecting funcy
  Downloading https://files.pythonhosted.org/packages/44/52/5cf7401456a461e4b481650dfb8279bc000f31a011d0918904f86e755947/funcy-1.16-py2.py3-none-any.whl
Requirement already satisfied: numexpr in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (2.7.3)
Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (0.16.0)
Requirement already satisfied: sklearn in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (0.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (2.11.3)
Collecting visions[type_image_path]==0.7.1
  Downloading https://files.pythonhosted.org/packages/80/96/01e4ba22cef96ae5035dbcf0451c2f4f859f8f17393b98406b23f0034279/visions-0.7.1-py3-none-any.whl (102kB)
     |████████████████████████████████| 112kB 55.0MB/s 
Collecting htmlmin>=0.1.12
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting pydantic>=1.8.1
  Downloading https://files.pythonhosted.org/packages/9f/f2/2d5425efe57f6c4e06cbe5e587c1fd16929dcf0eb90bd4d3d1e1c97d1151/pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1MB)
     |████████████████████████████████| 10.1MB 32.1MB/s 
Collecting phik>=0.11.1
  Downloading https://files.pythonhosted.org/packages/b7/ce/193e8ddf62d4be643b9b4b20e8e9c63b2f6a20f92778c0410c629f89bdaa/phik-0.11.2.tar.gz (1.1MB)
     |████████████████████████████████| 1.1MB 47.0MB/s 
Requirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.4.2)
Collecting tangled-up-in-unicode==0.1.0
  Downloading https://files.pythonhosted.org/packages/93/3e/cb354fb2097fcf2fd5b5a342b10ae2a6e9363ba435b64e3e00c414064bc7/tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1MB)
     |████████████████████████████████| 3.1MB 34.4MB/s 
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (0.3.0)
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn->pycaret) (0.34.0)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (4.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0->pycaret) (2021.5.30)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0->pycaret) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0->pycaret) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0->pycaret) (1.24.3)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.7/dist-packages (from traitlets>=4.2->IPython->pycaret) (0.2.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython->pycaret) (0.2.5)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.7/dist-packages (from pexpect; sys_platform != "win32"->IPython->pycaret) (0.7.0)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.3.1)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.1.1)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.3.5)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (2.6.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.7.1)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from statsmodels->pyod->pycaret) (0.5.1)
Collecting websocket-client>=0.32.0
  Downloading https://files.pythonhosted.org/packages/ca/5f/3c211d168b2e9f9342cfb53bcfc26aab0eac63b998015e7af7bcae66119d/websocket_client-1.1.0-py2.py3-none-any.whl (68kB)
     |████████████████████████████████| 71kB 12.6MB/s 
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.9)
Requirement already satisfied: prometheus_client in /usr/local/lib/python3.7/dist-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.11.0)
Collecting gitdb<5,>=4.0.1
  Downloading https://files.pythonhosted.org/packages/ea/e8/f414d1a4f0bbc668ed441f74f44c116d9816833a48bf81d22b697090dba8/gitdb-4.0.7-py3-none-any.whl (63kB)
     |████████████████████████████████| 71kB 11.6MB/s 
Requirement already satisfied: typing-extensions>=3.7.4.0; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from gitpython>=2.1.0->mlflow->pycaret) (3.7.4.3)
Collecting Mako
  Downloading https://files.pythonhosted.org/packages/f3/54/dbc07fbb20865d3b78fdb7cf7fa713e2cba4f87f71100074ef2dc9f9d1f7/Mako-1.1.4-py2.py3-none-any.whl (75kB)
     |████████████████████████████████| 81kB 13.7MB/s 
Collecting python-editor>=0.3
  Downloading https://files.pythonhosted.org/packages/c6/d3/201fc3abe391bbae6606e6f1d598c15d367033332bd54352b12f35513717/python_editor-1.0.4-py3-none-any.whl
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.0.1)
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.1.0)
Requirement already satisfied: greenlet!=0.4.17; python_version >= "3" in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->mlflow->pycaret) (1.1.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->pyLDAvis->pycaret) (2.0.1)
Collecting multimethod==1.4
  Downloading https://files.pythonhosted.org/packages/7a/d0/ce5ad0392aa12645b7ad91a5983d6b625b704b021d9cd48c587630c1a9ac/multimethod-1.4-py2.py3-none-any.whl
Requirement already satisfied: bottleneck in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.8.0->pycaret) (1.3.2)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.8.0->pycaret) (21.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.8.0->pycaret) (2.5.1)
Collecting imagehash; extra == "type_image_path"
  Downloading https://files.pythonhosted.org/packages/8e/18/9dbb772b5ef73a3069c66bb5bf29b9fb4dd57af0d5790c781c3f559bcca6/ImageHash-4.2.0-py2.py3-none-any.whl (295kB)
     |████████████████████████████████| 296kB 60.0MB/s 
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (3.4.1)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.5.0)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.10.1)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->pycaret) (22.1.0)
Collecting smmap<5,>=3.0.1
  Downloading https://files.pythonhosted.org/packages/68/ee/d540eb5e5996eb81c26ceffac6ee49041d473bc5125f2aa995cf51ec1cf1/smmap-4.0.0-py2.py3-none-any.whl
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.7.1->pandas-profiling>=2.8.0->pycaret) (1.1.1)
Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.4.3)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.7.1)
Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (3.3.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (PEP 517) ... done
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-cp37-none-any.whl size=136897 sha256=5b3f3e703e123552f5a86ee6b9b063107d92500aed5641fceaf84c177f27134d
  Stored in directory: /root/.cache/pip/wheels/a0/9c/fc/c6e00689d35c82cf96a8adc70edfe7ba7904374fdac3240ac2
Successfully built pyLDAvis
Building wheels for collected packages: umap-learn, pyod, pynndescent, databricks-cli, prometheus-flask-exporter, alembic, htmlmin, phik
  Building wheel for umap-learn (setup.py) ... done
  Created wheel for umap-learn: filename=umap_learn-0.5.1-cp37-none-any.whl size=76569 sha256=beaeed619d847ea60136ffe1a548c257eba6c65db90b7f8a1c9de3eccaa1706a
  Stored in directory: /root/.cache/pip/wheels/ad/df/d5/a3691296ff779f25cd1cf415a3af954b987fb53111e3392cf4
  Building wheel for pyod (setup.py) ... done
  Created wheel for pyod: filename=pyod-0.9.0-cp37-none-any.whl size=122561 sha256=360d1bb91bc4b5e0edf18793d36e05856a9b99b940c79ed45c891b24905bbf9b
  Stored in directory: /root/.cache/pip/wheels/db/15/54/88660e3bfac7c88e81b7d719c66ee876e26b6abc46376f78b3
  Building wheel for pynndescent (setup.py) ... done
  Created wheel for pynndescent: filename=pynndescent-0.5.4-cp37-none-any.whl size=52374 sha256=a8714ab18e8e724e26ea429b0f11014285a837d21c6359c9037ef1ddbcac48d6
  Stored in directory: /root/.cache/pip/wheels/42/4b/8c/f6f119c67cf6583bb192431fa8f7278cf95e5b943055077d94
  Building wheel for databricks-cli (setup.py) ... done
  Created wheel for databricks-cli: filename=databricks_cli-0.14.3-cp37-none-any.whl size=100560 sha256=3c7a5dd5e4ec55f351df8baaaeea71fd9026154b00f06265f3a01481cca86cb3
  Stored in directory: /root/.cache/pip/wheels/5b/24/f3/34d8e3964dac4ba849d844273c49a679111b00d5799ebb934a
  Building wheel for prometheus-flask-exporter (setup.py) ... done
  Created wheel for prometheus-flask-exporter: filename=prometheus_flask_exporter-0.18.2-cp37-none-any.whl size=17415 sha256=6d84a5ffb6bddac8f8f46e1e1f2a07ac7b2b6e6c8e401c2818cfa8cfca432703
  Stored in directory: /root/.cache/pip/wheels/c0/e2/9c/4f3ee23964802940f81a8b476d0b9be6fb6348cb12df2e2226
  Building wheel for alembic (setup.py) ... done
  Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158170 sha256=d2f9f5be707f7f3d3070bc0772321bd4ae4d691174d3108f0a1ff1d2b7bf35c0
  Stored in directory: /root/.cache/pip/wheels/84/07/f7/12f7370ca47a66030c2edeedcc23dec26ea0ac22dcb4c4a0f3
  Building wheel for htmlmin (setup.py) ... done
  Created wheel for htmlmin: filename=htmlmin-0.1.12-cp37-none-any.whl size=27099 sha256=296ac9b2f45b70187a65e4e7917ed856c299525e58a40e4e9137a08f1822d1ec
  Stored in directory: /root/.cache/pip/wheels/43/07/ac/7c5a9d708d65247ac1f94066cf1db075540b85716c30255459
  Building wheel for phik (setup.py) ... done
  Created wheel for phik: filename=phik-0.11.2-cp37-none-any.whl size=1107437 sha256=234a436e3ca6728ec1641c31922f5f96bc29c557d0935e760e29cd7d9840fbb5
  Stored in directory: /root/.cache/pip/wheels/c0/a3/b0/f27b1cfe32ea131a3715169132ff6d85653789e80e966c3bf6
Successfully built umap-learn pyod pynndescent databricks-cli prometheus-flask-exporter alembic htmlmin phik
ERROR: pandas-profiling 3.0.0 has requirement requests>=2.24.0, but you'll have requests 2.23.0 which is incompatible.
ERROR: pandas-profiling 3.0.0 has requirement tqdm>=4.48.2, but you'll have tqdm 4.41.1 which is incompatible.
ERROR: pyldavis 3.3.1 has requirement numpy>=1.20.0, but you'll have numpy 1.19.5 which is incompatible.
ERROR: pyldavis 3.3.1 has requirement pandas>=1.2.0, but you'll have pandas 1.1.5 which is incompatible.
ERROR: phik 0.11.2 has requirement scipy>=1.5.2, but you'll have scipy 1.4.1 which is incompatible.
Installing collected packages: threadpoolctl, scikit-learn, yellowbrick, pynndescent, umap-learn, kmodes, scikit-plot, Boruta, pyod, imbalanced-learn, lightgbm, websocket-client, docker, gunicorn, databricks-cli, pyyaml, prometheus-flask-exporter, smmap, gitdb, gitpython, Mako, python-editor, alembic, querystring-parser, mlflow, mlxtend, funcy, pyLDAvis, multimethod, tangled-up-in-unicode, imagehash, visions, htmlmin, pydantic, phik, pandas-profiling, pycaret
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
  Found existing installation: yellowbrick 0.9.1
    Uninstalling yellowbrick-0.9.1:
      Successfully uninstalled yellowbrick-0.9.1
  Found existing installation: imbalanced-learn 0.4.3
    Uninstalling imbalanced-learn-0.4.3:
      Successfully uninstalled imbalanced-learn-0.4.3
  Found existing installation: lightgbm 2.2.3
    Uninstalling lightgbm-2.2.3:
      Successfully uninstalled lightgbm-2.2.3
  Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
  Found existing installation: mlxtend 0.14.0
    Uninstalling mlxtend-0.14.0:
      Successfully uninstalled mlxtend-0.14.0
  Found existing installation: pandas-profiling 1.4.1
    Uninstalling pandas-profiling-1.4.1:
      Successfully uninstalled pandas-profiling-1.4.1
Successfully installed Boruta-0.3 Mako-1.1.4 alembic-1.4.1 databricks-cli-0.14.3 docker-5.0.0 funcy-1.16 gitdb-4.0.7 gitpython-3.1.18 gunicorn-20.1.0 htmlmin-0.1.12 imagehash-4.2.0 imbalanced-learn-0.7.0 kmodes-0.11.0 lightgbm-3.2.1 mlflow-1.18.0 mlxtend-0.18.0 multimethod-1.4 pandas-profiling-3.0.0 phik-0.11.2 prometheus-flask-exporter-0.18.2 pyLDAvis-3.3.1 pycaret-2.3.2 pydantic-1.8.2 pynndescent-0.5.4 pyod-0.9.0 python-editor-1.0.4 pyyaml-5.4.1 querystring-parser-1.2.4 scikit-learn-0.23.2 scikit-plot-0.3.7 smmap-4.0.0 tangled-up-in-unicode-0.1.0 threadpoolctl-2.1.0 umap-learn-0.5.1 visions-0.7.1 websocket-client-1.1.0 yellowbrick-1.3.post1

Dataset for the Tutorial¶

For this tutorial we will use a dataset of superconductive materials. You can download the data (material_superconductivity.csv) for this tutorial from here and load it using pandas.

Note: It is a slightly modified version of the original Superconductivity Data Set.

The goal of this tutorial is to develop a ML model which can predict the critical temperature (critical_temp) of a superconductor given a set of various properties of the superconducing material.

In [ ]:

import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Datasets/Superconductivity/material_superconductivity.csv")

Since we do not have an explicit test dataset for our trained model to predict results on, we will use 10% of the total datapoints as our unseen test dataset, while the remaining 90% will be used for our training & validation purposes.

In [ ]:

data_unseen = df.sample(frac=0.1, random_state=42)      # Sample 10% of the data to become the unseen test set
df = df.drop(data_unseen.index)                       # Use the remaining 90% as the training (& validation) data

df.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Model Training & Validation: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Model Training & Validation: (19137, 83)
Unseen Data For Predictions: (2126, 83)

In [ ]:

data.head()

Out[ ]:

	material	number_of_elements	mean_atomic_mass	wtd_mean_atomic_mass	gmean_atomic_mass	wtd_gmean_atomic_mass	entropy_atomic_mass	wtd_entropy_atomic_mass	range_atomic_mass	wtd_range_atomic_mass	std_atomic_mass	wtd_std_atomic_mass	mean_fie	wtd_mean_fie	gmean_fie	wtd_gmean_fie	entropy_fie	wtd_entropy_fie	range_fie	wtd_range_fie	std_fie	wtd_std_fie	mean_atomic_radius	wtd_mean_atomic_radius	gmean_atomic_radius	wtd_gmean_atomic_radius	entropy_atomic_radius	wtd_entropy_atomic_radius	range_atomic_radius	wtd_range_atomic_radius	std_atomic_radius	wtd_std_atomic_radius	mean_Density	wtd_mean_Density	gmean_Density	wtd_gmean_Density	entropy_Density	wtd_entropy_Density	range_Density	wtd_range_Density	...	wtd_mean_ElectronAffinity	gmean_ElectronAffinity	wtd_gmean_ElectronAffinity	entropy_ElectronAffinity	wtd_entropy_ElectronAffinity	range_ElectronAffinity	wtd_range_ElectronAffinity	std_ElectronAffinity	wtd_std_ElectronAffinity	mean_FusionHeat	wtd_mean_FusionHeat	gmean_FusionHeat	wtd_gmean_FusionHeat	entropy_FusionHeat	wtd_entropy_FusionHeat	range_FusionHeat	wtd_range_FusionHeat	std_FusionHeat	wtd_std_FusionHeat	mean_ThermalConductivity	wtd_mean_ThermalConductivity	gmean_ThermalConductivity	wtd_gmean_ThermalConductivity	entropy_ThermalConductivity	wtd_entropy_ThermalConductivity	range_ThermalConductivity	wtd_range_ThermalConductivity	std_ThermalConductivity	wtd_std_ThermalConductivity	mean_Valence	wtd_mean_Valence	gmean_Valence	wtd_gmean_Valence	entropy_Valence	wtd_entropy_Valence	range_Valence	wtd_range_Valence	std_Valence	wtd_std_Valence	critical_temp
0	Ba0.2La1.8Cu1O4	4.0	88.944468	57.862692	66.361592	36.116612	1.181795	1.062396	122.90607	31.794921	51.968828	53.622535	775.425	1010.268571	718.152900	938.016780	1.305967	0.791488	810.6	735.985714	323.811808	355.562967	160.25	105.514286	136.126003	84.528423	1.259244	1.207040	205.0	42.914286	75.237540	69.235569	4654.35725	2961.502286	724.953211	53.543811	1.033129	0.814598	8958.571	1579.583429	...	111.727143	60.123179	99.414682	1.159687	0.787382	127.05	80.987143	51.433712	42.558396	6.9055	3.846857	3.479475	1.040986	1.088575	0.994998	12.878	1.744571	4.599064	4.666920	107.756645	61.015189	7.062488	0.621979	0.308148	0.262848	399.97342	57.127669	168.854244	138.517163	2.25	2.257143	2.213364	2.219783	1.368922	1.066221	1.0	1.085714	0.433013	0.437059	29.0
1	Ba0.1La1.9Ag0.1Cu0.9O4	5.0	92.729214	58.518416	73.132787	36.396602	1.449309	1.057755	122.90607	36.161939	47.094633	53.979870	766.440	1010.612857	720.605511	938.745413	1.544145	0.807078	810.6	743.164286	290.183029	354.963511	161.20	104.971429	141.465215	84.370167	1.508328	1.204115	205.0	50.571429	67.321319	68.008817	5821.48580	3021.016571	1237.095080	54.095718	1.314442	0.914802	10488.571	1667.383429	...	112.316429	69.833315	101.166398	1.427997	0.838666	127.05	81.207857	49.438167	41.667621	7.7844	3.796857	4.403790	1.035251	1.374977	1.073094	12.878	1.595714	4.473363	4.603000	172.205316	61.372331	16.064228	0.619735	0.847404	0.567706	429.97342	51.413383	198.554600	139.630922	2.00	2.257143	1.888175	2.210679	1.557113	1.047221	2.0	1.128571	0.632456	0.468606	26.0
2	Ba0.1La1.9Cu1O4	4.0	88.944468	57.885242	66.361592	36.122509	1.181795	0.975980	122.90607	35.741099	51.968828	53.656268	775.425	1010.820000	718.152900	939.009036	1.305967	0.773620	810.6	743.164286	323.811808	354.804183	160.25	104.685714	136.126003	84.214573	1.259244	1.132547	205.0	49.314286	75.237540	67.797712	4654.35725	2999.159429	724.953211	53.974022	1.033129	0.760305	8958.571	1667.383429	...	112.213571	60.123179	101.082152	1.159687	0.786007	127.05	81.207857	51.433712	41.639878	NaN	3.822571	3.479475	1.037439	1.088575	0.927479	12.878	1.757143	4.599064	4.649635	107.756645	60.943760	7.062488	0.619095	0.308148	0.250477	399.97342	57.127669	168.854244	138.540613	2.25	2.271429	2.213364	2.232679	1.368922	1.029175	1.0	1.114286	0.433013	0.444697	19.0
3	Ba0.15La1.85Cu1O4	4.0	88.944468	57.873967	66.361592	36.119560	1.181795	1.022291	122.90607	33.768010	51.968828	53.639405	775.425	1010.544286	718.152900	938.512777	1.305967	0.783207	810.6	739.575000	323.811808	355.183884	160.25	105.100000	136.126003	84.371352	1.259244	1.173033	205.0	46.114286	75.237540	68.521665	4654.35725	2980.330857	724.953211	53.758486	1.033129	0.788889	8958.571	1623.483429	...	111.970357	60.123179	100.244950	1.159687	NaN	127.05	81.097500	51.433712	42.102344	6.9055	3.834714	3.479475	1.039211	1.088575	0.964031	12.878	1.744571	4.599064	4.658301	107.756645	60.979474	7.062488	0.620535	0.308148	0.257045	399.97342	57.127669	168.854244	138.528893	2.25	2.264286	2.213364	2.226222	1.368922	1.048834	1.0	1.100000	0.433013	0.440952	22.0
4	Ba0.3La1.7Cu1O4	4.0	88.944468	57.840143	66.361592	36.110716	1.181795	1.129224	122.90607	27.848743	51.968828	53.588771	775.425	1009.717143	718.152900	937.025573	1.305967	0.805230	810.6	728.807143	323.811808	356.319281	160.25	106.342857	136.126003	84.843442	1.259244	1.261194	205.0	36.514286	75.237540	70.634448	4654.35725	2923.845143	724.953211	53.117029	1.033129	0.859811	8958.571	1491.783429	...	111.240714	60.123179	97.774719	1.159687	0.787396	127.05	80.766429	51.433712	43.452059	6.9055	3.871143	3.479475	1.044545	1.088575	1.044970	12.878	1.744571	4.599064	4.684014	107.756645	61.086617	7.062488	0.624878	0.308148	0.272820	399.97342	57.127669	168.854244	138.493671	2.25	2.242857	2.213364	2.206963	1.368922	1.096052	1.0	1.057143	0.433013	0.428809	23.0

5 rows × 83 columns

Note: Here, we are using an external dataset & have loaded it using pandas. PyCaret also has a repository of datasets which can be used for model training & experimentation involving various different tasks. These datasets can be loaded as pandas DataFrame using:

dataset_name = "name of pycaret dataset"           # Eg: "ipl", "bike", etc.
data = get_data(dataset_name)

Basic PyCaret¶

This section intends to familiarize you with the building blocks of PyCaret that are used to develop a simple end-to-end ML pipeline. In this section, we will look at the following:

PyCaret environment setup
Comparison of model algorithms
Training & Fine-Tuning a model
Evaluation of a model through plots
Making predictions using trained model
Saving & loading a model

Environment Setup¶

We need to configure the environment before we begin any machine learning experiment in PyCaret. Depending on the sort of experiment we wish to run, one of the six presently supported modules must be loaded into our Python environment.

To begin with, we import all the functions from the pycaret.regression module.

In [ ]:

from pycaret.regression import *

Now, we use the setup(...) function to initialize the environment of our ML experiment. It the first & only mandatory setp to begin any ML experiment, and is common to all the 6 modules.

Apart from defining the dataset (using data) & the target variable to be predicted (using target - this holds for all modules except clustering & anomaly detection), the setup(...) function offers a wide range of functionalities. In this Basic PyCaret section, we will understand at the following functionalties:

Reproduciblity: The session_id controls the randomness of experiment. If set to a non-None value (say, 42, as in our case), it can be used for later reproducibility of the entire experiment.
Dataset Splitting: train_size is used to determine the size of the dataset used for training our models during the experiment. By default 0.7 (or 70%) of the dataset is used for training, while the remaining is held-out for validation in the end.
Data Type Inference: PyCaret automatically tries to infer the numeric & categorical variables form the dataset whensetup(...) is executed, & allows you to confirm them.

Screenshot 2021-07-08 181023.png

However, one can override these by typing "quit" in the textbox & later specifying the respective columns names as arguments for categorical_features & numeric_features (we don't need these for our case though). Columns that need not be considered for the ML task can be excluded by specifying them as ignore_features (here, since the material column contains unique values for each instance & doesn't aid in the task, we can ignore it).

Missing Value Imputation: This is one of the preprocessing steps that can be done elegantly using PyCaret just by specifying the type of values that should replace the missing values in any column. They type of imputation for categorical data is specified using categorical_imputation ('constant' by default), while that for numeric data is specified using numeric_imputation ('mean' by default).
Categorical Encoding: By default, PyCaret handles the conversion of categorical features (inferred or specified explicitly) into numeric values using the One-Hot-Encoding scheme. This implicitly creates dummy features corresponding to the original categorical features in our dataset.

There are a bunch of other data preprocessing & transformation steps that can be specified to be orchestrated into our pipeline using the setup(...) function. We will explore them in the Intermediate PyCaret section.

In [ ]:

expt_basic = setup(
    data = df, 
    target = 'critical_temp', 
    session_id=42,                      # Random seed to ensure reproducibility of the experiment with the same data
    train_size=0.8,                     # 80% training data & 20% held-out validation data
    ignore_features=["material"],
    numeric_imputation="median",        # "mean" by default
    categorical_imputation="mode",      # "constant" (not_available) by default
)

	Description	Value
0	session_id	42
1	Target	critical_temp
2	Original Data	(21263, 83)
3	Missing Values	True
4	Numeric Features	79
5	Categorical Features	2
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(17010, 95)
10	Transformed Test Set	(4253, 95)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	a8ad
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	median
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	mode
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	False
28	Normalize Method	None
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Remove Perfect Collinearity	True
43	Clustering	False
44	Clustering Iteration	None
45	Polynomial Features	False
46	Polynomial Degree	None
47	Trignometry Features	False
48	Polynomial Threshold	None
49	Group Features	False
50	Feature Selection	False
51	Feature Selection Method	classic
52	Features Selection Threshold	None
53	Feature Interaction	False
54	Feature Ratio	False
55	Interaction Threshold	None
56	Transform Target	False
57	Transform Target Method	box-cox

This completes the environment setup for our experiment, with all the necessary preprocessing, transformation & feature engineeering steps piped together to be applied to our dataset for training of our ML models. All the details of our environment are displayed in the table above.

Model Comparison¶

We are now set to experience the true power of low-code machine learning, when we train & compare a multitude of models with just a single line of code.

But prior to that, let us inspect all the algorithms that PyCaret's regression module offers us. This can be done using models(), as shown below. The output is a table of models available in model library for the particular module, along with the reference to the actual underlying implementation.

The Turbo column indicates the algorithms that are usually able to run in a shorter duration (indicated if Turbo is True) & are chosen by default for the comparison.

Note: model() can be used with other modules as well to inspect the various algorithms available for those tasks

In [ ]:

models()

Out[ ]:

	Name	Reference	Turbo
ID
lr	Linear Regression	sklearn.linear_model._base.LinearRegression	True
lasso	Lasso Regression	sklearn.linear_model._coordinate_descent.Lasso	True
ridge	Ridge Regression	sklearn.linear_model._ridge.Ridge	True
en	Elastic Net	sklearn.linear_model._coordinate_descent.Elast...	True
lar	Least Angle Regression	sklearn.linear_model._least_angle.Lars	True
llar	Lasso Least Angle Regression	sklearn.linear_model._least_angle.LassoLars	True
omp	Orthogonal Matching Pursuit	sklearn.linear_model._omp.OrthogonalMatchingPu...	True
br	Bayesian Ridge	sklearn.linear_model._bayes.BayesianRidge	True
ard	Automatic Relevance Determination	sklearn.linear_model._bayes.ARDRegression	False
par	Passive Aggressive Regressor	sklearn.linear_model._passive_aggressive.Passi...	True
ransac	Random Sample Consensus	sklearn.linear_model._ransac.RANSACRegressor	False
tr	TheilSen Regressor	sklearn.linear_model._theil_sen.TheilSenRegressor	False
huber	Huber Regressor	sklearn.linear_model._huber.HuberRegressor	True
kr	Kernel Ridge	sklearn.kernel_ridge.KernelRidge	False
svm	Support Vector Regression	sklearn.svm._classes.SVR	False
knn	K Neighbors Regressor	sklearn.neighbors._regression.KNeighborsRegressor	True
dt	Decision Tree Regressor	sklearn.tree._classes.DecisionTreeRegressor	True
rf	Random Forest Regressor	sklearn.ensemble._forest.RandomForestRegressor	True
et	Extra Trees Regressor	sklearn.ensemble._forest.ExtraTreesRegressor	True
ada	AdaBoost Regressor	sklearn.ensemble._weight_boosting.AdaBoostRegr...	True
gbr	Gradient Boosting Regressor	sklearn.ensemble._gb.GradientBoostingRegressor	True
mlp	MLP Regressor	sklearn.neural_network._multilayer_perceptron....	False
lightgbm	Light Gradient Boosting Machine	lightgbm.sklearn.LGBMRegressor	True

With this background, we can proceed towards comparing the various algorithms & evaluate their performance before zeroing down on the final model that we will eventually use.

The compare_models() function (just 2 words!) trains and evaluates performance of all estimators available in the model library using k-fold cross validation (on the training split of the dataset). The output prints a score grid that shows average metric scores across the k folds of validation, along with training time.

In our case below, this simple function (with some parameters, of course, which we will discuss next) is able to train & evaluate 10+ algorithms without the user having to write code for any specific algorithm!

Following is the explanation of the parameters that compare_models() can take:

sort: By default, the output grid is sorted in descending order of the R2 metric (higher R2 is better). This can be changed to any other metric by specifying it as sort. In our case, we sort the grid by RMSE (lower RMSE is better).
include: This is used to explicitly enlist the algorithms that we wish to train & compare.
exclude: This is used to enlist the algorithms that we do not wish to train or compare. Here, we have excluded 5 algorithms because they seemed to take a lot of time, & did not give good results.
fold: This is basically the k in k-Fold Cross Validation. By default, k=10, but it can be changed as per convenience.
turbo: We saw previously that some algorithms have Turbo = True (usually faster algorithms). By default, turbo=True, which includes only these faster algorithms to be compared & evaluated. However, it can be set to False to include all the other models as well.

compare_models() return the best performing model based on sort order unless we specify the number of top performing models to be returned using the n_select parameter (we will use this in the Intermediate PyCaret section).

In [ ]:

best = compare_models(sort="RMSE", exclude=["lar", "rf", "et", "gbr", "ada"], fold=5)

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
lightgbm	Light Gradient Boosting Machine	6.7053	113.4606	10.6463	0.9037	0.4989	9.0818	1.208
knn	K Neighbors Regressor	7.5609	175.0485	13.2177	0.8516	0.5174	8.3847	0.452
dt	Decision Tree Regressor	7.1429	187.9618	13.6975	0.8405	0.5165	7.6383	0.906
ridge	Ridge Regression	14.2936	347.1825	18.6294	0.7052	0.8754	14.6673	0.026
br	Bayesian Ridge	14.2983	347.3928	18.6350	0.7050	0.8738	14.2653	0.164
lr	Linear Regression	14.3182	348.1571	18.6556	0.7043	0.8793	14.6963	0.474
en	Elastic Net	14.9925	376.1511	19.3914	0.6806	0.9120	20.0117	0.274
lasso	Lasso Regression	15.0233	377.6577	19.4303	0.6793	0.9110	20.1200	0.268
omp	Orthogonal Matching Pursuit	16.5008	443.5972	21.0579	0.6234	0.9436	26.2129	0.028
huber	Huber Regressor	17.4946	519.1853	22.7831	0.5592	0.9496	24.5486	0.782
llar	Lasso Least Angle Regression	29.4259	1179.1006	34.3350	-0.0007	1.4446	51.3135	0.046
par	Passive Aggressive Regressor	29.7797	1391.9471	35.9650	-0.1920	1.4002	36.7929	0.184

Model Creation & Fine-Tuning¶

Once we perform a comparative analysis of the various available models, we can choose the best-performing algorithm & train it on our dataset. This can be done using create_model(...), which is perhaps the most granular function in PyCaret.

Given the id of the model to be created (here, "lightgbm"), this function trains & evaluates the corresponding model for us. By default, it uses a 10-fold cross validation. Number of folds can be changed with the fold parameter, & cross-validation can be totally avoided if we set cross_validation=False.

If needed, other model-specific parameters can also be set while calling this function. These parameters can be found in the reference of the corresponding models (Example: max_depth for xgboost, learning_rate for lightgbm, etc.)

In [ ]:

lgbm = create_model("lightgbm", fold=10)

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	6.8344	126.3321	11.2398	0.8940	0.5226	13.8306
1	6.7972	113.1294	10.6362	0.9056	0.4874	1.2830
2	6.5400	104.4271	10.2190	0.9117	0.4969	10.6841
3	6.6852	121.6496	11.0295	0.8923	0.5055	12.6763
4	6.3071	100.5768	10.0288	0.9085	0.4885	23.3895
5	6.7048	105.6652	10.2794	0.9087	0.4920	1.1607
6	6.8755	129.7889	11.3925	0.8944	0.4816	1.3610
7	6.7249	106.4520	10.3176	0.9115	0.5337	23.7443
8	6.7539	115.1639	10.7314	0.9001	0.4960	0.9899
9	6.6179	101.2943	10.0645	0.9178	0.4927	1.7183
Mean	6.6841	112.4479	10.5939	0.9045	0.4997	9.0838
SD	0.1569	9.9880	0.4670	0.0083	0.0157	8.7213

When a model is built with the create_model() method, it is trained using the default hyperparameters. The tune_model() method is used to now tune the hyperparameters of this created model such that a particluar metric is optimized specifically. It uses Random grid search using pre-defined grids that are totally configurable to modify the hyperparameter of the model provided as an estimator (for now, we will use the random grid, & explore the customization in the next section).

Again, tuning is done using k-Fold Cross Validation, where k can be set using fold parameter. The metric to optimize the hyperparameters with respect to is specified using optimize.

In [ ]:

tuned_lgbm = tune_model(lgbm, fold=5, optimize="RMSE")

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	6.4884	113.3398	10.6461	0.9052	0.4981	6.6217
1	6.3793	107.5357	10.3699	0.9070	0.4996	10.4792
2	6.2171	96.1899	9.8076	0.9148	0.4796	11.1850
3	6.5189	112.1177	10.5886	0.9079	0.4837	13.3632
4	6.4482	104.1701	10.2064	0.9127	0.4836	1.3184
Mean	6.4104	106.6707	10.3237	0.9095	0.4889	8.5935
SD	0.1074	6.1805	0.3021	0.0036	0.0082	4.2388

As you can see, the RMSE value has improved from 10.5939 to 10.3237 after fine-tuning.

We can print out both the untuned & tuned LightGBM models to check the difference in hyperparameters.

In [ ]:

print(lgbm, "\n")
print(tuned_lgbm)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0) 

LGBMRegressor(bagging_fraction=0.8, bagging_freq=3, boosting_type='gbdt',
              class_weight=None, colsample_bytree=1.0, feature_fraction=0.8,
              importance_type='split', learning_rate=0.2, max_depth=-1,
              min_child_samples=6, min_child_weight=0.001, min_split_gain=0.6,
              n_estimators=100, n_jobs=-1, num_leaves=30, objective=None,
              random_state=42, reg_alpha=0.001, reg_lambda=5, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

Plotting the Model Characteristics¶

Analyzing the performance of a trained ML model is an essential part of any ML workflow. In PyCaret, this can be simply done with the plot_model(...) function. The function accepts a trained model object and the kind of plot (plot parameter) as strings. There are several kinds of plots that one can plot based on the PyCaret module they are working with. Details about all these types of plots can be found in this documentation.

For our regression task, we plot the residuals, the learning curve (to understand how the performance has evolved with training) & the relative feature importances in our trained model.

In [ ]:

plot_model(tuned_lgbm, plot="residuals")

In [ ]:

plot_model(tuned_lgbm, plot="learning")

In [ ]:

plot_model(tuned_lgbm, plot="feature")              # Top 10 most important features
# plot_model(tuned_lgbm, plot="feature_all")        # All features used in the model

Another technique for evaluating model performance is to use the evaluate_model() function, which provides a user interface for all possible plots for a particular model (it makes use of the plot_model() method internally).

Prediction on Validation / Hold-out Sample¶

If you recall, while setting up our environment, we split our data into 80-20 (training & validation data) explicitly. Till now, the training & fine-tuning was done on the training portion of that data (17010 samples).

Now, once we've fine-tuned our model, it is a good idea to predict the results on our held-out validation dataset (4253 samples) & check the performance metrics on this set.

For this, we use the predict_model(...) function & pass it our trained model object. Since we do not specify any dataset explicitly, we mean to predict on the validation dataset that was held out at the time of setting up our environment.

In [ ]:

predict_model(tuned_lgbm);

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	Light Gradient Boosting Machine	6.2684	96.4005	9.8184	0.9163	0.4909	2.1346

We can compare these values obtained with the results of our fine-tuned model. Note that we should take into account the standard deviation of the metrics in our fine-tuned model while comparing against these results.

Saving & Loading a Model¶

Once we have trained our model in this experiment, we need to save it so that it can be used in future for making predictions.

A model can be saved easily using the sace_model(...) function, which takes in the model object & the file name. The model, along with its entire transformation pipeline (preprocessing, etc. to be applied to the raw dataset) is saved as a .pkl (pickle) file.

In [ ]:

save_model(tuned_lgbm,'lgbm_expt1')

Transformation Pipeline and Model Successfully Saved

Out[ ]:

(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True,
                                       features_todrop=['material'],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[],
                                       target='critical_temp',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='most frequent',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 n...
                                boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, feature_fraction=0.8,
                                importance_type='split', learning_rate=0.2,
                                max_depth=-1, min_child_samples=6,
                                min_child_weight=0.001, min_split_gain=0.6,
                                n_estimators=100, n_jobs=-1, num_leaves=30,
                                objective=None, random_state=42, reg_alpha=0.001,
                                reg_lambda=5, silent=True, subsample=1.0,
                                subsample_for_bin=200000, subsample_freq=0)]],
          verbose=False), 'lgbm_expt1.pkl')

Any saved model can be loaded in the same or alternate environment using the load_model(...) function by passing in the file name.

In [ ]:

saved_model = load_model('lgbm_expt1')

Transformation Pipeline and Model Successfully Loaded

Prediction on Unseen Test Data¶

Now, we will use this loaded model to make predictions on our actual unseen test data (10% of the original dataset), which we had set aside initially.

Again, we use the predict_model(...) function, but this time we pass in the unseen data in the data parameter as well.

Since our loaded model contains the entire transformation pipeline as well, all the necessary preprocessing steps are automatically applied to this passed dataset before predictions are actually made.

In [ ]:

unseen_predictions = predict_model(saved_model, data=data_unseen)
unseen_predictions.head()

Out[ ]:

	material	number_of_elements	mean_atomic_mass	wtd_mean_atomic_mass	gmean_atomic_mass	wtd_gmean_atomic_mass	entropy_atomic_mass	wtd_entropy_atomic_mass	range_atomic_mass	wtd_range_atomic_mass	std_atomic_mass	wtd_std_atomic_mass	mean_fie	wtd_mean_fie	gmean_fie	wtd_gmean_fie	entropy_fie	wtd_entropy_fie	range_fie	wtd_range_fie	std_fie	wtd_std_fie	mean_atomic_radius	wtd_mean_atomic_radius	gmean_atomic_radius	wtd_gmean_atomic_radius	entropy_atomic_radius	wtd_entropy_atomic_radius	range_atomic_radius	wtd_range_atomic_radius	std_atomic_radius	wtd_std_atomic_radius	mean_Density	wtd_mean_Density	gmean_Density	wtd_gmean_Density	entropy_Density	wtd_entropy_Density	range_Density	wtd_range_Density	...	gmean_ElectronAffinity	wtd_gmean_ElectronAffinity	entropy_ElectronAffinity	wtd_entropy_ElectronAffinity	range_ElectronAffinity	wtd_range_ElectronAffinity	std_ElectronAffinity	wtd_std_ElectronAffinity	mean_FusionHeat	wtd_mean_FusionHeat	gmean_FusionHeat	wtd_gmean_FusionHeat	entropy_FusionHeat	wtd_entropy_FusionHeat	range_FusionHeat	wtd_range_FusionHeat	std_FusionHeat	wtd_std_FusionHeat	mean_ThermalConductivity	wtd_mean_ThermalConductivity	gmean_ThermalConductivity	wtd_gmean_ThermalConductivity	entropy_ThermalConductivity	wtd_entropy_ThermalConductivity	range_ThermalConductivity	wtd_range_ThermalConductivity	std_ThermalConductivity	wtd_std_ThermalConductivity	mean_Valence	wtd_mean_Valence	gmean_Valence	wtd_gmean_Valence	entropy_Valence	wtd_entropy_Valence	range_Valence	wtd_range_Valence	std_Valence	wtd_std_Valence	critical_temp	Label
0	Ge1Nb3	2.0	82.768190	87.837285	82.144935	87.360109	0.685627	0.509575	20.27638	51.522285	10.138190	8.779930	711.80	687.700000	710.166178	686.488365	0.690853	0.589409	96.4	307.700000	48.200000	41.742424	161.500000	179.750000	157.321327	176.492557	0.667386	0.461943	73.0	117.250000	36.500000	31.609927	6946.500000	7758.250000	6754.118003	7608.074085	0.665582	NaN	3247.000	5096.750000	...	102.741423	94.869113	0.680597	0.622557	32.90	35.575000	16.450000	14.246118	29.3000	28.050000	29.193150	27.970992	0.689503	0.596157	5.000	12.150000	2.500000	2.165064	57.000000	55.500000	56.920998	55.441265	0.691761	0.583527	6.00000	25.500000	3.000000	2.598076	4.50	4.750000	4.472136	4.728708	0.686962	0.514653	1.0	2.750000	0.500000	0.433013	6.4	10.102330
1	Y1Ba2Cu3O	4.0	76.444563	81.456750	59.356672	68.229617	1.199541	1.108189	121.32760	36.950657	43.823354	40.612293	794.00	738.357143	741.629349	702.424197	1.315004	1.282439	810.6	231.371429	311.743492	255.465227	164.500000	171.571429	139.000514	153.255987	1.256701	1.166832	205.0	65.428571	77.525802	67.911407	4235.857250	5481.918429	669.556588	1780.077447	1.015407	0.810985	8958.571	3839.795857	...	53.527965	56.435929	1.105182	0.953371	127.05	46.971429	54.830755	52.650295	8.1805	9.560286	4.035569	6.229663	1.112098	0.975148	12.878	5.582571	4.948155	4.359650	108.756645	179.003797	7.552385	26.578636	0.336262	0.201966	399.97342	171.424774	168.301047	NaN	2.25	2.142857	2.213364	2.119268	1.368922	1.309526	1.0	0.571429	0.433013	0.349927	91.2	81.822642
2	Y1Ba1.5La0.5Cu3O7.08	5.0	88.936744	51.090431	70.358975	34.783991	1.445824	1.525092	122.90607	10.438667	46.482335	44.261233	743.42	1006.991437	696.313849	942.154532	1.538527	0.933542	810.6	690.076300	296.615154	340.103711	170.600000	111.914373	148.737352	88.458069	1.506998	1.509873	205.0	25.802752	70.406250	75.991199	4617.885800	3035.177165	NaN	66.205693	1.324548	0.978095	8958.571	2054.272376	...	52.696967	90.754610	1.354052	0.815880	127.05	75.361239	50.281492	47.273696	7.8044	5.154569	4.411557	1.310299	1.374615	1.153108	12.878	2.884422	4.489231	5.622243	89.605316	95.618363	8.418912	1.058389	0.457810	0.209583	399.97342	91.728732	155.329609	166.192923	2.40	2.114679	2.352158	2.095193	1.589027	1.314189	1.0	0.967890	0.489898	0.318634	38.0	34.782055
3	La1.76Sr0.24Cu1O4	4.0	76.517718	56.149432	59.310096	35.562124	1.197273	1.042132	122.90607	31.920690	44.289459	51.815571	787.05	1011.642286	734.219624	940.469590	1.313008	0.802473	NaN	731.520000	314.505966	353.685876	151.750000	104.680000	131.302197	84.236452	1.275274	1.215773	171.0	41.520000	65.579627	67.580981	4434.357250	2916.268000	674.484751	52.847119	0.995983	0.807732	8958.571	1544.463429	...	48.477265	95.882106	1.096672	0.773969	135.97	81.204686	54.373097	43.628244	6.9055	3.856571	3.479475	1.042408	1.088575	1.016669	12.878	1.744571	4.599064	4.673780	112.006645	61.626617	8.339818	0.637507	0.403693	0.304547	399.97342	57.127669	166.742351	138.361103	2.25	2.251429	2.213364	NaN	1.368922	1.078855	1.0	1.074286	0.433013	0.433834	19.0	21.520392
4	La0.94Mo6Se8	3.0	104.608490	89.558979	101.719818	88.481210	1.070258	0.944284	59.94547	33.541423	25.225148	15.159829	722.10	812.626104	703.695276	799.630186	1.072790	0.796138	399.3	469.515797	165.133461	141.219887	162.666667	143.728246	156.269832	137.110827	1.062146	0.913764	92.0	64.036145	42.240055	43.743683	7081.666667	7095.665328	6727.404359	6633.555886	1.046587	0.841524	5461.000	3741.817938	...	89.376074	121.321040	0.925747	0.621554	147.00	102.106426	64.406539	63.255476	15.9000	17.745783	10.699060	11.681349	0.726384	0.547497	30.600	14.061446	14.217595	14.955962	50.840000	56.919679	9.794610	6.007072	0.313841	0.106068	138.48000	55.544846	62.546392	67.307993	5.00	5.811245	4.762203	5.743954	1.054920	0.803990	3.0	3.024096	1.414214	0.728448	11.0	5.981907

5 rows × 84 columns

The predictions on this dataset get appended as a column with name "Label" to the test dataframe.

In this case, since we have the actual values (critical_temp column) as well in the unseen test dataset, we can use the check_metric(...) function from pycaret.utils to get the preformance scores of our model on this data.

The check_metric(...) function takes in the columns containing true values & predicted values, along with the metric.

In [ ]:

from pycaret.utils import check_metric
import numpy as np


print("R2:\t", check_metric(unseen_predictions.critical_temp, unseen_predictions.Label, 'R2'))
print("RMSE:\t", np.round(np.sqrt(check_metric(unseen_predictions.critical_temp, unseen_predictions.Label, 'MSE')), 4))

R2:	 0.9157
RMSE:	 9.8366

Note: Here, there was a bug in the PyCaret code at the time of writing this tutorial due to which on specifying "RMSE" as a metric, the result obtained was that of "MSE". Hence, the extra step of preforming the square root manually is needed here.

With this, we have covered most of the basic functionalities of PyCaret. In this section of the tutorial, we explored how to prepare an end-to-end ML pipeline for a regression task (data ingestion, basic pre-processing, model selection & training, hyperparameter tuning, analysis, prediction & saving the model for later use) in less than 10 commands. This truly demonstrates the power & efficiency of using tools like PyCaret.

Intermediate PyCaret¶

Having familiarized with the fundamentals of using PyCaret, it is time to dive into some more cool features that PyCaret offers. In this section we will explore the following:

Some more pre-processing
- Data Transformation
- Feature Engineering
Model Ensembling
Custom Grid Search in Hyperparameter Tuning

Data Transformation & Feature Engineering¶

In the previous section, we saw how setup(...) can be used to initiaize the environment for our experiment & perform some basic preprocessing. Now, we look at some more options available to us for preprocessing our data further. Like before, all this can be done easily by setting the corresponding parameters while calling the setup function.

Data Transformation & Scaling¶

Following are some of the parameters that can be used to perform scaling & transformation:

normalize: If set to True, the entire feature space is rescaled based on the normalize_method.
normalize_method: It defines the method used to normalize the data. The types of menthods can be seen in the documentation.

Feature Engineering¶

PyCaret allows you to create new features based on the original features simply by adding a few parameters. Some of these are:

feature_interaction: Setting this to True creates new features of the form a * b for all pairs of numeric features a & b.
feature_ratio: Setting this to True creates new features of the form a / b for all pairs of numeric features a & b.
polynomial_features: Setting this to True creates new features based on polynomial combinations that exist within the numeric features.
polynomial_degree: This specifies the degree of the polynomial features (default: 2)
trigonometry_features: Setting this to True created new feature that are trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the polynomial_degree.

With the knowledge of these parameters, we can set up a new experiment where we use some these preprocessing steps as follows:

In [ ]:

expt_intermediate = setup(
    data = df, 
    target = 'critical_temp', 
    session_id=42,                      # Random seed to ensure reproducibility of the experiment with the same data
    train_size=0.8,                     # 80% training data & 20% held-out validation data
    ignore_features=["material"],
    normalize=True,
    normalize_method="minmax",
    polynomial_features=True,
    trigonometry_features=True
)

	Description	Value
0	session_id	42
1	Target	critical_temp
2	Original Data	(21263, 83)
3	Missing Values	True
4	Numeric Features	79
5	Categorical Features	2
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(17010, 154)
10	Transformed Test Set	(4253, 154)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	7f5e
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	mean
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	constant
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	True
28	Normalize Method	minmax
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Remove Perfect Collinearity	True
43	Clustering	False
44	Clustering Iteration	None
45	Polynomial Features	True
46	Polynomial Degree	2
47	Trignometry Features	True
48	Polynomial Threshold	0.1
49	Group Features	False
50	Feature Selection	False
51	Feature Selection Method	classic
52	Features Selection Threshold	None
53	Feature Interaction	False
54	Feature Ratio	False
55	Interaction Threshold	None
56	Transform Target	False
57	Transform Target Method	box-cox

Now, similar to the previous section, we compare the various models available. The only change is that this time while comparing, we select the top 3 models (based on RMSE). This can be done by setting the n_select parameter to 3 as follows:

In [ ]:

top3 = compare_models(sort="RMSE", exclude=["lar", "rf", "et", "gbr", "ada"], fold=5, n_select=3)

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
lightgbm	Light Gradient Boosting Machine	6.7801	116.7506	10.7993	0.9010	0.4985	9.4966	2.126
knn	K Neighbors Regressor	7.4934	171.2640	13.0827	0.8547	0.4829	9.7451	3.210
dt	Decision Tree Regressor	7.3110	194.2151	13.9218	0.8353	0.5210	6.8010	1.514
lr	Linear Regression	12.9294	298.8115	17.2820	0.7463	0.8402	21.8982	0.528
br	Bayesian Ridge	13.0206	300.7426	17.3384	0.7446	0.8478	19.7532	0.298
ridge	Ridge Regression	13.1375	305.2784	17.4690	0.7408	0.8489	19.0016	0.052
huber	Huber Regressor	13.3101	322.8834	17.9655	0.7258	0.8514	18.4292	1.310
omp	Orthogonal Matching Pursuit	15.0008	373.2262	19.3157	0.6831	0.9263	23.1458	0.056
par	Passive Aggressive Regressor	16.2831	450.3890	21.0913	0.6152	0.9943	26.3691	0.168
lasso	Lasso Regression	17.1647	503.5471	22.4379	0.5726	0.9311	23.0181	0.082
en	Elastic Net	18.9225	573.1404	23.9374	0.5136	1.0144	20.2454	0.060
llar	Lasso Least Angle Regression	29.4259	1179.1006	34.3350	-0.0007	1.4446	51.3135	0.058

In [ ]:

for model in top3:
    print(model)
    print()

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                    weights='uniform')

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=42, splitter='best')

Tuning with Customized Parameter Grid¶

Previously, we were using a Random Grid Search to find the optimal hyperparameters for a model. This time, we will learn how to create our custom grid to search for optimal parameters.

First, we look at the various hyperparameters for every model & then create a dict containing ranges of the grid that we wish to fine-tune on. Later, we pass this dict into the tune_model(...) function.

Following are examples of some custom fine-tuning applied to each of our top 3 models.

In [ ]:

# Fine-Tuning the Light GBM model
lgbm = top3[0]

lgbm_params = {
    'num_leaves': np.arange(10,200,10),
    'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
    'learning_rate': np.arange(0.1,1,0.1)
}

tuned_lgbm = tune_model(lgbm, custom_grid = lgbm_params, fold=5)

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	6.1500	111.8845	10.5775	0.9064	0.4487	5.9283
1	5.9389	98.9598	9.9479	0.9144	0.4426	12.5269
2	5.7831	90.1870	9.4967	0.9201	0.4397	10.1177
3	6.1927	110.0694	10.4914	0.9096	0.4556	12.1821
4	6.1456	104.1636	10.2061	0.9127	0.4446	1.1379
Mean	6.0420	103.0529	10.1439	0.9126	0.4462	8.3786
SD	0.1567	7.8835	0.3924	0.0046	0.0056	4.3158

In [ ]:

# Fine-Tuning the K-Nearest Neighbors model
knn = top3[1]

knn_params = {
    'n_neighbors': np.arange(2,6),
    'p': np.arange(1,2),
    'leaf_size': np.arange(10,60,10)
}

tuned_knn = tune_model(knn, custom_grid = knn_params, fold=5)

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	6.2757	136.2753	11.6737	0.8860	0.4228	3.5938
1	6.2235	125.1762	11.1882	0.8918	0.4225	10.8804
2	6.1148	120.6007	10.9818	0.8931	0.4153	7.3990
3	6.2538	135.0941	11.6230	0.8890	0.4224	11.8967
4	6.2056	127.5158	11.2923	0.8932	0.4154	0.9532
Mean	6.2147	128.9324	11.3518	0.8906	0.4197	6.9446
SD	0.0555	5.9568	0.2624	0.0028	0.0036	4.1796

In [ ]:

# Fine-Tuning the Decision Tree model
dt = top3[2]

dt_params = {
    'min_samples_split': np.arange(2,12,1),
    'max_features': ["auto", "sqrt", "log2"],
}

tuned_dt = tune_model(dt, custom_grid = dt_params, fold=5)

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	7.6052	204.5219	14.3011	0.8289	0.5250	6.1723
1	6.9846	168.8165	12.9929	0.8540	0.4830	13.6954
2	7.1957	172.3882	13.1297	0.8473	0.5026	8.5512
3	7.3743	194.9335	13.9619	0.8399	0.5004	13.0748
4	7.0655	167.3674	12.9371	0.8598	0.4667	0.9862
Mean	7.2451	181.6055	13.4645	0.8460	0.4956	8.4960
SD	0.2231	15.1923	0.5586	0.0108	0.0197	4.6861

Now, we have the top 3 models fine-tuned using a custom grid. The optimal hyperparameters can be seens by printing the models:

In [ ]:

print(tuned_lgbm, "\n")
print(tuned_knn, "\n")
print(tuned_dt)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=70,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=80, objective=None,
              random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0) 

KNeighborsRegressor(algorithm='auto', leaf_size=10, metric='minkowski',
                    metric_params=None, n_jobs=-1, n_neighbors=3, p=1,
                    weights='uniform') 

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features='log2', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=10,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=42, splitter='best')

The following table compares the RMSE values of the 3 models before & after fine-tuning.

Model	Initial RMSE	Fine-Tuned RMSE
lightgbm	10.7993	10.1439
knn	13.0827	11.3518
dt	13.9218	13.4645

Model Ensembling¶

Model ensembling is a machine learning technique to combine multiple other models in the prediction process. You may be familiar with the terms bagging & boosting, which are typically used to increase the performance of decision tree-like algorithms.

PyCaret allows bagging & boosting by using the ensemble_model(...) function that takes in a single tree object, along with the method ("Bagging" or "Boosting"). The number of estimators can be controlled using n_estimators parameter (default: 10).

However, PyCaret also allows us to combine multiple types of models to create custom ensembles. This technique is called Blending. We will implement these next.

Model Blending¶

Blending is an ensemble ML approach that use a machine learning model to discover the optimal way to blend predictions from several contributing ensemble member models. Blending models in PyCaret can be achieved using blend_models(...) function, which is available only for the pycaret.regression & pycaret.classification modules. The model objects passed in the estimator_list will be used as ensemble members. If nothing is specified, all models available in the library will be used as ensemble members.

We will try to blend our top 3 fine-tuned models to produce our final model.

In [ ]:

blended_model = blend_models(estimator_list=[tuned_lgbm, tuned_knn, tuned_dt])

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	5.8042	112.3234	10.5983	0.9058	0.4150	8.6553
1	5.7807	105.5312	10.2728	0.9119	0.3966	0.9125
2	5.5910	95.3437	9.7644	0.9194	0.4005	6.5337
3	5.8788	106.0757	10.2993	0.9061	0.4224	8.2497
4	5.4651	90.7305	9.5253	0.9174	0.4192	15.6773
5	5.8054	100.1237	10.0062	0.9135	0.4042	0.7391
6	6.0748	120.6465	10.9839	0.9019	0.4022	1.0963
7	6.0080	109.8304	10.4800	0.9087	0.4321	27.5085
8	5.8795	109.2790	10.4537	0.9052	0.4121	0.7415
9	5.7296	92.2783	9.6062	0.9251	0.3888	1.2944
Mean	5.8017	104.2163	10.1990	0.9115	0.4093	7.1408
SD	0.1710	9.0436	0.4434	0.0070	0.0125	8.2647

We can now check the performance of this blended model on our held-out validation set to verify the performance metric values.

In [ ]:

predict_model(blended_model);

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	Voting Regressor	5.5852	92.7436	9.6303	0.9194	0.4059	1.2702

Now, we will predict the values for our unseen test dataset

In [ ]:

unseen_predictions = predict_model(blended_model, data=data_unseen)
unseen_predictions.head()

Out[ ]:

	material	number_of_elements	mean_atomic_mass	wtd_mean_atomic_mass	gmean_atomic_mass	wtd_gmean_atomic_mass	entropy_atomic_mass	wtd_entropy_atomic_mass	range_atomic_mass	wtd_range_atomic_mass	std_atomic_mass	wtd_std_atomic_mass	mean_fie	wtd_mean_fie	gmean_fie	wtd_gmean_fie	entropy_fie	wtd_entropy_fie	range_fie	wtd_range_fie	std_fie	wtd_std_fie	mean_atomic_radius	wtd_mean_atomic_radius	gmean_atomic_radius	wtd_gmean_atomic_radius	entropy_atomic_radius	wtd_entropy_atomic_radius	range_atomic_radius	wtd_range_atomic_radius	std_atomic_radius	wtd_std_atomic_radius	mean_Density	wtd_mean_Density	gmean_Density	wtd_gmean_Density	entropy_Density	wtd_entropy_Density	range_Density	wtd_range_Density	...	gmean_ElectronAffinity	wtd_gmean_ElectronAffinity	entropy_ElectronAffinity	wtd_entropy_ElectronAffinity	range_ElectronAffinity	wtd_range_ElectronAffinity	std_ElectronAffinity	wtd_std_ElectronAffinity	mean_FusionHeat	wtd_mean_FusionHeat	gmean_FusionHeat	wtd_gmean_FusionHeat	entropy_FusionHeat	wtd_entropy_FusionHeat	range_FusionHeat	wtd_range_FusionHeat	std_FusionHeat	wtd_std_FusionHeat	mean_ThermalConductivity	wtd_mean_ThermalConductivity	gmean_ThermalConductivity	wtd_gmean_ThermalConductivity	entropy_ThermalConductivity	wtd_entropy_ThermalConductivity	range_ThermalConductivity	wtd_range_ThermalConductivity	std_ThermalConductivity	wtd_std_ThermalConductivity	mean_Valence	wtd_mean_Valence	gmean_Valence	wtd_gmean_Valence	entropy_Valence	wtd_entropy_Valence	range_Valence	wtd_range_Valence	std_Valence	wtd_std_Valence	critical_temp	Label
0	Ge1Nb3	2.0	82.768190	87.837285	82.144935	87.360109	0.685627	0.509575	20.27638	51.522285	10.138190	8.779930	711.80	687.700000	710.166178	686.488365	0.690853	0.589409	96.4	307.700000	48.200000	41.742424	161.500000	179.750000	157.321327	176.492557	0.667386	0.461943	73.0	117.250000	36.500000	31.609927	6946.500000	7758.250000	6754.118003	7608.074085	0.665582	NaN	3247.000	5096.750000	...	102.741423	94.869113	0.680597	0.622557	32.90	35.575000	16.450000	14.246118	29.3000	28.050000	29.193150	27.970992	0.689503	0.596157	5.000	12.150000	2.500000	2.165064	57.000000	55.500000	56.920998	55.441265	0.691761	0.583527	6.00000	25.500000	3.000000	2.598076	4.50	4.750000	4.472136	4.728708	0.686962	0.514653	1.0	2.750000	0.500000	0.433013	6.4	12.279531
1	Y1Ba2Cu3O	4.0	76.444563	81.456750	59.356672	68.229617	1.199541	1.108189	121.32760	36.950657	43.823354	40.612293	794.00	738.357143	741.629349	702.424197	1.315004	1.282439	810.6	231.371429	311.743492	255.465227	164.500000	171.571429	139.000514	153.255987	1.256701	1.166832	205.0	65.428571	77.525802	67.911407	4235.857250	5481.918429	669.556588	1780.077447	1.015407	0.810985	8958.571	3839.795857	...	53.527965	56.435929	1.105182	0.953371	127.05	46.971429	54.830755	52.650295	8.1805	9.560286	4.035569	6.229663	1.112098	0.975148	12.878	5.582571	4.948155	4.359650	108.756645	179.003797	7.552385	26.578636	0.336262	0.201966	399.97342	171.424774	168.301047	NaN	2.25	2.142857	2.213364	2.119268	1.368922	1.309526	1.0	0.571429	0.433013	0.349927	91.2	77.990958
2	Y1Ba1.5La0.5Cu3O7.08	5.0	88.936744	51.090431	70.358975	34.783991	1.445824	1.525092	122.90607	10.438667	46.482335	44.261233	743.42	1006.991437	696.313849	942.154532	1.538527	0.933542	810.6	690.076300	296.615154	340.103711	170.600000	111.914373	148.737352	88.458069	1.506998	1.509873	205.0	25.802752	70.406250	75.991199	4617.885800	3035.177165	NaN	66.205693	1.324548	0.978095	8958.571	2054.272376	...	52.696967	90.754610	1.354052	0.815880	127.05	75.361239	50.281492	47.273696	7.8044	5.154569	4.411557	1.310299	1.374615	1.153108	12.878	2.884422	4.489231	5.622243	89.605316	95.618363	8.418912	1.058389	0.457810	0.209583	399.97342	91.728732	155.329609	166.192923	2.40	2.114679	2.352158	2.095193	1.589027	1.314189	1.0	0.967890	0.489898	0.318634	38.0	33.854053
3	La1.76Sr0.24Cu1O4	4.0	76.517718	56.149432	59.310096	35.562124	1.197273	1.042132	122.90607	31.920690	44.289459	51.815571	787.05	1011.642286	734.219624	940.469590	1.313008	0.802473	NaN	731.520000	314.505966	353.685876	151.750000	104.680000	131.302197	84.236452	1.275274	1.215773	171.0	41.520000	65.579627	67.580981	4434.357250	2916.268000	674.484751	52.847119	0.995983	0.807732	8958.571	1544.463429	...	48.477265	95.882106	1.096672	0.773969	135.97	81.204686	54.373097	43.628244	6.9055	3.856571	3.479475	1.042408	1.088575	1.016669	12.878	1.744571	4.599064	4.673780	112.006645	61.626617	8.339818	0.637507	0.403693	0.304547	399.97342	57.127669	166.742351	138.361103	2.25	2.251429	2.213364	NaN	1.368922	1.078855	1.0	1.074286	0.433013	0.433834	19.0	22.574612
4	La0.94Mo6Se8	3.0	104.608490	89.558979	101.719818	88.481210	1.070258	0.944284	59.94547	33.541423	25.225148	15.159829	722.10	812.626104	703.695276	799.630186	1.072790	0.796138	399.3	469.515797	165.133461	141.219887	162.666667	143.728246	156.269832	137.110827	1.062146	0.913764	92.0	64.036145	42.240055	43.743683	7081.666667	7095.665328	6727.404359	6633.555886	1.046587	0.841524	5461.000	3741.817938	...	89.376074	121.321040	0.925747	0.621554	147.00	102.106426	64.406539	63.255476	15.9000	17.745783	10.699060	11.681349	0.726384	0.547497	30.600	14.061446	14.217595	14.955962	50.840000	56.919679	9.794610	6.007072	0.313841	0.106068	138.48000	55.544846	62.546392	67.307993	5.00	5.811245	4.762203	5.743954	1.054920	0.803990	3.0	3.024096	1.414214	0.728448	11.0	8.882301

5 rows × 84 columns

In [ ]:

print("R2:\t", check_metric(unseen_predictions.critical_temp, unseen_predictions.Label, 'R2'))
print("RMSE:\t", np.round(np.sqrt(check_metric(unseen_predictions.critical_temp, unseen_predictions.Label, 'MSE')), 4))

R2:	 0.9225
RMSE:	 9.428

Clearly, this blended model trained using some extra preprocessing & feature engineering on our initial dataset, performs better than our model that we trained in the Basic PyCaret section (R2 = 0.9157, RMSE=9.8366).

We can now save our blended regression model

In [ ]:

save_model(blended_model, "blended_expt2")

Transformation Pipeline and Model Successfully Saved

Out[ ]:

(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True,
                                       features_todrop=['material'],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[],
                                       target='critical_temp',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,...
                                                                   weights='uniform')),
                                              ('dt',
                                               DecisionTreeRegressor(ccp_alpha=0.0,
                                                                     criterion='mse',
                                                                     max_depth=None,
                                                                     max_features='log2',
                                                                     max_leaf_nodes=None,
                                                                     min_impurity_decrease=0.0,
                                                                     min_impurity_split=None,
                                                                     min_samples_leaf=1,
                                                                     min_samples_split=10,
                                                                     min_weight_fraction_leaf=0.0,
                                                                     presort='deprecated',
                                                                     random_state=42,
                                                                     splitter='best'))],
                                  n_jobs=-1, verbose=False, weights=None)]],
          verbose=False), 'blended_expt2.pkl')

This concludes our tutorial on PyCaret. Over the course of this tutorial, we have seen how easy it is to develop end-to-end ML pipelines using a low-code framework like PyCaret.