We are into the 3rd week of the Fundamentals of MLOps: A Hands-On Approach, where we will learn how to efficiently create & experiment with ML Pipelines using a low-code framework called PyCaret.
This tutorial is intended to familiarize you with some of the functionality offered by PyCaret using a regression example (we will use the pycaret.regression
module in this tutorial). Most of the functions (with a few parameter tweaks) can be extended to other modules as well (you will explore the pycaret.classification
module in this week's assignment).
The first step to get started with PyCaret is to install PyCaret. It can be done as follows:
!pip install pycaret
Collecting pycaret Downloading https://files.pythonhosted.org/packages/bc/b6/9d620a23a038b3abdc249472ffd9be217f6b1877d2d952bfb3f653622a28/pycaret-2.3.2-py3-none-any.whl (263kB) |████████████████████████████████| 266kB 8.2MB/s Requirement already satisfied: textblob in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.15.3) Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.6.0) Collecting yellowbrick>=1.0.1 Downloading https://files.pythonhosted.org/packages/3a/15/58feb940b6a2f52d3335cccf9e5d00704ec5ba62782da83f7e2abeca5e4b/yellowbrick-1.3.post1-py3-none-any.whl (271kB) |████████████████████████████████| 276kB 13.2MB/s Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.11.1) Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.1.5) Collecting umap-learn Downloading https://files.pythonhosted.org/packages/75/69/85e7f950bb75792ad5d666d86c5f3e62eedbb942848e7e3126513af9999c/umap-learn-0.5.1.tar.gz (80kB) |████████████████████████████████| 81kB 10.7MB/s Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (2.2.4) Collecting kmodes>=0.10.1 Downloading https://files.pythonhosted.org/packages/9b/34/fffc601aa4d44b94e945a7cc72f477e09dffa7dce888898f2ffd9f4e343e/kmodes-0.11.0-py2.py3-none-any.whl Requirement already satisfied: IPython in /usr/local/lib/python3.7/dist-packages (from pycaret) (5.5.0) Collecting scikit-plot Downloading https://files.pythonhosted.org/packages/7c/47/32520e259340c140a4ad27c1b97050dd3254fdc517b1d59974d47037510e/scikit_plot-0.3.7-py3-none-any.whl Collecting Boruta Downloading https://files.pythonhosted.org/packages/b2/11/583f4eac99d802c79af9217e1eff56027742a69e6c866b295cce6a5a8fc2/Boruta-0.3-py3-none-any.whl (56kB) |████████████████████████████████| 61kB 9.0MB/s Requirement already satisfied: ipywidgets in /usr/local/lib/python3.7/dist-packages (from pycaret) (7.6.3) Requirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.5.0) Collecting pyod Downloading https://files.pythonhosted.org/packages/71/8a/faa04a753bc32aeef00b9acf8e23d0b914b03844b89dcc6062b28e7ab1c5/pyod-0.9.0.tar.gz (105kB) |████████████████████████████████| 112kB 15.8MB/s Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.5) Collecting imbalanced-learn==0.7.0 Downloading https://files.pythonhosted.org/packages/c8/81/8db4d87b03b998fda7c6f835d807c9ae4e3b141f978597b8d7f31600be15/imbalanced_learn-0.7.0-py3-none-any.whl (167kB) |████████████████████████████████| 174kB 15.0MB/s Requirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.4.1) Collecting lightgbm>=2.3.1 Downloading https://files.pythonhosted.org/packages/18/b2/fff8370f48549ce223f929fe8cab4ee6bf285a41f86037d91312b48ed95b/lightgbm-3.2.1-py3-none-manylinux1_x86_64.whl (2.0MB) |████████████████████████████████| 2.0MB 17.0MB/s Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from pycaret) (3.2.2) Collecting mlflow Downloading https://files.pythonhosted.org/packages/f3/c9/190a45e667b63edb76112deefa70629c2d9985603a85cb1968015fe0f327/mlflow-1.18.0-py3-none-any.whl (14.2MB) |████████████████████████████████| 14.2MB 216kB/s Collecting mlxtend>=0.17.0 Downloading https://files.pythonhosted.org/packages/86/30/781c0b962a70848db83339567ecab656638c62f05adb064cb33c0ae49244/mlxtend-0.18.0-py2.py3-none-any.whl (1.3MB) |████████████████████████████████| 1.4MB 42.9MB/s Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.0.1) Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from pycaret) (4.4.1) Collecting pyLDAvis Downloading https://files.pythonhosted.org/packages/03/a5/15a0da6b0150b8b68610cc78af80364a80a9a4c8b6dd5ee549b8989d4b60/pyLDAvis-3.3.1.tar.gz (1.7MB) |████████████████████████████████| 1.7MB 29.4MB/s Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing wheel metadata ... done Requirement already satisfied: numpy==1.19.5 in /usr/local/lib/python3.7/dist-packages (from pycaret) (1.19.5) Collecting pandas-profiling>=2.8.0 Downloading https://files.pythonhosted.org/packages/3b/a3/34519d16e5ebe69bad30c5526deea2c3912634ced7f9b5e6e0bb9dbbd567/pandas_profiling-3.0.0-py2.py3-none-any.whl (248kB) |████████████████████████████████| 256kB 51.3MB/s Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from pycaret) (0.17.3) Collecting scikit-learn==0.23.2 Downloading https://files.pythonhosted.org/packages/f4/cb/64623369f348e9bfb29ff898a57ac7c91ed4921f228e9726546614d63ccb/scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8MB) |████████████████████████████████| 6.8MB 35.9MB/s Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->pycaret) (1.15.0) Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->pycaret) (5.1.0) Requirement already satisfied: cycler>=0.10.0 in /usr/local/lib/python3.7/dist-packages (from yellowbrick>=1.0.1->pycaret) (0.10.0) Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->pycaret) (2018.9) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pycaret) (2.8.1) Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn->pycaret) (0.51.2) Collecting pynndescent>=0.5 Downloading https://files.pythonhosted.org/packages/b1/8d/44bf1c9e69dd9bf0697a3b9375b0729942525c0eee7b7859f563439d676a/pynndescent-0.5.4.tar.gz (1.1MB) |████████████████████████████████| 1.1MB 23.9MB/s Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (57.0.0) Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (7.4.0) Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.4.1) Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.5) Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.0) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (4.41.1) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (2.0.5) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.0.5) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (3.0.5) Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (2.23.0) Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (1.1.3) Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<2.4.0->pycaret) (0.8.2) Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (5.0.5) Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.4.2) Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.8.1) Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (1.0.18) Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (0.7.5) Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (4.8.0) Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from IPython->pycaret) (2.6.1) Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (3.5.1) Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (4.10.1) Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (5.1.3) Requirement already satisfied: jupyterlab-widgets>=1.0.0; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from ipywidgets->pycaret) (1.0.0) Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (from wordcloud->pycaret) (7.1.2) Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (from pyod->pycaret) (0.10.2) Requirement already satisfied: wheel in /usr/local/lib/python3.7/dist-packages (from lightgbm>=2.3.1->pycaret) (0.36.2) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (1.3.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->pycaret) (2.4.7) Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (3.12.4) Requirement already satisfied: entrypoints in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.3) Collecting docker>=4.0.0 Downloading https://files.pythonhosted.org/packages/b2/5a/f988909dfed18c1ac42ad8d9e611e6c5657e270aa6eb68559985dbb69c13/docker-5.0.0-py2.py3-none-any.whl (146kB) |████████████████████████████████| 153kB 41.6MB/s Collecting gunicorn; platform_system != "Windows" Downloading https://files.pythonhosted.org/packages/e4/dd/5b190393e6066286773a67dfcc2f9492058e9b57c4867a95f1ba5caf0a83/gunicorn-20.1.0-py3-none-any.whl (79kB) |████████████████████████████████| 81kB 13.5MB/s Requirement already satisfied: sqlparse>=0.3.1 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (0.4.1) Collecting databricks-cli>=0.8.7 Downloading https://files.pythonhosted.org/packages/bc/af/631375abc29e59cedfa4467a5f7755503ba19898890751e1f2636ef02f92/databricks-cli-0.14.3.tar.gz (54kB) |████████████████████████████████| 61kB 9.3MB/s Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.3.0) Collecting pyyaml>=5.1 Downloading https://files.pythonhosted.org/packages/7a/a5/393c087efdc78091afa2af9f1378762f9821c9c1d7a22c5753fb5ac5f97a/PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636kB) |████████████████████████████████| 645kB 38.6MB/s Collecting prometheus-flask-exporter Downloading https://files.pythonhosted.org/packages/f3/c1/2cc385fadf18dc75fe24c18899269eda4dcc60221d61eff7da4a6cc5c01d/prometheus_flask_exporter-0.18.2.tar.gz Collecting gitpython>=2.1.0 Downloading https://files.pythonhosted.org/packages/bc/91/b38c4fabb6e5092ab23492ded4f318ab7299b19263272b703478038c0fbc/GitPython-3.1.18-py3-none-any.whl (170kB) |████████████████████████████████| 174kB 55.7MB/s Collecting alembic<=1.4.1 Downloading https://files.pythonhosted.org/packages/e0/e9/359dbb77c35c419df0aedeb1d53e71e7e3f438ff64a8fdb048c907404de3/alembic-1.4.1.tar.gz (1.1MB) |████████████████████████████████| 1.1MB 32.1MB/s Requirement already satisfied: Flask in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.1.4) Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (1.4.18) Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (20.9) Collecting querystring-parser Downloading https://files.pythonhosted.org/packages/88/6b/572b2590fd55114118bf08bde63c0a421dcc82d593700f3e2ad89908a8a9/querystring_parser-1.2.4-py2.py3-none-any.whl Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.7/dist-packages (from mlflow->pycaret) (7.1.2) Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly>=4.4.1->pycaret) (1.3.3) Collecting funcy Downloading https://files.pythonhosted.org/packages/44/52/5cf7401456a461e4b481650dfb8279bc000f31a011d0918904f86e755947/funcy-1.16-py2.py3-none-any.whl Requirement already satisfied: numexpr in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (2.7.3) Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (0.16.0) Requirement already satisfied: sklearn in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (0.0) Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from pyLDAvis->pycaret) (2.11.3) Collecting visions[type_image_path]==0.7.1 Downloading https://files.pythonhosted.org/packages/80/96/01e4ba22cef96ae5035dbcf0451c2f4f859f8f17393b98406b23f0034279/visions-0.7.1-py3-none-any.whl (102kB) |████████████████████████████████| 112kB 55.0MB/s Collecting htmlmin>=0.1.12 Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz Collecting pydantic>=1.8.1 Downloading https://files.pythonhosted.org/packages/9f/f2/2d5425efe57f6c4e06cbe5e587c1fd16929dcf0eb90bd4d3d1e1c97d1151/pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1MB) |████████████████████████████████| 10.1MB 32.1MB/s Collecting phik>=0.11.1 Downloading https://files.pythonhosted.org/packages/b7/ce/193e8ddf62d4be643b9b4b20e8e9c63b2f6a20f92778c0410c629f89bdaa/phik-0.11.2.tar.gz (1.1MB) |████████████████████████████████| 1.1MB 47.0MB/s Requirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.4.2) Collecting tangled-up-in-unicode==0.1.0 Downloading https://files.pythonhosted.org/packages/93/3e/cb354fb2097fcf2fd5b5a342b10ae2a6e9363ba435b64e3e00c414064bc7/tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1MB) |████████████████████████████████| 3.1MB 34.4MB/s Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from cufflinks>=0.17.0->pycaret) (0.3.0) Collecting threadpoolctl>=2.0.0 Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn->pycaret) (0.34.0) Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (4.5.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0->pycaret) (2021.5.30) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0->pycaret) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0->pycaret) (3.0.4) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0->pycaret) (1.24.3) Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.7/dist-packages (from traitlets>=4.2->IPython->pycaret) (0.2.0) Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython->pycaret) (0.2.5) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.7/dist-packages (from pexpect; sys_platform != "win32"->IPython->pycaret) (0.7.0) Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.3.1) Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.1.1) Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.3.5) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (2.6.0) Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.7.1) Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from statsmodels->pyod->pycaret) (0.5.1) Collecting websocket-client>=0.32.0 Downloading https://files.pythonhosted.org/packages/ca/5f/3c211d168b2e9f9342cfb53bcfc26aab0eac63b998015e7af7bcae66119d/websocket_client-1.1.0-py2.py3-none-any.whl (68kB) |████████████████████████████████| 71kB 12.6MB/s Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.9) Requirement already satisfied: prometheus_client in /usr/local/lib/python3.7/dist-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.11.0) Collecting gitdb<5,>=4.0.1 Downloading https://files.pythonhosted.org/packages/ea/e8/f414d1a4f0bbc668ed441f74f44c116d9816833a48bf81d22b697090dba8/gitdb-4.0.7-py3-none-any.whl (63kB) |████████████████████████████████| 71kB 11.6MB/s Requirement already satisfied: typing-extensions>=3.7.4.0; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from gitpython>=2.1.0->mlflow->pycaret) (3.7.4.3) Collecting Mako Downloading https://files.pythonhosted.org/packages/f3/54/dbc07fbb20865d3b78fdb7cf7fa713e2cba4f87f71100074ef2dc9f9d1f7/Mako-1.1.4-py2.py3-none-any.whl (75kB) |████████████████████████████████| 81kB 13.7MB/s Collecting python-editor>=0.3 Downloading https://files.pythonhosted.org/packages/c6/d3/201fc3abe391bbae6606e6f1d598c15d367033332bd54352b12f35513717/python_editor-1.0.4-py3-none-any.whl Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.0.1) Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3.7/dist-packages (from Flask->mlflow->pycaret) (1.1.0) Requirement already satisfied: greenlet!=0.4.17; python_version >= "3" in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->mlflow->pycaret) (1.1.0) Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->pyLDAvis->pycaret) (2.0.1) Collecting multimethod==1.4 Downloading https://files.pythonhosted.org/packages/7a/d0/ce5ad0392aa12645b7ad91a5983d6b625b704b021d9cd48c587630c1a9ac/multimethod-1.4-py2.py3-none-any.whl Requirement already satisfied: bottleneck in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.8.0->pycaret) (1.3.2) Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.8.0->pycaret) (21.2.0) Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.8.0->pycaret) (2.5.1) Collecting imagehash; extra == "type_image_path" Downloading https://files.pythonhosted.org/packages/8e/18/9dbb772b5ef73a3069c66bb5bf29b9fb4dd57af0d5790c781c3f559bcca6/ImageHash-4.2.0-py2.py3-none-any.whl (295kB) |████████████████████████████████| 296kB 60.0MB/s Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy<2.4.0->pycaret) (3.4.1) Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1) Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.5.0) Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.10.1) Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->pycaret) (22.1.0) Collecting smmap<5,>=3.0.1 Downloading https://files.pythonhosted.org/packages/68/ee/d540eb5e5996eb81c26ceffac6ee49041d473bc5125f2aa995cf51ec1cf1/smmap-4.0.0-py2.py3-none-any.whl Requirement already satisfied: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.7.1->pandas-profiling>=2.8.0->pycaret) (1.1.1) Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.0) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.4.3) Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4) Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.7.1) Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (3.3.0) Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1) Building wheels for collected packages: pyLDAvis Building wheel for pyLDAvis (PEP 517) ... done Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-cp37-none-any.whl size=136897 sha256=5b3f3e703e123552f5a86ee6b9b063107d92500aed5641fceaf84c177f27134d Stored in directory: /root/.cache/pip/wheels/a0/9c/fc/c6e00689d35c82cf96a8adc70edfe7ba7904374fdac3240ac2 Successfully built pyLDAvis Building wheels for collected packages: umap-learn, pyod, pynndescent, databricks-cli, prometheus-flask-exporter, alembic, htmlmin, phik Building wheel for umap-learn (setup.py) ... done Created wheel for umap-learn: filename=umap_learn-0.5.1-cp37-none-any.whl size=76569 sha256=beaeed619d847ea60136ffe1a548c257eba6c65db90b7f8a1c9de3eccaa1706a Stored in directory: /root/.cache/pip/wheels/ad/df/d5/a3691296ff779f25cd1cf415a3af954b987fb53111e3392cf4 Building wheel for pyod (setup.py) ... done Created wheel for pyod: filename=pyod-0.9.0-cp37-none-any.whl size=122561 sha256=360d1bb91bc4b5e0edf18793d36e05856a9b99b940c79ed45c891b24905bbf9b Stored in directory: /root/.cache/pip/wheels/db/15/54/88660e3bfac7c88e81b7d719c66ee876e26b6abc46376f78b3 Building wheel for pynndescent (setup.py) ... done Created wheel for pynndescent: filename=pynndescent-0.5.4-cp37-none-any.whl size=52374 sha256=a8714ab18e8e724e26ea429b0f11014285a837d21c6359c9037ef1ddbcac48d6 Stored in directory: /root/.cache/pip/wheels/42/4b/8c/f6f119c67cf6583bb192431fa8f7278cf95e5b943055077d94 Building wheel for databricks-cli (setup.py) ... done Created wheel for databricks-cli: filename=databricks_cli-0.14.3-cp37-none-any.whl size=100560 sha256=3c7a5dd5e4ec55f351df8baaaeea71fd9026154b00f06265f3a01481cca86cb3 Stored in directory: /root/.cache/pip/wheels/5b/24/f3/34d8e3964dac4ba849d844273c49a679111b00d5799ebb934a Building wheel for prometheus-flask-exporter (setup.py) ... done Created wheel for prometheus-flask-exporter: filename=prometheus_flask_exporter-0.18.2-cp37-none-any.whl size=17415 sha256=6d84a5ffb6bddac8f8f46e1e1f2a07ac7b2b6e6c8e401c2818cfa8cfca432703 Stored in directory: /root/.cache/pip/wheels/c0/e2/9c/4f3ee23964802940f81a8b476d0b9be6fb6348cb12df2e2226 Building wheel for alembic (setup.py) ... done Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158170 sha256=d2f9f5be707f7f3d3070bc0772321bd4ae4d691174d3108f0a1ff1d2b7bf35c0 Stored in directory: /root/.cache/pip/wheels/84/07/f7/12f7370ca47a66030c2edeedcc23dec26ea0ac22dcb4c4a0f3 Building wheel for htmlmin (setup.py) ... done Created wheel for htmlmin: filename=htmlmin-0.1.12-cp37-none-any.whl size=27099 sha256=296ac9b2f45b70187a65e4e7917ed856c299525e58a40e4e9137a08f1822d1ec Stored in directory: /root/.cache/pip/wheels/43/07/ac/7c5a9d708d65247ac1f94066cf1db075540b85716c30255459 Building wheel for phik (setup.py) ... done Created wheel for phik: filename=phik-0.11.2-cp37-none-any.whl size=1107437 sha256=234a436e3ca6728ec1641c31922f5f96bc29c557d0935e760e29cd7d9840fbb5 Stored in directory: /root/.cache/pip/wheels/c0/a3/b0/f27b1cfe32ea131a3715169132ff6d85653789e80e966c3bf6 Successfully built umap-learn pyod pynndescent databricks-cli prometheus-flask-exporter alembic htmlmin phik ERROR: pandas-profiling 3.0.0 has requirement requests>=2.24.0, but you'll have requests 2.23.0 which is incompatible. ERROR: pandas-profiling 3.0.0 has requirement tqdm>=4.48.2, but you'll have tqdm 4.41.1 which is incompatible. ERROR: pyldavis 3.3.1 has requirement numpy>=1.20.0, but you'll have numpy 1.19.5 which is incompatible. ERROR: pyldavis 3.3.1 has requirement pandas>=1.2.0, but you'll have pandas 1.1.5 which is incompatible. ERROR: phik 0.11.2 has requirement scipy>=1.5.2, but you'll have scipy 1.4.1 which is incompatible. Installing collected packages: threadpoolctl, scikit-learn, yellowbrick, pynndescent, umap-learn, kmodes, scikit-plot, Boruta, pyod, imbalanced-learn, lightgbm, websocket-client, docker, gunicorn, databricks-cli, pyyaml, prometheus-flask-exporter, smmap, gitdb, gitpython, Mako, python-editor, alembic, querystring-parser, mlflow, mlxtend, funcy, pyLDAvis, multimethod, tangled-up-in-unicode, imagehash, visions, htmlmin, pydantic, phik, pandas-profiling, pycaret Found existing installation: scikit-learn 0.22.2.post1 Uninstalling scikit-learn-0.22.2.post1: Successfully uninstalled scikit-learn-0.22.2.post1 Found existing installation: yellowbrick 0.9.1 Uninstalling yellowbrick-0.9.1: Successfully uninstalled yellowbrick-0.9.1 Found existing installation: imbalanced-learn 0.4.3 Uninstalling imbalanced-learn-0.4.3: Successfully uninstalled imbalanced-learn-0.4.3 Found existing installation: lightgbm 2.2.3 Uninstalling lightgbm-2.2.3: Successfully uninstalled lightgbm-2.2.3 Found existing installation: PyYAML 3.13 Uninstalling PyYAML-3.13: Successfully uninstalled PyYAML-3.13 Found existing installation: mlxtend 0.14.0 Uninstalling mlxtend-0.14.0: Successfully uninstalled mlxtend-0.14.0 Found existing installation: pandas-profiling 1.4.1 Uninstalling pandas-profiling-1.4.1: Successfully uninstalled pandas-profiling-1.4.1 Successfully installed Boruta-0.3 Mako-1.1.4 alembic-1.4.1 databricks-cli-0.14.3 docker-5.0.0 funcy-1.16 gitdb-4.0.7 gitpython-3.1.18 gunicorn-20.1.0 htmlmin-0.1.12 imagehash-4.2.0 imbalanced-learn-0.7.0 kmodes-0.11.0 lightgbm-3.2.1 mlflow-1.18.0 mlxtend-0.18.0 multimethod-1.4 pandas-profiling-3.0.0 phik-0.11.2 prometheus-flask-exporter-0.18.2 pyLDAvis-3.3.1 pycaret-2.3.2 pydantic-1.8.2 pynndescent-0.5.4 pyod-0.9.0 python-editor-1.0.4 pyyaml-5.4.1 querystring-parser-1.2.4 scikit-learn-0.23.2 scikit-plot-0.3.7 smmap-4.0.0 tangled-up-in-unicode-0.1.0 threadpoolctl-2.1.0 umap-learn-0.5.1 visions-0.7.1 websocket-client-1.1.0 yellowbrick-1.3.post1
For this tutorial we will use a dataset of superconductive materials. You can download the data (material_superconductivity.csv
) for this tutorial from here and load it using pandas
.
Note: It is a slightly modified version of the original Superconductivity Data Set.
The goal of this tutorial is to develop a ML model which can predict the critical temperature (critical_temp
) of a superconductor given a set of various properties of the superconducing material.
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Datasets/Superconductivity/material_superconductivity.csv")
Since we do not have an explicit test dataset for our trained model to predict results on, we will use 10% of the total datapoints as our unseen test dataset, while the remaining 90% will be used for our training & validation purposes.
data_unseen = df.sample(frac=0.1, random_state=42) # Sample 10% of the data to become the unseen test set
df = df.drop(data_unseen.index) # Use the remaining 90% as the training (& validation) data
df.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)
print('Data for Model Training & Validation: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Model Training & Validation: (19137, 83) Unseen Data For Predictions: (2126, 83)
data.head()
material | number_of_elements | mean_atomic_mass | wtd_mean_atomic_mass | gmean_atomic_mass | wtd_gmean_atomic_mass | entropy_atomic_mass | wtd_entropy_atomic_mass | range_atomic_mass | wtd_range_atomic_mass | std_atomic_mass | wtd_std_atomic_mass | mean_fie | wtd_mean_fie | gmean_fie | wtd_gmean_fie | entropy_fie | wtd_entropy_fie | range_fie | wtd_range_fie | std_fie | wtd_std_fie | mean_atomic_radius | wtd_mean_atomic_radius | gmean_atomic_radius | wtd_gmean_atomic_radius | entropy_atomic_radius | wtd_entropy_atomic_radius | range_atomic_radius | wtd_range_atomic_radius | std_atomic_radius | wtd_std_atomic_radius | mean_Density | wtd_mean_Density | gmean_Density | wtd_gmean_Density | entropy_Density | wtd_entropy_Density | range_Density | wtd_range_Density | ... | wtd_mean_ElectronAffinity | gmean_ElectronAffinity | wtd_gmean_ElectronAffinity | entropy_ElectronAffinity | wtd_entropy_ElectronAffinity | range_ElectronAffinity | wtd_range_ElectronAffinity | std_ElectronAffinity | wtd_std_ElectronAffinity | mean_FusionHeat | wtd_mean_FusionHeat | gmean_FusionHeat | wtd_gmean_FusionHeat | entropy_FusionHeat | wtd_entropy_FusionHeat | range_FusionHeat | wtd_range_FusionHeat | std_FusionHeat | wtd_std_FusionHeat | mean_ThermalConductivity | wtd_mean_ThermalConductivity | gmean_ThermalConductivity | wtd_gmean_ThermalConductivity | entropy_ThermalConductivity | wtd_entropy_ThermalConductivity | range_ThermalConductivity | wtd_range_ThermalConductivity | std_ThermalConductivity | wtd_std_ThermalConductivity | mean_Valence | wtd_mean_Valence | gmean_Valence | wtd_gmean_Valence | entropy_Valence | wtd_entropy_Valence | range_Valence | wtd_range_Valence | std_Valence | wtd_std_Valence | critical_temp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ba0.2La1.8Cu1O4 | 4.0 | 88.944468 | 57.862692 | 66.361592 | 36.116612 | 1.181795 | 1.062396 | 122.90607 | 31.794921 | 51.968828 | 53.622535 | 775.425 | 1010.268571 | 718.152900 | 938.016780 | 1.305967 | 0.791488 | 810.6 | 735.985714 | 323.811808 | 355.562967 | 160.25 | 105.514286 | 136.126003 | 84.528423 | 1.259244 | 1.207040 | 205.0 | 42.914286 | 75.237540 | 69.235569 | 4654.35725 | 2961.502286 | 724.953211 | 53.543811 | 1.033129 | 0.814598 | 8958.571 | 1579.583429 | ... | 111.727143 | 60.123179 | 99.414682 | 1.159687 | 0.787382 | 127.05 | 80.987143 | 51.433712 | 42.558396 | 6.9055 | 3.846857 | 3.479475 | 1.040986 | 1.088575 | 0.994998 | 12.878 | 1.744571 | 4.599064 | 4.666920 | 107.756645 | 61.015189 | 7.062488 | 0.621979 | 0.308148 | 0.262848 | 399.97342 | 57.127669 | 168.854244 | 138.517163 | 2.25 | 2.257143 | 2.213364 | 2.219783 | 1.368922 | 1.066221 | 1.0 | 1.085714 | 0.433013 | 0.437059 | 29.0 |
1 | Ba0.1La1.9Ag0.1Cu0.9O4 | 5.0 | 92.729214 | 58.518416 | 73.132787 | 36.396602 | 1.449309 | 1.057755 | 122.90607 | 36.161939 | 47.094633 | 53.979870 | 766.440 | 1010.612857 | 720.605511 | 938.745413 | 1.544145 | 0.807078 | 810.6 | 743.164286 | 290.183029 | 354.963511 | 161.20 | 104.971429 | 141.465215 | 84.370167 | 1.508328 | 1.204115 | 205.0 | 50.571429 | 67.321319 | 68.008817 | 5821.48580 | 3021.016571 | 1237.095080 | 54.095718 | 1.314442 | 0.914802 | 10488.571 | 1667.383429 | ... | 112.316429 | 69.833315 | 101.166398 | 1.427997 | 0.838666 | 127.05 | 81.207857 | 49.438167 | 41.667621 | 7.7844 | 3.796857 | 4.403790 | 1.035251 | 1.374977 | 1.073094 | 12.878 | 1.595714 | 4.473363 | 4.603000 | 172.205316 | 61.372331 | 16.064228 | 0.619735 | 0.847404 | 0.567706 | 429.97342 | 51.413383 | 198.554600 | 139.630922 | 2.00 | 2.257143 | 1.888175 | 2.210679 | 1.557113 | 1.047221 | 2.0 | 1.128571 | 0.632456 | 0.468606 | 26.0 |
2 | Ba0.1La1.9Cu1O4 | 4.0 | 88.944468 | 57.885242 | 66.361592 | 36.122509 | 1.181795 | 0.975980 | 122.90607 | 35.741099 | 51.968828 | 53.656268 | 775.425 | 1010.820000 | 718.152900 | 939.009036 | 1.305967 | 0.773620 | 810.6 | 743.164286 | 323.811808 | 354.804183 | 160.25 | 104.685714 | 136.126003 | 84.214573 | 1.259244 | 1.132547 | 205.0 | 49.314286 | 75.237540 | 67.797712 | 4654.35725 | 2999.159429 | 724.953211 | 53.974022 | 1.033129 | 0.760305 | 8958.571 | 1667.383429 | ... | 112.213571 | 60.123179 | 101.082152 | 1.159687 | 0.786007 | 127.05 | 81.207857 | 51.433712 | 41.639878 | NaN | 3.822571 | 3.479475 | 1.037439 | 1.088575 | 0.927479 | 12.878 | 1.757143 | 4.599064 | 4.649635 | 107.756645 | 60.943760 | 7.062488 | 0.619095 | 0.308148 | 0.250477 | 399.97342 | 57.127669 | 168.854244 | 138.540613 | 2.25 | 2.271429 | 2.213364 | 2.232679 | 1.368922 | 1.029175 | 1.0 | 1.114286 | 0.433013 | 0.444697 | 19.0 |
3 | Ba0.15La1.85Cu1O4 | 4.0 | 88.944468 | 57.873967 | 66.361592 | 36.119560 | 1.181795 | 1.022291 | 122.90607 | 33.768010 | 51.968828 | 53.639405 | 775.425 | 1010.544286 | 718.152900 | 938.512777 | 1.305967 | 0.783207 | 810.6 | 739.575000 | 323.811808 | 355.183884 | 160.25 | 105.100000 | 136.126003 | 84.371352 | 1.259244 | 1.173033 | 205.0 | 46.114286 | 75.237540 | 68.521665 | 4654.35725 | 2980.330857 | 724.953211 | 53.758486 | 1.033129 | 0.788889 | 8958.571 | 1623.483429 | ... | 111.970357 | 60.123179 | 100.244950 | 1.159687 | NaN | 127.05 | 81.097500 | 51.433712 | 42.102344 | 6.9055 | 3.834714 | 3.479475 | 1.039211 | 1.088575 | 0.964031 | 12.878 | 1.744571 | 4.599064 | 4.658301 | 107.756645 | 60.979474 | 7.062488 | 0.620535 | 0.308148 | 0.257045 | 399.97342 | 57.127669 | 168.854244 | 138.528893 | 2.25 | 2.264286 | 2.213364 | 2.226222 | 1.368922 | 1.048834 | 1.0 | 1.100000 | 0.433013 | 0.440952 | 22.0 |
4 | Ba0.3La1.7Cu1O4 | 4.0 | 88.944468 | 57.840143 | 66.361592 | 36.110716 | 1.181795 | 1.129224 | 122.90607 | 27.848743 | 51.968828 | 53.588771 | 775.425 | 1009.717143 | 718.152900 | 937.025573 | 1.305967 | 0.805230 | 810.6 | 728.807143 | 323.811808 | 356.319281 | 160.25 | 106.342857 | 136.126003 | 84.843442 | 1.259244 | 1.261194 | 205.0 | 36.514286 | 75.237540 | 70.634448 | 4654.35725 | 2923.845143 | 724.953211 | 53.117029 | 1.033129 | 0.859811 | 8958.571 | 1491.783429 | ... | 111.240714 | 60.123179 | 97.774719 | 1.159687 | 0.787396 | 127.05 | 80.766429 | 51.433712 | 43.452059 | 6.9055 | 3.871143 | 3.479475 | 1.044545 | 1.088575 | 1.044970 | 12.878 | 1.744571 | 4.599064 | 4.684014 | 107.756645 | 61.086617 | 7.062488 | 0.624878 | 0.308148 | 0.272820 | 399.97342 | 57.127669 | 168.854244 | 138.493671 | 2.25 | 2.242857 | 2.213364 | 2.206963 | 1.368922 | 1.096052 | 1.0 | 1.057143 | 0.433013 | 0.428809 | 23.0 |
5 rows × 83 columns
Note: Here, we are using an external dataset & have loaded it using pandas
. PyCaret also has a repository of datasets which can be used for model training & experimentation involving various different tasks. These datasets can be loaded as pandas
DataFrame using:
dataset_name = "name of pycaret dataset" # Eg: "ipl", "bike", etc.
data = get_data(dataset_name)
This section intends to familiarize you with the building blocks of PyCaret that are used to develop a simple end-to-end ML pipeline. In this section, we will look at the following:
We need to configure the environment before we begin any machine learning experiment in PyCaret. Depending on the sort of experiment we wish to run, one of the six presently supported modules must be loaded into our Python environment.
To begin with, we import all the functions from the pycaret.regression
module.
from pycaret.regression import *
Now, we use the setup(...)
function to initialize the environment of our ML experiment. It the first & only mandatory setp to begin any ML experiment, and is common to all the 6 modules.
Apart from defining the dataset (using data
) & the target variable to be predicted (using target
- this holds for all modules except clustering & anomaly detection), the setup(...)
function offers a wide range of functionalities. In this Basic PyCaret section, we will understand at the following functionalties:
session_id
controls the randomness of experiment. If set to a non-None
value (say, 42
, as in our case), it can be used for later reproducibility of the entire experiment.train_size
is used to determine the size of the dataset used for training our models during the experiment. By default 0.7 (or 70%) of the dataset is used for training, while the remaining is held-out for validation in the end.setup(...)
is executed, & allows you to confirm them.However, one can override these by typing "quit" in the textbox & later specifying the respective columns names as arguments for categorical_features
& numeric_features
(we don't need these for our case though). Columns that need not be considered for the ML task can be excluded by specifying them as ignore_features
(here, since the material
column contains unique values for each instance & doesn't aid in the task, we can ignore it).
categorical_imputation
('constant'
by default), while that for numeric data is specified using numeric_imputation
('mean'
by default).There are a bunch of other data preprocessing & transformation steps that can be specified to be orchestrated into our pipeline using the setup(...)
function. We will explore them in the Intermediate PyCaret section.
expt_basic = setup(
data = df,
target = 'critical_temp',
session_id=42, # Random seed to ensure reproducibility of the experiment with the same data
train_size=0.8, # 80% training data & 20% held-out validation data
ignore_features=["material"],
numeric_imputation="median", # "mean" by default
categorical_imputation="mode", # "constant" (not_available) by default
)
Description | Value | |
---|---|---|
0 | session_id | 42 |
1 | Target | critical_temp |
2 | Original Data | (21263, 83) |
3 | Missing Values | True |
4 | Numeric Features | 79 |
5 | Categorical Features | 2 |
6 | Ordinal Features | False |
7 | High Cardinality Features | False |
8 | High Cardinality Method | None |
9 | Transformed Train Set | (17010, 95) |
10 | Transformed Test Set | (4253, 95) |
11 | Shuffle Train-Test | True |
12 | Stratify Train-Test | False |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | a8ad |
20 | Imputation Type | simple |
21 | Iterative Imputation Iteration | None |
22 | Numeric Imputer | median |
23 | Iterative Imputation Numeric Model | None |
24 | Categorical Imputer | mode |
25 | Iterative Imputation Categorical Model | None |
26 | Unknown Categoricals Handling | least_frequent |
27 | Normalize | False |
28 | Normalize Method | None |
29 | Transformation | False |
30 | Transformation Method | None |
31 | PCA | False |
32 | PCA Method | None |
33 | PCA Components | None |
34 | Ignore Low Variance | False |
35 | Combine Rare Levels | False |
36 | Rare Level Threshold | None |
37 | Numeric Binning | False |
38 | Remove Outliers | False |
39 | Outliers Threshold | None |
40 | Remove Multicollinearity | False |
41 | Multicollinearity Threshold | None |
42 | Remove Perfect Collinearity | True |
43 | Clustering | False |
44 | Clustering Iteration | None |
45 | Polynomial Features | False |
46 | Polynomial Degree | None |
47 | Trignometry Features | False |
48 | Polynomial Threshold | None |
49 | Group Features | False |
50 | Feature Selection | False |
51 | Feature Selection Method | classic |
52 | Features Selection Threshold | None |
53 | Feature Interaction | False |
54 | Feature Ratio | False |
55 | Interaction Threshold | None |
56 | Transform Target | False |
57 | Transform Target Method | box-cox |
This completes the environment setup for our experiment, with all the necessary preprocessing, transformation & feature engineeering steps piped together to be applied to our dataset for training of our ML models. All the details of our environment are displayed in the table above.
We are now set to experience the true power of low-code machine learning, when we train & compare a multitude of models with just a single line of code.
But prior to that, let us inspect all the algorithms that PyCaret's regression
module offers us. This can be done using models()
, as shown below. The output is a table of models available in model library for the particular module, along with the reference to the actual underlying implementation.
The Turbo column indicates the algorithms that are usually able to run in a shorter duration (indicated if Turbo is True
) & are chosen by default for the comparison.
Note: model()
can be used with other modules as well to inspect the various algorithms available for those tasks
models()
Name | Reference | Turbo | |
---|---|---|---|
ID | |||
lr | Linear Regression | sklearn.linear_model._base.LinearRegression | True |
lasso | Lasso Regression | sklearn.linear_model._coordinate_descent.Lasso | True |
ridge | Ridge Regression | sklearn.linear_model._ridge.Ridge | True |
en | Elastic Net | sklearn.linear_model._coordinate_descent.Elast... | True |
lar | Least Angle Regression | sklearn.linear_model._least_angle.Lars | True |
llar | Lasso Least Angle Regression | sklearn.linear_model._least_angle.LassoLars | True |
omp | Orthogonal Matching Pursuit | sklearn.linear_model._omp.OrthogonalMatchingPu... | True |
br | Bayesian Ridge | sklearn.linear_model._bayes.BayesianRidge | True |
ard | Automatic Relevance Determination | sklearn.linear_model._bayes.ARDRegression | False |
par | Passive Aggressive Regressor | sklearn.linear_model._passive_aggressive.Passi... | True |
ransac | Random Sample Consensus | sklearn.linear_model._ransac.RANSACRegressor | False |
tr | TheilSen Regressor | sklearn.linear_model._theil_sen.TheilSenRegressor | False |
huber | Huber Regressor | sklearn.linear_model._huber.HuberRegressor | True |
kr | Kernel Ridge | sklearn.kernel_ridge.KernelRidge | False |
svm | Support Vector Regression | sklearn.svm._classes.SVR | False |
knn | K Neighbors Regressor | sklearn.neighbors._regression.KNeighborsRegressor | True |
dt | Decision Tree Regressor | sklearn.tree._classes.DecisionTreeRegressor | True |
rf | Random Forest Regressor | sklearn.ensemble._forest.RandomForestRegressor | True |
et | Extra Trees Regressor | sklearn.ensemble._forest.ExtraTreesRegressor | True |
ada | AdaBoost Regressor | sklearn.ensemble._weight_boosting.AdaBoostRegr... | True |
gbr | Gradient Boosting Regressor | sklearn.ensemble._gb.GradientBoostingRegressor | True |
mlp | MLP Regressor | sklearn.neural_network._multilayer_perceptron.... | False |
lightgbm | Light Gradient Boosting Machine | lightgbm.sklearn.LGBMRegressor | True |
With this background, we can proceed towards comparing the various algorithms & evaluate their performance before zeroing down on the final model that we will eventually use.
The compare_models()
function (just 2 words!) trains and evaluates performance of all estimators available in the model library using k-fold cross validation (on the training split of the dataset). The output prints a score grid that shows average metric scores across the k
folds of validation, along with training time.
In our case below, this simple function (with some parameters, of course, which we will discuss next) is able to train & evaluate 10+ algorithms without the user having to write code for any specific algorithm!
Following is the explanation of the parameters that compare_models()
can take:
sort
: By default, the output grid is sorted in descending order of the R2 metric (higher R2 is better). This can be changed to any other metric by specifying it as sort
. In our case, we sort the grid by RMSE (lower RMSE is better).include
: This is used to explicitly enlist the algorithms that we wish to train & compare.exclude
: This is used to enlist the algorithms that we do not wish to train or compare. Here, we have excluded 5 algorithms because they seemed to take a lot of time, & did not give good results.fold
: This is basically the k
in k-Fold Cross Validation. By default, k=10
, but it can be changed as per convenience.turbo
: We saw previously that some algorithms have Turbo = True
(usually faster algorithms). By default, turbo=True
, which includes only these faster algorithms to be compared & evaluated. However, it can be set to False
to include all the other models as well.compare_models()
return the best performing model based on sort
order unless we specify the number of top performing models to be returned using the n_select
parameter (we will use this in the Intermediate PyCaret section).
best = compare_models(sort="RMSE", exclude=["lar", "rf", "et", "gbr", "ada"], fold=5)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
lightgbm | Light Gradient Boosting Machine | 6.7053 | 113.4606 | 10.6463 | 0.9037 | 0.4989 | 9.0818 | 1.208 |
knn | K Neighbors Regressor | 7.5609 | 175.0485 | 13.2177 | 0.8516 | 0.5174 | 8.3847 | 0.452 |
dt | Decision Tree Regressor | 7.1429 | 187.9618 | 13.6975 | 0.8405 | 0.5165 | 7.6383 | 0.906 |
ridge | Ridge Regression | 14.2936 | 347.1825 | 18.6294 | 0.7052 | 0.8754 | 14.6673 | 0.026 |
br | Bayesian Ridge | 14.2983 | 347.3928 | 18.6350 | 0.7050 | 0.8738 | 14.2653 | 0.164 |
lr | Linear Regression | 14.3182 | 348.1571 | 18.6556 | 0.7043 | 0.8793 | 14.6963 | 0.474 |
en | Elastic Net | 14.9925 | 376.1511 | 19.3914 | 0.6806 | 0.9120 | 20.0117 | 0.274 |
lasso | Lasso Regression | 15.0233 | 377.6577 | 19.4303 | 0.6793 | 0.9110 | 20.1200 | 0.268 |
omp | Orthogonal Matching Pursuit | 16.5008 | 443.5972 | 21.0579 | 0.6234 | 0.9436 | 26.2129 | 0.028 |
huber | Huber Regressor | 17.4946 | 519.1853 | 22.7831 | 0.5592 | 0.9496 | 24.5486 | 0.782 |
llar | Lasso Least Angle Regression | 29.4259 | 1179.1006 | 34.3350 | -0.0007 | 1.4446 | 51.3135 | 0.046 |
par | Passive Aggressive Regressor | 29.7797 | 1391.9471 | 35.9650 | -0.1920 | 1.4002 | 36.7929 | 0.184 |
Once we perform a comparative analysis of the various available models, we can choose the best-performing algorithm & train it on our dataset. This can be done using create_model(...)
, which is perhaps the most granular function in PyCaret.
Given the id of the model to be created (here, "lightgbm"
), this function trains & evaluates the corresponding model for us. By default, it uses a 10-fold cross validation. Number of folds can be changed with the fold
parameter, & cross-validation can be totally avoided if we set cross_validation=False
.
If needed, other model-specific parameters can also be set while calling this function. These parameters can be found in the reference of the corresponding models (Example: max_depth
for xgboost, learning_rate
for lightgbm, etc.)
lgbm = create_model("lightgbm", fold=10)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 6.8344 | 126.3321 | 11.2398 | 0.8940 | 0.5226 | 13.8306 |
1 | 6.7972 | 113.1294 | 10.6362 | 0.9056 | 0.4874 | 1.2830 |
2 | 6.5400 | 104.4271 | 10.2190 | 0.9117 | 0.4969 | 10.6841 |
3 | 6.6852 | 121.6496 | 11.0295 | 0.8923 | 0.5055 | 12.6763 |
4 | 6.3071 | 100.5768 | 10.0288 | 0.9085 | 0.4885 | 23.3895 |
5 | 6.7048 | 105.6652 | 10.2794 | 0.9087 | 0.4920 | 1.1607 |
6 | 6.8755 | 129.7889 | 11.3925 | 0.8944 | 0.4816 | 1.3610 |
7 | 6.7249 | 106.4520 | 10.3176 | 0.9115 | 0.5337 | 23.7443 |
8 | 6.7539 | 115.1639 | 10.7314 | 0.9001 | 0.4960 | 0.9899 |
9 | 6.6179 | 101.2943 | 10.0645 | 0.9178 | 0.4927 | 1.7183 |
Mean | 6.6841 | 112.4479 | 10.5939 | 0.9045 | 0.4997 | 9.0838 |
SD | 0.1569 | 9.9880 | 0.4670 | 0.0083 | 0.0157 | 8.7213 |
When a model is built with the create_model()
method, it is trained using the default hyperparameters. The tune_model()
method is used to now tune the hyperparameters of this created model such that a particluar metric is optimized specifically. It uses Random grid search using pre-defined grids that are totally configurable to modify the hyperparameter of the model provided as an estimator (for now, we will use the random grid, & explore the customization in the next section).
Again, tuning is done using k-Fold Cross Validation, where k
can be set using fold
parameter. The metric to optimize the hyperparameters with respect to is specified using optimize
.
tuned_lgbm = tune_model(lgbm, fold=5, optimize="RMSE")
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 6.4884 | 113.3398 | 10.6461 | 0.9052 | 0.4981 | 6.6217 |
1 | 6.3793 | 107.5357 | 10.3699 | 0.9070 | 0.4996 | 10.4792 |
2 | 6.2171 | 96.1899 | 9.8076 | 0.9148 | 0.4796 | 11.1850 |
3 | 6.5189 | 112.1177 | 10.5886 | 0.9079 | 0.4837 | 13.3632 |
4 | 6.4482 | 104.1701 | 10.2064 | 0.9127 | 0.4836 | 1.3184 |
Mean | 6.4104 | 106.6707 | 10.3237 | 0.9095 | 0.4889 | 8.5935 |
SD | 0.1074 | 6.1805 | 0.3021 | 0.0036 | 0.0082 | 4.2388 |
As you can see, the RMSE value has improved from 10.5939 to 10.3237 after fine-tuning.
We can print out both the untuned & tuned LightGBM models to check the difference in hyperparameters.
print(lgbm, "\n")
print(tuned_lgbm)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=-1, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31, objective=None, random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0) LGBMRegressor(bagging_fraction=0.8, bagging_freq=3, boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, feature_fraction=0.8, importance_type='split', learning_rate=0.2, max_depth=-1, min_child_samples=6, min_child_weight=0.001, min_split_gain=0.6, n_estimators=100, n_jobs=-1, num_leaves=30, objective=None, random_state=42, reg_alpha=0.001, reg_lambda=5, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
Analyzing the performance of a trained ML model is an essential part of any ML workflow. In PyCaret, this can be simply done with the plot_model(...
) function. The function accepts a trained model object and the kind of plot (plot
parameter) as strings. There are several kinds of plots that one can plot based on the PyCaret module they are working with. Details about all these types of plots can be found in this documentation.
For our regression task, we plot the residuals, the learning curve (to understand how the performance has evolved with training) & the relative feature importances in our trained model.
plot_model(tuned_lgbm, plot="residuals")
plot_model(tuned_lgbm, plot="learning")
plot_model(tuned_lgbm, plot="feature") # Top 10 most important features
# plot_model(tuned_lgbm, plot="feature_all") # All features used in the model
Another technique for evaluating model performance is to use the evaluate_model()
function, which provides a user interface for all possible plots for a particular model (it makes use of the plot_model()
method internally).
If you recall, while setting up our environment, we split our data into 80-20 (training & validation data) explicitly. Till now, the training & fine-tuning was done on the training portion of that data (17010 samples).
Now, once we've fine-tuned our model, it is a good idea to predict the results on our held-out validation dataset (4253 samples) & check the performance metrics on this set.
For this, we use the predict_model(...)
function & pass it our trained model object. Since we do not specify any dataset explicitly, we mean to predict on the validation dataset that was held out at the time of setting up our environment.
predict_model(tuned_lgbm);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Light Gradient Boosting Machine | 6.2684 | 96.4005 | 9.8184 | 0.9163 | 0.4909 | 2.1346 |
We can compare these values obtained with the results of our fine-tuned model. Note that we should take into account the standard deviation of the metrics in our fine-tuned model while comparing against these results.
Once we have trained our model in this experiment, we need to save it so that it can be used in future for making predictions.
A model can be saved easily using the sace_model(...)
function, which takes in the model object & the file name. The model, along with its entire transformation pipeline (preprocessing, etc. to be applied to the raw dataset) is saved as a .pkl
(pickle) file.
save_model(tuned_lgbm,'lgbm_expt1')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=['material'], id_columns=[], ml_usecase='regression', numerical_features=[], target='critical_temp', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='most frequent', fill_value_categorical=None, fill_value_numerical=None, n... boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, feature_fraction=0.8, importance_type='split', learning_rate=0.2, max_depth=-1, min_child_samples=6, min_child_weight=0.001, min_split_gain=0.6, n_estimators=100, n_jobs=-1, num_leaves=30, objective=None, random_state=42, reg_alpha=0.001, reg_lambda=5, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0)]], verbose=False), 'lgbm_expt1.pkl')
Any saved model can be loaded in the same or alternate environment using the load_model(...)
function by passing in the file name.
saved_model = load_model('lgbm_expt1')
Transformation Pipeline and Model Successfully Loaded
Now, we will use this loaded model to make predictions on our actual unseen test data (10% of the original dataset), which we had set aside initially.
Again, we use the predict_model(...)
function, but this time we pass in the unseen data in the data
parameter as well.
Since our loaded model contains the entire transformation pipeline as well, all the necessary preprocessing steps are automatically applied to this passed dataset before predictions are actually made.
unseen_predictions = predict_model(saved_model, data=data_unseen)
unseen_predictions.head()
material | number_of_elements | mean_atomic_mass | wtd_mean_atomic_mass | gmean_atomic_mass | wtd_gmean_atomic_mass | entropy_atomic_mass | wtd_entropy_atomic_mass | range_atomic_mass | wtd_range_atomic_mass | std_atomic_mass | wtd_std_atomic_mass | mean_fie | wtd_mean_fie | gmean_fie | wtd_gmean_fie | entropy_fie | wtd_entropy_fie | range_fie | wtd_range_fie | std_fie | wtd_std_fie | mean_atomic_radius | wtd_mean_atomic_radius | gmean_atomic_radius | wtd_gmean_atomic_radius | entropy_atomic_radius | wtd_entropy_atomic_radius | range_atomic_radius | wtd_range_atomic_radius | std_atomic_radius | wtd_std_atomic_radius | mean_Density | wtd_mean_Density | gmean_Density | wtd_gmean_Density | entropy_Density | wtd_entropy_Density | range_Density | wtd_range_Density | ... | gmean_ElectronAffinity | wtd_gmean_ElectronAffinity | entropy_ElectronAffinity | wtd_entropy_ElectronAffinity | range_ElectronAffinity | wtd_range_ElectronAffinity | std_ElectronAffinity | wtd_std_ElectronAffinity | mean_FusionHeat | wtd_mean_FusionHeat | gmean_FusionHeat | wtd_gmean_FusionHeat | entropy_FusionHeat | wtd_entropy_FusionHeat | range_FusionHeat | wtd_range_FusionHeat | std_FusionHeat | wtd_std_FusionHeat | mean_ThermalConductivity | wtd_mean_ThermalConductivity | gmean_ThermalConductivity | wtd_gmean_ThermalConductivity | entropy_ThermalConductivity | wtd_entropy_ThermalConductivity | range_ThermalConductivity | wtd_range_ThermalConductivity | std_ThermalConductivity | wtd_std_ThermalConductivity | mean_Valence | wtd_mean_Valence | gmean_Valence | wtd_gmean_Valence | entropy_Valence | wtd_entropy_Valence | range_Valence | wtd_range_Valence | std_Valence | wtd_std_Valence | critical_temp | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ge1Nb3 | 2.0 | 82.768190 | 87.837285 | 82.144935 | 87.360109 | 0.685627 | 0.509575 | 20.27638 | 51.522285 | 10.138190 | 8.779930 | 711.80 | 687.700000 | 710.166178 | 686.488365 | 0.690853 | 0.589409 | 96.4 | 307.700000 | 48.200000 | 41.742424 | 161.500000 | 179.750000 | 157.321327 | 176.492557 | 0.667386 | 0.461943 | 73.0 | 117.250000 | 36.500000 | 31.609927 | 6946.500000 | 7758.250000 | 6754.118003 | 7608.074085 | 0.665582 | NaN | 3247.000 | 5096.750000 | ... | 102.741423 | 94.869113 | 0.680597 | 0.622557 | 32.90 | 35.575000 | 16.450000 | 14.246118 | 29.3000 | 28.050000 | 29.193150 | 27.970992 | 0.689503 | 0.596157 | 5.000 | 12.150000 | 2.500000 | 2.165064 | 57.000000 | 55.500000 | 56.920998 | 55.441265 | 0.691761 | 0.583527 | 6.00000 | 25.500000 | 3.000000 | 2.598076 | 4.50 | 4.750000 | 4.472136 | 4.728708 | 0.686962 | 0.514653 | 1.0 | 2.750000 | 0.500000 | 0.433013 | 6.4 | 10.102330 |
1 | Y1Ba2Cu3O | 4.0 | 76.444563 | 81.456750 | 59.356672 | 68.229617 | 1.199541 | 1.108189 | 121.32760 | 36.950657 | 43.823354 | 40.612293 | 794.00 | 738.357143 | 741.629349 | 702.424197 | 1.315004 | 1.282439 | 810.6 | 231.371429 | 311.743492 | 255.465227 | 164.500000 | 171.571429 | 139.000514 | 153.255987 | 1.256701 | 1.166832 | 205.0 | 65.428571 | 77.525802 | 67.911407 | 4235.857250 | 5481.918429 | 669.556588 | 1780.077447 | 1.015407 | 0.810985 | 8958.571 | 3839.795857 | ... | 53.527965 | 56.435929 | 1.105182 | 0.953371 | 127.05 | 46.971429 | 54.830755 | 52.650295 | 8.1805 | 9.560286 | 4.035569 | 6.229663 | 1.112098 | 0.975148 | 12.878 | 5.582571 | 4.948155 | 4.359650 | 108.756645 | 179.003797 | 7.552385 | 26.578636 | 0.336262 | 0.201966 | 399.97342 | 171.424774 | 168.301047 | NaN | 2.25 | 2.142857 | 2.213364 | 2.119268 | 1.368922 | 1.309526 | 1.0 | 0.571429 | 0.433013 | 0.349927 | 91.2 | 81.822642 |
2 | Y1Ba1.5La0.5Cu3O7.08 | 5.0 | 88.936744 | 51.090431 | 70.358975 | 34.783991 | 1.445824 | 1.525092 | 122.90607 | 10.438667 | 46.482335 | 44.261233 | 743.42 | 1006.991437 | 696.313849 | 942.154532 | 1.538527 | 0.933542 | 810.6 | 690.076300 | 296.615154 | 340.103711 | 170.600000 | 111.914373 | 148.737352 | 88.458069 | 1.506998 | 1.509873 | 205.0 | 25.802752 | 70.406250 | 75.991199 | 4617.885800 | 3035.177165 | NaN | 66.205693 | 1.324548 | 0.978095 | 8958.571 | 2054.272376 | ... | 52.696967 | 90.754610 | 1.354052 | 0.815880 | 127.05 | 75.361239 | 50.281492 | 47.273696 | 7.8044 | 5.154569 | 4.411557 | 1.310299 | 1.374615 | 1.153108 | 12.878 | 2.884422 | 4.489231 | 5.622243 | 89.605316 | 95.618363 | 8.418912 | 1.058389 | 0.457810 | 0.209583 | 399.97342 | 91.728732 | 155.329609 | 166.192923 | 2.40 | 2.114679 | 2.352158 | 2.095193 | 1.589027 | 1.314189 | 1.0 | 0.967890 | 0.489898 | 0.318634 | 38.0 | 34.782055 |
3 | La1.76Sr0.24Cu1O4 | 4.0 | 76.517718 | 56.149432 | 59.310096 | 35.562124 | 1.197273 | 1.042132 | 122.90607 | 31.920690 | 44.289459 | 51.815571 | 787.05 | 1011.642286 | 734.219624 | 940.469590 | 1.313008 | 0.802473 | NaN | 731.520000 | 314.505966 | 353.685876 | 151.750000 | 104.680000 | 131.302197 | 84.236452 | 1.275274 | 1.215773 | 171.0 | 41.520000 | 65.579627 | 67.580981 | 4434.357250 | 2916.268000 | 674.484751 | 52.847119 | 0.995983 | 0.807732 | 8958.571 | 1544.463429 | ... | 48.477265 | 95.882106 | 1.096672 | 0.773969 | 135.97 | 81.204686 | 54.373097 | 43.628244 | 6.9055 | 3.856571 | 3.479475 | 1.042408 | 1.088575 | 1.016669 | 12.878 | 1.744571 | 4.599064 | 4.673780 | 112.006645 | 61.626617 | 8.339818 | 0.637507 | 0.403693 | 0.304547 | 399.97342 | 57.127669 | 166.742351 | 138.361103 | 2.25 | 2.251429 | 2.213364 | NaN | 1.368922 | 1.078855 | 1.0 | 1.074286 | 0.433013 | 0.433834 | 19.0 | 21.520392 |
4 | La0.94Mo6Se8 | 3.0 | 104.608490 | 89.558979 | 101.719818 | 88.481210 | 1.070258 | 0.944284 | 59.94547 | 33.541423 | 25.225148 | 15.159829 | 722.10 | 812.626104 | 703.695276 | 799.630186 | 1.072790 | 0.796138 | 399.3 | 469.515797 | 165.133461 | 141.219887 | 162.666667 | 143.728246 | 156.269832 | 137.110827 | 1.062146 | 0.913764 | 92.0 | 64.036145 | 42.240055 | 43.743683 | 7081.666667 | 7095.665328 | 6727.404359 | 6633.555886 | 1.046587 | 0.841524 | 5461.000 | 3741.817938 | ... | 89.376074 | 121.321040 | 0.925747 | 0.621554 | 147.00 | 102.106426 | 64.406539 | 63.255476 | 15.9000 | 17.745783 | 10.699060 | 11.681349 | 0.726384 | 0.547497 | 30.600 | 14.061446 | 14.217595 | 14.955962 | 50.840000 | 56.919679 | 9.794610 | 6.007072 | 0.313841 | 0.106068 | 138.48000 | 55.544846 | 62.546392 | 67.307993 | 5.00 | 5.811245 | 4.762203 | 5.743954 | 1.054920 | 0.803990 | 3.0 | 3.024096 | 1.414214 | 0.728448 | 11.0 | 5.981907 |
5 rows × 84 columns
The predictions on this dataset get appended as a column with name "Label" to the test dataframe.
In this case, since we have the actual values (critical_temp
column) as well in the unseen test dataset, we can use the check_metric(...)
function from pycaret.utils
to get the preformance scores of our model on this data.
The check_metric(...)
function takes in the columns containing true values & predicted values, along with the metric.
from pycaret.utils import check_metric
import numpy as np
print("R2:\t", check_metric(unseen_predictions.critical_temp, unseen_predictions.Label, 'R2'))
print("RMSE:\t", np.round(np.sqrt(check_metric(unseen_predictions.critical_temp, unseen_predictions.Label, 'MSE')), 4))
R2: 0.9157 RMSE: 9.8366
Note: Here, there was a bug in the PyCaret code at the time of writing this tutorial due to which on specifying "RMSE" as a metric, the result obtained was that of "MSE". Hence, the extra step of preforming the square root manually is needed here.
With this, we have covered most of the basic functionalities of PyCaret. In this section of the tutorial, we explored how to prepare an end-to-end ML pipeline for a regression task (data ingestion, basic pre-processing, model selection & training, hyperparameter tuning, analysis, prediction & saving the model for later use) in less than 10 commands. This truly demonstrates the power & efficiency of using tools like PyCaret.
Having familiarized with the fundamentals of using PyCaret, it is time to dive into some more cool features that PyCaret offers. In this section we will explore the following:
In the previous section, we saw how setup(...)
can be used to initiaize the environment for our experiment & perform some basic preprocessing. Now, we look at some more options available to us for preprocessing our data further. Like before, all this can be done easily by setting the corresponding parameters while calling the setup function.
Following are some of the parameters that can be used to perform scaling & transformation:
normalize
: If set to True
, the entire feature space is rescaled based on the normalize_method
.normalize_method
: It defines the method used to normalize the data. The types of menthods can be seen in the documentation.PyCaret allows you to create new features based on the original features simply by adding a few parameters. Some of these are:
feature_interaction
: Setting this to True
creates new features of the form a * b
for all pairs of numeric features a
& b
.feature_ratio
: Setting this to True
creates new features of the form a / b
for all pairs of numeric features a
& b
.polynomial_features
: Setting this to True
creates new features based on polynomial combinations that exist within the numeric features.polynomial_degree
: This specifies the degree of the polynomial features (default: 2)trigonometry_features
: Setting this to True
created new feature that are trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the polynomial_degree
.With the knowledge of these parameters, we can set up a new experiment where we use some these preprocessing steps as follows:
expt_intermediate = setup(
data = df,
target = 'critical_temp',
session_id=42, # Random seed to ensure reproducibility of the experiment with the same data
train_size=0.8, # 80% training data & 20% held-out validation data
ignore_features=["material"],
normalize=True,
normalize_method="minmax",
polynomial_features=True,
trigonometry_features=True
)
Description | Value | |
---|---|---|
0 | session_id | 42 |
1 | Target | critical_temp |
2 | Original Data | (21263, 83) |
3 | Missing Values | True |
4 | Numeric Features | 79 |
5 | Categorical Features | 2 |
6 | Ordinal Features | False |
7 | High Cardinality Features | False |
8 | High Cardinality Method | None |
9 | Transformed Train Set | (17010, 154) |
10 | Transformed Test Set | (4253, 154) |
11 | Shuffle Train-Test | True |
12 | Stratify Train-Test | False |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | 7f5e |
20 | Imputation Type | simple |
21 | Iterative Imputation Iteration | None |
22 | Numeric Imputer | mean |
23 | Iterative Imputation Numeric Model | None |
24 | Categorical Imputer | constant |
25 | Iterative Imputation Categorical Model | None |
26 | Unknown Categoricals Handling | least_frequent |
27 | Normalize | True |
28 | Normalize Method | minmax |
29 | Transformation | False |
30 | Transformation Method | None |
31 | PCA | False |
32 | PCA Method | None |
33 | PCA Components | None |
34 | Ignore Low Variance | False |
35 | Combine Rare Levels | False |
36 | Rare Level Threshold | None |
37 | Numeric Binning | False |
38 | Remove Outliers | False |
39 | Outliers Threshold | None |
40 | Remove Multicollinearity | False |
41 | Multicollinearity Threshold | None |
42 | Remove Perfect Collinearity | True |
43 | Clustering | False |
44 | Clustering Iteration | None |
45 | Polynomial Features | True |
46 | Polynomial Degree | 2 |
47 | Trignometry Features | True |
48 | Polynomial Threshold | 0.1 |
49 | Group Features | False |
50 | Feature Selection | False |
51 | Feature Selection Method | classic |
52 | Features Selection Threshold | None |
53 | Feature Interaction | False |
54 | Feature Ratio | False |
55 | Interaction Threshold | None |
56 | Transform Target | False |
57 | Transform Target Method | box-cox |
Now, similar to the previous section, we compare the various models available. The only change is that this time while comparing, we select the top 3 models (based on RMSE). This can be done by setting the n_select
parameter to 3
as follows:
top3 = compare_models(sort="RMSE", exclude=["lar", "rf", "et", "gbr", "ada"], fold=5, n_select=3)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
lightgbm | Light Gradient Boosting Machine | 6.7801 | 116.7506 | 10.7993 | 0.9010 | 0.4985 | 9.4966 | 2.126 |
knn | K Neighbors Regressor | 7.4934 | 171.2640 | 13.0827 | 0.8547 | 0.4829 | 9.7451 | 3.210 |
dt | Decision Tree Regressor | 7.3110 | 194.2151 | 13.9218 | 0.8353 | 0.5210 | 6.8010 | 1.514 |
lr | Linear Regression | 12.9294 | 298.8115 | 17.2820 | 0.7463 | 0.8402 | 21.8982 | 0.528 |
br | Bayesian Ridge | 13.0206 | 300.7426 | 17.3384 | 0.7446 | 0.8478 | 19.7532 | 0.298 |
ridge | Ridge Regression | 13.1375 | 305.2784 | 17.4690 | 0.7408 | 0.8489 | 19.0016 | 0.052 |
huber | Huber Regressor | 13.3101 | 322.8834 | 17.9655 | 0.7258 | 0.8514 | 18.4292 | 1.310 |
omp | Orthogonal Matching Pursuit | 15.0008 | 373.2262 | 19.3157 | 0.6831 | 0.9263 | 23.1458 | 0.056 |
par | Passive Aggressive Regressor | 16.2831 | 450.3890 | 21.0913 | 0.6152 | 0.9943 | 26.3691 | 0.168 |
lasso | Lasso Regression | 17.1647 | 503.5471 | 22.4379 | 0.5726 | 0.9311 | 23.0181 | 0.082 |
en | Elastic Net | 18.9225 | 573.1404 | 23.9374 | 0.5136 | 1.0144 | 20.2454 | 0.060 |
llar | Lasso Least Angle Regression | 29.4259 | 1179.1006 | 34.3350 | -0.0007 | 1.4446 | 51.3135 | 0.058 |
for model in top3:
print(model)
print()
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=-1, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31, objective=None, random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0) KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=5, p=2, weights='uniform') DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=42, splitter='best')
Previously, we were using a Random Grid Search to find the optimal hyperparameters for a model. This time, we will learn how to create our custom grid to search for optimal parameters.
First, we look at the various hyperparameters for every model & then create a dict
containing ranges of the grid that we wish to fine-tune on. Later, we pass this dict
into the tune_model(...)
function.
Following are examples of some custom fine-tuning applied to each of our top 3 models.
# Fine-Tuning the Light GBM model
lgbm = top3[0]
lgbm_params = {
'num_leaves': np.arange(10,200,10),
'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
'learning_rate': np.arange(0.1,1,0.1)
}
tuned_lgbm = tune_model(lgbm, custom_grid = lgbm_params, fold=5)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 6.1500 | 111.8845 | 10.5775 | 0.9064 | 0.4487 | 5.9283 |
1 | 5.9389 | 98.9598 | 9.9479 | 0.9144 | 0.4426 | 12.5269 |
2 | 5.7831 | 90.1870 | 9.4967 | 0.9201 | 0.4397 | 10.1177 |
3 | 6.1927 | 110.0694 | 10.4914 | 0.9096 | 0.4556 | 12.1821 |
4 | 6.1456 | 104.1636 | 10.2061 | 0.9127 | 0.4446 | 1.1379 |
Mean | 6.0420 | 103.0529 | 10.1439 | 0.9126 | 0.4462 | 8.3786 |
SD | 0.1567 | 7.8835 | 0.3924 | 0.0046 | 0.0056 | 4.3158 |
# Fine-Tuning the K-Nearest Neighbors model
knn = top3[1]
knn_params = {
'n_neighbors': np.arange(2,6),
'p': np.arange(1,2),
'leaf_size': np.arange(10,60,10)
}
tuned_knn = tune_model(knn, custom_grid = knn_params, fold=5)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 6.2757 | 136.2753 | 11.6737 | 0.8860 | 0.4228 | 3.5938 |
1 | 6.2235 | 125.1762 | 11.1882 | 0.8918 | 0.4225 | 10.8804 |
2 | 6.1148 | 120.6007 | 10.9818 | 0.8931 | 0.4153 | 7.3990 |
3 | 6.2538 | 135.0941 | 11.6230 | 0.8890 | 0.4224 | 11.8967 |
4 | 6.2056 | 127.5158 | 11.2923 | 0.8932 | 0.4154 | 0.9532 |
Mean | 6.2147 | 128.9324 | 11.3518 | 0.8906 | 0.4197 | 6.9446 |
SD | 0.0555 | 5.9568 | 0.2624 | 0.0028 | 0.0036 | 4.1796 |
# Fine-Tuning the Decision Tree model
dt = top3[2]
dt_params = {
'min_samples_split': np.arange(2,12,1),
'max_features': ["auto", "sqrt", "log2"],
}
tuned_dt = tune_model(dt, custom_grid = dt_params, fold=5)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 7.6052 | 204.5219 | 14.3011 | 0.8289 | 0.5250 | 6.1723 |
1 | 6.9846 | 168.8165 | 12.9929 | 0.8540 | 0.4830 | 13.6954 |
2 | 7.1957 | 172.3882 | 13.1297 | 0.8473 | 0.5026 | 8.5512 |
3 | 7.3743 | 194.9335 | 13.9619 | 0.8399 | 0.5004 | 13.0748 |
4 | 7.0655 | 167.3674 | 12.9371 | 0.8598 | 0.4667 | 0.9862 |
Mean | 7.2451 | 181.6055 | 13.4645 | 0.8460 | 0.4956 | 8.4960 |
SD | 0.2231 | 15.1923 | 0.5586 | 0.0108 | 0.0197 | 4.6861 |
Now, we have the top 3 models fine-tuned using a custom grid. The optimal hyperparameters can be seens by printing the models:
print(tuned_lgbm, "\n")
print(tuned_knn, "\n")
print(tuned_dt)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=70, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=80, objective=None, random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0) KNeighborsRegressor(algorithm='auto', leaf_size=10, metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=3, p=1, weights='uniform') DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='log2', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=10, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=42, splitter='best')
The following table compares the RMSE values of the 3 models before & after fine-tuning.
Model | Initial RMSE | Fine-Tuned RMSE |
---|---|---|
lightgbm | 10.7993 | 10.1439 |
knn | 13.0827 | 11.3518 |
dt | 13.9218 | 13.4645 |
Model ensembling is a machine learning technique to combine multiple other models in the prediction process. You may be familiar with the terms bagging & boosting, which are typically used to increase the performance of decision tree-like algorithms.
PyCaret allows bagging & boosting by using the ensemble_model(...)
function that takes in a single tree object, along with the method
("Bagging"
or "Boosting"
). The number of estimators can be controlled using n_estimators
parameter (default: 10).
However, PyCaret also allows us to combine multiple types of models to create custom ensembles. This technique is called Blending. We will implement these next.
Blending is an ensemble ML approach that use a machine learning model to discover the optimal way to blend predictions from several contributing ensemble member models. Blending models in PyCaret can be achieved using blend_models(...)
function, which is available only for the pycaret.regression
& pycaret.classification
modules. The model objects passed in the estimator_list
will be used as ensemble members. If nothing is specified, all models available in the library will be used as ensemble members.
We will try to blend our top 3 fine-tuned models to produce our final model.
blended_model = blend_models(estimator_list=[tuned_lgbm, tuned_knn, tuned_dt])
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 5.8042 | 112.3234 | 10.5983 | 0.9058 | 0.4150 | 8.6553 |
1 | 5.7807 | 105.5312 | 10.2728 | 0.9119 | 0.3966 | 0.9125 |
2 | 5.5910 | 95.3437 | 9.7644 | 0.9194 | 0.4005 | 6.5337 |
3 | 5.8788 | 106.0757 | 10.2993 | 0.9061 | 0.4224 | 8.2497 |
4 | 5.4651 | 90.7305 | 9.5253 | 0.9174 | 0.4192 | 15.6773 |
5 | 5.8054 | 100.1237 | 10.0062 | 0.9135 | 0.4042 | 0.7391 |
6 | 6.0748 | 120.6465 | 10.9839 | 0.9019 | 0.4022 | 1.0963 |
7 | 6.0080 | 109.8304 | 10.4800 | 0.9087 | 0.4321 | 27.5085 |
8 | 5.8795 | 109.2790 | 10.4537 | 0.9052 | 0.4121 | 0.7415 |
9 | 5.7296 | 92.2783 | 9.6062 | 0.9251 | 0.3888 | 1.2944 |
Mean | 5.8017 | 104.2163 | 10.1990 | 0.9115 | 0.4093 | 7.1408 |
SD | 0.1710 | 9.0436 | 0.4434 | 0.0070 | 0.0125 | 8.2647 |
We can now check the performance of this blended model on our held-out validation set to verify the performance metric values.
predict_model(blended_model);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Voting Regressor | 5.5852 | 92.7436 | 9.6303 | 0.9194 | 0.4059 | 1.2702 |
Now, we will predict the values for our unseen test dataset
unseen_predictions = predict_model(blended_model, data=data_unseen)
unseen_predictions.head()
material | number_of_elements | mean_atomic_mass | wtd_mean_atomic_mass | gmean_atomic_mass | wtd_gmean_atomic_mass | entropy_atomic_mass | wtd_entropy_atomic_mass | range_atomic_mass | wtd_range_atomic_mass | std_atomic_mass | wtd_std_atomic_mass | mean_fie | wtd_mean_fie | gmean_fie | wtd_gmean_fie | entropy_fie | wtd_entropy_fie | range_fie | wtd_range_fie | std_fie | wtd_std_fie | mean_atomic_radius | wtd_mean_atomic_radius | gmean_atomic_radius | wtd_gmean_atomic_radius | entropy_atomic_radius | wtd_entropy_atomic_radius | range_atomic_radius | wtd_range_atomic_radius | std_atomic_radius | wtd_std_atomic_radius | mean_Density | wtd_mean_Density | gmean_Density | wtd_gmean_Density | entropy_Density | wtd_entropy_Density | range_Density | wtd_range_Density | ... | gmean_ElectronAffinity | wtd_gmean_ElectronAffinity | entropy_ElectronAffinity | wtd_entropy_ElectronAffinity | range_ElectronAffinity | wtd_range_ElectronAffinity | std_ElectronAffinity | wtd_std_ElectronAffinity | mean_FusionHeat | wtd_mean_FusionHeat | gmean_FusionHeat | wtd_gmean_FusionHeat | entropy_FusionHeat | wtd_entropy_FusionHeat | range_FusionHeat | wtd_range_FusionHeat | std_FusionHeat | wtd_std_FusionHeat | mean_ThermalConductivity | wtd_mean_ThermalConductivity | gmean_ThermalConductivity | wtd_gmean_ThermalConductivity | entropy_ThermalConductivity | wtd_entropy_ThermalConductivity | range_ThermalConductivity | wtd_range_ThermalConductivity | std_ThermalConductivity | wtd_std_ThermalConductivity | mean_Valence | wtd_mean_Valence | gmean_Valence | wtd_gmean_Valence | entropy_Valence | wtd_entropy_Valence | range_Valence | wtd_range_Valence | std_Valence | wtd_std_Valence | critical_temp | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ge1Nb3 | 2.0 | 82.768190 | 87.837285 | 82.144935 | 87.360109 | 0.685627 | 0.509575 | 20.27638 | 51.522285 | 10.138190 | 8.779930 | 711.80 | 687.700000 | 710.166178 | 686.488365 | 0.690853 | 0.589409 | 96.4 | 307.700000 | 48.200000 | 41.742424 | 161.500000 | 179.750000 | 157.321327 | 176.492557 | 0.667386 | 0.461943 | 73.0 | 117.250000 | 36.500000 | 31.609927 | 6946.500000 | 7758.250000 | 6754.118003 | 7608.074085 | 0.665582 | NaN | 3247.000 | 5096.750000 | ... | 102.741423 | 94.869113 | 0.680597 | 0.622557 | 32.90 | 35.575000 | 16.450000 | 14.246118 | 29.3000 | 28.050000 | 29.193150 | 27.970992 | 0.689503 | 0.596157 | 5.000 | 12.150000 | 2.500000 | 2.165064 | 57.000000 | 55.500000 | 56.920998 | 55.441265 | 0.691761 | 0.583527 | 6.00000 | 25.500000 | 3.000000 | 2.598076 | 4.50 | 4.750000 | 4.472136 | 4.728708 | 0.686962 | 0.514653 | 1.0 | 2.750000 | 0.500000 | 0.433013 | 6.4 | 12.279531 |
1 | Y1Ba2Cu3O | 4.0 | 76.444563 | 81.456750 | 59.356672 | 68.229617 | 1.199541 | 1.108189 | 121.32760 | 36.950657 | 43.823354 | 40.612293 | 794.00 | 738.357143 | 741.629349 | 702.424197 | 1.315004 | 1.282439 | 810.6 | 231.371429 | 311.743492 | 255.465227 | 164.500000 | 171.571429 | 139.000514 | 153.255987 | 1.256701 | 1.166832 | 205.0 | 65.428571 | 77.525802 | 67.911407 | 4235.857250 | 5481.918429 | 669.556588 | 1780.077447 | 1.015407 | 0.810985 | 8958.571 | 3839.795857 | ... | 53.527965 | 56.435929 | 1.105182 | 0.953371 | 127.05 | 46.971429 | 54.830755 | 52.650295 | 8.1805 | 9.560286 | 4.035569 | 6.229663 | 1.112098 | 0.975148 | 12.878 | 5.582571 | 4.948155 | 4.359650 | 108.756645 | 179.003797 | 7.552385 | 26.578636 | 0.336262 | 0.201966 | 399.97342 | 171.424774 | 168.301047 | NaN | 2.25 | 2.142857 | 2.213364 | 2.119268 | 1.368922 | 1.309526 | 1.0 | 0.571429 | 0.433013 | 0.349927 | 91.2 | 77.990958 |
2 | Y1Ba1.5La0.5Cu3O7.08 | 5.0 | 88.936744 | 51.090431 | 70.358975 | 34.783991 | 1.445824 | 1.525092 | 122.90607 | 10.438667 | 46.482335 | 44.261233 | 743.42 | 1006.991437 | 696.313849 | 942.154532 | 1.538527 | 0.933542 | 810.6 | 690.076300 | 296.615154 | 340.103711 | 170.600000 | 111.914373 | 148.737352 | 88.458069 | 1.506998 | 1.509873 | 205.0 | 25.802752 | 70.406250 | 75.991199 | 4617.885800 | 3035.177165 | NaN | 66.205693 | 1.324548 | 0.978095 | 8958.571 | 2054.272376 | ... | 52.696967 | 90.754610 | 1.354052 | 0.815880 | 127.05 | 75.361239 | 50.281492 | 47.273696 | 7.8044 | 5.154569 | 4.411557 | 1.310299 | 1.374615 | 1.153108 | 12.878 | 2.884422 | 4.489231 | 5.622243 | 89.605316 | 95.618363 | 8.418912 | 1.058389 | 0.457810 | 0.209583 | 399.97342 | 91.728732 | 155.329609 | 166.192923 | 2.40 | 2.114679 | 2.352158 | 2.095193 | 1.589027 | 1.314189 | 1.0 | 0.967890 | 0.489898 | 0.318634 | 38.0 | 33.854053 |
3 | La1.76Sr0.24Cu1O4 | 4.0 | 76.517718 | 56.149432 | 59.310096 | 35.562124 | 1.197273 | 1.042132 | 122.90607 | 31.920690 | 44.289459 | 51.815571 | 787.05 | 1011.642286 | 734.219624 | 940.469590 | 1.313008 | 0.802473 | NaN | 731.520000 | 314.505966 | 353.685876 | 151.750000 | 104.680000 | 131.302197 | 84.236452 | 1.275274 | 1.215773 | 171.0 | 41.520000 | 65.579627 | 67.580981 | 4434.357250 | 2916.268000 | 674.484751 | 52.847119 | 0.995983 | 0.807732 | 8958.571 | 1544.463429 | ... | 48.477265 | 95.882106 | 1.096672 | 0.773969 | 135.97 | 81.204686 | 54.373097 | 43.628244 | 6.9055 | 3.856571 | 3.479475 | 1.042408 | 1.088575 | 1.016669 | 12.878 | 1.744571 | 4.599064 | 4.673780 | 112.006645 | 61.626617 | 8.339818 | 0.637507 | 0.403693 | 0.304547 | 399.97342 | 57.127669 | 166.742351 | 138.361103 | 2.25 | 2.251429 | 2.213364 | NaN | 1.368922 | 1.078855 | 1.0 | 1.074286 | 0.433013 | 0.433834 | 19.0 | 22.574612 |
4 | La0.94Mo6Se8 | 3.0 | 104.608490 | 89.558979 | 101.719818 | 88.481210 | 1.070258 | 0.944284 | 59.94547 | 33.541423 | 25.225148 | 15.159829 | 722.10 | 812.626104 | 703.695276 | 799.630186 | 1.072790 | 0.796138 | 399.3 | 469.515797 | 165.133461 | 141.219887 | 162.666667 | 143.728246 | 156.269832 | 137.110827 | 1.062146 | 0.913764 | 92.0 | 64.036145 | 42.240055 | 43.743683 | 7081.666667 | 7095.665328 | 6727.404359 | 6633.555886 | 1.046587 | 0.841524 | 5461.000 | 3741.817938 | ... | 89.376074 | 121.321040 | 0.925747 | 0.621554 | 147.00 | 102.106426 | 64.406539 | 63.255476 | 15.9000 | 17.745783 | 10.699060 | 11.681349 | 0.726384 | 0.547497 | 30.600 | 14.061446 | 14.217595 | 14.955962 | 50.840000 | 56.919679 | 9.794610 | 6.007072 | 0.313841 | 0.106068 | 138.48000 | 55.544846 | 62.546392 | 67.307993 | 5.00 | 5.811245 | 4.762203 | 5.743954 | 1.054920 | 0.803990 | 3.0 | 3.024096 | 1.414214 | 0.728448 | 11.0 | 8.882301 |
5 rows × 84 columns
print("R2:\t", check_metric(unseen_predictions.critical_temp, unseen_predictions.Label, 'R2'))
print("RMSE:\t", np.round(np.sqrt(check_metric(unseen_predictions.critical_temp, unseen_predictions.Label, 'MSE')), 4))
R2: 0.9225 RMSE: 9.428
Clearly, this blended model trained using some extra preprocessing & feature engineering on our initial dataset, performs better than our model that we trained in the Basic PyCaret section (R2 = 0.9157, RMSE=9.8366).
We can now save our blended regression model
save_model(blended_model, "blended_expt2")
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=['material'], id_columns=[], ml_usecase='regression', numerical_features=[], target='critical_temp', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None,... weights='uniform')), ('dt', DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='log2', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=10, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=42, splitter='best'))], n_jobs=-1, verbose=False, weights=None)]], verbose=False), 'blended_expt2.pkl')
This concludes our tutorial on PyCaret. Over the course of this tutorial, we have seen how easy it is to develop end-to-end ML pipelines using a low-code framework like PyCaret.