In this example, we will predict the math score in the ENEM 2016 Brazilian National Exam. The dataset was obtained from INEP, a department from the Brazilian Education Ministry. It contains data from the applicants for the 2016 National High School Exam and can be downloaded here. Inside this dataset there are not only the exam results, but the social and economic context of the applicants. Check here data for description: Enem 2016 Microdata.
You will have two datasets - train.csv e test.csv - for using to predict math scores (NU_NOTA_MT
). But the train file is a good amount of data from ENEM 2016 and contains most of the columns for those who would like to do some EDA.
This notebook was created using PyCaret 2.0 by Simone Perazzoli. Last updated : 04-08-2020
#!pip3 install pycaret==2.0
import numpy as np
import pandas as pd
import pycaret
import seaborn as sns
import matplotlib.pyplot as plt
from pycaret.regression import *
from scipy.stats import kurtosis, skew
pd.set_option('display.max_columns',200)
# checking pycaret version
from pycaret.utils import version
version()
2.0
# Train dataset
df_train = pd.read_csv('train.csv')
# Test dataset
df_test = pd.read_csv('test.csv')
# Creating answer dataframe
answer = pd.DataFrame()
# Saving the registration number:
answer['NU_INSCRICAO'] = df_test['NU_INSCRICAO']
# Droping the registration number from train and test dataframes:
df_train.drop(['NU_INSCRICAO'], axis=1, inplace=True)
df_test.drop(['NU_INSCRICAO'], axis=1, inplace=True)
# Checking dataframe shape
df_train.shape, df_test.shape
((13730, 166), (4576, 46))
#By checking the shape of the datasets we can see that there are more columns in the training data than in the
#test data, so we will use only the features that exist in the test dataframe to analyze and determine which
#features we should use to make the prediction.
cols = list(df_test)
cols.append('NU_NOTA_MT')
train = df_train[cols]
test = df_test
# Viewing training data:
train.head()
CO_UF_RESIDENCIA | SG_UF_RESIDENCIA | NU_IDADE | TP_SEXO | TP_COR_RACA | TP_NACIONALIDADE | TP_ST_CONCLUSAO | TP_ANO_CONCLUIU | TP_ESCOLA | TP_ENSINO | IN_TREINEIRO | TP_DEPENDENCIA_ADM_ESC | IN_BAIXA_VISAO | IN_CEGUEIRA | IN_SURDEZ | IN_DISLEXIA | IN_DISCALCULIA | IN_SABATISTA | IN_GESTANTE | IN_IDOSO | TP_PRESENCA_CN | TP_PRESENCA_CH | TP_PRESENCA_LC | CO_PROVA_CN | CO_PROVA_CH | CO_PROVA_LC | CO_PROVA_MT | NU_NOTA_CN | NU_NOTA_CH | NU_NOTA_LC | TP_LINGUA | TP_STATUS_REDACAO | NU_NOTA_COMP1 | NU_NOTA_COMP2 | NU_NOTA_COMP3 | NU_NOTA_COMP4 | NU_NOTA_COMP5 | NU_NOTA_REDACAO | Q001 | Q002 | Q006 | Q024 | Q025 | Q026 | Q027 | Q047 | NU_NOTA_MT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 43 | RS | 24 | M | 1 | 1 | 1 | 4 | 1 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 16f84b7b3d2aeaff7d2f01297e6b3d0e25c77bb2 | 9cd70f1b922e02bd33453b3f607f5a644fb9b1b8 | 01af53cd161a420fff1767129c10de560cc264dd | 97caab1e1533dba217deb7ef41490f52e459ab01 | 436.3 | 495.4 | 581.2 | 1 | 1.0 | 120.0 | 120.0 | 120.0 | 80.0 | 80.0 | 520.0 | D | D | C | A | A | C | H | A | 399.4 |
1 | 23 | CE | 17 | F | 3 | 1 | 2 | 0 | 2 | 1.0 | 0 | 2.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | b9b06ce8c319a3df2158ea3d0aef0f7d3eecaed7 | 909237ab0d84688e10c0470e2997348aff585273 | 01af53cd161a420fff1767129c10de560cc264dd | 97caab1e1533dba217deb7ef41490f52e459ab01 | 474.5 | 544.1 | 599.0 | 1 | 1.0 | 140.0 | 120.0 | 120.0 | 120.0 | 80.0 | 580.0 | A | A | B | A | A | A | NaN | A | 459.8 |
2 | 23 | CE | 21 | F | 3 | 1 | 3 | 0 | 1 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | D | D | C | A | A | A | NaN | A | NaN |
3 | 33 | RJ | 25 | F | 0 | 1 | 1 | 9 | 1 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | H | E | E | C | B | C | F | D | NaN |
4 | 13 | AM | 28 | M | 2 | 1 | 1 | 4 | 1 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | E | D | C | A | A | B | F | A | NaN |
# Viewing test data:
test.head()
CO_UF_RESIDENCIA | SG_UF_RESIDENCIA | NU_IDADE | TP_SEXO | TP_COR_RACA | TP_NACIONALIDADE | TP_ST_CONCLUSAO | TP_ANO_CONCLUIU | TP_ESCOLA | TP_ENSINO | IN_TREINEIRO | TP_DEPENDENCIA_ADM_ESC | IN_BAIXA_VISAO | IN_CEGUEIRA | IN_SURDEZ | IN_DISLEXIA | IN_DISCALCULIA | IN_SABATISTA | IN_GESTANTE | IN_IDOSO | TP_PRESENCA_CN | TP_PRESENCA_CH | TP_PRESENCA_LC | CO_PROVA_CN | CO_PROVA_CH | CO_PROVA_LC | CO_PROVA_MT | NU_NOTA_CN | NU_NOTA_CH | NU_NOTA_LC | TP_LINGUA | TP_STATUS_REDACAO | NU_NOTA_COMP1 | NU_NOTA_COMP2 | NU_NOTA_COMP3 | NU_NOTA_COMP4 | NU_NOTA_COMP5 | NU_NOTA_REDACAO | Q001 | Q002 | Q006 | Q024 | Q025 | Q026 | Q027 | Q047 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | PR | 22 | F | 3 | 1 | 1 | 5 | 1 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 16f84b7b3d2aeaff7d2f01297e6b3d0e25c77bb2 | 9cd70f1b922e02bd33453b3f607f5a644fb9b1b8 | 01abbb7f1a90505385f44eec9905f82ca2a42cfd | 81d0ee00ef42a7c23eb04496458c03d4c5b9c31a | 464.8 | 443.5 | 431.8 | 0 | 1.0 | 120.0 | 80.0 | 80.0 | 100.0 | 40.0 | 420.0 | B | A | C | A | A | C | C | A |
1 | 21 | MA | 26 | F | 3 | 1 | 1 | 8 | 1 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | c8328ebc6f3238e06076c481bc1b82b8301e7a3f | f48d390ab6a2428e659c37fb8a9d00afde621889 | 72f80e4b3150c627c7ffc93cfe0fa13a9989b610 | 577f8968d95046f5eb5cc158608e12fa9ba34c85 | 391.1 | 491.1 | 548.0 | 1 | 1.0 | 120.0 | 120.0 | 120.0 | 120.0 | 100.0 | 580.0 | E | B | C | B | B | B | F | A |
2 | 23 | CE | 21 | M | 1 | 1 | 2 | 0 | 2 | 3.0 | 0 | 2.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 16f84b7b3d2aeaff7d2f01297e6b3d0e25c77bb2 | 9cd70f1b922e02bd33453b3f607f5a644fb9b1b8 | 01af53cd161a420fff1767129c10de560cc264dd | 97caab1e1533dba217deb7ef41490f52e459ab01 | 595.9 | 622.7 | 613.6 | 0 | 1.0 | 80.0 | 40.0 | 40.0 | 80.0 | 80.0 | 320.0 | E | E | D | B | B | A | NaN | A |
3 | 15 | PA | 27 | F | 3 | 1 | 1 | 8 | 1 | NaN | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | 2d22ac1d42e6187f09ee6c578df187a760123ccf | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | H | E | G | B | B | A | NaN | A |
4 | 41 | PR | 18 | M | 1 | 1 | 2 | 0 | 2 | 1.0 | 0 | 2.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 66b1dad288e13be0992bae01e81f71eca1c6e8a6 | 942ab3dc020af4cf53740b6b07e9dd7060b24164 | 5aebe5cad7fabc1545ac7fba07a4e6177f98483c | 767a32545304ed293242d528f54d4edb1369f910 | 592.9 | 492.6 | 571.4 | 1 | 1.0 | 100.0 | 80.0 | 60.0 | 80.0 | 0.0 | 320.0 | D | H | H | C | B | A | NaN | A |
# Checking dataframe shape after transformation
train.shape, test.shape
((13730, 47), (4576, 46))
# Creating a funtion to summarize dataframe information
def data_summary(df):
'''Summary dataframe information'''
df = pd.DataFrame({'type': df.dtypes,
'amount': df.isna().sum(),
'null_values (%)': (df.isna().sum() / df.shape[0]) * 100,
'unique': df.nunique()})
return df
# Train summary:
data_summary(train)
type | amount | null_values (%) | unique | |
---|---|---|---|---|
CO_UF_RESIDENCIA | int64 | 0 | 0.000000 | 27 |
SG_UF_RESIDENCIA | object | 0 | 0.000000 | 27 |
NU_IDADE | int64 | 0 | 0.000000 | 55 |
TP_SEXO | object | 0 | 0.000000 | 2 |
TP_COR_RACA | int64 | 0 | 0.000000 | 6 |
TP_NACIONALIDADE | int64 | 0 | 0.000000 | 5 |
TP_ST_CONCLUSAO | int64 | 0 | 0.000000 | 4 |
TP_ANO_CONCLUIU | int64 | 0 | 0.000000 | 11 |
TP_ESCOLA | int64 | 0 | 0.000000 | 4 |
TP_ENSINO | float64 | 9448 | 68.812819 | 3 |
IN_TREINEIRO | int64 | 0 | 0.000000 | 2 |
TP_DEPENDENCIA_ADM_ESC | float64 | 9448 | 68.812819 | 4 |
IN_BAIXA_VISAO | int64 | 0 | 0.000000 | 2 |
IN_CEGUEIRA | int64 | 0 | 0.000000 | 1 |
IN_SURDEZ | int64 | 0 | 0.000000 | 2 |
IN_DISLEXIA | int64 | 0 | 0.000000 | 2 |
IN_DISCALCULIA | int64 | 0 | 0.000000 | 2 |
IN_SABATISTA | int64 | 0 | 0.000000 | 2 |
IN_GESTANTE | int64 | 0 | 0.000000 | 2 |
IN_IDOSO | int64 | 0 | 0.000000 | 2 |
TP_PRESENCA_CN | int64 | 0 | 0.000000 | 3 |
TP_PRESENCA_CH | int64 | 0 | 0.000000 | 3 |
TP_PRESENCA_LC | int64 | 0 | 0.000000 | 3 |
CO_PROVA_CN | object | 0 | 0.000000 | 10 |
CO_PROVA_CH | object | 0 | 0.000000 | 10 |
CO_PROVA_LC | object | 0 | 0.000000 | 9 |
CO_PROVA_MT | object | 0 | 0.000000 | 9 |
NU_NOTA_CN | float64 | 3389 | 24.683176 | 2692 |
NU_NOTA_CH | float64 | 3389 | 24.683176 | 2978 |
NU_NOTA_LC | float64 | 3597 | 26.198106 | 2774 |
TP_LINGUA | int64 | 0 | 0.000000 | 2 |
TP_STATUS_REDACAO | float64 | 3597 | 26.198106 | 9 |
NU_NOTA_COMP1 | float64 | 3597 | 26.198106 | 15 |
NU_NOTA_COMP2 | float64 | 3597 | 26.198106 | 13 |
NU_NOTA_COMP3 | float64 | 3597 | 26.198106 | 12 |
NU_NOTA_COMP4 | float64 | 3597 | 26.198106 | 14 |
NU_NOTA_COMP5 | float64 | 3597 | 26.198106 | 14 |
NU_NOTA_REDACAO | float64 | 3597 | 26.198106 | 53 |
Q001 | object | 0 | 0.000000 | 8 |
Q002 | object | 0 | 0.000000 | 8 |
Q006 | object | 0 | 0.000000 | 17 |
Q024 | object | 0 | 0.000000 | 5 |
Q025 | object | 0 | 0.000000 | 2 |
Q026 | object | 0 | 0.000000 | 3 |
Q027 | object | 7373 | 53.699927 | 13 |
Q047 | object | 0 | 0.000000 | 5 |
NU_NOTA_MT | float64 | 3597 | 26.198106 | 3406 |
# Test summary:
data_summary(test)
type | amount | null_values (%) | unique | |
---|---|---|---|---|
CO_UF_RESIDENCIA | int64 | 0 | 0.000000 | 27 |
SG_UF_RESIDENCIA | object | 0 | 0.000000 | 27 |
NU_IDADE | int64 | 0 | 0.000000 | 46 |
TP_SEXO | object | 0 | 0.000000 | 2 |
TP_COR_RACA | int64 | 0 | 0.000000 | 6 |
TP_NACIONALIDADE | int64 | 0 | 0.000000 | 5 |
TP_ST_CONCLUSAO | int64 | 0 | 0.000000 | 4 |
TP_ANO_CONCLUIU | int64 | 0 | 0.000000 | 11 |
TP_ESCOLA | int64 | 0 | 0.000000 | 3 |
TP_ENSINO | float64 | 3096 | 67.657343 | 3 |
IN_TREINEIRO | int64 | 0 | 0.000000 | 2 |
TP_DEPENDENCIA_ADM_ESC | float64 | 3096 | 67.657343 | 4 |
IN_BAIXA_VISAO | int64 | 0 | 0.000000 | 2 |
IN_CEGUEIRA | int64 | 0 | 0.000000 | 1 |
IN_SURDEZ | int64 | 0 | 0.000000 | 2 |
IN_DISLEXIA | int64 | 0 | 0.000000 | 1 |
IN_DISCALCULIA | int64 | 0 | 0.000000 | 1 |
IN_SABATISTA | int64 | 0 | 0.000000 | 2 |
IN_GESTANTE | int64 | 0 | 0.000000 | 2 |
IN_IDOSO | int64 | 0 | 0.000000 | 1 |
TP_PRESENCA_CN | int64 | 0 | 0.000000 | 2 |
TP_PRESENCA_CH | int64 | 0 | 0.000000 | 2 |
TP_PRESENCA_LC | int64 | 0 | 0.000000 | 3 |
CO_PROVA_CN | object | 0 | 0.000000 | 9 |
CO_PROVA_CH | object | 0 | 0.000000 | 9 |
CO_PROVA_LC | object | 0 | 0.000000 | 9 |
CO_PROVA_MT | object | 0 | 0.000000 | 9 |
NU_NOTA_CN | float64 | 1134 | 24.781469 | 1823 |
NU_NOTA_CH | float64 | 1134 | 24.781469 | 1969 |
NU_NOTA_LC | float64 | 1199 | 26.201923 | 1839 |
TP_LINGUA | int64 | 0 | 0.000000 | 2 |
TP_STATUS_REDACAO | float64 | 1199 | 26.201923 | 9 |
NU_NOTA_COMP1 | float64 | 1199 | 26.201923 | 10 |
NU_NOTA_COMP2 | float64 | 1199 | 26.201923 | 10 |
NU_NOTA_COMP3 | float64 | 1199 | 26.201923 | 11 |
NU_NOTA_COMP4 | float64 | 1199 | 26.201923 | 11 |
NU_NOTA_COMP5 | float64 | 1199 | 26.201923 | 11 |
NU_NOTA_REDACAO | float64 | 1199 | 26.201923 | 44 |
Q001 | object | 0 | 0.000000 | 8 |
Q002 | object | 0 | 0.000000 | 8 |
Q006 | object | 0 | 0.000000 | 17 |
Q024 | object | 0 | 0.000000 | 5 |
Q025 | object | 0 | 0.000000 | 2 |
Q026 | object | 0 | 0.000000 | 3 |
Q027 | object | 2488 | 54.370629 | 13 |
Q047 | object | 0 | 0.000000 | 5 |
# Checking the distribution of the variable:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,6))
sns.distplot(train.NU_NOTA_MT, bins=25)
plt.xlabel('Score')
plt.title('Distribuition of math scores');
# Descriptive statistics for target:
train['NU_NOTA_MT'].describe()
count 10133.000000 mean 482.497928 std 99.826323 min 0.000000 25% 408.900000 50% 461.200000 75% 537.600000 max 952.000000 Name: NU_NOTA_MT, dtype: float64
print(f'Kurtosis: {train.NU_NOTA_MT.kurt()}')
print(f'Asymmetry: {train.NU_NOTA_MT.skew()}')
Kurtosis: 1.4225025820577502 Asymmetry: 0.9206896733932955
Kurtosis is used to identify outliers in the distribution. Here, its value is > 3, which means that the distribution tails tend to be lighter than in normal distribution or, the lack of outliers.
The positive asymmetry means that we have a slightly tail on the right side of the distribution. The data is moderately skewed as our asymmetry value is between 0.5 and 1.
# Creating a function to remove irrelevant features
def data_cleaning(df):
'''Removing irrelevant features'''
df.drop(['TP_DEPENDENCIA_ADM_ESC',
'TP_ENSINO',
'CO_PROVA_CN',
'CO_PROVA_CH',
'CO_PROVA_LC',
'CO_PROVA_MT',
'SG_UF_RESIDENCIA',
'CO_UF_RESIDENCIA',
'TP_NACIONALIDADE',
'IN_BAIXA_VISAO',
'IN_CEGUEIRA',
'IN_SURDEZ',
'IN_DISLEXIA',
'IN_DISCALCULIA',
'IN_SABATISTA',
'IN_GESTANTE',
'IN_IDOSO',
'TP_ANO_CONCLUIU','TP_PRESENCA_CN',
'TP_LINGUA','TP_PRESENCA_CH',
'IN_TREINEIRO', 'TP_PRESENCA_LC',
'TP_ST_CONCLUSAO',
'TP_STATUS_REDACAO',
'NU_IDADE',
'Q027'], axis=1, inplace=True)
return df
# Cleaning data:
data_cleaning(train)
data_cleaning(test)
train.shape, test.shape
/home/mone/.local/lib/python3.6/site-packages/pandas/core/frame.py:3997: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors,
((13730, 20), (4576, 19))
# Creating a function to impute missing values:
def data_imputation(df):
'''Imputing values to the missing data'''
df.fillna(df.dtypes.replace({'float64': -100}), inplace=True)
return df
data_imputation(train)
train.head();
/home/mone/.local/lib/python3.6/site-packages/pandas/core/generic.py:6245: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._update_inplace(new_data)
data_imputation(test)
test.head()
TP_SEXO | TP_COR_RACA | TP_ESCOLA | NU_NOTA_CN | NU_NOTA_CH | NU_NOTA_LC | NU_NOTA_COMP1 | NU_NOTA_COMP2 | NU_NOTA_COMP3 | NU_NOTA_COMP4 | NU_NOTA_COMP5 | NU_NOTA_REDACAO | Q001 | Q002 | Q006 | Q024 | Q025 | Q026 | Q047 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | F | 3 | 1 | 464.8 | 443.5 | 431.8 | 120.0 | 80.0 | 80.0 | 100.0 | 40.0 | 420.0 | B | A | C | A | A | C | A |
1 | F | 3 | 1 | 391.1 | 491.1 | 548.0 | 120.0 | 120.0 | 120.0 | 120.0 | 100.0 | 580.0 | E | B | C | B | B | B | A |
2 | M | 1 | 2 | 595.9 | 622.7 | 613.6 | 80.0 | 40.0 | 40.0 | 80.0 | 80.0 | 320.0 | E | E | D | B | B | A | A |
3 | F | 3 | 1 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | H | E | G | B | B | A | A |
4 | M | 1 | 2 | 592.9 | 492.6 | 571.4 | 100.0 | 80.0 | 60.0 | 80.0 | 0.0 | 320.0 | D | H | H | C | B | A | A |
setup()
function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. It must called before executing any other function and takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column. All other parameters are optional.
data: train data.
target: target feature.
remove_multicollinearity: When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped.
multicollinearity_threshold: Threshold used for dropping the correlated features.
normalize: When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.
normalize_method: Defines the method to be used for normalization.
transform_target: When set to True, target variable is transformed using the method defined in transform_target_method param. Target transformation is applied separately from feature transformations.
session_id: If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.
# Creating a pipeline to setup the model
pipeline = setup(data=train, target='NU_NOTA_MT',
remove_multicollinearity=True,
normalize_method='robust',
multicollinearity_threshold=0.95,
normalize=True,
transform_target=True,
session_id=1991)
Setup Succesfully Completed.
Description | Value | |
---|---|---|
0 | session_id | 1991 |
1 | Transform Target | True |
2 | Transform Target Method | yeo-johnson |
3 | Original Data | (13730, 20) |
4 | Missing Values | False |
5 | Numeric Features | 9 |
6 | Categorical Features | 10 |
7 | Ordinal Features | False |
8 | High Cardinality Features | False |
9 | High Cardinality Method | None |
10 | Sampled Data | (13730, 20) |
11 | Transformed Train Set | (9610, 60) |
12 | Transformed Test Set | (4120, 60) |
13 | Numeric Imputer | mean |
14 | Categorical Imputer | constant |
15 | Normalize | True |
16 | Normalize Method | robust |
17 | Transformation | False |
18 | Transformation Method | None |
19 | PCA | False |
20 | PCA Method | None |
21 | PCA Components | None |
22 | Ignore Low Variance | False |
23 | Combine Rare Levels | False |
24 | Rare Level Threshold | None |
25 | Numeric Binning | False |
26 | Remove Outliers | False |
27 | Outliers Threshold | None |
28 | Remove Multicollinearity | True |
29 | Multicollinearity Threshold | 0.950000 |
30 | Clustering | False |
31 | Clustering Iteration | None |
32 | Polynomial Features | False |
33 | Polynomial Degree | None |
34 | Trignometry Features | False |
35 | Polynomial Threshold | None |
36 | Group Features | False |
37 | Feature Selection | False |
38 | Features Selection Threshold | None |
39 | Feature Interaction | False |
40 | Feature Ratio | False |
41 | Interaction Threshold | None |
compare_models()
function uses all models in the model library and scores them using K-fold Cross Validation. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds) of all the available models in model library.compare_models(fold=5)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
0 | Extreme Gradient Boosting | 43.3216 | 4029.5616 | 63.4558 | 0.9449 | 0.1686 | 0.0862 | 1.6997 |
1 | Gradient Boosting Regressor | 43.3722 | 4042.4929 | 63.5589 | 0.9447 | 0.1671 | 0.0860 | 1.1562 |
2 | CatBoost Regressor | 43.9795 | 4145.4626 | 64.3591 | 0.9433 | 0.2114 | 0.0840 | 4.0103 |
3 | Random Forest | 44.3396 | 4258.4459 | 65.2399 | 0.9417 | 0.1696 | 0.0927 | 1.1249 |
4 | Light Gradient Boosting Machine | 43.9950 | 4259.0092 | 65.2418 | 0.9417 | 0.2036 | 0.0892 | 0.8384 |
5 | Extra Trees Regressor | 45.4451 | 4474.6227 | 66.8702 | 0.9388 | 0.1862 | 0.0950 | 1.0607 |
6 | Support Vector Machine | 49.9354 | 4923.2115 | 70.1513 | 0.9327 | 0.2652 | 0.0576 | 5.0518 |
7 | Huber Regressor | 50.4408 | 5102.6068 | 71.4191 | 0.9302 | 0.3316 | 0.0684 | 0.4597 |
8 | TheilSen Regressor | 50.5112 | 5107.2200 | 71.4531 | 0.9301 | 0.3766 | 0.0634 | 4.1116 |
9 | Ridge Regression | 50.9306 | 5112.9920 | 71.4930 | 0.9300 | 0.2772 | 0.0673 | 0.0315 |
10 | Bayesian Ridge | 50.9366 | 5114.1109 | 71.5008 | 0.9300 | 0.2757 | 0.0674 | 0.0742 |
11 | Orthogonal Matching Pursuit | 51.2477 | 5160.5718 | 71.8217 | 0.9294 | 0.2784 | 0.0676 | 0.0114 |
12 | AdaBoost Regressor | 57.2904 | 6362.2942 | 79.6530 | 0.9129 | 0.1963 | 0.1279 | 0.7479 |
13 | Passive Aggressive Regressor | 68.4823 | 7590.3098 | 87.0755 | 0.8961 | 0.5244 | -0.0036 | 0.0356 |
14 | Decision Tree | 61.6425 | 8456.2910 | 91.9125 | 0.8843 | 0.2385 | 0.1285 | 0.0831 |
15 | K Neighbors Regressor | 62.1216 | 11308.1117 | 106.3199 | 0.8454 | 0.3329 | 0.0158 | 0.0564 |
16 | Elastic Net | 101.4546 | 17137.7881 | 130.9072 | 0.7656 | 0.8277 | -0.2622 | 0.0220 |
17 | Lasso Regression | 135.9752 | 27060.6285 | 164.4996 | 0.6300 | 0.3364 | -0.4303 | 0.0160 |
18 | Lasso Least Angle Regression | 234.8026 | 73591.9650 | 271.2742 | -0.0062 | 0.7179 | -0.8333 | 0.0157 |
19 | Linear Regression | 516782043.6764 | 2566481464996111843328.0000 | 22656043246.8041 | -34828091114500956.0000 | 0.3362 | 1439051.0276 | 0.0350 |
20 | Random Sample Consensus | 10256633745.3887 | 1010957918896013998817280.0000 | 449657184787.5454 | -13719068301274478592.0000 | 0.3512 | 28561015.6825 | 2.5894 |
21 | Least Angle Regression | 4149691457749054.5000 | 165483615650858787290755350403219456.0000 | 181925048110950880.0000 | -2245673121919038750883225534464.0000 | 1.6141 | 11555390008799.6914 | 0.0374 |
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1, nthread=None, objective='reg:linear', random_state=1991, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=0)
create_model()
function creates a model and scores it using K-fold Cross Validation (default = 10 Fold). The output prints a score grid that shows MAE, MSE, RMSE, RMSLE, R2 and MAPE. This function returns a trained model object.model = create_model('xgboost', fold=5, round=2)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 42.33 | 3799.14 | 61.64 | 0.95 | 0.14 | 0.08 |
1 | 42.70 | 3919.92 | 62.61 | 0.95 | 0.19 | 0.09 |
2 | 42.90 | 4076.33 | 63.85 | 0.95 | 0.20 | 0.09 |
3 | 42.65 | 3920.33 | 62.61 | 0.95 | 0.13 | 0.08 |
4 | 46.03 | 4432.09 | 66.57 | 0.94 | 0.20 | 0.09 |
Mean | 43.32 | 4029.56 | 63.46 | 0.94 | 0.17 | 0.09 |
SD | 1.37 | 219.67 | 1.71 | 0.00 | 0.03 | 0.00 |
tune_model()
function tunes the hyperparameters of a model and scores it using K-fold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (by default = 10 Folds). This function returns a trained model object.model = tune_model(model, fold=5)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 45.6315 | 4401.0212 | 66.3402 | 0.9403 | 0.1628 | 0.0870 |
1 | 45.2794 | 4449.0194 | 66.7010 | 0.9402 | 0.2079 | 0.0867 |
2 | 45.4789 | 4493.2549 | 67.0317 | 0.9394 | 0.2133 | 0.0871 |
3 | 45.8112 | 4512.8476 | 67.1777 | 0.9374 | 0.1372 | 0.0875 |
4 | 48.7189 | 4907.9689 | 70.0569 | 0.9313 | 0.2172 | 0.0952 |
Mean | 46.1840 | 4552.8224 | 67.4615 | 0.9377 | 0.1877 | 0.0887 |
SD | 1.2795 | 181.7107 | 1.3294 | 0.0034 | 0.0320 | 0.0032 |
# Checking score after cross-validation:
predict_model(model);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Extreme Gradient Boosting Regressor | 44.9987 | 4317.789 | 65.7099 | 0.9403 | 0.1652 | 0 0.089478 dtype: float64 |
# Checking model parameters:
print(model)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.9, gamma=0, importance_type='gain', learning_rate=0.02, max_delta_step=0, max_depth=30, min_child_weight=1, missing=None, n_estimators=700, n_jobs=-1, nthread=None, objective='reg:linear', random_state=1991, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=0.1, verbosity=0)
# Residuals Plot
plot_model(model, plot='residuals')
# Prediction Error
plot_model(model, plot='error')
# Cooks Distance Plot
plot_model(model, plot='cooks')
# Learning Curve
plot_model(model, plot='learning')
# Validation Curve
plot_model(model, plot='vc')
# Manifold Learning
plot_model(model, plot='manifold')
# Feature Importance
plot_model(model, plot='feature')
# Model Hyperparameter
plot_model(model, plot='parameter')
Parameters | |
---|---|
base_score | 0.5 |
booster | gbtree |
colsample_bylevel | 1 |
colsample_bynode | 1 |
colsample_bytree | 0.9 |
gamma | 0 |
importance_type | gain |
learning_rate | 0.02 |
max_delta_step | 0 |
max_depth | 30 |
min_child_weight | 1 |
missing | None |
n_estimators | 700 |
n_jobs | -1 |
nthread | None |
objective | reg:linear |
random_state | 1991 |
reg_alpha | 0 |
reg_lambda | 1 |
scale_pos_weight | 1 |
seed | None |
silent | None |
subsample | 0.1 |
verbosity | 0 |
predict_model()
is used to predict new data using a trained estimator.predictions = predict_model(model, data=test, round=2)
predictions
TP_SEXO | TP_COR_RACA | TP_ESCOLA | NU_NOTA_CN | NU_NOTA_CH | NU_NOTA_LC | NU_NOTA_COMP1 | NU_NOTA_COMP2 | NU_NOTA_COMP3 | NU_NOTA_COMP4 | NU_NOTA_COMP5 | NU_NOTA_REDACAO | Q001 | Q002 | Q006 | Q024 | Q025 | Q026 | Q047 | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | F | 3 | 1 | 464.8 | 443.5 | 431.8 | 120.0 | 80.0 | 80.0 | 100.0 | 40.0 | 420.0 | B | A | C | A | A | C | A | 450.679993 |
1 | F | 3 | 1 | 391.1 | 491.1 | 548.0 | 120.0 | 120.0 | 120.0 | 120.0 | 100.0 | 580.0 | E | B | C | B | B | B | A | 499.760010 |
2 | M | 1 | 2 | 595.9 | 622.7 | 613.6 | 80.0 | 40.0 | 40.0 | 80.0 | 80.0 | 320.0 | E | E | D | B | B | A | A | 619.380005 |
3 | F | 3 | 1 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | H | E | G | B | B | A | A | -97.839996 |
4 | M | 1 | 2 | 592.9 | 492.6 | 571.4 | 100.0 | 80.0 | 60.0 | 80.0 | 0.0 | 320.0 | D | H | H | C | B | A | A | 605.760010 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4571 | F | 1 | 2 | 398.3 | 558.2 | 511.6 | 120.0 | 120.0 | 120.0 | 100.0 | 40.0 | 500.0 | E | E | D | A | B | A | A | 421.130005 |
4572 | M | 2 | 2 | 427.6 | 579.7 | 471.1 | 100.0 | 100.0 | 100.0 | 120.0 | 100.0 | 520.0 | C | C | C | A | A | A | A | 431.380005 |
4573 | M | 1 | 1 | 639.2 | 643.8 | 604.9 | 160.0 | 140.0 | 120.0 | 140.0 | 80.0 | 640.0 | D | F | D | B | B | A | D | 660.619995 |
4574 | M | 2 | 1 | 427.1 | 467.9 | 540.2 | 140.0 | 80.0 | 80.0 | 140.0 | 80.0 | 520.0 | C | E | C | A | A | A | A | 471.299988 |
4575 | M | 1 | 1 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | -100.0 | C | C | A | B | B | B | A | -98.129997 |
4576 rows × 20 columns
finalize_model()
function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.final_model = finalize_model(model)
save_model()
function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.save_model(model, 'xgboost_model_04082020')
Transformation Pipeline and Model Succesfully Saved