Predicting the best math grades of the ENEM 2016¶

In this example, we will predict the math score in the ENEM 2016 Brazilian National Exam. The dataset was obtained from INEP, a department from the Brazilian Education Ministry. It contains data from the applicants for the 2016 National High School Exam and can be downloaded here. Inside this dataset there are not only the exam results, but the social and economic context of the applicants. Check here data for description: Enem 2016 Microdata.

You will have two datasets - train.csv e test.csv - for using to predict math scores (NU_NOTA_MT). But the train file is a good amount of data from ENEM 2016 and contains most of the columns for those who would like to do some EDA.

This notebook was created using PyCaret 2.0 by Simone Perazzoli. Last updated : 04-08-2020

Loading libraries¶

In [1]:

#!pip3 install pycaret==2.0

In [2]:

import numpy as np
import pandas as pd
import pycaret
import seaborn as sns
import matplotlib.pyplot as plt
from pycaret.regression import *
from scipy.stats import kurtosis, skew

pd.set_option('display.max_columns',200)

In [3]:

# checking pycaret version
from pycaret.utils import version
version()

2.0

Loading datasets¶

In [4]:

# Train dataset
df_train = pd.read_csv('train.csv')
# Test dataset
df_test = pd.read_csv('test.csv')

# Creating answer dataframe
answer = pd.DataFrame()

Data preprocessing¶

In [5]:

# Saving the registration number:
answer['NU_INSCRICAO'] = df_test['NU_INSCRICAO']

In [6]:

# Droping the registration number from train and test dataframes:
df_train.drop(['NU_INSCRICAO'], axis=1, inplace=True)
df_test.drop(['NU_INSCRICAO'], axis=1, inplace=True)

In [7]:

# Checking dataframe shape
df_train.shape, df_test.shape

Out[7]:

((13730, 166), (4576, 46))

In [8]:

#By checking the shape of the datasets we can see that there are more columns in the training data than in the 
#test data, so we will use only the features that exist in the test dataframe to analyze and determine which 
#features we should use to make the prediction.

cols = list(df_test)
cols.append('NU_NOTA_MT')

train = df_train[cols]
test = df_test

In [9]:

# Viewing training data:
train.head()

Out[9]:

	CO_UF_RESIDENCIA	SG_UF_RESIDENCIA	NU_IDADE	TP_SEXO	TP_COR_RACA	TP_NACIONALIDADE	TP_ST_CONCLUSAO	TP_ANO_CONCLUIU	TP_ESCOLA	TP_ENSINO	TP_DEPENDENCIA_ADM_ESC	TP_PRESENCA_CN	TP_PRESENCA_CH	TP_PRESENCA_LC	CO_PROVA_CN	CO_PROVA_CH	CO_PROVA_LC	CO_PROVA_MT	NU_NOTA_CN	NU_NOTA_CH	NU_NOTA_LC	TP_LINGUA	TP_STATUS_REDACAO	NU_NOTA_COMP1	NU_NOTA_COMP2	NU_NOTA_COMP3	NU_NOTA_COMP4	NU_NOTA_COMP5	NU_NOTA_REDACAO	Q001	Q002	Q006	Q024	Q025	Q026	Q027	Q047	NU_NOTA_MT
0	43	RS	24	M	1	1	1	4	1	NaN	NaN	1	1	1	16f84b7b3d2aeaff7d2f01297e6b3d0e25c77bb2	9cd70f1b922e02bd33453b3f607f5a644fb9b1b8	01af53cd161a420fff1767129c10de560cc264dd	97caab1e1533dba217deb7ef41490f52e459ab01	436.3	495.4	581.2	1	1.0	120.0	120.0	120.0	80.0	80.0	520.0	D	D	C	A	A	C	H	A	399.4
1	23	CE	17	F	3	1	2	0	2	1.0	2.0	1	1	1	b9b06ce8c319a3df2158ea3d0aef0f7d3eecaed7	909237ab0d84688e10c0470e2997348aff585273	01af53cd161a420fff1767129c10de560cc264dd	97caab1e1533dba217deb7ef41490f52e459ab01	474.5	544.1	599.0	1	1.0	140.0	120.0	120.0	120.0	80.0	580.0	A	A	B	A	A	A	NaN	A	459.8
2	23	CE	21	F	3	1	3	0	1	NaN	NaN	0	0	0	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	D	D	C	A	A	A	NaN	A	NaN
3	33	RJ	25	F	0	1	1	9	1	NaN	NaN	0	0	0	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	NaN	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	H	E	E	C	B	C	F	D	NaN
4	13	AM	28	M	2	1	1	4	1	NaN	NaN	0	0	0	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	D	C	A	A	B	F	A	NaN

In [10]:

# Viewing test data:
test.head()

Out[10]:

	CO_UF_RESIDENCIA	SG_UF_RESIDENCIA	NU_IDADE	TP_SEXO	TP_COR_RACA	TP_NACIONALIDADE	TP_ST_CONCLUSAO	TP_ANO_CONCLUIU	TP_ESCOLA	TP_ENSINO	TP_DEPENDENCIA_ADM_ESC	TP_PRESENCA_CN	TP_PRESENCA_CH	TP_PRESENCA_LC	CO_PROVA_CN	CO_PROVA_CH	CO_PROVA_LC	CO_PROVA_MT	NU_NOTA_CN	NU_NOTA_CH	NU_NOTA_LC	TP_LINGUA	TP_STATUS_REDACAO	NU_NOTA_COMP1	NU_NOTA_COMP2	NU_NOTA_COMP3	NU_NOTA_COMP4	NU_NOTA_COMP5	NU_NOTA_REDACAO	Q001	Q002	Q006	Q024	Q025	Q026	Q027	Q047
0	41	PR	22	F	3	1	1	5	1	NaN	NaN	1	1	1	16f84b7b3d2aeaff7d2f01297e6b3d0e25c77bb2	9cd70f1b922e02bd33453b3f607f5a644fb9b1b8	01abbb7f1a90505385f44eec9905f82ca2a42cfd	81d0ee00ef42a7c23eb04496458c03d4c5b9c31a	464.8	443.5	431.8	0	1.0	120.0	80.0	80.0	100.0	40.0	420.0	B	A	C	A	A	C	C	A
1	21	MA	26	F	3	1	1	8	1	NaN	NaN	1	1	1	c8328ebc6f3238e06076c481bc1b82b8301e7a3f	f48d390ab6a2428e659c37fb8a9d00afde621889	72f80e4b3150c627c7ffc93cfe0fa13a9989b610	577f8968d95046f5eb5cc158608e12fa9ba34c85	391.1	491.1	548.0	1	1.0	120.0	120.0	120.0	120.0	100.0	580.0	E	B	C	B	B	B	F	A
2	23	CE	21	M	1	1	2	0	2	3.0	2.0	1	1	1	16f84b7b3d2aeaff7d2f01297e6b3d0e25c77bb2	9cd70f1b922e02bd33453b3f607f5a644fb9b1b8	01af53cd161a420fff1767129c10de560cc264dd	97caab1e1533dba217deb7ef41490f52e459ab01	595.9	622.7	613.6	0	1.0	80.0	40.0	40.0	80.0	80.0	320.0	E	E	D	B	B	A	NaN	A
3	15	PA	27	F	3	1	1	8	1	NaN	NaN	0	0	0	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	2d22ac1d42e6187f09ee6c578df187a760123ccf	NaN	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	H	E	G	B	B	A	NaN	A
4	41	PR	18	M	1	1	2	0	2	1.0	2.0	1	1	1	66b1dad288e13be0992bae01e81f71eca1c6e8a6	942ab3dc020af4cf53740b6b07e9dd7060b24164	5aebe5cad7fabc1545ac7fba07a4e6177f98483c	767a32545304ed293242d528f54d4edb1369f910	592.9	492.6	571.4	1	1.0	100.0	80.0	60.0	80.0	0.0	320.0	D	H	H	C	B	A	NaN	A

In [11]:

# Checking dataframe shape after transformation
train.shape, test.shape

Out[11]:

((13730, 47), (4576, 46))

In [12]:

# Creating a funtion to summarize dataframe information

def data_summary(df):
    '''Summary dataframe information'''

    df = pd.DataFrame({'type': df.dtypes,
                       'amount': df.isna().sum(),
                       'null_values (%)': (df.isna().sum() / df.shape[0]) * 100,
                       'unique': df.nunique()})
    return df

In [13]:

# Train summary:
data_summary(train)

Out[13]:

	type	amount	null_values (%)	unique
CO_UF_RESIDENCIA	int64	0	0.000000	27
SG_UF_RESIDENCIA	object	0	0.000000	27
NU_IDADE	int64	0	0.000000	55
TP_SEXO	object	0	0.000000	2
TP_COR_RACA	int64	0	0.000000	6
TP_NACIONALIDADE	int64	0	0.000000	5
TP_ST_CONCLUSAO	int64	0	0.000000	4
TP_ANO_CONCLUIU	int64	0	0.000000	11
TP_ESCOLA	int64	0	0.000000	4
TP_ENSINO	float64	9448	68.812819	3
IN_TREINEIRO	int64	0	0.000000	2
TP_DEPENDENCIA_ADM_ESC	float64	9448	68.812819	4
IN_BAIXA_VISAO	int64	0	0.000000	2
IN_CEGUEIRA	int64	0	0.000000	1
IN_SURDEZ	int64	0	0.000000	2
IN_DISLEXIA	int64	0	0.000000	2
IN_DISCALCULIA	int64	0	0.000000	2
IN_SABATISTA	int64	0	0.000000	2
IN_GESTANTE	int64	0	0.000000	2
IN_IDOSO	int64	0	0.000000	2
TP_PRESENCA_CN	int64	0	0.000000	3
TP_PRESENCA_CH	int64	0	0.000000	3
TP_PRESENCA_LC	int64	0	0.000000	3
CO_PROVA_CN	object	0	0.000000	10
CO_PROVA_CH	object	0	0.000000	10
CO_PROVA_LC	object	0	0.000000	9
CO_PROVA_MT	object	0	0.000000	9
NU_NOTA_CN	float64	3389	24.683176	2692
NU_NOTA_CH	float64	3389	24.683176	2978
NU_NOTA_LC	float64	3597	26.198106	2774
TP_LINGUA	int64	0	0.000000	2
TP_STATUS_REDACAO	float64	3597	26.198106	9
NU_NOTA_COMP1	float64	3597	26.198106	15
NU_NOTA_COMP2	float64	3597	26.198106	13
NU_NOTA_COMP3	float64	3597	26.198106	12
NU_NOTA_COMP4	float64	3597	26.198106	14
NU_NOTA_COMP5	float64	3597	26.198106	14
NU_NOTA_REDACAO	float64	3597	26.198106	53
Q001	object	0	0.000000	8
Q002	object	0	0.000000	8
Q006	object	0	0.000000	17
Q024	object	0	0.000000	5
Q025	object	0	0.000000	2
Q026	object	0	0.000000	3
Q027	object	7373	53.699927	13
Q047	object	0	0.000000	5
NU_NOTA_MT	float64	3597	26.198106	3406

In [14]:

# Test summary:
data_summary(test)

Out[14]:

	type	amount	null_values (%)	unique
CO_UF_RESIDENCIA	int64	0	0.000000	27
SG_UF_RESIDENCIA	object	0	0.000000	27
NU_IDADE	int64	0	0.000000	46
TP_SEXO	object	0	0.000000	2
TP_COR_RACA	int64	0	0.000000	6
TP_NACIONALIDADE	int64	0	0.000000	5
TP_ST_CONCLUSAO	int64	0	0.000000	4
TP_ANO_CONCLUIU	int64	0	0.000000	11
TP_ESCOLA	int64	0	0.000000	3
TP_ENSINO	float64	3096	67.657343	3
IN_TREINEIRO	int64	0	0.000000	2
TP_DEPENDENCIA_ADM_ESC	float64	3096	67.657343	4
IN_BAIXA_VISAO	int64	0	0.000000	2
IN_CEGUEIRA	int64	0	0.000000	1
IN_SURDEZ	int64	0	0.000000	2
IN_DISLEXIA	int64	0	0.000000	1
IN_DISCALCULIA	int64	0	0.000000	1
IN_SABATISTA	int64	0	0.000000	2
IN_GESTANTE	int64	0	0.000000	2
IN_IDOSO	int64	0	0.000000	1
TP_PRESENCA_CN	int64	0	0.000000	2
TP_PRESENCA_CH	int64	0	0.000000	2
TP_PRESENCA_LC	int64	0	0.000000	3
CO_PROVA_CN	object	0	0.000000	9
CO_PROVA_CH	object	0	0.000000	9
CO_PROVA_LC	object	0	0.000000	9
CO_PROVA_MT	object	0	0.000000	9
NU_NOTA_CN	float64	1134	24.781469	1823
NU_NOTA_CH	float64	1134	24.781469	1969
NU_NOTA_LC	float64	1199	26.201923	1839
TP_LINGUA	int64	0	0.000000	2
TP_STATUS_REDACAO	float64	1199	26.201923	9
NU_NOTA_COMP1	float64	1199	26.201923	10
NU_NOTA_COMP2	float64	1199	26.201923	10
NU_NOTA_COMP3	float64	1199	26.201923	11
NU_NOTA_COMP4	float64	1199	26.201923	11
NU_NOTA_COMP5	float64	1199	26.201923	11
NU_NOTA_REDACAO	float64	1199	26.201923	44
Q001	object	0	0.000000	8
Q002	object	0	0.000000	8
Q006	object	0	0.000000	17
Q024	object	0	0.000000	5
Q025	object	0	0.000000	2
Q026	object	0	0.000000	3
Q027	object	2488	54.370629	13
Q047	object	0	0.000000	5

Analysing target "NU_NOTA_MT"¶

In [15]:

# Checking the distribution of the variable:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,6))
sns.distplot(train.NU_NOTA_MT, bins=25)
plt.xlabel('Score')
plt.title('Distribuition of math scores');

In [16]:

# Descriptive statistics for target:
train['NU_NOTA_MT'].describe()

Out[16]:

count    10133.000000
mean       482.497928
std         99.826323
min          0.000000
25%        408.900000
50%        461.200000
75%        537.600000
max        952.000000
Name: NU_NOTA_MT, dtype: float64

In [17]:

print(f'Kurtosis: {train.NU_NOTA_MT.kurt()}')
print(f'Asymmetry: {train.NU_NOTA_MT.skew()}')

Kurtosis: 1.4225025820577502
Asymmetry: 0.9206896733932955

Kurtosis is used to identify outliers in the distribution. Here, its value is > 3, which means that the distribution tails tend to be lighter than in normal distribution or, the lack of outliers.
The positive asymmetry means that we have a slightly tail on the right side of the distribution. The data is moderately skewed as our asymmetry value is between 0.5 and 1.

Data cleaning and transforming¶

In [18]:

# Creating a function to remove irrelevant features
def data_cleaning(df):
    '''Removing irrelevant features'''

    df.drop(['TP_DEPENDENCIA_ADM_ESC',
             'TP_ENSINO',
             'CO_PROVA_CN',
             'CO_PROVA_CH',
             'CO_PROVA_LC',
             'CO_PROVA_MT',
             'SG_UF_RESIDENCIA',
             'CO_UF_RESIDENCIA',
             'TP_NACIONALIDADE',
             'IN_BAIXA_VISAO',
             'IN_CEGUEIRA',
             'IN_SURDEZ',
             'IN_DISLEXIA',
             'IN_DISCALCULIA',
             'IN_SABATISTA',
             'IN_GESTANTE',
             'IN_IDOSO',
             'TP_ANO_CONCLUIU','TP_PRESENCA_CN',
             'TP_LINGUA','TP_PRESENCA_CH',
             'IN_TREINEIRO', 'TP_PRESENCA_LC',
             'TP_ST_CONCLUSAO',
             'TP_STATUS_REDACAO', 
             'NU_IDADE',
             'Q027'], axis=1, inplace=True)
    return df

In [19]:

# Cleaning data:
data_cleaning(train)
data_cleaning(test)
train.shape, test.shape

/home/mone/.local/lib/python3.6/site-packages/pandas/core/frame.py:3997: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,

Out[19]:

((13730, 20), (4576, 19))

Here, the 3 first collumns were dropped due to present missing values >50%. The other ones were dropped after a manual analysis of the data dictionary.

In [20]:

# Creating a function to impute missing values:
def data_imputation(df):
    '''Imputing values to the missing data'''

    df.fillna(df.dtypes.replace({'float64': -100}), inplace=True)
    return df

In [21]:

data_imputation(train)
train.head();

/home/mone/.local/lib/python3.6/site-packages/pandas/core/generic.py:6245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)

In [22]:

data_imputation(test)
test.head()

Out[22]:

	TP_SEXO	TP_COR_RACA	TP_ESCOLA	NU_NOTA_CN	NU_NOTA_CH	NU_NOTA_LC	NU_NOTA_COMP1	NU_NOTA_COMP2	NU_NOTA_COMP3	NU_NOTA_COMP4	NU_NOTA_COMP5	NU_NOTA_REDACAO	Q001	Q002	Q006	Q024	Q025	Q026	Q047
0	F	3	1	464.8	443.5	431.8	120.0	80.0	80.0	100.0	40.0	420.0	B	A	C	A	A	C	A
1	F	3	1	391.1	491.1	548.0	120.0	120.0	120.0	120.0	100.0	580.0	E	B	C	B	B	B	A
2	M	1	2	595.9	622.7	613.6	80.0	40.0	40.0	80.0	80.0	320.0	E	E	D	B	B	A	A
3	F	3	1	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	H	E	G	B	B	A	A
4	M	1	2	592.9	492.6	571.4	100.0	80.0	60.0	80.0	0.0	320.0	D	H	H	C	B	A	A

Data modeling with PyCaret¶

1- Seting up parameters:¶

setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. It must called before executing any other function and takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column. All other parameters are optional.
data: train data.
target: target feature.
remove_multicollinearity: When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped.
multicollinearity_threshold: Threshold used for dropping the correlated features.
normalize: When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.
normalize_method: Defines the method to be used for normalization.
transform_target: When set to True, target variable is transformed using the method defined in transform_target_method param. Target transformation is applied separately from feature transformations.
session_id: If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.

In [23]:

# Creating a pipeline to setup the model
pipeline = setup(data=train, target='NU_NOTA_MT', 
                 remove_multicollinearity=True,
                 normalize_method='robust',
                 multicollinearity_threshold=0.95, 
                 normalize=True, 
                 transform_target=True, 
                 session_id=1991)

 
Setup Succesfully Completed.

	Description	Value
0	session_id	1991
1	Transform Target	True
2	Transform Target Method	yeo-johnson
3	Original Data	(13730, 20)
4	Missing Values	False
5	Numeric Features	9
6	Categorical Features	10
7	Ordinal Features	False
8	High Cardinality Features	False
9	High Cardinality Method	None
10	Sampled Data	(13730, 20)
11	Transformed Train Set	(9610, 60)
12	Transformed Test Set	(4120, 60)
13	Numeric Imputer	mean
14	Categorical Imputer	constant
15	Normalize	True
16	Normalize Method	robust
17	Transformation	False
18	Transformation Method	None
19	PCA	False
20	PCA Method	None
21	PCA Components	None
22	Ignore Low Variance	False
23	Combine Rare Levels	False
24	Rare Level Threshold	None
25	Numeric Binning	False
26	Remove Outliers	False
27	Outliers Threshold	None
28	Remove Multicollinearity	True
29	Multicollinearity Threshold	0.950000
30	Clustering	False
31	Clustering Iteration	None
32	Polynomial Features	False
33	Polynomial Degree	None
34	Trignometry Features	False
35	Polynomial Threshold	None
36	Group Features	False
37	Feature Selection	False
38	Features Selection Threshold	None
39	Feature Interaction	False
40	Feature Ratio	False
41	Interaction Threshold	None

2 - Comparing regression models¶

compare_models() function uses all models in the model library and scores them using K-fold Cross Validation. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds) of all the available models in model library.

In [24]:

compare_models(fold=5)

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
0	Extreme Gradient Boosting	43.3216	4029.5616	63.4558	0.9449	0.1686	0.0862	1.6997
1	Gradient Boosting Regressor	43.3722	4042.4929	63.5589	0.9447	0.1671	0.0860	1.1562
2	CatBoost Regressor	43.9795	4145.4626	64.3591	0.9433	0.2114	0.0840	4.0103
3	Random Forest	44.3396	4258.4459	65.2399	0.9417	0.1696	0.0927	1.1249
4	Light Gradient Boosting Machine	43.9950	4259.0092	65.2418	0.9417	0.2036	0.0892	0.8384
5	Extra Trees Regressor	45.4451	4474.6227	66.8702	0.9388	0.1862	0.0950	1.0607
6	Support Vector Machine	49.9354	4923.2115	70.1513	0.9327	0.2652	0.0576	5.0518
7	Huber Regressor	50.4408	5102.6068	71.4191	0.9302	0.3316	0.0684	0.4597
8	TheilSen Regressor	50.5112	5107.2200	71.4531	0.9301	0.3766	0.0634	4.1116
9	Ridge Regression	50.9306	5112.9920	71.4930	0.9300	0.2772	0.0673	0.0315
10	Bayesian Ridge	50.9366	5114.1109	71.5008	0.9300	0.2757	0.0674	0.0742
11	Orthogonal Matching Pursuit	51.2477	5160.5718	71.8217	0.9294	0.2784	0.0676	0.0114
12	AdaBoost Regressor	57.2904	6362.2942	79.6530	0.9129	0.1963	0.1279	0.7479
13	Passive Aggressive Regressor	68.4823	7590.3098	87.0755	0.8961	0.5244	-0.0036	0.0356
14	Decision Tree	61.6425	8456.2910	91.9125	0.8843	0.2385	0.1285	0.0831
15	K Neighbors Regressor	62.1216	11308.1117	106.3199	0.8454	0.3329	0.0158	0.0564
16	Elastic Net	101.4546	17137.7881	130.9072	0.7656	0.8277	-0.2622	0.0220
17	Lasso Regression	135.9752	27060.6285	164.4996	0.6300	0.3364	-0.4303	0.0160
18	Lasso Least Angle Regression	234.8026	73591.9650	271.2742	-0.0062	0.7179	-0.8333	0.0157
19	Linear Regression	516782043.6764	2566481464996111843328.0000	22656043246.8041	-34828091114500956.0000	0.3362	1439051.0276	0.0350
20	Random Sample Consensus	10256633745.3887	1010957918896013998817280.0000	449657184787.5454	-13719068301274478592.0000	0.3512	28561015.6825	2.5894
21	Least Angle Regression	4149691457749054.5000	165483615650858787290755350403219456.0000	181925048110950880.0000	-2245673121919038750883225534464.0000	1.6141	11555390008799.6914	0.0374

Out[24]:

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=-1, nthread=None, objective='reg:linear', random_state=1991,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=0)

3 - Creating a model¶

create_model() function creates a model and scores it using K-fold Cross Validation (default = 10 Fold). The output prints a score grid that shows MAE, MSE, RMSE, RMSLE, R2 and MAPE. This function returns a trained model object.

In [25]:

model = create_model('xgboost', fold=5, round=2)

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	42.33	3799.14	61.64	0.95	0.14	0.08
1	42.70	3919.92	62.61	0.95	0.19	0.09
2	42.90	4076.33	63.85	0.95	0.20	0.09
3	42.65	3920.33	62.61	0.95	0.13	0.08
4	46.03	4432.09	66.57	0.94	0.20	0.09
Mean	43.32	4029.56	63.46	0.94	0.17	0.09
SD	1.37	219.67	1.71	0.00	0.03	0.00

4 - Model Tunning¶

tune_model() function tunes the hyperparameters of a model and scores it using K-fold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (by default = 10 Folds). This function returns a trained model object.

In [26]:

model = tune_model(model, fold=5)

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	45.6315	4401.0212	66.3402	0.9403	0.1628	0.0870
1	45.2794	4449.0194	66.7010	0.9402	0.2079	0.0867
2	45.4789	4493.2549	67.0317	0.9394	0.2133	0.0871
3	45.8112	4512.8476	67.1777	0.9374	0.1372	0.0875
4	48.7189	4907.9689	70.0569	0.9313	0.2172	0.0952
Mean	46.1840	4552.8224	67.4615	0.9377	0.1877	0.0887
SD	1.2795	181.7107	1.3294	0.0034	0.0320	0.0032

In [27]:

# Checking score after cross-validation:
predict_model(model);

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	Extreme Gradient Boosting Regressor	44.9987	4317.789	65.7099	0.9403	0.1652	0 0.089478 dtype: float64

In [28]:

# Checking model parameters:
print(model)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.9, gamma=0,
             importance_type='gain', learning_rate=0.02, max_delta_step=0,
             max_depth=30, min_child_weight=1, missing=None, n_estimators=700,
             n_jobs=-1, nthread=None, objective='reg:linear', random_state=1991,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=0.1, verbosity=0)

5 - Analysing results¶

In [29]:

# Residuals Plot 
plot_model(model, plot='residuals')

In [30]:

# Prediction Error 
plot_model(model, plot='error')

In [31]:

# Cooks Distance Plot
plot_model(model, plot='cooks')

In [32]:

# Learning Curve
plot_model(model, plot='learning')

In [33]:

# Validation Curve
plot_model(model, plot='vc')

In [34]:

# Manifold Learning
plot_model(model, plot='manifold')

In [35]:

# Feature Importance
plot_model(model, plot='feature')

In [36]:

# Model Hyperparameter
plot_model(model, plot='parameter')

	Parameters
base_score	0.5
booster	gbtree
colsample_bylevel	1
colsample_bynode	1
colsample_bytree	0.9
gamma	0
importance_type	gain
learning_rate	0.02
max_delta_step	0
max_depth	30
min_child_weight	1
missing	None
n_estimators	700
n_jobs	-1
nthread	None
objective	reg:linear
random_state	1991
reg_alpha	0
reg_lambda	1
scale_pos_weight	1
seed	None
silent	None
subsample	0.1
verbosity	0

6 - Predicting math scores¶

predict_model() is used to predict new data using a trained estimator.

In [37]:

predictions = predict_model(model, data=test, round=2)
predictions

Out[37]:

	TP_SEXO	TP_COR_RACA	TP_ESCOLA	NU_NOTA_CN	NU_NOTA_CH	NU_NOTA_LC	NU_NOTA_COMP1	NU_NOTA_COMP2	NU_NOTA_COMP3	NU_NOTA_COMP4	NU_NOTA_COMP5	NU_NOTA_REDACAO	Q001	Q002	Q006	Q024	Q025	Q026	Q047	Label
0	F	3	1	464.8	443.5	431.8	120.0	80.0	80.0	100.0	40.0	420.0	B	A	C	A	A	C	A	450.679993
1	F	3	1	391.1	491.1	548.0	120.0	120.0	120.0	120.0	100.0	580.0	E	B	C	B	B	B	A	499.760010
2	M	1	2	595.9	622.7	613.6	80.0	40.0	40.0	80.0	80.0	320.0	E	E	D	B	B	A	A	619.380005
3	F	3	1	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	H	E	G	B	B	A	A	-97.839996
4	M	1	2	592.9	492.6	571.4	100.0	80.0	60.0	80.0	0.0	320.0	D	H	H	C	B	A	A	605.760010
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4571	F	1	2	398.3	558.2	511.6	120.0	120.0	120.0	100.0	40.0	500.0	E	E	D	A	B	A	A	421.130005
4572	M	2	2	427.6	579.7	471.1	100.0	100.0	100.0	120.0	100.0	520.0	C	C	C	A	A	A	A	431.380005
4573	M	1	1	639.2	643.8	604.9	160.0	140.0	120.0	140.0	80.0	640.0	D	F	D	B	B	A	D	660.619995
4574	M	2	1	427.1	467.9	540.2	140.0	80.0	80.0	140.0	80.0	520.0	C	E	C	A	A	A	A	471.299988
4575	M	1	1	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	-100.0	C	C	A	B	B	B	A	-98.129997

4576 rows × 20 columns

7 - Finalize Model¶

finalize_model() function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.

In [38]:

final_model = finalize_model(model)

8 - Save Model¶

save_model() function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

In [39]:

save_model(model, 'xgboost_model_04082020')

Transformation Pipeline and Model Succesfully Saved