DataCamp Certification Case Study¶

Project Brief¶

You have been hired as a data scientist at a used car dealership in the UK. The sales team have been having problems with pricing used cars that arrive at the dealership and would like your help. Before they take any company wide action they would like you to work with the Toyota specialist to test your idea. They have already collected some data from other retailers on the price that a range of Toyota cars were listed at. It is known that cars that are more than £1500 above the estimated price will not sell. The sales team wants to know whether you can make predictions within this range.

The presentation of your findings should be targeted at the Head of Sales, who has no technical data science background.

The data you will use for this analysis can be accessed here: "data/toyota.csv"

1. Library Imports¶

Let's start by loading the libraries we will need in our analysis.

In [1]:

# import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
import pickle

# import pipeline and preprocessing libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# import model selection and metrics libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error

# import the different models we are going to use
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter(action='ignore', category=ConvergenceWarning)

2. Load Data¶

In [2]:

df = pd.read_csv('data/toyota.csv')

3. Exploratory Data Analysis¶

3.1 Initial Exploration¶

We can see our features are:

model
year
transmission
mileage
fuelType
tax
mpg
engineSize

And our target:

price

In [3]:

df.head(10)

Out[3]:

	model	year	price	transmission	mileage	fuelType	tax	mpg	engineSize
0	GT86	2016	16000	Manual	24089	Petrol	265	36.2	2.0
1	GT86	2017	15995	Manual	18615	Petrol	145	36.2	2.0
2	GT86	2015	13998	Manual	27469	Petrol	265	36.2	2.0
3	GT86	2017	18998	Manual	14736	Petrol	150	36.2	2.0
4	GT86	2017	17498	Manual	36284	Petrol	145	36.2	2.0
5	GT86	2017	15998	Manual	26919	Petrol	260	36.2	2.0
6	GT86	2017	18522	Manual	10456	Petrol	145	36.2	2.0
7	GT86	2017	18995	Manual	12340	Petrol	145	36.2	2.0
8	GT86	2020	27998	Manual	516	Petrol	150	33.2	2.0
9	GT86	2016	13990	Manual	37999	Petrol	265	36.2	2.0

In [4]:

df.tail(10)

Out[4]:

	model	year	price	transmission	mileage	fuelType	tax	mpg	engineSize
6728	IQ	2009	2699	Manual	74095	Petrol	0	65.7	1.0
6729	IQ	2013	4950	Manual	72913	Petrol	30	54.3	1.3
6730	IQ	2013	4849	Manual	41142	Petrol	0	64.2	1.0
6731	IQ	2011	5995	Automatic	29000	Petrol	20	58.9	1.0
6732	IQ	2012	2495	Manual	72498	Petrol	0	64.2	1.0
6733	IQ	2011	5500	Automatic	30000	Petrol	20	58.9	1.0
6734	Urban Cruiser	2011	4985	Manual	36154	Petrol	125	50.4	1.3
6735	Urban Cruiser	2012	4995	Manual	46000	Diesel	125	57.6	1.4
6736	Urban Cruiser	2011	3995	Manual	60700	Petrol	125	50.4	1.3
6737	Urban Cruiser	2011	4495	Manual	45128	Petrol	125	50.4	1.3

We have 6,738 cars in our dataset.

In [5]:

df.shape

Out[5]:

(6738, 9)

The data types look ok

In [6]:

df.dtypes

Out[6]:

model            object
year              int64
price             int64
transmission     object
mileage           int64
fuelType         object
tax               int64
mpg             float64
engineSize      float64
dtype: object

We can see that some engineSizes are 0, should look into it at the Data Cleaning stage. Also, at least one car does 235 miles per gallon. We'll investigate.

In [7]:

df.describe()

Out[7]:

	year	price	mileage	tax	mpg	engineSize
count	6738.000000	6738.000000	6738.000000	6738.000000	6738.000000	6738.000000
mean	2016.748145	12522.391066	22857.413921	94.697240	63.042223	1.471297
std	2.204062	6345.017587	19125.464147	73.880776	15.836710	0.436159
min	1998.000000	850.000000	2.000000	0.000000	2.800000	0.000000
25%	2016.000000	8290.000000	9446.000000	0.000000	55.400000	1.000000
50%	2017.000000	10795.000000	18513.000000	135.000000	62.800000	1.500000
75%	2018.000000	14995.000000	31063.750000	145.000000	69.000000	1.800000
max	2020.000000	59995.000000	174419.000000	565.000000	235.000000	4.500000

Most of the cars use petrol, and have a manual transmission. About a third or the cars are Yaris

In [8]:

df.describe(include=['object'])

Out[8]:

	model	transmission	fuelType
count	6738	6738	6738
unique	18	4	4
top	Yaris	Manual	Petrol
freq	2122	3826	4087

There seems to be no missing values

In [9]:

df.isnull().sum()

Out[9]:

model           0
year            0
price           0
transmission    0
mileage         0
fuelType        0
tax             0
mpg             0
engineSize      0
dtype: int64

Let's look at the unique values in each column, just to make sure there are no hidden problems:

In [10]:

for col in df.columns:
    print(f'Column Name: {col}')
    print(df[col].unique())
    print(f'Unique values: {df[col].nunique()}')
    print()

Column Name: model
[' GT86' ' Corolla' ' RAV4' ' Yaris' ' Auris' ' Aygo' ' C-HR' ' Prius'
 ' Avensis' ' Verso' ' Hilux' ' PROACE VERSO' ' Land Cruiser' ' Supra'
 ' Camry' ' Verso-S' ' IQ' ' Urban Cruiser']
Unique values: 18

Column Name: year
[2016 2017 2015 2020 2013 2019 2018 2014 2012 2005 2003 2004 2001 2008
 2007 2010 2011 2006 2009 2002 1999 2000 1998]
Unique values: 23

Column Name: price
[16000 15995 13998 ...  4950  4849  4985]
Unique values: 2114

Column Name: transmission
['Manual' 'Automatic' 'Semi-Auto' 'Other']
Unique values: 4

Column Name: mileage
[24089 18615 27469 ... 36154 60700 45128]
Unique values: 5699

Column Name: fuelType
['Petrol' 'Other' 'Hybrid' 'Diesel']
Unique values: 4

Column Name: tax
[265 145 150 260 200 250 140 135 235 300 125  20 160 165 205 240 330 325
   0  30 120 155 115 190  10 305 565 555 540]
Unique values: 29

Column Name: mpg
[ 36.2  33.2  32.8  39.8  85.6  36.7  72.4  39.2  78.5  58.9  53.3  62.8
  56.5  57.7  37.2  34.5  35.3  40.4  51.1  57.6  60.1  55.4  42.2  43.5
  53.   49.6  52.3  48.7  42.8  47.1  38.2  31.4  32.1  30.4  47.9  78.
  76.3  58.   65.7  55.   86.   66.   80.7  51.4  67.3  74.3  42.9  54.3
  50.4  37.7  44.8  76.4  70.6  61.4  46.3  65.6  83.1  68.8  51.3  40.9
  68.9  69.   67.   41.5   6.   94.1  64.2 235.  217.3 134.5  44.1  45.6
  30.7  38.7   2.8  30.1  34.9  31.   27.2  23.9  29.7]
Unique values: 81

Column Name: engineSize
[2.  1.8 1.2 1.6 1.4 2.5 2.2 1.5 1.  1.3 0.  2.4 3.  2.8 4.2 4.5]
Unique values: 16

3.2 Visual Exploration¶

First, we can look into the distribution of our target variable, price:

In [11]:

plot = sns.displot(x='price', kde=True, data=df)
plt.show(plot)

I like to plot the variables with few categories vs the target variable to see if there is anything interesting there. In this case it is fuelType, transmission, engineSize, model, and year:

I wonder that the "other" means for fuelType

In [12]:

sns.countplot(y='fuelType', data=df)
plt.show()

In [13]:

sns.countplot(y='transmission', data=df)
plt.show()

About a third or our cars have the 1.0 liter engine

In [14]:

sns.countplot(y='engineSize', data=df)
plt.show()

Yaris is our most popular model, followed closely by Aygo

In [15]:

sns.countplot(y='model', data=df)
plt.show()

We can see that most cars are from 2013 on, with most comming from 2017

In [16]:

sns.countplot(y='year', data=df)
plt.show()

For the same columns (fuelType, transmission, engineSize, model, and year), I will now plot the boxplots, to look at the distribution of price vs those variables. The red line indicates the average price of the whole dataset:

In [17]:

for col in ['fuelType', 'transmission', 'engineSize']:
    sns.boxplot(x=col, y='price', data=df)
    plt.axhline(df.price.mean(),color='r',linestyle='dashed',linewidth=2)
    plt.show()

The Land Cruiser has the highest average price. The cars below the average are Yaris, Aygo, Avensis, Camry, Verso-S, IQ, and Urban Cruiser

In [18]:

ax = sns.boxplot(x='model', y='price', data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 45)
plt.axhline(df.price.mean(),color='r',linestyle='dashed',linewidth=2)
plt.show()

The average price is brougth up mainly by cars from 2019 and 2020. This graph shows how used cars their value fairly quickly, 3-5 years

In [19]:

ax = sns.boxplot(x='year', y='price', data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 45)
plt.axhline(df.price.mean(),color='r',linestyle='dashed',linewidth=2)
plt.show()

These tables calculate the mean of the remaining variables vs fuelType, transmission, and engineSize. I chose these 3 because they have few categories and are easier to see in one glance.

In [20]:

df.groupby('fuelType').mean()

Out[20]:

	year	price	mileage	tax	mpg	engineSize
fuelType
Diesel	2015.546720	15697.807157	41742.294235	149.174950	52.066799	2.077734
Hybrid	2016.795888	17185.472834	24452.267744	60.127264	75.282183	1.806853
Other	2017.057143	14121.161905	17159.904762	100.523810	71.184762	1.349524
Petrol	2016.864204	9759.537803	19882.336188	105.123563	58.065329	1.232053

In [21]:

df.groupby('transmission').mean()

Out[21]:

	year	price	mileage	tax	mpg	engineSize
transmission
Automatic	2016.706436	16582.828754	23977.222431	78.219797	70.137486	1.760369
Manual	2016.759801	9551.496864	22472.350497	104.218505	58.629665	1.273393
Other	2015.000000	12795.000000	16733.000000	0.000000	78.000000	1.500000
Semi-Auto	2017.015748	14797.137795	16967.830709	124.015748	55.228740	1.428346

In [22]:

df.groupby('engineSize').mean()

Out[22]:

	year	price	mileage	tax	mpg
engineSize
0.0	2017.666667	10566.666667	18358.000000	97.500000	64.766667
1.0	2016.946576	7864.862555	18683.023798	89.927149	65.044973
1.2	2017.627063	14999.620462	17085.171617	123.399340	48.065347
1.3	2015.158070	8191.329451	29611.928453	50.307820	55.689517
1.4	2014.175000	6722.625000	51062.650000	14.625000	74.522500
1.5	2017.438510	11751.345748	17267.450457	104.121574	63.545748
1.6	2015.118343	9573.508876	37542.497041	112.810651	55.526627
1.8	2016.635376	17445.613654	26293.993018	72.152832	75.836074
2.0	2016.535176	17666.537688	29531.874372	149.497487	52.432412
2.2	2012.603448	9994.724138	60341.155172	220.775862	45.246552
2.4	2018.085714	22986.757143	20498.628571	261.357143	33.400000
2.5	2017.076305	22909.895582	26997.899598	67.971888	54.745382
2.8	2018.305556	42767.083333	12889.111111	180.555556	33.025000
3.0	2014.800000	29292.800000	44988.457143	258.571429	33.754286
4.2	1998.000000	19990.000000	100000.000000	265.000000	23.900000
4.5	2014.000000	44990.000000	60000.000000	540.000000	29.700000

What is the average and mediam price in our dataset?

In [23]:

print(round(df.price.mean(),2), round(df.price.median(),2))

12522.39 10795.0

I will now plot the distribution of all the numeric variables. I find the distribution of mileage, year, engineSize, and mpg particularly interesting.

In [24]:

df.hist(figsize=(10,10), column=sorted(df.columns))
plt.show()

It is valuable to look at the correlations table. We can see how the highest correlation is between year and mileage, and between price and engineSize. We will explore those two pairs with their own plot.

In [25]:

# heatmap of correlations
correlations = df.corr()
mask = np.zeros_like(correlations)
mask[np.triu_indices_from(mask)] = 1

sns.set_style('white')
plt.figure(figsize=(9,8))
sns.heatmap(correlations * 100,
            cmap='RdBu_r',
           annot=True,
           fmt='.0f',
           mask=mask,
           cbar=False)
plt.show()

The pair plot is very valuable because we can see in just one graph the scatter plots of some of the higher correlation variables. I find the year and mileage vs price plots very intersting. We can see it seems like it is not exactly a linear relationship.

In [26]:

sns.pairplot(df[['price', 'engineSize', 'year', 'mileage']], diag_kind="kde")
plt.show()

In [27]:

# engineSize v price
sns.scatterplot(x='engineSize', y='price', data=df)
plt.show()

In [28]:

# year v mileage
sns.scatterplot(x='year', y='mileage', data=df)
plt.show()

In [29]:

# year v mpg
sns.scatterplot(x='year', y='mpg', data=df)
plt.show()

The violin plots help us get a better sense of the distribution of the data in each category:

In [30]:

# transmission v mpg
sns.violinplot(x='transmission', y='mpg', data=df)
plt.show()

In [31]:

# fuelType v mpg
sns.violinplot(x='fuelType', y='mpg', data=df)
plt.show()

4. Data Cleaning & Feature Engineering¶

There are 39 duplicates. Given the size of our dataset, I believe it is ok to drop those values.

In [32]:

# Check for duplicates
df.duplicated().sum()
df[df.duplicated()]

Out[32]:

	model	year	price	transmission	mileage	fuelType	tax	mpg	engineSize
178	Corolla	2019	17960	Manual	4500	Petrol	145	39.2	1.2
286	Corolla	2019	15991	Manual	10	Petrol	145	39.2	1.2
397	RAV4	2016	17495	Automatic	58100	Hybrid	20	55.4	2.5
398	RAV4	2017	16295	Manual	34633	Diesel	125	60.1	2.0
446	RAV4	2014	10999	Manual	57942	Diesel	145	53.0	2.0
459	RAV4	2015	13500	Manual	45757	Diesel	125	57.6	2.0
460	RAV4	2015	13500	Manual	45757	Diesel	125	57.6	2.0
764	RAV4	2016	18795	Automatic	57000	Hybrid	20	55.4	2.5
1178	Yaris	2016	10495	Automatic	33705	Hybrid	0	86.0	1.5
1414	Yaris	2016	9495	Automatic	26191	Petrol	30	58.0	1.3
1587	Yaris	2020	14769	Manual	25	Petrol	150	47.9	1.5
1964	Yaris	2019	12495	Automatic	6000	Petrol	145	47.9	1.5
2161	Yaris	2019	14495	Manual	100	Petrol	145	47.9	1.5
2945	Auris	2016	9985	Manual	27600	Petrol	30	58.9	1.2
3177	Auris	2016	8992	Manual	40323	Diesel	20	67.3	1.6
3230	Auris	2013	6622	Manual	41959	Petrol	145	47.9	1.6
3778	Aygo	2019	9495	Manual	1360	Petrol	145	56.5	1.0
3836	Aygo	2016	8450	Manual	12935	Petrol	0	69.0	1.0
3902	Aygo	2019	10000	Manual	1000	Petrol	145	57.7	1.0
3948	Aygo	2019	9295	Manual	5000	Petrol	145	57.7	1.0
4059	Aygo	2019	8495	Manual	5519	Petrol	145	56.5	1.0
4283	Aygo	2019	9995	Manual	25	Petrol	145	56.5	1.0
4284	Aygo	2019	9995	Manual	25	Petrol	145	56.5	1.0
4301	Aygo	2019	10495	Manual	3500	Petrol	145	56.5	1.0
4330	Aygo	2017	6550	Manual	17047	Petrol	0	69.0	1.0
4452	Aygo	2017	6850	Manual	18898	Petrol	150	68.9	1.0
4516	Aygo	2019	9999	Manual	1500	Petrol	145	56.5	1.0
4517	Aygo	2019	9999	Manual	1500	Petrol	145	56.5	1.0
4603	Aygo	2019	9990	Manual	5000	Petrol	145	57.7	1.0
5084	Aygo	2018	9831	Manual	10742	Petrol	145	56.5	1.0
5194	Aygo	2019	10995	Manual	3	Petrol	145	57.7	1.0
5212	Aygo	2019	9391	Manual	3	Petrol	145	56.5	1.0
5342	Aygo	2019	9999	Manual	20	Petrol	145	57.7	1.0
5488	Aygo	2019	10350	Manual	2000	Petrol	145	57.7	1.0
5489	Aygo	2019	10350	Manual	2000	Petrol	145	57.7	1.0
5493	Aygo	2019	10000	Manual	150	Petrol	145	57.7	1.0
5560	Aygo	2019	10750	Manual	32	Petrol	145	57.7	1.0
6357	Avensis	2017	10595	Manual	35939	Diesel	145	67.3	1.6
6570	Hilux	2015	14995	Automatic	72100	Diesel	260	32.8	3.0

In [33]:

# Drop Duplicates
df.drop_duplicates(inplace=True)

In [34]:

df.shape

Out[34]:

(6699, 9)

If I had more time, I would fix these values. It would not be hard to look online for the engine sizes in these vehicles:

In [35]:

# Cars with engine size = 0
df[df.engineSize == 0]

Out[35]:

	model	year	price	transmission	mileage	fuelType	tax	mpg
2535	Yaris	2016	12300	Manual	6148	Hybrid	0	86.0
2545	Yaris	2016	11000	Automatic	39909	Hybrid	0	86.0
5126	Aygo	2019	9800	Manual	3635	Petrol	150	56.5
5233	Aygo	2019	8000	Manual	8531	Petrol	145	56.5
5257	Aygo	2019	8000	Manual	5354	Petrol	145	56.5
5960	C-HR	2017	14300	Manual	46571	Petrol	145	47.1

We can see how some Prius have an mpg of 235 and others it's about 135. If we had more time, we should look into this. Also some say they have 'Other' as FuelType, but I don't think Prius makes electric cars, so it would be interesting to talk to the people who put together the dataset.

In [36]:

df[df.mpg > 100]

Out[36]:

	model	year	price	transmission	mileage	fuelType	tax	mpg	engineSize
6098	Prius	2017	20795	Automatic	3881	Hybrid	135	235.0	1.8
6103	Prius	2018	20495	Automatic	16347	Hybrid	140	235.0	1.8
6116	Prius	2017	19998	Automatic	13476	Hybrid	140	235.0	1.8
6118	Prius	2019	24987	Automatic	6552	Hybrid	135	235.0	1.8
6124	Prius	2020	27990	Automatic	1500	Hybrid	135	235.0	1.8
6134	Prius	2017	20998	Automatic	17681	Hybrid	135	235.0	1.8
6151	Prius	2019	27999	Automatic	3650	Hybrid	140	217.3	1.8
6160	Prius	2013	13522	Automatic	52217	Hybrid	0	134.5	1.8
6176	Prius	2019	26995	Automatic	12998	Hybrid	135	217.3	1.8
6177	Prius	2015	15995	Automatic	51896	Hybrid	0	134.5	1.8
6180	Prius	2015	17495	Automatic	19350	Hybrid	0	134.5	1.8
6188	Prius	2014	15995	Automatic	47221	Hybrid	0	134.5	1.8
6195	Prius	2014	15362	Automatic	50647	Hybrid	0	134.5	1.8
6198	Prius	2019	29995	Automatic	3754	Hybrid	135	217.3	1.8
6200	Prius	2018	19072	Automatic	43584	Hybrid	135	235.0	1.8
6207	Prius	2019	22495	Automatic	13772	Hybrid	135	235.0	1.8
6216	Prius	2018	22890	Automatic	31786	Hybrid	135	235.0	1.8
6223	Prius	2018	18699	Automatic	34502	Hybrid	140	235.0	1.8
6236	Prius	2019	21500	Automatic	4000	Hybrid	135	235.0	1.8
6240	Prius	2013	13495	Automatic	35726	Hybrid	0	134.5	1.8
6244	Prius	2020	31995	Automatic	100	Hybrid	135	217.3	1.8
6250	Prius	2012	13199	Automatic	37870	Hybrid	0	134.5	1.8
6253	Prius	2019	27999	Automatic	9975	Hybrid	140	217.3	1.8
6256	Prius	2018	18150	Automatic	59191	Other	135	235.0	1.8
6267	Prius	2018	19699	Automatic	43678	Hybrid	135	235.0	1.8
6286	Prius	2019	26700	Automatic	6053	Other	135	217.3	1.8
6287	Prius	2019	28590	Automatic	5000	Other	135	217.3	1.8
6289	Prius	2014	15990	Automatic	23523	Other	0	134.5	1.8
6291	Prius	2019	26990	Automatic	7000	Other	135	217.3	1.8
6314	Prius	2016	15795	Automatic	47000	Other	0	134.5	1.8

Let's create a version of the dataframe that encodes our three non-numeric columns into 1 and 0. We will use get_dummies for this.

In [37]:

abt = pd.get_dummies(df, columns=['model', 'transmission', 'fuelType'])

We now have 32 columns

In [38]:

abt.head()

Out[38]:

	year	price	mileage	tax	mpg	engineSize	...	transmission_Manual	fuelType_Petrol
0	2016	16000	24089	265	36.2	2.0	...	1	1
1	2017	15995	18615	145	36.2	2.0	...	1	1
2	2015	13998	27469	265	36.2	2.0	...	1	1
3	2017	18998	14736	150	36.2	2.0	...	1	1
4	2017	17498	36284	145	36.2	2.0	...	1	1

5 rows × 32 columns

We will create the analytical base table, as a checkpoint in our analysis.

In [39]:

# Create ABT
abt.to_csv('analytical_base_table.csv', index=None)

5. Algorithm Selection¶

Regression problem¶

This is a regression problem. We will try to estimate price (continuous variable) using the other features.

Algorithms to test¶

We will use the following algorithms and see which one performs best:

Lasso Regression
Ridge Regression
ElasticNet Regression
Random Forest Regression
Gradient Boosting Classifier

I chose the first three because the regularization they provide normally improves the linear regression, especially in a case like this where there linear regression has many variables to fit.

Random Forest and Gradient Boosting Regressors are normally good choices in regression problems.

Hyperparameter tuning¶

For Lasso and Ridge we will tune the alpha parameter.

For ElasticNet we will focus on the alpha and the l1 ratio.

For the Random Forest Regressor we will focus on the number of trees and the number of features to use per tree.

For the Gradient Boosting Regressor we will tune the number of estimators, the learning rate and the maximum depth.

6. Model Training & Selection¶

First we load our table.

In [40]:

df = pd.read_csv('analytical_base_table.csv')
df.head()

Out[40]:

	year	price	mileage	tax	mpg	engineSize	...	transmission_Manual	fuelType_Petrol
0	2016	16000	24089	265	36.2	2.0	...	1	1
1	2017	15995	18615	145	36.2	2.0	...	1	1
2	2015	13998	27469	265	36.2	2.0	...	1	1
3	2017	18998	14736	150	36.2	2.0	...	1	1
4	2017	17498	36284	145	36.2	2.0	...	1	1

5 rows × 32 columns

In [41]:

df.shape

Out[41]:

(6699, 32)

Now we create X and y tables. X is everything except our target. Y is our target variable

In [42]:

X = df.drop('price', axis=1)
y = df.price

We can now split the data set into training and testing datasets. I will keep 20% for testing

In [43]:

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=77)

By creating pipelines, we can run cross-validation and use our pre-processing at each fold. In this case, we will use StandardScaler to substract the mean and divide by the variance of each column.

In [44]:

# dictionary of all our pipelines
pipelines = {
    'lasso' : make_pipeline(StandardScaler(), Lasso(random_state=77)),
    'ridge' : make_pipeline(StandardScaler(), Ridge(random_state=77)),
    'enet' : make_pipeline(StandardScaler(), ElasticNet(random_state=77)),
    'rf' : make_pipeline(StandardScaler(), RandomForestRegressor(random_state=77)),
    'gb' : make_pipeline(StandardScaler(), GradientBoostingRegressor(random_state=77))
}

These are the parameters we will serch over in our hyperparameter tuning. If we had more time, we could use an expanded grid

In [45]:

# hyperparameter grids
lasso_hyperparameters = {
    'lasso__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}
ridge_hyperparameters = {
    'ridge__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}
enet_hyperparameters = {
    'elasticnet__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
    'elasticnet__l1_ratio' : [0.1, 0.3, 0.5, 0.7, 0.9]
}
rf_hyperparameters = {
    'randomforestregressor__n_estimators' : [100, 200],
    'randomforestregressor__max_features' : ['auto', 'sqrt', 0.33]
}
gb_hyperparameters = {
    'gradientboostingregressor__n_estimators' : [100, 200],
    'gradientboostingregressor__learning_rate' : [0.05, 0.1, 0.2],
    'gradientboostingregressor__max_depth' : [1, 3, 5]
}

By grouping everything into a dictionary, it is easy to loop over each model and its hyperparameter grid

In [46]:

# dictionary of hyperparameters
hyperparameters = {
    'lasso' : lasso_hyperparameters,
    'ridge' : ridge_hyperparameters,
    'enet' : enet_hyperparameters,
    'rf' : rf_hyperparameters,  
    'gb' : gb_hyperparameters
}

This step takes a few minutes

In [47]:

# loop through all our models and fit them using our pipelines and hyperparameters
fitted_models = {}

for name, pipeline in pipelines.items():
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10)
    model.fit(X_train, y_train)
    fitted_models[name] = model
    print(name, 'has been fitted')

lasso has been fitted
ridge has been fitted
enet has been fitted
rf has been fitted
gb has been fitted

It seems gradient boosting is the best model. We wil test this using the test set.

In [48]:

# print the score of each model
for name, model in fitted_models.items():
    print(name, model.best_score_)

lasso 0.9244539982458957
ridge 0.9244573958197139
enet 0.9244568643352086
rf 0.9604453777122133
gb 0.9668996100757182

Since we need to have an average error lower than 1500, we are using mean absolute error as the metric to confirm that gradient boosting is our best model. And also to see if we were able to meet the required accuracy for the final user. It seems we did!

In [49]:

# print the R^2 and MAE for each model (regression)
for name, model in fitted_models.items():
    pred = model.predict(X_test)
    print(name)
    print('-'*8)
    print(f'R^2: {r2_score(y_test, pred)}')
    print(f'MAE: {mean_absolute_error(y_test, pred)}')
    print()

lasso
--------
R^2: 0.9290397270899275
MAE: 1033.9090834078168

ridge
--------
R^2: 0.9290610512361053
MAE: 1033.7925054464924

enet
--------
R^2: 0.9290603223847058
MAE: 1033.818063999813

rf
--------
R^2: 0.961070619542453
MAE: 793.9879979328197

gb
--------
R^2: 0.9637539267332421
MAE: 758.1641555273221

Gradient Boosting was the best model, with 200 trees, 0.05 learning rate and maximum depth of 5. With more time, we would try a bigger grid.

In [50]:

# print the pipeline of the best model
fitted_models['gb'].best_estimator_

Out[50]:

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(learning_rate=0.05, max_depth=5,
                                           n_estimators=200,
                                           random_state=77))])

7. Insights & Analysis¶

Gradient boosting performed better that random forest and regularized regression.

Our best model had a mean absolute error (MAE) lower than 1,500, in this case 758.

This MAE means that our model meets the requirement to be useful to our clients. The user stated that a car more than £1,500 above the true market value would not sell. Since our mean absolure error is about half that, we can be confident that the model will be useful and will be within the permitted error range most of the time.

We performed 10 fold cross validation of our dataset, which gives us good confidence that our evaluation holds for out of sample data.

This model could be further improved in several ways:

Talk to the domain experts to get a better feel for what features they believe are most important and adjust accordingly
Investigate the possibility of using even more data, so we can train our model with more observations
Try feature engineering. We were pressed for time, so we could not try more things, for example grouping together sparse model classes if they are similar
Improve our hyperparameter tuning. We were pressed for time so we could not explore tuning more
Try other algorithms or even create an ensemble of different models to see if we can get better results (deep learning for example)
Better model explanation. I could not use eli5 or shap to better explore feature importance and use that in the presentation