You have been hired as a data scientist at a used car dealership in the UK. The sales team have been having problems with pricing used cars that arrive at the dealership and would like your help. Before they take any company wide action they would like you to work with the Toyota specialist to test your idea. They have already collected some data from other retailers on the price that a range of Toyota cars were listed at. It is known that cars that are more than £1500 above the estimated price will not sell. The sales team wants to know whether you can make predictions within this range.
The presentation of your findings should be targeted at the Head of Sales, who has no technical data science background.
The data you will use for this analysis can be accessed here: "data/toyota.csv"
Let's start by loading the libraries we will need in our analysis.
# import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
import pickle
# import pipeline and preprocessing libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# import model selection and metrics libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error
# import the different models we are going to use
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter(action='ignore', category=ConvergenceWarning)
df = pd.read_csv('data/toyota.csv')
We can see our features are:
And our target:
df.head(10)
model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
---|---|---|---|---|---|---|---|---|---|
0 | GT86 | 2016 | 16000 | Manual | 24089 | Petrol | 265 | 36.2 | 2.0 |
1 | GT86 | 2017 | 15995 | Manual | 18615 | Petrol | 145 | 36.2 | 2.0 |
2 | GT86 | 2015 | 13998 | Manual | 27469 | Petrol | 265 | 36.2 | 2.0 |
3 | GT86 | 2017 | 18998 | Manual | 14736 | Petrol | 150 | 36.2 | 2.0 |
4 | GT86 | 2017 | 17498 | Manual | 36284 | Petrol | 145 | 36.2 | 2.0 |
5 | GT86 | 2017 | 15998 | Manual | 26919 | Petrol | 260 | 36.2 | 2.0 |
6 | GT86 | 2017 | 18522 | Manual | 10456 | Petrol | 145 | 36.2 | 2.0 |
7 | GT86 | 2017 | 18995 | Manual | 12340 | Petrol | 145 | 36.2 | 2.0 |
8 | GT86 | 2020 | 27998 | Manual | 516 | Petrol | 150 | 33.2 | 2.0 |
9 | GT86 | 2016 | 13990 | Manual | 37999 | Petrol | 265 | 36.2 | 2.0 |
df.tail(10)
model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
---|---|---|---|---|---|---|---|---|---|
6728 | IQ | 2009 | 2699 | Manual | 74095 | Petrol | 0 | 65.7 | 1.0 |
6729 | IQ | 2013 | 4950 | Manual | 72913 | Petrol | 30 | 54.3 | 1.3 |
6730 | IQ | 2013 | 4849 | Manual | 41142 | Petrol | 0 | 64.2 | 1.0 |
6731 | IQ | 2011 | 5995 | Automatic | 29000 | Petrol | 20 | 58.9 | 1.0 |
6732 | IQ | 2012 | 2495 | Manual | 72498 | Petrol | 0 | 64.2 | 1.0 |
6733 | IQ | 2011 | 5500 | Automatic | 30000 | Petrol | 20 | 58.9 | 1.0 |
6734 | Urban Cruiser | 2011 | 4985 | Manual | 36154 | Petrol | 125 | 50.4 | 1.3 |
6735 | Urban Cruiser | 2012 | 4995 | Manual | 46000 | Diesel | 125 | 57.6 | 1.4 |
6736 | Urban Cruiser | 2011 | 3995 | Manual | 60700 | Petrol | 125 | 50.4 | 1.3 |
6737 | Urban Cruiser | 2011 | 4495 | Manual | 45128 | Petrol | 125 | 50.4 | 1.3 |
We have 6,738 cars in our dataset.
df.shape
(6738, 9)
The data types look ok
df.dtypes
model object year int64 price int64 transmission object mileage int64 fuelType object tax int64 mpg float64 engineSize float64 dtype: object
We can see that some engineSizes are 0, should look into it at the Data Cleaning stage. Also, at least one car does 235 miles per gallon. We'll investigate.
df.describe()
year | price | mileage | tax | mpg | engineSize | |
---|---|---|---|---|---|---|
count | 6738.000000 | 6738.000000 | 6738.000000 | 6738.000000 | 6738.000000 | 6738.000000 |
mean | 2016.748145 | 12522.391066 | 22857.413921 | 94.697240 | 63.042223 | 1.471297 |
std | 2.204062 | 6345.017587 | 19125.464147 | 73.880776 | 15.836710 | 0.436159 |
min | 1998.000000 | 850.000000 | 2.000000 | 0.000000 | 2.800000 | 0.000000 |
25% | 2016.000000 | 8290.000000 | 9446.000000 | 0.000000 | 55.400000 | 1.000000 |
50% | 2017.000000 | 10795.000000 | 18513.000000 | 135.000000 | 62.800000 | 1.500000 |
75% | 2018.000000 | 14995.000000 | 31063.750000 | 145.000000 | 69.000000 | 1.800000 |
max | 2020.000000 | 59995.000000 | 174419.000000 | 565.000000 | 235.000000 | 4.500000 |
Most of the cars use petrol, and have a manual transmission. About a third or the cars are Yaris
df.describe(include=['object'])
model | transmission | fuelType | |
---|---|---|---|
count | 6738 | 6738 | 6738 |
unique | 18 | 4 | 4 |
top | Yaris | Manual | Petrol |
freq | 2122 | 3826 | 4087 |
There seems to be no missing values
df.isnull().sum()
model 0 year 0 price 0 transmission 0 mileage 0 fuelType 0 tax 0 mpg 0 engineSize 0 dtype: int64
Let's look at the unique values in each column, just to make sure there are no hidden problems:
for col in df.columns:
print(f'Column Name: {col}')
print(df[col].unique())
print(f'Unique values: {df[col].nunique()}')
print()
Column Name: model [' GT86' ' Corolla' ' RAV4' ' Yaris' ' Auris' ' Aygo' ' C-HR' ' Prius' ' Avensis' ' Verso' ' Hilux' ' PROACE VERSO' ' Land Cruiser' ' Supra' ' Camry' ' Verso-S' ' IQ' ' Urban Cruiser'] Unique values: 18 Column Name: year [2016 2017 2015 2020 2013 2019 2018 2014 2012 2005 2003 2004 2001 2008 2007 2010 2011 2006 2009 2002 1999 2000 1998] Unique values: 23 Column Name: price [16000 15995 13998 ... 4950 4849 4985] Unique values: 2114 Column Name: transmission ['Manual' 'Automatic' 'Semi-Auto' 'Other'] Unique values: 4 Column Name: mileage [24089 18615 27469 ... 36154 60700 45128] Unique values: 5699 Column Name: fuelType ['Petrol' 'Other' 'Hybrid' 'Diesel'] Unique values: 4 Column Name: tax [265 145 150 260 200 250 140 135 235 300 125 20 160 165 205 240 330 325 0 30 120 155 115 190 10 305 565 555 540] Unique values: 29 Column Name: mpg [ 36.2 33.2 32.8 39.8 85.6 36.7 72.4 39.2 78.5 58.9 53.3 62.8 56.5 57.7 37.2 34.5 35.3 40.4 51.1 57.6 60.1 55.4 42.2 43.5 53. 49.6 52.3 48.7 42.8 47.1 38.2 31.4 32.1 30.4 47.9 78. 76.3 58. 65.7 55. 86. 66. 80.7 51.4 67.3 74.3 42.9 54.3 50.4 37.7 44.8 76.4 70.6 61.4 46.3 65.6 83.1 68.8 51.3 40.9 68.9 69. 67. 41.5 6. 94.1 64.2 235. 217.3 134.5 44.1 45.6 30.7 38.7 2.8 30.1 34.9 31. 27.2 23.9 29.7] Unique values: 81 Column Name: engineSize [2. 1.8 1.2 1.6 1.4 2.5 2.2 1.5 1. 1.3 0. 2.4 3. 2.8 4.2 4.5] Unique values: 16
First, we can look into the distribution of our target variable, price:
plot = sns.displot(x='price', kde=True, data=df)
plt.show(plot)
I like to plot the variables with few categories vs the target variable to see if there is anything interesting there. In this case it is fuelType, transmission, engineSize, model, and year:
I wonder that the "other" means for fuelType
sns.countplot(y='fuelType', data=df)
plt.show()
sns.countplot(y='transmission', data=df)
plt.show()
About a third or our cars have the 1.0 liter engine
sns.countplot(y='engineSize', data=df)
plt.show()
Yaris is our most popular model, followed closely by Aygo
sns.countplot(y='model', data=df)
plt.show()
We can see that most cars are from 2013 on, with most comming from 2017
sns.countplot(y='year', data=df)
plt.show()
For the same columns (fuelType, transmission, engineSize, model, and year), I will now plot the boxplots, to look at the distribution of price vs those variables. The red line indicates the average price of the whole dataset:
for col in ['fuelType', 'transmission', 'engineSize']:
sns.boxplot(x=col, y='price', data=df)
plt.axhline(df.price.mean(),color='r',linestyle='dashed',linewidth=2)
plt.show()
The Land Cruiser has the highest average price. The cars below the average are Yaris, Aygo, Avensis, Camry, Verso-S, IQ, and Urban Cruiser
ax = sns.boxplot(x='model', y='price', data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 45)
plt.axhline(df.price.mean(),color='r',linestyle='dashed',linewidth=2)
plt.show()
The average price is brougth up mainly by cars from 2019 and 2020. This graph shows how used cars their value fairly quickly, 3-5 years
ax = sns.boxplot(x='year', y='price', data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 45)
plt.axhline(df.price.mean(),color='r',linestyle='dashed',linewidth=2)
plt.show()
These tables calculate the mean of the remaining variables vs fuelType, transmission, and engineSize. I chose these 3 because they have few categories and are easier to see in one glance.
df.groupby('fuelType').mean()
year | price | mileage | tax | mpg | engineSize | |
---|---|---|---|---|---|---|
fuelType | ||||||
Diesel | 2015.546720 | 15697.807157 | 41742.294235 | 149.174950 | 52.066799 | 2.077734 |
Hybrid | 2016.795888 | 17185.472834 | 24452.267744 | 60.127264 | 75.282183 | 1.806853 |
Other | 2017.057143 | 14121.161905 | 17159.904762 | 100.523810 | 71.184762 | 1.349524 |
Petrol | 2016.864204 | 9759.537803 | 19882.336188 | 105.123563 | 58.065329 | 1.232053 |
df.groupby('transmission').mean()
year | price | mileage | tax | mpg | engineSize | |
---|---|---|---|---|---|---|
transmission | ||||||
Automatic | 2016.706436 | 16582.828754 | 23977.222431 | 78.219797 | 70.137486 | 1.760369 |
Manual | 2016.759801 | 9551.496864 | 22472.350497 | 104.218505 | 58.629665 | 1.273393 |
Other | 2015.000000 | 12795.000000 | 16733.000000 | 0.000000 | 78.000000 | 1.500000 |
Semi-Auto | 2017.015748 | 14797.137795 | 16967.830709 | 124.015748 | 55.228740 | 1.428346 |
df.groupby('engineSize').mean()
year | price | mileage | tax | mpg | |
---|---|---|---|---|---|
engineSize | |||||
0.0 | 2017.666667 | 10566.666667 | 18358.000000 | 97.500000 | 64.766667 |
1.0 | 2016.946576 | 7864.862555 | 18683.023798 | 89.927149 | 65.044973 |
1.2 | 2017.627063 | 14999.620462 | 17085.171617 | 123.399340 | 48.065347 |
1.3 | 2015.158070 | 8191.329451 | 29611.928453 | 50.307820 | 55.689517 |
1.4 | 2014.175000 | 6722.625000 | 51062.650000 | 14.625000 | 74.522500 |
1.5 | 2017.438510 | 11751.345748 | 17267.450457 | 104.121574 | 63.545748 |
1.6 | 2015.118343 | 9573.508876 | 37542.497041 | 112.810651 | 55.526627 |
1.8 | 2016.635376 | 17445.613654 | 26293.993018 | 72.152832 | 75.836074 |
2.0 | 2016.535176 | 17666.537688 | 29531.874372 | 149.497487 | 52.432412 |
2.2 | 2012.603448 | 9994.724138 | 60341.155172 | 220.775862 | 45.246552 |
2.4 | 2018.085714 | 22986.757143 | 20498.628571 | 261.357143 | 33.400000 |
2.5 | 2017.076305 | 22909.895582 | 26997.899598 | 67.971888 | 54.745382 |
2.8 | 2018.305556 | 42767.083333 | 12889.111111 | 180.555556 | 33.025000 |
3.0 | 2014.800000 | 29292.800000 | 44988.457143 | 258.571429 | 33.754286 |
4.2 | 1998.000000 | 19990.000000 | 100000.000000 | 265.000000 | 23.900000 |
4.5 | 2014.000000 | 44990.000000 | 60000.000000 | 540.000000 | 29.700000 |
What is the average and mediam price in our dataset?
print(round(df.price.mean(),2), round(df.price.median(),2))
12522.39 10795.0
I will now plot the distribution of all the numeric variables. I find the distribution of mileage, year, engineSize, and mpg particularly interesting.
df.hist(figsize=(10,10), column=sorted(df.columns))
plt.show()
It is valuable to look at the correlations table. We can see how the highest correlation is between year and mileage, and between price and engineSize. We will explore those two pairs with their own plot.
# heatmap of correlations
correlations = df.corr()
mask = np.zeros_like(correlations)
mask[np.triu_indices_from(mask)] = 1
sns.set_style('white')
plt.figure(figsize=(9,8))
sns.heatmap(correlations * 100,
cmap='RdBu_r',
annot=True,
fmt='.0f',
mask=mask,
cbar=False)
plt.show()
The pair plot is very valuable because we can see in just one graph the scatter plots of some of the higher correlation variables. I find the year and mileage vs price plots very intersting. We can see it seems like it is not exactly a linear relationship.
sns.pairplot(df[['price', 'engineSize', 'year', 'mileage']], diag_kind="kde")
plt.show()
# engineSize v price
sns.scatterplot(x='engineSize', y='price', data=df)
plt.show()
# year v mileage
sns.scatterplot(x='year', y='mileage', data=df)
plt.show()
# year v mpg
sns.scatterplot(x='year', y='mpg', data=df)
plt.show()
The violin plots help us get a better sense of the distribution of the data in each category:
# transmission v mpg
sns.violinplot(x='transmission', y='mpg', data=df)
plt.show()
# fuelType v mpg
sns.violinplot(x='fuelType', y='mpg', data=df)
plt.show()
There are 39 duplicates. Given the size of our dataset, I believe it is ok to drop those values.
# Check for duplicates
df.duplicated().sum()
df[df.duplicated()]
model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
---|---|---|---|---|---|---|---|---|---|
178 | Corolla | 2019 | 17960 | Manual | 4500 | Petrol | 145 | 39.2 | 1.2 |
286 | Corolla | 2019 | 15991 | Manual | 10 | Petrol | 145 | 39.2 | 1.2 |
397 | RAV4 | 2016 | 17495 | Automatic | 58100 | Hybrid | 20 | 55.4 | 2.5 |
398 | RAV4 | 2017 | 16295 | Manual | 34633 | Diesel | 125 | 60.1 | 2.0 |
446 | RAV4 | 2014 | 10999 | Manual | 57942 | Diesel | 145 | 53.0 | 2.0 |
459 | RAV4 | 2015 | 13500 | Manual | 45757 | Diesel | 125 | 57.6 | 2.0 |
460 | RAV4 | 2015 | 13500 | Manual | 45757 | Diesel | 125 | 57.6 | 2.0 |
764 | RAV4 | 2016 | 18795 | Automatic | 57000 | Hybrid | 20 | 55.4 | 2.5 |
1178 | Yaris | 2016 | 10495 | Automatic | 33705 | Hybrid | 0 | 86.0 | 1.5 |
1414 | Yaris | 2016 | 9495 | Automatic | 26191 | Petrol | 30 | 58.0 | 1.3 |
1587 | Yaris | 2020 | 14769 | Manual | 25 | Petrol | 150 | 47.9 | 1.5 |
1964 | Yaris | 2019 | 12495 | Automatic | 6000 | Petrol | 145 | 47.9 | 1.5 |
2161 | Yaris | 2019 | 14495 | Manual | 100 | Petrol | 145 | 47.9 | 1.5 |
2945 | Auris | 2016 | 9985 | Manual | 27600 | Petrol | 30 | 58.9 | 1.2 |
3177 | Auris | 2016 | 8992 | Manual | 40323 | Diesel | 20 | 67.3 | 1.6 |
3230 | Auris | 2013 | 6622 | Manual | 41959 | Petrol | 145 | 47.9 | 1.6 |
3778 | Aygo | 2019 | 9495 | Manual | 1360 | Petrol | 145 | 56.5 | 1.0 |
3836 | Aygo | 2016 | 8450 | Manual | 12935 | Petrol | 0 | 69.0 | 1.0 |
3902 | Aygo | 2019 | 10000 | Manual | 1000 | Petrol | 145 | 57.7 | 1.0 |
3948 | Aygo | 2019 | 9295 | Manual | 5000 | Petrol | 145 | 57.7 | 1.0 |
4059 | Aygo | 2019 | 8495 | Manual | 5519 | Petrol | 145 | 56.5 | 1.0 |
4283 | Aygo | 2019 | 9995 | Manual | 25 | Petrol | 145 | 56.5 | 1.0 |
4284 | Aygo | 2019 | 9995 | Manual | 25 | Petrol | 145 | 56.5 | 1.0 |
4301 | Aygo | 2019 | 10495 | Manual | 3500 | Petrol | 145 | 56.5 | 1.0 |
4330 | Aygo | 2017 | 6550 | Manual | 17047 | Petrol | 0 | 69.0 | 1.0 |
4452 | Aygo | 2017 | 6850 | Manual | 18898 | Petrol | 150 | 68.9 | 1.0 |
4516 | Aygo | 2019 | 9999 | Manual | 1500 | Petrol | 145 | 56.5 | 1.0 |
4517 | Aygo | 2019 | 9999 | Manual | 1500 | Petrol | 145 | 56.5 | 1.0 |
4603 | Aygo | 2019 | 9990 | Manual | 5000 | Petrol | 145 | 57.7 | 1.0 |
5084 | Aygo | 2018 | 9831 | Manual | 10742 | Petrol | 145 | 56.5 | 1.0 |
5194 | Aygo | 2019 | 10995 | Manual | 3 | Petrol | 145 | 57.7 | 1.0 |
5212 | Aygo | 2019 | 9391 | Manual | 3 | Petrol | 145 | 56.5 | 1.0 |
5342 | Aygo | 2019 | 9999 | Manual | 20 | Petrol | 145 | 57.7 | 1.0 |
5488 | Aygo | 2019 | 10350 | Manual | 2000 | Petrol | 145 | 57.7 | 1.0 |
5489 | Aygo | 2019 | 10350 | Manual | 2000 | Petrol | 145 | 57.7 | 1.0 |
5493 | Aygo | 2019 | 10000 | Manual | 150 | Petrol | 145 | 57.7 | 1.0 |
5560 | Aygo | 2019 | 10750 | Manual | 32 | Petrol | 145 | 57.7 | 1.0 |
6357 | Avensis | 2017 | 10595 | Manual | 35939 | Diesel | 145 | 67.3 | 1.6 |
6570 | Hilux | 2015 | 14995 | Automatic | 72100 | Diesel | 260 | 32.8 | 3.0 |
# Drop Duplicates
df.drop_duplicates(inplace=True)
df.shape
(6699, 9)
If I had more time, I would fix these values. It would not be hard to look online for the engine sizes in these vehicles:
# Cars with engine size = 0
df[df.engineSize == 0]
model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
---|---|---|---|---|---|---|---|---|---|
2535 | Yaris | 2016 | 12300 | Manual | 6148 | Hybrid | 0 | 86.0 | 0.0 |
2545 | Yaris | 2016 | 11000 | Automatic | 39909 | Hybrid | 0 | 86.0 | 0.0 |
5126 | Aygo | 2019 | 9800 | Manual | 3635 | Petrol | 150 | 56.5 | 0.0 |
5233 | Aygo | 2019 | 8000 | Manual | 8531 | Petrol | 145 | 56.5 | 0.0 |
5257 | Aygo | 2019 | 8000 | Manual | 5354 | Petrol | 145 | 56.5 | 0.0 |
5960 | C-HR | 2017 | 14300 | Manual | 46571 | Petrol | 145 | 47.1 | 0.0 |
We can see how some Prius have an mpg of 235 and others it's about 135. If we had more time, we should look into this. Also some say they have 'Other' as FuelType, but I don't think Prius makes electric cars, so it would be interesting to talk to the people who put together the dataset.
df[df.mpg > 100]
model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
---|---|---|---|---|---|---|---|---|---|
6098 | Prius | 2017 | 20795 | Automatic | 3881 | Hybrid | 135 | 235.0 | 1.8 |
6103 | Prius | 2018 | 20495 | Automatic | 16347 | Hybrid | 140 | 235.0 | 1.8 |
6116 | Prius | 2017 | 19998 | Automatic | 13476 | Hybrid | 140 | 235.0 | 1.8 |
6118 | Prius | 2019 | 24987 | Automatic | 6552 | Hybrid | 135 | 235.0 | 1.8 |
6124 | Prius | 2020 | 27990 | Automatic | 1500 | Hybrid | 135 | 235.0 | 1.8 |
6134 | Prius | 2017 | 20998 | Automatic | 17681 | Hybrid | 135 | 235.0 | 1.8 |
6151 | Prius | 2019 | 27999 | Automatic | 3650 | Hybrid | 140 | 217.3 | 1.8 |
6160 | Prius | 2013 | 13522 | Automatic | 52217 | Hybrid | 0 | 134.5 | 1.8 |
6176 | Prius | 2019 | 26995 | Automatic | 12998 | Hybrid | 135 | 217.3 | 1.8 |
6177 | Prius | 2015 | 15995 | Automatic | 51896 | Hybrid | 0 | 134.5 | 1.8 |
6180 | Prius | 2015 | 17495 | Automatic | 19350 | Hybrid | 0 | 134.5 | 1.8 |
6188 | Prius | 2014 | 15995 | Automatic | 47221 | Hybrid | 0 | 134.5 | 1.8 |
6195 | Prius | 2014 | 15362 | Automatic | 50647 | Hybrid | 0 | 134.5 | 1.8 |
6198 | Prius | 2019 | 29995 | Automatic | 3754 | Hybrid | 135 | 217.3 | 1.8 |
6200 | Prius | 2018 | 19072 | Automatic | 43584 | Hybrid | 135 | 235.0 | 1.8 |
6207 | Prius | 2019 | 22495 | Automatic | 13772 | Hybrid | 135 | 235.0 | 1.8 |
6216 | Prius | 2018 | 22890 | Automatic | 31786 | Hybrid | 135 | 235.0 | 1.8 |
6223 | Prius | 2018 | 18699 | Automatic | 34502 | Hybrid | 140 | 235.0 | 1.8 |
6236 | Prius | 2019 | 21500 | Automatic | 4000 | Hybrid | 135 | 235.0 | 1.8 |
6240 | Prius | 2013 | 13495 | Automatic | 35726 | Hybrid | 0 | 134.5 | 1.8 |
6244 | Prius | 2020 | 31995 | Automatic | 100 | Hybrid | 135 | 217.3 | 1.8 |
6250 | Prius | 2012 | 13199 | Automatic | 37870 | Hybrid | 0 | 134.5 | 1.8 |
6253 | Prius | 2019 | 27999 | Automatic | 9975 | Hybrid | 140 | 217.3 | 1.8 |
6256 | Prius | 2018 | 18150 | Automatic | 59191 | Other | 135 | 235.0 | 1.8 |
6267 | Prius | 2018 | 19699 | Automatic | 43678 | Hybrid | 135 | 235.0 | 1.8 |
6286 | Prius | 2019 | 26700 | Automatic | 6053 | Other | 135 | 217.3 | 1.8 |
6287 | Prius | 2019 | 28590 | Automatic | 5000 | Other | 135 | 217.3 | 1.8 |
6289 | Prius | 2014 | 15990 | Automatic | 23523 | Other | 0 | 134.5 | 1.8 |
6291 | Prius | 2019 | 26990 | Automatic | 7000 | Other | 135 | 217.3 | 1.8 |
6314 | Prius | 2016 | 15795 | Automatic | 47000 | Other | 0 | 134.5 | 1.8 |
Let's create a version of the dataframe that encodes our three non-numeric columns into 1 and 0. We will use get_dummies for this.
abt = pd.get_dummies(df, columns=['model', 'transmission', 'fuelType'])
We now have 32 columns
abt.head()
year | price | mileage | tax | mpg | engineSize | model_ Auris | model_ Avensis | model_ Aygo | model_ C-HR | ... | model_ Verso-S | model_ Yaris | transmission_Automatic | transmission_Manual | transmission_Other | transmission_Semi-Auto | fuelType_Diesel | fuelType_Hybrid | fuelType_Other | fuelType_Petrol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016 | 16000 | 24089 | 265 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2017 | 15995 | 18615 | 145 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 2015 | 13998 | 27469 | 265 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 2017 | 18998 | 14736 | 150 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 2017 | 17498 | 36284 | 145 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 32 columns
We will create the analytical base table, as a checkpoint in our analysis.
# Create ABT
abt.to_csv('analytical_base_table.csv', index=None)
This is a regression problem. We will try to estimate price (continuous variable) using the other features.
We will use the following algorithms and see which one performs best:
I chose the first three because the regularization they provide normally improves the linear regression, especially in a case like this where there linear regression has many variables to fit.
Random Forest and Gradient Boosting Regressors are normally good choices in regression problems.
For Lasso and Ridge we will tune the alpha parameter.
For ElasticNet we will focus on the alpha and the l1 ratio.
For the Random Forest Regressor we will focus on the number of trees and the number of features to use per tree.
For the Gradient Boosting Regressor we will tune the number of estimators, the learning rate and the maximum depth.
First we load our table.
df = pd.read_csv('analytical_base_table.csv')
df.head()
year | price | mileage | tax | mpg | engineSize | model_ Auris | model_ Avensis | model_ Aygo | model_ C-HR | ... | model_ Verso-S | model_ Yaris | transmission_Automatic | transmission_Manual | transmission_Other | transmission_Semi-Auto | fuelType_Diesel | fuelType_Hybrid | fuelType_Other | fuelType_Petrol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016 | 16000 | 24089 | 265 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2017 | 15995 | 18615 | 145 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 2015 | 13998 | 27469 | 265 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 2017 | 18998 | 14736 | 150 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 2017 | 17498 | 36284 | 145 | 36.2 | 2.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 32 columns
df.shape
(6699, 32)
Now we create X and y tables. X is everything except our target. Y is our target variable
X = df.drop('price', axis=1)
y = df.price
We can now split the data set into training and testing datasets. I will keep 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=77)
By creating pipelines, we can run cross-validation and use our pre-processing at each fold. In this case, we will use StandardScaler to substract the mean and divide by the variance of each column.
# dictionary of all our pipelines
pipelines = {
'lasso' : make_pipeline(StandardScaler(), Lasso(random_state=77)),
'ridge' : make_pipeline(StandardScaler(), Ridge(random_state=77)),
'enet' : make_pipeline(StandardScaler(), ElasticNet(random_state=77)),
'rf' : make_pipeline(StandardScaler(), RandomForestRegressor(random_state=77)),
'gb' : make_pipeline(StandardScaler(), GradientBoostingRegressor(random_state=77))
}
These are the parameters we will serch over in our hyperparameter tuning. If we had more time, we could use an expanded grid
# hyperparameter grids
lasso_hyperparameters = {
'lasso__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}
ridge_hyperparameters = {
'ridge__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}
enet_hyperparameters = {
'elasticnet__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
'elasticnet__l1_ratio' : [0.1, 0.3, 0.5, 0.7, 0.9]
}
rf_hyperparameters = {
'randomforestregressor__n_estimators' : [100, 200],
'randomforestregressor__max_features' : ['auto', 'sqrt', 0.33]
}
gb_hyperparameters = {
'gradientboostingregressor__n_estimators' : [100, 200],
'gradientboostingregressor__learning_rate' : [0.05, 0.1, 0.2],
'gradientboostingregressor__max_depth' : [1, 3, 5]
}
By grouping everything into a dictionary, it is easy to loop over each model and its hyperparameter grid
# dictionary of hyperparameters
hyperparameters = {
'lasso' : lasso_hyperparameters,
'ridge' : ridge_hyperparameters,
'enet' : enet_hyperparameters,
'rf' : rf_hyperparameters,
'gb' : gb_hyperparameters
}
This step takes a few minutes
# loop through all our models and fit them using our pipelines and hyperparameters
fitted_models = {}
for name, pipeline in pipelines.items():
model = GridSearchCV(pipeline, hyperparameters[name], cv=10)
model.fit(X_train, y_train)
fitted_models[name] = model
print(name, 'has been fitted')
lasso has been fitted ridge has been fitted enet has been fitted rf has been fitted gb has been fitted
It seems gradient boosting is the best model. We wil test this using the test set.
# print the score of each model
for name, model in fitted_models.items():
print(name, model.best_score_)
lasso 0.9244539982458957 ridge 0.9244573958197139 enet 0.9244568643352086 rf 0.9604453777122133 gb 0.9668996100757182
Since we need to have an average error lower than 1500, we are using mean absolute error as the metric to confirm that gradient boosting is our best model. And also to see if we were able to meet the required accuracy for the final user. It seems we did!
# print the R^2 and MAE for each model (regression)
for name, model in fitted_models.items():
pred = model.predict(X_test)
print(name)
print('-'*8)
print(f'R^2: {r2_score(y_test, pred)}')
print(f'MAE: {mean_absolute_error(y_test, pred)}')
print()
lasso -------- R^2: 0.9290397270899275 MAE: 1033.9090834078168 ridge -------- R^2: 0.9290610512361053 MAE: 1033.7925054464924 enet -------- R^2: 0.9290603223847058 MAE: 1033.818063999813 rf -------- R^2: 0.961070619542453 MAE: 793.9879979328197 gb -------- R^2: 0.9637539267332421 MAE: 758.1641555273221
Gradient Boosting was the best model, with 200 trees, 0.05 learning rate and maximum depth of 5. With more time, we would try a bigger grid.
# print the pipeline of the best model
fitted_models['gb'].best_estimator_
Pipeline(steps=[('standardscaler', StandardScaler()), ('gradientboostingregressor', GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=200, random_state=77))])
Gradient boosting performed better that random forest and regularized regression.
Our best model had a mean absolute error (MAE) lower than 1,500, in this case 758.
This MAE means that our model meets the requirement to be useful to our clients. The user stated that a car more than £1,500 above the true market value would not sell. Since our mean absolure error is about half that, we can be confident that the model will be useful and will be within the permitted error range most of the time.
We performed 10 fold cross validation of our dataset, which gives us good confidence that our evaluation holds for out of sample data.
This model could be further improved in several ways: