Background and Introduction: In this project, we will apply the machine learning algorithm K-nearest Neighbours or KNN to a practical problem i.e. prediction of car prices. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).
Dataset: The dataset being used is taken from the UCI Machine Learning Laboratory and has 205 entries containing information on various attributes on imported cars. This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.
Prediction Strategy: Since we are interested in predicting car prices based on available data, we will follow following strategy:
We will use the Python machine learning scikitlearn for model training and predicitons.
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
# Setting pandas display options for large data
pd.options.display.max_rows = 200
pd.options.display.max_columns = 150
The dataset comes without column names which are contained in a separate file and have to be provided while reading the dataset.
# The data set comes with an accompanying name file from which column names can be extracted
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style',
'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
cars = pd.read_csv('imports-85.data', names = cols)
cars.head()
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null int64 1 normalized-losses 205 non-null object 2 make 205 non-null object 3 fuel-type 205 non-null object 4 aspiration 205 non-null object 5 num-of-doors 205 non-null object 6 body-style 205 non-null object 7 drive-wheels 205 non-null object 8 engine-location 205 non-null object 9 wheel-base 205 non-null float64 10 length 205 non-null float64 11 width 205 non-null float64 12 height 205 non-null float64 13 curb-weight 205 non-null int64 14 engine-type 205 non-null object 15 num-of-cylinders 205 non-null object 16 engine-size 205 non-null int64 17 fuel-system 205 non-null object 18 bore 205 non-null object 19 stroke 205 non-null object 20 compression-rate 205 non-null float64 21 horsepower 205 non-null object 22 peak-rpm 205 non-null object 23 city-mpg 205 non-null int64 24 highway-mpg 205 non-null int64 25 price 205 non-null object dtypes: float64(5), int64(5), object(16) memory usage: 41.8+ KB
According to the accompanying file import-85.names
which contains information on the respective attributes, following columns contain the number of missing entries as shown against their column number:
2. 41
6. 2
19. 4
20. 4
22. 2
23. 2
26. 4
These missing values are plugged in with the string '?' in the dataframe. First we will need to replace it with NAN in order to convert the numeric columns to the appropriate type.
cars = cars.replace('?', np.nan).copy()
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null int64 1 normalized-losses 164 non-null object 2 make 205 non-null object 3 fuel-type 205 non-null object 4 aspiration 205 non-null object 5 num-of-doors 203 non-null object 6 body-style 205 non-null object 7 drive-wheels 205 non-null object 8 engine-location 205 non-null object 9 wheel-base 205 non-null float64 10 length 205 non-null float64 11 width 205 non-null float64 12 height 205 non-null float64 13 curb-weight 205 non-null int64 14 engine-type 205 non-null object 15 num-of-cylinders 205 non-null object 16 engine-size 205 non-null int64 17 fuel-system 205 non-null object 18 bore 201 non-null object 19 stroke 201 non-null object 20 compression-rate 205 non-null float64 21 horsepower 203 non-null object 22 peak-rpm 203 non-null object 23 city-mpg 205 non-null int64 24 highway-mpg 205 non-null int64 25 price 201 non-null object dtypes: float64(5), int64(5), object(16) memory usage: 41.8+ KB
Since we can only work with numeric columns with continuous values we will only keep the required columns and remove the rest from our dataset. We have the required information on continuous variables from the accompanying file imports-85.names
. We will retain the numeric columns, convert them to dtype: float
, drop any NAN
values from the price
column, fill the remaining missing values with average values for that variable, and convert all columns to normalized values i.e. on a scale from 0
to 1
except the price
column.
continuous_cols = ['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-rate',
'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
continuous_cars = cars[continuous_cols].astype('float')
continuous_cars = continuous_cars.dropna(subset = ['price']).copy()
continuous_cars = continuous_cars.fillna(continuous_cars.mean()).copy()
normalized_cars = (continuous_cars - continuous_cars.min()) / (continuous_cars.max() - continuous_cars.min())
normalized_cars['price'] = continuous_cars['price']
normalized_cars.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 201 entries, 0 to 204 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 normalized-losses 201 non-null float64 1 wheel-base 201 non-null float64 2 length 201 non-null float64 3 width 201 non-null float64 4 height 201 non-null float64 5 curb-weight 201 non-null float64 6 engine-size 201 non-null float64 7 bore 201 non-null float64 8 stroke 201 non-null float64 9 compression-rate 201 non-null float64 10 horsepower 201 non-null float64 11 peak-rpm 201 non-null float64 12 city-mpg 201 non-null float64 13 highway-mpg 201 non-null float64 14 price 201 non-null float64 dtypes: float64(15) memory usage: 25.1 KB
normalized_cars.describe()
normalized-losses | wheel-base | length | width | height | curb-weight | engine-size | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 |
mean | 0.298429 | 0.355598 | 0.494045 | 0.477697 | 0.497222 | 0.414145 | 0.248587 | 0.564793 | 0.565192 | 0.197767 | 0.258864 | 0.394934 | 0.338308 | 0.386489 | 13207.129353 |
std | 0.167520 | 0.176862 | 0.183913 | 0.179613 | 0.203985 | 0.200658 | 0.156781 | 0.191480 | 0.150499 | 0.250310 | 0.174606 | 0.195148 | 0.178423 | 0.179346 | 7947.066342 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5118.000000 |
25% | 0.188482 | 0.230321 | 0.383582 | 0.324786 | 0.350000 | 0.264158 | 0.139623 | 0.435714 | 0.495238 | 0.100000 | 0.102804 | 0.265306 | 0.166667 | 0.236842 | 7775.000000 |
50% | 0.298429 | 0.303207 | 0.479104 | 0.444444 | 0.525000 | 0.359193 | 0.222642 | 0.550000 | 0.580952 | 0.125000 | 0.219626 | 0.394934 | 0.305556 | 0.368421 | 10295.000000 |
75% | 0.376963 | 0.460641 | 0.632836 | 0.538462 | 0.641667 | 0.557797 | 0.301887 | 0.742857 | 0.638095 | 0.150000 | 0.317757 | 0.551020 | 0.472222 | 0.473684 | 16500.000000 |
max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 45400.000000 |
We will now write a general purpose function to split the above data into training and test datasets and predict prices for the test datasets for both the univariate and multivariate cases for any number of k values. First, we will randomize the dataset.
# randomize data once
# Don't rerun the complete kernel after randomization as it will change results
normalized_cars = normalized_cars.loc[np.random.permutation(normalized_cars.index)]
normalized_cars = normalized_cars.reset_index(drop = True)
# Function for univariate and multivariate cases with varying k
def knn_train_test(k, df, features_train, feature_target, n_splits = 0.75 ):
""" Trains the model and predicts root mean squared error(RMSE) values for target feature
Args:
k(int):range of k nearest neighbors (from 1 to k)
df (matrix, dataframe): dataframe like or matrix like
features_train(array like): df features to train the model on
feature_target(array like): target df feature
n_splits(float): How to split the df into training and test sets
Returns:
Array: RMSE values for the range of k nearest neighbors specified
"""
train_df = df.head(int(len(df.index)*n_splits)+1)
test_df = df.tail(int(len(df.index)*(1-n_splits)))
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
hyper_params = [param for param in range(1, k+1)]
mse_values = []
rmse_values = []
for param in hyper_params:
knn = KNeighborsRegressor(n_neighbors = param, algorithm = 'brute')
train_features = train_df[features_train]
train_target = train_df[feature_target]
knn.fit(train_features, train_target)
predictions = knn.predict(test_df[features_train])
y_true = test_df[feature_target]
y_pred = predictions
mse = mean_squared_error(y_true, y_pred)
rmse = mse ** (1/2)
mse_values.append(mse)
rmse_values.append(rmse)
return rmse_values
train = normalized_cars.columns[0:5]
target = ['price']
fig = plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
for col in train:
plt.plot(knn_train_test(50, normalized_cars, [col], target), label = col, linewidth = 1.5)
plt.xlabel("k")
plt.ylabel('rmse')
plt.legend()
plt.title('Root Mean Squared Error for Univariate Cases-I (k = 1-50)')
plt.show()
train = normalized_cars.columns[5:10]
target = ['price']
fig = plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
for col in train:
plt.plot(knn_train_test(50, normalized_cars, [col], target), label = col, linewidth = 1.5)
plt.xlabel("k")
plt.ylabel('rmse')
plt.legend()
plt.title('Root Mean Squared Error for Univariate Cases-II (k = 1-50)')
plt.show()
train = normalized_cars.columns[10:-1]
target = ['price']
fig = plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
for col in train:
plt.plot(knn_train_test(50, normalized_cars, [col], target), label = col, linewidth = 1.5)
plt.xlabel("k")
plt.ylabel('rmse')
plt.legend()
plt.title('Root Mean Squared Error for Univariate Cases-III (k = 1-50)')
plt.show()
It is quite clear from above plots that out of 14 features analyzed individually for k values between 1-50, approx half i.e. 7 features converge around a root mean squared error of between 5000 and 60000 after k>20. For all these features minimas occur at a k value of less than 10 at around between 3000 to 50000. These error values are considerably high given that we have seen abouve using normalized_cars.describe()
that the mean price is approximately 13000
, the minimum price is close to 5000
and the maximum price is about 45,000
. We will also drop the 8
high-price features and consider only the 6
low-price
cases. These low-price
features are plotted together below for k values of 1-5, 1-10 and 1-20 to highlight the differences.
# low-price features plotted together k = 1-5
low_price = ['horsepower', 'city-mpg', 'highway-mpg', 'curb-weight', 'engine-size', 'width']
target = ['price']
fig = plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
for col in low_price:
plt.plot(knn_train_test(6, normalized_cars, [col], target), label = col, linewidth = 1.5)
plt.xlabel("k")
plt.ylabel('rmse')
plt.legend(loc = 'lower left', bbox_to_anchor=(1, 0.5))
plt.title('Root Mean Squared Error for Low-Price Univariate Cases (k = 1-5)')
plt.xlim(0, 5)
plt.show()
# low-price features plotted together k = 1-10
low_price = ['horsepower', 'city-mpg', 'highway-mpg', 'curb-weight', 'engine-size', 'width']
target = ['price']
fig = plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
for col in low_price:
plt.plot(knn_train_test(11, normalized_cars, [col], target), label = col, linewidth = 1.5)
plt.xlabel("k")
plt.ylabel('rmse')
plt.legend(loc = 'lower left', bbox_to_anchor=(1, 0.5))
plt.title('Root Mean Squared Error for Low-Price Univariate Cases (k = 1-10)')
plt.xlim(0, 10)
plt.show()
# low-price features plotted together k = 1-20
low_price = ['horsepower', 'city-mpg', 'highway-mpg', 'curb-weight', 'engine-size', 'width']
target = ['price']
fig = plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
for col in low_price:
plt.plot(knn_train_test(21, normalized_cars, [col], target), label = col, linewidth = 1.5)
plt.xlabel("k")
plt.ylabel('rmse')
plt.legend(loc = 'lower left', bbox_to_anchor=(1, 0.5))
plt.title('Root Mean Squared Error for Low-Price Univariate Cases (k = 1-20)')
plt.xlim(0, 20)
plt.show()
It is obvious from above plots that lowest values for all the low-price
feature converge at a k-value of around 5. We will now consider these features together in the following combination:
order
: Lowest RMSE to highest RMSEFirse, we will define a function to get the k
value for the minimum RMSE for each of the six features isolated above.
def knn_rmse_min(k, df, features_train, feature_target, n_splits = 0.75 ):
""" Trains the model, predicts root mean squared error(RMSE) values for target feature
and determines k-nearest neighbors value for which RMSE is minimum
Args:
k(int):range of k nearest neighbors (from 1 to k)
df (matrix, dataframe): dataframe like or matrix like
features_train(array like): df features to train the model on
feature_target(array like): target df feature
n_splits(float): How to split the df into training and test sets
Returns:
dict: minimum RMSE value with k value as key
"""
train_df = df.head(int(len(df.index)*n_splits)+1)
test_df = df.tail(int(len(df.index)*(1-n_splits)))
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
hyper_params = [r for r in range(1, k+1)]
mse_values = []
rmse_values = []
min_dictionary = {}
min_k = {}
for param in hyper_params:
knn = KNeighborsRegressor(n_neighbors = param, algorithm = 'brute')
train_features = train_df[features_train]
train_target = train_df[feature_target]
knn.fit(train_features, train_target)
predictions = knn.predict(test_df[features_train])
y_true = test_df[feature_target]
y_pred = predictions
mse = mean_squared_error(y_true, y_pred)
rmse = mse ** (1/2)
mse_values.append(mse)
rmse_values.append(rmse)
min_dictionary[param] = rmse
min_rmse_key = min(min_dictionary, key=lambda param: min_dictionary[param])
min_k[min_rmse_key] = min(rmse_values)
return min_k
# low-price features min knn values k = 1-50
low_price = ['horsepower', 'city-mpg', 'highway-mpg', 'curb-weight', 'engine-size', 'width']
target = ['price']
for col in low_price:
min_k = knn_rmse_min(50, normalized_cars, [col], target, n_splits = 0.75 )
print(min_k)
{2: 4368.918995586895} {3: 4872.768931293354} {5: 5611.89955526647} {6: 4965.435532928272} {2: 3308.477266054582} {3: 5033.73003271155}
From above, it can be seen that the minimum knn values for all selected features are below 6
. We are not as yet bound to select a knn value as our functions are flexible enough to carry out hyperparameter-optimiztion
on the go. However, we will consider multivariate cases in the following order from first two to first six features:
low_price = ['engine-size', 'horsepower`, `city-mpg`, `curb-weight`, `width`, `highway-mpg`]
# All multivariate combinations, plotted together
low_price = ['engine-size', 'horsepower', 'city-mpg', 'curb-weight', 'width', 'highway-mpg']
first_two_features = low_price[0:2]
first_three_features = low_price[0:3]
first_four_features = low_price[0:4]
first_five_features = low_price[0:5]
six_features = low_price
low_price_features = [first_two_features, first_three_features, first_four_features, first_five_features, six_features]
target = ['price']
fig = plt.figure(figsize=(8, 8))
plt.style.use('fivethirtyeight')
for col in low_price_features:
plt.plot(knn_train_test(11, normalized_cars, col, target), label = col, linewidth = 1.5)
plt.xlabel("k")
plt.ylabel('rmse')
plt.legend(loc = 'lower left', bbox_to_anchor=(1, 0.5))
plt.title('Root Mean Squared Error for Low-Price Multivariate Cases (k = 1-10)')
plt.xlim(0, 10)
plt.show()
We can see above that for multivariate cases, RMSE is increasing with increase in k value which means that we will have to look at the distribution of training and test sets more closely through a k-fold
analysis. However, first let us determine minimum knn values for all the five models above.
# Determine minimum knn for all five models
low_price_features = [first_two_features, first_three_features, first_four_features, first_five_features, six_features]
target = ['price']
for col in low_price_features:
min_k = knn_rmse_min(50, normalized_cars, col, target, n_splits = 0.75 )
print(min_k)
{1: 3253.8129018122722} {1: 3055.242167161222} {1: 2671.9910890569977} {1: 3970.2814534992353} {1: 3972.552886998485}
Above values show that invariably, for all five models RMSE is increasing as the knn value is increased. Let's determine maximum knn values. We will write a function for that.
def knn_rmse_max(k, df, features_train, feature_target, n_splits = 0.75 ):
""" Trains the model, predicts root mean squared error(RMSE) values for target feature
and determines k-nearest neighbors value for which RMSE is maximum
Args:
k(int):range of k nearest neighbors (from 1 to k)
df (matrix, dataframe): dataframe like or matrix like
features_train(array like): df features to train the model on
feature_target(array like): target df feature
n_splits(float): How to split the df into training and test sets
Returns:
dict: maximum RMSE value with k value as key
"""
train_df = df.head(int(len(df.index)*n_splits)+1)
test_df = df.tail(int(len(df.index)*(1-n_splits)))
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
hyper_params = [r for r in range(1, k+1)]
mse_values = []
rmse_values = []
max_dictionary = {}
max_k = {}
for param in hyper_params:
knn = KNeighborsRegressor(n_neighbors = param, algorithm = 'brute')
train_features = train_df[features_train]
train_target = train_df[feature_target]
knn.fit(train_features, train_target)
predictions = knn.predict(test_df[features_train])
y_true = test_df[feature_target]
y_pred = predictions
mse = mean_squared_error(y_true, y_pred)
rmse = mse ** (1/2)
mse_values.append(mse)
rmse_values.append(rmse)
max_dictionary[param] = rmse
max_rmse_key = max(max_dictionary, key=lambda param: max_dictionary[param])
max_k[max_rmse_key] = max(rmse_values)
return max_k
low_price_features = [first_two_features, first_three_features, first_four_features, first_five_features, six_features]
target = ['price']
for col in low_price_features:
max_k = knn_rmse_max(50, normalized_cars, col, target, n_splits = 0.75 )
print(max_k)
{50: 6535.6889687842395} {50: 6600.717875863504} {50: 6355.602234909922} {50: 6362.965983904047} {50: 6381.51419476287}
Above confirms that the RMSE is constantly rising for all five models as the knn value is increased from 1 to 50 which is not an expected behaviour. This is most likely due to an off-chance randomization. So, we will conduct a k-fold cross-validation for all five models with folds = 2 to 10 to see which distribution can best optimize the data splitting between training and test sets.
We will write a function for that.
def kfold_avg_rmse(k, df, features_train, feature_target, n_splits = 5 ):
""" Trains the model, predicts root mean squared error(RMSE) values for target feature
and determines k-nearest neighbors value for which RMSE is maximum
Args:
k(int):range of k nearest neighbors (from 1 to k)
df (matrix, dataframe): dataframe like or matrix like
features_train(array like): df features to train the model on
feature_target(array like): target df feature
n_splits(float): How to split the df into training and test sets
Returns:
dict: maximum RMSE value with k value as key
"""
from sklearn.model_selection import cross_val_score, KFold
num_folds = [fold for fold in range(2, n_splits+1)]
average_rmse_allfolds = []
average_sd_allfolds = []
for fold in num_folds:
kf = KFold(fold, shuffle=True, random_state=1)
model = KNeighborsRegressor(n_neighbors = k, algorithm = 'brute')
mses = cross_val_score(model, df[features_train], df[feature_target], scoring="neg_mean_squared_error", cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
std_rmse = np.std(rmses)
average_rmse_allfolds.append(avg_rmse)
average_sd_allfolds.append(std_rmse)
return [average_rmse_allfolds, average_sd_allfolds]
low_price_features = [first_two_features, first_three_features, first_four_features, first_five_features, six_features]
target = ['price']
models_rmse =[]
models_sd = []
for col in low_price_features:
rmse = kfold_avg_rmse(5, normalized_cars, col, target, n_splits = 10 )[0]
models_rmse.append(rmse)
sd = kfold_avg_rmse(5, normalized_cars, col, target, n_splits = 10 )[1]
models_sd.append(sd)
print(models_rmse)
print(models_sd)
[[3788.904006124423, 3326.638308611929, 3096.1215598920217, 3040.666217642692, 3013.585206258978, 2960.321774043016, 2986.7604104406473, 2935.4167840226296, 2811.2469318111425], [3604.0274445155505, 3154.807111529546, 3056.620682088128, 2982.4725104493455, 2867.0890519502177, 2877.811571060391, 2888.847154500831, 2748.547519325456, 2747.9488264714882], [3431.8122958028794, 3520.7412147865575, 3364.885368941768, 3209.8036404800337, 3085.5992829732845, 3155.1508664953794, 3011.585411398726, 2813.922094261042, 2847.7896416796384], [3565.4246318099586, 3617.4252825042054, 3308.0079275031444, 3156.181479790488, 3116.644647116997, 3064.167697694363, 2930.8600499628888, 2821.801133848072, 2862.6543388769956], [3510.283758277111, 3631.251637847103, 3319.3285192043236, 3163.035592551541, 3122.051544854827, 3123.439672486437, 3002.436456302922, 2932.8198424211687, 2916.736181344633]] [[311.0426040157615, 343.00096673049444, 344.68948807789394, 544.947781238466, 660.9759117129839, 445.3186364731295, 749.06802418931, 850.6725280355512, 959.6483466559935], [331.42412581281496, 247.5101358551991, 338.3692685133275, 545.9625334811914, 721.0448349019874, 518.7045549378647, 929.8423989134221, 1032.2358746921836, 1003.6269457411858], [267.8581056793878, 505.30121669931253, 83.45573660221137, 715.9603385085095, 1021.7977197809018, 712.6443158696204, 1034.1009938944276, 1270.5110486742299, 1143.8388422614296], [422.0657663286404, 1018.414988735174, 513.9539166568715, 740.6936449217887, 1013.4479747087628, 957.8080301297065, 1203.0299712364078, 1244.2482574685648, 1082.9422283866295], [147.48744157875922, 971.7720179504709, 478.7010150895989, 725.4653443057812, 924.8758144212836, 912.5944759020289, 1081.518526308157, 1136.5294406708356, 1080.1030799559662]]
fig = plt.figure(figsize=(6, 6))
plt.style.use('fivethirtyeight')
models = ['model_2features', 'model_3features', 'model_4features', 'model_5features', 'model_6features']
for rmse, model in zip(models_rmse, models):
plt.plot(rmse, label = model, linewidth = 1.5)
plt.xlabel("folds+2")
plt.ylabel('rmse')
plt.title('Root Mean Squared Error - Variation with Folds', fontsize = 12)
plt.xlim(0, 10)
plt.ylim(2500, 3800)
plt.legend(loc = 'lower left', bbox_to_anchor=(1, 0.5))
plt.show()
fig = plt.figure(figsize=(6, 6))
plt.style.use('fivethirtyeight')
models = ['model_2features', 'model_3features', 'model_4features', 'model_5features', 'model_6features']
for sd, model in zip(models_sd, models):
plt.plot(sd, label = model, linewidth = 1.5)
plt.xlabel("folds+2")
plt.ylabel('sd')
plt.title('Standard Deviation - Variation with Folds', fontsize = 12)
plt.xlim(0, 10)
plt.ylim(200, 1400)
plt.legend(loc = 'lower left', bbox_to_anchor=(1, 0.5))
plt.show()
Note: In the above cross-validation plots, value of 0 on x-axis implies 2 folds due to quircks of our function. +2 to be added to each value, thus 8 implies 10 folds.
We can see that all five models tend to converge at an RMSE of approximately 2800-3200 corresponding to a minimum variance at a 7 folds split. Therefore this is the optimum splitting strategy between training and test data for this dataset i.e. 85% : 15% between training_set : test_set
In this project we tried to predict car prices from a relatively small dataset (approx 200 entries) containing various attributes. We can conclude that: