In this project we will use a dataset to build a model to predict a cars market price based on its attributes.
The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set here and can download it directly from here.
Whilst importing the data I notice that there isn't a header row, so I will use the column names from the documentation.
import pandas as pd
pd.set_option('display.max_columns', None)
# As there isn't a header row, use the column names from the documentation
cols = ['symboling','normalized-losses','make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels','engine-location','wheel-base','length','width','height','curb-weight','engine-type','num-of-cylinders','engine-size','fuel-system','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
cars = pd.read_csv("imports-85.data", header = None, names = cols)
cars.head()
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
From looking at the table above I have identified which columns are numerical, and will be used as feature columns. These columns will also have to be cleaned.
feature_cols = ["normalized-losses", "num-of-doors", "wheel-base", "length", "width", "height", "curb-weight", "num-of-cylinders", "engine-size", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg"]
price
cars.isnull().sum()
symboling 0 normalized-losses 0 make 0 fuel-type 0 aspiration 0 num-of-doors 0 body-style 0 drive-wheels 0 engine-location 0 wheel-base 0 length 0 width 0 height 0 curb-weight 0 engine-type 0 num-of-cylinders 0 engine-size 0 fuel-system 0 bore 0 stroke 0 compression-ratio 0 horsepower 0 peak-rpm 0 city-mpg 0 highway-mpg 0 price 0 dtype: int64
From the above we can see that there are not any null values. But we do know from the previous table that there are "?" values.
#replace ? with nan
import numpy as np
df_data = cars.replace("?",np.nan)
df_data.isnull().sum()
symboling 0 normalized-losses 41 make 0 fuel-type 0 aspiration 0 num-of-doors 2 body-style 0 drive-wheels 0 engine-location 0 wheel-base 0 length 0 width 0 height 0 curb-weight 0 engine-type 0 num-of-cylinders 0 engine-size 0 fuel-system 0 bore 4 stroke 4 compression-ratio 0 horsepower 2 peak-rpm 2 city-mpg 0 highway-mpg 0 price 4 dtype: int64
Now we can see which columns need further work.
change "num-of-doors" values to numbers
# change "num-of-doors" values to numbers
cars['num-of-doors'] = cars['num-of-doors'].replace(to_replace={'four':4, 'two':2})
cars["num-of-doors"].value_counts()
4 114 2 89 ? 2 Name: num-of-doors, dtype: int64
change "num-of-cylinders" values to numbers
# change "num-of-cylinders" values to numbers
cars['num-of-cylinders'] = cars['num-of-cylinders'].replace(to_replace={'four':4,'six':6,'five':5,'eight':8, 'two':2,'twelve':11,'three':3})
cars['num-of-cylinders'].value_counts()
4 159 6 24 5 11 8 5 2 4 11 1 3 1 Name: num-of-cylinders, dtype: int64
replace "?" values with mean values for that column
#replace "?" values with mean values for that column
nvcol = ['normalized-losses', 'bore', 'stroke', 'horsepower', 'peak-rpm', 'price', 'num-of-doors']
for a in nvcol:
df_temp = cars[cars[a]!='?']
normalised_mean = np.mean(df_temp[a].astype(float))
cars[a] = cars[a].replace('?',normalised_mean).astype(float)
I will now normalize the data in feature columns as they currently have differing scales.
# Normalize the feature columns
result = cars.copy()
for feature_name in feature_cols:
max_value = cars[feature_name].max()
min_value = cars[feature_name].min()
result[feature_name] = (cars[feature_name] - min_value) / (max_value - min_value)
result.head()
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 0.298429 | alfa-romero | gas | std | 0.0 | convertible | rwd | front | 0.058309 | 0.413433 | 0.316667 | 0.083333 | 0.411171 | dohc | 0.222222 | 0.260377 | mpfi | 0.664286 | 0.290476 | 0.1250 | 0.262500 | 0.346939 | 0.222222 | 0.289474 | 13495.0 |
1 | 3 | 0.298429 | alfa-romero | gas | std | 0.0 | convertible | rwd | front | 0.058309 | 0.413433 | 0.316667 | 0.083333 | 0.411171 | dohc | 0.222222 | 0.260377 | mpfi | 0.664286 | 0.290476 | 0.1250 | 0.262500 | 0.346939 | 0.222222 | 0.289474 | 16500.0 |
2 | 1 | 0.298429 | alfa-romero | gas | std | 0.0 | hatchback | rwd | front | 0.230321 | 0.449254 | 0.433333 | 0.383333 | 0.517843 | ohcv | 0.444444 | 0.343396 | mpfi | 0.100000 | 0.666667 | 0.1250 | 0.441667 | 0.346939 | 0.166667 | 0.263158 | 16500.0 |
3 | 2 | 0.518325 | audi | gas | std | 1.0 | sedan | fwd | front | 0.384840 | 0.529851 | 0.491667 | 0.541667 | 0.329325 | ohc | 0.222222 | 0.181132 | mpfi | 0.464286 | 0.633333 | 0.1875 | 0.225000 | 0.551020 | 0.305556 | 0.368421 | 13950.0 |
4 | 2 | 0.518325 | audi | gas | std | 1.0 | sedan | 4wd | front | 0.373178 | 0.529851 | 0.508333 | 0.541667 | 0.518231 | ohc | 0.333333 | 0.283019 | mpfi | 0.464286 | 0.633333 | 0.0625 | 0.279167 | 0.551020 | 0.138889 | 0.157895 | 17450.0 |
I will train and test a model to identify the feature with the lowest RMSE.
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import math
def knn_train_test(data, cols):
rows = len(data)
np.random.seed(1)
indexs = np.random.permutation(rows)
train_data = data.iloc[indexs[: round(rows * 0.75)]]
test_data = data.iloc[indexs[round(rows * 0.75):]]
# initialize dictionary
rmses = {}
# iterating through the elements of list
for a in cols:
rmses[a] = None
knn = KNeighborsRegressor()
knn.fit(train_data[[a]], train_data["price"])
predictions = knn.predict(test_data[[a]])
mse = mean_squared_error(test_data["price"],predictions)
rmse = mse ** (0.5)
rmses[a] = rmse
return rmses
cols_rmse = knn_train_test(result,feature_cols)
cols_rmse
{'normalized-losses': 8321.741649654248, 'num-of-doors': 13693.688198054244, 'wheel-base': 6490.955739274032, 'length': 4784.906578979326, 'width': 5068.972256031248, 'height': 7792.0195878192135, 'curb-weight': 4003.636111832969, 'num-of-cylinders': 5157.447800986909, 'engine-size': 4363.560152931148, 'bore': 9048.861283343298, 'stroke': 8268.729608497286, 'compression-ratio': 7935.054256508373, 'horsepower': 5406.424272782797, 'peak-rpm': 8876.974709964941, 'city-mpg': 4449.98254875967, 'highway-mpg': 4270.796260700418}
I will use a Scatter Plot to present the results
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(cols_rmse.keys(),cols_rmse.values())
plt.xticks(rotation=90)
plt.show()
For the standard K = 5, curb-weight has the lowest rmse.
I will now amend the function to vary the number of folds and present the results with individual scatter plots.
def knn_train_test(data, cols):
rows = len(data)
np.random.seed(1)
indexs = np.random.permutation(rows)
train_data = data.iloc[indexs[: round(len(data)/2)]]
test_data = data.iloc[indexs[round(len(data)/2):]]
k = [1,3,5,7,9]
df_rmses = pd.DataFrame()
for a in cols:
for val in k:
knn = KNeighborsRegressor(n_neighbors=val)
knn.fit(train_data[[a]], train_data["price"])
predictions = knn.predict(test_data[[a]])
mse = mean_squared_error(test_data["price"],predictions)
val = str(val)
df_rmses.loc[a,val]= mse ** (0.5)
plt.scatter(k,df_rmses.loc[a,:])
plt.xlabel("Number of Folds")
plt.title(a + "RMSE Values and Folds")
plt.show()
return df_rmses
data_rmses = knn_train_test(result,feature_cols)
I will now amend the function to work with multiple columns.
#use all the feature columns
def knn_train_test3(data, cols):
rows = len(data)
np.random.seed(1)
indexs = np.random.permutation(rows)
train_data = data.iloc[indexs[: round(rows * 0.75)]]
test_data = data.iloc[indexs[round(rows * 0.75):]]
knn = KNeighborsRegressor(algorithm='brute')
knn.fit(train_data[cols], train_data["price"])
predictions = knn.predict(test_data[cols])
mse = mean_squared_error(test_data["price"],predictions)
rmse = mse ** (0.5)
return rmse
cols_rmse = knn_train_test3(result,feature_cols)
cols_rmse
4210.154820792211
We can see that using all the columns at the standard k=5 , produces an rmse at the lower range of results. That would indicate we are using too many features.
I will now identify the 2,3,4,5 best features and obtain rmse for them.
# have to order results from prev exercise.
data_rmses
1 | 3 | 5 | 7 | 9 | |
---|---|---|---|---|---|
normalized-losses | 6998.742471 | 7038.210876 | 7374.513274 | 7722.670399 | 7507.512183 |
num-of-doors | 9305.990265 | 9418.765296 | 8631.127463 | 8471.518375 | 8443.603301 |
wheel-base | 5400.514190 | 5629.538535 | 5743.877086 | 6048.744598 | 6360.529323 |
length | 6853.715820 | 5113.209768 | 5418.778301 | 5618.149069 | 5641.426310 |
width | 5739.245757 | 5291.949722 | 5030.139338 | 4933.639635 | 4951.829909 |
height | 10653.050778 | 8176.934988 | 7832.152833 | 7743.870133 | 7632.665665 |
curb-weight | 5842.169654 | 4527.849193 | 4436.523561 | 4320.057971 | 4181.086221 |
num-of-cylinders | 7268.475153 | 5838.243342 | 5346.389179 | 5651.315717 | 5809.659051 |
engine-size | 4775.980595 | 4211.354845 | 4244.103421 | 4028.227456 | 4125.116174 |
bore | 7930.548591 | 6353.890994 | 6746.031651 | 7246.705124 | 7184.417704 |
stroke | 7345.451364 | 7455.212314 | 8096.653898 | 7935.188914 | 8066.594604 |
compression-ratio | 7255.772135 | 7485.226184 | 7177.202061 | 7333.967328 | 7426.772346 |
horsepower | 5261.091646 | 4995.302906 | 5092.272401 | 5025.308887 | 4989.986569 |
peak-rpm | 9386.461613 | 8096.114657 | 7965.541705 | 8256.807774 | 8101.582870 |
city-mpg | 6449.848972 | 4808.843642 | 5085.047194 | 4958.158653 | 5021.788908 |
highway-mpg | 4445.090471 | 4442.016316 | 4245.734567 | 4266.445996 | 4522.647594 |
In order to identify the best feature I have decided to calculate the mean value of each row and sort the index by it.
#calculate the average - mean
data_rmses["average"] = data_rmses.apply(lambda x: x.mean(), axis =1)
#sort the dataframe by avergae column
data_rmses.sort_values(by='average')
1 | 3 | 5 | 7 | 9 | average | |
---|---|---|---|---|---|---|
engine-size | 4775.980595 | 4211.354845 | 4244.103421 | 4028.227456 | 4125.116174 | 4276.956498 |
highway-mpg | 4445.090471 | 4442.016316 | 4245.734567 | 4266.445996 | 4522.647594 | 4384.386989 |
curb-weight | 5842.169654 | 4527.849193 | 4436.523561 | 4320.057971 | 4181.086221 | 4661.537320 |
horsepower | 5261.091646 | 4995.302906 | 5092.272401 | 5025.308887 | 4989.986569 | 5072.792482 |
width | 5739.245757 | 5291.949722 | 5030.139338 | 4933.639635 | 4951.829909 | 5189.360872 |
city-mpg | 6449.848972 | 4808.843642 | 5085.047194 | 4958.158653 | 5021.788908 | 5264.737474 |
length | 6853.715820 | 5113.209768 | 5418.778301 | 5618.149069 | 5641.426310 | 5729.055853 |
wheel-base | 5400.514190 | 5629.538535 | 5743.877086 | 6048.744598 | 6360.529323 | 5836.640746 |
num-of-cylinders | 7268.475153 | 5838.243342 | 5346.389179 | 5651.315717 | 5809.659051 | 5982.816488 |
bore | 7930.548591 | 6353.890994 | 6746.031651 | 7246.705124 | 7184.417704 | 7092.318813 |
normalized-losses | 6998.742471 | 7038.210876 | 7374.513274 | 7722.670399 | 7507.512183 | 7328.329841 |
compression-ratio | 7255.772135 | 7485.226184 | 7177.202061 | 7333.967328 | 7426.772346 | 7335.788011 |
stroke | 7345.451364 | 7455.212314 | 8096.653898 | 7935.188914 | 8066.594604 | 7779.820219 |
peak-rpm | 9386.461613 | 8096.114657 | 7965.541705 | 8256.807774 | 8101.582870 | 8361.301724 |
height | 10653.050778 | 8176.934988 | 7832.152833 | 7743.870133 | 7632.665665 | 8407.734879 |
num-of-doors | 9305.990265 | 9418.765296 | 8631.127463 | 8471.518375 | 8443.603301 | 8854.200940 |
rmses_dataframe = pd.DataFrame()
rows = len(result)
np.random.seed(1)
indexs = np.random.permutation(rows)
train_data = result.iloc[indexs[: round(rows * 0.75)]]
test_data = result.iloc[indexs[round(rows * 0.75):]]
#2 best features
kn_two_bf = KNeighborsRegressor()
kn_two_bf.fit(train_data[["engine-size", "highway-mpg"]], train_data["price"])
predictions = kn_two_bf.predict(test_data[["engine-size", "highway-mpg"]])
mse = mean_squared_error(test_data["price"], predictions)
print(mse ** 0.5)
rmses_dataframe.loc["rmse", "engine-size, highway-mpg"] = (mse ** 0.5)
3680.7880754786975
#3 best features
kn_three_bf = KNeighborsRegressor()
kn_three_bf.fit(train_data[["engine-size", "highway-mpg", "curb-weight"]], train_data["price"])
predictions = kn_three_bf.predict(test_data[["engine-size", "highway-mpg", "curb-weight"]])
mse = mean_squared_error(test_data["price"], predictions)
rmses_dataframe.loc["rmse", "engine-size, highway-mpg, curb-weight"] = (mse ** 0.5)
#4 best features
kn_four_bf = KNeighborsRegressor()
kn_four_bf.fit(train_data[["engine-size", "highway-mpg", "curb-weight","horsepower"]], train_data["price"])
predictions = kn_four_bf.predict(test_data[["engine-size", "highway-mpg", "curb-weight","horsepower"]])
mse = mean_squared_error(test_data["price"], predictions)
rmses_dataframe.loc["rmse", "engine-size, highway-mpg, curb-weight, horsepower"] = (mse ** 0.5)
#5 best features
kn_five_bf = KNeighborsRegressor()
kn_five_bf.fit(train_data[["engine-size", "highway-mpg", "curb-weight","horsepower", "width"]], train_data["price"])
predictions = kn_five_bf.predict(test_data[["engine-size", "highway-mpg", "curb-weight","horsepower", "width"]])
mse = mean_squared_error(test_data["price"], predictions)
rmses_dataframe.loc["rmse", "engine-size, highway-mpg, curb-weight, horsepower, width"] = (mse ** 0.5)
rmses_dataframe
engine-size, highway-mpg | engine-size, highway-mpg, curb-weight | engine-size, highway-mpg, curb-weight, horsepower | engine-size, highway-mpg, curb-weight, horsepower, width | |
---|---|---|---|---|
rmse | 3680.788075 | 3634.493956 | 3589.907892 | 3373.951664 |
From the above results we can see that the 5 feature model has the lowest rmse.
I will now optimize the 3 best models with hyperparameters ranging from 1 to 25.
#3 best features with Hyper Parameters
hp_df = pd.DataFrame()
hp = [x for x in range(1,25)]
for val in hp:
kn_three_bf = KNeighborsRegressor(n_neighbors=val, algorithm ="brute")
kn_three_bf.fit(train_data[["engine-size", "highway-mpg", "curb-weight"]], train_data["price"])
predictions = kn_three_bf.predict(test_data[["engine-size", "highway-mpg", "curb-weight"]])
mse = mean_squared_error(test_data["price"], predictions)
hp_df.loc[val, "engine-size, highway-mpg, curb-weight"] = (mse ** 0.5)
#4 best features with Hyper Parameters
for val in hp:
kn_four_bf = KNeighborsRegressor(n_neighbors= val, algorithm= "brute")
kn_four_bf.fit(train_data[["engine-size", "highway-mpg", "curb-weight","horsepower"]], train_data["price"])
predictions = kn_four_bf.predict(test_data[["engine-size", "highway-mpg", "curb-weight","horsepower"]])
mse = mean_squared_error(test_data["price"], predictions)
hp_df.loc[val, "engine-size, highway-mpg, curb-weight, horsepower"] = (mse ** 0.5)
#5 best features with Hyper Parameters
for val in hp:
kn_five_bf = KNeighborsRegressor(n_neighbors= val, algorithm= "brute")
kn_five_bf.fit(train_data[["engine-size", "highway-mpg", "curb-weight","horsepower", "width"]], train_data["price"])
predictions = kn_five_bf.predict(test_data[["engine-size", "highway-mpg", "curb-weight","horsepower", "width"]])
mse = mean_squared_error(test_data["price"], predictions)
hp_df.loc[val, "engine-size, highway-mpg, curb-weight, horsepower, width"] = (mse ** 0.5)
for a in hp_df.columns :
plt.scatter( hp, hp_df[a])
plt.xlabel("Number of Hyper Parameters")
plt.title(a)
plt.show()
In summary:
For the 3 best features : "engine-size, highway-mpg, curb-weight" A k value = 3 had the lowest RMSE
For the 4 best features : "engine-size, highway-mpg, curb-weight, horsepower" A k value = 1 had the lowest RMSE
For the 5 best features :"engine-size, highway-mpg, curb-weight, horsepower, width" A k value = 2 had the lowest RMSE
Best RMSE was with 4 best features. In each case as the k value increased the RMSE increased. This would indicate a weak relationship between price and the features.
I will now perform Cross Validation using the 4 best features model as it had the lowest RMSE.
#Cross Validation using the 4 best features model as it had the lowest RMSE
from sklearn.model_selection import cross_val_score, KFold
num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]
cross_dataframe = pd.DataFrame()
for fold in num_folds:
kf = KFold(fold, shuffle=True, random_state=1)
model = KNeighborsRegressor()
mses = cross_val_score(model, result[["engine-size", "highway-mpg", "curb-weight","horsepower"]], result["price"], scoring="neg_mean_squared_error", cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
std_rmse = np.std(rmses)
print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))
cross_dataframe.loc[fold, "avg RMSE"] = avg_rmse
cross_dataframe.loc[fold, "std RMSE"] = std_rmse
3 folds: avg RMSE: 3510.0526332782865 std RMSE: 332.58816521741426 5 folds: avg RMSE: 3476.0273977627003 std RMSE: 321.0543995170088 7 folds: avg RMSE: 3403.043271982663 std RMSE: 828.958208984916 9 folds: avg RMSE: 3427.657612558447 std RMSE: 801.5174522094895 10 folds: avg RMSE: 3407.6692824307006 std RMSE: 921.73920671967 11 folds: avg RMSE: 3379.882825794062 std RMSE: 952.6167633412856 13 folds: avg RMSE: 3225.1455289784644 std RMSE: 1280.6632710242682 15 folds: avg RMSE: 3306.121933699179 std RMSE: 1230.0766308579587 17 folds: avg RMSE: 3249.5024220447585 std RMSE: 1224.681518335031 19 folds: avg RMSE: 3153.4932735959846 std RMSE: 1420.4064444480312 21 folds: avg RMSE: 3095.6458450327077 std RMSE: 1474.7560127054214 23 folds: avg RMSE: 3165.340543897426 std RMSE: 1419.3809096175714
cross_dataframe
avg RMSE | std RMSE | |
---|---|---|
3 | 3510.052633 | 332.588165 |
5 | 3476.027398 | 321.054400 |
7 | 3403.043272 | 828.958209 |
9 | 3427.657613 | 801.517452 |
10 | 3407.669282 | 921.739207 |
11 | 3379.882826 | 952.616763 |
13 | 3225.145529 | 1280.663271 |
15 | 3306.121934 | 1230.076631 |
17 | 3249.502422 | 1224.681518 |
19 | 3153.493274 | 1420.406444 |
21 | 3095.645845 | 1474.756013 |
23 | 3165.340544 | 1419.380910 |
I will now present the results visually for clarity:
cross_dataframe.plot(kind="bar")
plt.legend(bbox_to_anchor=(1,1), loc="upper left")
plt.xlabel("Number of Folds")
plt.ylabel("RMSE")
plt.xticks(rotation=0)
plt.show()
From the chart above we can see that folds = 5 has the lowest RMSE with the least variance for "engine-size", "highway-mpg", "curb-weight","horsepower". But the size of the RMSE would indicate that relationship to price is not strong, just less inaccurate than the other features.