In this project i will use the K-Nearest Neighbors algorithm to predict a car's market price using its attributes. The dataset contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates and more. Information about the data can be found here and can be downloaded from here
import pandas as pd
pd.options.display.max_columns = 100
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
cars = pd.read_csv('imports-85.data', names=['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style',
'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price'])
cars.head()
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
Now let's select the columns with numeric and constant values
num_cars = cars[['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']]
Lets transform the ? character into NaN
num_cars = num_cars.replace('?', np.nan)
num_cars.head()
normalized-losses | wheel-base | length | width | height | curb-weight | engine-size | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | NaN | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | NaN | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 164 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 164 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
Some columns are recognized as strings, let's transform the entire dataset into float
num_cars = num_cars.astype('float')
num_cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 normalized-losses 164 non-null float64 1 wheel-base 205 non-null float64 2 length 205 non-null float64 3 width 205 non-null float64 4 height 205 non-null float64 5 curb-weight 205 non-null float64 6 engine-size 205 non-null float64 7 bore 201 non-null float64 8 stroke 201 non-null float64 9 compression-rate 205 non-null float64 10 horsepower 203 non-null float64 11 peak-rpm 203 non-null float64 12 city-mpg 205 non-null float64 13 highway-mpg 205 non-null float64 14 price 201 non-null float64 dtypes: float64(15) memory usage: 24.1 KB
I'm going to drop those 4 rows with NaN values in the price column
num_cars.dropna(subset=['price'], inplace=True)
num_cars.isnull().sum()
normalized-losses 37 wheel-base 0 length 0 width 0 height 0 curb-weight 0 engine-size 0 bore 4 stroke 4 compression-rate 0 horsepower 2 peak-rpm 2 city-mpg 0 highway-mpg 0 price 0 dtype: int64
After dropping the NaN values in the price column we find NaN values in the columns num_doors, bore, stroke, horsepower and peak_rpm
The 'normalized-losses' is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door, small, station wagons, sports/specialty, etc…), and represents the average loss per car per year. So it's not ok to drop the entirely column.
There are 37 rows with NaN values in the normalized_losses and 205 rows in total, that means that the NaN valures in this column represent ~20% of the dataset. So dropping them is not a good idea becase we would loss a considerable amount of data. According to the documentation this column goes from 65 to 256, so it's ok to fill the values with the mean value of the column, the same will be applied in the num_doors, bore, stroke, horsepower and peak_rpm columns
num_cars = num_cars.fillna(num_cars.mean())
num_cars.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 201 entries, 0 to 204 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 normalized-losses 201 non-null float64 1 wheel-base 201 non-null float64 2 length 201 non-null float64 3 width 201 non-null float64 4 height 201 non-null float64 5 curb-weight 201 non-null float64 6 engine-size 201 non-null float64 7 bore 201 non-null float64 8 stroke 201 non-null float64 9 compression-rate 201 non-null float64 10 horsepower 201 non-null float64 11 peak-rpm 201 non-null float64 12 city-mpg 201 non-null float64 13 highway-mpg 201 non-null float64 14 price 201 non-null float64 dtypes: float64(15) memory usage: 25.1 KB
Now let's see how the dataframe looks
num_cars.head()
normalized-losses | wheel-base | length | width | height | curb-weight | engine-size | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 122.0 | 88.6 | 168.8 | 64.1 | 48.8 | 2548.0 | 130.0 | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21.0 | 27.0 | 13495.0 |
1 | 122.0 | 88.6 | 168.8 | 64.1 | 48.8 | 2548.0 | 130.0 | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21.0 | 27.0 | 16500.0 |
2 | 122.0 | 94.5 | 171.2 | 65.5 | 52.4 | 2823.0 | 152.0 | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19.0 | 26.0 | 16500.0 |
3 | 164.0 | 99.8 | 176.6 | 66.2 | 54.3 | 2337.0 | 109.0 | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24.0 | 30.0 | 13950.0 |
4 | 164.0 | 99.4 | 176.6 | 66.4 | 54.3 | 2824.0 | 136.0 | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18.0 | 22.0 | 17450.0 |
After cleaning and transforming the data, let's normalize the dataframe except the price column
price_col = num_cars['price']
num_cars = (num_cars - num_cars.min()) / (num_cars.max() - num_cars.min())
num_cars['price'] = price_col
num_cars.head()
normalized-losses | wheel-base | length | width | height | curb-weight | engine-size | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.298429 | 0.058309 | 0.413433 | 0.324786 | 0.083333 | 0.411171 | 0.260377 | 0.664286 | 0.290476 | 0.1250 | 0.294393 | 0.346939 | 0.222222 | 0.289474 | 13495.0 |
1 | 0.298429 | 0.058309 | 0.413433 | 0.324786 | 0.083333 | 0.411171 | 0.260377 | 0.664286 | 0.290476 | 0.1250 | 0.294393 | 0.346939 | 0.222222 | 0.289474 | 16500.0 |
2 | 0.298429 | 0.230321 | 0.449254 | 0.444444 | 0.383333 | 0.517843 | 0.343396 | 0.100000 | 0.666667 | 0.1250 | 0.495327 | 0.346939 | 0.166667 | 0.263158 | 16500.0 |
3 | 0.518325 | 0.384840 | 0.529851 | 0.504274 | 0.541667 | 0.329325 | 0.181132 | 0.464286 | 0.633333 | 0.1875 | 0.252336 | 0.551020 | 0.305556 | 0.368421 | 13950.0 |
4 | 0.518325 | 0.373178 | 0.529851 | 0.521368 | 0.541667 | 0.518231 | 0.283019 | 0.464286 | 0.633333 | 0.0625 | 0.313084 | 0.551020 | 0.138889 | 0.157895 | 17450.0 |
Let's start with a simple univariate model
def knn_train_test(train_col, target_col, df):
knn = KNeighborsRegressor()
#Divide the dataset into training and test set
np.random.seed(1)
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
last_train_row = int((len(rand_df) / 2))
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
#Fit, train and predict knn
knn.fit(train_df[[train_col]], train_df[target_col])
prediction = knn.predict(test_df[[target_col]])
#Calculate MSE, RMSE and R2
mse = mean_squared_error(test_df[target_col], prediction)
rmse = np.sqrt(mse)
return rmse
train_col = num_cars.columns.drop('price')
rmse_results = {}
for col in train_col:
rmse = knn_train_test(col, 'price', num_cars)
rmse_results[col] = rmse
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
compression-rate 8211.362084 height 8563.716277 normalized-losses 9049.200281 peak-rpm 9274.562418 highway-mpg 10306.945989 city-mpg 10790.811235 stroke 11392.931393 bore 11648.715892 wheel-base 12113.028618 length 15280.245008 horsepower 19499.466023 width 21957.031122 engine-size 22558.988688 curb-weight 22715.744609 dtype: float64
def knn_train_test(train_col, target_col, df):
#Divide the dataset into training and test set
np.random.seed(1)
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
last_train_row = int((len(rand_df) / 2))
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
#Fit, train and predict knn
k = [1, 3, 5, 7, 9]
k_rmse = {}
for val in k:
knn = KNeighborsRegressor(n_neighbors=val)
knn.fit(train_df[[train_col]], train_df[target_col])
prediction = knn.predict(test_df[[target_col]])
#Calculate RMSE
mse = mean_squared_error(test_df[target_col], prediction)
rmse = np.sqrt(mse)
k_rmse[val] = rmse
return k_rmse
rmse_results = {}
train_cols = num_cars.columns.drop('price')
for col in train_cols:
rmse_val = knn_train_test(col, 'price', num_cars)
rmse_results[col] = rmse_val
rmse_results
{'normalized-losses': {1: 8705.821566544702, 3: 8324.478981637923, 5: 9049.200281409916, 7: 9687.642855117061, 9: 8725.16770421584}, 'wheel-base': {1: 20248.36501992687, 3: 10858.639280305888, 5: 12113.028618265982, 7: 14959.821637815054, 9: 14289.269224809848}, 'length': {1: 20248.36501992687, 3: 21661.037439741816, 5: 15280.24500769047, 7: 14959.821637815054, 9: 15344.326750559727}, 'width': {1: 20248.36501992687, 3: 22944.445289083855, 5: 21957.031122365224, 7: 20350.698362284606, 9: 18404.053154200246}, 'height': {1: 9134.452438785569, 3: 9293.973051882005, 5: 8563.716277133866, 7: 8356.05606856574, 9: 8179.371987342417}, 'curb-weight': {1: 20845.13315091098, 3: 21661.037439741816, 5: 22715.744608774967, 7: 19229.739888900913, 9: 16283.869938367277}, 'engine-size': {1: 23917.722163804614, 3: 22732.673900117097, 5: 22558.988688413323, 7: 22340.337129615156, 9: 21519.849668067345}, 'bore': {1: 12104.886855405886, 3: 11026.30721555867, 5: 11648.715892277994, 7: 12180.980480625842, 9: 12281.207697921911}, 'stroke': {1: 20845.13315091098, 3: 15201.911122002919, 5: 11392.931393466635, 7: 13286.897711190293, 9: 11307.596666201045}, 'compression-rate': {1: 8180.095347310391, 3: 8319.86256550617, 5: 8211.362084186992, 7: 9093.760019973523, 9: 8768.46582925998}, 'horsepower': {1: 22492.603018206974, 3: 17677.62400907318, 5: 19499.466022899647, 7: 20313.129758313873, 9: 17468.68421053942}, 'peak-rpm': {1: 8871.878937473624, 3: 9593.778472621629, 5: 9274.562418201138, 7: 8148.230785298795, 9: 8282.472226728549}, 'city-mpg': {1: 10470.005103216814, 3: 10732.921925012071, 5: 10790.811234812078, 7: 10440.52270608314, 9: 10424.19913295505}, 'highway-mpg': {1: 10470.005103216814, 3: 10473.987688297588, 5: 10306.945989059326, 7: 10440.52270608314, 9: 10444.279273581038}}
for k,v in rmse_results.items():
x = list(v.keys())
y = list(v.values())
plt.plot(x,y)
plt.xlabel('k value')
plt.ylabel('RMSE')
feature_avg_rmse = {}
for k,v in rmse_results.items():
avg_rmse = np.mean(list(v.values()))
feature_avg_rmse[k] = avg_rmse
series_avg_rmse = pd.Series(feature_avg_rmse)
sorted_series_avg_rmse = series_avg_rmse.sort_values()
print(sorted_series_avg_rmse)
sorted_features = sorted_series_avg_rmse.index
compression-rate 8514.709169 height 8705.513965 peak-rpm 8834.184568 normalized-losses 8898.462278 highway-mpg 10427.148152 city-mpg 10571.692020 bore 11848.419628 stroke 14406.894009 wheel-base 14493.824756 length 17498.759171 horsepower 19490.301404 curb-weight 20147.105005 width 20780.918590 engine-size 22613.914310 dtype: float64
def knn_train_test(train_cols, target_col, df):
np.random.seed(1)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
# Select the first half and set as training set.
# Select the second half and set as test set.
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
k_values = [5]
k_rmses = {}
for k in k_values:
# Fit model using k nearest neighbors.
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(train_df[train_cols], train_df[target_col])
# Make predictions using model.
predicted_labels = knn.predict(test_df[train_cols])
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predicted_labels)
rmse = np.sqrt(mse)
k_rmses[k] = rmse
return k_rmses
k_rmse_results = {}
for nr_best_feats in range(2,7):
k_rmse_results['{} best features'.format(nr_best_feats)] = knn_train_test(
sorted_features[:nr_best_feats],
'price',
num_cars
)
k_rmse_results
{'2 best features': {5: 7136.447543101927}, '3 best features': {5: 8201.21523663559}, '4 best features': {5: 7772.148262116261}, '5 best features': {5: 5789.710881784466}, '6 best features': {5: 5282.8180788156415}}
def knn_train_test(train_cols, target_col, df):
np.random.seed(1)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
# Select the first half and set as training set.
# Select the second half and set as test set.
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
k_values = [i for i in range(1, 25)]
k_rmses = {}
for k in k_values:
# Fit model using k nearest neighbors.
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(train_df[train_cols], train_df[target_col])
# Make predictions using model.
predicted_labels = knn.predict(test_df[train_cols])
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predicted_labels)
rmse = np.sqrt(mse)
k_rmses[k] = rmse
return k_rmses
k_rmse_results = {}
for nr_best_feats in range(2,6):
k_rmse_results['{} best features'.format(nr_best_feats)] = knn_train_test(
sorted_features[:nr_best_feats],
'price',
num_cars
)
k_rmse_results
{'2 best features': {1: 7824.371484580271, 2: 7923.5793046782055, 3: 7409.480215717343, 4: 7191.72558039019, 5: 7136.447543101927, 6: 7101.178357995046, 7: 7120.22843764463, 8: 7229.050268071916, 9: 7550.580281246418, 10: 7810.511086470667, 11: 7872.903806451249, 12: 7908.748330143412, 13: 7895.739497547708, 14: 7732.749613920275, 15: 7655.855679069836, 16: 7595.185600922556, 17: 7600.665298684744, 18: 7573.805136000839, 19: 7617.015965753199, 20: 7659.880760708181, 21: 7686.574591298704, 22: 7663.80855510408, 23: 7691.38160279881, 24: 7703.726920011501}, '3 best features': {1: 7630.501830038298, 2: 7594.311448135843, 3: 7597.887055304949, 4: 7854.866777141014, 5: 8201.21523663559, 6: 7928.465743678274, 7: 7635.300173215064, 8: 7639.530305987501, 9: 7776.463598317299, 10: 7721.387383581188, 11: 7758.564953660474, 12: 7811.027964817545, 13: 7810.3711518777245, 14: 7818.263133922312, 15: 7861.619243639901, 16: 7946.566688887372, 17: 8030.648320008287, 18: 8053.664872584696, 19: 7995.558107782373, 20: 7959.3144421122, 21: 7968.587797537081, 22: 8021.918427314005, 23: 8012.446625358399, 24: 8018.699678961325}, '4 best features': {1: 7103.944014028788, 2: 7671.840641057541, 3: 7470.419473276629, 4: 7758.290877728199, 5: 7772.148262116261, 6: 7710.962853417703, 7: 7666.462852991477, 8: 7537.720863726756, 9: 7608.251961267052, 10: 7693.168832229546, 11: 7715.6418050312805, 12: 7711.217071652267, 13: 7771.022024546921, 14: 7820.645843746911, 15: 7907.148201761521, 16: 7976.899590599832, 17: 8006.901644331514, 18: 8060.700076244394, 19: 8057.829795644624, 20: 8049.534061486397, 21: 8024.783612194227, 22: 7989.26833954141, 23: 7983.751819484669, 24: 7938.933565212882}, '5 best features': {1: 5156.866394420258, 2: 4982.051535758185, 3: 5386.38415366237, 4: 5981.469734523845, 5: 5789.710881784466, 6: 5744.280490369113, 7: 5788.903505790488, 8: 5991.557058413138, 9: 6103.75523981958, 10: 5996.512301701336, 11: 5800.065480324994, 12: 5905.169281286444, 13: 5919.139906404429, 14: 5971.188410847201, 15: 5898.433524011584, 16: 5946.871385883877, 17: 5970.36112332897, 18: 5956.169386693718, 19: 5926.263559815366, 20: 5951.929944489342, 21: 6033.483514401977, 22: 6042.8042884712795, 23: 6083.648354516337, 24: 6099.767711137054}}
for k,v in k_rmse_results.items():
x = list(v.keys())
y = list(v.values())
plt.plot(x,y, label="{}".format(k))
plt.xlabel('k value')
plt.ylabel('RMSE')
plt.legend()
<matplotlib.legend.Legend at 0x2d5815c2a48>