Predicting Car Prices with K-Nearest Neighbors¶

In this project i will use the K-Nearest Neighbors algorithm to predict a car's market price using its attributes. The dataset contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates and more. Information about the data can be found here and can be downloaded from here

In [1]:

import pandas as pd
pd.options.display.max_columns = 100
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [2]:

cars = pd.read_csv('imports-85.data', names=['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price'])
cars.head()

Out[2]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	height	curb-weight	engine-type	num-of-cylinders	engine-size	fuel-system	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

Now let's select the columns with numeric and constant values

In [3]:

num_cars = cars[['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']]

Lets transform the ? character into NaN

In [4]:

num_cars = num_cars.replace('?', np.nan)
num_cars.head()

Out[4]:

	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	NaN	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	13495
1	NaN	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	16500
2	NaN	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9.0	154	5000	19	26	16500
3	164	99.8	176.6	66.2	54.3	2337	109	3.19	3.40	10.0	102	5500	24	30	13950
4	164	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8.0	115	5500	18	22	17450

Some columns are recognized as strings, let's transform the entire dataset into float

In [5]:

num_cars = num_cars.astype('float')
num_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   normalized-losses  164 non-null    float64
 1   wheel-base         205 non-null    float64
 2   length             205 non-null    float64
 3   width              205 non-null    float64
 4   height             205 non-null    float64
 5   curb-weight        205 non-null    float64
 6   engine-size        205 non-null    float64
 7   bore               201 non-null    float64
 8   stroke             201 non-null    float64
 9   compression-rate   205 non-null    float64
 10  horsepower         203 non-null    float64
 11  peak-rpm           203 non-null    float64
 12  city-mpg           205 non-null    float64
 13  highway-mpg        205 non-null    float64
 14  price              201 non-null    float64
dtypes: float64(15)
memory usage: 24.1 KB

I'm going to drop those 4 rows with NaN values in the price column

In [6]:

num_cars.dropna(subset=['price'], inplace=True)
num_cars.isnull().sum()

Out[6]:

normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-size           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64

After dropping the NaN values in the price column we find NaN values in the columns num_doors, bore, stroke, horsepower and peak_rpm

The 'normalized-losses' is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door, small, station wagons, sports/specialty, etc…), and represents the average loss per car per year. So it's not ok to drop the entirely column.

There are 37 rows with NaN values in the normalized_losses and 205 rows in total, that means that the NaN valures in this column represent ~20% of the dataset. So dropping them is not a good idea becase we would loss a considerable amount of data. According to the documentation this column goes from 65 to 256, so it's ok to fill the values with the mean value of the column, the same will be applied in the num_doors, bore, stroke, horsepower and peak_rpm columns

In [7]:

num_cars = num_cars.fillna(num_cars.mean())
num_cars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   normalized-losses  201 non-null    float64
 1   wheel-base         201 non-null    float64
 2   length             201 non-null    float64
 3   width              201 non-null    float64
 4   height             201 non-null    float64
 5   curb-weight        201 non-null    float64
 6   engine-size        201 non-null    float64
 7   bore               201 non-null    float64
 8   stroke             201 non-null    float64
 9   compression-rate   201 non-null    float64
 10  horsepower         201 non-null    float64
 11  peak-rpm           201 non-null    float64
 12  city-mpg           201 non-null    float64
 13  highway-mpg        201 non-null    float64
 14  price              201 non-null    float64
dtypes: float64(15)
memory usage: 25.1 KB

Now let's see how the dataframe looks

In [8]:

num_cars.head()

Out[8]:

	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	122.0	88.6	168.8	64.1	48.8	2548.0	130.0	3.47	2.68	9.0	111.0	5000.0	21.0	27.0	13495.0
1	122.0	88.6	168.8	64.1	48.8	2548.0	130.0	3.47	2.68	9.0	111.0	5000.0	21.0	27.0	16500.0
2	122.0	94.5	171.2	65.5	52.4	2823.0	152.0	2.68	3.47	9.0	154.0	5000.0	19.0	26.0	16500.0
3	164.0	99.8	176.6	66.2	54.3	2337.0	109.0	3.19	3.40	10.0	102.0	5500.0	24.0	30.0	13950.0
4	164.0	99.4	176.6	66.4	54.3	2824.0	136.0	3.19	3.40	8.0	115.0	5500.0	18.0	22.0	17450.0

After cleaning and transforming the data, let's normalize the dataframe except the price column

In [9]:

price_col = num_cars['price']
num_cars = (num_cars - num_cars.min()) / (num_cars.max() - num_cars.min())
num_cars['price'] = price_col
num_cars.head()

Out[9]:

	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	0.298429	0.058309	0.413433	0.324786	0.083333	0.411171	0.260377	0.664286	0.290476	0.1250	0.294393	0.346939	0.222222	0.289474	13495.0
1	0.298429	0.058309	0.413433	0.324786	0.083333	0.411171	0.260377	0.664286	0.290476	0.1250	0.294393	0.346939	0.222222	0.289474	16500.0
2	0.298429	0.230321	0.449254	0.444444	0.383333	0.517843	0.343396	0.100000	0.666667	0.1250	0.495327	0.346939	0.166667	0.263158	16500.0
3	0.518325	0.384840	0.529851	0.504274	0.541667	0.329325	0.181132	0.464286	0.633333	0.1875	0.252336	0.551020	0.305556	0.368421	13950.0
4	0.518325	0.373178	0.529851	0.521368	0.541667	0.518231	0.283019	0.464286	0.633333	0.0625	0.313084	0.551020	0.138889	0.157895	17450.0

Univariate model¶

Let's start with a simple univariate model

In [10]:

def knn_train_test(train_col, target_col, df):
    knn = KNeighborsRegressor()
    
    #Divide the dataset into training and test set
    np.random.seed(1)
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    last_train_row = int((len(rand_df) / 2))
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    #Fit, train and predict knn
    knn.fit(train_df[[train_col]], train_df[target_col])
    prediction = knn.predict(test_df[[target_col]])
    
    #Calculate MSE, RMSE and R2
    mse = mean_squared_error(test_df[target_col], prediction)
    rmse = np.sqrt(mse)
    
    return rmse

In [11]:

train_col = num_cars.columns.drop('price')
rmse_results = {}


for col in train_col:
    rmse = knn_train_test(col, 'price', num_cars)
    rmse_results[col] = rmse
    
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()

Out[11]:

compression-rate      8211.362084
height                8563.716277
normalized-losses     9049.200281
peak-rpm              9274.562418
highway-mpg          10306.945989
city-mpg             10790.811235
stroke               11392.931393
bore                 11648.715892
wheel-base           12113.028618
length               15280.245008
horsepower           19499.466023
width                21957.031122
engine-size          22558.988688
curb-weight          22715.744609
dtype: float64

In [12]:

def knn_train_test(train_col, target_col, df): 
    #Divide the dataset into training and test set
    np.random.seed(1)
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    last_train_row = int((len(rand_df) / 2))
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    #Fit, train and predict knn
    k = [1, 3, 5, 7, 9]
    k_rmse = {}
    
    for val in k:
        knn = KNeighborsRegressor(n_neighbors=val)
        knn.fit(train_df[[train_col]], train_df[target_col])
        prediction = knn.predict(test_df[[target_col]])
        
        #Calculate RMSE
        mse = mean_squared_error(test_df[target_col], prediction)
        rmse = np.sqrt(mse)
        
        k_rmse[val] = rmse
    
    return k_rmse

In [13]:

rmse_results = {}
train_cols = num_cars.columns.drop('price')
for col in train_cols:
    rmse_val = knn_train_test(col, 'price', num_cars)
    rmse_results[col] = rmse_val
    
rmse_results

Out[13]:

{'normalized-losses': {1: 8705.821566544702,
  3: 8324.478981637923,
  5: 9049.200281409916,
  7: 9687.642855117061,
  9: 8725.16770421584},
 'wheel-base': {1: 20248.36501992687,
  3: 10858.639280305888,
  5: 12113.028618265982,
  7: 14959.821637815054,
  9: 14289.269224809848},
 'length': {1: 20248.36501992687,
  3: 21661.037439741816,
  5: 15280.24500769047,
  7: 14959.821637815054,
  9: 15344.326750559727},
 'width': {1: 20248.36501992687,
  3: 22944.445289083855,
  5: 21957.031122365224,
  7: 20350.698362284606,
  9: 18404.053154200246},
 'height': {1: 9134.452438785569,
  3: 9293.973051882005,
  5: 8563.716277133866,
  7: 8356.05606856574,
  9: 8179.371987342417},
 'curb-weight': {1: 20845.13315091098,
  3: 21661.037439741816,
  5: 22715.744608774967,
  7: 19229.739888900913,
  9: 16283.869938367277},
 'engine-size': {1: 23917.722163804614,
  3: 22732.673900117097,
  5: 22558.988688413323,
  7: 22340.337129615156,
  9: 21519.849668067345},
 'bore': {1: 12104.886855405886,
  3: 11026.30721555867,
  5: 11648.715892277994,
  7: 12180.980480625842,
  9: 12281.207697921911},
 'stroke': {1: 20845.13315091098,
  3: 15201.911122002919,
  5: 11392.931393466635,
  7: 13286.897711190293,
  9: 11307.596666201045},
 'compression-rate': {1: 8180.095347310391,
  3: 8319.86256550617,
  5: 8211.362084186992,
  7: 9093.760019973523,
  9: 8768.46582925998},
 'horsepower': {1: 22492.603018206974,
  3: 17677.62400907318,
  5: 19499.466022899647,
  7: 20313.129758313873,
  9: 17468.68421053942},
 'peak-rpm': {1: 8871.878937473624,
  3: 9593.778472621629,
  5: 9274.562418201138,
  7: 8148.230785298795,
  9: 8282.472226728549},
 'city-mpg': {1: 10470.005103216814,
  3: 10732.921925012071,
  5: 10790.811234812078,
  7: 10440.52270608314,
  9: 10424.19913295505},
 'highway-mpg': {1: 10470.005103216814,
  3: 10473.987688297588,
  5: 10306.945989059326,
  7: 10440.52270608314,
  9: 10444.279273581038}}

In [14]:

for k,v in rmse_results.items():
    x = list(v.keys())
    y = list(v.values())
    plt.plot(x,y)
    plt.xlabel('k value')
    plt.ylabel('RMSE')

Multivariate model¶

In [15]:

feature_avg_rmse = {}
for k,v in rmse_results.items():
    avg_rmse = np.mean(list(v.values()))
    feature_avg_rmse[k] = avg_rmse
series_avg_rmse = pd.Series(feature_avg_rmse)
sorted_series_avg_rmse = series_avg_rmse.sort_values()
print(sorted_series_avg_rmse)

sorted_features = sorted_series_avg_rmse.index

compression-rate      8514.709169
height                8705.513965
peak-rpm              8834.184568
normalized-losses     8898.462278
highway-mpg          10427.148152
city-mpg             10571.692020
bore                 11848.419628
stroke               14406.894009
wheel-base           14493.824756
length               17498.759171
horsepower           19490.301404
curb-weight          20147.105005
width                20780.918590
engine-size          22613.914310
dtype: float64

In [16]:

def knn_train_test(train_cols, target_col, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    k_values = [5]
    k_rmses = {}
    
    for k in k_values:
        # Fit model using k nearest neighbors.
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[train_cols], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_cols])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        
        k_rmses[k] = rmse
    return k_rmses

k_rmse_results = {}

for nr_best_feats in range(2,7):
    k_rmse_results['{} best features'.format(nr_best_feats)] = knn_train_test(
        sorted_features[:nr_best_feats],
        'price',
        num_cars
    )

k_rmse_results

Out[16]:

{'2 best features': {5: 7136.447543101927},
 '3 best features': {5: 8201.21523663559},
 '4 best features': {5: 7772.148262116261},
 '5 best features': {5: 5789.710881784466},
 '6 best features': {5: 5282.8180788156415}}

Hyperparameter tuning¶

In [17]:

def knn_train_test(train_cols, target_col, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    k_values = [i for i in range(1, 25)]
    k_rmses = {}
    
    for k in k_values:
        # Fit model using k nearest neighbors.
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[train_cols], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_cols])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        
        k_rmses[k] = rmse
    return k_rmses

k_rmse_results = {}

for nr_best_feats in range(2,6):
    k_rmse_results['{} best features'.format(nr_best_feats)] = knn_train_test(
        sorted_features[:nr_best_feats],
        'price',
        num_cars
    )

k_rmse_results

Out[17]:

{'2 best features': {1: 7824.371484580271,
  2: 7923.5793046782055,
  3: 7409.480215717343,
  4: 7191.72558039019,
  5: 7136.447543101927,
  6: 7101.178357995046,
  7: 7120.22843764463,
  8: 7229.050268071916,
  9: 7550.580281246418,
  10: 7810.511086470667,
  11: 7872.903806451249,
  12: 7908.748330143412,
  13: 7895.739497547708,
  14: 7732.749613920275,
  15: 7655.855679069836,
  16: 7595.185600922556,
  17: 7600.665298684744,
  18: 7573.805136000839,
  19: 7617.015965753199,
  20: 7659.880760708181,
  21: 7686.574591298704,
  22: 7663.80855510408,
  23: 7691.38160279881,
  24: 7703.726920011501},
 '3 best features': {1: 7630.501830038298,
  2: 7594.311448135843,
  3: 7597.887055304949,
  4: 7854.866777141014,
  5: 8201.21523663559,
  6: 7928.465743678274,
  7: 7635.300173215064,
  8: 7639.530305987501,
  9: 7776.463598317299,
  10: 7721.387383581188,
  11: 7758.564953660474,
  12: 7811.027964817545,
  13: 7810.3711518777245,
  14: 7818.263133922312,
  15: 7861.619243639901,
  16: 7946.566688887372,
  17: 8030.648320008287,
  18: 8053.664872584696,
  19: 7995.558107782373,
  20: 7959.3144421122,
  21: 7968.587797537081,
  22: 8021.918427314005,
  23: 8012.446625358399,
  24: 8018.699678961325},
 '4 best features': {1: 7103.944014028788,
  2: 7671.840641057541,
  3: 7470.419473276629,
  4: 7758.290877728199,
  5: 7772.148262116261,
  6: 7710.962853417703,
  7: 7666.462852991477,
  8: 7537.720863726756,
  9: 7608.251961267052,
  10: 7693.168832229546,
  11: 7715.6418050312805,
  12: 7711.217071652267,
  13: 7771.022024546921,
  14: 7820.645843746911,
  15: 7907.148201761521,
  16: 7976.899590599832,
  17: 8006.901644331514,
  18: 8060.700076244394,
  19: 8057.829795644624,
  20: 8049.534061486397,
  21: 8024.783612194227,
  22: 7989.26833954141,
  23: 7983.751819484669,
  24: 7938.933565212882},
 '5 best features': {1: 5156.866394420258,
  2: 4982.051535758185,
  3: 5386.38415366237,
  4: 5981.469734523845,
  5: 5789.710881784466,
  6: 5744.280490369113,
  7: 5788.903505790488,
  8: 5991.557058413138,
  9: 6103.75523981958,
  10: 5996.512301701336,
  11: 5800.065480324994,
  12: 5905.169281286444,
  13: 5919.139906404429,
  14: 5971.188410847201,
  15: 5898.433524011584,
  16: 5946.871385883877,
  17: 5970.36112332897,
  18: 5956.169386693718,
  19: 5926.263559815366,
  20: 5951.929944489342,
  21: 6033.483514401977,
  22: 6042.8042884712795,
  23: 6083.648354516337,
  24: 6099.767711137054}}

In [18]:

for k,v in k_rmse_results.items():
    x = list(v.keys())
    y = list(v.values())  
    plt.plot(x,y, label="{}".format(k))
    
plt.xlabel('k value')
plt.ylabel('RMSE')
plt.legend()

Out[18]:

<matplotlib.legend.Legend at 0x2d5815c2a48>

In [ ]: