Predicting Car Prices with K-Nearest Neighbors

In this project i will use the K-Nearest Neighbors algorithm to predict a car's market price using its attributes. The dataset contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates and more. Information about the data can be found here and can be downloaded from here

In [1]:
import pandas as pd
pd.options.display.max_columns = 100
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
In [2]:
cars = pd.read_csv('imports-85.data', names=['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price'])
cars.head()
Out[2]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

Now let's select the columns with numeric and constant values

In [3]:
num_cars = cars[['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']]

Lets transform the ? character into NaN

In [4]:
num_cars = num_cars.replace('?', np.nan)
num_cars.head()
Out[4]:
normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 NaN 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111 5000 21 27 13495
1 NaN 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111 5000 21 27 16500
2 NaN 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9.0 154 5000 19 26 16500
3 164 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 10.0 102 5500 24 30 13950
4 164 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 8.0 115 5500 18 22 17450

Some columns are recognized as strings, let's transform the entire dataset into float

In [5]:
num_cars = num_cars.astype('float')
num_cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   normalized-losses  164 non-null    float64
 1   wheel-base         205 non-null    float64
 2   length             205 non-null    float64
 3   width              205 non-null    float64
 4   height             205 non-null    float64
 5   curb-weight        205 non-null    float64
 6   engine-size        205 non-null    float64
 7   bore               201 non-null    float64
 8   stroke             201 non-null    float64
 9   compression-rate   205 non-null    float64
 10  horsepower         203 non-null    float64
 11  peak-rpm           203 non-null    float64
 12  city-mpg           205 non-null    float64
 13  highway-mpg        205 non-null    float64
 14  price              201 non-null    float64
dtypes: float64(15)
memory usage: 24.1 KB

I'm going to drop those 4 rows with NaN values in the price column

In [6]:
num_cars.dropna(subset=['price'], inplace=True)
num_cars.isnull().sum()
Out[6]:
normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-size           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64

After dropping the NaN values in the price column we find NaN values in the columns num_doors, bore, stroke, horsepower and peak_rpm

The 'normalized-losses' is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door, small, station wagons, sports/specialty, etc…), and represents the average loss per car per year. So it's not ok to drop the entirely column.

There are 37 rows with NaN values in the normalized_losses and 205 rows in total, that means that the NaN valures in this column represent ~20% of the dataset. So dropping them is not a good idea becase we would loss a considerable amount of data. According to the documentation this column goes from 65 to 256, so it's ok to fill the values with the mean value of the column, the same will be applied in the num_doors, bore, stroke, horsepower and peak_rpm columns

In [7]:
num_cars = num_cars.fillna(num_cars.mean())
num_cars.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   normalized-losses  201 non-null    float64
 1   wheel-base         201 non-null    float64
 2   length             201 non-null    float64
 3   width              201 non-null    float64
 4   height             201 non-null    float64
 5   curb-weight        201 non-null    float64
 6   engine-size        201 non-null    float64
 7   bore               201 non-null    float64
 8   stroke             201 non-null    float64
 9   compression-rate   201 non-null    float64
 10  horsepower         201 non-null    float64
 11  peak-rpm           201 non-null    float64
 12  city-mpg           201 non-null    float64
 13  highway-mpg        201 non-null    float64
 14  price              201 non-null    float64
dtypes: float64(15)
memory usage: 25.1 KB

Now let's see how the dataframe looks

In [8]:
num_cars.head()
Out[8]:
normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 122.0 88.6 168.8 64.1 48.8 2548.0 130.0 3.47 2.68 9.0 111.0 5000.0 21.0 27.0 13495.0
1 122.0 88.6 168.8 64.1 48.8 2548.0 130.0 3.47 2.68 9.0 111.0 5000.0 21.0 27.0 16500.0
2 122.0 94.5 171.2 65.5 52.4 2823.0 152.0 2.68 3.47 9.0 154.0 5000.0 19.0 26.0 16500.0
3 164.0 99.8 176.6 66.2 54.3 2337.0 109.0 3.19 3.40 10.0 102.0 5500.0 24.0 30.0 13950.0
4 164.0 99.4 176.6 66.4 54.3 2824.0 136.0 3.19 3.40 8.0 115.0 5500.0 18.0 22.0 17450.0

After cleaning and transforming the data, let's normalize the dataframe except the price column

In [9]:
price_col = num_cars['price']
num_cars = (num_cars - num_cars.min()) / (num_cars.max() - num_cars.min())
num_cars['price'] = price_col
num_cars.head()
Out[9]:
normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 0.298429 0.058309 0.413433 0.324786 0.083333 0.411171 0.260377 0.664286 0.290476 0.1250 0.294393 0.346939 0.222222 0.289474 13495.0
1 0.298429 0.058309 0.413433 0.324786 0.083333 0.411171 0.260377 0.664286 0.290476 0.1250 0.294393 0.346939 0.222222 0.289474 16500.0
2 0.298429 0.230321 0.449254 0.444444 0.383333 0.517843 0.343396 0.100000 0.666667 0.1250 0.495327 0.346939 0.166667 0.263158 16500.0
3 0.518325 0.384840 0.529851 0.504274 0.541667 0.329325 0.181132 0.464286 0.633333 0.1875 0.252336 0.551020 0.305556 0.368421 13950.0
4 0.518325 0.373178 0.529851 0.521368 0.541667 0.518231 0.283019 0.464286 0.633333 0.0625 0.313084 0.551020 0.138889 0.157895 17450.0

Univariate model

Let's start with a simple univariate model

In [10]:
def knn_train_test(train_col, target_col, df):
    knn = KNeighborsRegressor()
    
    #Divide the dataset into training and test set
    np.random.seed(1)
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    last_train_row = int((len(rand_df) / 2))
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    #Fit, train and predict knn
    knn.fit(train_df[[train_col]], train_df[target_col])
    prediction = knn.predict(test_df[[target_col]])
    
    #Calculate MSE, RMSE and R2
    mse = mean_squared_error(test_df[target_col], prediction)
    rmse = np.sqrt(mse)
    
    return rmse
In [11]:
train_col = num_cars.columns.drop('price')
rmse_results = {}


for col in train_col:
    rmse = knn_train_test(col, 'price', num_cars)
    rmse_results[col] = rmse
    
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
Out[11]:
compression-rate      8211.362084
height                8563.716277
normalized-losses     9049.200281
peak-rpm              9274.562418
highway-mpg          10306.945989
city-mpg             10790.811235
stroke               11392.931393
bore                 11648.715892
wheel-base           12113.028618
length               15280.245008
horsepower           19499.466023
width                21957.031122
engine-size          22558.988688
curb-weight          22715.744609
dtype: float64
In [12]:
def knn_train_test(train_col, target_col, df): 
    #Divide the dataset into training and test set
    np.random.seed(1)
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    last_train_row = int((len(rand_df) / 2))
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    #Fit, train and predict knn
    k = [1, 3, 5, 7, 9]
    k_rmse = {}
    
    for val in k:
        knn = KNeighborsRegressor(n_neighbors=val)
        knn.fit(train_df[[train_col]], train_df[target_col])
        prediction = knn.predict(test_df[[target_col]])
        
        #Calculate RMSE
        mse = mean_squared_error(test_df[target_col], prediction)
        rmse = np.sqrt(mse)
        
        k_rmse[val] = rmse
    
    return k_rmse
In [13]:
rmse_results = {}
train_cols = num_cars.columns.drop('price')
for col in train_cols:
    rmse_val = knn_train_test(col, 'price', num_cars)
    rmse_results[col] = rmse_val
    
rmse_results
Out[13]:
{'normalized-losses': {1: 8705.821566544702,
  3: 8324.478981637923,
  5: 9049.200281409916,
  7: 9687.642855117061,
  9: 8725.16770421584},
 'wheel-base': {1: 20248.36501992687,
  3: 10858.639280305888,
  5: 12113.028618265982,
  7: 14959.821637815054,
  9: 14289.269224809848},
 'length': {1: 20248.36501992687,
  3: 21661.037439741816,
  5: 15280.24500769047,
  7: 14959.821637815054,
  9: 15344.326750559727},
 'width': {1: 20248.36501992687,
  3: 22944.445289083855,
  5: 21957.031122365224,
  7: 20350.698362284606,
  9: 18404.053154200246},
 'height': {1: 9134.452438785569,
  3: 9293.973051882005,
  5: 8563.716277133866,
  7: 8356.05606856574,
  9: 8179.371987342417},
 'curb-weight': {1: 20845.13315091098,
  3: 21661.037439741816,
  5: 22715.744608774967,
  7: 19229.739888900913,
  9: 16283.869938367277},
 'engine-size': {1: 23917.722163804614,
  3: 22732.673900117097,
  5: 22558.988688413323,
  7: 22340.337129615156,
  9: 21519.849668067345},
 'bore': {1: 12104.886855405886,
  3: 11026.30721555867,
  5: 11648.715892277994,
  7: 12180.980480625842,
  9: 12281.207697921911},
 'stroke': {1: 20845.13315091098,
  3: 15201.911122002919,
  5: 11392.931393466635,
  7: 13286.897711190293,
  9: 11307.596666201045},
 'compression-rate': {1: 8180.095347310391,
  3: 8319.86256550617,
  5: 8211.362084186992,
  7: 9093.760019973523,
  9: 8768.46582925998},
 'horsepower': {1: 22492.603018206974,
  3: 17677.62400907318,
  5: 19499.466022899647,
  7: 20313.129758313873,
  9: 17468.68421053942},
 'peak-rpm': {1: 8871.878937473624,
  3: 9593.778472621629,
  5: 9274.562418201138,
  7: 8148.230785298795,
  9: 8282.472226728549},
 'city-mpg': {1: 10470.005103216814,
  3: 10732.921925012071,
  5: 10790.811234812078,
  7: 10440.52270608314,
  9: 10424.19913295505},
 'highway-mpg': {1: 10470.005103216814,
  3: 10473.987688297588,
  5: 10306.945989059326,
  7: 10440.52270608314,
  9: 10444.279273581038}}
In [14]:
for k,v in rmse_results.items():
    x = list(v.keys())
    y = list(v.values())
    plt.plot(x,y)
    plt.xlabel('k value')
    plt.ylabel('RMSE')

Multivariate model

In [15]:
feature_avg_rmse = {}
for k,v in rmse_results.items():
    avg_rmse = np.mean(list(v.values()))
    feature_avg_rmse[k] = avg_rmse
series_avg_rmse = pd.Series(feature_avg_rmse)
sorted_series_avg_rmse = series_avg_rmse.sort_values()
print(sorted_series_avg_rmse)

sorted_features = sorted_series_avg_rmse.index
compression-rate      8514.709169
height                8705.513965
peak-rpm              8834.184568
normalized-losses     8898.462278
highway-mpg          10427.148152
city-mpg             10571.692020
bore                 11848.419628
stroke               14406.894009
wheel-base           14493.824756
length               17498.759171
horsepower           19490.301404
curb-weight          20147.105005
width                20780.918590
engine-size          22613.914310
dtype: float64
In [16]:
def knn_train_test(train_cols, target_col, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    k_values = [5]
    k_rmses = {}
    
    for k in k_values:
        # Fit model using k nearest neighbors.
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[train_cols], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_cols])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        
        k_rmses[k] = rmse
    return k_rmses

k_rmse_results = {}

for nr_best_feats in range(2,7):
    k_rmse_results['{} best features'.format(nr_best_feats)] = knn_train_test(
        sorted_features[:nr_best_feats],
        'price',
        num_cars
    )

k_rmse_results
Out[16]:
{'2 best features': {5: 7136.447543101927},
 '3 best features': {5: 8201.21523663559},
 '4 best features': {5: 7772.148262116261},
 '5 best features': {5: 5789.710881784466},
 '6 best features': {5: 5282.8180788156415}}

Hyperparameter tuning

In [17]:
def knn_train_test(train_cols, target_col, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    k_values = [i for i in range(1, 25)]
    k_rmses = {}
    
    for k in k_values:
        # Fit model using k nearest neighbors.
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[train_cols], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_cols])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        
        k_rmses[k] = rmse
    return k_rmses

k_rmse_results = {}

for nr_best_feats in range(2,6):
    k_rmse_results['{} best features'.format(nr_best_feats)] = knn_train_test(
        sorted_features[:nr_best_feats],
        'price',
        num_cars
    )

k_rmse_results
Out[17]:
{'2 best features': {1: 7824.371484580271,
  2: 7923.5793046782055,
  3: 7409.480215717343,
  4: 7191.72558039019,
  5: 7136.447543101927,
  6: 7101.178357995046,
  7: 7120.22843764463,
  8: 7229.050268071916,
  9: 7550.580281246418,
  10: 7810.511086470667,
  11: 7872.903806451249,
  12: 7908.748330143412,
  13: 7895.739497547708,
  14: 7732.749613920275,
  15: 7655.855679069836,
  16: 7595.185600922556,
  17: 7600.665298684744,
  18: 7573.805136000839,
  19: 7617.015965753199,
  20: 7659.880760708181,
  21: 7686.574591298704,
  22: 7663.80855510408,
  23: 7691.38160279881,
  24: 7703.726920011501},
 '3 best features': {1: 7630.501830038298,
  2: 7594.311448135843,
  3: 7597.887055304949,
  4: 7854.866777141014,
  5: 8201.21523663559,
  6: 7928.465743678274,
  7: 7635.300173215064,
  8: 7639.530305987501,
  9: 7776.463598317299,
  10: 7721.387383581188,
  11: 7758.564953660474,
  12: 7811.027964817545,
  13: 7810.3711518777245,
  14: 7818.263133922312,
  15: 7861.619243639901,
  16: 7946.566688887372,
  17: 8030.648320008287,
  18: 8053.664872584696,
  19: 7995.558107782373,
  20: 7959.3144421122,
  21: 7968.587797537081,
  22: 8021.918427314005,
  23: 8012.446625358399,
  24: 8018.699678961325},
 '4 best features': {1: 7103.944014028788,
  2: 7671.840641057541,
  3: 7470.419473276629,
  4: 7758.290877728199,
  5: 7772.148262116261,
  6: 7710.962853417703,
  7: 7666.462852991477,
  8: 7537.720863726756,
  9: 7608.251961267052,
  10: 7693.168832229546,
  11: 7715.6418050312805,
  12: 7711.217071652267,
  13: 7771.022024546921,
  14: 7820.645843746911,
  15: 7907.148201761521,
  16: 7976.899590599832,
  17: 8006.901644331514,
  18: 8060.700076244394,
  19: 8057.829795644624,
  20: 8049.534061486397,
  21: 8024.783612194227,
  22: 7989.26833954141,
  23: 7983.751819484669,
  24: 7938.933565212882},
 '5 best features': {1: 5156.866394420258,
  2: 4982.051535758185,
  3: 5386.38415366237,
  4: 5981.469734523845,
  5: 5789.710881784466,
  6: 5744.280490369113,
  7: 5788.903505790488,
  8: 5991.557058413138,
  9: 6103.75523981958,
  10: 5996.512301701336,
  11: 5800.065480324994,
  12: 5905.169281286444,
  13: 5919.139906404429,
  14: 5971.188410847201,
  15: 5898.433524011584,
  16: 5946.871385883877,
  17: 5970.36112332897,
  18: 5956.169386693718,
  19: 5926.263559815366,
  20: 5951.929944489342,
  21: 6033.483514401977,
  22: 6042.8042884712795,
  23: 6083.648354516337,
  24: 6099.767711137054}}
In [18]:
for k,v in k_rmse_results.items():
    x = list(v.keys())
    y = list(v.values())  
    plt.plot(x,y, label="{}".format(k))
    
plt.xlabel('k value')
plt.ylabel('RMSE')
plt.legend()
Out[18]:
<matplotlib.legend.Legend at 0x2d5815c2a48>
In [ ]: