Guided Project: Predicting Car Prices

In this guided project, we will predict a car's market price using its attributes. The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set here and can download it directly from here.

Let's start by reading in the dataset.

In [364]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('fivethirtyeight')
In [365]:
cars = pd.read_csv('imports-85.data', names = ['symboling', 'normalized_losses','make', 'fuel_type', 'aspiration',
                                              'num_doors', 'body_style', 'drive_wheels', 'engine_location',
                                              'wheel_base', 'length', 'width', 'height', 'curb_weight',
                                              'engine_type', 'num_cylinders', 'engine_size', 'fuel_system',
                                              'bore', 'stroke', 'compression_ratio', 'horsepower',
                                              'peak_rpm', 'city_mpg', 'highway_mpg', 'price'])
In [366]:
cars.head()
Out[366]:
symboling normalized_losses make fuel_type aspiration num_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

In [367]:
cars.shape
Out[367]:
(205, 26)
In [368]:
cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized_losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel_type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num_doors          205 non-null    object 
 6   body_style         205 non-null    object 
 7   drive_wheels       205 non-null    object 
 8   engine_location    205 non-null    object 
 9   wheel_base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb_weight        205 non-null    int64  
 14  engine_type        205 non-null    object 
 15  num_cylinders      205 non-null    object 
 16  engine_size        205 non-null    int64  
 17  fuel_system        205 non-null    object 
 18  bore               205 non-null    object 
 19  stroke             205 non-null    object 
 20  compression_ratio  205 non-null    float64
 21  horsepower         205 non-null    object 
 22  peak_rpm           205 non-null    object 
 23  city_mpg           205 non-null    int64  
 24  highway_mpg        205 non-null    int64  
 25  price              205 non-null    object 
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

Since we would like to predict car prices, the target column is price. Currently the price column is not in numeric format. We need to clean it up.

Data cleaning

We usually can't have any missing values if we want to use them for predictive modeling. Based on the data set preview from the last step, we can tell that the normalized_losses column contains missing values represented using "?". Let's replace these values and look for the presence of missing values in other numeric columns. Let's also rescale the values in the numeric columns so they all range from 0 to 1.

In [369]:
cars = cars.replace('?', np.NaN)

Because ? is a string value, columns containing this value were cast to the pandas object data type (instead of a numeric type like int or float). Let's determine which columns should be converted back to numeric values and convert them.

In [370]:
numeric_cols = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
       'highway_mpg', 'price']

cars[numeric_cols] = cars[numeric_cols].astype('float')

Let's see how many rows in the normalized_losses column have missing values.

In [371]:
cars['normalized_losses'].isnull().sum()
Out[371]:
41

There are a few ways we could handle columns with missing values:

  • Replace the missing values using the average values from that column.
  • Drop the rows entirely (especially if other columns in those rows have missing values).
  • Drop the column entirely.

When it comes to the normalized_losses column, dropping the rows with missing values would result in too much data loss - 20% of the rows.

Let's see how many rows with missing data there are in the rest of the numeric columns.

In [372]:
cars[numeric_cols].isnull().sum()
Out[372]:
normalized_losses    41
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_size           0
bore                  4
stroke                4
compression_ratio     0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 4
dtype: int64

The price column is our target column. Let's drop the four rows with missing values in that column.

In [373]:
cars = cars.dropna(subset=['price'])
cars[numeric_cols].isnull().sum()
Out[373]:
normalized_losses    37
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_size           0
bore                  4
stroke                4
compression_ratio     0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 0
dtype: int64

Let's fill in the missing values of the remaining columns with the mean value for each column.

In [374]:
for col in numeric_cols:
    cars[col] = cars[col].fillna(cars[col].mean())
In [412]:
cars.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          201 non-null    int64  
 1   normalized_losses  201 non-null    float64
 2   make               201 non-null    object 
 3   fuel_type          201 non-null    object 
 4   aspiration         201 non-null    object 
 5   num_doors          199 non-null    object 
 6   body_style         201 non-null    object 
 7   drive_wheels       201 non-null    object 
 8   engine_location    201 non-null    object 
 9   wheel_base         201 non-null    float64
 10  length             201 non-null    float64
 11  width              201 non-null    float64
 12  height             201 non-null    float64
 13  curb_weight        201 non-null    float64
 14  engine_type        201 non-null    object 
 15  num_cylinders      201 non-null    object 
 16  engine_size        201 non-null    float64
 17  fuel_system        201 non-null    object 
 18  bore               201 non-null    float64
 19  stroke             201 non-null    float64
 20  compression_ratio  201 non-null    float64
 21  horsepower         201 non-null    float64
 22  peak_rpm           201 non-null    float64
 23  city_mpg           201 non-null    float64
 24  highway_mpg        201 non-null    float64
 25  price              201 non-null    float64
dtypes: float64(15), int64(1), object(10)
memory usage: 52.4+ KB

Now, let's normalize the numeric columns to range from 0 to 1, excluding the 'price' column.

In [376]:
for col in numeric_cols[:-1]:
    cars[col] = (cars[col]-cars[col].min())/(cars[col].max()-cars[col].min())
In [377]:
cars[numeric_cols].describe().T
Out[377]:
count mean std min 25% 50% 75% max
normalized_losses 201.0 0.298429 0.167520 0.0 0.188482 0.298429 0.376963 1.0
wheel_base 201.0 0.355598 0.176862 0.0 0.230321 0.303207 0.460641 1.0
length 201.0 0.494045 0.183913 0.0 0.383582 0.479104 0.632836 1.0
width 201.0 0.477697 0.179613 0.0 0.324786 0.444444 0.538462 1.0
height 201.0 0.497222 0.203985 0.0 0.350000 0.525000 0.641667 1.0
curb_weight 201.0 0.414145 0.200658 0.0 0.264158 0.359193 0.557797 1.0
engine_size 201.0 0.248587 0.156781 0.0 0.139623 0.222642 0.301887 1.0
bore 201.0 0.564793 0.191480 0.0 0.435714 0.550000 0.742857 1.0
stroke 201.0 0.565192 0.150499 0.0 0.495238 0.580952 0.638095 1.0
compression_ratio 201.0 0.197767 0.250310 0.0 0.100000 0.125000 0.150000 1.0
horsepower 201.0 0.258864 0.174606 0.0 0.102804 0.219626 0.317757 1.0
peak_rpm 201.0 0.394934 0.195148 0.0 0.265306 0.394934 0.551020 1.0
city_mpg 201.0 0.338308 0.178423 0.0 0.166667 0.305556 0.472222 1.0
highway_mpg 201.0 0.386489 0.179346 0.0 0.236842 0.368421 0.473684 1.0
price 201.0 13207.129353 7947.066342 5118.0 7775.000000 10295.000000 16500.000000 45400.0

Univariate Model

Let's start with some univariate k-nearest neighbors models. Starting with simple models before moving to more complex models helps us structure our code workflow and understand the features better.

In [378]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
In [379]:
def knn_train_test(train_col, target_col, df):
    knn = KNeighborsRegressor()
    np.random.seed(1)
    
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    #Split the dataset in two (75:25):
    train_set = rand_df[:int(len(rand_df)*0.75)]
    test_set = rand_df[int(len(rand_df)*0.75):]
    
    knn.fit(train_set[[train_col]], train_set[target_col])
    predictions = knn.predict(test_set[[train_col]])
    
    #Calculate the RMSE
    rmse = mean_squared_error(test_set[target_col], predictions, squared=False)
    return rmse
In [380]:
rmses = {}

for col in numeric_cols[:-1]:
    rmses[col] = knn_train_test(col, 'price', cars)

rmses_series = pd.Series(rmses)
rmses_series = rmses_series.sort_values()
rmses_series
Out[380]:
engine_size          3051.434222
city_mpg             3684.803554
width                3917.227670
curb_weight          4011.450036
wheel_base           4161.947972
highway_mpg          4323.502530
horsepower           4756.983755
length               5416.294064
compression_ratio    5958.572328
normalized_losses    6231.311124
peak_rpm             6326.471744
bore                 6507.421953
height               6666.667678
stroke               6939.691440
dtype: float64

Now let's modify the knn_train_test() function to accept a parameter for the k value. For each numeric column, we will create, train, and test a univariate model using the following k values (1, 3, 5, 7, and 9). We will then visualize the results using a scatter plot.

In [381]:
def knn_train_test_k(train_col, target_col, df, k_values=5):
    np.random.seed(1)
    
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    #Split the dataset in two:
    train_set = rand_df[:int(len(rand_df)*0.75)]
    test_set = rand_df[int(len(rand_df)*0.75):]
    
    k_rmses = {}
    
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors = k)
        knn.fit(train_set[[train_col]], train_set[target_col])
        predictions = knn.predict(test_set[[train_col]])
    
        #Calculate the RMSE
        rmse = mean_squared_error(test_set[target_col], predictions, squared=False)
        k_rmses[k] = rmse
    return k_rmses
In [382]:
k_rmse_result = {}

for col in numeric_cols[:-1]:
    k_rmse_result[col] = knn_train_test_k(col, 'price', cars, k_values= range(1,10,2))

k_rmse_result
Out[382]:
{'normalized_losses': {1: 6499.876425672999,
  3: 6373.107123484384,
  5: 6231.311123986794,
  7: 6434.331952881507,
  9: 6534.518371010953},
 'wheel_base': {1: 2740.62572288541,
  3: 3417.1446705321687,
  5: 4161.947971897813,
  7: 4572.080496976003,
  9: 4488.496411259577},
 'length': {1: 5297.150036557764,
  3: 5507.625440160827,
  5: 5416.294063685402,
  7: 5030.583828449936,
  9: 4407.827047652169},
 'width': {1: 2791.733012966368,
  3: 4083.105499312811,
  5: 3917.2276702270137,
  7: 3722.033525295731,
  9: 3665.167444110463},
 'height': {1: 8302.179086735921,
  3: 7159.369749588925,
  5: 6666.667677627373,
  7: 6570.390279532118,
  9: 6484.278955313533},
 'curb_weight': {1: 3906.6625644092765,
  3: 4156.281651639599,
  5: 4011.4500359082417,
  7: 3811.5247071791855,
  9: 3914.8184777366964},
 'engine_size': {1: 2830.4032415548004,
  3: 2726.617989002506,
  5: 3051.4342223507515,
  7: 3009.8809966252047,
  9: 2951.5863108544913},
 'bore': {1: 8034.701162128302,
  3: 6084.362542715404,
  5: 6507.421953431328,
  7: 7053.551668716783,
  9: 7347.122881701951},
 'stroke': {1: 10305.511168834877,
  3: 8888.38657792322,
  5: 6939.6914399959405,
  7: 7280.853272752566,
  9: 6975.780622041914},
 'compression_ratio': {1: 6357.71976018286,
  3: 5732.933869082801,
  5: 5958.572328216569,
  7: 5674.840084089771,
  9: 6218.363072593153},
 'horsepower': {1: 3554.321250115078,
  3: 4351.006073525047,
  5: 4756.983755346181,
  7: 4701.79668565425,
  9: 4717.573083571473},
 'peak_rpm': {1: 7434.7067165278495,
  3: 7452.9880963456635,
  5: 6326.47174389284,
  7: 6746.235594602308,
  9: 6817.364517123306},
 'city_mpg': {1: 4403.50104473479,
  3: 4066.0079553304445,
  5: 3684.803553760217,
  7: 3751.3880442982118,
  9: 4114.050552842478},
 'highway_mpg': {1: 5033.224263848779,
  3: 4465.395966701448,
  5: 4323.5025303792845,
  7: 3902.7625529383477,
  9: 4249.080449511941}}
In [383]:
for k, v in k_rmse_result.items():
    x = list(v.keys())
    y = list(v.values())
    plt.scatter(x, y, label=k)
    plt.xlabel('K-value')
    plt.ylabel('RMSE')
    plt.xticks(x)
    plt.title('RMSE per k-value', y=1.06)
    plt.legend(bbox_to_anchor=(1.05, 1))

It seems that a k-value of 5 is optimal for the Univariate model.

Multivariate Model

Now, let's calculate the average RMSE accross different k values for each feature.

Afterwards, we will modify the knn_test_train() function to work with multiple columns.

Then we will use the modified function to calculate the RMSE when using the top 2, 3, 4, and 5 best features.

In [384]:
#Calculate the average RMSE across different `k` values for each feature.

avg_rmse = {}

for k, v in k_rmse_result.items():
    key = k
    value = np.mean(list(v.values()))
    avg_rmse[key] = value
    
avg_rmse_series = pd.Series(avg_rmse)
avg_rmse_series = avg_rmse_series.sort_values()
avg_rmse_series
Out[384]:
engine_size          2913.984552
width                3635.853430
wheel_base           3876.059055
curb_weight          3960.147487
city_mpg             4003.950230
highway_mpg          4394.793153
horsepower           4416.336170
length               5131.896083
compression_ratio    5988.485823
normalized_losses    6414.628999
peak_rpm             6955.553334
bore                 7005.432042
height               7036.577150
stroke               8078.044616
dtype: float64
In [385]:
def knn_train_test_list(train_col, target_col, df):
    knn = KNeighborsRegressor()
    np.random.seed(1)
    
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    #Split the dataset in two:
    train_set = rand_df[:int(len(rand_df)*0.75)]
    test_set = rand_df[int(len(rand_df)*0.75):]
    
    knn.fit(train_set[train_col], train_set[target_col])
    predictions = knn.predict(test_set[train_col])
    
    #Calculate the RMSE
    rmse = mean_squared_error(test_set[target_col], predictions, squared=False)
    return rmse
In [386]:
#Now, let's use the 2, 3, 4, and 5 best features from the previous step with default k value of 5.

best_features = {}

for x in range(2,6):
    rmse = knn_train_test_list(avg_rmse_series.index[:x], 'price', cars)
    best_features['RMSE for Features: {}'.format(list(avg_rmse_series.index[:x]))] = rmse
In [387]:
best_features
Out[387]:
{"RMSE for Features: ['engine_size', 'width']": 2638.287274194467,
 "RMSE for Features: ['engine_size', 'width', 'wheel_base']": 2634.3882310997487,
 "RMSE for Features: ['engine_size', 'width', 'wheel_base', 'curb_weight']": 2399.3737483588766,
 "RMSE for Features: ['engine_size', 'width', 'wheel_base', 'curb_weight', 'city_mpg']": 2543.664238540805}

It looks like the Multivariate model preforms best with four features.

Hyperparameter Tuning

Let's now optimize the model that performed the best in the previous step.

For the top three models from the previous step, we will vary the hyperparameter value from 1 to 25 and plot the resulting values.

In [388]:
def knn_train_test_k(train_col_list, target_col, df):
    np.random.seed(1)
    
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    #Split the dataset in two:
    train_set = rand_df[:int(len(rand_df)*0.75)]
    test_set = rand_df[int(len(rand_df)*0.75):]
    
    k_values = range(1,26)
    k_rmses = {}
    
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors = k)
        knn.fit(train_set[train_col_list], train_set[target_col])
        predictions = knn.predict(test_set[train_col_list])
    
        #Calculate the RMSE
        rmse = mean_squared_error(test_set[target_col], predictions, squared=False)
        k_rmses[k] = rmse
    return k_rmses
In [389]:
top3 = [3,4,5]

top3_rmse = {}

for x in top3:
    rmse = knn_train_test_k(avg_rmse_series.index[:x], 'price', cars)
    top3_rmse['Best {} features'.format(x)] = rmse
In [390]:
top3_rmse
Out[390]:
{'Best 3 features': {1: 2037.493938400703,
  2: 2420.528728094218,
  3: 2466.335154811804,
  4: 2500.1180896129295,
  5: 2634.3882310997487,
  6: 2638.3042600762296,
  7: 2707.0835451696375,
  8: 2722.4567546507747,
  9: 2818.4424513137938,
  10: 2896.475263346206,
  11: 2848.7049142909195,
  12: 2976.516912023164,
  13: 3018.674738954696,
  14: 3151.1076747917155,
  15: 3308.8805984260784,
  16: 3401.759092567612,
  17: 3505.29174118991,
  18: 3610.00223375181,
  19: 3616.158541761227,
  20: 3680.22666115825,
  21: 3725.1163223927992,
  22: 3787.534502731638,
  23: 3808.838793741088,
  24: 3815.4423107223315,
  25: 3883.5938729773807},
 'Best 4 features': {1: 2054.308886219402,
  2: 2312.1377161898145,
  3: 2342.6297026519737,
  4: 2297.287468833795,
  5: 2399.3737483588766,
  6: 2483.1128692111097,
  7: 2655.597307504255,
  8: 2753.481913222979,
  9: 2869.211518068187,
  10: 2876.2827368589724,
  11: 2790.066817576287,
  12: 2854.342162940999,
  13: 2847.3193772453174,
  14: 2946.704084275195,
  15: 3085.1710150725858,
  16: 3163.7708779795853,
  17: 3275.6480420821645,
  18: 3280.424683662215,
  19: 3316.979218424875,
  20: 3350.9084160485645,
  21: 3436.497390281892,
  22: 3472.0400792710466,
  23: 3535.9167710145994,
  24: 3569.44674285931,
  25: 3595.109623732075},
 'Best 5 features': {1: 2198.1608310621145,
  2: 2402.0308360333142,
  3: 2516.6052483524027,
  4: 2588.640862205569,
  5: 2543.664238540805,
  6: 2545.7763357011786,
  7: 2606.406501591917,
  8: 2815.318660567435,
  9: 2785.46102051866,
  10: 2771.8811967722586,
  11: 2729.3198095110683,
  12: 2794.67787774539,
  13: 2832.9760174697294,
  14: 2880.493075993126,
  15: 3014.7527522086934,
  16: 3101.6329287454873,
  17: 3164.9325512141627,
  18: 3240.472823936355,
  19: 3262.111511430676,
  20: 3320.7830101852924,
  21: 3399.330393442633,
  22: 3435.38692788332,
  23: 3475.615751819248,
  24: 3553.77592319236,
  25: 3640.9329688473003}}
In [391]:
for k, v in top3_rmse.items():
    x = list(v.keys())
    y = list(v.values())
    
    plt.plot(x, y, label=k)
    plt.legend(loc='lower right')
    plt.xlabel('K-value')
    plt.ylabel('RMSE')
    plt.title('RMSE for k-values 1 to 25')

It seems like using a k-value 1 is best for the model when using the best 3, 4 and 5 features.

K-fold cross validation

Now, let's modify the knn_train_test() function to use k-fold cross validation. We will use the default n_neighbors of 5 and will loop through folds 2 to 10.

In [392]:
from sklearn.model_selection import KFold, cross_val_score

def knn_train_test_kfold(train_col, target_col, df, folds=range(2,12,2)):
    avg_rmses = {}
    
    for f in folds:
        kf = KFold(f, shuffle=True, random_state=1)
        knn = KNeighborsRegressor()
    
        mse = cross_val_score(knn, df[[train_col]], df[target_col], scoring='neg_mean_squared_error', cv=kf)
        rmse = np.sqrt(np.absolute(mse))
        avg_rmse = np.mean(rmse)
        avg_rmses[f] = avg_rmse

    return avg_rmses
In [393]:
diff_folds = {}

for col in numeric_cols[:-1]:
    diff_folds[col] = knn_train_test_kfold(col, 'price', cars, folds=range(2,12,2))
    
diff_folds_df = pd.DataFrame(diff_folds)
diff_folds_df
Out[393]:
normalized_losses wheel_base length width height curb_weight engine_size bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg
2 7482.632289 5709.567683 5645.803155 4480.755962 7735.811797 4130.915690 3364.285473 6561.161516 7768.258096 6875.609561 3983.289877 7721.986452 4788.022045 4336.427993
4 6910.432610 5601.049008 5217.793716 4305.112854 7723.547222 4331.335448 3287.796580 6475.031311 7407.897334 6455.122963 3802.623017 7564.959829 4157.873935 4402.370669
6 7495.827796 5945.565378 5314.853031 4207.797327 7704.673968 3965.294847 3219.961874 6854.558582 7722.122646 6195.073611 3839.065975 7569.317860 4269.760675 4283.341086
8 6900.152199 5744.523778 5320.497538 4319.527199 7667.115403 4059.026943 3141.847651 6556.757544 7517.688928 6182.827390 3843.473451 7533.791225 4369.852715 4075.701847
10 6671.869661 5725.179143 5344.389417 4209.420018 7702.821601 4234.188078 3043.502927 6914.254048 7297.139641 6082.597131 3801.849871 7510.544452 4305.536968 4180.874800

We see that for 6 out of the 14 columns (~43%), 10 folds give the best result. Because of that, we will use 10 folds for the rest of the tests.

Varying nearest neighbours

Now let's modify the knn_train_test_kfold() so that the k nearest neighbors changes - 1, 3, 5, 7, 9.

In [413]:
def knn_train_test_kfold(train_col, target_col, df, folds=10):
    avg_rmses = {}
    kf = KFold(folds, shuffle=True, random_state=1)
    k_values = range(1,10,2)
    
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors=k)
        mse = cross_val_score(knn, df[[train_col]], df[target_col], scoring='neg_mean_squared_error', cv=kf)
        rmse = np.sqrt(np.absolute(mse))
        avg_rmse = np.mean(rmse)
        avg_rmses[k] = avg_rmse
    
    return avg_rmses
    
In [414]:
cols_k = {}

for col in numeric_cols[:-1]:
    cols_k[col] = knn_train_test_kfold(col, 'price', cars, folds=10)
    
cols_k_df=pd.DataFrame(cols_k)

cols_k_df
Out[414]:
normalized_losses wheel_base length width height curb_weight engine_size bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg
1 7579.057448 4127.664023 4898.478019 4066.868124 8392.337279 5157.859842 3601.796747 8057.920723 8225.659301 6741.420713 3991.682326 8725.320610 5756.990928 5108.460253
3 7148.990331 4791.446221 5055.359831 4014.256862 7867.419571 4495.974906 3054.441896 6514.966390 7273.057903 5821.947865 3860.005144 7721.733869 4388.057423 4648.040609
5 6671.869661 5725.179143 5344.389417 4209.420018 7702.821601 4234.188078 3043.502927 6914.254048 7297.139641 6082.597131 3801.849871 7510.544452 4305.536968 4180.874800
7 7076.278964 5949.354619 5407.596946 4362.409681 7663.933333 4065.269237 3346.551409 6940.746752 7332.833386 6251.986521 3799.826781 7487.471490 4247.140019 4136.835757
9 7376.047238 5847.402960 5461.100326 4316.124588 7491.422748 3961.532240 3496.675613 6752.362175 7299.626100 6330.739278 3843.219943 7564.611083 4414.995561 4168.970779

Now, let's use the above results for k-values [1,3,5,9] and average the results for each feature. We will then use the top features to calculate the RMSE for more than one column over 10 folds.

In [416]:
#Calculate the average RMSE across different `k` values for each feature.

avg_rmse_k = {}

for k, v in cols_k.items():
    key = k
    value = np.mean(list(v.values()))
    avg_rmse_k[key] = value
    
avg_rmse_k_series = pd.Series(avg_rmse_k)
avg_rmse_k_series = avg_rmse_k_series.sort_values()
avg_rmse_k_series
Out[416]:
engine_size          3308.593718
horsepower           3859.316813
width                4193.815855
curb_weight          4382.964861
highway_mpg          4448.636439
city_mpg             4622.544180
length               5233.384908
wheel_base           5288.209393
compression_ratio    6245.738302
bore                 7036.050018
normalized_losses    7170.448729
stroke               7485.663266
peak_rpm             7801.936301
height               7823.586906
dtype: float64
In [418]:
def knn_train_test_kfold(train_col, target_col, df, folds=10):
    kf = KFold(folds, shuffle=True, random_state=1)
    knn = KNeighborsRegressor()
    
    mse = cross_val_score(knn, df[train_col], df[target_col], scoring='neg_mean_squared_error', cv=kf)
    rmse = np.sqrt(np.absolute(mse))
    avg_rmse = np.mean(rmse)
    return avg_rmse
In [419]:
features = {}

for x in range(2,7):
    result = knn_train_test_kfold(list(avg_rmse_k_series.index[:x]), 'price', cars, folds=10)
    features["{} best features".format(x)] = result
    
features
Out[419]:
{'2 best features': 2808.4983984193304,
 '3 best features': 3031.767371260333,
 '4 best features': 3010.436378520319,
 '5 best features': 3031.8000284536897,
 '6 best features': 3097.0129847108556}

It looks like the 10 fold model and k value of 5 preforms best with 2 features.

Let's use the top 3 features combinations from above (2, 3, 4, 5) and calculate the RMSE with varying nearest neighbors from 1 to 25, using 10 folds.

In [420]:
def knn_train_test_kfold(train_col, target_col, df, folds=10):
    kf = KFold(folds, shuffle=True, random_state=1)
    
    avg_rmses = {}
    for k in range(1,26):
        knn = KNeighborsRegressor(n_neighbors=k)
    
        mse = cross_val_score(knn, df[train_col], df[target_col], scoring='neg_mean_squared_error', cv=kf)
        rmse = np.sqrt(np.absolute(mse))
        avg_rmse = np.mean(rmse)
        avg_rmses[k] = avg_rmse
    return avg_rmses
In [421]:
features_25 = {}

features = [2,3,4,5]

for x in features:
    result = knn_train_test_kfold(list(ten_folds_s.index[:x]), 'price', cars, folds=10)
    features_25["{} best features".format(x)] = result
    
features_25
Out[421]:
{'2 best features': {1: 2844.6738023575053,
  2: 2601.2320598273477,
  3: 2636.5883949955796,
  4: 2787.516162198912,
  5: 2808.4983984193304,
  6: 2845.3008673634185,
  7: 2934.275309828578,
  8: 3137.167365122338,
  9: 3205.8876254263178,
  10: 3285.3850135510047,
  11: 3422.1375743796843,
  12: 3550.5516722965376,
  13: 3691.4335807073912,
  14: 3793.5952330652267,
  15: 3868.994116314313,
  16: 3897.4576330716213,
  17: 3933.2465918697517,
  18: 4025.3490798025623,
  19: 4085.6682300076645,
  20: 4140.402487825685,
  21: 4166.786411146952,
  22: 4205.597041611041,
  23: 4278.854191048603,
  24: 4336.7415031100345,
  25: 4396.694250351881},
 '3 best features': {1: 2835.7499519350376,
  2: 2713.0939323616276,
  3: 2881.9217672696113,
  4: 2889.4832987303425,
  5: 2841.263201908723,
  6: 2842.0707944371597,
  7: 2861.897569604688,
  8: 2944.1857947890157,
  9: 3081.2734956302334,
  10: 3236.5861706288274,
  11: 3370.8617217698957,
  12: 3461.715962779535,
  13: 3559.758194194258,
  14: 3661.5208599146586,
  15: 3722.097232839941,
  16: 3783.4163394060292,
  17: 3794.859672251304,
  18: 3848.883401454411,
  19: 3907.1342111481376,
  20: 3996.464958332951,
  21: 4069.230213629584,
  22: 4146.315861378528,
  23: 4229.862620322901,
  24: 4277.620323178494,
  25: 4324.470535981728},
 '4 best features': {1: 2758.415899280747,
  2: 2738.8111435615356,
  3: 2826.3147326898097,
  4: 2907.9974816814656,
  5: 3109.51836966299,
  6: 3162.825655139368,
  7: 3255.6467031203138,
  8: 3337.042766498323,
  9: 3395.531012433022,
  10: 3391.265338725209,
  11: 3365.259927631688,
  12: 3354.954668157832,
  13: 3431.8157579549284,
  14: 3527.281024218571,
  15: 3635.1275508295344,
  16: 3755.3392126172284,
  17: 3821.765436292949,
  18: 3894.5873570708745,
  19: 3936.600608368157,
  20: 3998.661745284531,
  21: 4054.2532431517056,
  22: 4103.673942749257,
  23: 4149.211421266717,
  24: 4200.632572416727,
  25: 4231.87605444434},
 '5 best features': {1: 2387.989701944437,
  2: 2605.8129571699683,
  3: 2617.023387868039,
  4: 2850.7310028232587,
  5: 3031.8000284536897,
  6: 3179.9099856765306,
  7: 3259.3832024265203,
  8: 3291.025360053838,
  9: 3330.080256269958,
  10: 3406.4218137052712,
  11: 3427.670710137292,
  12: 3482.1389813977826,
  13: 3533.599857429084,
  14: 3584.617451945532,
  15: 3672.4055813523287,
  16: 3738.9448772518017,
  17: 3815.0789878312376,
  18: 3880.5662892780733,
  19: 3977.743003897579,
  20: 3988.9994628618674,
  21: 4023.846213626975,
  22: 4058.430814240882,
  23: 4090.076222229394,
  24: 4110.878305650327,
  25: 4123.132673656963}}
In [422]:
for k, v in features_25.items():
    x = list(v.keys())
    y = list(v.values())
    
    plt.plot(x, y, label=k)
    plt.legend(loc='lower right', frameon=False)
    plt.xlabel('K-value')
    plt.ylabel('RMSE')
    plt.title('RMSE for k-values 1 to 25\n10 Folds')

It looks like 2, 3, and 4 best features produce most accurate results with k value of 2 whereas 5 features work best with k value of 1.

In [ ]: