Predicting_Car_Prices¶

In this project, I utilized the k-nearest neighbors algorithm to predict a car's market price using its attributes. The dataset contains information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set here and can download it directly from here. Here's a preview of the attributes:

Attribute Information:¶

symboling: -3, -2, -1, 0, 1, 2, 3.
normalized-losses: continuous from 65 to 256.
make:

alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo 4. fuel-type: diesel, gas. 5. aspiration: std, turbo. 6. num-of-doors: four, two. 7. body-style: hardtop, wagon, sedan, hatchback, convertible. 8. drive-wheels: 4wd, fwd, rwd. 9. engine-location: front, rear. 10. wheel-base: continuous from 86.6 120.9. 11. length: continuous from 141.1 to 208.1. 12. width: continuous from 60.3 to 72.3. 13. height: continuous from 47.8 to 59.8. 14. curb-weight: continuous from 1488 to 4066. 15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor. 16. num-of-cylinders: eight, five, four, six, three, twelve, two. 17. engine-size: continuous from 61 to 326. 18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. 19. bore: continuous from 2.54 to 3.94. 20. stroke: continuous from 2.07 to 4.17. 21. compression-ratio: continuous from 7 to 23. 22. horsepower: continuous from 48 to 288. 23. peak-rpm: continuous from 4150 to 6600. 24. city-mpg: continuous from 13 to 49. 25. highway-mpg: continuous from 16 to 54. 26. price: continuous from 5118 to 45400.

In [1]:

# importing libraries

import pandas as pd
import numpy as np

pd.options.display.max_columns = 99

In [2]:

# reading the dataset

cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm',
        'city-mpg', 'highway-mpg', 'price']
cars = pd.read_csv('imports-85.data', names=cols)

In [3]:

cars.head(10)

Out[3]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	height	curb-weight	engine-type	num-of-cylinders	engine-size	fuel-system	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450
5	2	?	audi	gas	std	two	sedan	fwd	front	99.8	177.3	66.3	53.1	2507	ohc	five	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250
6	1	158	audi	gas	std	four	sedan	fwd	front	105.8	192.7	71.4	55.7	2844	ohc	five	136	mpfi	3.19	3.40	8.5	110	5500	19	25	17710
7	1	?	audi	gas	std	four	wagon	fwd	front	105.8	192.7	71.4	55.7	2954	ohc	five	136	mpfi	3.19	3.40	8.5	110	5500	19	25	18920
8	1	158	audi	gas	turbo	four	sedan	fwd	front	105.8	192.7	71.4	55.9	3086	ohc	five	131	mpfi	3.13	3.40	8.3	140	5500	17	20	23875
9	0	?	audi	gas	turbo	two	hatchback	4wd	front	99.5	178.2	67.9	52.0	3053	ohc	five	131	mpfi	3.13	3.40	7.0	160	5500	16	22	?

In [4]:

cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               205 non-null    object 
 19  stroke             205 non-null    object 
 20  compression-rate   205 non-null    float64
 21  horsepower         205 non-null    object 
 22  peak-rpm           205 non-null    object 
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              205 non-null    object 
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

In [5]:

# creating a new dataframe with columns having continuous values
# from https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names

continuous_values_cols = ['normalized-losses', 'wheel-base',
                          'length', 'width', 'height','curb-weight',
                          'bore', 'stroke', 'compression-rate',
                          'horsepower', 'peak-rpm', 'city-mpg',
                          'highway-mpg', 'price']

cars_numeric = cars[continuous_values_cols]
cars_numeric.sample(10)

Out[5]:

	normalized-losses	wheel-base	length	width	height	curb-weight	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
79	161	93.0	157.3	63.8	50.8	2145	3.03	3.39	7.6	102	5500	24	30	7689
60	115	98.8	177.8	66.5	55.5	2410	3.39	3.39	8.6	84	4800	26	32	8495
23	118	93.7	157.3	63.8	50.8	2128	3.03	3.39	7.6	102	5500	24	30	7957
87	125	96.3	172.4	65.4	51.6	2403	3.17	3.46	7.5	116	5500	23	30	9279
107	161	107.9	186.7	68.4	56.7	3020	3.46	3.19	8.4	97	5000	19	24	11900
0	?	88.6	168.8	64.1	48.8	2548	3.47	2.68	9.0	111	5000	21	27	13495
7	?	105.8	192.7	71.4	55.7	2954	3.19	3.40	8.5	110	5500	19	25	18920
39	85	96.5	175.4	65.2	54.1	2304	3.15	3.58	9.0	86	5800	27	33	8845
130	?	96.1	181.5	66.5	55.2	2579	3.46	3.90	8.7	?	?	23	31	9295
141	102	97.2	172.0	65.4	52.5	2145	3.62	2.64	9.5	82	4800	32	37	7126

In [6]:

# cleaning the data

# replacing '?' symbols in the columns
cars_numeric = cars_numeric.replace('?', np.nan)
cars_numeric

Out[6]:

	normalized-losses	wheel-base	length	width	height	curb-weight	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	NaN	88.6	168.8	64.1	48.8	2548	3.47	2.68	9.0	111	5000	21	27	13495
1	NaN	88.6	168.8	64.1	48.8	2548	3.47	2.68	9.0	111	5000	21	27	16500
2	NaN	94.5	171.2	65.5	52.4	2823	2.68	3.47	9.0	154	5000	19	26	16500
3	164	99.8	176.6	66.2	54.3	2337	3.19	3.40	10.0	102	5500	24	30	13950
4	164	99.4	176.6	66.4	54.3	2824	3.19	3.40	8.0	115	5500	18	22	17450
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
200	95	109.1	188.8	68.9	55.5	2952	3.78	3.15	9.5	114	5400	23	28	16845
201	95	109.1	188.8	68.8	55.5	3049	3.78	3.15	8.7	160	5300	19	25	19045
202	95	109.1	188.8	68.9	55.5	3012	3.58	2.87	8.8	134	5500	18	23	21485
203	95	109.1	188.8	68.9	55.5	3217	3.01	3.40	23.0	106	4800	26	27	22470
204	95	109.1	188.8	68.9	55.5	3062	3.78	3.15	9.5	114	5400	19	25	22625

205 rows × 14 columns

In [7]:

# converting all columns to float dtype
cars_numeric = cars_numeric.astype('float')

# null values count
cars_numeric.isnull().sum()

Out[7]:

normalized-losses    41
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

In [8]:

# removing null values in the price column as 'price' is the column we want to predict
cars_numeric = cars_numeric.dropna(subset=['price'])
cars_numeric.isnull().sum()

Out[8]:

normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64

In [9]:

# replacing missing values in other columns with column means.
cars_numeric = cars_numeric.fillna(cars_numeric.mean())

# confirming that there's no more missing values!
cars_numeric.isnull().sum()

Out[9]:

normalized-losses    0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
bore                 0
stroke               0
compression-rate     0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

In [10]:

# normalizing all columnns to range from 0 to 1 except the target column.
price_col = cars_numeric['price']
cars_numeric = ((cars_numeric - cars_numeric.min()) /
                (cars_numeric.max() - cars_numeric.min())
               )
cars_numeric['price'] = price_col

In [11]:

# creating a univariate(single column/variable) model

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

def knn_train_test(train_col, target_col, df):
    knn = KNeighborsRegressor()
    np.random.seed(1)
        
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    # Fit a KNN model using default k value.
    knn.fit(train_df[[train_col]], train_df[target_col])
    
    # Make predictions using model.
    predicted_labels = knn.predict(test_df[[train_col]])

    # Calculate and return RMSE.
    mse = mean_squared_error(test_df[target_col], predicted_labels)
    rmse = np.sqrt(mse)
    return rmse

rmse_results = {}
train_cols = cars_numeric.columns.drop('price')

# For each column (minus `price`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
    rmse_val = knn_train_test(col, 'price', cars_numeric)
    rmse_results[col] = rmse_val

# Create a Series object from the dictionary so 
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()

Out[11]:

horsepower           4037.037713
curb-weight          4401.118255
highway-mpg          4630.026799
width                4704.482590
city-mpg             4766.422505
length               5427.200961
wheel-base           5461.553998
compression-rate     6610.812153
bore                 6780.627785
normalized-losses    7330.197653
peak-rpm             7697.459696
stroke               8006.529545
height               8144.441043
dtype: float64

In [12]:

# modifying the function to include a parameter for the 'k' value

def knn_train_test(train_col, target_col, df):
    np.random.seed(1)
        
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    k_values = [1,3,5,7,9]
    k_rmses = {}
    
    for k in k_values:
        # Fit model using k nearest neighbors.
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[[train_col]], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[[train_col]])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        
        k_rmses[k] = rmse
    return k_rmses

k_rmse_results = {}

# For each column (minus `price`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
train_cols = cars_numeric.columns.drop('price')
for col in train_cols:
    rmse_val = knn_train_test(col, 'price', cars_numeric)
    k_rmse_results[col] = rmse_val

k_rmse_results

Out[12]:

{'normalized-losses': {1: 7846.750605148984,
  3: 7500.5698123109905,
  5: 7330.197653434445,
  7: 7756.421586234123,
  9: 7688.096096891432},
 'wheel-base': {1: 4493.734068810494,
  3: 5120.161506064513,
  5: 5461.553997873057,
  7: 5448.1070513823315,
  9: 5738.405685192312},
 'length': {1: 4628.45550121557,
  3: 5129.8358210721635,
  5: 5427.2009608367125,
  7: 5313.427720847974,
  9: 5383.054514833446},
 'width': {1: 4559.257297950061,
  3: 4606.413692169901,
  5: 4704.482589704386,
  7: 4571.485046194653,
  9: 4652.914172067787},
 'height': {1: 8904.04645636071,
  3: 8277.609643045525,
  5: 8144.441042663747,
  7: 7679.598124393773,
  9: 7811.03606291223},
 'curb-weight': {1: 5264.290230758878,
  3: 5022.318011757233,
  5: 4401.118254793124,
  7: 4330.608104418053,
  9: 4632.044474454401},
 'bore': {1: 8602.58848450066,
  3: 6984.239489480916,
  5: 6780.627784685976,
  7: 6878.097965921532,
  9: 6866.808502038413},
 'stroke': {1: 9116.495955406906,
  3: 7338.68466990294,
  5: 8006.529544647101,
  7: 7803.937796804327,
  9: 7735.554366079291},
 'compression-rate': {1: 8087.205346523092,
  3: 7375.063685578359,
  5: 6610.812153159129,
  7: 6732.801282941515,
  9: 7024.485525463435},
 'horsepower': {1: 4170.054848037801,
  3: 4020.8492630885394,
  5: 4037.0377131537603,
  7: 4353.811860277134,
  9: 4515.135617419103},
 'peak-rpm': {1: 9511.480067750124,
  3: 8537.550899973421,
  5: 7697.4596964334805,
  7: 7510.294160083481,
  9: 7340.041341263401},
 'city-mpg': {1: 5901.143574354764,
  3: 4646.746408727155,
  5: 4766.422505090134,
  7: 5232.523034167316,
  9: 5465.209492527533},
 'highway-mpg': {1: 6025.594966720739,
  3: 4617.305019788554,
  5: 4630.026798588056,
  7: 4796.061440186946,
  9: 5278.358056953987}}

In [13]:

# importing library for plotting

import matplotlib.pyplot as plt
%matplotlib inline

for key,value in k_rmse_results.items():
    x = list(value.keys())
    y = list(value.values())
    
    plt.plot(x,y)
    plt.xlabel('k value')
    plt.ylabel('RMSE')

In [14]:

# computing average RMSE across different `k` values for each feature.

feature_avg_rmse = {}
for key, val in k_rmse_results.items():
    avg_rmse = np.mean(list(val.values()))
    feature_avg_rmse[key] = avg_rmse
    
avg_rmse_series = pd.Series(feature_avg_rmse)
avg_rmse_series.sort_values()

Out[14]:

horsepower           4219.377860
width                4618.910560
curb-weight          4730.075815
highway-mpg          5069.469256
length               5176.394904
city-mpg             5202.409003
wheel-base           5252.392462
compression-rate     7166.073599
bore                 7222.472445
normalized-losses    7624.407151
stroke               8000.240467
peak-rpm             8119.365233
height               8163.346266
dtype: float64

In [15]:

# creating a multivariate(multiple columns/variables) model

def knn_train_test(train_cols, target_col, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    k_values = [5]
    k_rmses = {}
    
    for k in k_values:
        # Fit model using k nearest neighbors.
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[train_cols], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_cols])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        
        k_rmses[k] = rmse
    return k_rmses

k_rmse_results = {}

two_best_features = list(avg_rmse_series.sort_values().keys()[:2])
rmse_val = knn_train_test(two_best_features, 'price', cars_numeric)
k_rmse_results['two best features'] = rmse_val

three_best_features = list(avg_rmse_series.sort_values().keys()[:3])
rmse_val = knn_train_test(three_best_features, 'price', cars_numeric)
k_rmse_results['three best features'] = rmse_val

four_best_features = list(avg_rmse_series.sort_values().keys()[:4])
rmse_val = knn_train_test(four_best_features, 'price', cars_numeric)
k_rmse_results['four best features'] = rmse_val

five_best_features = list(avg_rmse_series.sort_values().keys()[:5])
rmse_val = knn_train_test(five_best_features, 'price', cars_numeric)
k_rmse_results['five best features'] = rmse_val

k_rmse_results

Out[15]:

{'two best features': {5: 3589.3132622073304},
 'three best features': {5: 3305.9401397969677},
 'four best features': {5: 3358.6915801682458},
 'five best features': {5: 3665.546673045813}}

In [16]:

# optimizing the model that performed the best in the previous step

def knn_train_test(train_cols, target_col, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    k_values = [i for i in range(1, 25)]
    k_rmses = {}
    
    for k in k_values:
        # Fit model using k nearest neighbors.
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[train_cols], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_cols])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        
        k_rmses[k] = rmse
    return k_rmses

k_rmse_results = {}

two_best_features = list(avg_rmse_series.sort_values().keys()[:2])
rmse_val = knn_train_test(two_best_features, 'price', cars_numeric)
k_rmse_results["two best features"] = rmse_val

three_best_features = list(avg_rmse_series.sort_values().keys()[:3])
rmse_val = knn_train_test(three_best_features, 'price', cars_numeric)
k_rmse_results["three best features"] = rmse_val

four_best_features = list(avg_rmse_series.sort_values().keys()[:4])
rmse_val = knn_train_test(four_best_features, 'price', cars_numeric)
k_rmse_results["four best features"] = rmse_val

five_best_features = list(avg_rmse_series.sort_values().keys()[:5])
rmse_val = knn_train_test(five_best_features, 'price', cars_numeric)
k_rmse_results["five best features"] = rmse_val

k_rmse_results

Out[16]:

{'two best features': {1: 4061.9613050304106,
  2: 3497.49936199118,
  3: 3402.8692636542114,
  4: 3587.0044198356923,
  5: 3589.3132622073304,
  6: 3680.062981095498,
  7: 3756.92796407086,
  8: 3937.770418264052,
  9: 4078.3485919700097,
  10: 4163.828373808731,
  11: 4297.135962941241,
  12: 4370.753019740529,
  13: 4500.462028689254,
  14: 4604.156707686779,
  15: 4595.345097101211,
  16: 4605.433669910023,
  17: 4611.2845838376215,
  18: 4598.88218482117,
  19: 4579.964891966457,
  20: 4653.966845712387,
  21: 4759.076059393234,
  22: 4807.805949321809,
  23: 4865.320887129985,
  24: 4910.715769042787},
 'three best features': {1: 3013.0109985241875,
  2: 2813.285969825997,
  3: 3171.585284478674,
  4: 3182.3137417981943,
  5: 3305.9401397969677,
  6: 3522.506848900376,
  7: 3774.3772094554106,
  8: 3978.969124021116,
  9: 3992.923680588881,
  10: 4076.2381473803043,
  11: 4156.388331131807,
  12: 4201.10713385948,
  13: 4303.62676861325,
  14: 4359.693296989702,
  15: 4371.771103372868,
  16: 4394.4846551644205,
  17: 4510.399710057406,
  18: 4584.310961865486,
  19: 4636.62620477063,
  20: 4664.465847866811,
  21: 4724.096637428273,
  22: 4752.535484102914,
  23: 4808.703310452101,
  24: 4858.9452710176065},
 'four best features': {1: 2600.746383728188,
  2: 2725.4325072335123,
  3: 3108.8580314362966,
  4: 3217.3135209486827,
  5: 3358.6915801682458,
  6: 3633.1687033129465,
  7: 3896.127441396644,
  8: 4002.8383900652543,
  9: 4055.5309369929582,
  10: 4128.67807741542,
  11: 4249.827289347268,
  12: 4344.035898237492,
  13: 4402.995293166156,
  14: 4424.314365328619,
  15: 4442.943179452285,
  16: 4528.57927503009,
  17: 4572.28806185627,
  18: 4604.034045947238,
  19: 4660.524954508328,
  20: 4735.352015758023,
  21: 4742.329532242572,
  22: 4763.606459864159,
  23: 4807.076030845482,
  24: 4848.127192424658},
 'five best features': {1: 2773.8991269216394,
  2: 2936.079965592973,
  3: 3152.3415515178144,
  4: 3488.57822210674,
  5: 3665.546673045813,
  6: 3563.9910249785435,
  7: 3714.642677357888,
  8: 3927.6655582704293,
  9: 4074.724411578548,
  10: 4202.692919892065,
  11: 4228.8377103033245,
  12: 4280.7222580306225,
  13: 4323.694733441248,
  14: 4341.598003940922,
  15: 4381.910642108479,
  16: 4462.210967318207,
  17: 4512.666161759793,
  18: 4549.02427742861,
  19: 4625.542238703432,
  20: 4680.4075341436155,
  21: 4769.300287838951,
  22: 4813.1714929806085,
  23: 4871.956026848068,
  24: 4922.889655107399}}

In [17]:

# plotting the results

for key,val in k_rmse_results.items():
    x = list(val.keys())
    y = list(val.values())
    
    plt.plot(x,y)
    plt.xlabel('k value')
    plt.ylabel('RMSE')

In [18]:

# extracting optimal k value for each model

import operator

two_feat_k_value = {}
three_feat_k_value = {}
four_feat_k_value = {}
five_feat_k_value = {}

dict_ = k_rmse_results.copy()

for k, v in dict_.items():
    for key,val in v.items():
        if k == 'two best features':
            two_feat_k_value[key] = val
        elif k == 'three best features':
            three_feat_k_value[key] = val
        elif k == 'four best features':
            four_feat_k_value[key] = val
        else:
            five_feat_k_value[key] = val
            
print('two best features: {}'.format(min(two_feat_k_value.items(), key=operator.itemgetter(1))[1]))
print('three best features: {}'.format(min(three_feat_k_value.items(), key=operator.itemgetter(1))[1]))
print('four best features: {}'.format(min(four_feat_k_value.items(), key=operator.itemgetter(1))[1]))
print('five best features: {}'.format(min(five_feat_k_value.items(), key=operator.itemgetter(1))[1]))

two best features: 3402.8692636542114
three best features: 2813.285969825997
four best features: 2600.746383728188
five best features: 2773.8991269216394