Predicting car prices¶

In this course, we explored the fundamentals of machine learning using the k-nearest neighbors algorithm. In this guided project, you'll practice the machine learning workflow you've learned so far to predict a car's market price using its attributes. The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set here and can download it directly from here.

As a conclusion multivariate k-nearest neighbors regression model using horsepower, engine size, and city-mpg columns with k = 3 turned out to be the most accurate and reliable one.

In [1]:

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport

pd.options.display.max_columns = 99

Discovering the data¶

In [2]:

# Rename the columns
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
cars = pd.read_csv('imports-85.data', names=cols)
cars.head()

Out[2]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	height	curb-weight	engine-type	num-of-cylinders	engine-size	fuel-system	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

In [3]:

cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               205 non-null    object 
 19  stroke             205 non-null    object 
 20  compression-rate   205 non-null    float64
 21  horsepower         205 non-null    object 
 22  peak-rpm           205 non-null    object 
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              205 non-null    object 
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

In [4]:

# Summarising the dataset using Pandas Profiling
profile = ProfileReport(cars, title="Pandas Profiling Report")

In [5]:

# Take a look at the summary (comment it to avoid long report generation for the consequent kernel runs)
#profile 

Cleaning the data¶

We need to get rid of missing data in order to let our machine learning algorithm work.

In [6]:

cars['normalized-losses'].value_counts().head()

Out[6]:

?      41
161    11
91      8
150     7
134     6
Name: normalized-losses, dtype: int64

In [7]:

cars.replace('?', np.nan, inplace=True)
cars['normalized-losses'].value_counts(dropna=False).head()

Out[7]:

NaN    41
161    11
91      8
150     7
104     6
Name: normalized-losses, dtype: int64

In [8]:

# Explore columns with nulled normalized losses
cars[cars['normalized-losses'].isnull()]

Out[8]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	height	curb-weight	engine-type	num-of-cylinders	engine-size	fuel-system	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
5	2	NaN	audi	gas	std	two	sedan	fwd	front	99.8	177.3	66.3	53.1	2507	ohc	five	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250
7	1	NaN	audi	gas	std	four	wagon	fwd	front	105.8	192.7	71.4	55.7	2954	ohc	five	136	mpfi	3.19	3.40	8.5	110	5500	19	25	18920
9	0	NaN	audi	gas	turbo	two	hatchback	4wd	front	99.5	178.2	67.9	52.0	3053	ohc	five	131	mpfi	3.13	3.40	7.0	160	5500	16	22	NaN
14	1	NaN	bmw	gas	std	four	sedan	rwd	front	103.5	189.0	66.9	55.7	3055	ohc	six	164	mpfi	3.31	3.19	9.0	121	4250	20	25	24565
15	0	NaN	bmw	gas	std	four	sedan	rwd	front	103.5	189.0	66.9	55.7	3230	ohc	six	209	mpfi	3.62	3.39	8.0	182	5400	16	22	30760
16	0	NaN	bmw	gas	std	two	sedan	rwd	front	103.5	193.8	67.9	53.7	3380	ohc	six	209	mpfi	3.62	3.39	8.0	182	5400	16	22	41315
17	0	NaN	bmw	gas	std	four	sedan	rwd	front	110.0	197.0	70.9	56.3	3505	ohc	six	209	mpfi	3.62	3.39	8.0	182	5400	15	20	36880
43	0	NaN	isuzu	gas	std	four	sedan	rwd	front	94.3	170.7	61.8	53.5	2337	ohc	four	111	2bbl	3.31	3.23	8.5	78	4800	24	29	6785
44	1	NaN	isuzu	gas	std	two	sedan	fwd	front	94.5	155.9	63.6	52.0	1874	ohc	four	90	2bbl	3.03	3.11	9.6	70	5400	38	43	NaN
45	0	NaN	isuzu	gas	std	four	sedan	fwd	front	94.5	155.9	63.6	52.0	1909	ohc	four	90	2bbl	3.03	3.11	9.6	70	5400	38	43	NaN
46	2	NaN	isuzu	gas	std	two	hatchback	rwd	front	96.0	172.6	65.2	51.4	2734	ohc	four	119	spfi	3.43	3.23	9.2	90	5000	24	29	11048
48	0	NaN	jaguar	gas	std	four	sedan	rwd	front	113.0	199.6	69.6	52.8	4066	dohc	six	258	mpfi	3.63	4.17	8.1	176	4750	15	19	35550
49	0	NaN	jaguar	gas	std	two	sedan	rwd	front	102.0	191.7	70.6	47.8	3950	ohcv	twelve	326	mpfi	3.54	2.76	11.5	262	5000	13	17	36000
63	0	NaN	mazda	diesel	std	NaN	sedan	fwd	front	98.8	177.8	66.5	55.5	2443	ohc	four	122	idi	3.39	3.39	22.7	64	4650	36	42	10795
66	0	NaN	mazda	diesel	std	four	sedan	rwd	front	104.9	175.0	66.1	54.4	2700	ohc	four	134	idi	3.43	3.64	22.0	72	4200	31	39	18344
71	-1	NaN	mercedes-benz	gas	std	four	sedan	rwd	front	115.6	202.6	71.7	56.5	3740	ohcv	eight	234	mpfi	3.46	3.10	8.3	155	4750	16	18	34184
73	0	NaN	mercedes-benz	gas	std	four	sedan	rwd	front	120.9	208.1	71.7	56.7	3900	ohcv	eight	308	mpfi	3.80	3.35	8.0	184	4500	14	16	40960
74	1	NaN	mercedes-benz	gas	std	two	hardtop	rwd	front	112.0	199.2	72.0	55.4	3715	ohcv	eight	304	mpfi	3.80	3.35	8.0	184	4500	14	16	45400
75	1	NaN	mercury	gas	turbo	two	hatchback	rwd	front	102.7	178.4	68.0	54.8	2910	ohc	four	140	mpfi	3.78	3.12	8.0	175	5000	19	24	16503
82	3	NaN	mitsubishi	gas	turbo	two	hatchback	fwd	front	95.9	173.2	66.3	50.2	2833	ohc	four	156	spdi	3.58	3.86	7.0	145	5000	19	24	12629
83	3	NaN	mitsubishi	gas	turbo	two	hatchback	fwd	front	95.9	173.2	66.3	50.2	2921	ohc	four	156	spdi	3.59	3.86	7.0	145	5000	19	24	14869
84	3	NaN	mitsubishi	gas	turbo	two	hatchback	fwd	front	95.9	173.2	66.3	50.2	2926	ohc	four	156	spdi	3.59	3.86	7.0	145	5000	19	24	14489
109	0	NaN	peugot	gas	std	four	wagon	rwd	front	114.2	198.9	68.4	58.7	3230	l	four	120	mpfi	3.46	3.19	8.4	97	5000	19	24	12440
110	0	NaN	peugot	diesel	turbo	four	wagon	rwd	front	114.2	198.9	68.4	58.7	3430	l	four	152	idi	3.70	3.52	21.0	95	4150	25	25	13860
113	0	NaN	peugot	gas	std	four	wagon	rwd	front	114.2	198.9	68.4	56.7	3285	l	four	120	mpfi	3.46	2.19	8.4	95	5000	19	24	16695
114	0	NaN	peugot	diesel	turbo	four	wagon	rwd	front	114.2	198.9	68.4	58.7	3485	l	four	152	idi	3.70	3.52	21.0	95	4150	25	25	17075
124	3	NaN	plymouth	gas	turbo	two	hatchback	rwd	front	95.9	173.2	66.3	50.2	2818	ohc	four	156	spdi	3.59	3.86	7.0	145	5000	19	24	12764
126	3	NaN	porsche	gas	std	two	hardtop	rwd	rear	89.5	168.9	65.0	51.6	2756	ohcf	six	194	mpfi	3.74	2.90	9.5	207	5900	17	25	32528
127	3	NaN	porsche	gas	std	two	hardtop	rwd	rear	89.5	168.9	65.0	51.6	2756	ohcf	six	194	mpfi	3.74	2.90	9.5	207	5900	17	25	34028
128	3	NaN	porsche	gas	std	two	convertible	rwd	rear	89.5	168.9	65.0	51.6	2800	ohcf	six	194	mpfi	3.74	2.90	9.5	207	5900	17	25	37028
129	1	NaN	porsche	gas	std	two	hatchback	rwd	front	98.4	175.7	72.3	50.5	3366	dohcv	eight	203	mpfi	3.94	3.11	10.0	288	5750	17	28	NaN
130	0	NaN	renault	gas	std	four	wagon	fwd	front	96.1	181.5	66.5	55.2	2579	ohc	four	132	mpfi	3.46	3.90	8.7	NaN	NaN	23	31	9295
131	2	NaN	renault	gas	std	two	hatchback	fwd	front	96.1	176.8	66.6	50.5	2460	ohc	four	132	mpfi	3.46	3.90	8.7	NaN	NaN	23	31	9895
181	-1	NaN	toyota	gas	std	four	wagon	rwd	front	104.5	187.8	66.5	54.1	3151	dohc	six	161	mpfi	3.27	3.35	9.2	156	5200	19	24	15750
189	3	NaN	volkswagen	gas	std	two	convertible	fwd	front	94.5	159.3	64.2	55.6	2254	ohc	four	109	mpfi	3.19	3.40	8.5	90	5500	24	29	11595
191	0	NaN	volkswagen	gas	std	four	sedan	fwd	front	100.4	180.2	66.9	55.1	2661	ohc	five	136	mpfi	3.19	3.40	8.5	110	5500	19	24	13295
192	0	NaN	volkswagen	diesel	turbo	four	sedan	fwd	front	100.4	180.2	66.9	55.1	2579	ohc	four	97	idi	3.01	3.40	23.0	68	4500	33	38	13845
193	0	NaN	volkswagen	gas	std	four	wagon	fwd	front	100.4	183.1	66.9	55.1	2563	ohc	four	109	mpfi	3.19	3.40	9.0	88	5500	25	31	12290

We cannot simply drop all these rows since we'll lose a lot of valuable information for our algorithm's training. Let's first of all convert some object column to numeric ones and then fill the missing values.

In [9]:

cars['bore'] = cars['bore'].astype(float)
cars['stroke'] = cars['stroke'].astype(float)
cars['horsepower'] = cars['horsepower'].astype(float)
cars['peak-rpm'] = pd.to_numeric(cars['peak-rpm'], errors='coerce', downcast='integer')
cars['normalized-losses'] = pd.to_numeric(cars['normalized-losses'], errors='coerce', downcast='integer')
cars['price'] = cars['price'].astype(float)
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       203 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               201 non-null    float64
 19  stroke             201 non-null    float64
 20  compression-rate   205 non-null    float64
 21  horsepower         203 non-null    float64
 22  peak-rpm           203 non-null    float64
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              201 non-null    float64
dtypes: float64(11), int64(5), object(10)
memory usage: 41.8+ KB

In [10]:

cars.isnull().sum()

Out[10]:

symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

It is better to drop rows with missing values in price column since there are just 2 of them and averaging missing value can lead to the biased results of the prediction. Regarding num-of-doors column we can't put an average value also, since all the values have to be integers either 2 or 4 and average will be some float value in between them.

In [11]:

# Delete rows with missing values
cars.dropna(subset=['price', 'num-of-doors'], inplace=True)

# Fill missing values with average value in the remaining columns
cars.fillna(cars.mean(), inplace=True)
cars.isnull().sum()

Out[11]:

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-rate     0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

Now, we need to convert num-of-doors and num-of-cylinders columns to numeric type. The best way would be to define dictionaries and map them on these columns.

In [12]:

cars['num-of-doors'].value_counts(dropna=False)

Out[12]:

four    113
two      86
Name: num-of-doors, dtype: int64

In [13]:

cars['num-of-cylinders'].value_counts(dropna=False)

Out[13]:

four      155
six        24
five       10
eight       4
two         4
three       1
twelve      1
Name: num-of-cylinders, dtype: int64

In [14]:

# Convert categorical columns into numeric
mapping_dict_doors = {
    'two': '2',
    'four': '4'
}
mapping_dict_cyl = {
    'two': '2',
    'three': '3',
    'four': '4',
    'five': '5',
    'six': '6',
    'eight': '8',
    'twelve': '12',
}
cars['num-of-doors'] = cars['num-of-doors'].map(mapping_dict_doors).astype(int)
cars['num-of-cylinders'] = cars['num-of-cylinders'].map(mapping_dict_cyl).astype(int)

In [15]:

cars['num-of-doors'].value_counts(dropna=False)

Out[15]:

4    113
2     86
Name: num-of-doors, dtype: int64

In [16]:

cars['num-of-cylinders'].value_counts(dropna=False)

Out[16]:

4     155
6      24
5      10
2       4
8       4
3       1
12      1
Name: num-of-cylinders, dtype: int64

In [17]:

# Create a dataset with only numeric columns from cars
cars_numeric = cars.select_dtypes(include=['float64', 'int64'])
cars_numeric.head()

Out[17]:

	symboling	normalized-losses	num-of-doors	wheel-base	length	width	height	curb-weight	num-of-cylinders	engine-size	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	121.840491	2	88.6	168.8	64.1	48.8	2548	4	130	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
1	3	121.840491	2	88.6	168.8	64.1	48.8	2548	4	130	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
2	1	121.840491	2	94.5	171.2	65.5	52.4	2823	6	152	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
3	2	164.000000	4	99.8	176.6	66.2	54.3	2337	4	109	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
4	2	164.000000	4	99.4	176.6	66.4	54.3	2824	5	136	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0

Feature engineering¶

We have to normalize all columnns to range from 0 to 1 except the target column. We need to do this since k-nearest neighbors algorithm is sensitive to the absolute distance during the prediction process. We will not use MinMaxScaler from sklearn.preprocessing for this particular task since we need pandas dataframe in return and not just numpy array.

In [18]:

# Normalize all columnns to range from 0 to 1 except the target column.
price_col = cars_numeric['price']
cars_numeric = (cars_numeric - cars_numeric.min())/(cars_numeric.max() - cars_numeric.min())
cars_numeric['price'] = price_col
cars_numeric.head()

Out[18]:

	symboling	normalized-losses	num-of-doors	wheel-base	length	width	height	curb-weight	num-of-cylinders	engine-size	bore	stroke	compression-rate	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	1.0	0.297594	0.0	0.058309	0.413433	0.324786	0.083333	0.411171	0.2	0.260377	0.664286	0.290476	0.1250	0.294393	0.346939	0.222222	0.289474	13495.0
1	1.0	0.297594	0.0	0.058309	0.413433	0.324786	0.083333	0.411171	0.2	0.260377	0.664286	0.290476	0.1250	0.294393	0.346939	0.222222	0.289474	16500.0
2	0.6	0.297594	0.0	0.230321	0.449254	0.444444	0.383333	0.517843	0.4	0.343396	0.100000	0.666667	0.1250	0.495327	0.346939	0.166667	0.263158	16500.0
3	0.8	0.518325	1.0	0.384840	0.529851	0.504274	0.541667	0.329325	0.2	0.181132	0.464286	0.633333	0.1875	0.252336	0.551020	0.305556	0.368421	13950.0
4	0.8	0.518325	1.0	0.373178	0.529851	0.521368	0.541667	0.518231	0.3	0.283019	0.464286	0.633333	0.0625	0.313084	0.551020	0.138889	0.157895	17450.0

Univariate model¶

Now, let's define functions forthe prediction of the car's price. We'll start with univariate model and k parameter fixed to 5.

In [19]:

from sklearn.model_selection import train_test_split

def knn_train_test(training_col, target_col, df, k=5):
    knn = KNeighborsRegressor(n_neighbors=k)
    
    # Divide the dataset
    train, test = train_test_split(df, test_size=0.25, random_state=1)
    
    # Fit the model and predict the outcomes  
    knn.fit(train[[training_col]], train[target_col])
    predictions = knn.predict(test[[training_col]])
    
    # Accuracy
    mse = mean_squared_error(test[target_col], predictions)
    return np.sqrt(mse)

train_cols = cars_numeric.columns.drop('price')
for column in train_cols:
    print(f" The RMSE for {column} column is {knn_train_test(column, 'price', cars_numeric):.2f}")

 The RMSE for symboling column is 8634.29
 The RMSE for normalized-losses column is 6407.50
 The RMSE for num-of-doors column is 9002.39
 The RMSE for wheel-base column is 6063.64
 The RMSE for length column is 5249.87
 The RMSE for width column is 5107.75
 The RMSE for height column is 8416.81
 The RMSE for curb-weight column is 4815.19
 The RMSE for num-of-cylinders column is 4551.73
 The RMSE for engine-size column is 2488.25
 The RMSE for bore column is 6997.09
 The RMSE for stroke column is 7614.35
 The RMSE for compression-rate column is 6148.07
 The RMSE for horsepower column is 1938.98
 The RMSE for peak-rpm column is 7433.45
 The RMSE for city-mpg column is 4119.17
 The RMSE for highway-mpg column is 4157.67

In [20]:

# Sort columns in ascending order
rmse_results = {}
for col in train_cols:
    rmse_val = knn_train_test(col, 'price', cars_numeric)
    rmse_results[col] = rmse_val
        
rmse_results
rmses = pd.Series(rmse_results)  
rmses.sort_values()

Out[20]:

horsepower           1938.978302
engine-size          2488.250189
city-mpg             4119.171196
highway-mpg          4157.670044
num-of-cylinders     4551.734937
curb-weight          4815.191305
width                5107.750375
length               5249.874563
wheel-base           6063.636722
compression-rate     6148.069944
normalized-losses    6407.498964
bore                 6997.094822
peak-rpm             7433.452480
stroke               7614.348266
height               8416.809826
symboling            8634.293963
num-of-doors         9002.391392
dtype: float64

Now, let's see rmse score for each parameter varying k value. We need to modify our function first.

In [21]:

# Write a function to take different k values
def knn_train_test_k(training_col, target_col, df):
    # Divide the dataset
    train, test = train_test_split(df, test_size=0.25, random_state=1)
    
    k_values = [1, 3, 5, 7, 9]
    k_rmse = {}
    
    # Fir the model and predict the outcomes
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train[[training_col]], train[target_col])
        predictions = knn.predict(test[[training_col]])
    
        # Accuracy
        mse = mean_squared_error(test[target_col], predictions)
        k_rmse[k] = np.sqrt(mse)
    return k_rmse

rmse_results = {}
for col in train_cols:
    rmse_val = knn_train_test_k(col, 'price', cars_numeric)
    rmse_results[col] = rmse_val
        
rmse_results

Out[21]:

{'symboling': {1: 8288.25263369789,
  3: 8305.934701163982,
  5: 8634.293962519461,
  7: 8371.858930844402,
  9: 8435.082080085398},
 'normalized-losses': {1: 8554.284874844887,
  3: 5985.453731060551,
  5: 6407.4989636831,
  7: 6200.620020972886,
  9: 6853.3640433632445},
 'num-of-doors': {1: 11837.155974303963,
  3: 10792.378695686653,
  5: 9002.391391891379,
  7: 8473.477248160996,
  9: 9060.208311834069},
 'wheel-base': {1: 2918.6992171171046,
  3: 4651.839485622865,
  5: 6063.6367219021295,
  7: 6213.650480209205,
  9: 6348.772902413486},
 'length': {1: 5187.118271256209,
  3: 4783.065690770118,
  5: 5249.874563377681,
  7: 5593.228506955567,
  9: 5903.658388022561},
 'width': {1: 4242.631461722783,
  3: 4604.697718634742,
  5: 5107.750374793192,
  7: 5351.0667742850255,
  9: 5570.581466225214},
 'height': {1: 11444.853880238052,
  3: 8446.713568272837,
  5: 8416.809826341569,
  7: 8779.818509815914,
  9: 8758.916856214248},
 'curb-weight': {1: 5992.175453038737,
  3: 5053.441928692429,
  5: 4815.191304693926,
  7: 4699.675190426484,
  9: 4569.049352186519},
 'num-of-cylinders': {1: 7762.639512433899,
  3: 6568.338993653993,
  5: 4551.734937361796,
  7: 4977.937308048097,
  9: 5085.9553468714585},
 'engine-size': {1: 3304.597246261638,
  3: 2689.138964220497,
  5: 2488.2501893499375,
  7: 2940.8036887065755,
  9: 3435.147852335822},
 'bore': {1: 6462.969052997237,
  3: 6722.980876400323,
  5: 6997.094821595603,
  7: 7161.108966109216,
  9: 7255.013790476211},
 'stroke': {1: 10209.457264712948,
  3: 7775.179014016332,
  5: 7614.348266476914,
  7: 7689.89451744221,
  9: 7848.738630372477},
 'compression-rate': {1: 7393.651555219518,
  3: 6362.959106334795,
  5: 6148.069944429715,
  7: 6132.473911356606,
  9: 6044.765077586546},
 'horsepower': {1: 3437.133200793941,
  3: 3408.2626793002787,
  5: 1938.978301889941,
  7: 2301.780060505081,
  9: 2859.3648759950297},
 'peak-rpm': {1: 5992.39433782524,
  3: 6535.491600314224,
  5: 7433.452479541387,
  7: 7591.715532535369,
  9: 7689.1649958453445},
 'city-mpg': {1: 4916.991643271321,
  3: 4240.1692012885,
  5: 4119.171196150993,
  7: 4031.134467504407,
  9: 3911.8713402559715},
 'highway-mpg': {1: 5681.289170602039,
  3: 4967.1497165544215,
  5: 4157.670044147323,
  7: 4674.508578034963,
  9: 4683.754499791912}}

In [22]:

# Define the function for plotting this type of graphs
def plot_k_graph(dictionary, Multivar=False, kfold=False):
    min_val = []
    plt.figure(figsize=(10, 8))
    if Multivar == False:
        for key in dictionary:
            plt.plot(dictionary[key].keys(), dictionary[key].values(), label=key)
            min_val.append(min(dictionary[key].values()))
        plt.legend(bbox_to_anchor=(1.05, 1)) # Width and heigth relative to the canvas
        plt.xticks(np.arange(1, 10))
        plt.yticks(np.arange(1500, 12001, 500))
    else:
        for key in dictionary:
            plt.plot(dictionary[key].keys(), dictionary[key].values(), label=(f"{key} best features"))
            min_val.append(min(dictionary[key].values()))
        plt.legend(loc='lower right')
        plt.xticks(np.arange(1, 26))
        plt.yticks(np.arange(1500, 5001, 500))
    plt.xlabel('k neighbors')
    plt.ylabel('RMSE')
    plt.suptitle(f'Rmse vs k neighbor for each model, the lowest rmse ={np.min(min_val):.2f}', fontsize = 16, y=0.95)
    if kfold == True:
        plt.title('WIth Kfold cross validation')
    else:
        plt.title('Without Kfold cross validation')
    plt.show()

    
plot_k_graph(rmse_results)    

Multivariate model¶

Now, let's test multivariate model and see the difference in the best rmse score.

In [23]:

# Write a function for multivariate model
def knn_train_test_multi(training_cols, target_col, df, k=5):
    knn = KNeighborsRegressor(n_neighbors=k)
    
    # Divide the dataset
    train, test = train_test_split(df, test_size=0.25, random_state=1)
    
    # Fir the model and predict the outcomes  
    knn.fit(train[training_cols], train[target_col]) # Erase one pair of square brackets to accept multiple columns
    predictions = knn.predict(test[training_cols])
    
    # Accuracy
    mse = mean_squared_error(test[target_col], predictions)
    return np.sqrt(mse)

for features in range(2,6):
    print(f"The RMSE for {features} best features is {knn_train_test_multi(rmses.sort_values().index[:features], 'price', cars_numeric):.2f}")

The RMSE for 2 best features is 3200.47
The RMSE for 3 best features is 2580.70
The RMSE for 4 best features is 2739.47
The RMSE for 5 best features is 3279.35

In [24]:

# Plot the results
plt.figure(figsize=(10, 8))
x = range(2, rmses.shape[0]+1)
y = []
for features in x:
    y.append(knn_train_test_multi(rmses.sort_values().index[:features], 'price', cars_numeric))
plt.plot(x,y)
plt.xlabel('Number of best features')
plt.ylabel('RMSE score')
plt.title('RMSE score vs number of best features with k=5', fontsize=16)
plt.xticks(np.arange(2, rmses.shape[0]+1))
plt.show()

Now, let's choose the best three multivariate models and perform hyperparameter tuning with them, varying k value from 1 to 25.

In [25]:

models = pd.Series(y,x)
top_three_models = models.sort_values().head(3)
top_three_models

Out[25]:

3    2580.704611
4    2739.472916
2    3200.467446
dtype: float64

In [26]:

# Write a function for various k for multivariate model
def knn_train_test_multi_k(training_col, target_col, df):
    # Divide the dataset
    train, test = train_test_split(df, test_size=0.25, random_state=1)
    
    k_values = np.arange(1, 26)
    k_rmse = {}
    
    # Fit the model and predict the outcomes
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train[training_col], train[target_col])
        predictions = knn.predict(test[training_col])
    
        # Accuracy
        mse = mean_squared_error(test[target_col], predictions)
        k_rmse[k] = np.sqrt(mse)
    return k_rmse

rmse_results_multi = {}
for n_cols in top_three_models.index:
    rmse_val = knn_train_test_multi_k(rmses.sort_values().index[:n_cols], 'price', cars_numeric)
    rmse_results_multi[n_cols] = rmse_val
        
rmse_results_multi

Out[26]:

{3: {1: 2366.837801793777,
  2: 2742.075025414148,
  3: 1958.3608412252438,
  4: 2207.236688146516,
  5: 2580.7046112253915,
  6: 3015.2918615874573,
  7: 3329.950428570175,
  8: 3508.8582299392633,
  9: 3784.83907817153,
  10: 3881.661548847349,
  11: 4045.2975382402424,
  12: 4229.747234748325,
  13: 4198.694406541947,
  14: 4133.471306558551,
  15: 4069.718254217497,
  16: 4187.786262538135,
  17: 4260.570686720242,
  18: 4363.375258456589,
  19: 4425.637643852176,
  20: 4532.178023914109,
  21: 4612.125610318032,
  22: 4669.975677684552,
  23: 4691.00330087382,
  24: 4717.770623203265,
  25: 4754.362109322343},
 4: {1: 2436.246112362214,
  2: 2731.9911978262303,
  3: 1943.710574019588,
  4: 2272.8742849088685,
  5: 2739.4729161647138,
  6: 3179.6248496394105,
  7: 3506.78029001661,
  8: 3568.2953833027896,
  9: 3743.8209482879265,
  10: 3920.6434714725083,
  11: 4043.474482412045,
  12: 4276.825392222081,
  13: 4437.612641163912,
  14: 4523.469940493804,
  15: 4467.32513264929,
  16: 4589.837862553671,
  17: 4683.462641337365,
  18: 4820.557785108264,
  19: 4884.041794811199,
  20: 4949.438850657718,
  21: 4854.25786757337,
  22: 4757.023542745997,
  23: 4747.55103993484,
  24: 4728.563402639091,
  25: 4736.136634822944},
 2: {1: 2976.5597121509254,
  2: 3077.0307075165824,
  3: 2431.1651399643297,
  4: 2799.7221792974387,
  5: 3200.467445983477,
  6: 3344.363038839467,
  7: 3444.157055841984,
  8: 3668.323265854919,
  9: 3881.0770168684103,
  10: 4103.602890899654,
  11: 4338.775550603234,
  12: 4580.254625230396,
  13: 4399.939501991298,
  14: 4347.159296985486,
  15: 4289.052871848412,
  16: 4373.1228476127335,
  17: 4448.38477387033,
  18: 4514.628269488714,
  19: 4587.550520847661,
  20: 4709.713384458761,
  21: 4743.039814626191,
  22: 4787.719658249815,
  23: 4750.56381066564,
  24: 4750.233994565426,
  25: 4741.452336907754}}

In [27]:

# Plot the results
plot_k_graph(rmse_results_multi, Multivar=True) 

Wow! The rmse score in the best univariate model is better than in the best multivariate one. This is desinetely a surprising result. However, more reliable error evaluation can be reached using K fold cross validation.

K fold cross validation¶

We will use 4 folds for K fold cross validation to have 25% of data for testing in each fold and be consistent with our original models.

In [28]:

# Write a function to take different k values
from sklearn.model_selection import cross_val_score, KFold
def knn_train_test_k_kfold(training_col, target_col, df):
    # Create folds
    kf = KFold(n_splits=4, shuffle=True, random_state=1)
    
    k_values = [1, 3, 5, 7, 9]
    k_rmse = {}
    
    # Fir the model and predict the outcomes
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors=k)
    
        # Accuracy
        mses = cross_val_score(knn, df[[training_col]], df[target_col], scoring='neg_mean_squared_error', cv=kf)
        k_rmse[k] = np.mean(abs(mses) ** (1/2))
    return k_rmse

rmse_results_kfold = {}
for col in train_cols:
    rmse_val = knn_train_test_k_kfold(col, 'price', cars_numeric)
    rmse_results_kfold[col] = rmse_val

# Plot the results' comparison
plot_k_graph(rmse_results)
plot_k_graph(rmse_results_kfold, kfold=True)  

We can clearly see that with Kfold cross validation the error is much bigger. This is due to the fact that in the first case we used only one fold that turned out to be too optimistic and underestimated the rmse error.

In [29]:

# Write a function for various k for multivariate model
def knn_train_test_multi_k_kfold(training_col, target_col, df):
    # Create folds
    kf = KFold(n_splits=4, shuffle=True, random_state=1)
    
    k_values = np.arange(1, 26)
    k_rmse = {}
    
    # Fir the model and predict the outcomes
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors=k)
    
        # Accuracy
        mses = cross_val_score(knn, df[training_col], df[target_col], scoring='neg_mean_squared_error', cv=kf)
        k_rmse[k] = np.mean(abs(mses) ** (1/2))
    return k_rmse

rmse_results_multi_kfold = {}
for n_cols in top_three_models.index:
    rmse_val = knn_train_test_multi_k_kfold(rmses.sort_values().index[:n_cols], 'price', cars_numeric)
    rmse_results_multi_kfold[n_cols] = rmse_val

# Plot the results' comparison
plot_k_graph(rmse_results_multi, Multivar=True)
plot_k_graph(rmse_results_multi_kfold, Multivar=True, kfold=True)

Conclusions¶

We trained the k-nearest neighbor regressor to predict car's price based on the available data. After several iterations following conclusions have been reached:

Without using Kfold cross validation:

Depending on number of k-neighbors engine size and horse power are two best features.
The multivariate models with 3 and 4 best features were the most accurate.
The number of k-neighbors equal to 3 showed the best accuracy for multivariate model during the hyperparameter tuning.
The univariate model using horsepower column and k=5 showed the best rmse score overall (1938).

However, the conclusions have changed after applying Kfold cross validation:

The multivariate models with 3 and 4 best features were also the most accurate.
The number of k-neighbors equal to 3 showed the best accuracy for multivariate model during the hyperparameter tuning and the rmse score is higher (the error is bigger) using Kfold cross validation (2732 vs 1943).
The univariate model using engine size column and k=3 showed the best rmse score overall among other univariate models. However the error is much bigger than in optimal multivariate model (3469 vs 2732).