In this course, we explored the fundamentals of machine learning using the k-nearest neighbors algorithm. In this guided project, you'll practice the machine learning workflow you've learned so far to predict a car's market price using its attributes. The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set here and can download it directly from here.
As a conclusion multivariate k-nearest neighbors regression model using horsepower
, engine size
, and city-mpg
columns with k = 3 turned out to be the most accurate and reliable one.
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
pd.options.display.max_columns = 99
# Rename the columns
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style',
'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
cars = pd.read_csv('imports-85.data', names=cols)
cars.head()
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null int64 1 normalized-losses 205 non-null object 2 make 205 non-null object 3 fuel-type 205 non-null object 4 aspiration 205 non-null object 5 num-of-doors 205 non-null object 6 body-style 205 non-null object 7 drive-wheels 205 non-null object 8 engine-location 205 non-null object 9 wheel-base 205 non-null float64 10 length 205 non-null float64 11 width 205 non-null float64 12 height 205 non-null float64 13 curb-weight 205 non-null int64 14 engine-type 205 non-null object 15 num-of-cylinders 205 non-null object 16 engine-size 205 non-null int64 17 fuel-system 205 non-null object 18 bore 205 non-null object 19 stroke 205 non-null object 20 compression-rate 205 non-null float64 21 horsepower 205 non-null object 22 peak-rpm 205 non-null object 23 city-mpg 205 non-null int64 24 highway-mpg 205 non-null int64 25 price 205 non-null object dtypes: float64(5), int64(5), object(16) memory usage: 41.8+ KB
# Summarising the dataset using Pandas Profiling
profile = ProfileReport(cars, title="Pandas Profiling Report")
# Take a look at the summary (comment it to avoid long report generation for the consequent kernel runs)
#profile
We need to get rid of missing data in order to let our machine learning algorithm work.
cars['normalized-losses'].value_counts().head()
? 41 161 11 91 8 150 7 134 6 Name: normalized-losses, dtype: int64
cars.replace('?', np.nan, inplace=True)
cars['normalized-losses'].value_counts(dropna=False).head()
NaN 41 161 11 91 8 150 7 104 6 Name: normalized-losses, dtype: int64
# Explore columns with nulled normalized losses
cars[cars['normalized-losses'].isnull()]
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | NaN | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
5 | 2 | NaN | audi | gas | std | two | sedan | fwd | front | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250 |
7 | 1 | NaN | audi | gas | std | four | wagon | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920 |
9 | 0 | NaN | audi | gas | turbo | two | hatchback | 4wd | front | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | NaN |
14 | 1 | NaN | bmw | gas | std | four | sedan | rwd | front | 103.5 | 189.0 | 66.9 | 55.7 | 3055 | ohc | six | 164 | mpfi | 3.31 | 3.19 | 9.0 | 121 | 4250 | 20 | 25 | 24565 |
15 | 0 | NaN | bmw | gas | std | four | sedan | rwd | front | 103.5 | 189.0 | 66.9 | 55.7 | 3230 | ohc | six | 209 | mpfi | 3.62 | 3.39 | 8.0 | 182 | 5400 | 16 | 22 | 30760 |
16 | 0 | NaN | bmw | gas | std | two | sedan | rwd | front | 103.5 | 193.8 | 67.9 | 53.7 | 3380 | ohc | six | 209 | mpfi | 3.62 | 3.39 | 8.0 | 182 | 5400 | 16 | 22 | 41315 |
17 | 0 | NaN | bmw | gas | std | four | sedan | rwd | front | 110.0 | 197.0 | 70.9 | 56.3 | 3505 | ohc | six | 209 | mpfi | 3.62 | 3.39 | 8.0 | 182 | 5400 | 15 | 20 | 36880 |
43 | 0 | NaN | isuzu | gas | std | four | sedan | rwd | front | 94.3 | 170.7 | 61.8 | 53.5 | 2337 | ohc | four | 111 | 2bbl | 3.31 | 3.23 | 8.5 | 78 | 4800 | 24 | 29 | 6785 |
44 | 1 | NaN | isuzu | gas | std | two | sedan | fwd | front | 94.5 | 155.9 | 63.6 | 52.0 | 1874 | ohc | four | 90 | 2bbl | 3.03 | 3.11 | 9.6 | 70 | 5400 | 38 | 43 | NaN |
45 | 0 | NaN | isuzu | gas | std | four | sedan | fwd | front | 94.5 | 155.9 | 63.6 | 52.0 | 1909 | ohc | four | 90 | 2bbl | 3.03 | 3.11 | 9.6 | 70 | 5400 | 38 | 43 | NaN |
46 | 2 | NaN | isuzu | gas | std | two | hatchback | rwd | front | 96.0 | 172.6 | 65.2 | 51.4 | 2734 | ohc | four | 119 | spfi | 3.43 | 3.23 | 9.2 | 90 | 5000 | 24 | 29 | 11048 |
48 | 0 | NaN | jaguar | gas | std | four | sedan | rwd | front | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | dohc | six | 258 | mpfi | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 35550 |
49 | 0 | NaN | jaguar | gas | std | two | sedan | rwd | front | 102.0 | 191.7 | 70.6 | 47.8 | 3950 | ohcv | twelve | 326 | mpfi | 3.54 | 2.76 | 11.5 | 262 | 5000 | 13 | 17 | 36000 |
63 | 0 | NaN | mazda | diesel | std | NaN | sedan | fwd | front | 98.8 | 177.8 | 66.5 | 55.5 | 2443 | ohc | four | 122 | idi | 3.39 | 3.39 | 22.7 | 64 | 4650 | 36 | 42 | 10795 |
66 | 0 | NaN | mazda | diesel | std | four | sedan | rwd | front | 104.9 | 175.0 | 66.1 | 54.4 | 2700 | ohc | four | 134 | idi | 3.43 | 3.64 | 22.0 | 72 | 4200 | 31 | 39 | 18344 |
71 | -1 | NaN | mercedes-benz | gas | std | four | sedan | rwd | front | 115.6 | 202.6 | 71.7 | 56.5 | 3740 | ohcv | eight | 234 | mpfi | 3.46 | 3.10 | 8.3 | 155 | 4750 | 16 | 18 | 34184 |
73 | 0 | NaN | mercedes-benz | gas | std | four | sedan | rwd | front | 120.9 | 208.1 | 71.7 | 56.7 | 3900 | ohcv | eight | 308 | mpfi | 3.80 | 3.35 | 8.0 | 184 | 4500 | 14 | 16 | 40960 |
74 | 1 | NaN | mercedes-benz | gas | std | two | hardtop | rwd | front | 112.0 | 199.2 | 72.0 | 55.4 | 3715 | ohcv | eight | 304 | mpfi | 3.80 | 3.35 | 8.0 | 184 | 4500 | 14 | 16 | 45400 |
75 | 1 | NaN | mercury | gas | turbo | two | hatchback | rwd | front | 102.7 | 178.4 | 68.0 | 54.8 | 2910 | ohc | four | 140 | mpfi | 3.78 | 3.12 | 8.0 | 175 | 5000 | 19 | 24 | 16503 |
82 | 3 | NaN | mitsubishi | gas | turbo | two | hatchback | fwd | front | 95.9 | 173.2 | 66.3 | 50.2 | 2833 | ohc | four | 156 | spdi | 3.58 | 3.86 | 7.0 | 145 | 5000 | 19 | 24 | 12629 |
83 | 3 | NaN | mitsubishi | gas | turbo | two | hatchback | fwd | front | 95.9 | 173.2 | 66.3 | 50.2 | 2921 | ohc | four | 156 | spdi | 3.59 | 3.86 | 7.0 | 145 | 5000 | 19 | 24 | 14869 |
84 | 3 | NaN | mitsubishi | gas | turbo | two | hatchback | fwd | front | 95.9 | 173.2 | 66.3 | 50.2 | 2926 | ohc | four | 156 | spdi | 3.59 | 3.86 | 7.0 | 145 | 5000 | 19 | 24 | 14489 |
109 | 0 | NaN | peugot | gas | std | four | wagon | rwd | front | 114.2 | 198.9 | 68.4 | 58.7 | 3230 | l | four | 120 | mpfi | 3.46 | 3.19 | 8.4 | 97 | 5000 | 19 | 24 | 12440 |
110 | 0 | NaN | peugot | diesel | turbo | four | wagon | rwd | front | 114.2 | 198.9 | 68.4 | 58.7 | 3430 | l | four | 152 | idi | 3.70 | 3.52 | 21.0 | 95 | 4150 | 25 | 25 | 13860 |
113 | 0 | NaN | peugot | gas | std | four | wagon | rwd | front | 114.2 | 198.9 | 68.4 | 56.7 | 3285 | l | four | 120 | mpfi | 3.46 | 2.19 | 8.4 | 95 | 5000 | 19 | 24 | 16695 |
114 | 0 | NaN | peugot | diesel | turbo | four | wagon | rwd | front | 114.2 | 198.9 | 68.4 | 58.7 | 3485 | l | four | 152 | idi | 3.70 | 3.52 | 21.0 | 95 | 4150 | 25 | 25 | 17075 |
124 | 3 | NaN | plymouth | gas | turbo | two | hatchback | rwd | front | 95.9 | 173.2 | 66.3 | 50.2 | 2818 | ohc | four | 156 | spdi | 3.59 | 3.86 | 7.0 | 145 | 5000 | 19 | 24 | 12764 |
126 | 3 | NaN | porsche | gas | std | two | hardtop | rwd | rear | 89.5 | 168.9 | 65.0 | 51.6 | 2756 | ohcf | six | 194 | mpfi | 3.74 | 2.90 | 9.5 | 207 | 5900 | 17 | 25 | 32528 |
127 | 3 | NaN | porsche | gas | std | two | hardtop | rwd | rear | 89.5 | 168.9 | 65.0 | 51.6 | 2756 | ohcf | six | 194 | mpfi | 3.74 | 2.90 | 9.5 | 207 | 5900 | 17 | 25 | 34028 |
128 | 3 | NaN | porsche | gas | std | two | convertible | rwd | rear | 89.5 | 168.9 | 65.0 | 51.6 | 2800 | ohcf | six | 194 | mpfi | 3.74 | 2.90 | 9.5 | 207 | 5900 | 17 | 25 | 37028 |
129 | 1 | NaN | porsche | gas | std | two | hatchback | rwd | front | 98.4 | 175.7 | 72.3 | 50.5 | 3366 | dohcv | eight | 203 | mpfi | 3.94 | 3.11 | 10.0 | 288 | 5750 | 17 | 28 | NaN |
130 | 0 | NaN | renault | gas | std | four | wagon | fwd | front | 96.1 | 181.5 | 66.5 | 55.2 | 2579 | ohc | four | 132 | mpfi | 3.46 | 3.90 | 8.7 | NaN | NaN | 23 | 31 | 9295 |
131 | 2 | NaN | renault | gas | std | two | hatchback | fwd | front | 96.1 | 176.8 | 66.6 | 50.5 | 2460 | ohc | four | 132 | mpfi | 3.46 | 3.90 | 8.7 | NaN | NaN | 23 | 31 | 9895 |
181 | -1 | NaN | toyota | gas | std | four | wagon | rwd | front | 104.5 | 187.8 | 66.5 | 54.1 | 3151 | dohc | six | 161 | mpfi | 3.27 | 3.35 | 9.2 | 156 | 5200 | 19 | 24 | 15750 |
189 | 3 | NaN | volkswagen | gas | std | two | convertible | fwd | front | 94.5 | 159.3 | 64.2 | 55.6 | 2254 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 8.5 | 90 | 5500 | 24 | 29 | 11595 |
191 | 0 | NaN | volkswagen | gas | std | four | sedan | fwd | front | 100.4 | 180.2 | 66.9 | 55.1 | 2661 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 24 | 13295 |
192 | 0 | NaN | volkswagen | diesel | turbo | four | sedan | fwd | front | 100.4 | 180.2 | 66.9 | 55.1 | 2579 | ohc | four | 97 | idi | 3.01 | 3.40 | 23.0 | 68 | 4500 | 33 | 38 | 13845 |
193 | 0 | NaN | volkswagen | gas | std | four | wagon | fwd | front | 100.4 | 183.1 | 66.9 | 55.1 | 2563 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 9.0 | 88 | 5500 | 25 | 31 | 12290 |
We cannot simply drop all these rows since we'll lose a lot of valuable information for our algorithm's training. Let's first of all convert some object
column to numeric
ones and then fill the missing values.
cars['bore'] = cars['bore'].astype(float)
cars['stroke'] = cars['stroke'].astype(float)
cars['horsepower'] = cars['horsepower'].astype(float)
cars['peak-rpm'] = pd.to_numeric(cars['peak-rpm'], errors='coerce', downcast='integer')
cars['normalized-losses'] = pd.to_numeric(cars['normalized-losses'], errors='coerce', downcast='integer')
cars['price'] = cars['price'].astype(float)
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null int64 1 normalized-losses 164 non-null float64 2 make 205 non-null object 3 fuel-type 205 non-null object 4 aspiration 205 non-null object 5 num-of-doors 203 non-null object 6 body-style 205 non-null object 7 drive-wheels 205 non-null object 8 engine-location 205 non-null object 9 wheel-base 205 non-null float64 10 length 205 non-null float64 11 width 205 non-null float64 12 height 205 non-null float64 13 curb-weight 205 non-null int64 14 engine-type 205 non-null object 15 num-of-cylinders 205 non-null object 16 engine-size 205 non-null int64 17 fuel-system 205 non-null object 18 bore 201 non-null float64 19 stroke 201 non-null float64 20 compression-rate 205 non-null float64 21 horsepower 203 non-null float64 22 peak-rpm 203 non-null float64 23 city-mpg 205 non-null int64 24 highway-mpg 205 non-null int64 25 price 201 non-null float64 dtypes: float64(11), int64(5), object(10) memory usage: 41.8+ KB
cars.isnull().sum()
symboling 0 normalized-losses 41 make 0 fuel-type 0 aspiration 0 num-of-doors 2 body-style 0 drive-wheels 0 engine-location 0 wheel-base 0 length 0 width 0 height 0 curb-weight 0 engine-type 0 num-of-cylinders 0 engine-size 0 fuel-system 0 bore 4 stroke 4 compression-rate 0 horsepower 2 peak-rpm 2 city-mpg 0 highway-mpg 0 price 4 dtype: int64
It is better to drop rows with missing values in price
column since there are just 2 of them and averaging missing value can lead to the biased results of the prediction. Regarding num-of-doors
column we can't put an average value also, since all the values have to be integers either 2 or 4 and average will be some float value in between them.
# Delete rows with missing values
cars.dropna(subset=['price', 'num-of-doors'], inplace=True)
# Fill missing values with average value in the remaining columns
cars.fillna(cars.mean(), inplace=True)
cars.isnull().sum()
symboling 0 normalized-losses 0 make 0 fuel-type 0 aspiration 0 num-of-doors 0 body-style 0 drive-wheels 0 engine-location 0 wheel-base 0 length 0 width 0 height 0 curb-weight 0 engine-type 0 num-of-cylinders 0 engine-size 0 fuel-system 0 bore 0 stroke 0 compression-rate 0 horsepower 0 peak-rpm 0 city-mpg 0 highway-mpg 0 price 0 dtype: int64
Now, we need to convert num-of-doors
and num-of-cylinders
columns to numeric type. The best way would be to define dictionaries and map them on these columns.
cars['num-of-doors'].value_counts(dropna=False)
four 113 two 86 Name: num-of-doors, dtype: int64
cars['num-of-cylinders'].value_counts(dropna=False)
four 155 six 24 five 10 eight 4 two 4 three 1 twelve 1 Name: num-of-cylinders, dtype: int64
# Convert categorical columns into numeric
mapping_dict_doors = {
'two': '2',
'four': '4'
}
mapping_dict_cyl = {
'two': '2',
'three': '3',
'four': '4',
'five': '5',
'six': '6',
'eight': '8',
'twelve': '12',
}
cars['num-of-doors'] = cars['num-of-doors'].map(mapping_dict_doors).astype(int)
cars['num-of-cylinders'] = cars['num-of-cylinders'].map(mapping_dict_cyl).astype(int)
cars['num-of-doors'].value_counts(dropna=False)
4 113 2 86 Name: num-of-doors, dtype: int64
cars['num-of-cylinders'].value_counts(dropna=False)
4 155 6 24 5 10 2 4 8 4 3 1 12 1 Name: num-of-cylinders, dtype: int64
# Create a dataset with only numeric columns from cars
cars_numeric = cars.select_dtypes(include=['float64', 'int64'])
cars_numeric.head()
symboling | normalized-losses | num-of-doors | wheel-base | length | width | height | curb-weight | num-of-cylinders | engine-size | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 121.840491 | 2 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 4 | 130 | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 |
1 | 3 | 121.840491 | 2 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 4 | 130 | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
2 | 1 | 121.840491 | 2 | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 6 | 152 | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
3 | 2 | 164.000000 | 4 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 4 | 109 | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 |
4 | 2 | 164.000000 | 4 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 5 | 136 | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 |
We have to normalize all columnns to range from 0 to 1 except the target column. We need to do this since k-nearest neighbors algorithm is sensitive to the absolute distance during the prediction process. We will not use MinMaxScaler from sklearn.preprocessing for this particular task since we need pandas dataframe in return and not just numpy array.
# Normalize all columnns to range from 0 to 1 except the target column.
price_col = cars_numeric['price']
cars_numeric = (cars_numeric - cars_numeric.min())/(cars_numeric.max() - cars_numeric.min())
cars_numeric['price'] = price_col
cars_numeric.head()
symboling | normalized-losses | num-of-doors | wheel-base | length | width | height | curb-weight | num-of-cylinders | engine-size | bore | stroke | compression-rate | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.297594 | 0.0 | 0.058309 | 0.413433 | 0.324786 | 0.083333 | 0.411171 | 0.2 | 0.260377 | 0.664286 | 0.290476 | 0.1250 | 0.294393 | 0.346939 | 0.222222 | 0.289474 | 13495.0 |
1 | 1.0 | 0.297594 | 0.0 | 0.058309 | 0.413433 | 0.324786 | 0.083333 | 0.411171 | 0.2 | 0.260377 | 0.664286 | 0.290476 | 0.1250 | 0.294393 | 0.346939 | 0.222222 | 0.289474 | 16500.0 |
2 | 0.6 | 0.297594 | 0.0 | 0.230321 | 0.449254 | 0.444444 | 0.383333 | 0.517843 | 0.4 | 0.343396 | 0.100000 | 0.666667 | 0.1250 | 0.495327 | 0.346939 | 0.166667 | 0.263158 | 16500.0 |
3 | 0.8 | 0.518325 | 1.0 | 0.384840 | 0.529851 | 0.504274 | 0.541667 | 0.329325 | 0.2 | 0.181132 | 0.464286 | 0.633333 | 0.1875 | 0.252336 | 0.551020 | 0.305556 | 0.368421 | 13950.0 |
4 | 0.8 | 0.518325 | 1.0 | 0.373178 | 0.529851 | 0.521368 | 0.541667 | 0.518231 | 0.3 | 0.283019 | 0.464286 | 0.633333 | 0.0625 | 0.313084 | 0.551020 | 0.138889 | 0.157895 | 17450.0 |
Now, let's define functions forthe prediction of the car's price. We'll start with univariate model and k parameter fixed to 5.
from sklearn.model_selection import train_test_split
def knn_train_test(training_col, target_col, df, k=5):
knn = KNeighborsRegressor(n_neighbors=k)
# Divide the dataset
train, test = train_test_split(df, test_size=0.25, random_state=1)
# Fit the model and predict the outcomes
knn.fit(train[[training_col]], train[target_col])
predictions = knn.predict(test[[training_col]])
# Accuracy
mse = mean_squared_error(test[target_col], predictions)
return np.sqrt(mse)
train_cols = cars_numeric.columns.drop('price')
for column in train_cols:
print(f" The RMSE for {column} column is {knn_train_test(column, 'price', cars_numeric):.2f}")
The RMSE for symboling column is 8634.29 The RMSE for normalized-losses column is 6407.50 The RMSE for num-of-doors column is 9002.39 The RMSE for wheel-base column is 6063.64 The RMSE for length column is 5249.87 The RMSE for width column is 5107.75 The RMSE for height column is 8416.81 The RMSE for curb-weight column is 4815.19 The RMSE for num-of-cylinders column is 4551.73 The RMSE for engine-size column is 2488.25 The RMSE for bore column is 6997.09 The RMSE for stroke column is 7614.35 The RMSE for compression-rate column is 6148.07 The RMSE for horsepower column is 1938.98 The RMSE for peak-rpm column is 7433.45 The RMSE for city-mpg column is 4119.17 The RMSE for highway-mpg column is 4157.67
# Sort columns in ascending order
rmse_results = {}
for col in train_cols:
rmse_val = knn_train_test(col, 'price', cars_numeric)
rmse_results[col] = rmse_val
rmse_results
rmses = pd.Series(rmse_results)
rmses.sort_values()
horsepower 1938.978302 engine-size 2488.250189 city-mpg 4119.171196 highway-mpg 4157.670044 num-of-cylinders 4551.734937 curb-weight 4815.191305 width 5107.750375 length 5249.874563 wheel-base 6063.636722 compression-rate 6148.069944 normalized-losses 6407.498964 bore 6997.094822 peak-rpm 7433.452480 stroke 7614.348266 height 8416.809826 symboling 8634.293963 num-of-doors 9002.391392 dtype: float64
Now, let's see rmse score for each parameter varying k value. We need to modify our function first.
# Write a function to take different k values
def knn_train_test_k(training_col, target_col, df):
# Divide the dataset
train, test = train_test_split(df, test_size=0.25, random_state=1)
k_values = [1, 3, 5, 7, 9]
k_rmse = {}
# Fir the model and predict the outcomes
for k in k_values:
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(train[[training_col]], train[target_col])
predictions = knn.predict(test[[training_col]])
# Accuracy
mse = mean_squared_error(test[target_col], predictions)
k_rmse[k] = np.sqrt(mse)
return k_rmse
rmse_results = {}
for col in train_cols:
rmse_val = knn_train_test_k(col, 'price', cars_numeric)
rmse_results[col] = rmse_val
rmse_results
{'symboling': {1: 8288.25263369789, 3: 8305.934701163982, 5: 8634.293962519461, 7: 8371.858930844402, 9: 8435.082080085398}, 'normalized-losses': {1: 8554.284874844887, 3: 5985.453731060551, 5: 6407.4989636831, 7: 6200.620020972886, 9: 6853.3640433632445}, 'num-of-doors': {1: 11837.155974303963, 3: 10792.378695686653, 5: 9002.391391891379, 7: 8473.477248160996, 9: 9060.208311834069}, 'wheel-base': {1: 2918.6992171171046, 3: 4651.839485622865, 5: 6063.6367219021295, 7: 6213.650480209205, 9: 6348.772902413486}, 'length': {1: 5187.118271256209, 3: 4783.065690770118, 5: 5249.874563377681, 7: 5593.228506955567, 9: 5903.658388022561}, 'width': {1: 4242.631461722783, 3: 4604.697718634742, 5: 5107.750374793192, 7: 5351.0667742850255, 9: 5570.581466225214}, 'height': {1: 11444.853880238052, 3: 8446.713568272837, 5: 8416.809826341569, 7: 8779.818509815914, 9: 8758.916856214248}, 'curb-weight': {1: 5992.175453038737, 3: 5053.441928692429, 5: 4815.191304693926, 7: 4699.675190426484, 9: 4569.049352186519}, 'num-of-cylinders': {1: 7762.639512433899, 3: 6568.338993653993, 5: 4551.734937361796, 7: 4977.937308048097, 9: 5085.9553468714585}, 'engine-size': {1: 3304.597246261638, 3: 2689.138964220497, 5: 2488.2501893499375, 7: 2940.8036887065755, 9: 3435.147852335822}, 'bore': {1: 6462.969052997237, 3: 6722.980876400323, 5: 6997.094821595603, 7: 7161.108966109216, 9: 7255.013790476211}, 'stroke': {1: 10209.457264712948, 3: 7775.179014016332, 5: 7614.348266476914, 7: 7689.89451744221, 9: 7848.738630372477}, 'compression-rate': {1: 7393.651555219518, 3: 6362.959106334795, 5: 6148.069944429715, 7: 6132.473911356606, 9: 6044.765077586546}, 'horsepower': {1: 3437.133200793941, 3: 3408.2626793002787, 5: 1938.978301889941, 7: 2301.780060505081, 9: 2859.3648759950297}, 'peak-rpm': {1: 5992.39433782524, 3: 6535.491600314224, 5: 7433.452479541387, 7: 7591.715532535369, 9: 7689.1649958453445}, 'city-mpg': {1: 4916.991643271321, 3: 4240.1692012885, 5: 4119.171196150993, 7: 4031.134467504407, 9: 3911.8713402559715}, 'highway-mpg': {1: 5681.289170602039, 3: 4967.1497165544215, 5: 4157.670044147323, 7: 4674.508578034963, 9: 4683.754499791912}}
# Define the function for plotting this type of graphs
def plot_k_graph(dictionary, Multivar=False, kfold=False):
min_val = []
plt.figure(figsize=(10, 8))
if Multivar == False:
for key in dictionary:
plt.plot(dictionary[key].keys(), dictionary[key].values(), label=key)
min_val.append(min(dictionary[key].values()))
plt.legend(bbox_to_anchor=(1.05, 1)) # Width and heigth relative to the canvas
plt.xticks(np.arange(1, 10))
plt.yticks(np.arange(1500, 12001, 500))
else:
for key in dictionary:
plt.plot(dictionary[key].keys(), dictionary[key].values(), label=(f"{key} best features"))
min_val.append(min(dictionary[key].values()))
plt.legend(loc='lower right')
plt.xticks(np.arange(1, 26))
plt.yticks(np.arange(1500, 5001, 500))
plt.xlabel('k neighbors')
plt.ylabel('RMSE')
plt.suptitle(f'Rmse vs k neighbor for each model, the lowest rmse ={np.min(min_val):.2f}', fontsize = 16, y=0.95)
if kfold == True:
plt.title('WIth Kfold cross validation')
else:
plt.title('Without Kfold cross validation')
plt.show()
plot_k_graph(rmse_results)
Now, let's test multivariate model and see the difference in the best rmse score.
# Write a function for multivariate model
def knn_train_test_multi(training_cols, target_col, df, k=5):
knn = KNeighborsRegressor(n_neighbors=k)
# Divide the dataset
train, test = train_test_split(df, test_size=0.25, random_state=1)
# Fir the model and predict the outcomes
knn.fit(train[training_cols], train[target_col]) # Erase one pair of square brackets to accept multiple columns
predictions = knn.predict(test[training_cols])
# Accuracy
mse = mean_squared_error(test[target_col], predictions)
return np.sqrt(mse)
for features in range(2,6):
print(f"The RMSE for {features} best features is {knn_train_test_multi(rmses.sort_values().index[:features], 'price', cars_numeric):.2f}")
The RMSE for 2 best features is 3200.47 The RMSE for 3 best features is 2580.70 The RMSE for 4 best features is 2739.47 The RMSE for 5 best features is 3279.35
# Plot the results
plt.figure(figsize=(10, 8))
x = range(2, rmses.shape[0]+1)
y = []
for features in x:
y.append(knn_train_test_multi(rmses.sort_values().index[:features], 'price', cars_numeric))
plt.plot(x,y)
plt.xlabel('Number of best features')
plt.ylabel('RMSE score')
plt.title('RMSE score vs number of best features with k=5', fontsize=16)
plt.xticks(np.arange(2, rmses.shape[0]+1))
plt.show()
Now, let's choose the best three multivariate models and perform hyperparameter tuning with them, varying k value from 1 to 25.
models = pd.Series(y,x)
top_three_models = models.sort_values().head(3)
top_three_models
3 2580.704611 4 2739.472916 2 3200.467446 dtype: float64
# Write a function for various k for multivariate model
def knn_train_test_multi_k(training_col, target_col, df):
# Divide the dataset
train, test = train_test_split(df, test_size=0.25, random_state=1)
k_values = np.arange(1, 26)
k_rmse = {}
# Fit the model and predict the outcomes
for k in k_values:
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(train[training_col], train[target_col])
predictions = knn.predict(test[training_col])
# Accuracy
mse = mean_squared_error(test[target_col], predictions)
k_rmse[k] = np.sqrt(mse)
return k_rmse
rmse_results_multi = {}
for n_cols in top_three_models.index:
rmse_val = knn_train_test_multi_k(rmses.sort_values().index[:n_cols], 'price', cars_numeric)
rmse_results_multi[n_cols] = rmse_val
rmse_results_multi
{3: {1: 2366.837801793777, 2: 2742.075025414148, 3: 1958.3608412252438, 4: 2207.236688146516, 5: 2580.7046112253915, 6: 3015.2918615874573, 7: 3329.950428570175, 8: 3508.8582299392633, 9: 3784.83907817153, 10: 3881.661548847349, 11: 4045.2975382402424, 12: 4229.747234748325, 13: 4198.694406541947, 14: 4133.471306558551, 15: 4069.718254217497, 16: 4187.786262538135, 17: 4260.570686720242, 18: 4363.375258456589, 19: 4425.637643852176, 20: 4532.178023914109, 21: 4612.125610318032, 22: 4669.975677684552, 23: 4691.00330087382, 24: 4717.770623203265, 25: 4754.362109322343}, 4: {1: 2436.246112362214, 2: 2731.9911978262303, 3: 1943.710574019588, 4: 2272.8742849088685, 5: 2739.4729161647138, 6: 3179.6248496394105, 7: 3506.78029001661, 8: 3568.2953833027896, 9: 3743.8209482879265, 10: 3920.6434714725083, 11: 4043.474482412045, 12: 4276.825392222081, 13: 4437.612641163912, 14: 4523.469940493804, 15: 4467.32513264929, 16: 4589.837862553671, 17: 4683.462641337365, 18: 4820.557785108264, 19: 4884.041794811199, 20: 4949.438850657718, 21: 4854.25786757337, 22: 4757.023542745997, 23: 4747.55103993484, 24: 4728.563402639091, 25: 4736.136634822944}, 2: {1: 2976.5597121509254, 2: 3077.0307075165824, 3: 2431.1651399643297, 4: 2799.7221792974387, 5: 3200.467445983477, 6: 3344.363038839467, 7: 3444.157055841984, 8: 3668.323265854919, 9: 3881.0770168684103, 10: 4103.602890899654, 11: 4338.775550603234, 12: 4580.254625230396, 13: 4399.939501991298, 14: 4347.159296985486, 15: 4289.052871848412, 16: 4373.1228476127335, 17: 4448.38477387033, 18: 4514.628269488714, 19: 4587.550520847661, 20: 4709.713384458761, 21: 4743.039814626191, 22: 4787.719658249815, 23: 4750.56381066564, 24: 4750.233994565426, 25: 4741.452336907754}}
# Plot the results
plot_k_graph(rmse_results_multi, Multivar=True)
Wow! The rmse score in the best univariate model is better than in the best multivariate one. This is desinetely a surprising result. However, more reliable error evaluation can be reached using K fold cross validation.
We will use 4 folds for K fold cross validation to have 25% of data for testing in each fold and be consistent with our original models.
# Write a function to take different k values
from sklearn.model_selection import cross_val_score, KFold
def knn_train_test_k_kfold(training_col, target_col, df):
# Create folds
kf = KFold(n_splits=4, shuffle=True, random_state=1)
k_values = [1, 3, 5, 7, 9]
k_rmse = {}
# Fir the model and predict the outcomes
for k in k_values:
knn = KNeighborsRegressor(n_neighbors=k)
# Accuracy
mses = cross_val_score(knn, df[[training_col]], df[target_col], scoring='neg_mean_squared_error', cv=kf)
k_rmse[k] = np.mean(abs(mses) ** (1/2))
return k_rmse
rmse_results_kfold = {}
for col in train_cols:
rmse_val = knn_train_test_k_kfold(col, 'price', cars_numeric)
rmse_results_kfold[col] = rmse_val
# Plot the results' comparison
plot_k_graph(rmse_results)
plot_k_graph(rmse_results_kfold, kfold=True)
We can clearly see that with Kfold cross validation the error is much bigger. This is due to the fact that in the first case we used only one fold that turned out to be too optimistic and underestimated the rmse error.
# Write a function for various k for multivariate model
def knn_train_test_multi_k_kfold(training_col, target_col, df):
# Create folds
kf = KFold(n_splits=4, shuffle=True, random_state=1)
k_values = np.arange(1, 26)
k_rmse = {}
# Fir the model and predict the outcomes
for k in k_values:
knn = KNeighborsRegressor(n_neighbors=k)
# Accuracy
mses = cross_val_score(knn, df[training_col], df[target_col], scoring='neg_mean_squared_error', cv=kf)
k_rmse[k] = np.mean(abs(mses) ** (1/2))
return k_rmse
rmse_results_multi_kfold = {}
for n_cols in top_three_models.index:
rmse_val = knn_train_test_multi_k_kfold(rmses.sort_values().index[:n_cols], 'price', cars_numeric)
rmse_results_multi_kfold[n_cols] = rmse_val
# Plot the results' comparison
plot_k_graph(rmse_results_multi, Multivar=True)
plot_k_graph(rmse_results_multi_kfold, Multivar=True, kfold=True)
We trained the k-nearest neighbor regressor to predict car's price based on the available data. After several iterations following conclusions have been reached:
Without using Kfold cross validation:
engine size
and horse power
are two best features.horsepower
column and k=5 showed the best rmse score overall (1938).However, the conclusions have changed after applying Kfold cross validation:
engine size
column and k=3 showed the best rmse score overall among other univariate models. However the error is much bigger than in optimal multivariate model (3469 vs 2732).