Project: Predicting car prices using K nearest neighbours algorithm based on
1980's data
Data Source: https://archive.ics.uci.edu/ml/datasets/automobile
Variables:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from itertools import combinations
%matplotlib inline
# Parameters for project
rd_state = 1 # random seed number - used for results reproductibility
dataset_columns = ['symboling', 'normalized_losses', 'brand', 'fuel_type',
'aspirations', 'num_of_doors', 'body_style', 'drive_wheels',
'engine_location', 'wheel_base', 'length', 'width', 'height',
'curb_weight', 'engine_type', 'num_of_cylinders',
'engine_size', 'fuel_system', 'bore', 'stroke',
'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
'highway_mpg', 'price']
car_dataset = pd.read_csv('##\\GP22_PredictingCarPrices\\imports-85.data',
names = dataset_columns)
car_dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null int64 1 normalized_losses 205 non-null object 2 brand 205 non-null object 3 fuel_type 205 non-null object 4 aspirations 205 non-null object 5 num_of_doors 205 non-null object 6 body_style 205 non-null object 7 drive_wheels 205 non-null object 8 engine_location 205 non-null object 9 wheel_base 205 non-null float64 10 length 205 non-null float64 11 width 205 non-null float64 12 height 205 non-null float64 13 curb_weight 205 non-null int64 14 engine_type 205 non-null object 15 num_of_cylinders 205 non-null object 16 engine_size 205 non-null int64 17 fuel_system 205 non-null object 18 bore 205 non-null object 19 stroke 205 non-null object 20 compression_ratio 205 non-null float64 21 horsepower 205 non-null object 22 peak_rpm 205 non-null object 23 city_mpg 205 non-null int64 24 highway_mpg 205 non-null int64 25 price 205 non-null object dtypes: float64(5), int64(5), object(16) memory usage: 41.8+ KB
car_dataset.head()
symboling | normalized_losses | brand | fuel_type | aspirations | num_of_doors | body_style | drive_wheels | engine_location | wheel_base | ... | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
Target variable: price
Variables to drop: symboling, normalized_losses (actuarial variables)
Numeric explanatory variables: wheel_base, engine_size, bore, stroke,
compression_ratio, horsepowerk, peak_rpm, city_mpg, highway_mpg
Categoric explanatory variables: brand, fuel_type, aspirations,
num_of_doors, body_style, drive_wheels, engine_location, fuel_system
For analytical purposes I will want to create dummy variables out of categorical
ones to explore their predictive power.
#Dataset has missing values marked as "?". Replacing them to NaN and assigning proper variable types.
car_dataset.replace('?', np.nan, inplace=True)
car_dataset = car_dataset.convert_dtypes()
print(car_dataset.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null Int64 1 normalized_losses 164 non-null string 2 brand 205 non-null string 3 fuel_type 205 non-null string 4 aspirations 205 non-null string 5 num_of_doors 203 non-null string 6 body_style 205 non-null string 7 drive_wheels 205 non-null string 8 engine_location 205 non-null string 9 wheel_base 205 non-null Float64 10 length 205 non-null Float64 11 width 205 non-null Float64 12 height 205 non-null Float64 13 curb_weight 205 non-null Int64 14 engine_type 205 non-null string 15 num_of_cylinders 205 non-null string 16 engine_size 205 non-null Int64 17 fuel_system 205 non-null string 18 bore 201 non-null string 19 stroke 201 non-null string 20 compression_ratio 205 non-null Float64 21 horsepower 203 non-null string 22 peak_rpm 203 non-null string 23 city_mpg 205 non-null Int64 24 highway_mpg 205 non-null Int64 25 price 201 non-null string dtypes: Float64(5), Int64(5), string(16) memory usage: 43.8 KB None
# Deleting rows with missing values for 'price'
car_dataset.dropna(subset=['price'], inplace=True)
car_dataset.reset_index(inplace=True)
# Deleting actuarial variables 'symboling' and 'normalized_losses', which computation methodology
# we do not know (they are representing average damages for car type
# (normalized_losses)and indicating if car is overpriced/underpriced (symboling)), hence we
# would not be able to reproduce this variable for purpose of scoring new datasets
car_dataset.drop(['symboling', 'normalized_losses'], axis=1, inplace=True)
# 4/201 values remain missing for 'bore' and 'stroke' variables
# 2/201 for 'num_of_doors', 'horsepower', 'peak_rpm'
print(car_dataset.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 201 entries, 0 to 200 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 201 non-null int64 1 brand 201 non-null string 2 fuel_type 201 non-null string 3 aspirations 201 non-null string 4 num_of_doors 199 non-null string 5 body_style 201 non-null string 6 drive_wheels 201 non-null string 7 engine_location 201 non-null string 8 wheel_base 201 non-null Float64 9 length 201 non-null Float64 10 width 201 non-null Float64 11 height 201 non-null Float64 12 curb_weight 201 non-null Int64 13 engine_type 201 non-null string 14 num_of_cylinders 201 non-null string 15 engine_size 201 non-null Int64 16 fuel_system 201 non-null string 17 bore 197 non-null string 18 stroke 197 non-null string 19 compression_ratio 201 non-null Float64 20 horsepower 199 non-null string 21 peak_rpm 199 non-null string 22 city_mpg 201 non-null Int64 23 highway_mpg 201 non-null Int64 24 price 201 non-null string dtypes: Float64(5), Int64(4), int64(1), string(15) memory usage: 41.1 KB None
# Investigating remaining missing values (8 observations total)
cols_with_na = ['num_of_doors','bore','stroke','horsepower','peak_rpm']
print(car_dataset[cols_with_na][car_dataset['bore'].isna()])
print(car_dataset[cols_with_na][car_dataset['num_of_doors'].isna()])
print(car_dataset[cols_with_na][car_dataset['horsepower'].isna()])
# Little overlap between variables
num_of_doors bore stroke horsepower peak_rpm 52 two <NA> <NA> 101 6000 53 two <NA> <NA> 101 6000 54 two <NA> <NA> 101 6000 55 two <NA> <NA> 135 6000 num_of_doors bore stroke horsepower peak_rpm 26 <NA> 3.03 3.39 102 5500 60 <NA> 3.39 3.39 64 4650 num_of_doors bore stroke horsepower peak_rpm 126 four 3.46 3.90 <NA> <NA> 127 two 3.46 3.90 <NA> <NA>
# Setting pandas display to display all rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# 'Num_of_doors'
# We can see that both dodge and mazda are sedan type - according to google typical sedan has 4 doors
print(car_dataset[car_dataset['num_of_doors'].isna()])
# In our dataset we can see that this is not always true
print(car_dataset[car_dataset['body_style']=='sedan']['num_of_doors'].value_counts())
# We can't really use average or mode to fill missing values and we can't define accurate number of doors based
# on presented data, hence best solution will be to drop this variable
car_dataset.drop(['num_of_doors'], axis=1, inplace=True)
cols_with_na.remove('num_of_doors')
index brand fuel_type aspirations num_of_doors body_style drive_wheels \ 26 27 dodge gas turbo <NA> sedan fwd 60 63 mazda diesel std <NA> sedan fwd engine_location wheel_base length width height curb_weight \ 26 front 93.7 157.3 63.8 50.6 2191 60 front 98.8 177.8 66.5 55.5 2443 engine_type num_of_cylinders engine_size fuel_system bore stroke \ 26 ohc four 98 mpfi 3.03 3.39 60 ohc four 122 idi 3.39 3.39 compression_ratio horsepower peak_rpm city_mpg highway_mpg price 26 7.6 102 5500 24 30 8558 60 22.7 64 4650 36 42 10795 four 78 two 14 Name: num_of_doors, dtype: Int64
# 'bore' and 'stroke'
print(car_dataset[car_dataset['bore'].isna()])
# 3/4 observations are seemingly describing same mazda model, as their remanining variables (except price) have same values
# We will investigate if we can find similar observations in the dataset with 'bore' and 'stroke' values present
index brand fuel_type aspirations body_style drive_wheels \ 52 55 mazda gas std hatchback rwd 53 56 mazda gas std hatchback rwd 54 57 mazda gas std hatchback rwd 55 58 mazda gas std hatchback rwd engine_location wheel_base length width height curb_weight \ 52 front 95.3 169.0 65.7 49.6 2380 53 front 95.3 169.0 65.7 49.6 2380 54 front 95.3 169.0 65.7 49.6 2385 55 front 95.3 169.0 65.7 49.6 2500 engine_type num_of_cylinders engine_size fuel_system bore stroke \ 52 rotor two 70 4bbl <NA> <NA> 53 rotor two 70 4bbl <NA> <NA> 54 rotor two 70 4bbl <NA> <NA> 55 rotor two 80 mpfi <NA> <NA> compression_ratio horsepower peak_rpm city_mpg highway_mpg price 52 9.4 101 6000 17 23 10945 53 9.4 101 6000 17 23 11845 54 9.4 101 6000 17 23 13645 55 9.4 135 6000 16 23 15645
# Trying to find observations with non-missing values
print(car_dataset[(car_dataset['brand'] == 'mazda') & (car_dataset['fuel_system'] == '4bbl')])
print(car_dataset[(car_dataset['brand'] == 'mazda') & (car_dataset['fuel_system'] == 'mpfi')])
# No such observations
index brand fuel_type aspirations body_style drive_wheels \ 52 55 mazda gas std hatchback rwd 53 56 mazda gas std hatchback rwd 54 57 mazda gas std hatchback rwd engine_location wheel_base length width height curb_weight \ 52 front 95.3 169.0 65.7 49.6 2380 53 front 95.3 169.0 65.7 49.6 2380 54 front 95.3 169.0 65.7 49.6 2385 engine_type num_of_cylinders engine_size fuel_system bore stroke \ 52 rotor two 70 4bbl <NA> <NA> 53 rotor two 70 4bbl <NA> <NA> 54 rotor two 70 4bbl <NA> <NA> compression_ratio horsepower peak_rpm city_mpg highway_mpg price 52 9.4 101 6000 17 23 10945 53 9.4 101 6000 17 23 11845 54 9.4 101 6000 17 23 13645 index brand fuel_type aspirations body_style drive_wheels \ 55 58 mazda gas std hatchback rwd 62 65 mazda gas std sedan rwd engine_location wheel_base length width height curb_weight \ 55 front 95.3 169.0 65.7 49.6 2500 62 front 104.9 175.0 66.1 54.4 2670 engine_type num_of_cylinders engine_size fuel_system bore stroke \ 55 rotor two 80 mpfi <NA> <NA> 62 ohc four 140 mpfi 3.76 3.16 compression_ratio horsepower peak_rpm city_mpg highway_mpg price 55 9.4 135 6000 16 23 15645 62 8.0 120 5000 19 27 18280
# Finding frequent values for mazda brand
print(car_dataset[(car_dataset['brand'] == 'mazda')]['bore'].value_counts())
print(car_dataset[(car_dataset['brand'] == 'mazda')]['stroke'].value_counts())
# No definitive way to fill NA values for 'bore' and 'stroke'.
3.39 6 3.03 4 3.08 1 3.76 1 3.43 1 Name: bore, dtype: Int64 3.39 6 3.15 5 3.16 1 3.64 1 Name: stroke, dtype: Int64
# Correcting data types
car_dataset['bore'] = pd.to_numeric(car_dataset['bore'], errors='coerce')
car_dataset['stroke'] = pd.to_numeric(car_dataset['stroke'], errors='coerce')
car_dataset['horsepower'] = pd.to_numeric(car_dataset['horsepower'], errors='coerce')
car_dataset['peak_rpm'] = pd.to_numeric(car_dataset['peak_rpm'], errors='coerce')
car_dataset['price'] = car_dataset['price'].astype(int)
print(car_dataset.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 201 entries, 0 to 200 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 201 non-null int64 1 brand 201 non-null string 2 fuel_type 201 non-null string 3 aspirations 201 non-null string 4 body_style 201 non-null string 5 drive_wheels 201 non-null string 6 engine_location 201 non-null string 7 wheel_base 201 non-null Float64 8 length 201 non-null Float64 9 width 201 non-null Float64 10 height 201 non-null Float64 11 curb_weight 201 non-null Int64 12 engine_type 201 non-null string 13 num_of_cylinders 201 non-null string 14 engine_size 201 non-null Int64 15 fuel_system 201 non-null string 16 bore 197 non-null float64 17 stroke 197 non-null float64 18 compression_ratio 201 non-null Float64 19 horsepower 199 non-null float64 20 peak_rpm 199 non-null float64 21 city_mpg 201 non-null Int64 22 highway_mpg 201 non-null Int64 23 price 201 non-null int32 dtypes: Float64(5), Int64(4), float64(4), int32(1), int64(1), string(9) memory usage: 38.8 KB None
plt.hist(car_dataset[car_dataset['bore'] != np.nan]['bore'], range=(2,4))
plt.title('Bore distribution in dataset')
plt.show()
# Conclusion: based on the distribution of 'Bore' variable we will input average value for training dataset in place of missing values
plt.hist(car_dataset[car_dataset['stroke'] != np.nan]['stroke'], range=(1.5,4.5))
plt.title('Stroke distribution in dataset')
plt.show()
# Conclusion: based on the distribution of 'Stroke' variable we will input average value for training dataset in place of missing values
# 'horsepower' and 'peak_rpm'
print(car_dataset[car_dataset['horsepower'].isna()])
index brand fuel_type aspirations body_style drive_wheels \ 126 130 renault gas std wagon fwd 127 131 renault gas std hatchback fwd engine_location wheel_base length width height curb_weight \ 126 front 96.1 181.5 66.5 55.2 2579 127 front 96.1 176.8 66.6 50.5 2460 engine_type num_of_cylinders engine_size fuel_system bore stroke \ 126 ohc four 132 mpfi 3.46 3.9 127 ohc four 132 mpfi 3.46 3.9 compression_ratio horsepower peak_rpm city_mpg highway_mpg price 126 8.7 NaN NaN 23 31 9295 127 8.7 NaN NaN 23 31 9895
# Trying to find observations with non-missing values
print(car_dataset[(car_dataset['brand'] == 'renault')])
# No definitive way to fill NA values for 'horsepower' and 'peak_rpm'.
index brand fuel_type aspirations body_style drive_wheels \ 126 130 renault gas std wagon fwd 127 131 renault gas std hatchback fwd engine_location wheel_base length width height curb_weight \ 126 front 96.1 181.5 66.5 55.2 2579 127 front 96.1 176.8 66.6 50.5 2460 engine_type num_of_cylinders engine_size fuel_system bore stroke \ 126 ohc four 132 mpfi 3.46 3.9 127 ohc four 132 mpfi 3.46 3.9 compression_ratio horsepower peak_rpm city_mpg highway_mpg price 126 8.7 NaN NaN 23 31 9295 127 8.7 NaN NaN 23 31 9895
plt.hist(car_dataset[car_dataset['horsepower'] != np.nan]['horsepower'])
plt.title('Horsepower distribution in dataset')
plt.show()
# Conclusion: based on the distribution of 'Horsepower' variable we will input average value for training dataset in place of missing values
plt.hist(car_dataset[car_dataset['peak_rpm'] != np.nan]['peak_rpm'])
plt.title('Peak rpm distribution in dataset')
plt.show()
# Conclusion: based on the distribution of 'Peak_rpm' variable we will input average value for training dataset in place of missing values
car_dataset.head()
index | brand | fuel_type | aspirations | body_style | drive_wheels | engine_location | wheel_base | length | width | height | curb_weight | engine_type | num_of_cylinders | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | alfa-romero | gas | std | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495 |
1 | 1 | alfa-romero | gas | std | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500 |
2 | 2 | alfa-romero | gas | std | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500 |
3 | 3 | audi | gas | std | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950 |
4 | 4 | audi | gas | std | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450 |
# Deciding which variables to use for dummy variables creation
for name in car_dataset.select_dtypes(include='string').columns:
print(name)
print(car_dataset[name].value_counts())
print('\n')
# brand - group 'isuzu', 'renault', 'mercury' into 'other'
# number of cylinders - group into "three or less", "seven or more"
# fuel_system - group '4bbl', 'spfi', 'mfi' into 'other'
brand toyota 32 nissan 18 mazda 17 mitsubishi 13 honda 13 volkswagen 12 subaru 12 peugot 11 volvo 11 dodge 9 bmw 8 mercedes-benz 8 plymouth 7 audi 6 saab 6 porsche 4 jaguar 3 chevrolet 3 alfa-romero 3 isuzu 2 renault 2 mercury 1 Name: brand, dtype: Int64 fuel_type gas 181 diesel 20 Name: fuel_type, dtype: Int64 aspirations std 165 turbo 36 Name: aspirations, dtype: Int64 body_style sedan 94 hatchback 68 wagon 25 hardtop 8 convertible 6 Name: body_style, dtype: Int64 drive_wheels fwd 118 rwd 75 4wd 8 Name: drive_wheels, dtype: Int64 engine_location front 198 rear 3 Name: engine_location, dtype: Int64 engine_type ohc 145 ohcf 15 ohcv 13 dohc 12 l 12 rotor 4 Name: engine_type, dtype: Int64 num_of_cylinders four 157 six 24 five 10 two 4 eight 4 three 1 twelve 1 Name: num_of_cylinders, dtype: Int64 fuel_system mpfi 92 2bbl 64 idi 20 1bbl 11 spdi 9 4bbl 3 mfi 1 spfi 1 Name: fuel_system, dtype: Int64
# Function grouping values from 'fuel_system', 'brand', 'num_of_cylinders' and 'engine_type' variables
# to more frequent groups
def fother(var):
if var in ['4bbl','spfi','mfi','isuzu','renault','mercury']:
return 'other'
elif var in ['two','three']:
return 'three_or_less'
elif var in ['eight','twelve']:
return 'seven_or_more'
else:
return var
# Updating values
car_dataset['brand']=car_dataset['brand'].apply(fother)
car_dataset['fuel_system']=car_dataset['fuel_system'].apply(fother)
car_dataset['num_of_cylinders']=car_dataset['num_of_cylinders'].apply(fother)
car_dataset['engine_type']=car_dataset['engine_type'].apply(fother)
# After using apply affected variable types got changed to 'object'
# Swtiching them back to 'string'
car_dataset = car_dataset.convert_dtypes()
# car_dataset.info()
# Creating dummy variables based on categorical variables
# This is done to broaden variable base for KNN algorithm
car_dataset2=car_dataset
for name in car_dataset2.select_dtypes(include='string').columns:
print(name)
car_dataset2=pd.get_dummies(car_dataset2, prefix=name, prefix_sep='_', columns=[name])
brand fuel_type aspirations body_style drive_wheels engine_location engine_type num_of_cylinders fuel_system
# Converting some variables dtypes to float, as Float, Int32 and Int64 is not properly handled by KNN algorithm
# (Float and Int data types got created because of using .convert_dtypes() function)
Float_to_float = ['wheel_base', 'length', 'width', 'height', 'bore', 'stroke', 'compression_ratio', 'curb_weight', 'engine_size',
'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg']
for var in Float_to_float:
car_dataset2[var] = car_dataset2[var].astype(float)
# Checking variable types of our new dataset - now 66 variables
# All variables are numeric type, so we can use them in KNN algorithm
# Still there are missing values present in 'bore', 'stroke' (4) and 'horsepower', 'peak_rpm' (2)
car_dataset2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 201 entries, 0 to 200 Data columns (total 66 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 201 non-null Int64 1 wheel_base 201 non-null float64 2 length 201 non-null float64 3 width 201 non-null float64 4 height 201 non-null float64 5 curb_weight 201 non-null float64 6 engine_size 201 non-null float64 7 bore 197 non-null float64 8 stroke 197 non-null float64 9 compression_ratio 201 non-null float64 10 horsepower 199 non-null float64 11 peak_rpm 199 non-null float64 12 city_mpg 201 non-null float64 13 highway_mpg 201 non-null float64 14 price 201 non-null Int32 15 brand_alfa-romero 201 non-null uint8 16 brand_audi 201 non-null uint8 17 brand_bmw 201 non-null uint8 18 brand_chevrolet 201 non-null uint8 19 brand_dodge 201 non-null uint8 20 brand_honda 201 non-null uint8 21 brand_jaguar 201 non-null uint8 22 brand_mazda 201 non-null uint8 23 brand_mercedes-benz 201 non-null uint8 24 brand_mitsubishi 201 non-null uint8 25 brand_nissan 201 non-null uint8 26 brand_other 201 non-null uint8 27 brand_peugot 201 non-null uint8 28 brand_plymouth 201 non-null uint8 29 brand_porsche 201 non-null uint8 30 brand_saab 201 non-null uint8 31 brand_subaru 201 non-null uint8 32 brand_toyota 201 non-null uint8 33 brand_volkswagen 201 non-null uint8 34 brand_volvo 201 non-null uint8 35 fuel_type_diesel 201 non-null uint8 36 fuel_type_gas 201 non-null uint8 37 aspirations_std 201 non-null uint8 38 aspirations_turbo 201 non-null uint8 39 body_style_convertible 201 non-null uint8 40 body_style_hardtop 201 non-null uint8 41 body_style_hatchback 201 non-null uint8 42 body_style_sedan 201 non-null uint8 43 body_style_wagon 201 non-null uint8 44 drive_wheels_4wd 201 non-null uint8 45 drive_wheels_fwd 201 non-null uint8 46 drive_wheels_rwd 201 non-null uint8 47 engine_location_front 201 non-null uint8 48 engine_location_rear 201 non-null uint8 49 engine_type_dohc 201 non-null uint8 50 engine_type_l 201 non-null uint8 51 engine_type_ohc 201 non-null uint8 52 engine_type_ohcf 201 non-null uint8 53 engine_type_ohcv 201 non-null uint8 54 engine_type_rotor 201 non-null uint8 55 num_of_cylinders_five 201 non-null uint8 56 num_of_cylinders_four 201 non-null uint8 57 num_of_cylinders_seven_or_more 201 non-null uint8 58 num_of_cylinders_six 201 non-null uint8 59 num_of_cylinders_three_or_less 201 non-null uint8 60 fuel_system_1bbl 201 non-null uint8 61 fuel_system_2bbl 201 non-null uint8 62 fuel_system_idi 201 non-null uint8 63 fuel_system_mpfi 201 non-null uint8 64 fuel_system_other 201 non-null uint8 65 fuel_system_spdi 201 non-null uint8 dtypes: Int32(1), Int64(1), float64(13), uint8(51) memory usage: 33.3 KB
# We have low number of observations (201)
# Arbitrarly deciding that we will use 20% holdout test dataset
# Preparing list of explanatory variables
features = car_dataset2.columns.to_list()
features.remove('index')
X = car_dataset2.loc[:, features]
train, test = train_test_split(X, random_state= rd_state, train_size = .8)
print('train size: ' + str(train.shape[0]))
print('test size: ' + str(test.shape[0]))
# Sample sizes are correct
train size: 160 test size: 41
plt.hist(train['price'])
plt.title('Price distribution in train dataset')
plt.show()
plt.hist(test['price'])
plt.title('Price distribution in test dataset')
plt.show()
# Distribution plot of target variable
plt.hist(car_dataset['price'])
plt.title('Price distribution car dataset')
plt.show()
train.isna().sum()
# All missing values are in the train dataset
wheel_base 0 length 0 width 0 height 0 curb_weight 0 engine_size 0 bore 4 stroke 4 compression_ratio 0 horsepower 2 peak_rpm 2 city_mpg 0 highway_mpg 0 price 0 brand_alfa-romero 0 brand_audi 0 brand_bmw 0 brand_chevrolet 0 brand_dodge 0 brand_honda 0 brand_jaguar 0 brand_mazda 0 brand_mercedes-benz 0 brand_mitsubishi 0 brand_nissan 0 brand_other 0 brand_peugot 0 brand_plymouth 0 brand_porsche 0 brand_saab 0 brand_subaru 0 brand_toyota 0 brand_volkswagen 0 brand_volvo 0 fuel_type_diesel 0 fuel_type_gas 0 aspirations_std 0 aspirations_turbo 0 body_style_convertible 0 body_style_hardtop 0 body_style_hatchback 0 body_style_sedan 0 body_style_wagon 0 drive_wheels_4wd 0 drive_wheels_fwd 0 drive_wheels_rwd 0 engine_location_front 0 engine_location_rear 0 engine_type_dohc 0 engine_type_l 0 engine_type_ohc 0 engine_type_ohcf 0 engine_type_ohcv 0 engine_type_rotor 0 num_of_cylinders_five 0 num_of_cylinders_four 0 num_of_cylinders_seven_or_more 0 num_of_cylinders_six 0 num_of_cylinders_three_or_less 0 fuel_system_1bbl 0 fuel_system_2bbl 0 fuel_system_idi 0 fuel_system_mpfi 0 fuel_system_other 0 fuel_system_spdi 0 dtype: int64
# Filling missing values with variable averages:
# We do this based on train data, not on test data to prevent information leakage about variable distribution to test dataset
for name in ["bore","stroke"]:
train[name].fillna(train[name].mean(), inplace=True)
for name in ["horsepower","peak_rpm"]:
train[name].fillna(round(train[name].mean()), inplace=True)
# Check
# train.isna().sum()
# No missing values left
def knn_train_test(X, Y, DS, k, seed, train_pct = 0.8):
"""
Generate multiple single-feature KNN regressions on given train dataset.
Function will randomly split DS DataFrame into train/valid subsets and train multiple single-feature KNN models.
Parameters
----------
X : str
List of features (independent variables).
Y : str
Dependent variable (explained).
DS : pandas.DataFrame
DataFrame containing X and Y variables.
k : int
Number of neighbors to use by default for kneighbors queries (KNeighborsRegressor n_neighbors parameter).
seed : int
Random number seed for results reproductibility.
train_pct : float64, default 0.8
Percentage of DS dataset to be used as train.
Returns
-------
predictions : numpy.ndarray
Numpy array with the predicted resluts of target column.
RMSE : float64
Root mean squared error of predictions and Y variable on valid dataset.
"""
model = KNeighborsRegressor(n_neighbors = k)
ds_train, ds_valid = train_test_split(DS, random_state = seed, train_size = train_pct)
model.fit(ds_train[[X]], ds_train[Y])
predictions = model.predict(ds_valid[[X]])
mse = mean_squared_error(ds_valid[Y].astype(float), predictions)
rmse = np.sqrt(mse)
return rmse, predictions
features.remove('price') # deleting dependent variable from list of variables
# Checking single-feature KNN model RMSE to find best predictors
# Because we are estimating single-feature only scaling is not required here
rmse_dict = dict()
n_neighbours = [1,2,3,5,7,9]
for var in features:
values_dict = {}
for n in n_neighbours:
rmse, predictions = knn_train_test(X = var, Y = 'price', DS = train, k = n, seed = rd_state)
values_dict[n] = rmse
rmse_dict[var] = values_dict
# Finding average RMSE for each feature from 6 KNN regressors
feature_avg_rmse = {}
for k,v in rmse_dict.items():
avg_rmse = np.mean(list(v.values()))
feature_avg_rmse[k] = avg_rmse
series_avg_rmse = pd.Series(feature_avg_rmse)
series_avg_rmse.sort_values(ascending = False).plot.barh(figsize=(15, 12))
plt.title('Avergae RMSE Value Per Feature')
plt.show()
Top 5 features for further analysis:
# Now we will re-train the model on full train dataset to find best multivariate suited model to test dataset
features_fin = ['horsepower','engine_size','city_mpg','highway_mpg','length','price']
train2=train[features_fin]
test2=test[features_fin]
# Normalizing train data based on train dataset descriptive statistics
normalized_train = (train2 - train2.min()) / (train2.max() - train2.min())
normalized_train['price']=train2['price']
normalized_train.head()
horsepower | engine_size | city_mpg | highway_mpg | length | price | |
---|---|---|---|---|---|---|
110 | 0.219626 | 0.222642 | 0.176471 | 0.216216 | 0.862687 | 16695 |
122 | 0.443925 | 0.339623 | 0.176471 | 0.297297 | 0.414925 | 22018 |
58 | 0.168224 | 0.230189 | 0.382353 | 0.432432 | 0.547761 | 10595 |
14 | 0.626168 | 0.558491 | 0.088235 | 0.162162 | 0.714925 | 30760 |
56 | 0.168224 | 0.230189 | 0.382353 | 0.432432 | 0.547761 | 8845 |
# Normalizing test data based on train dataset descriptive statistics
# This is to prevent information leakage (assumption that test data is "out of sample")
normalized_test=(test2-train2.min())/(train2.max() - train2.min())
normalized_test['price']=test2['price']
normalized_test.head()
horsepower | engine_size | city_mpg | highway_mpg | length | price | |
---|---|---|---|---|---|---|
97 | 0.228972 | 0.222642 | 0.411765 | 0.486486 | 0.482090 | 9549 |
117 | 0.093458 | 0.109434 | 0.529412 | 0.594595 | 0.241791 | 6229 |
59 | 0.168224 | 0.230189 | 0.382353 | 0.432432 | 0.547761 | 10245 |
35 | 0.130841 | 0.116981 | 0.500000 | 0.486486 | 0.238806 | 7295 |
165 | 0.317757 | 0.320755 | 0.323529 | 0.378378 | 0.523881 | 9989 |
normalized_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 160 entries, 110 to 37 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 horsepower 160 non-null float64 1 engine_size 160 non-null float64 2 city_mpg 160 non-null float64 3 highway_mpg 160 non-null float64 4 length 160 non-null float64 5 price 160 non-null Int32 dtypes: Int32(1), float64(5) memory usage: 8.3 KB
# Possible combinations of 5 final features
list_combinations = [[]]
for n in range (1,6):
list_combinations += list(combinations(normalized_train.columns.drop('price'), n))
print(len(list_combinations))
# Number of combinations - 32
print(list_combinations)
32 [[], ('horsepower',), ('engine_size',), ('city_mpg',), ('highway_mpg',), ('length',), ('horsepower', 'engine_size'), ('horsepower', 'city_mpg'), ('horsepower', 'highway_mpg'), ('horsepower', 'length'), ('engine_size', 'city_mpg'), ('engine_size', 'highway_mpg'), ('engine_size', 'length'), ('city_mpg', 'highway_mpg'), ('city_mpg', 'length'), ('highway_mpg', 'length'), ('horsepower', 'engine_size', 'city_mpg'), ('horsepower', 'engine_size', 'highway_mpg'), ('horsepower', 'engine_size', 'length'), ('horsepower', 'city_mpg', 'highway_mpg'), ('horsepower', 'city_mpg', 'length'), ('horsepower', 'highway_mpg', 'length'), ('engine_size', 'city_mpg', 'highway_mpg'), ('engine_size', 'city_mpg', 'length'), ('engine_size', 'highway_mpg', 'length'), ('city_mpg', 'highway_mpg', 'length'), ('horsepower', 'engine_size', 'city_mpg', 'highway_mpg'), ('horsepower', 'engine_size', 'city_mpg', 'length'), ('horsepower', 'engine_size', 'highway_mpg', 'length'), ('horsepower', 'city_mpg', 'highway_mpg', 'length'), ('engine_size', 'city_mpg', 'highway_mpg', 'length'), ('horsepower', 'engine_size', 'city_mpg', 'highway_mpg', 'length')]
# Finding best combination of KNN algorithm
KNN_results = pd.DataFrame(columns=['algorithm','n_neighbours','predictors','mae','mape','rmse','r2'])
algorithms=['ball_tree', 'kd_tree', 'brute']
hyper_params=list(range(1,26))
for n in range(1,len(list_combinations)):
for k in hyper_params:
for a in algorithms:
knn = KNeighborsRegressor(n_neighbors = k, algorithm = a, n_jobs=-1)
knn.fit(normalized_train[list(list_combinations[n])], normalized_train['price'].astype(float))
predictions=knn.predict(normalized_test[list(list_combinations[n])])
mae = mean_absolute_error(predictions, normalized_test['price'].astype(float))
mape = mean_absolute_percentage_error(predictions, normalized_test['price'].astype(float))
mse=mean_squared_error(predictions,normalized_test['price'].astype(float))
rmse=np.sqrt(mse)
r2 = knn.score(normalized_test[list(list_combinations[n])], normalized_test['price'].astype(float))
df = pd.DataFrame([[a, k, str(list(list_combinations[n])), mae, mape, rmse, r2]], columns = ['algorithm','n_neighbours','predictors','mae','mape','rmse','r2'])
KNN_results = pd.concat([KNN_results, df], ignore_index = True)
# Cleaning up "predictors" variable from object type (list) to string type
def list_to_str(list1):
mystring=""
for i in list1:
mystring += i + ", "
return mystring.rstrip(", ")
KNN_results["predictors"] = KNN_results["predictors"].apply(eval)
KNN_results["predictors_str"] = KNN_results["predictors"].apply(list_to_str)
KNN_results.drop(columns = "predictors", inplace = True)
# Best models per n_neighbours class
KNN_results.loc[KNN_results.groupby('n_neighbours')['rmse'].idxmin()]
algorithm | n_neighbours | mae | mape | rmse | r2 | predictors_str | |
---|---|---|---|---|---|---|---|
1950 | ball_tree | 1 | 1538.146341 | 0.122926 | 2248.391132 | 0.904610 | horsepower, engine_size, city_mpg, length |
1953 | ball_tree | 2 | 1436.182927 | 0.113492 | 2045.222030 | 0.921070 | horsepower, engine_size, city_mpg, length |
1132 | kd_tree | 3 | 1440.512195 | 0.114722 | 2160.316383 | 0.911937 | horsepower, engine_size, city_mpg |
1135 | kd_tree | 4 | 1597.335366 | 0.118807 | 2460.924004 | 0.885724 | horsepower, engine_size, city_mpg |
1138 | kd_tree | 5 | 1723.307317 | 0.131069 | 2560.586052 | 0.876281 | horsepower, engine_size, city_mpg |
1892 | brute | 6 | 1768.276423 | 0.138887 | 2686.857165 | 0.863778 | horsepower, engine_size, city_mpg, highway_mpg |
1744 | kd_tree | 7 | 2100.195122 | 0.146102 | 2837.456123 | 0.848079 | engine_size, highway_mpg, length |
1747 | kd_tree | 8 | 2000.759146 | 0.141080 | 2701.033240 | 0.862337 | engine_size, highway_mpg, length |
1075 | kd_tree | 9 | 1796.013550 | 0.132229 | 2777.143117 | 0.854469 | highway_mpg, length |
1077 | ball_tree | 10 | 1864.095122 | 0.133916 | 2836.179695 | 0.848216 | highway_mpg, length |
1082 | brute | 11 | 1900.082040 | 0.133944 | 2856.829504 | 0.845998 | highway_mpg, length |
1084 | kd_tree | 12 | 1929.843496 | 0.136536 | 2924.844466 | 0.838577 | highway_mpg, length |
1086 | ball_tree | 13 | 1965.103189 | 0.136828 | 3016.089897 | 0.828349 | highway_mpg, length |
1014 | ball_tree | 14 | 1908.567944 | 0.131135 | 3095.336816 | 0.819210 | city_mpg, length |
1017 | ball_tree | 15 | 1899.923577 | 0.131902 | 3061.358896 | 0.823157 | city_mpg, length |
1021 | kd_tree | 16 | 1954.992378 | 0.135537 | 3191.145774 | 0.807845 | city_mpg, length |
1024 | kd_tree | 17 | 1989.466284 | 0.137111 | 3193.576712 | 0.807552 | city_mpg, length |
1028 | brute | 18 | 2041.205962 | 0.140049 | 3281.423504 | 0.796819 | city_mpg, length |
1029 | ball_tree | 19 | 2098.623877 | 0.142447 | 3429.770748 | 0.778033 | city_mpg, length |
1032 | ball_tree | 20 | 2123.553659 | 0.145282 | 3414.622316 | 0.779989 | city_mpg, length |
1037 | brute | 21 | 2202.925668 | 0.151183 | 3548.205017 | 0.762439 | city_mpg, length |
2090 | brute | 22 | 2240.689579 | 0.156557 | 3601.631836 | 0.755231 | horsepower, engine_size, highway_mpg, length |
2091 | ball_tree | 23 | 2231.308590 | 0.155018 | 3625.929202 | 0.751917 | horsepower, engine_size, highway_mpg, length |
1719 | ball_tree | 24 | 2256.380081 | 0.159893 | 3660.891996 | 0.747110 | engine_size, city_mpg, length |
2099 | brute | 25 | 2234.507317 | 0.152879 | 3588.199167 | 0.757053 | horsepower, engine_size, highway_mpg, length |
# 15 best models
# We see that 3 predictor sets dominated top 15 models
# 4 features: (horsepower, engine_size, city_mpg, length)
# 3 features: (horsepower, engine_size, city_mpg)
# 2 features: (engine_size, city_mpg)
KNN_results.sort_values(by=['rmse']).head(15)
algorithm | n_neighbours | mae | mape | rmse | r2 | predictors_str | |
---|---|---|---|---|---|---|---|
1953 | ball_tree | 2 | 1436.182927 | 0.113492 | 2045.222030 | 0.921070 | horsepower, engine_size, city_mpg, length |
1954 | kd_tree | 2 | 1436.182927 | 0.113492 | 2045.222030 | 0.921070 | horsepower, engine_size, city_mpg, length |
1132 | kd_tree | 3 | 1440.512195 | 0.114722 | 2160.316383 | 0.911937 | horsepower, engine_size, city_mpg |
1131 | ball_tree | 3 | 1444.821138 | 0.115114 | 2173.721645 | 0.910841 | horsepower, engine_size, city_mpg |
1133 | brute | 3 | 1492.016260 | 0.119106 | 2182.952497 | 0.910082 | horsepower, engine_size, city_mpg |
1955 | brute | 2 | 1489.426829 | 0.113977 | 2183.447929 | 0.910041 | horsepower, engine_size, city_mpg, length |
1280 | brute | 2 | 1515.902439 | 0.113260 | 2234.513408 | 0.905784 | horsepower, engine_size, length |
1950 | ball_tree | 1 | 1538.146341 | 0.122926 | 2248.391132 | 0.904610 | horsepower, engine_size, city_mpg, length |
1951 | kd_tree | 1 | 1538.146341 | 0.122926 | 2248.391132 | 0.904610 | horsepower, engine_size, city_mpg, length |
1278 | ball_tree | 2 | 1532.439024 | 0.115533 | 2252.396948 | 0.904270 | horsepower, engine_size, length |
1279 | kd_tree | 2 | 1532.439024 | 0.115533 | 2252.396948 | 0.904270 | horsepower, engine_size, length |
682 | kd_tree | 3 | 1541.902439 | 0.124176 | 2263.351605 | 0.903336 | engine_size, city_mpg |
2254 | kd_tree | 2 | 1542.414634 | 0.128853 | 2275.819847 | 0.902269 | horsepower, engine_size, city_mpg, highway_mpg... |
2253 | ball_tree | 2 | 1542.414634 | 0.128853 | 2275.819847 | 0.902269 | horsepower, engine_size, city_mpg, highway_mpg... |
756 | ball_tree | 3 | 1718.065041 | 0.145574 | 2290.978510 | 0.900962 | engine_size, highway_mpg |
# Plotting best 3 models (feature-wise) to see how they are fitting to test data base on N neighbours parameter
# Selecting best 2,3,4 feature sets
top_pred=['engine_size, city_mpg','horsepower, engine_size, city_mpg', 'horsepower, engine_size, city_mpg, length']
features_rmse=KNN_results.loc[KNN_results['predictors_str'].isin(top_pred)]
features_rmse=features_rmse.loc[features_rmse.groupby(['predictors_str','n_neighbours'])['rmse'].idxmin()]
# Creating 'features' variable
features_rmse.loc[features_rmse['predictors_str'] =='engine_size, city_mpg', 'features'] = 2
features_rmse.loc[features_rmse['predictors_str'] =='horsepower, engine_size, city_mpg', 'features'] = 3
features_rmse.loc[features_rmse['predictors_str'] =='horsepower, engine_size, city_mpg, length', 'features'] = 4
features_rmse["features"] = features_rmse["features"].astype(int)
# Plotting RMSE
colors = {2:'red', 3:'green', 4:'blue', 5:'yellow'}
features_rmse.plot.scatter(x = 'n_neighbours', y = 'rmse', c=features_rmse['features'].map(colors))
plt.title('Models best fit to test data')
plt.xlabel('N neighbours')
plt.ylabel('RMSE')
plt.yticks(range(0,5500,500))
textstr = '\n'.join((
"red - 2 features",
"green - 3 features",
"blue - 4 features"))
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
plt.text(18, 500, textstr, fontsize=10, horizontalalignment='left',
verticalalignment='bottom', bbox=props)
plt.show()
features_rmse["n_neighbours"]=features_rmse["n_neighbours"].astype(int)
features_rmse[["n_neighbours","rmse"]].corr()
n_neighbours | rmse | |
---|---|---|
n_neighbours | 1.000000 | 0.886306 |
rmse | 0.886306 | 1.000000 |
We can see that 4 feature model is best fitted to test data for n neighbours values of 2,3 and 16+ (compared to 2 and 3 features models for same n neighbours parameter value). 3 features model is dominant in <3;15> range of n neighbours parameter. Having said that, for 2 feature model for n neighbours = 3, we have 4th best model in analyzed dataset. Overall we can observe a statistically significant positive correlation between n neighbours parameter and RMSE, hence we recommend using low value of n neighbours.
# Check for price outliers in train dataset - because of small dataset size we expect a significant amount of outliers
sn.boxplot(x = train['price'])
print(np.where(train['price']>30000)) # 11 outliers in train / 6,875% of dataset / 3 outliers above 40 000
print(np.where(test['price']>30000)) # 3 observations meeting same criteria in test dataset - 7,5% of dataset
(array([ 3, 20, 40, 42, 50, 66, 80, 94, 105, 138, 150], dtype=int64),) (array([22, 25, 39], dtype=int64),)
# Checking if extreme values deletion from train dataset will improve overall performance
train3 = train2[train2.price < 45000] # Deleting 1 observation from train dataset with highest price
normalized_train_ex = (train3 - train3.min()) / (train3.max() - train3.min())
normalized_train_ex['price']=train3['price']
normalized_test_ex=(test2-train3.min())/(train3.max() - train3.min())
normalized_test_ex['price']=test2['price']
# Finding best combination of KNN algorithm
KNN_results_ex = pd.DataFrame(columns=['algorithm','n_neighbours','predictors','mae','mape','rmse','r2'])
algorithms=['ball_tree', 'kd_tree', 'brute']
hyper_params=list(range(1,26))
for n in range(1,len(list_combinations)):
for k in hyper_params:
for a in algorithms:
knn = KNeighborsRegressor(n_neighbors = k, algorithm = a, n_jobs=-1)
knn.fit(normalized_train_ex[list(list_combinations[n])], normalized_train_ex['price'].astype(float))
predictions=knn.predict(normalized_test_ex[list(list_combinations[n])])
mae = mean_absolute_error(predictions, normalized_test_ex['price'].astype(float))
mape = mean_absolute_percentage_error(predictions, normalized_test['price'].astype(float))
mse=mean_squared_error(predictions,normalized_test_ex['price'].astype(float))
rmse=np.sqrt(mse)
r2 = knn.score(normalized_test_ex[list(list_combinations[n])], normalized_test_ex['price'].astype(float))
df = pd.DataFrame([[a, k, list_combinations[n], mae, mape, rmse, r2]], columns = ['algorithm','n_neighbours','predictors','mae','mape','rmse','r2'])
KNN_results_ex = pd.concat([KNN_results_ex, df], ignore_index = True)
# Best models after deleting 1 observation with highest price in train dataset
KNN_results_ex.sort_values(by=['rmse']).head(3)
# Deleting extreme observation decreases prediction power by at least 1% (looking at R^2 metric)
algorithm | n_neighbours | predictors | mae | mape | rmse | r2 | |
---|---|---|---|---|---|---|---|
1131 | ball_tree | 3 | (horsepower, engine_size, city_mpg) | 1471.243902 | 0.118446 | 2164.652306 | 0.911583 |
1132 | kd_tree | 3 | (horsepower, engine_size, city_mpg) | 1475.552846 | 0.118837 | 2178.030882 | 0.910487 |
1133 | brute | 3 | (horsepower, engine_size, city_mpg) | 1492.016260 | 0.119106 | 2182.952497 | 0.910082 |
Using KNN algorithm we were able to estimate pretty good model predicting car prices from 1980's (dataset : https://archive.ics.uci.edu/ml/datasets/automobile ). Best model used 4 features and 2 nearest neighbours with evaluation metrics on holdout dataset : MAPE 11,8%, RMSE 2 164,65 and R^2 91,16%. Because of small data size (205 observations total) further improvement using same ML technique will be limited, even considering cross-validation techniques. As further steps we could compare KNN performance vs other algorithms (for example logistic regression and tree-based techniques).