This project aims at prediction of various cars' market prices based their attributes such as weight of the car, acceleration speed, miles per gallon, among others.
The automobile dataset used is from the UCI Machine Learning Repository, and can be found here.
This dataset consists of three entity types:-
(a) Specification - In terms of various characteristics of the auto
(b) Assigned insurance risk rating - The degree to which an auto is more risky other than its price indicates
(c) Normalized losses in use - Relative loss payment per insured vehicle per year, as compared to other cars.
import pandas as pd
cars = pd.read_csv('imports-85.data')
# First few rows of the data
cars.head(3)
3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.60 | ... | 130 | mpfi | 3.47 | 2.68 | 9.00 | 111 | 5000 | 21 | 27 | 13495 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
1 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
2 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
3 rows × 26 columns
The dataset has 26 columns each giving some info about different autos.
However, the column names don't give very clear information as they are, and therefore need some cleaning.
#Renaming the columns
new_cols = ['Symbol', 'Normalized_loss', 'Make', 'Fuel_type', 'Aspiration', 'No_of_doors', 'Body_style', 'Drive_wheels',
'Engine_loc', 'Wheel_base', 'Length', 'Width', 'Height', 'Curb_weight', 'Engine_type', 'No_of_cylinders',
'Engine_size', 'Fuel_system', 'Bore', 'Stroke', 'Compression_ratio', 'Horse_power', 'Peak_rpm', 'City_mpg',
'Highway_mpg', 'Price']
cars.columns = new_cols
cars.head(2)
Symbol | Normalized_loss | Make | Fuel_type | Aspiration | No_of_doors | Body_style | Drive_wheels | Engine_loc | Wheel_base | ... | Engine_size | Fuel_system | Bore | Stroke | Compression_ratio | Horse_power | Peak_rpm | City_mpg | Highway_mpg | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
1 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
2 rows × 26 columns
print('The dataset has {} records and {} columns'.format(cars.shape[0], cars.shape[1]))
print(' ')
print('Info on number of non-null values and the datatype of each column: ')
print(' ')
cars.info()
The dataset has 204 records and 26 columns Info on number of non-null values and the datatype of each column: <class 'pandas.core.frame.DataFrame'> RangeIndex: 204 entries, 0 to 203 Data columns (total 26 columns): Symbol 204 non-null int64 Normalized_loss 204 non-null object Make 204 non-null object Fuel_type 204 non-null object Aspiration 204 non-null object No_of_doors 204 non-null object Body_style 204 non-null object Drive_wheels 204 non-null object Engine_loc 204 non-null object Wheel_base 204 non-null float64 Length 204 non-null float64 Width 204 non-null float64 Height 204 non-null float64 Curb_weight 204 non-null int64 Engine_type 204 non-null object No_of_cylinders 204 non-null object Engine_size 204 non-null int64 Fuel_system 204 non-null object Bore 204 non-null object Stroke 204 non-null object Compression_ratio 204 non-null float64 Horse_power 204 non-null object Peak_rpm 204 non-null object City_mpg 204 non-null int64 Highway_mpg 204 non-null int64 Price 204 non-null object dtypes: float64(5), int64(5), object(16) memory usage: 41.5+ KB
There are columns that are of object datatype but the values are/ should be either int or float:
Such columns need to be cleaned before continuing with the analysis;
import numpy as np
# list of columns that need some cleaning
numeric_cols = ['Normalized_loss', 'Bore', 'Stroke', 'Horse_power', 'Peak_rpm', 'Price']
# Remove whitespaces and '?' in the columns:
def strip_cols(df):
for col in numeric_cols:
df[col] = df[col].str.replace('?', ' ')#np.nan)
df[col] = df[col].str.strip()
return df
cars = strip_cols(cars)
cars[numeric_cols] = cars[numeric_cols].apply(pd.to_numeric)
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 204 entries, 0 to 203 Data columns (total 26 columns): Symbol 204 non-null int64 Normalized_loss 164 non-null float64 Make 204 non-null object Fuel_type 204 non-null object Aspiration 204 non-null object No_of_doors 204 non-null object Body_style 204 non-null object Drive_wheels 204 non-null object Engine_loc 204 non-null object Wheel_base 204 non-null float64 Length 204 non-null float64 Width 204 non-null float64 Height 204 non-null float64 Curb_weight 204 non-null int64 Engine_type 204 non-null object No_of_cylinders 204 non-null object Engine_size 204 non-null int64 Fuel_system 204 non-null object Bore 200 non-null float64 Stroke 200 non-null float64 Compression_ratio 204 non-null float64 Horse_power 202 non-null float64 Peak_rpm 202 non-null float64 City_mpg 204 non-null int64 Highway_mpg 204 non-null int64 Price 200 non-null float64 dtypes: float64(11), int64(5), object(10) memory usage: 41.5+ KB
In this mini-project, we'll only use numeric columns for the prediction and ignore columns with string values. Otherwise some of the object columns can be encoded for better results.
cars = cars.select_dtypes(exclude = ['object']).copy()
cars.head()
Symbol | Normalized_loss | Wheel_base | Length | Width | Height | Curb_weight | Engine_size | Bore | Stroke | Compression_ratio | Horse_power | Peak_rpm | City_mpg | Highway_mpg | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | NaN | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
1 | 1 | NaN | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
2 | 2 | 164.0 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 |
3 | 2 | 164.0 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 |
4 | 2 | NaN | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110.0 | 5500.0 | 19 | 25 | 15250.0 |
missing_vals = cars.columns[cars.isna().any()]
cars[missing_vals].isna().sum()
Normalized_loss 40 Bore 4 Stroke 4 Horse_power 2 Peak_rpm 2 Price 4 dtype: int64
# An overview of the rows with missing values
null_data = cars[cars.isnull().any(axis=1)]
null_data
Symbol | Normalized_loss | Wheel_base | Length | Width | Height | Curb_weight | Engine_size | Bore | Stroke | Compression_ratio | Horse_power | Peak_rpm | City_mpg | Highway_mpg | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | NaN | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
1 | 1 | NaN | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
4 | 2 | NaN | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110.0 | 5500.0 | 19 | 25 | 15250.0 |
6 | 1 | NaN | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | 136 | 3.19 | 3.40 | 8.5 | 110.0 | 5500.0 | 19 | 25 | 18920.0 |
8 | 0 | NaN | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160.0 | 5500.0 | 16 | 22 | NaN |
13 | 1 | NaN | 103.5 | 189.0 | 66.9 | 55.7 | 3055 | 164 | 3.31 | 3.19 | 9.0 | 121.0 | 4250.0 | 20 | 25 | 24565.0 |
14 | 0 | NaN | 103.5 | 189.0 | 66.9 | 55.7 | 3230 | 209 | 3.62 | 3.39 | 8.0 | 182.0 | 5400.0 | 16 | 22 | 30760.0 |
15 | 0 | NaN | 103.5 | 193.8 | 67.9 | 53.7 | 3380 | 209 | 3.62 | 3.39 | 8.0 | 182.0 | 5400.0 | 16 | 22 | 41315.0 |
16 | 0 | NaN | 110.0 | 197.0 | 70.9 | 56.3 | 3505 | 209 | 3.62 | 3.39 | 8.0 | 182.0 | 5400.0 | 15 | 20 | 36880.0 |
42 | 0 | NaN | 94.3 | 170.7 | 61.8 | 53.5 | 2337 | 111 | 3.31 | 3.23 | 8.5 | 78.0 | 4800.0 | 24 | 29 | 6785.0 |
43 | 1 | NaN | 94.5 | 155.9 | 63.6 | 52.0 | 1874 | 90 | 3.03 | 3.11 | 9.6 | 70.0 | 5400.0 | 38 | 43 | NaN |
44 | 0 | NaN | 94.5 | 155.9 | 63.6 | 52.0 | 1909 | 90 | 3.03 | 3.11 | 9.6 | 70.0 | 5400.0 | 38 | 43 | NaN |
45 | 2 | NaN | 96.0 | 172.6 | 65.2 | 51.4 | 2734 | 119 | 3.43 | 3.23 | 9.2 | 90.0 | 5000.0 | 24 | 29 | 11048.0 |
47 | 0 | NaN | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | 258 | 3.63 | 4.17 | 8.1 | 176.0 | 4750.0 | 15 | 19 | 35550.0 |
48 | 0 | NaN | 102.0 | 191.7 | 70.6 | 47.8 | 3950 | 326 | 3.54 | 2.76 | 11.5 | 262.0 | 5000.0 | 13 | 17 | 36000.0 |
54 | 3 | 150.0 | 95.3 | 169.0 | 65.7 | 49.6 | 2380 | 70 | NaN | NaN | 9.4 | 101.0 | 6000.0 | 17 | 23 | 10945.0 |
55 | 3 | 150.0 | 95.3 | 169.0 | 65.7 | 49.6 | 2380 | 70 | NaN | NaN | 9.4 | 101.0 | 6000.0 | 17 | 23 | 11845.0 |
56 | 3 | 150.0 | 95.3 | 169.0 | 65.7 | 49.6 | 2385 | 70 | NaN | NaN | 9.4 | 101.0 | 6000.0 | 17 | 23 | 13645.0 |
57 | 3 | 150.0 | 95.3 | 169.0 | 65.7 | 49.6 | 2500 | 80 | NaN | NaN | 9.4 | 135.0 | 6000.0 | 16 | 23 | 15645.0 |
62 | 0 | NaN | 98.8 | 177.8 | 66.5 | 55.5 | 2443 | 122 | 3.39 | 3.39 | 22.7 | 64.0 | 4650.0 | 36 | 42 | 10795.0 |
65 | 0 | NaN | 104.9 | 175.0 | 66.1 | 54.4 | 2700 | 134 | 3.43 | 3.64 | 22.0 | 72.0 | 4200.0 | 31 | 39 | 18344.0 |
70 | -1 | NaN | 115.6 | 202.6 | 71.7 | 56.5 | 3740 | 234 | 3.46 | 3.10 | 8.3 | 155.0 | 4750.0 | 16 | 18 | 34184.0 |
72 | 0 | NaN | 120.9 | 208.1 | 71.7 | 56.7 | 3900 | 308 | 3.80 | 3.35 | 8.0 | 184.0 | 4500.0 | 14 | 16 | 40960.0 |
73 | 1 | NaN | 112.0 | 199.2 | 72.0 | 55.4 | 3715 | 304 | 3.80 | 3.35 | 8.0 | 184.0 | 4500.0 | 14 | 16 | 45400.0 |
74 | 1 | NaN | 102.7 | 178.4 | 68.0 | 54.8 | 2910 | 140 | 3.78 | 3.12 | 8.0 | 175.0 | 5000.0 | 19 | 24 | 16503.0 |
81 | 3 | NaN | 95.9 | 173.2 | 66.3 | 50.2 | 2833 | 156 | 3.58 | 3.86 | 7.0 | 145.0 | 5000.0 | 19 | 24 | 12629.0 |
82 | 3 | NaN | 95.9 | 173.2 | 66.3 | 50.2 | 2921 | 156 | 3.59 | 3.86 | 7.0 | 145.0 | 5000.0 | 19 | 24 | 14869.0 |
83 | 3 | NaN | 95.9 | 173.2 | 66.3 | 50.2 | 2926 | 156 | 3.59 | 3.86 | 7.0 | 145.0 | 5000.0 | 19 | 24 | 14489.0 |
108 | 0 | NaN | 114.2 | 198.9 | 68.4 | 58.7 | 3230 | 120 | 3.46 | 3.19 | 8.4 | 97.0 | 5000.0 | 19 | 24 | 12440.0 |
109 | 0 | NaN | 114.2 | 198.9 | 68.4 | 58.7 | 3430 | 152 | 3.70 | 3.52 | 21.0 | 95.0 | 4150.0 | 25 | 25 | 13860.0 |
112 | 0 | NaN | 114.2 | 198.9 | 68.4 | 56.7 | 3285 | 120 | 3.46 | 2.19 | 8.4 | 95.0 | 5000.0 | 19 | 24 | 16695.0 |
113 | 0 | NaN | 114.2 | 198.9 | 68.4 | 58.7 | 3485 | 152 | 3.70 | 3.52 | 21.0 | 95.0 | 4150.0 | 25 | 25 | 17075.0 |
123 | 3 | NaN | 95.9 | 173.2 | 66.3 | 50.2 | 2818 | 156 | 3.59 | 3.86 | 7.0 | 145.0 | 5000.0 | 19 | 24 | 12764.0 |
125 | 3 | NaN | 89.5 | 168.9 | 65.0 | 51.6 | 2756 | 194 | 3.74 | 2.90 | 9.5 | 207.0 | 5900.0 | 17 | 25 | 32528.0 |
126 | 3 | NaN | 89.5 | 168.9 | 65.0 | 51.6 | 2756 | 194 | 3.74 | 2.90 | 9.5 | 207.0 | 5900.0 | 17 | 25 | 34028.0 |
127 | 3 | NaN | 89.5 | 168.9 | 65.0 | 51.6 | 2800 | 194 | 3.74 | 2.90 | 9.5 | 207.0 | 5900.0 | 17 | 25 | 37028.0 |
128 | 1 | NaN | 98.4 | 175.7 | 72.3 | 50.5 | 3366 | 203 | 3.94 | 3.11 | 10.0 | 288.0 | 5750.0 | 17 | 28 | NaN |
129 | 0 | NaN | 96.1 | 181.5 | 66.5 | 55.2 | 2579 | 132 | 3.46 | 3.90 | 8.7 | NaN | NaN | 23 | 31 | 9295.0 |
130 | 2 | NaN | 96.1 | 176.8 | 66.6 | 50.5 | 2460 | 132 | 3.46 | 3.90 | 8.7 | NaN | NaN | 23 | 31 | 9895.0 |
180 | -1 | NaN | 104.5 | 187.8 | 66.5 | 54.1 | 3151 | 161 | 3.27 | 3.35 | 9.2 | 156.0 | 5200.0 | 19 | 24 | 15750.0 |
188 | 3 | NaN | 94.5 | 159.3 | 64.2 | 55.6 | 2254 | 109 | 3.19 | 3.40 | 8.5 | 90.0 | 5500.0 | 24 | 29 | 11595.0 |
190 | 0 | NaN | 100.4 | 180.2 | 66.9 | 55.1 | 2661 | 136 | 3.19 | 3.40 | 8.5 | 110.0 | 5500.0 | 19 | 24 | 13295.0 |
191 | 0 | NaN | 100.4 | 180.2 | 66.9 | 55.1 | 2579 | 97 | 3.01 | 3.40 | 23.0 | 68.0 | 4500.0 | 33 | 38 | 13845.0 |
192 | 0 | NaN | 100.4 | 183.1 | 66.9 | 55.1 | 2563 | 109 | 3.19 | 3.40 | 9.0 | 88.0 | 5500.0 | 25 | 31 | 12290.0 |
Here's how the missing values will be handled:-
cars = cars.dropna(subset = ['Bore', 'Stroke', 'Horse_power', 'Peak_rpm', 'Price'])
avg_loss = cars['Normalized_loss'].mean()
cars['Normalized_loss'] = cars['Normalized_loss'].fillna(value = avg_loss)
cars.isna().sum()
Symbol 0 Normalized_loss 0 Wheel_base 0 Length 0 Width 0 Height 0 Curb_weight 0 Engine_size 0 Bore 0 Stroke 0 Compression_ratio 0 Horse_power 0 Peak_rpm 0 City_mpg 0 Highway_mpg 0 Price 0 dtype: int64
# imputing price column
# from sklearn.impute import KNNImputer
# imputer = KNNImputer(n_neighbors=5)
# cars = pd.DataFrame(imputer.fit_transform(cars),columns = cars.columns)
# cars.isna().sum()
1. Univariate Model
We'll use the holdout validation and K-Fold cross-validation methods to build the predictive model.
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
# import math
def knn_train_test_univariate(df, feature_col, target_col):
# train and test sets
np.random.seed(0)
shuffled_index = np.random.permutation(df.index)
df = df.reindex(index = shuffled_index)
split_loc = int(0.5*len(df))
train_set = df.iloc[:split_loc].copy()
test_set = df.iloc[split_loc:].copy()
# model building
model = KNeighborsRegressor()
model.fit(train_set[[feature_col]], train_set[target_col])
predictions = model.predict(test_set[[feature_col]])
rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))
return rmse
all_features = cars.columns.tolist()
all_features.remove('Price')
rmse_dict = {}
for col in all_features:
rmse_dict[col] = knn_train_test_univariate(cars, col, 'Price')
rmse_dict = sorted(rmse_dict.items(), key=lambda x: x[1])
print('The following are the rmse values for each feature column:')
print(' ')
rmse_dict
The following are the rmse values for each feature column:
[('Engine_size', 2768.7024577193997), ('City_mpg', 3801.9522491944367), ('Horse_power', 3840.8349432699265), ('Highway_mpg', 3961.0130844072405), ('Curb_weight', 4017.1563713857277), ('Width', 4540.626474420263), ('Length', 5332.7291227377045), ('Wheel_base', 5930.3202334562175), ('Compression_ratio', 6406.467140561386), ('Bore', 6773.123772768155), ('Stroke', 6827.417884190247), ('Peak_rpm', 7014.6680399077395), ('Symbol', 7354.647765392282), ('Height', 7610.692804694108), ('Normalized_loss', 7987.618494882803)]
Using the default number neighbors (k = 5), engine size gave the best prediction of car prices.
Modifying the function to use various values of k:
def knn_train_test_univariate(df, feature_col, target_col, k_values):
# train and test sets
np.random.seed(0)
shuffled_index = np.random.permutation(df.index)
df = df.reindex(index = shuffled_index)
split_loc = int(0.5*len(df))
train_set = df.iloc[:split_loc].copy()
test_set = df.iloc[split_loc:].copy()
#model building
k_rmse = {}
for k in k_values:
model = KNeighborsRegressor(n_neighbors = k)
model.fit(train_set[[feature_col]], train_set[target_col])
predictions = model.predict(test_set[[feature_col]])
k_rmse[k] = np.sqrt(mean_squared_error(test_set[target_col], predictions))
return k_rmse
# which k gives the best model for each feature column?
hyper_params = [1,3,5,7,9]
rmses_dict = {}
for col in all_features:
rmses_dict[col] = knn_train_test_univariate(cars, col, 'Price', hyper_params)
print('Features and their error metrics for each k-value: ')
# rmses_dict = sorted(rmses_dict.items(), key=lambda x: x[1])
rmses_dict
Features and their error metrics for each k-value:
{'Bore': {1: 6525.949489711673, 3: 6541.776528330883, 5: 6773.123772768155, 7: 6729.9829252536565, 9: 6357.036118920397}, 'City_mpg': {1: 4664.349620412614, 3: 3720.345109563052, 5: 3801.9522491944367, 7: 4180.392991380315, 9: 4296.643528372117}, 'Compression_ratio': {1: 8429.72226860252, 3: 7020.903704767295, 5: 6406.467140561386, 7: 6184.377890549715, 9: 6411.835343433226}, 'Curb_weight': {1: 5496.564537175945, 3: 4695.644900063398, 5: 4017.1563713857277, 7: 3992.3539935803296, 9: 4117.954136455427}, 'Engine_size': {1: 3135.512141860174, 3: 2765.713844538874, 5: 2768.7024577193997, 7: 3129.136821023901, 9: 3428.5532010324764}, 'Height': {1: 9213.420979437002, 3: 8154.015954739659, 5: 7610.692804694108, 7: 7434.132286129642, 9: 7225.565525315033}, 'Highway_mpg': {1: 4912.499136171588, 3: 4013.776142072605, 5: 3961.0130844072405, 7: 4132.843966396334, 9: 4141.612716576377}, 'Horse_power': {1: 4025.957210890042, 3: 4071.865575599259, 5: 3840.8349432699265, 7: 3759.8430295951794, 9: 3855.726260320768}, 'Length': {1: 4729.385459522438, 3: 4641.256909859858, 5: 5332.7291227377045, 7: 5699.182523221176, 9: 5906.604384918108}, 'Normalized_loss': {1: 6575.726293624134, 3: 7150.762204024888, 5: 7987.618494882803, 7: 7959.284364487943, 9: 7799.910037272096}, 'Peak_rpm': {1: 8427.859651731176, 3: 7314.371808190612, 5: 7014.6680399077395, 7: 7115.832016066374, 9: 7396.66892661851}, 'Stroke': {1: 6192.3391013729015, 3: 6581.129120229773, 5: 6827.417884190247, 7: 6995.072405390737, 9: 7340.819986586927}, 'Symbol': {1: 6760.109693796761, 3: 7667.881427369129, 5: 7354.647765392282, 7: 7272.019699866985, 9: 7201.320970947845}, 'Wheel_base': {1: 4315.016501981323, 3: 5749.194776628336, 5: 5930.3202334562175, 7: 6209.168849787841, 9: 6301.256464775368}, 'Width': {1: 3336.9678748735573, 3: 4697.335149820219, 5: 4540.626474420263, 7: 4829.883257740291, 9: 4944.217366473074}}
# Creating a dataframe to hold each feature and its error metrics for each value of k
data = pd.DataFrame.from_dict(rmses_dict)
# data.insert(loc = 0, column = 'N_neighbors', value = [1,3,5,7,9])
data
Bore | City_mpg | Compression_ratio | Curb_weight | Engine_size | Height | Highway_mpg | Horse_power | Length | Normalized_loss | Peak_rpm | Stroke | Symbol | Wheel_base | Width | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 6525.949490 | 4664.349620 | 8429.722269 | 5496.564537 | 3135.512142 | 9213.420979 | 4912.499136 | 4025.957211 | 4729.385460 | 6575.726294 | 8427.859652 | 6192.339101 | 6760.109694 | 4315.016502 | 3336.967875 |
3 | 6541.776528 | 3720.345110 | 7020.903705 | 4695.644900 | 2765.713845 | 8154.015955 | 4013.776142 | 4071.865576 | 4641.256910 | 7150.762204 | 7314.371808 | 6581.129120 | 7667.881427 | 5749.194777 | 4697.335150 |
5 | 6773.123773 | 3801.952249 | 6406.467141 | 4017.156371 | 2768.702458 | 7610.692805 | 3961.013084 | 3840.834943 | 5332.729123 | 7987.618495 | 7014.668040 | 6827.417884 | 7354.647765 | 5930.320233 | 4540.626474 |
7 | 6729.982925 | 4180.392991 | 6184.377891 | 3992.353994 | 3129.136821 | 7434.132286 | 4132.843966 | 3759.843030 | 5699.182523 | 7959.284364 | 7115.832016 | 6995.072405 | 7272.019700 | 6209.168850 | 4829.883258 |
9 | 6357.036119 | 4296.643528 | 6411.835343 | 4117.954136 | 3428.553201 | 7225.565525 | 4141.612717 | 3855.726260 | 5906.604385 | 7799.910037 | 7396.668927 | 7340.819987 | 7201.320971 | 6301.256465 | 4944.217366 |
# Visualizing on line chart
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#plotting
for col in list(data.columns):
data[col].plot(figsize = (13,8))
plt.title('Error metrics at for each k - value')
plt.xlabel('K value')
plt.ylabel('RMSE values')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
For the case of 'Bore', the model performs best at k = 5.
All the other features' error metrics can also be visualized using such the line chart as above
2. Multivariate Model
# Modifying the function to accept list of feature columns
def knn_train_test_multivariate(df, feature_cols, target_col):
# train and test sets
np.random.seed(0)
shuffled_index = np.random.permutation(df.index)
df = df.reindex(index = shuffled_index)
split_loc = int(0.5*len(df))
train_set = df.iloc[:split_loc].copy()
test_set = df.iloc[split_loc:].copy()
#model building
model = KNeighborsRegressor(n_neighbors = 5)
model.fit(train_set[feature_cols], train_set[target_col])
predictions = model.predict(test_set[feature_cols])
rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))
return rmse
# best features:
# average mse value in each column
mean_rmses = data.mean().sort_values()
mean_rmses
best_features = list(mean_rmses.index)
two_features = best_features[:2]
three_features = best_features[:3]
four_features = best_features[:4]
five_features = best_features[:5]
select_lst = [two_features, three_features, four_features,five_features]
select_output = []
for item in select_lst:
select_output.append(knn_train_test_multivariate(cars, item, 'Price'))
select_dict = {'two_features': select_output[0],
'three_features': select_output[1],
'four_features': select_output[2],
'five_features': select_output[3]
}
select_dict
{'five_features': 3948.5887004549895, 'four_features': 3083.271094949073, 'three_features': 3078.7980786456533, 'two_features': 2996.8873214940795}
The best model is the model with 3 independent variables: ('Highway_mpg', 'Engine_size', 'City_mpg')
Hyperparameter Optimization
def knn_train_test_multivariate(df, feature_cols, target_col, k_values):
# train and test sets
np.random.seed(0)
shuffled_index = np.random.permutation(df.index)
df = df.reindex(index = shuffled_index)
split_loc = int(0.5*len(df))
train_set = df.iloc[:split_loc].copy()
test_set = df.iloc[split_loc:].copy()
#model building
k_rmse = {}
for k in k_values:
model = KNeighborsRegressor(n_neighbors = k)
model.fit(train_set[feature_cols], train_set[target_col])
predictions = model.predict(test_set[feature_cols])
k_rmse[k] = np.sqrt(mean_squared_error(test_set[target_col], predictions))
return k_rmse
hyp_params = list(range(1, 26))
top_three = [two_features, three_features, four_features]
selected = []
for item in top_three:
selected.append(knn_train_test_multivariate(cars, item, 'Price', hyp_params))
print('Features and their error metrics for each k-value: ')
best_dict = {'two_features': selected[0],
'three_features': selected[1],
'four_features': selected[2],
}
best_dict
Features and their error metrics for each k-value:
{'four_features': {1: 2610.6627234612483, 2: 2674.6854266330647, 3: 2495.415808033366, 4: 2737.609650630784, 5: 3083.271094949073, 6: 3130.173688680205, 7: 3384.5351687709153, 8: 3560.6258772476385, 9: 3489.3617381363333, 10: 3574.2210022302074, 11: 3642.6804175077923, 12: 3675.8881501828423, 13: 3841.6091062898613, 14: 3925.294206504302, 15: 3882.231349399912, 16: 3946.6363421326932, 17: 3878.929278856831, 18: 3893.0936284329778, 19: 3911.6103208045697, 20: 3961.8887642849104, 21: 3999.9386784900307, 22: 4070.518779726736, 23: 4162.771545655721, 24: 4222.27321782126, 25: 4282.307729786075}, 'three_features': {1: 2534.588538965579, 2: 2680.5878404005143, 3: 2464.771925999155, 4: 2729.525458407154, 5: 3078.7980786456533, 6: 3148.583470886329, 7: 3383.9058919272175, 8: 3577.165165323855, 9: 3494.9357692498993, 10: 3597.8367148899783, 11: 3678.316396079907, 12: 3711.0517621578206, 13: 3860.557996264031, 14: 3930.9075907490123, 15: 3882.5919982742203, 16: 3944.00229722676, 17: 3905.7462793347754, 18: 3905.309340629744, 19: 3916.592517113872, 20: 3942.3319701492446, 21: 3995.3507403242816, 22: 4059.7959116530124, 23: 4156.537589932155, 24: 4237.941348741846, 25: 4287.16475174757}, 'two_features': {1: 2807.9143358595848, 2: 2759.9046982276345, 3: 2354.72887298986, 4: 2587.626322555174, 5: 2996.8873214940795, 6: 3095.941706834793, 7: 3351.095825242845, 8: 3557.2028865114526, 9: 3524.1670038672614, 10: 3599.2079940642616, 11: 3682.8374655370367, 12: 3733.954887479565, 13: 3818.342821734111, 14: 3810.6160085975616, 15: 3805.6453609757978, 16: 3869.710323645115, 17: 3850.6126908392766, 18: 3891.1983452584423, 19: 3926.9249455122713, 20: 3963.2914529408017, 21: 4004.4869239008185, 22: 4079.0805793509444, 23: 4175.556938053969, 24: 4240.634363607186, 25: 4278.664903559106}}
best_df = pd.DataFrame.from_dict(best_dict)
best_df
four_features | three_features | two_features | |
---|---|---|---|
1 | 2610.662723 | 2534.588539 | 2807.914336 |
2 | 2674.685427 | 2680.587840 | 2759.904698 |
3 | 2495.415808 | 2464.771926 | 2354.728873 |
4 | 2737.609651 | 2729.525458 | 2587.626323 |
5 | 3083.271095 | 3078.798079 | 2996.887321 |
6 | 3130.173689 | 3148.583471 | 3095.941707 |
7 | 3384.535169 | 3383.905892 | 3351.095825 |
8 | 3560.625877 | 3577.165165 | 3557.202887 |
9 | 3489.361738 | 3494.935769 | 3524.167004 |
10 | 3574.221002 | 3597.836715 | 3599.207994 |
11 | 3642.680418 | 3678.316396 | 3682.837466 |
12 | 3675.888150 | 3711.051762 | 3733.954887 |
13 | 3841.609106 | 3860.557996 | 3818.342822 |
14 | 3925.294207 | 3930.907591 | 3810.616009 |
15 | 3882.231349 | 3882.591998 | 3805.645361 |
16 | 3946.636342 | 3944.002297 | 3869.710324 |
17 | 3878.929279 | 3905.746279 | 3850.612691 |
18 | 3893.093628 | 3905.309341 | 3891.198345 |
19 | 3911.610321 | 3916.592517 | 3926.924946 |
20 | 3961.888764 | 3942.331970 | 3963.291453 |
21 | 3999.938678 | 3995.350740 | 4004.486924 |
22 | 4070.518780 | 4059.795912 | 4079.080579 |
23 | 4162.771546 | 4156.537590 | 4175.556938 |
24 | 4222.273218 | 4237.941349 | 4240.634364 |
25 | 4282.307730 | 4287.164752 | 4278.664904 |
#plotting
for col in list(best_df.columns):
best_df[col].plot(figsize = (13,8))
plt.title('Error metrics at for each k - value')
plt.xlabel('K value')
plt.ylabel('RMSE values')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
# rmse averages
best_df.min()
four_features 2495.415808 three_features 2464.771926 two_features 2354.728873 dtype: float64
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
def knn_train_test_kfold(df, feature_cols, target_col, n_folds):
'''
folds: a list of number of folds
'''
# train and test sets
np.random.seed(0)
shuffled_index = np.random.permutation(df.index)
df = df.reindex(index = shuffled_index)
split_loc = int(0.5*len(df))
train_set = df.iloc[:split_loc].copy()
test_set = df.iloc[split_loc:].copy()
# splitting the dataframe into folds
for fold in n_folds:
kf = KFold(fold, shuffle=True, random_state=1)
# model building
model = KNeighborsRegressor()
mses = cross_val_score(model, train_set[feature_cols], train_set["Price"], scoring="neg_mean_squared_error", cv=kf)
mean_mse = np.mean(np.abs(mses))
kfold_rmse = np.sqrt(mean_mse)
return kfold_rmse
n_folds = [5,9,13,15,19]
kfold_rmses = []
for lst in select_lst:
kfold_rmses.append(knn_train_test_kfold(cars, lst, 'Price', n_folds))
kfold_dict = {'Two best features': kfold_rmses[0], 'Three best features': kfold_rmses[1],
'Four best features': kfold_rmses[2],'Five best features': kfold_rmses[3]}
kfold_dict
{'Five best features': 4403.893418376636, 'Four best features': 3410.8686574207804, 'Three best features': 3556.772181805177, 'Two best features': 3645.2146005798754}
1. Univariate model
data
print('The minimum rmse for each predictor variable:')
print(' ')
print(data.min())
The minimum rmse for each predictor variable: Bore 6357.036119 City_mpg 3720.345110 Compression_ratio 6184.377891 Curb_weight 3992.353994 Engine_size 2765.713845 Height 7225.565525 Highway_mpg 3961.013084 Horse_power 3759.843030 Length 4641.256910 Normalized_loss 6575.726294 Peak_rpm 7014.668040 Stroke 6192.339101 Symbol 6760.109694 Wheel_base 4315.016502 Width 3336.967875 dtype: float64
Engine size is the best predictor with an rmse of 2766
2. Multivariate model
best_df
print('The minimum rmse for each set of predictor variables:')
print(' ')
print(best_df.min())
The minimum rmse for each set of predictor variables: four_features 2495.415808 three_features 2464.771926 two_features 2354.728873 dtype: float64
The best predictors for this model have an rmse of 2354
3. Kfold Validation Technique
kfold_dict
{'Five best features': 4403.893418376636, 'Four best features': 3410.8686574207804, 'Three best features': 3556.772181805177, 'Two best features': 3645.2146005798754}
The best predictors for this model have an rmse of 3410
np.random.seed(0)
shuffled_index = np.random.permutation(cars.index)
cars = cars.reindex(index = shuffled_index)
split_loc = int(0.5*len(cars))
train_set = cars.iloc[:split_loc].copy()
test_set = cars.iloc[split_loc:].copy()
knn_model = KNeighborsRegressor(n_neighbors = 3)
knn_model.fit(train_set[two_features], train_set['Price'])
test_set['Predicted_price'] = knn_model.predict(test_set[two_features])
test_set.head()
Symbol | Normalized_loss | Wheel_base | Length | Width | Height | Curb_weight | Engine_size | Bore | Stroke | Compression_ratio | Horse_power | Peak_rpm | City_mpg | Highway_mpg | Price | Predicted_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
101 | 0 | 108.0 | 100.4 | 184.6 | 66.5 | 56.1 | 3296 | 181 | 3.43 | 3.27 | 9.0 | 152.0 | 5200.0 | 17 | 22 | 14399.0 | 16365.666667 |
199 | -1 | 95.0 | 109.1 | 188.8 | 68.9 | 55.5 | 2952 | 141 | 3.78 | 3.15 | 9.5 | 114.0 | 5400.0 | 23 | 28 | 16845.0 | 17518.333333 |
102 | 0 | 108.0 | 100.4 | 184.6 | 66.5 | 55.1 | 3060 | 181 | 3.43 | 3.27 | 9.0 | 152.0 | 5200.0 | 19 | 25 | 13499.0 | 16365.666667 |
71 | 3 | 142.0 | 96.6 | 180.3 | 70.5 | 50.8 | 3685 | 234 | 3.46 | 3.10 | 8.3 | 155.0 | 4750.0 | 16 | 18 | 35056.0 | 33994.666667 |
147 | 0 | 85.0 | 96.9 | 173.6 | 65.4 | 54.9 | 2420 | 108 | 3.62 | 2.64 | 9.0 | 82.0 | 4800.0 | 23 | 29 | 8013.0 | 7911.000000 |