In this project, we'll practice the machine learning workflow to predict a car's market price using its attributes.
This data set consists of three types of entities:
The second rating corresponds to the degree to which the auto is more risky than its price indicates:
Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale.
Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.
You can read more about the data set here.
import pandas as pd
import numpy as np
pd.options.display.max_columns = 99
columns_name=['symboling','normalized_losses','make','fuel_type',\
'aspiration','num_doors','body_style','drive_wheels',\
'engine_location','wheel_base','length','width','height',\
'curb_weight','engine_type','num_cylinders','engine_size',\
'fuel_system','bore','stroke','compression_ratio','horsepower',\
'peak_rpm','city_mpg','highway_mpg','price']
cars=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data',names=columns_name)
cars.head()
symboling | normalized_losses | make | fuel_type | aspiration | num_doors | body_style | drive_wheels | engine_location | wheel_base | length | width | height | curb_weight | engine_type | num_cylinders | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
# Select only the columns with continuous values from
#https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
continuous_values_cols=['normalized_losses','wheel_base','length','width',
'height','curb_weight','engine_size','bore','stroke','compression_ratio',
'horsepower','peak_rpm','city_mpg','highway_mpg','price']
numeric_cars=cars[continuous_values_cols]
# Select feature columns
features = numeric_cars.columns.tolist()
features.remove('price')
numeric_cars.head()
normalized_losses | wheel_base | length | width | height | curb_weight | engine_size | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ? | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | ? | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | ? | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 164 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 164 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
# Replace all of the ? values with np.nan in the dataframe
numeric_cars=numeric_cars.replace('?',np.nan).astype('float')
# Check missing values
numeric_cars.isnull().sum()
normalized_losses 41 wheel_base 0 length 0 width 0 height 0 curb_weight 0 engine_size 0 bore 4 stroke 4 compression_ratio 0 horsepower 2 peak_rpm 2 city_mpg 0 highway_mpg 0 price 4 dtype: int64
# Because `price` is the column we want to predict, let's remove any rows with missing `price` values.
numeric_cars = numeric_cars.dropna(subset=['price'])
numeric_cars.isnull().sum()
normalized_losses 37 wheel_base 0 length 0 width 0 height 0 curb_weight 0 engine_size 0 bore 4 stroke 4 compression_ratio 0 horsepower 2 peak_rpm 2 city_mpg 0 highway_mpg 0 price 0 dtype: int64
# Replace missing values in other columns using column means.
numeric_cars=numeric_cars.fillna(numeric_cars.mean())
# Confirm that there's no more missing values!
numeric_cars.isnull().sum()
normalized_losses 0 wheel_base 0 length 0 width 0 height 0 curb_weight 0 engine_size 0 bore 0 stroke 0 compression_ratio 0 horsepower 0 peak_rpm 0 city_mpg 0 highway_mpg 0 price 0 dtype: int64
def knn_train_test(df, features, target, k=[5]):
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# Normalize all columnns to range from 0 to 1 except the target column
target_col = df[target]
df = (df - df.min()) / (df.max() - df.min())
df[target] = target_col
# Split full dataset into random train and test subsets
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target],\
test_size = 0.25, random_state = 1)
k_values = k
k_rmses = dict()
for k in k_values:
# Instantiate model
model = KNeighborsRegressor(n_neighbors=k, algorithm='brute')
# Fit the model to the training data
model.fit(X_train, y_train)
# Make predictions using model
predictions = model.predict(X_test)
# Calculate RMSE
mse = mean_squared_error(y_test, predictions)
k_rmses[k] = np.sqrt(mse)
return k_rmses
k_rmse_univariate = dict()
# For each feature, train a model, return RMSE value and add to the dictionary `k_rmse_univariate`.
for f in features:
k_rmse_univariate[f] = knn_train_test(numeric_cars, [f], 'price', k=[k for k in range(1,26)])
k_min_rmse = dict()
# For each feature, get RMSE minimum and k parameter in the dictionary `k_rmse_univariate`.
for k,v in k_rmse_univariate.items():
k_min = min(v, key=v.get)
k_min_rmse['{}, k={}'.format(k,k_min)] = v[k_min]
#print(k,'k={}, RMSE={}'.format(k_min, v[k_min]))
# Create a Series object from the dictionary so we can easily view the results
pd.Series(k_min_rmse).sort_values()
city_mpg, k=3 2924.835567 curb_weight, k=13 3215.747970 horsepower, k=6 3299.004162 engine_size, k=3 3395.244497 width, k=12 3527.358721 highway_mpg, k=8 3537.608742 wheel_base, k=2 3662.498026 length, k=6 3836.714751 compression_ratio, k=2 5276.134084 peak_rpm, k=3 5981.174124 bore, k=3 6017.613253 normalized_losses, k=5 6227.674033 stroke, k=6 6596.325642 height, k=15 6838.981068 dtype: float64
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(12, 5))
# Visualize the results using a line plot
for k,v in k_rmse_univariate.items():
x = list(v.keys())
y = list(v.values())
plt.plot(x,y)
plt.xlabel('k value')
plt.ylabel('RMSE')
#plt.legend(k_rmse_univariate.keys())
# Compute average RMSE across different `k` values for each feature.
feature_avg_rmse = dict()
for k,v in k_rmse_univariate.items():
feature_avg_rmse[k] = np.mean(list(v.values()))
sorted_avg_rmse = pd.Series(feature_avg_rmse).sort_values()
print(sorted_avg_rmse)
sorted_features = list(sorted_avg_rmse.index)
curb_weight 3626.986018 city_mpg 3745.587829 width 3869.814154 engine_size 3912.575642 horsepower 3939.670381 highway_mpg 4057.281638 length 4295.034784 wheel_base 4947.761896 compression_ratio 6090.360368 bore 6558.435372 normalized_losses 6937.020426 peak_rpm 7186.755973 height 7190.287096 stroke 7350.891992 dtype: float64
k_rmse_multivariate = dict()
for nr_best_feats in range(2,len(features)):
features = sorted_features[:nr_best_feats]
k_rmse_multivariate['{} features {}'.format(nr_best_feats, features)] = \
knn_train_test(numeric_cars, features, 'price', [k for k in range(1,26)])
k_min_rmse = dict()
# For each feature groups, get RMSE minimum and k parameter in the dictionary `k_rmse_multivariate`
for k,v in k_rmse_multivariate.items():
k_min = min(v, key=v.get)
k_min_rmse['{}, k={}'.format(k,k_min)] = v[k_min]
# Create a Series object from the dictionary so we can easily view the results
pd.Series(k_min_rmse).sort_values()
5 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower'], k=2 2020.421278 4 features ['curb_weight', 'city_mpg', 'width', 'engine_size'], k=2 2033.850822 11 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio', 'bore', 'normalized_losses'], k=2 2071.718283 6 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg'], k=3 2099.756135 9 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio'], k=2 2170.566203 7 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length'], k=2 2183.376718 12 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio', 'bore', 'normalized_losses', 'peak_rpm'], k=2 2250.232600 10 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio', 'bore'], k=2 2296.830525 13 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base', 'compression_ratio', 'bore', 'normalized_losses', 'peak_rpm', 'height'], k=2 2367.526830 3 features ['curb_weight', 'city_mpg', 'width'], k=10 2441.199189 8 features ['curb_weight', 'city_mpg', 'width', 'engine_size', 'horsepower', 'highway_mpg', 'length', 'wheel_base'], k=2 2462.637202 2 features ['curb_weight', 'city_mpg'], k=6 2907.042683 dtype: float64
plt.figure(figsize=(12, 5))
# Visualize the results using a line plot
for k,v in k_rmse_multivariate.items():
x = list(v.keys())
y = list(v.values())
plt.plot(x,y)
plt.xlabel('k value')
plt.ylabel('RMSE')
#plt.legend(k_rmse_multivariate.keys())
To build a better k-nearest neighbors model, we can change the features it uses or tweak the number of neighbors (a hyperparameter). To accurately understand a model's performance, we can perform k-fold cross validation and select the proper number of folds.
from sklearn.neighbors import KNeighborsRegressor from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=5, shuffle=True, random_state=1)
knn_model = KNeighborsRegressor(n_neighbors=5)
mses = cross_val_score(knn_model, numeric_cars[features], numeric_cars['price'], scoring='neg_mean_squared_error', cv=kf) rmses = np.sqrt(np.absolute(mses)) avg_rmse = np.mean(rmses)
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, cross_val_score
num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]
for fold in num_folds:
kf = KFold(fold, shuffle=True, random_state=1)
knn_model = KNeighborsRegressor(n_neighbors=5)
mses = cross_val_score(knn_model, numeric_cars[features], numeric_cars['price'], scoring='neg_mean_squared_error', cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
std_rmse = np.std(rmses)
print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))
3 folds: avg RMSE: 3585.1705141779657 std RMSE: 285.68226373226037 5 folds: avg RMSE: 3351.753259983704 std RMSE: 649.1333560711732 7 folds: avg RMSE: 3420.7190682826645 std RMSE: 621.4931409045394 9 folds: avg RMSE: 3321.930266658537 std RMSE: 1063.806444897892 10 folds: avg RMSE: 3312.480907694746 std RMSE: 920.594563206525 11 folds: avg RMSE: 3300.6748284059954 std RMSE: 1168.0914570412504 13 folds: avg RMSE: 3290.836162203311 std RMSE: 1135.2193763125706 15 folds: avg RMSE: 3233.735625360849 std RMSE: 1201.766093749223 17 folds: avg RMSE: 3135.4364936060138 std RMSE: 1378.5750347626836 19 folds: avg RMSE: 3166.1693774141836 std RMSE: 1433.4631163267445 21 folds: avg RMSE: 3109.8444036695387 std RMSE: 1452.0442590555754 23 folds: avg RMSE: 3074.8357538573023 std RMSE: 1464.4334018546076
As you increase the number the folds, the number of observations in each fold decreases and the variance of the fold-by-fold errors increases.
The standard deviation of the RMSE values can be a proxy for a model's variance while the average RMSE is a proxy for a model's bias. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.
def knn_train_validate(df, features, target, k_neighbors=[5], n_folds=[10]):
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, cross_val_score
# Normalize all columnns to range from 0 to 1 except the target column
target_col = df[target]
df = (df - df.min()) / (df.max() - df.min())
df[target] = target_col
k_rmses = dict()
k_folds_rmses = dict()
for fold in n_folds:
kf = KFold(fold, shuffle=True, random_state=1)
for k in k_neighbors:
model = KNeighborsRegressor(n_neighbors=k, algorithm='brute')
mses = cross_val_score(model, df[features], df[target], \
scoring='neg_mean_squared_error', cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
std_rmse = np.std(rmses)
k_rmses[k] = avg_rmse
k_folds_rmses[fold] = k_rmses
return k_folds_rmses
k_folds_univariate = dict()
for f in features:
k_folds_univariate[f] = knn_train_validate(numeric_cars, [f], 'price',\
k_neighbors= [k for k in range(1,6)], n_folds= [5, 7, 9, 10])
for k,v in k_folds_univariate.items():
print(k,v)
curb_weight {5: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 7: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 9: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}, 10: {1: 5131.016591427542, 2: 4520.280316493375, 3: 4585.946768732755, 4: 4381.7272850452155, 5: 4248.131164348986}} city_mpg {5: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 7: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 9: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}, 10: {1: 4456.365948231014, 2: 4400.334452969808, 3: 4331.565256334018, 4: 4204.670525954754, 5: 4212.517293004572}} width {5: {1: 4990.5188305466945, 2: 4383.158956155274, 3: 4217.846360101605, 4: 4144.0211767376195, 5: 4274.563982170558}, 7: {1: 4990.5188305466945, 2: 4383.158956155274, 3: 4217.846360101605, 4: 4144.0211767376195, 5: 4274.563982170558}, 9: {1: 4990.5188305466945, 2: 4383.158956155274, 3: 4217.846360101605, 4: 4144.0211767376195, 5: 4274.563982170558}, 10: {1: 4990.5188305466945, 2: 4383.158956155274, 3: 4217.846360101605, 4: 4144.0211767376195, 5: 4274.563982170558}} engine_size {5: {1: 3851.652772164489, 2: 3634.990367306872, 3: 3264.0359997322735, 4: 2950.275310463868, 5: 3022.066973616294}, 7: {1: 3851.652772164489, 2: 3634.990367306872, 3: 3264.0359997322735, 4: 2950.275310463868, 5: 3022.066973616294}, 9: {1: 3851.652772164489, 2: 3634.990367306872, 3: 3264.0359997322735, 4: 2950.275310463868, 5: 3022.066973616294}, 10: {1: 3851.652772164489, 2: 3634.990367306872, 3: 3264.0359997322735, 4: 2950.275310463868, 5: 3022.066973616294}} horsepower {5: {1: 3606.4020812330878, 2: 3951.2604703770944, 3: 3806.7749777442577, 4: 3838.889193011517, 5: 3846.259271724578}, 7: {1: 3606.4020812330878, 2: 3951.2604703770944, 3: 3806.7749777442577, 4: 3838.889193011517, 5: 3846.259271724578}, 9: {1: 3606.4020812330878, 2: 3951.2604703770944, 3: 3806.7749777442577, 4: 3838.889193011517, 5: 3846.259271724578}, 10: {1: 3606.4020812330878, 2: 3951.2604703770944, 3: 3806.7749777442577, 4: 3838.889193011517, 5: 3846.259271724578}} highway_mpg {5: {1: 5430.30488809365, 2: 4971.605816188208, 3: 4665.947558293721, 4: 4361.242615243273, 5: 4272.058028231421}, 7: {1: 5430.30488809365, 2: 4971.605816188208, 3: 4665.947558293721, 4: 4361.242615243273, 5: 4272.058028231421}, 9: {1: 5430.30488809365, 2: 4971.605816188208, 3: 4665.947558293721, 4: 4361.242615243273, 5: 4272.058028231421}, 10: {1: 5430.30488809365, 2: 4971.605816188208, 3: 4665.947558293721, 4: 4361.242615243273, 5: 4272.058028231421}} length {5: {1: 5092.264401956441, 2: 5147.060254596485, 3: 5230.033935409053, 4: 5172.147017452602, 5: 5049.551484658347}, 7: {1: 5092.264401956441, 2: 5147.060254596485, 3: 5230.033935409053, 4: 5172.147017452602, 5: 5049.551484658347}, 9: {1: 5092.264401956441, 2: 5147.060254596485, 3: 5230.033935409053, 4: 5172.147017452602, 5: 5049.551484658347}, 10: {1: 5092.264401956441, 2: 5147.060254596485, 3: 5230.033935409053, 4: 5172.147017452602, 5: 5049.551484658347}} wheel_base {5: {1: 4838.811735426543, 2: 4464.687413319423, 3: 4788.918931896196, 4: 5334.325476326249, 5: 5676.480374517227}, 7: {1: 4838.811735426543, 2: 4464.687413319423, 3: 4788.918931896196, 4: 5334.325476326249, 5: 5676.480374517227}, 9: {1: 4838.811735426543, 2: 4464.687413319423, 3: 4788.918931896196, 4: 5334.325476326249, 5: 5676.480374517227}, 10: {1: 4838.811735426543, 2: 4464.687413319423, 3: 4788.918931896196, 4: 5334.325476326249, 5: 5676.480374517227}} compression_ratio {5: {1: 7461.267154949103, 2: 6564.209225260229, 3: 6491.506885202596, 4: 6599.814783017793, 5: 6511.23288304368}, 7: {1: 7461.267154949103, 2: 6564.209225260229, 3: 6491.506885202596, 4: 6599.814783017793, 5: 6511.23288304368}, 9: {1: 7461.267154949103, 2: 6564.209225260229, 3: 6491.506885202596, 4: 6599.814783017793, 5: 6511.23288304368}, 10: {1: 7461.267154949103, 2: 6564.209225260229, 3: 6491.506885202596, 4: 6599.814783017793, 5: 6511.23288304368}} bore {5: {1: 9362.960123667412, 2: 10160.346090504572, 3: 9266.74556313814, 4: 6673.215005041127, 5: 6835.570995420826}, 7: {1: 9362.960123667412, 2: 10160.346090504572, 3: 9266.74556313814, 4: 6673.215005041127, 5: 6835.570995420826}, 9: {1: 9362.960123667412, 2: 10160.346090504572, 3: 9266.74556313814, 4: 6673.215005041127, 5: 6835.570995420826}, 10: {1: 9362.960123667412, 2: 10160.346090504572, 3: 9266.74556313814, 4: 6673.215005041127, 5: 6835.570995420826}} normalized_losses {5: {1: 7106.493869468242, 2: 6632.965725360233, 3: 6651.486193575065, 4: 7629.8840283747295, 5: 7651.707666484876}, 7: {1: 7106.493869468242, 2: 6632.965725360233, 3: 6651.486193575065, 4: 7629.8840283747295, 5: 7651.707666484876}, 9: {1: 7106.493869468242, 2: 6632.965725360233, 3: 6651.486193575065, 4: 7629.8840283747295, 5: 7651.707666484876}, 10: {1: 7106.493869468242, 2: 6632.965725360233, 3: 6651.486193575065, 4: 7629.8840283747295, 5: 7651.707666484876}} peak_rpm {5: {1: 8564.51186629666, 2: 8867.308045596194, 3: 8160.038603897234, 4: 7190.544845980863, 5: 7123.500203097938}, 7: {1: 8564.51186629666, 2: 8867.308045596194, 3: 8160.038603897234, 4: 7190.544845980863, 5: 7123.500203097938}, 9: {1: 8564.51186629666, 2: 8867.308045596194, 3: 8160.038603897234, 4: 7190.544845980863, 5: 7123.500203097938}, 10: {1: 8564.51186629666, 2: 8867.308045596194, 3: 8160.038603897234, 4: 7190.544845980863, 5: 7123.500203097938}} height {5: {1: 10559.769233683894, 2: 8626.85043904975, 3: 8122.506562495744, 4: 7649.157681521623, 5: 7617.330945525906}, 7: {1: 10559.769233683894, 2: 8626.85043904975, 3: 8122.506562495744, 4: 7649.157681521623, 5: 7617.330945525906}, 9: {1: 10559.769233683894, 2: 8626.85043904975, 3: 8122.506562495744, 4: 7649.157681521623, 5: 7617.330945525906}, 10: {1: 10559.769233683894, 2: 8626.85043904975, 3: 8122.506562495744, 4: 7649.157681521623, 5: 7617.330945525906}}