The goal this project - predict the price of used car's on the market using its attributes. The data set contains information on various cars. Each car has information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set here and can download it directly from here. The k-nearest neighbors algorithm will be used to predict the price of a used car. Let's collect data about our dataset, download, open and explore:
# Import required modules
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import warnings
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)
# Open dataset and get info
used_cars = pd.read_csv("data/imports-85.data")
display(used_cars.head())
3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.60 | 168.80 | 64.10 | 48.80 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.00 | 111 | 5000 | 21 | 27 | 13495 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.600 | 168.800 | 64.100 | 48.800 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.000 | 111 | 5000 | 21 | 27 | 16500 |
1 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.500 | 171.200 | 65.500 | 52.400 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.000 | 154 | 5000 | 19 | 26 | 16500 |
2 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.800 | 176.600 | 66.200 | 54.300 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.000 | 102 | 5500 | 24 | 30 | 13950 |
3 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.400 | 176.600 | 66.400 | 54.300 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.000 | 115 | 5500 | 18 | 22 | 17450 |
4 | 2 | ? | audi | gas | std | two | sedan | fwd | front | 99.800 | 177.300 | 66.300 | 53.100 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.500 | 110 | 5500 | 19 | 25 | 15250 |
The first row shows that the dataset does not have columns with the names of technical parameters.
Using the attribute dictionary located here, reload dataset, add columns with names.
Also I see that the first five columns of the dataset some columns contain the symbol ?
.
Replace ?
to NaN and view dataset.
# add columns to dataset and reload
columns_name = ["symboling", "normalized-losses", "make", "fuel-type",
"aspiration", "num-of-doors", "body-style", "drive-wheels",
"engine-location", "wheel-base", "length", "width", "height",
"curb-weight", "engine-type", "num-of-cylinders", "engine-size",
"fuel-system", "bore", "stroke", "compression-ratio",
"horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
used_cars = pd.read_csv("data/imports-85.data", names = columns_name)
used_cars.head()
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.600 | 168.800 | 64.100 | 48.800 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.000 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.600 | 168.800 | 64.100 | 48.800 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.000 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.500 | 171.200 | 65.500 | 52.400 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.000 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.800 | 176.600 | 66.200 | 54.300 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.000 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.400 | 176.600 | 66.400 | 54.300 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.000 | 115 | 5500 | 18 | 22 | 17450 |
From the available columns in my opinion only the next columns have of practical value for potential buyers:
the the all remaining columns will be deleted.
# Replace ? to NaN and display NaN values
used_cars = used_cars.replace("?", np.nan)
print(used_cars.isna().sum())
#Define actual columns list for further analyzing
actual_columns = ["make", "fuel-type", "num-of-doors", "body-style",
"drive-wheels", "engine-type", "engine-size", "horsepower", "city-mpg",
"highway-mpg", "price"]
used_cars= used_cars.copy()[actual_columns]
used_cars.reset_index(drop = True, inplace = True)
symboling 0 normalized-losses 41 make 0 fuel-type 0 aspiration 0 num-of-doors 2 body-style 0 drive-wheels 0 engine-location 0 wheel-base 0 length 0 width 0 height 0 curb-weight 0 engine-type 0 num-of-cylinders 0 engine-size 0 fuel-system 0 bore 4 stroke 4 compression-ratio 0 horsepower 2 peak-rpm 2 city-mpg 0 highway-mpg 0 price 4 dtype: int64
As seen above dataset has nan values in num-of-doors
, bore
, horsepower
, stroke
and price
, remove its.
# Drop NaN values
used_cars = used_cars.dropna().copy()
used_cars.reset_index(drop = True, inplace = True)
print("\n")
used_cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 197 entries, 0 to 196 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 make 197 non-null object 1 fuel-type 197 non-null object 2 num-of-doors 197 non-null object 3 body-style 197 non-null object 4 drive-wheels 197 non-null object 5 engine-type 197 non-null object 6 engine-size 197 non-null int64 7 horsepower 197 non-null object 8 city-mpg 197 non-null int64 9 highway-mpg 197 non-null int64 10 price 197 non-null object dtypes: int64(3), object(8) memory usage: 17.1+ KB
As see above dataset lost 8 rows, approximately 4% from total rows.
For using dataset for further ML encoding theirs string values to digit, convert all digit values to float, reorder column in more logical sequence, define the optimal scale conversion method. I don't use permutation of index, because train_test_split
do it default when split dataset.
# Encode string columns
enc = OrdinalEncoder()
# List columns for encode
encode_columns = ["make", "fuel-type", "num-of-doors", "body-style",
"drive-wheels", "engine-type"]
# Add encoding columns
for feature in encode_columns:
used_cars[feature + "_enc"] = enc.fit_transform(used_cars[[feature]])
# Convert columns 7 - 16 to float
used_cars.iloc[:, 6:] = used_cars.iloc[:, 6:].astype(float)
# Reorder columns for more clear understanding
used_cars = used_cars.reindex(columns=['make', 'fuel-type', 'num-of-doors',
'body-style', 'drive-wheels',
'engine-type', 'make_enc',
'body-style_enc', 'num-of-doors_enc',
'drive-wheels_enc', 'fuel-type_enc',
'engine-type_enc', 'engine-size',
'horsepower', 'city-mpg', 'highway-mpg',
'price'])
# Make copy dataset for further normalization
used_cars_normal = used_cars.copy()
# Define columns for estimate ratio values max / min
features_min_max = ['make_enc', 'body-style_enc', 'num-of-doors_enc',
'drive-wheels_enc', 'fuel-type_enc', 'engine-type_enc',
'engine-size', 'horsepower', 'city-mpg', 'highway-mpg',
'price']
# Print min_max for defining optimal conversion type
for i, feature in enumerate(features_min_max):
print(f"{i+1}. {feature}: max = {used_cars_normal[feature].max()}",
f", min = {used_cars_normal[feature].min()}",
f", ratio max/min ={used_cars_normal[feature].max()/used_cars_normal[feature].min():.3f}")
# Maximar ratio max/min = 8.871, hense normalizing numeric values with using MinMaxScaler
scaler = MinMaxScaler()
used_cars_normal.iloc[:, 6:] = scaler.fit_transform(used_cars.iloc[:, 6:])
# Check inf values after conversation in transformed columns
print("\nInf values in the transformed columns:")
print(used_cars_normal.iloc[:, :].isin([np.inf, -np.inf]).sum(),"\n")
# View normalized dataset
display(used_cars_normal.head())
1. make_enc: max = 20.0 , min = 0.0 , ratio max/min =inf 2. body-style_enc: max = 4.0 , min = 0.0 , ratio max/min =inf 3. num-of-doors_enc: max = 1.0 , min = 0.0 , ratio max/min =inf 4. drive-wheels_enc: max = 2.0 , min = 0.0 , ratio max/min =inf 5. fuel-type_enc: max = 1.0 , min = 0.0 , ratio max/min =inf 6. engine-type_enc: max = 5.0 , min = 0.0 , ratio max/min =inf 7. engine-size: max = 326.0 , min = 61.0 , ratio max/min =5.344 8. horsepower: max = 262.0 , min = 48.0 , ratio max/min =5.458 9. city-mpg: max = 49.0 , min = 13.0 , ratio max/min =3.769 10. highway-mpg: max = 54.0 , min = 16.0 , ratio max/min =3.375 11. price: max = 45400.0 , min = 5118.0 , ratio max/min =8.871 Inf values in the transformed columns: make 0 fuel-type 0 num-of-doors 0 body-style 0 drive-wheels 0 engine-type 0 make_enc 0 body-style_enc 0 num-of-doors_enc 0 drive-wheels_enc 0 fuel-type_enc 0 engine-type_enc 0 engine-size 0 horsepower 0 city-mpg 0 highway-mpg 0 price 0 dtype: int64
make | fuel-type | num-of-doors | body-style | drive-wheels | engine-type | make_enc | body-style_enc | num-of-doors_enc | drive-wheels_enc | fuel-type_enc | engine-type_enc | engine-size | horsepower | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | alfa-romero | gas | two | convertible | rwd | dohc | 0.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.000 | 260.377m | 294.393m | 222.222m | 289.474m | 207.959m |
1 | alfa-romero | gas | two | convertible | rwd | dohc | 0.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.000 | 260.377m | 294.393m | 222.222m | 289.474m | 282.558m |
2 | alfa-romero | gas | two | hatchback | rwd | ohcv | 0.000 | 500.000m | 1.000 | 1.000 | 1.000 | 800.000m | 343.396m | 495.327m | 166.667m | 263.158m | 282.558m |
3 | audi | gas | four | sedan | fwd | ohc | 50.000m | 750.000m | 0.000 | 500.000m | 1.000 | 400.000m | 181.132m | 252.336m | 305.556m | 368.421m | 219.254m |
4 | audi | gas | four | sedan | 4wd | ohc | 50.000m | 750.000m | 0.000 | 0.000 | 1.000 | 400.000m | 283.019m | 313.084m | 138.889m | 157.895m | 306.142m |
For corresponding datasetused_cars_normal
ready to Machine Learning. In the real life the following columns will be required - the make_year
, number_owners
and mileage
and can earn money by prediction price of used cars.
As you know, the size of training set determining empirically in each case. I split dataset using according to the MS recommendations for using ratio train to test equal 70:30 as typical for the most cases. Investigate the impact of different separate features and hyperparameters on the accuracy of predicting the price of used cars. Collect detailed information and summary statistic about RMSE and k-nn for 1 to 30 and set size for training set equal 70%.
# Set range k-nn
k_val = np.array(range(1, 31))
# Define dataset for further filling RMSE and K-nn values
df_detail = pd.DataFrame(k_val, columns =["k_nn"])
# Split to train and test dataset with permutation using standard sklearn train_test_split method
ts = 0.7
train, test = train_test_split(used_cars_normal.iloc[:, 6:].copy(),
test_size = (1 - ts))
# Extract k-values and RMSE to df_detail
for i, feature in enumerate(features_min_max[:-1]):
rmse=[]
for k in k_val:
# call KNeighborsRegressor object
knn = KNeighborsRegressor(n_neighbors=k)
# Fit model
knn.fit(train[[feature]], train.price)
# Extract prediction
prediction = knn.predict(test[[feature]])
rmse.append(mean_squared_error(test.price, prediction)**(0.5))
# Add values in df_detail
df_detail.insert(i+1,feature, rmse)
# Create pivot table for rmse
# First column
df_optimal_first_column = ["optimal_k", "minimal_rmse", "numbers_of_unique",
"train_size %"]
# Create dataframe with one column
df_optimal = pd.DataFrame(df_optimal_first_column, columns = ["optimal"])
# Insert minimal RMSE, optimal k and length unique values
for i, feature in enumerate(features_min_max[:-1]):
min_rmse = df_detail[feature].min()
optimal_k = df_detail.k_nn[df_detail[feature]==min_rmse].values[0]
# numbers of unique values taken from train_uni dataset
numbers_unique = len(train[feature].unique())
rmse_pvt_val=[]
rmse_pvt_val.append(optimal_k)
rmse_pvt_val.append(min_rmse)
rmse_pvt_val.append(numbers_unique)
rmse_pvt_val.append(ts)
df_optimal.insert(i+1, feature, rmse_pvt_val)
# Using default pandas plot
fig, ax = plt.subplots(figsize = (18, 12))
ax = df_detail.set_index("k_nn").plot.line(ax = ax, marker='x')
handles, labels = ax.get_legend_handles_labels()
for i, feature in enumerate(labels[:]):
add_values = ", min RMSE = " + \
str(round(df_detail.loc[1, feature], 3)) + \
", optimal K-nn = "+ \
str(df_optimal.loc[0, feature]) + ", uniq val = " +\
str(df_optimal.loc[2, feature])
labels[i] += add_values
ax.legend(labels = labels, fontsize=12)
ax.set_xlabel('K - nearest neighbors', weight = "bold", fontsize = 14)
ax.set_ylabel('RMSE', weight = "bold", fontsize = 12, rotation=0)
ax.set_title("Dependence RMSE as numbers of K-Nearest neighbors for each\
future, train size = 70%.", weight = "bold",
fontsize = 12)
ax.spines.top.set_visible(False)
ax.spines.right.set_visible(False)
plt.show()
display(df_optimal)
optimal | make_enc | body-style_enc | num-of-doors_enc | drive-wheels_enc | fuel-type_enc | engine-type_enc | engine-size | horsepower | city-mpg | highway-mpg | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | optimal_k | 6.000 | 11.000 | 28.000 | 18.000 | 7.000 | 9.000 | 2.000 | 2.000 | 4.000 | 2.000 |
1 | minimal_rmse | 123.428m | 193.565m | 208.000m | 162.394m | 202.299m | 187.631m | 61.812m | 91.273m | 84.949m | 116.612m |
2 | numbers_of_unique | 20.000 | 5.000 | 2.000 | 3.000 | 2.000 | 6.000 | 36.000 | 50.000 | 26.000 | 28.000 |
3 | train_size % | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m |
Plot and tables show dependence that the more values that differ from each other in a column - the higher the accuracy of the prediction. The optimal value of k-nnn is in the range from 3 to 10. Notice that the train_test_split method constantly shuffles the columns and each time the code is loaded in a cell, both the graph view and the table of values will have new values each time.
It is known that scikit-learn regression metrics paragraph 3.3 it has several metrics for evaluation prediction accuracy. Investigate dependence for RMSE and R² score.
Ш write three functions necessary for analyzing the dependencies of optimal values k-nn for a different number of features.
First function for calculation RMSE - knn_rmse_multy
.
def knn_rmse_multy(feature_list, ratio):
"""
Take list of features, ratio train size/all size and calculate RMSE for
k-nn from 1 to 30 and return dataframes df_detail contains detailed data
of RMSE and K-nn and df_optimal with optimal k-nn values with mimimal RMSE
"""
feature = feature_list[:]
# Function prototype
# Set range k-nn
k_val = np.array(range(1, 31))
# Define dataset for further filling RMSE and K-nn values
df_detail = pd.DataFrame(k_val, columns =["k_nn"])
# Split to train and test dataset with permutation
part = ratio
train, test = train_test_split(used_cars_normal.copy().iloc[:, 6:],
test_size = (1 - part))
# Extract k-values and RMSE to df_detail
rmse=[]
for k in k_val:
# call KNeighborsRegressor object
knn = KNeighborsRegressor(n_neighbors=k)
# Fit model
knn.fit(train[feature], train.price)
# Extract prediction
prediction = knn.predict(test[feature])
rmse.append(mean_squared_error(test.price, prediction)**(0.5))
# Add values in df_detail
feature_label = str(len(feature)) + " features"
df_detail.insert(1,feature_label, rmse)
# Create pivot table for minimal RMSE and optimal k-nn
# First column
df_optimal_first_column = ["optimal_k", "minimal_rmse",
"train_size %"]
# Create dataframe with one column
df_optimal = pd.DataFrame(df_optimal_first_column, columns = ["optimal"])
# Insert minimal RMSE, optimal k and test size and columns with future name
min_rmse = df_detail[feature_label].min()
optimal_k = df_detail.k_nn[df_detail[feature_label]==min_rmse].values[0]
rmse_pvt_val=[]
rmse_pvt_val.append(optimal_k)
rmse_pvt_val.append(min_rmse)
rmse_pvt_val.append(part)
df_optimal.insert(1, feature_label, rmse_pvt_val)
return df_detail, df_optimal
Second function for calculation R² - knn_r2_score_multy
.
def knn_r2_score_multy(feature_list, ratio):
"""
Take list of features, ratio train size/all size and calculate score for
k-nn from 1 to 30 and return dataframes df_detail contains detailed data
of score and K-nn and df_optimal with optimal k-nn values with
highest score
"""
feature = feature_list[:]
# Function prototype
# Set range k-nn
k_val = np.array(range(1, 31))
# Define dataset for further filling RMSE and K-nn values
df_detail = pd.DataFrame(k_val, columns =["k_nn"])
# Split to train and test dataset with permutation
part = ratio
train, test = train_test_split(used_cars_normal.copy().iloc[:, 6:],
test_size = (1 - part))
# Extract k-values and RMSE to df_detail
scores=[]
for k in k_val:
# call KNeighborsRegressor object
knn = KNeighborsRegressor(n_neighbors=k)
# Fit model
knn.fit(train[feature], train.price)
# Extract prediction
prediction = knn.predict(test[feature])
scores.append(r2_score(test.price, prediction))
# Add values in df_detail
feature_label = str(len(feature)) + " features"
df_detail.insert(1,feature_label, scores)
# Create pivot table for highest score and optimal k-nn
# First column
df_optimal_first_column = ["optimal_k", "highest_r2_score",
"train_size %"]
# Create dataframe with one column
df_optimal = pd.DataFrame(df_optimal_first_column, columns = ["optimal"])
# Insert maximal R2, optimal k and test size and columns with "len(future)" name
max_score = df_detail[feature_label].max()
optimal_k = df_detail.k_nn[df_detail[feature_label]==max_score].values[0]
# numbers of unique values taken from train_uni dataset
r2_pvt_val=[]
r2_pvt_val.append(optimal_k)
r2_pvt_val.append(max_score)
r2_pvt_val.append(part)
df_optimal.insert(1, feature_label, r2_pvt_val)
return df_detail, df_optimal
Third function for concatenation dataframes with detailed information and pivot tables with optimal k-nn values - rmse_r2_concat
.
def rmse_r2_concat(list_of_features):
"""
Parameters
----------
list_of_features : list of list
DESCRIPTION: Teake list futures.
Returns: four dataframe for RMSE and r2 values: df_rmse and df_r2 with
detailed data for furhter plot and df_rmse_opt and df_r2_opt witn
optimal parametr
"""
features = list_of_features[:]
df_rmse, df_rmse_opt = knn_rmse_multy(features[0], 0.7)
df_r2, df_r2_opt = knn_r2_score_multy(features[0], 0.7)
for value in features[1:]:
# Concate rmse
rmse, rmse_opt = knn_rmse_multy(value, 0.7)
df_rmse = df_rmse.merge(rmse, how="inner")
df_rmse_opt = df_rmse_opt.merge(rmse_opt, how = "inner")
#concate r2
r2, r2_opt = knn_r2_score_multy(value, 0.7)
df_r2 = df_r2.merge(r2, how = "inner")
df_r2_opt = df_r2_opt.merge(r2_opt, how = "inner")
return df_rmse, df_rmse_opt, df_r2, df_r2_opt
Define randomly list of lists with features from 2 to 10.
features_random = []
features_l = features_min_max[:-1]
for i in range(2,len(features_l)+1):
add = random.choices(features_l, k = i)
features_random.append(add)
# Train models and output summary values to dataframes
rmse_det, rmse_min, r2_det, r2_max = rmse_r2_concat(features_random)
Plot RMSE values and display optimal k-nn and lowest RMSE.
fig, ax = plt.subplots(figsize = (18, 12))
ax = rmse_det.set_index("k_nn").plot.line(ax = ax, marker='x')
handles, labels = ax.get_legend_handles_labels()
for i, value in enumerate(labels[:]):
add_values = ", min RMSE = " + \
str(round(rmse_min.loc[1, value], 3)) + \
", optimal K-nn = "+ \
str(rmse_min.loc[0, value]) + "."
labels[i] += add_values
ax.legend(labels = labels)
ax.set_xlabel('K - nearest neighbors', weight = "bold", fontsize = 14)
ax.set_ylabel('RMSE', weight = "bold", fontsize = 12, rotation=0)
ax.set_title("Dependence RMSE as numbers of K-Nearest neighbors as numbers of \
futures, train size = 70%.", weight = "bold",
fontsize = 12)
ax.spines.top.set_visible(False)
ax.spines.right.set_visible(False)
plt.show()
display(rmse_min)
optimal | 2 features | 3 features | 4 features | 5 features | 6 features | 7 features | 8 features | 9 features | 10 features | |
---|---|---|---|---|---|---|---|---|---|---|
0 | optimal_k | 2.000 | 10.000 | 17.000 | 1.000 | 1.000 | 1.000 | 1.000 | 2.000 | 5.000 |
1 | minimal_rmse | 73.925m | 78.448m | 153.652m | 81.173m | 95.997m | 66.756m | 120.357m | 75.641m | 65.140m |
2 | train_size % | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m |
Plot R² values and display optimal k-nn and highest .
fig, ax = plt.subplots(figsize = (18, 12))
ax = r2_det.set_index("k_nn").plot.line(ax = ax, marker='x')
handles, labels = ax.get_legend_handles_labels()
for i, value in enumerate(labels[:]):
add_values = ", max R2 = " + \
str(round(r2_max.loc[1, value], 3)) + \
", optimal K-nn = "+ \
str(rmse_min.loc[0, value]) + "."
labels[i] += add_values
ax.legend(labels = labels)
ax.set_xlabel('K - nearest neighbors', weight = "bold", fontsize = 14)
ax.set_ylabel('R2', weight = "bold", fontsize = 12, rotation=0)
ax.set_title("Dependence R2 as numbers of K-Nearest neighbors as numbers of \
futures, train size = 70%.", weight = "bold",
fontsize = 12)
ax.spines.top.set_visible(False)
ax.spines.right.set_visible(False)
plt.show()
display(r2_max)
optimal | 2 features | 3 features | 4 features | 5 features | 6 features | 7 features | 8 features | 9 features | 10 features | |
---|---|---|---|---|---|---|---|---|---|---|
0 | optimal_k | 1.000 | 5.000 | 13.000 | 2.000 | 2.000 | 1.000 | 4.000 | 3.000 | 2.000 |
1 | highest_r2_score | 863.549m | 809.144m | 487.823m | 682.212m | 718.525m | 843.550m | 687.036m | 865.545m | 867.150m |
2 | train_size % | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m | 700.000m |
It can be seen that it accuracy of prediction decreases with increasing number of features. The optimal k-nn values are in the range from 1 to 9. As the train sets and samples of features are random the optimal values k-nn, min RMSE and max R² are constantly changing. Maximal value R² for all features in the most cases has value in range 60% - 90%, but you can get and negative R² for some features dependence of quality your train set and random choice features.
Cross-validation is a statistical method used to estimate the skill of machine learning models. It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.
©
Jason Brownlee.
It is known that the optimal value is k=10. In this case numbers of k and is of a theoretical nature. Formally
The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller
— Page 70, Applied Predictive Modeling, 2013.
Formally investigate influence k-fold for RMSE and R² and print its.
# 5. k-fold cross validation
# Validation with different k-fold for default value knn = 5
folds_k = np.array(range(2,21))
fold_ds = used_cars_normal.copy().iloc[:, 6:]
fold_features = ['make_enc', 'body-style_enc', 'num-of-doors_enc',
'drive-wheels_enc', 'fuel-type_enc', 'engine-type_enc',
'engine-size', 'horsepower', 'city-mpg', 'highway-mpg']
for fold in folds_k:
model_knr = KNeighborsRegressor()
kf = KFold(fold, shuffle=True, random_state=1)
# Extract RMSE
mses = cross_val_score(model_knr,fold_ds[fold_features].copy(),
fold_ds["price"].copy(),
scoring="neg_mean_squared_error",
cv=kf)
rmses = np.sqrt(np.absolute(mses))
rmse_mean = np.mean(rmses)
std_rmse = np.std(rmses)
# Extract R2
r2es = cross_val_score(model_knr,fold_ds[fold_features].copy(),
fold_ds["price"].copy(),
scoring="r2",
cv=kf)
r2es_mean = np.mean(r2es)
r2es_std = np.std(r2es)
print(str(fold), "k-folds,", f"train size = {(1-1.0/fold):.2%},",
" mean RMSE: ", str(rmse_mean), "std RMSE:", str(std_rmse))
print(str(fold), "k-folds,", f"train size = {(1-1.0/fold):.2%},",
" mean R2: ", str(r2es_mean), "std R2: ", str(r2es_std))
2 k-folds, train size = 50.00%, mean RMSE: 0.13089148960141994 std RMSE: 0.000719857384639852 2 k-folds, train size = 50.00%, mean R2: 0.5565805637587018 std R2: 0.05125063863095347 3 k-folds, train size = 66.67%, mean RMSE: 0.11243482283121416 std RMSE: 0.015538177542357269 3 k-folds, train size = 66.67%, mean R2: 0.6760281422783841 std R2: 0.06381523849864348 4 k-folds, train size = 75.00%, mean RMSE: 0.11155850491321122 std RMSE: 0.019386293881982854 4 k-folds, train size = 75.00%, mean R2: 0.6582370928565034 std R2: 0.15624236826361704 5 k-folds, train size = 80.00%, mean RMSE: 0.11633561941243338 std RMSE: 0.02281156137037081 5 k-folds, train size = 80.00%, mean R2: 0.6392568126328487 std R2: 0.10511402435147534 6 k-folds, train size = 83.33%, mean RMSE: 0.10695266252661678 std RMSE: 0.028836808358473118 6 k-folds, train size = 83.33%, mean R2: 0.6942104676511262 std R2: 0.10999646723955493 7 k-folds, train size = 85.71%, mean RMSE: 0.10298788891435087 std RMSE: 0.028765961382694154 7 k-folds, train size = 85.71%, mean R2: 0.711706180903027 std R2: 0.10196387054753021 8 k-folds, train size = 87.50%, mean RMSE: 0.10654198538696047 std RMSE: 0.027099498668648282 8 k-folds, train size = 87.50%, mean R2: 0.6773076303692642 std R2: 0.18385720014987433 9 k-folds, train size = 88.89%, mean RMSE: 0.10523957613642426 std RMSE: 0.029959366206118082 9 k-folds, train size = 88.89%, mean R2: 0.6798909580165289 std R2: 0.14448260998313528 10 k-folds, train size = 90.00%, mean RMSE: 0.10522822882383409 std RMSE: 0.03867942904619742 10 k-folds, train size = 90.00%, mean R2: 0.6493600953217988 std R2: 0.1345234690550561 11 k-folds, train size = 90.91%, mean RMSE: 0.09681931131103597 std RMSE: 0.03694622951293173 11 k-folds, train size = 90.91%, mean R2: 0.7047897573409708 std R2: 0.17089031670534569 12 k-folds, train size = 91.67%, mean RMSE: 0.09632514521868567 std RMSE: 0.04270234597180818 12 k-folds, train size = 91.67%, mean R2: 0.7112340425131559 std R2: 0.2098827111196913 13 k-folds, train size = 92.31%, mean RMSE: 0.10028455086689438 std RMSE: 0.0426421773022934 13 k-folds, train size = 92.31%, mean R2: 0.7061037719313487 std R2: 0.13662939841106442 14 k-folds, train size = 92.86%, mean RMSE: 0.09673422158906257 std RMSE: 0.04157373289568755 14 k-folds, train size = 92.86%, mean R2: 0.7217363481169973 std R2: 0.1389218340462481 15 k-folds, train size = 93.33%, mean RMSE: 0.09682928308838966 std RMSE: 0.04503238511806834 15 k-folds, train size = 93.33%, mean R2: 0.6961937471890659 std R2: 0.21387216340326473 16 k-folds, train size = 93.75%, mean RMSE: 0.09596627053398545 std RMSE: 0.045415409382783364 16 k-folds, train size = 93.75%, mean R2: 0.6753952823175247 std R2: 0.21677908708378274 17 k-folds, train size = 94.12%, mean RMSE: 0.0966896289824184 std RMSE: 0.04896693488251336 17 k-folds, train size = 94.12%, mean R2: 0.7013454572838578 std R2: 0.1580954055136116 18 k-folds, train size = 94.44%, mean RMSE: 0.09599674885453761 std RMSE: 0.04341095052479746 18 k-folds, train size = 94.44%, mean R2: 0.6922606545447201 std R2: 0.17480102879151543 19 k-folds, train size = 94.74%, mean RMSE: 0.09319223904808788 std RMSE: 0.050094847858043946 19 k-folds, train size = 94.74%, mean R2: 0.6963603745474026 std R2: 0.1848592400342005 20 k-folds, train size = 95.00%, mean RMSE: 0.0950442231412687 std RMSE: 0.051876827256537944 20 k-folds, train size = 95.00%, mean R2: 0.6481775003418562 std R2: 0.24785570986373792
It can be seen that with increasing train size, as a rule, the accuracy prediction of the model also increases, which in practice allows you to roughly estimate the required train size. Keep the mind that in this case using not optimal k-nn values for this features.
In my opinion the main value of cross validation is the selection of the optimal value of k-nn values before training the model. Check it for fixed k-fold = 10
k_nn = np.array(range(1,16))
for k in k_nn:
model_knr = KNeighborsRegressor(n_neighbors = k)
kf = KFold(10, shuffle=True, random_state=1)
mses = cross_val_score(model_knr,fold_ds[fold_features].copy(),
fold_ds["price"].copy(),
scoring="neg_mean_squared_error",
cv=kf)
rmses = np.sqrt(np.absolute(mses))
rmse_mean = np.mean(rmses)
std_rmse = np.std(rmses)
r2es = cross_val_score(model_knr,fold_ds[fold_features].copy(),
fold_ds["price"].copy(),
scoring="r2",
cv=kf)
r2es_mean = np.mean(r2es)
r2es_std = np.std(r2es)
print(str(k), "k-nn, mean RMSE: ", str(rmse_mean), "std RMSE: ", str(std_rmse))
print(str(k), "k-nn, mean R2: ", str(r2es_mean), "std R2: ", str(r2es_std))
1 k-nn, mean RMSE: 0.09712291337061893 std RMSE: 0.03490225049845181 1 k-nn, mean R2: 0.693978792764743 std R2: 0.1257138091530604 2 k-nn, mean RMSE: 0.08653753827212116 std RMSE: 0.0352251141933749 2 k-nn, mean R2: 0.748365958881571 std R2: 0.1513118549866339 3 k-nn, mean RMSE: 0.09251051422117426 std RMSE: 0.035022486688649296 3 k-nn, mean R2: 0.7136022506042058 std R2: 0.14669605736707222 4 k-nn, mean RMSE: 0.09902182495782333 std RMSE: 0.035285296204673404 4 k-nn, mean R2: 0.6695446293437334 std R2: 0.15325580089768884 5 k-nn, mean RMSE: 0.10522822882383409 std RMSE: 0.03867942904619742 5 k-nn, mean R2: 0.6493600953217988 std R2: 0.1345234690550561 6 k-nn, mean RMSE: 0.10649191750779483 std RMSE: 0.04023088557258272 6 k-nn, mean R2: 0.6464298572129444 std R2: 0.1366971070654578 7 k-nn, mean RMSE: 0.1100277976447781 std RMSE: 0.04269968076416515 7 k-nn, mean R2: 0.6279944010949343 std R2: 0.14446565555155896 8 k-nn, mean RMSE: 0.10838952590073272 std RMSE: 0.04251489632743167 8 k-nn, mean R2: 0.6444848641983755 std R2: 0.13075582656564638 9 k-nn, mean RMSE: 0.1086102074007429 std RMSE: 0.04323741505146017 9 k-nn, mean R2: 0.6464194246020623 std R2: 0.11550372464454059 10 k-nn, mean RMSE: 0.11088789270693618 std RMSE: 0.043908019793763214 10 k-nn, mean R2: 0.6380891841653615 std R2: 0.09747726365480958 11 k-nn, mean RMSE: 0.11331774010260823 std RMSE: 0.04429130946367318 11 k-nn, mean R2: 0.6247960984527863 std R2: 0.08916398891774455 12 k-nn, mean RMSE: 0.11531819673053864 std RMSE: 0.04494945487794248 12 k-nn, mean R2: 0.6135644778101136 std R2: 0.08112710266046536 13 k-nn, mean RMSE: 0.1151129543186397 std RMSE: 0.045746145323041484 13 k-nn, mean R2: 0.6146848759682085 std R2: 0.08252653121566782 14 k-nn, mean RMSE: 0.11623040601208519 std RMSE: 0.046774229494997266 14 k-nn, mean R2: 0.6080269664897694 std R2: 0.08495156181195233 15 k-nn, mean RMSE: 0.11675020746736864 std RMSE: 0.0464711842152894 15 k-nn, mean R2: 0.6027211022422753 std R2: 0.08165379472054539
It can be seen that the optimal k-nn values for all features nearest with the k-nn values shown in the graphs above. Keep the mind that on graphs using different dataset that here. In this case I assume that for all features optimal k-nn values has range 2 - 4.
k-fold validation allows you quickly select the optimal hyper parameters for further model training that aren't always equal to the default values of hyper parameters for sets of features.
k-fold eaul 10 it is quite enough for practical use.
Created on July 10, 2021
@author: Vadim Maklakov, used some ideas from public Internet resources.
© 3-clause BSD License
Software environment: Debian 10, Python 3.8.7
See required installed and imported python modules in the cell No 1.