In this project, we will use the machine learning algorithm, K-Nearest Neighbors, to perform regressions. More precisely, we will apply the algorithm to predict the price of cars using The Automobile Data Set. Before we dive further into the project, let's take a closer look at the algorithm we will be using.
The K-Nearest Neighbors (KNN) is a fundamental machine learning algorithm that can be used for classification and regression problems based on feature similarity.
Maybe you are asking how the algortheme work? the answer is for a given data point a prediction is made by looking at the k nearest neighbors data points, it depind in the the problem and the nature of data the mesure of similary is choisen.
Once we select the k neighbors we are almost donne in case of classification the data piont will be assigned to the class most common among its k nearest neighbors. In the case of regression like in this project where we want to estimate to price of car, the K-nearest neighbors will compute the averge price of k similar to predic the price.
In this project, we will follow these steps:
import pandas as pd
import numpy as np
cars = pd.read_csv('imports-85.data')
pd.set_option('display.max_columns', None) # to display all the columns
cars.head() # print the first five rows
3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.60 | 168.80 | 64.10 | 48.80 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.00 | 111 | 5000 | 21 | 27 | 13495 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
1 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
2 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
3 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
4 | 2 | ? | audi | gas | std | two | sedan | fwd | front | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250 |
As we can notice that the data has no header, we will use the documentation on the data source website to generate a list of column names and pass it as an argument to the read_csv
method. We can also note the use of ?
to represent missing values, we can also pass ?
to the na_values
argument to replace it with np.nan
. See pd.read_csv.
columns_name = ['symboling','normalized-losses','make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels','engine-location','wheel-base','lenght','width','height','curb-weight','engine-type','num-of-cylinders','engine-size','fuel-system','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
cars = pd.read_csv('imports-85.data',header=None,names=columns_name,na_values='?')
pd.set_option('display.max_columns', None)
cars.head()
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | lenght | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 |
1 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
2 | 1 | NaN | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
3 | 2 | 164.0 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 |
4 | 2 | 164.0 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 |
Let's took a close look at the data.
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null int64 1 normalized-losses 164 non-null float64 2 make 205 non-null object 3 fuel-type 205 non-null object 4 aspiration 205 non-null object 5 num-of-doors 203 non-null object 6 body-style 205 non-null object 7 drive-wheels 205 non-null object 8 engine-location 205 non-null object 9 wheel-base 205 non-null float64 10 lenght 205 non-null float64 11 width 205 non-null float64 12 height 205 non-null float64 13 curb-weight 205 non-null int64 14 engine-type 205 non-null object 15 num-of-cylinders 205 non-null object 16 engine-size 205 non-null int64 17 fuel-system 205 non-null object 18 bore 201 non-null float64 19 stroke 201 non-null float64 20 compression-ratio 205 non-null float64 21 horsepower 203 non-null float64 22 peak-rpm 203 non-null float64 23 city-mpg 205 non-null int64 24 highway-mpg 205 non-null int64 25 price 201 non-null float64 dtypes: float64(11), int64(5), object(10) memory usage: 41.8+ KB
cars.describe()
symboling | normalized-losses | wheel-base | lenght | width | height | curb-weight | engine-size | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 205.000000 | 164.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 201.000000 | 201.000000 | 205.000000 | 203.000000 | 203.000000 | 205.000000 | 205.000000 | 201.000000 |
mean | 0.834146 | 122.000000 | 98.756585 | 174.049268 | 65.907805 | 53.724878 | 2555.565854 | 126.907317 | 3.329751 | 3.255423 | 10.142537 | 104.256158 | 5125.369458 | 25.219512 | 30.751220 | 13207.129353 |
std | 1.245307 | 35.442168 | 6.021776 | 12.337289 | 2.145204 | 2.443522 | 520.680204 | 41.642693 | 0.273539 | 0.316717 | 3.972040 | 39.714369 | 479.334560 | 6.542142 | 6.886443 | 7947.066342 |
min | -2.000000 | 65.000000 | 86.600000 | 141.100000 | 60.300000 | 47.800000 | 1488.000000 | 61.000000 | 2.540000 | 2.070000 | 7.000000 | 48.000000 | 4150.000000 | 13.000000 | 16.000000 | 5118.000000 |
25% | 0.000000 | 94.000000 | 94.500000 | 166.300000 | 64.100000 | 52.000000 | 2145.000000 | 97.000000 | 3.150000 | 3.110000 | 8.600000 | 70.000000 | 4800.000000 | 19.000000 | 25.000000 | 7775.000000 |
50% | 1.000000 | 115.000000 | 97.000000 | 173.200000 | 65.500000 | 54.100000 | 2414.000000 | 120.000000 | 3.310000 | 3.290000 | 9.000000 | 95.000000 | 5200.000000 | 24.000000 | 30.000000 | 10295.000000 |
75% | 2.000000 | 150.000000 | 102.400000 | 183.100000 | 66.900000 | 55.500000 | 2935.000000 | 141.000000 | 3.590000 | 3.410000 | 9.400000 | 116.000000 | 5500.000000 | 30.000000 | 34.000000 | 16500.000000 |
max | 3.000000 | 256.000000 | 120.900000 | 208.100000 | 72.300000 | 59.800000 | 4066.000000 | 326.000000 | 3.940000 | 4.170000 | 23.000000 | 288.000000 | 6600.000000 | 49.000000 | 54.000000 | 45400.000000 |
20% of the values in the column normalized-losses
are missing, if we delete all the rows with missing values, we lose a lot of data, the alternative is to replace this value with the average of the columns or to delete the column entirely. For now we will juste replace the missing values using the average.
cars['normalized-losses'] = cars['normalized-losses'].fillna(cars['normalized-losses'].mean())
cars['num-of-doors'].value_counts(dropna=False)
four 114 two 89 NaN 2 Name: num-of-doors, dtype: int64
cars['num-of-cylinders'].value_counts(dropna=False)
four 159 six 24 five 11 eight 5 two 4 three 1 twelve 1 Name: num-of-cylinders, dtype: int64
the database contine 15 continuous attributes out of 26 and one integer attribute symboling
, and some nominal attributes can be transformed to numerica like num-of-doors
and num-of-cylinders
lets statr by tronsforming the num-of-doors
and num-of-cylinders
to numeric columns
cars['num-of-cylinders'] = cars['num-of-cylinders'].replace(to_replace={'four':4,'six':6,'five':5,'eight':8, 'two':2,'twelve':11,'three':3})
cars['num-of-doors'] = cars['num-of-doors'].replace(to_replace={'four':4, 'two':2})
cars['num-of-doors'].value_counts(dropna=False)
4.0 114 2.0 89 NaN 2 Name: num-of-doors, dtype: int64
cars['num-of-cylinders'].value_counts(dropna=False)
4 159 6 24 5 11 8 5 2 4 11 1 3 1 Name: num-of-cylinders, dtype: int64
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null int64 1 normalized-losses 205 non-null float64 2 make 205 non-null object 3 fuel-type 205 non-null object 4 aspiration 205 non-null object 5 num-of-doors 203 non-null float64 6 body-style 205 non-null object 7 drive-wheels 205 non-null object 8 engine-location 205 non-null object 9 wheel-base 205 non-null float64 10 lenght 205 non-null float64 11 width 205 non-null float64 12 height 205 non-null float64 13 curb-weight 205 non-null int64 14 engine-type 205 non-null object 15 num-of-cylinders 205 non-null int64 16 engine-size 205 non-null int64 17 fuel-system 205 non-null object 18 bore 201 non-null float64 19 stroke 201 non-null float64 20 compression-ratio 205 non-null float64 21 horsepower 203 non-null float64 22 peak-rpm 203 non-null float64 23 city-mpg 205 non-null int64 24 highway-mpg 205 non-null int64 25 price 201 non-null float64 dtypes: float64(12), int64(6), object(8) memory usage: 41.8+ KB
we still have some columns with missing data, we will just drop the rows
cars.dropna(inplace=True)
cars.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 193 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 193 non-null int64 1 normalized-losses 193 non-null float64 2 make 193 non-null object 3 fuel-type 193 non-null object 4 aspiration 193 non-null object 5 num-of-doors 193 non-null float64 6 body-style 193 non-null object 7 drive-wheels 193 non-null object 8 engine-location 193 non-null object 9 wheel-base 193 non-null float64 10 lenght 193 non-null float64 11 width 193 non-null float64 12 height 193 non-null float64 13 curb-weight 193 non-null int64 14 engine-type 193 non-null object 15 num-of-cylinders 193 non-null int64 16 engine-size 193 non-null int64 17 fuel-system 193 non-null object 18 bore 193 non-null float64 19 stroke 193 non-null float64 20 compression-ratio 193 non-null float64 21 horsepower 193 non-null float64 22 peak-rpm 193 non-null float64 23 city-mpg 193 non-null int64 24 highway-mpg 193 non-null int64 25 price 193 non-null float64 dtypes: float64(12), int64(6), object(8) memory usage: 40.7+ KB
the list of numeric columns to keep
numeric_columns = ['symboling','normalized-losses','num-of-doors','wheel-base','lenght','width','height',
'curb-weight','num-of-cylinders','engine-size','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
len(numeric_columns)
18
numeric_cars = cars[numeric_columns]
numeric_cars.head()
symboling | normalized-losses | num-of-doors | wheel-base | lenght | width | height | curb-weight | num-of-cylinders | engine-size | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 122.0 | 2.0 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 4 | 130 | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 |
1 | 3 | 122.0 | 2.0 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 4 | 130 | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
2 | 1 | 122.0 | 2.0 | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 6 | 152 | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
3 | 2 | 164.0 | 4.0 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 4 | 109 | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 |
4 | 2 | 164.0 | 4.0 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 5 | 136 | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 |
Depending on the algorithm you are using, you may or may not need to standardize the data, since we assume that all features have the same considerations and we are computing distance we should apply standardization. To normalize all columns in a range of 0 to 1, we will use Min-Max normalization.
price_col = numeric_cars['price']
numeric_cars = (numeric_cars - numeric_cars.min())/(numeric_cars.max() - numeric_cars.min())
numeric_cars['price'] = price_col
numeric_cars.head()
symboling | normalized-losses | num-of-doors | wheel-base | lenght | width | height | curb-weight | num-of-cylinders | engine-size | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.298429 | 0.0 | 0.058309 | 0.413433 | 0.324786 | 0.083333 | 0.411171 | 0.125 | 0.260377 | 0.664286 | 0.290476 | 0.1250 | 0.294393 | 0.346939 | 0.222222 | 0.289474 | 13495.0 |
1 | 1.0 | 0.298429 | 0.0 | 0.058309 | 0.413433 | 0.324786 | 0.083333 | 0.411171 | 0.125 | 0.260377 | 0.664286 | 0.290476 | 0.1250 | 0.294393 | 0.346939 | 0.222222 | 0.289474 | 16500.0 |
2 | 0.6 | 0.298429 | 0.0 | 0.230321 | 0.449254 | 0.444444 | 0.383333 | 0.517843 | 0.375 | 0.343396 | 0.100000 | 0.666667 | 0.1250 | 0.495327 | 0.346939 | 0.166667 | 0.263158 | 16500.0 |
3 | 0.8 | 0.518325 | 1.0 | 0.384840 | 0.529851 | 0.504274 | 0.541667 | 0.329325 | 0.125 | 0.181132 | 0.464286 | 0.633333 | 0.1875 | 0.252336 | 0.551020 | 0.305556 | 0.368421 | 13950.0 |
4 | 0.8 | 0.518325 | 1.0 | 0.373178 | 0.529851 | 0.521368 | 0.541667 | 0.518231 | 0.250 | 0.283019 | 0.464286 | 0.633333 | 0.0625 | 0.313084 | 0.551020 | 0.138889 | 0.157895 | 17450.0 |
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
def knn_train_test(df,features='',target='price',k=5):
#first split the data set train/test
nbr_rows = len(df)
np.random.seed(1)
indexs = np.random.permutation(nbr_rows) #shuffle the data
train = df.iloc[indexs[0:round(nbr_rows * .75)]] # 75% for the training
test = df.iloc[indexs[round(nbr_rows * .75):]]
#train th model
model = KNeighborsRegressor(n_neighbors=k)
model.fit(train[features],train[target])
predictions = model.predict(test[features])
mse = mean_squared_error(test[target],predictions)
rmse = mse ** (1/2)
return rmse
numeric_columns.remove('price')
len(numeric_columns)
17
For each numeric column, we will create, train, and test a univariate model using default k
value from scikit-learn
.
rmse_values = list()
for feature in numeric_columns:
rmse = knn_train_test(numeric_cars,[feature])
rmse_values.append(rmse)
rmse_values
[7391.45857809314, 6691.161998798614, 7355.852500673643, 5771.642749902318, 5260.713472920443, 3709.0194335340625, 7982.664949230092, 3084.2734194350105, 4086.3104265951206, 3171.8674012060046, 5995.115904425868, 6234.768973733778, 5096.504099053259, 4550.010748522103, 6514.248899080129, 3675.0418801468554, 4131.277200818168]
import matplotlib.pyplot as plt
from matplotlib import style
style.use('bmh')
%matplotlib inline
plt.scatter(numeric_columns,rmse_values,)
plt.xticks(rotation=90)
plt.title('The Root Mean Square Error For Univariate Model')
plt.ylabel('RMSE')
plt.xlabel('Numeric attribute')
plt.show()
For each numeric column, we will create, train, and test a univariate model using the following k
values (1, 3, 5, 7, and 9). Visualize the results using a scatter plot.
neighbors = [1,3,5,7,9]
rmse_k = {}
for feature in numeric_columns:
rmse_values = list()
for k in neighbors:
rmse = knn_train_test(numeric_cars,[feature],k=k)
rmse_values.append(rmse)
rmse_k[feature] = rmse_values
results = pd.DataFrame(rmse_k, index=neighbors)
results
symboling | normalized-losses | num-of-doors | wheel-base | lenght | width | height | curb-weight | num-of-cylinders | engine-size | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 9936.839864 | 6581.090924 | 9473.439976 | 4665.530244 | 4571.303106 | 2062.156330 | 9660.569786 | 5103.185798 | 6469.261154 | 2569.223914 | 6024.216180 | 3697.468140 | 4765.611197 | 3349.458156 | 6178.049318 | 4805.300574 | 4528.756243 |
3 | 7639.444338 | 6207.707240 | 8001.398262 | 5542.296773 | 5127.738422 | 3130.699015 | 8324.506117 | 3554.133091 | 3929.319132 | 2768.704046 | 5976.832174 | 5377.241925 | 4589.625354 | 4001.242160 | 5700.254306 | 3891.116610 | 4491.859959 |
5 | 7391.458578 | 6691.161999 | 7355.852501 | 5771.642750 | 5260.713473 | 3709.019434 | 7982.664949 | 3084.273419 | 4086.310427 | 3171.867401 | 5995.115904 | 6234.768974 | 5096.504099 | 4550.010749 | 6514.248899 | 3675.041880 | 4131.277201 |
7 | 6923.249562 | 7090.636344 | 7691.424289 | 5427.876232 | 5562.721854 | 3509.697064 | 7865.607051 | 2928.318411 | 4447.210949 | 3194.923231 | 6116.079189 | 6440.383994 | 5660.042521 | 4849.468468 | 6427.913607 | 3623.745989 | 3756.331451 |
9 | 7017.689447 | 7195.266961 | 7521.324168 | 5325.721120 | 5577.753263 | 3313.365517 | 7625.072018 | 2744.663489 | 4905.965071 | 2927.365390 | 5973.632081 | 6542.294028 | 6254.368059 | 5010.198400 | 6980.497340 | 4055.921880 | 3349.029400 |
fig = plt.figure(figsize=(12, 3*6))
<Figure size 864x1296 with 0 Axes>
for sp in range(0,6):
ax = fig.add_subplot(6,3,sp*3+1)
ax.scatter(x=neighbors,y=results[numeric_columns[sp]])
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.set_xlim(0, 10)
ax.set_ylim(0,10000)
#ax.set_yticks([0,50,100])
#ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.3)
ax.set_title(numeric_columns[sp])
ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
#if sp == 5:
# ax.tick_params(labelbottom='on')
fig
for sp in range(0,5):
ax = fig.add_subplot(6,3,sp*3+2)
ax.scatter(x=neighbors,y=results[numeric_columns[sp+6]])
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.set_xlim(0, 10)
ax.set_ylim(0,10000)
ax.set_title(numeric_columns[sp+6])
ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
#if sp == 5:
# ax.tick_params(labelbottom='on')
fig
for sp in range(0,6):
ax = fig.add_subplot(6,3,sp*3+3)
ax.scatter(x=neighbors,y=results[numeric_columns[sp+11]])
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.set_xlim(0, 10)
ax.set_ylim(0,10000)
ax.set_title(numeric_columns[sp+11])
ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
#if sp == 5:
# ax.tick_params(labelbottom='on')
fig
We will sort the features based in result from the previous step.
# Compute average RMSE across different `k` values for each feature.
results.loc['avg'] = results.apply(np.mean)
results
symboling | normalized-losses | num-of-doors | wheel-base | lenght | width | height | curb-weight | num-of-cylinders | engine-size | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 9936.839864 | 6581.090924 | 9473.439976 | 4665.530244 | 4571.303106 | 2062.156330 | 9660.569786 | 5103.185798 | 6469.261154 | 2569.223914 | 6024.216180 | 3697.468140 | 4765.611197 | 3349.458156 | 6178.049318 | 4805.300574 | 4528.756243 |
3 | 7639.444338 | 6207.707240 | 8001.398262 | 5542.296773 | 5127.738422 | 3130.699015 | 8324.506117 | 3554.133091 | 3929.319132 | 2768.704046 | 5976.832174 | 5377.241925 | 4589.625354 | 4001.242160 | 5700.254306 | 3891.116610 | 4491.859959 |
5 | 7391.458578 | 6691.161999 | 7355.852501 | 5771.642750 | 5260.713473 | 3709.019434 | 7982.664949 | 3084.273419 | 4086.310427 | 3171.867401 | 5995.115904 | 6234.768974 | 5096.504099 | 4550.010749 | 6514.248899 | 3675.041880 | 4131.277201 |
7 | 6923.249562 | 7090.636344 | 7691.424289 | 5427.876232 | 5562.721854 | 3509.697064 | 7865.607051 | 2928.318411 | 4447.210949 | 3194.923231 | 6116.079189 | 6440.383994 | 5660.042521 | 4849.468468 | 6427.913607 | 3623.745989 | 3756.331451 |
9 | 7017.689447 | 7195.266961 | 7521.324168 | 5325.721120 | 5577.753263 | 3313.365517 | 7625.072018 | 2744.663489 | 4905.965071 | 2927.365390 | 5973.632081 | 6542.294028 | 6254.368059 | 5010.198400 | 6980.497340 | 4055.921880 | 3349.029400 |
avg | 7781.736358 | 6753.172694 | 8008.687839 | 5346.613424 | 5220.046023 | 3144.987472 | 8291.683984 | 3482.914842 | 4767.613347 | 2926.416796 | 6017.175106 | 5658.431412 | 5273.230246 | 4352.075586 | 6360.192694 | 4010.225386 | 4051.450851 |
best_features = results.loc['avg'].sort_values().index
best_features
Index(['engine-size', 'width', 'curb-weight', 'city-mpg', 'highway-mpg', 'horsepower', 'num-of-cylinders', 'lenght', 'compression-ratio', 'wheel-base', 'stroke', 'bore', 'peak-rpm', 'normalized-losses', 'symboling', 'num-of-doors', 'height'], dtype='object')
best_features
Index(['engine-size', 'width', 'curb-weight', 'city-mpg', 'highway-mpg', 'horsepower', 'num-of-cylinders', 'lenght', 'compression-ratio', 'wheel-base', 'stroke', 'bore', 'peak-rpm', 'normalized-losses', 'symboling', 'num-of-doors', 'height'], dtype='object')
#the best 2 features from the previous step
best_features[:2]
Index(['engine-size', 'width'], dtype='object')
nbr_features = [2,3,4,5,6,7]
rmse_values ={}
for nbr_feature in nbr_features:
rmse = knn_train_test(numeric_cars,best_features[:nbr_feature])
rmse_values[nbr_feature] = rmse
rmse_values
{2: 2244.634766133086, 3: 2199.4665137255447, 4: 2346.6186852930896, 5: 2350.786971420422, 6: 2537.1719108093566, 7: 2402.188727147252}
A good a chose of k can imporve accuracy, we will use searsh grid to covere a range of value between 1
and 25
. For the features we will select the three top models from the last step.
top3 = [2,3,4,5]
rmse_k = {}
for nbr_feature in top3:
rmse_values = list()
for k in range(1,25):
rmse = knn_train_test(numeric_cars,best_features[:nbr_feature],k=k)
rmse_values.append(rmse)
rmse_k[nbr_feature] = rmse_values
rmse_k
{2: [2384.8323010573863, 2030.8784608923959, 1964.0343194227464, 2201.423501053258, 2244.634766133086, 2467.51531056453, 2579.4367959108126, 2503.2739671653003, 2487.9080667005983, 2559.959226359865, 2689.18357091232, 2716.31906933967, 2765.3286698631077, 2834.7934028985865, 2911.3369994315767, 2909.5235766556757, 2926.8478955706305, 2936.1936513744995, 2977.316330053173, 3008.711678185186, 3087.4809134656584, 3181.9239792949256, 3243.0021632731014, 3303.6151531104692], 3: [1959.2384734465923, 1902.5755131750575, 1746.8690921188113, 1767.3586418750488, 2199.4665137255447, 2208.9384608545784, 2214.8778506460153, 2407.3242039485804, 2436.9999119266404, 2223.685969192443, 2190.0789485687924, 2300.616854176771, 2306.17457684166, 2398.464915856294, 2486.1731734773657, 2558.965994025218, 2667.128930620863, 2783.3294599200517, 2817.6408636433353, 2886.588850570788, 2979.7812577769514, 3015.7859197633666, 3094.5277518437797, 3101.193287848042], 4: [2131.390679618982, 2206.456985096998, 2271.0984987894526, 2145.8483795993525, 2346.6186852930896, 2407.1171822433002, 2458.6987468244365, 2505.775662199398, 2540.68372219656, 2499.9748365400264, 2452.8085856456073, 2407.9226779476408, 2353.7413061113307, 2443.2431227676734, 2479.9547049141697, 2615.546035247707, 2726.81420267943, 2798.017348142094, 2909.634763853679, 2997.496469130184, 3049.461809572687, 3095.515781923601, 3147.9935104031965, 3187.1392490403423], 5: [1933.0795810485058, 2155.764972018827, 2205.8025865715917, 2209.161862067264, 2350.786971420422, 2397.6095645126616, 2532.9836826822816, 2537.307925625131, 2500.662772434606, 2295.518860737154, 2305.323067136258, 2299.7159552489065, 2417.0951768720624, 2549.317541292039, 2572.7566502467525, 2683.8710127117415, 2768.240263311874, 2867.431478622222, 2950.5906760713688, 3047.6063083898325, 3129.0189045129937, 3181.3859005391196, 3189.2883411802713, 3254.3498733317087]}
final_result = pd.DataFrame(rmse_k,index=range(1,25))
final_result
2 | 3 | 4 | 5 | |
---|---|---|---|---|
1 | 2384.832301 | 1959.238473 | 2131.390680 | 1933.079581 |
2 | 2030.878461 | 1902.575513 | 2206.456985 | 2155.764972 |
3 | 1964.034319 | 1746.869092 | 2271.098499 | 2205.802587 |
4 | 2201.423501 | 1767.358642 | 2145.848380 | 2209.161862 |
5 | 2244.634766 | 2199.466514 | 2346.618685 | 2350.786971 |
6 | 2467.515311 | 2208.938461 | 2407.117182 | 2397.609565 |
7 | 2579.436796 | 2214.877851 | 2458.698747 | 2532.983683 |
8 | 2503.273967 | 2407.324204 | 2505.775662 | 2537.307926 |
9 | 2487.908067 | 2436.999912 | 2540.683722 | 2500.662772 |
10 | 2559.959226 | 2223.685969 | 2499.974837 | 2295.518861 |
11 | 2689.183571 | 2190.078949 | 2452.808586 | 2305.323067 |
12 | 2716.319069 | 2300.616854 | 2407.922678 | 2299.715955 |
13 | 2765.328670 | 2306.174577 | 2353.741306 | 2417.095177 |
14 | 2834.793403 | 2398.464916 | 2443.243123 | 2549.317541 |
15 | 2911.336999 | 2486.173173 | 2479.954705 | 2572.756650 |
16 | 2909.523577 | 2558.965994 | 2615.546035 | 2683.871013 |
17 | 2926.847896 | 2667.128931 | 2726.814203 | 2768.240263 |
18 | 2936.193651 | 2783.329460 | 2798.017348 | 2867.431479 |
19 | 2977.316330 | 2817.640864 | 2909.634764 | 2950.590676 |
20 | 3008.711678 | 2886.588851 | 2997.496469 | 3047.606308 |
21 | 3087.480913 | 2979.781258 | 3049.461810 | 3129.018905 |
22 | 3181.923979 | 3015.785920 | 3095.515782 | 3181.385901 |
23 | 3243.002163 | 3094.527752 | 3147.993510 | 3189.288341 |
24 | 3303.615153 | 3101.193288 | 3187.139249 | 3254.349873 |
final_result.loc['top_k'] = final_result.apply(lambda col: col.sort_values().index[0])
the best value of k
for each groupe of features is:
final_result.loc['top_k']
2 3.0 3 3.0 4 1.0 5 1.0 Name: top_k, dtype: float64
We will update the knn_train_test
function to add cross validation
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, KFold
def knn_train_test2(df,features='',target='price',k=5,fold=5):
#first split the data set train/test
kf = KFold(fold,shuffle=True, random_state=1)
#train th model
model = KNeighborsRegressor(n_neighbors=k)
mses= cross_val_score(model, df[features], df[target], scoring='neg_mean_squared_error',cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
return avg_rmse
rmse_values = list()
for feature in numeric_columns:
rmse = knn_train_test2(numeric_cars,[feature])
rmse_values.append(rmse)
style.use('bmh')
%matplotlib inline
plt.scatter(numeric_columns,rmse_values)
plt.xticks(rotation=90)
plt.title('The Root Mean Square Error For Univariate Model With Cross Validation')
plt.ylabel('RMSE')
plt.xlabel('Numeric attribute')
plt.show()