Guided Project: Predictiong Car Prices¶

In this project, we will use the machine learning algorithm, K-Nearest Neighbors, to perform regressions. More precisely, we will apply the algorithm to predict the price of cars using The Automobile Data Set. Before we dive further into the project, let's take a closer look at the algorithm we will be using.

The K-Nearest Neighbors (KNN) is a fundamental machine learning algorithm that can be used for classification and regression problems based on feature similarity.

Maybe you are asking how the algortheme work? the answer is for a given data point a prediction is made by looking at the k nearest neighbors data points, it depind in the the problem and the nature of data the mesure of similary is choisen.

Once we select the k neighbors we are almost donne in case of classification the data piont will be assigned to the class most common among its k nearest neighbors. In the case of regression like in this project where we want to estimate to price of car, the K-nearest neighbors will compute the averge price of k similar to predic the price.

In this project, we will follow these steps:

Introduction to the data set
Data Cleaning
Univariate Model
Multivariate Model
Hyperparameter Tuning
K-fold cross validation
Conclution

Introduction to the data set¶

In [36]:

import pandas as pd
import numpy as np

cars = pd.read_csv('imports-85.data')
pd.set_option('display.max_columns', None) # to display all the columns
cars.head()                                # print the first five rows

Out[36]:

	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.60	168.80	64.10	48.80	2548	dohc	four	130	mpfi	3.47	2.68	9.00	111	5000	21	27	13495
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
1	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
2	2	164	audi	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
3	2	164	audi	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450
4	2	?	audi	gas	std	two	sedan	fwd	front	99.8	177.3	66.3	53.1	2507	ohc	five	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250

As we can notice that the data has no header, we will use the documentation on the data source website to generate a list of column names and pass it as an argument to the read_csv method. We can also note the use of ? to represent missing values, we can also pass ? to the na_values argument to replace it with np.nan. See pd.read_csv.

In [37]:

columns_name = ['symboling','normalized-losses','make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels','engine-location','wheel-base','lenght','width','height','curb-weight','engine-type','num-of-cylinders','engine-size','fuel-system','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
cars = pd.read_csv('imports-85.data',header=None,names=columns_name,na_values='?')
pd.set_option('display.max_columns', None)
cars.head()

Out[37]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	lenght	width	height	curb-weight	engine-type	num-of-cylinders	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
3	2	164.0	audi	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
4	2	164.0	audi	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0

Data Cleaning¶

Let's took a close look at the data.

In [38]:

cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       203 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  lenght             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               201 non-null    float64
 19  stroke             201 non-null    float64
 20  compression-ratio  205 non-null    float64
 21  horsepower         203 non-null    float64
 22  peak-rpm           203 non-null    float64
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              201 non-null    float64
dtypes: float64(11), int64(5), object(10)
memory usage: 41.8+ KB

In [39]:

cars.describe()

Out[39]:

	symboling	normalized-losses	wheel-base	lenght	width	height	curb-weight	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
count	205.000000	164.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	201.000000	201.000000	205.000000	203.000000	203.000000	205.000000	205.000000	201.000000
mean	0.834146	122.000000	98.756585	174.049268	65.907805	53.724878	2555.565854	126.907317	3.329751	3.255423	10.142537	104.256158	5125.369458	25.219512	30.751220	13207.129353
std	1.245307	35.442168	6.021776	12.337289	2.145204	2.443522	520.680204	41.642693	0.273539	0.316717	3.972040	39.714369	479.334560	6.542142	6.886443	7947.066342
min	-2.000000	65.000000	86.600000	141.100000	60.300000	47.800000	1488.000000	61.000000	2.540000	2.070000	7.000000	48.000000	4150.000000	13.000000	16.000000	5118.000000
25%	0.000000	94.000000	94.500000	166.300000	64.100000	52.000000	2145.000000	97.000000	3.150000	3.110000	8.600000	70.000000	4800.000000	19.000000	25.000000	7775.000000
50%	1.000000	115.000000	97.000000	173.200000	65.500000	54.100000	2414.000000	120.000000	3.310000	3.290000	9.000000	95.000000	5200.000000	24.000000	30.000000	10295.000000
75%	2.000000	150.000000	102.400000	183.100000	66.900000	55.500000	2935.000000	141.000000	3.590000	3.410000	9.400000	116.000000	5500.000000	30.000000	34.000000	16500.000000
max	3.000000	256.000000	120.900000	208.100000	72.300000	59.800000	4066.000000	326.000000	3.940000	4.170000	23.000000	288.000000	6600.000000	49.000000	54.000000	45400.000000

20% of the values in the column normalized-losses are missing, if we delete all the rows with missing values, we lose a lot of data, the alternative is to replace this value with the average of the columns or to delete the column entirely. For now we will juste replace the missing values using the average.

In [40]:

cars['normalized-losses'] = cars['normalized-losses'].fillna(cars['normalized-losses'].mean())

In [41]:

cars['num-of-doors'].value_counts(dropna=False)

Out[41]:

four    114
two      89
NaN       2
Name: num-of-doors, dtype: int64

In [42]:

cars['num-of-cylinders'].value_counts(dropna=False)

Out[42]:

four      159
six        24
five       11
eight       5
two         4
three       1
twelve      1
Name: num-of-cylinders, dtype: int64

the database contine 15 continuous attributes out of 26 and one integer attribute symboling , and some nominal attributes can be transformed to numerica like num-of-doors and num-of-cylinders

lets statr by tronsforming the num-of-doors and num-of-cylinders to numeric columns

In [43]:

cars['num-of-cylinders'] = cars['num-of-cylinders'].replace(to_replace={'four':4,'six':6,'five':5,'eight':8, 'two':2,'twelve':11,'three':3})

In [44]:

cars['num-of-doors'] = cars['num-of-doors'].replace(to_replace={'four':4, 'two':2})

In [45]:

cars['num-of-doors'].value_counts(dropna=False)

Out[45]:

4.0    114
2.0     89
NaN      2
Name: num-of-doors, dtype: int64

In [46]:

cars['num-of-cylinders'].value_counts(dropna=False)

Out[46]:

4     159
6      24
5      11
8       5
2       4
11      1
3       1
Name: num-of-cylinders, dtype: int64

In [47]:

cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    float64
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       203 non-null    float64
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  lenght             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    int64  
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               201 non-null    float64
 19  stroke             201 non-null    float64
 20  compression-ratio  205 non-null    float64
 21  horsepower         203 non-null    float64
 22  peak-rpm           203 non-null    float64
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              201 non-null    float64
dtypes: float64(12), int64(6), object(8)
memory usage: 41.8+ KB

we still have some columns with missing data, we will just drop the rows

In [48]:

cars.dropna(inplace=True)

In [49]:

cars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          193 non-null    int64  
 1   normalized-losses  193 non-null    float64
 2   make               193 non-null    object 
 3   fuel-type          193 non-null    object 
 4   aspiration         193 non-null    object 
 5   num-of-doors       193 non-null    float64
 6   body-style         193 non-null    object 
 7   drive-wheels       193 non-null    object 
 8   engine-location    193 non-null    object 
 9   wheel-base         193 non-null    float64
 10  lenght             193 non-null    float64
 11  width              193 non-null    float64
 12  height             193 non-null    float64
 13  curb-weight        193 non-null    int64  
 14  engine-type        193 non-null    object 
 15  num-of-cylinders   193 non-null    int64  
 16  engine-size        193 non-null    int64  
 17  fuel-system        193 non-null    object 
 18  bore               193 non-null    float64
 19  stroke             193 non-null    float64
 20  compression-ratio  193 non-null    float64
 21  horsepower         193 non-null    float64
 22  peak-rpm           193 non-null    float64
 23  city-mpg           193 non-null    int64  
 24  highway-mpg        193 non-null    int64  
 25  price              193 non-null    float64
dtypes: float64(12), int64(6), object(8)
memory usage: 40.7+ KB

the list of numeric columns to keep

In [50]:

numeric_columns = ['symboling','normalized-losses','num-of-doors','wheel-base','lenght','width','height',
                   'curb-weight','num-of-cylinders','engine-size','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']

In [51]:

len(numeric_columns)

Out[51]:

In [52]:

numeric_cars = cars[numeric_columns]
numeric_cars.head()

Out[52]:

	symboling	normalized-losses	num-of-doors	wheel-base	lenght	width	height	curb-weight	num-of-cylinders	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	122.0	2.0	88.6	168.8	64.1	48.8	2548	4	130	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
1	3	122.0	2.0	88.6	168.8	64.1	48.8	2548	4	130	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
2	1	122.0	2.0	94.5	171.2	65.5	52.4	2823	6	152	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
3	2	164.0	4.0	99.8	176.6	66.2	54.3	2337	4	109	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
4	2	164.0	4.0	99.4	176.6	66.4	54.3	2824	5	136	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0

Feature scaling¶

Depending on the algorithm you are using, you may or may not need to standardize the data, since we assume that all features have the same considerations and we are computing distance we should apply standardization. To normalize all columns in a range of 0 to 1, we will use Min-Max normalization.

Min-Max Normalization:¶

$$ X = \frac{X - X.min}{X.max - X.min}$$

In [53]:

price_col = numeric_cars['price']
numeric_cars = (numeric_cars - numeric_cars.min())/(numeric_cars.max() - numeric_cars.min())
numeric_cars['price'] = price_col

In [54]:

numeric_cars.head()

Out[54]:

	symboling	normalized-losses	num-of-doors	wheel-base	lenght	width	height	curb-weight	num-of-cylinders	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	1.0	0.298429	0.0	0.058309	0.413433	0.324786	0.083333	0.411171	0.125	0.260377	0.664286	0.290476	0.1250	0.294393	0.346939	0.222222	0.289474	13495.0
1	1.0	0.298429	0.0	0.058309	0.413433	0.324786	0.083333	0.411171	0.125	0.260377	0.664286	0.290476	0.1250	0.294393	0.346939	0.222222	0.289474	16500.0
2	0.6	0.298429	0.0	0.230321	0.449254	0.444444	0.383333	0.517843	0.375	0.343396	0.100000	0.666667	0.1250	0.495327	0.346939	0.166667	0.263158	16500.0
3	0.8	0.518325	1.0	0.384840	0.529851	0.504274	0.541667	0.329325	0.125	0.181132	0.464286	0.633333	0.1875	0.252336	0.551020	0.305556	0.368421	13950.0
4	0.8	0.518325	1.0	0.373178	0.529851	0.521368	0.541667	0.518231	0.250	0.283019	0.464286	0.633333	0.0625	0.313084	0.551020	0.138889	0.157895	17450.0

Univariate Model¶

In [55]:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

def knn_train_test(df,features='',target='price',k=5):
    
    #first split the data set train/test
    nbr_rows = len(df)
    np.random.seed(1)
    indexs = np.random.permutation(nbr_rows) #shuffle the data
    train = df.iloc[indexs[0:round(nbr_rows * .75)]] # 75% for the training
    test = df.iloc[indexs[round(nbr_rows * .75):]]
    
    #train th model
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(train[features],train[target])
    predictions = model.predict(test[features])
    mse = mean_squared_error(test[target],predictions)
    rmse = mse ** (1/2)
    return rmse
    

In [56]:

numeric_columns.remove('price')

In [57]:

len(numeric_columns)

Out[57]:

For each numeric column, we will create, train, and test a univariate model using default k value from scikit-learn.

In [58]:

rmse_values = list()
for feature in numeric_columns:
    rmse = knn_train_test(numeric_cars,[feature])
    rmse_values.append(rmse)

In [59]:

rmse_values

Out[59]:

[7391.45857809314,
 6691.161998798614,
 7355.852500673643,
 5771.642749902318,
 5260.713472920443,
 3709.0194335340625,
 7982.664949230092,
 3084.2734194350105,
 4086.3104265951206,
 3171.8674012060046,
 5995.115904425868,
 6234.768973733778,
 5096.504099053259,
 4550.010748522103,
 6514.248899080129,
 3675.0418801468554,
 4131.277200818168]

In [60]:

import  matplotlib.pyplot as plt
from matplotlib import style


style.use('bmh')

%matplotlib inline 
plt.scatter(numeric_columns,rmse_values,)
plt.xticks(rotation=90)
plt.title('The  Root Mean Square Error  For Univariate Model')
plt.ylabel('RMSE')
plt.xlabel('Numeric attribute')
plt.show()

For each numeric column, we will create, train, and test a univariate model using the following k values (1, 3, 5, 7, and 9). Visualize the results using a scatter plot.

In [61]:

neighbors = [1,3,5,7,9]
rmse_k = {}
for feature in numeric_columns:
    rmse_values = list()
    for k in neighbors:      
        rmse = knn_train_test(numeric_cars,[feature],k=k)
        rmse_values.append(rmse)
    rmse_k[feature] = rmse_values 

In [62]:

results = pd.DataFrame(rmse_k, index=neighbors)

In [63]:

results

Out[63]:

	symboling	normalized-losses	num-of-doors	wheel-base	lenght	width	height	curb-weight	num-of-cylinders	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg
1	9936.839864	6581.090924	9473.439976	4665.530244	4571.303106	2062.156330	9660.569786	5103.185798	6469.261154	2569.223914	6024.216180	3697.468140	4765.611197	3349.458156	6178.049318	4805.300574	4528.756243
3	7639.444338	6207.707240	8001.398262	5542.296773	5127.738422	3130.699015	8324.506117	3554.133091	3929.319132	2768.704046	5976.832174	5377.241925	4589.625354	4001.242160	5700.254306	3891.116610	4491.859959
5	7391.458578	6691.161999	7355.852501	5771.642750	5260.713473	3709.019434	7982.664949	3084.273419	4086.310427	3171.867401	5995.115904	6234.768974	5096.504099	4550.010749	6514.248899	3675.041880	4131.277201
7	6923.249562	7090.636344	7691.424289	5427.876232	5562.721854	3509.697064	7865.607051	2928.318411	4447.210949	3194.923231	6116.079189	6440.383994	5660.042521	4849.468468	6427.913607	3623.745989	3756.331451
9	7017.689447	7195.266961	7521.324168	5325.721120	5577.753263	3313.365517	7625.072018	2744.663489	4905.965071	2927.365390	5973.632081	6542.294028	6254.368059	5010.198400	6980.497340	4055.921880	3349.029400

In [64]:

fig = plt.figure(figsize=(12, 3*6))

<Figure size 864x1296 with 0 Axes>

In [65]:

for sp in range(0,6):
    ax = fig.add_subplot(6,3,sp*3+1)
    ax.scatter(x=neighbors,y=results[numeric_columns[sp]])
    ax.spines["right"].set_visible(False)    
    ax.spines["left"].set_visible(False)
    ax.spines["top"].set_visible(False)    
    ax.spines["bottom"].set_visible(False)
    ax.set_xlim(0, 10)
    ax.set_ylim(0,10000)
    #ax.set_yticks([0,50,100])
    #ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.3)
    ax.set_title(numeric_columns[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
    #if sp == 5:
    #   ax.tick_params(labelbottom='on')
fig

Out[65]:

In [66]:

for sp in range(0,5):
    ax = fig.add_subplot(6,3,sp*3+2)
    ax.scatter(x=neighbors,y=results[numeric_columns[sp+6]])
    ax.spines["right"].set_visible(False)    
    ax.spines["left"].set_visible(False)
    ax.spines["top"].set_visible(False)    
    ax.spines["bottom"].set_visible(False)
    ax.set_xlim(0, 10)
    ax.set_ylim(0,10000)
    ax.set_title(numeric_columns[sp+6])
    ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
    #if sp == 5:
    #   ax.tick_params(labelbottom='on')
fig

Out[66]:

In [67]:

for sp in range(0,6):
    ax = fig.add_subplot(6,3,sp*3+3)
    ax.scatter(x=neighbors,y=results[numeric_columns[sp+11]])
    ax.spines["right"].set_visible(False)    
    ax.spines["left"].set_visible(False)
    ax.spines["top"].set_visible(False)    
    ax.spines["bottom"].set_visible(False)
    ax.set_xlim(0, 10)
    ax.set_ylim(0,10000)
    ax.set_title(numeric_columns[sp+11])
    ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
    #if sp == 5:
    #   ax.tick_params(labelbottom='on')
fig

Out[67]:

Multivariate Model¶

We will sort the features based in result from the previous step.

In [68]:

# Compute average RMSE across different `k` values for each feature.
results.loc['avg'] = results.apply(np.mean)

In [69]:

results

Out[69]:

	symboling	normalized-losses	num-of-doors	wheel-base	lenght	width	height	curb-weight	num-of-cylinders	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg
1	9936.839864	6581.090924	9473.439976	4665.530244	4571.303106	2062.156330	9660.569786	5103.185798	6469.261154	2569.223914	6024.216180	3697.468140	4765.611197	3349.458156	6178.049318	4805.300574	4528.756243
3	7639.444338	6207.707240	8001.398262	5542.296773	5127.738422	3130.699015	8324.506117	3554.133091	3929.319132	2768.704046	5976.832174	5377.241925	4589.625354	4001.242160	5700.254306	3891.116610	4491.859959
5	7391.458578	6691.161999	7355.852501	5771.642750	5260.713473	3709.019434	7982.664949	3084.273419	4086.310427	3171.867401	5995.115904	6234.768974	5096.504099	4550.010749	6514.248899	3675.041880	4131.277201
7	6923.249562	7090.636344	7691.424289	5427.876232	5562.721854	3509.697064	7865.607051	2928.318411	4447.210949	3194.923231	6116.079189	6440.383994	5660.042521	4849.468468	6427.913607	3623.745989	3756.331451
9	7017.689447	7195.266961	7521.324168	5325.721120	5577.753263	3313.365517	7625.072018	2744.663489	4905.965071	2927.365390	5973.632081	6542.294028	6254.368059	5010.198400	6980.497340	4055.921880	3349.029400
avg	7781.736358	6753.172694	8008.687839	5346.613424	5220.046023	3144.987472	8291.683984	3482.914842	4767.613347	2926.416796	6017.175106	5658.431412	5273.230246	4352.075586	6360.192694	4010.225386	4051.450851

In [70]:

best_features = results.loc['avg'].sort_values().index

In [71]:

best_features

Out[71]:

Index(['engine-size', 'width', 'curb-weight', 'city-mpg', 'highway-mpg',
       'horsepower', 'num-of-cylinders', 'lenght', 'compression-ratio',
       'wheel-base', 'stroke', 'bore', 'peak-rpm', 'normalized-losses',
       'symboling', 'num-of-doors', 'height'],
      dtype='object')

In [72]:

best_features

Out[72]:

Index(['engine-size', 'width', 'curb-weight', 'city-mpg', 'highway-mpg',
       'horsepower', 'num-of-cylinders', 'lenght', 'compression-ratio',
       'wheel-base', 'stroke', 'bore', 'peak-rpm', 'normalized-losses',
       'symboling', 'num-of-doors', 'height'],
      dtype='object')

In [73]:

#the best 2 features from the previous step
best_features[:2]

Out[73]:

Index(['engine-size', 'width'], dtype='object')

In [74]:

nbr_features = [2,3,4,5,6,7]
rmse_values ={}
for nbr_feature in nbr_features:
        rmse = knn_train_test(numeric_cars,best_features[:nbr_feature])
        rmse_values[nbr_feature] = rmse

In [75]:

rmse_values

Out[75]:

{2: 2244.634766133086,
 3: 2199.4665137255447,
 4: 2346.6186852930896,
 5: 2350.786971420422,
 6: 2537.1719108093566,
 7: 2402.188727147252}

Hyperparameter Tuning¶

A good a chose of k can imporve accuracy, we will use searsh grid to covere a range of value between 1 and 25. For the features we will select the three top models from the last step.

In [76]:

top3 = [2,3,4,5]
rmse_k = {}
for nbr_feature in top3:
        rmse_values = list()
        for k in range(1,25):
            rmse = knn_train_test(numeric_cars,best_features[:nbr_feature],k=k)
            rmse_values.append(rmse)
        rmse_k[nbr_feature] = rmse_values

In [77]:

rmse_k

Out[77]:

{2: [2384.8323010573863,
  2030.8784608923959,
  1964.0343194227464,
  2201.423501053258,
  2244.634766133086,
  2467.51531056453,
  2579.4367959108126,
  2503.2739671653003,
  2487.9080667005983,
  2559.959226359865,
  2689.18357091232,
  2716.31906933967,
  2765.3286698631077,
  2834.7934028985865,
  2911.3369994315767,
  2909.5235766556757,
  2926.8478955706305,
  2936.1936513744995,
  2977.316330053173,
  3008.711678185186,
  3087.4809134656584,
  3181.9239792949256,
  3243.0021632731014,
  3303.6151531104692],
 3: [1959.2384734465923,
  1902.5755131750575,
  1746.8690921188113,
  1767.3586418750488,
  2199.4665137255447,
  2208.9384608545784,
  2214.8778506460153,
  2407.3242039485804,
  2436.9999119266404,
  2223.685969192443,
  2190.0789485687924,
  2300.616854176771,
  2306.17457684166,
  2398.464915856294,
  2486.1731734773657,
  2558.965994025218,
  2667.128930620863,
  2783.3294599200517,
  2817.6408636433353,
  2886.588850570788,
  2979.7812577769514,
  3015.7859197633666,
  3094.5277518437797,
  3101.193287848042],
 4: [2131.390679618982,
  2206.456985096998,
  2271.0984987894526,
  2145.8483795993525,
  2346.6186852930896,
  2407.1171822433002,
  2458.6987468244365,
  2505.775662199398,
  2540.68372219656,
  2499.9748365400264,
  2452.8085856456073,
  2407.9226779476408,
  2353.7413061113307,
  2443.2431227676734,
  2479.9547049141697,
  2615.546035247707,
  2726.81420267943,
  2798.017348142094,
  2909.634763853679,
  2997.496469130184,
  3049.461809572687,
  3095.515781923601,
  3147.9935104031965,
  3187.1392490403423],
 5: [1933.0795810485058,
  2155.764972018827,
  2205.8025865715917,
  2209.161862067264,
  2350.786971420422,
  2397.6095645126616,
  2532.9836826822816,
  2537.307925625131,
  2500.662772434606,
  2295.518860737154,
  2305.323067136258,
  2299.7159552489065,
  2417.0951768720624,
  2549.317541292039,
  2572.7566502467525,
  2683.8710127117415,
  2768.240263311874,
  2867.431478622222,
  2950.5906760713688,
  3047.6063083898325,
  3129.0189045129937,
  3181.3859005391196,
  3189.2883411802713,
  3254.3498733317087]}

In [78]:

final_result = pd.DataFrame(rmse_k,index=range(1,25))

In [79]:

final_result

Out[79]:

	2	3	4	5
1	2384.832301	1959.238473	2131.390680	1933.079581
2	2030.878461	1902.575513	2206.456985	2155.764972
3	1964.034319	1746.869092	2271.098499	2205.802587
4	2201.423501	1767.358642	2145.848380	2209.161862
5	2244.634766	2199.466514	2346.618685	2350.786971
6	2467.515311	2208.938461	2407.117182	2397.609565
7	2579.436796	2214.877851	2458.698747	2532.983683
8	2503.273967	2407.324204	2505.775662	2537.307926
9	2487.908067	2436.999912	2540.683722	2500.662772
10	2559.959226	2223.685969	2499.974837	2295.518861
11	2689.183571	2190.078949	2452.808586	2305.323067
12	2716.319069	2300.616854	2407.922678	2299.715955
13	2765.328670	2306.174577	2353.741306	2417.095177
14	2834.793403	2398.464916	2443.243123	2549.317541
15	2911.336999	2486.173173	2479.954705	2572.756650
16	2909.523577	2558.965994	2615.546035	2683.871013
17	2926.847896	2667.128931	2726.814203	2768.240263
18	2936.193651	2783.329460	2798.017348	2867.431479
19	2977.316330	2817.640864	2909.634764	2950.590676
20	3008.711678	2886.588851	2997.496469	3047.606308
21	3087.480913	2979.781258	3049.461810	3129.018905
22	3181.923979	3015.785920	3095.515782	3181.385901
23	3243.002163	3094.527752	3147.993510	3189.288341
24	3303.615153	3101.193288	3187.139249	3254.349873

In [80]:

final_result.loc['top_k'] = final_result.apply(lambda col: col.sort_values().index[0])

the best value of k for each groupe of features is:

In [81]:

final_result.loc['top_k']

Out[81]:

2    3.0
3    3.0
4    1.0
5    1.0
Name: top_k, dtype: float64

K-fold cross validation¶

We will update the knn_train_test function to add cross validation

In [82]:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, KFold

def knn_train_test2(df,features='',target='price',k=5,fold=5):
    
    #first split the data set train/test
    kf = KFold(fold,shuffle=True, random_state=1)
    
    #train th model
    model = KNeighborsRegressor(n_neighbors=k)
    mses= cross_val_score(model, df[features], df[target], scoring='neg_mean_squared_error',cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    return avg_rmse

In [82]:

In [83]:

rmse_values = list()
for feature in numeric_columns:
    rmse = knn_train_test2(numeric_cars,[feature])
    rmse_values.append(rmse)

In [84]:

style.use('bmh')

%matplotlib inline 
plt.scatter(numeric_columns,rmse_values)
plt.xticks(rotation=90)
plt.title('The  Root Mean Square Error  For Univariate Model With Cross Validation')
plt.ylabel('RMSE')
plt.xlabel('Numeric attribute')
plt.show()