Predicting Car Prices¶

*In this project, we will predict a car's market price using its attributes. The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more.*

Download Documentation

*We will be using K-Nearset Neighbors algorithm to predict a Car's price accurately.*

Exploring Data¶

We will read data into a Dataframe. Since, the data file doesn't come with Header, we need to add proper column names.
Also, we will drop non-numerical columns which can't be used as features for our model.

In [1]:

import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 100)

cars = pd.read_csv('imports-85.csv')
print(cars.shape)
cars.head()

(204, 26)

Out[1]:

	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.60	168.80	64.10	48.80	2548	dohc	four	130	mpfi	3.47	2.68	9.00	111	5000	21	27	13495
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
1	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
2	2	164	audi	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
3	2	164	audi	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450
4	2	?	audi	gas	std	two	sedan	fwd	front	99.8	177.3	66.3	53.1	2507	ohc	five	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250

It looks like this dataset does not include the column names. We'll have to add in the column names manually using the documentation.

In [2]:

columns  = ['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_of_doors', 'body_style', 
            'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 
            'engine_type', 'num_of_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 
            'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']

cars = pd.read_csv('imports-85.csv', names=columns)
cars.head()

Out[2]:

	symboling	normalized_losses	make	fuel_type	aspiration	num_of_doors	body_style	drive_wheels	engine_location	wheel_base	length	width	height	curb_weight	engine_type	num_of_cylinders	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

Data Cleaning and Preparation¶

The k-nearest neighbors algorithm uses the distance formula to determine the nearest neighbors. That means, we can only use numerical columns for this machine learning algorithm. Afterwards, we'll have to do a little bit of data cleaning. We will perform the following steps:

Replace missing and meaningless values like ? with np.nan

Convert String columns (which are actually numeric) to Numeric datatype
Drop rows where target Column is missing/np.nan
Replace missing/np.nan values for other places using the average values from that column.
Normalize the Dataframe except Price Column

We can also seperate numerical columns given in the documentation as follows

In [3]:

continuous_values_cols = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 
                          'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 
                          'highway_mpg', 'price']

numeric_cars = cars[continuous_values_cols].copy()
numeric_cars.head()

Out[3]:

	normalized_losses	wheel_base	length	width	height	curb_weight	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	?	88.6	168.8	64.1	48.8	2548	3.47	2.68	9.0	111	5000	21	27	13495
1	?	88.6	168.8	64.1	48.8	2548	3.47	2.68	9.0	111	5000	21	27	16500
2	?	94.5	171.2	65.5	52.4	2823	2.68	3.47	9.0	154	5000	19	26	16500
3	164	99.8	176.6	66.2	54.3	2337	3.19	3.40	10.0	102	5500	24	30	13950
4	164	99.4	176.6	66.4	54.3	2824	3.19	3.40	8.0	115	5500	18	22	17450

Replace missing and meaningless values like ? with np.nan

In [4]:

numeric_cars.replace('?', np.nan, inplace=True)

Convert String columns (which are actually numeric) to Numeric datatype

In [5]:

# Check columns which are of object type
text_cols = numeric_cars.select_dtypes(include=['object']).columns
print(text_cols)

numeric_cars[text_cols] = numeric_cars[text_cols].astype('float')

# Checking if any non-numerical column is left
numeric_cars.dtypes.value_counts()

Index(['normalized_losses', 'bore', 'stroke', 'horsepower', 'peak_rpm',
       'price'],
      dtype='object')

Out[5]:

float64    11
int64       3
dtype: int64

Drop rows where target Column is missing/np.nan

In [6]:

# Because `price` is the column we want to predict, let's remove any rows with missing `price` values.
numeric_cars.dropna(subset=['price'], inplace=True)

# Checking if there is any null value
numeric_cars['price'].isnull().sum() 

Out[6]:

Replace missing/np.nan values for other places using the average values from that column.

In [7]:

# Replace missing values in other columns using their respective column means.
numeric_cars.fillna(numeric_cars.mean(), inplace=True)
numeric_cars.isnull().sum().value_counts()

Out[7]:

0    14
dtype: int64

The k-nearest neighbors algorithm uses the euclidean distance to determine the closest neighbor.

$$ Distance = \sqrt{{(q_1-p_1)}^2+{(q_2-p_2)}^2+...{(q_n-p_n)}^2} $$

Where q and p represent two rows and the subscript representing a column. However, each column have different scaling. For example, if we take row 2, and row 3. The peak RPM has a difference of 500, while the difference in width is 0.7. The algorithm will give extra weight towards the difference in peak RPM.

That is why it is important to normalize the dataset into a unit vector. After normalization we'll have values from -1 to 1. For more information on feature scaling click here.

$$ x' = \frac{x - mean(x)}{x(max) - x(min)}$$

In pandas this would be:

$$ df' = \frac{df - df.mean()}{df.max() - df.min()}$$

Where df is any dataframe.

Normalize the Dataframe except Price Column

In [8]:

# Normalizing The Dataframe
normalised_cars = (numeric_cars.max() - numeric_cars)/numeric_cars.max()
normalised_cars['price'] = numeric_cars['price']
normalised_cars.head()

Out[8]:

	normalized_losses	wheel_base	length	width	height	curb_weight	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	0.523438	0.267163	0.188852	0.109722	0.183946	0.373340	0.119289	0.357314	0.608696	0.576336	0.242424	0.571429	0.500000	13495.0
1	0.523438	0.267163	0.188852	0.109722	0.183946	0.373340	0.119289	0.357314	0.608696	0.576336	0.242424	0.571429	0.500000	16500.0
2	0.523438	0.218362	0.177319	0.090278	0.123746	0.305706	0.319797	0.167866	0.608696	0.412214	0.242424	0.612245	0.518519	16500.0
3	0.359375	0.174524	0.151370	0.080556	0.091973	0.425234	0.190355	0.184652	0.565217	0.610687	0.166667	0.510204	0.444444	13950.0
4	0.359375	0.177833	0.151370	0.077778	0.091973	0.305460	0.190355	0.184652	0.652174	0.561069	0.166667	0.632653	0.592593	17450.0

Applying Machine Learning¶

K-Nearest Neighbors

Suppose we have a dataframe named 'train', and a row named 'test'. The idea behind k-nearest neighbors is to find k number of rows from 'train' with the lowest distance to 'test'. Then we can determine the average of the target column of 'train' of those five rows and return the result to 'test'.

*We will create a knn_train_test function which uses KNeighborsRegressor class from scikit-learn.*

In [9]:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

def knn_train_test(feature, target, df):
    
    # Randomizing the Dataset
    np.random.seed(1)
    new_df = df.iloc[np.random.permutation(len(df))].copy()
    
    # Divide the data in half
    half_point = int(len(df)/2)
    train_df = new_df[:half_point]
    test_df = new_df[half_point:]
    
    # Fit a KNN Model using default K value
    knn = KNeighborsRegressor()
    knn.fit(train_df[[feature]], train_df[target])
    
    # Making predictions using the model
    predictions = knn.predict(test_df[[feature]])
    
    # Calculate and return RMSE Value
    rmse = np.sqrt(mean_squared_error(test_df[target], predictions))
    return rmse

This function will train and test univariate models.

First, we will evaluate which features give us the most accurate prediction.

In [10]:

# Extracting all feature names except price 
columns  = normalised_cars.columns.tolist()
columns.remove('price')

# Create a dictionary of RMSE Values aling with Features
rmse_results = {}

for col in columns:
    rmse_results[col] = knn_train_test(col, 'price', normalised_cars)

# Converting dictionary into Series and sorting it to display results
rmse_results = pd.Series(rmse_results)    
rmse_results.sort_values()

Out[10]:

horsepower           4007.472352
curb_weight          4437.934395
highway_mpg          4579.037250
width                4644.898429
city_mpg             4729.673421
length               5382.671155
wheel_base           5527.682489
compression_ratio    6736.676353
bore                 6816.853712
height               7487.652519
peak_rpm             7498.746475
normalized_losses    7635.170416
stroke               8078.491289
dtype: float64

It looks like horsepower feature gives us the least amount of error. We should definitely keep this list in mind when using the function for multiple features.

But, we need explore further. Let's modify the function to include k value or the number of neighbors as a parameter. Then we can loop through a list of K values and features to determine which K value and features are most optimal in our machine learning model.

*Modifying the knn_train_test() function to accept k value as a parameter.*

In [11]:

def knn_train_test2(feature, target, df, k_value):
    # Randomizing the Dataset
    np.random.seed(1)
    new_df = df.iloc[np.random.permutation(len(df))].copy()
    
    # Divide the data in half
    half_point = int(len(df)/2)
    train_df = new_df[:half_point]
    test_df = new_df[half_point:]
    
    k_results = []
    
    # Fitting the model wih k neighbors
    for k in k_value:
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[[feature]], train_df[target])
    
        # Making predictions using the model
        predictions = knn.predict(test_df[[feature]])

        # Calculate and return RMSE Value
        rmse = np.sqrt(mean_squared_error(test_df[target], predictions))
        k_results.append(rmse)
    
    return k_results

*Training, and testing a univariate model using following k values (1, 3, 5, 7, and 9)*

In [12]:

# K Nearest Neighbors
k_values = [1, 3, 5, 7, 9]

# Create a dictionary of RMSE Values along with Features
k_rmse_results = {}

# Looping through all the features
for col in columns:
    k_rmse_results[col] = knn_train_test2(col, 'price', normalised_cars, k_values)
    
k_rmse_results

Out[12]:

{'normalized_losses': [7906.594141025014,
  6712.873355379836,
  7635.170416092379,
  7870.651003239241,
  8221.578465544319],
 'wheel_base': [5964.682235317891,
  5246.472910232148,
  5527.682488732292,
  5485.683033525724,
  5734.4339857054465],
 'length': [5291.785164547288,
  5267.216777678541,
  5382.671155138166,
  5396.362242025737,
  5420.547916432259],
 'width': [4453.161424568767,
  4697.287114550659,
  4644.898428543422,
  4562.1341847495605,
  4643.882339393336],
 'height': [9108.471836593655,
  8049.98714728832,
  7487.652518884965,
  7753.797418084058,
  7695.632426557866],
 'curb_weight': [5518.883237405808,
  5048.607726036669,
  4437.934394635539,
  4369.349089851214,
  4632.205545221074],
 'bore': [7496.149231240644,
  6936.9888741632,
  6816.8537123691885,
  7062.061305053834,
  6869.727437364902],
 'stroke': [7282.34885878108,
  7664.984030806539,
  8078.491288735677,
  7754.483859461689,
  7723.913153845065],
 'compression_ratio': [9024.902677953633,
  7033.552922995039,
  6736.676353123451,
  7459.113194422072,
  7219.385481303907],
 'horsepower': [3749.5962185254293,
  3964.9503610053594,
  4007.4723516831596,
  4391.481673529705,
  4505.188632005311],
 'peak_rpm': [9825.559283202294,
  8025.172980050709,
  7498.746474941366,
  7296.5172664110205,
  7239.47816887947],
 'city_mpg': [4540.361003224739,
  4662.468376743848,
  4729.673420999269,
  5099.274289469859,
  4999.291723774096],
 'highway_mpg': [5270.360471073066,
  4618.186622340838,
  4579.0372499290315,
  4914.26000287261,
  5181.912418963636]}

*Visualising RMSEs for various K and Features*

In [96]:

import matplotlib.pyplot as plt
%matplotlib inline

import matplotlib.style as style
style.use('fivethirtyeight')

plt.figure(figsize=(10, 12))

for k,v in k_rmse_results.items():
    x = [1, 3, 5, 7, 9]
    y = v
    
    plt.plot(x, y, label=k)
    plt.xlabel('k value')
    plt.ylabel('RMSE')

plt.legend(bbox_to_anchor=(1.3, 1), borderaxespad=0)
# plt.legend()
plt.show()

The visualisation isn't very helpful. Let's arrange the Avg. RMSE (Root Mean Squared Error) and features in a sorted manner.

*Finding best features (with lowest RMSEs)*

In [14]:

# Getting average RMSE across different `k` values for each feature.
feature_rmse = {}

for k,v in k_rmse_results.items():
    avg_rmse = np.mean(v)
    feature_rmse[k] = avg_rmse
    
top_features = pd.Series(feature_rmse).sort_values()
top_features

Out[14]:

horsepower           4123.737847
width                4600.272698
curb_weight          4801.395999
city_mpg             4806.213763
highway_mpg          4912.751353
length               5351.716651
wheel_base           5591.790931
bore                 7036.356112
compression_ratio    7494.726126
normalized_losses    7669.373476
stroke               7700.844238
peak_rpm             7977.094835
height               8019.108269
dtype: float64

*The above table reiterates our finding from our earlier that horsepower gives least amount of error.*

Multivariate Model¶

*Now, we will optimize knn_train_test function to work along with multiple features at once.*

In [15]:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

def knn_train_test3(features, target, df):
    # Randomizing the Dataset
    np.random.seed(1)
#     new_df = df.iloc[np.random.permutation(len(df))].copy()
    shuffled_index = np.random.permutation(df.index)
    new_df = df.reindex(shuffled_index)
    
    # Divide the data in half
    half_point = int(len(df)/2)
    train_df = new_df[:half_point]
    test_df = new_df[half_point:]
    
    # Fitting the model wih k neighbors
    knn = KNeighborsRegressor(n_neighbors=5)
    knn.fit(train_df[features], train_df[target])
    
    # Making predictions using the model
    predictions = knn.predict(test_df[features])

    # Calculate and return RMSE Value
    rmse = np.sqrt(mean_squared_error(test_df[target], predictions))
    return rmse

*Applying this function with top features (having lowest amount of error), will further improve accuracy of our model.*

In [16]:

rmse_results = {}

rmse_results['Top Two Features'] = knn_train_test3(top_features[:2].index, 'price', normalised_cars)
rmse_results['Top Three Features']  = knn_train_test3(top_features[:3].index, 'price', normalised_cars)
rmse_results['Top Four Features'] = knn_train_test3(top_features[:4].index, 'price', normalised_cars)
rmse_results['Top Five Features'] = knn_train_test3(top_features[:5].index, 'price', normalised_cars)

# Displaying results sorted as per the RMSEs
pd.Series(rmse_results).sort_values()

Out[16]:

Top Three Features    3212.559631
Top Four Features     3232.103629
Top Five Features     3346.673710
Top Two Features      3681.398092
dtype: float64

We got the least error from Top Three Features followed by Four and Five features.

Hyperparameter Tuning¶

*Now, let's try varying the K values. We can further tune our machine learning model by finding the optimal K value to use.*

In [17]:

def knn_train_test_hyp(feature, target, df, k_value):
    # Randomizing the Dataset
    np.random.seed(1)
    new_df = df.iloc[np.random.permutation(len(df))].copy()
    
    # Divide the data in half
    half_point = int(len(df)/2)
    train_df = new_df[:half_point]
    test_df = new_df[half_point:]
    
    k_results = []
    
    # Fitting the model wih k neighbors
    for k in k_value:
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[feature], train_df[target])
    
        # Making predictions using the model
        predictions = knn.predict(test_df[feature])

        # Calculate and return RMSE Value
        rmse = np.sqrt(mean_squared_error(test_df[target], predictions))
        k_results.append(rmse)
    
    return k_results

In [18]:

# Training and Testing on all, five and four features
col_names = ['Top Three', 'Top Four', 'Top Five']
k_values = [x for x in range(1, 26)]

rmse_results = {}

for i in range(3):
    rmse = knn_train_test_hyp(top_features[:i+3].index, 'price', normalised_cars, k_values)    
    rmse_results['{} Features'.format(col_names[i])] = rmse
    
rmse_results

Out[18]:

{'Top Three Features': [3308.749941929402,
  3044.812909435545,
  3042.2117028741623,
  2958.964739955848,
  3212.559630605792,
  3542.300773674804,
  3801.5597829031262,
  4007.750148478564,
  4074.3452185932656,
  4225.049450691918,
  4338.899164938664,
  4428.084138858935,
  4496.362136550291,
  4540.135725202859,
  4614.027297973717,
  4654.474275823789,
  4714.058094964864,
  4645.9886513064885,
  4628.211244787356,
  4665.099200570483,
  4648.5009310888045,
  4610.013405029357,
  4642.836735468625,
  4669.567677732765,
  4719.453932620881],
 'Top Four Features': [3135.5489073677436,
  2514.1812009849527,
  2788.551941742018,
  2917.4679936225316,
  3232.103629232672,
  3566.725419074407,
  3834.980480987282,
  3927.395248759061,
  4078.9765839753827,
  4199.8376270003955,
  4345.006990461182,
  4451.387011302762,
  4550.163468300828,
  4591.534016042883,
  4630.39964268281,
  4711.911798285828,
  4692.337273008159,
  4709.187223643583,
  4698.1962740829795,
  4738.548781458035,
  4727.351846481681,
  4719.336959934102,
  4707.956340126882,
  4753.4193738951,
  4822.351168583702],
 'Top Five Features': [2561.7319037195625,
  2567.2749455482176,
  2949.9007889192553,
  3074.609110629889,
  3346.6737097607775,
  3686.4646211770864,
  3907.195998257802,
  4104.033987317772,
  4335.71419742586,
  4463.6007084810435,
  4444.025988909045,
  4534.547516044051,
  4638.525701454197,
  4686.768062739389,
  4676.617231827435,
  4706.48899163734,
  4714.757468354599,
  4724.017926210877,
  4780.036456967258,
  4790.865401485259,
  4788.442914205118,
  4820.25603556537,
  4823.624611651547,
  4830.771512289382,
  4878.281251020225]}

In [85]:

from numpy import arange

labels = ['{} features'.format(x) for x in col_names]
plt.figure(figsize=(8, 4))

for k,v in rmse_results.items():
    x = np.arange(1, 26, 1)
    y = v
    
    plt.plot(x, y)
    plt.xlabel('K Value')
    plt.ylabel('RMSE')

font = {'family': 'serif', 'color':  'gray', 'weight': 'bold', 'size': 14}

# plt.legend(labels=labels, bbox_to_anchor=(1.05, 1), borderaxespad=0)
plt.legend(labels=labels, loc='lower right')
plt.tight_layout()
plt.title('RMSE Scores for Various K and Feature Combinations', fontdict= font)
plt.savefig('rmse.png', dpi=300)
plt.show()

In [20]:

# Getting Min. RMSE across different `k` values for each feature combination.
k_rmse = {}


for k,v in rmse_results.items():
    min_rmse = min(v)
    k_rmse[min_rmse] = [k, v.index(min_rmse)+1]
    
pd.Series(k_rmse).sort_index()

Out[20]:

2514.181201     [Top Four Features, 2]
2561.731904     [Top Five Features, 1]
2958.964740    [Top Three Features, 4]
dtype: object

From last two cells, we can observe that choosing the best four features with a K value of 2 will give us the lowest RMSE of 2514.

K-Fold Cross Validation¶

We can improve our model by splitting data into more then 2 folds. Now, we can use cross validation with KFold and check how many splits may help us predict price in a better way.

In [21]:

from sklearn.model_selection import cross_val_score, KFold

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]
rmse_scores = {}

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    knn = KNeighborsRegressor(n_neighbors=2)
    
    mses = cross_val_score(knn, normalised_cars[top_features[:4].index], normalised_cars["price"], 
                           scoring="neg_mean_squared_error", cv=kf)
    
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    
    rmse_scores[avg_rmse] = [str(fold) + ' folds']
    
pd.Series(rmse_scores).sort_index()
#     print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

Out[21]:

2416.119787    [21 folds]
2460.825451     [9 folds]
2478.254376    [15 folds]
2489.796459    [23 folds]
2496.384436    [17 folds]
2551.093087    [19 folds]
2586.700720    [11 folds]
2601.038381    [10 folds]
2605.380154     [7 folds]
2650.294806     [5 folds]
2670.410517    [13 folds]
2721.148255     [3 folds]
dtype: object

*Here, we can observe that the least RMSE score of 2416.12 is shown when folds(data divisions) = 21, k = 2 along with top four features, which are horsepower, width, curb_weight and city_mpg.*

*That is it for now though, the goal of this project is to explore the fundamentals of K-Nearest Neighbors.*