Predicting Car Prices¶

The goal of this project is to use K Nearest to predict the market price of cars based on a number of it's attributes. K nearest computes the elucidean distance to find similarity and average in order to predict unknown values.

The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set [here-https://archive.ics.uci.edu/ml/datasets/automobile].

Throughout the project I will be looking to achieve the lowest Square root mean squared error (RMSE). This will show that there is the smallest error between the price and the predicted price.

To start with we need to clean the data. Only numeric values will be useful for this project.

Data Cleaning¶

In [1]:

import pandas as pd
import numpy as np
pd.options.display.max_columns = None

# Rename columns
col_names = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels"
,"engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size"
,"fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
file = pd.read_csv("imports-85.data", names = col_names)

In [2]:

file.head(5)

Out[2]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	height	curb-weight	engine-type	num-of-cylinders	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

In [3]:

# Replace '?'
file["num-of-doors"] = file["num-of-doors"].replace("four", 4).replace("two", 2).replace("?", np.nan)
file['num-of-cylinders'] = file['num-of-cylinders'].replace("four", 4).replace("six", 6).replace("five", 5).replace("eight", 8).replace("two", 2).replace("three", 3).replace("twelve", 12)
file = file.replace("?", np.nan)

In [4]:

# Remove non numeric columns
numeric_cols = list(file.columns)
cols_to_drop = ["make", "fuel-type", "aspiration", "body-style", "drive-wheels", "engine-location", "engine-type", "fuel-system"]
for cols in cols_to_drop:
    if cols in numeric_cols:
        numeric_cols.remove(cols)

We also need to normalize the columns so that individual values do not overly affect the results. Below we will keep the same scale for values but ensure that all values are between 0 and 1. The only column that will remain the same is the price column as this is the target column for our analysis.

Data Normalization¶

In [5]:

normalized_cars = file[numeric_cols].copy()
# drop Na columns
normalized_cars.dropna(inplace=True)

# convert columns to numeric values
normalized_cars["price"] = normalized_cars["price"].astype(int)
normalized_cars["peak-rpm"] = normalized_cars["peak-rpm"].astype(int)
normalized_cars["horsepower"] = normalized_cars["horsepower"].astype(int)
normalized_cars["stroke"] = normalized_cars["stroke"].astype(float)
normalized_cars["bore"] = normalized_cars["bore"].astype(float)
normalized_cars["normalized-losses"] = normalized_cars["normalized-losses"].astype(int)

# Normalize columns between 0 and 1 except for target column price
saved_price = normalized_cars["price"]
normalized_cars = (normalized_cars - normalized_cars.min()) / (normalized_cars.max() - normalized_cars.min())
normalized_cars["price"] = saved_price
normalized_cars.head(5)

Out[5]:

	symboling	normalized-losses	num-of-doors	wheel-base	length	width	height	curb-weight	num-of-cylinders	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
3	0.8	0.518325	1.0	0.455172	0.577236	0.517544	0.471154	0.329325	0.2	0.243655	0.464286	0.633333	0.18750	0.355263	0.551020	0.264706	0.333333	13950
4	0.8	0.518325	1.0	0.441379	0.577236	0.535088	0.471154	0.518231	0.4	0.380711	0.464286	0.633333	0.06250	0.440789	0.551020	0.088235	0.111111	17450
6	0.6	0.486911	1.0	0.662069	0.839024	0.973684	0.605769	0.525989	0.4	0.380711	0.464286	0.633333	0.09375	0.407895	0.551020	0.117647	0.194444	17710
8	0.6	0.486911	1.0	0.662069	0.839024	0.973684	0.625000	0.619860	0.4	0.355330	0.421429	0.633333	0.08125	0.605263	0.551020	0.058824	0.055556	23875
10	0.8	0.664921	0.0	0.503448	0.580488	0.394737	0.471154	0.351823	0.2	0.238579	0.685714	0.347619	0.11250	0.348684	0.673469	0.235294	0.305556	16430

Now that the dataset is ready lets begin. We will start by using a Univariate Model which only uses one attribute to train the data. For example, can we predict the price by finding the average price of the 5 cars with the most similar engine size.

Univariate Model¶

In [6]:

from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor


def knn_train_test(train_col, train_target, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    # Assign test and train datasets
    train_df = rand_df[:int((len(df) * 0.75))]
    test_df = rand_df[int((len(df) * 0.75)):]
    
    # Calculate Mean squared error
    knn = KNeighborsRegressor()
    knn.fit(train_df[[train_col]], train_df[train_target])
    prediction = knn.predict(test_df[[train_col]])
    mse = mean_squared_error(prediction, test_df[train_target])
    return mse**(1/2)

rmses = {}

# list of columns
columns = list(normalized_cars.columns)
columns.remove('price')

# Use function on each column
for col in columns:
    rmse = knn_train_test(col, "price", normalized_cars)
    rmses[col] = rmse
rmses

Out[6]:

{'symboling': 4728.749369759409,
 'normalized-losses': 4417.589877750084,
 'num-of-doors': 4695.416567249385,
 'wheel-base': 2681.3032586785102,
 'length': 2222.3324409277748,
 'width': 2718.763254312519,
 'height': 4193.069881244528,
 'curb-weight': 2135.015488702599,
 'num-of-cylinders': 4390.61656194207,
 'engine-size': 2624.5118045076497,
 'bore': 3903.751335574539,
 'stroke': 4218.941348845703,
 'compression-ratio': 6224.4305330688685,
 'horsepower': 2211.671943801792,
 'peak-rpm': 5212.425816066834,
 'city-mpg': 2774.1447757822593,
 'highway-mpg': 2249.4344375864794}

In [7]:

# Visaulize RMSE's
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.bar(list(rmses.keys()), rmses.values())
plt.xticks(rotation=90, size=15)
plt.ylabel("RMSE", size=15)
plt.xlabel("Car attributes", size=15)
plt.title("RMSE for each car attribute", size=20)

Out[7]:

Text(0.5, 1.0, 'RMSE for each car attribute')

We have a few attributes that are more accurate than the rest. The 5 attributes with the lowerst error in predciting the price are curb-weight, horsepower, length, highway-mpg and engine size. For each of these we looked at the 5 nearest neighbors. Let's see if we can get any more accurate by using the 1-10 closest neighbors.

In [8]:

def knn_train_test(train_col, train_target, df, k_num):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    # Assign test and train datasets
    train_df = rand_df[:int((len(df) * 0.75))]
    test_df = rand_df[int((len(df) * 0.75)):]
    
    # Calculate Mean Squared Error
    knn = KNeighborsRegressor(k_num)
    knn.fit(train_df[[train_col]], train_df[train_target])
    prediction = knn.predict(test_df[[train_col]])
    mse = mean_squared_error(prediction, test_df[train_target])
    return mse**(1/2)

# Test function on k neighbors 1-10
k_nums = list(range(1,11))

# Top 5 attributes
curb_weight = {}
for k in k_nums:
    rmse = knn_train_test("curb-weight", "price", normalized_cars, k)
    curb_weight[k] = rmse
    
horsepower = {}
for k in k_nums:
    rmse = knn_train_test("horsepower", "price", normalized_cars, k)
    horsepower[k] = rmse
    
length = {}
for k in k_nums:
    rmse = knn_train_test("length", "price", normalized_cars, k)
    length[k] = rmse
    
highway_mpg = {}
for k in k_nums:
    rmse = knn_train_test("highway-mpg", "price", normalized_cars, k)
    highway_mpg[k] = rmse
    
engine_size = {}
for k in k_nums:
    rmse = knn_train_test("engine-size", "price", normalized_cars, k)
    engine_size[k] = rmse

In [9]:

# Visaulaize 1-10 K numbers
plt.figure(figsize=(12, 6))
plt.plot(list(engine_size.keys()), list(engine_size.values()), label='Engine Size')
plt.plot(list(highway_mpg.keys()), list(highway_mpg.values()), label='Highway-mpg')
plt.plot(list(length.keys()), list(length.values()), label='Length')
plt.plot(list(horsepower.keys()), list(horsepower.values()), label='Horsepower')
plt.plot(list(curb_weight.keys()), list(curb_weight.values()), label='Curb-weight')


plt.ylim(1500,5000)
plt.ylabel("RMSE ($)", size=15)
plt.xlabel("K Number", size=15)
plt.xticks(range(1,11))
plt.title("RMSE for K Numbers 1-10", size=20)
plt.legend()

Out[9]:

<matplotlib.legend.Legend at 0xadf2b38>

Engine size is a bit of an outlier here. For the other 4 it looks like a k number of 7-8 brings the most accurate predictions whereas with engine size the most accurate K number is 4.

Let's now increase the number of attributes we use to make our predictions. For example, do we get less a smaller margin of error if we look at the cars with the most similar engine size, horsepower and highway-mpg rather than just one of these attributes.

Multivariate Model¶

In [10]:

def knn_train_test(train_cols, train_target, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    # Assign test and train datasets
    train_df = rand_df[:int((len(df) * 0.75))]
    test_df = rand_df[int((len(df) * 0.75)):]
    # Calculate Mean Squared Errors
    knn = KNeighborsRegressor()
    knn.fit(train_df[train_cols], train_df[train_target])
    prediction = knn.predict(test_df[train_cols])
    mse = mean_squared_error(prediction, test_df[train_target])
    return mse**(1/2)

# 2,3,4,5 best attributes
two_best = ["curb-weight", "horsepower"]
three_best = ["curb-weight", "horsepower", "length"]
four_best = ["curb-weight", "horsepower", "length", "highway-mpg"]
five_best = ["curb-weight", "horsepower", "length", "highway-mpg", "engine-size"]

# Function on 2,3,4,5th best features
rmse_two_best = knn_train_test(two_best, 'price', normalized_cars)
rmse_three_best = knn_train_test(three_best, 'price', normalized_cars)
rmse_four_best = knn_train_test(four_best, 'price', normalized_cars)
rmse_five_best = knn_train_test(five_best, 'price', normalized_cars)

print("RMSE with 2 best features: {}".format(rmse_two_best))
print("RMSE with 3 best features: {}".format(rmse_three_best))
print("RMSE with 4 best features: {}".format(rmse_four_best))
print("RMSE with 5 best features: {}".format(rmse_five_best))

RMSE with 2 best features: 2114.83344237791
RMSE with 3 best features: 1987.7541040078372
RMSE with 4 best features: 1970.0850007042845
RMSE with 5 best features: 2088.677455712107

These numbers are coming down. Let's continue to look for the lowest RMSE by using hyperparameter optimization whichlooks for the optimal hyperparameter value (i.e nest number of k neighbors). We are going to loop through 1-25 K neighbors along with the 3 best multivariate lists of attributes we found above.

Hyperparameter Optimization¶

In [11]:

def knn_train_test(train_cols, train_target, df, k_num):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    # Assign test and train datasets
    train_df = rand_df[:int((len(df) * 0.75))]
    test_df = rand_df[int((len(df) * 0.75)):]
    
    # Calculate mean squared errors
    knn = KNeighborsRegressor(k_num)
    knn.fit(train_df[train_cols], train_df[train_target])
    prediction = knn.predict(test_df[train_cols])
    mse = mean_squared_error(prediction, test_df[train_target])
    return mse**(1/2)

k_nums = list(range(1, 26))

with_top_five = {}
with_top_three = {}
with_top_four = {}

# Funtion on top 3 features
for k in k_nums:
    result = knn_train_test(five_best, 'price', normalized_cars, k)
    with_top_five[k] = result
for k in k_nums:
    result = knn_train_test(three_best, 'price', normalized_cars, k)
    with_top_three[k] = result
for k in k_nums:
    result = knn_train_test(four_best, 'price', normalized_cars, k)
    with_top_four[k] = result
    
print(with_top_five)
print("\n------------------------------------------------------------------------------------")
print(with_top_three)
print("\n------------------------------------------------------------------------------------")
print(with_top_four)

{1: 1908.7894986090007, 2: 1969.4308822601517, 3: 1689.1839145641372, 4: 1699.0605547625428, 5: 2088.677455712107, 6: 2216.58653050636, 7: 2196.2768958082734, 8: 2237.6074520288407, 9: 2291.708656180976, 10: 2276.801462139376, 11: 2216.5546668457564, 12: 2304.1030901039103, 13: 2377.847499850312, 14: 2417.2069982096855, 15: 2470.4970267809135, 16: 2459.711975643693, 17: 2530.581472647321, 18: 2498.490186808875, 19: 2451.4004451862365, 20: 2386.7422938233194, 21: 2366.631875259268, 22: 2364.885785390959, 23: 2407.2605812715133, 24: 2397.9911035271766, 25: 2371.8352317056097}

------------------------------------------------------------------------------------
{1: 1862.019622614112, 2: 2322.114474191572, 3: 2145.8102879021403, 4: 1870.5371383475656, 5: 1987.7541040078372, 6: 2158.100359691365, 7: 2336.875224699925, 8: 2407.477395078197, 9: 2530.6221487761745, 10: 2645.6187748804628, 11: 2654.205292767879, 12: 2605.759850159141, 13: 2617.5227623160417, 14: 2572.4166082951365, 15: 2565.5870205254955, 16: 2535.0711033852185, 17: 2495.0474085371106, 18: 2508.04797495501, 19: 2545.3008172875684, 20: 2546.2789382008013, 21: 2501.2579576808303, 22: 2508.576377445437, 23: 2482.7204069323047, 24: 2455.0596943144924, 25: 2446.7601700452788}

------------------------------------------------------------------------------------
{1: 1948.5735872683895, 2: 1694.3284049292215, 3: 1686.665338438212, 4: 1860.036953177947, 5: 1970.0850007042845, 6: 2070.5084265572177, 7: 2199.5210471224627, 8: 2249.4048259692495, 9: 2318.8411322479465, 10: 2410.4803709530597, 11: 2444.998685628946, 12: 2388.432660181614, 13: 2526.4479851945284, 14: 2626.2851186347307, 15: 2570.8897807525273, 16: 2502.68424556671, 17: 2525.2735766179285, 18: 2509.3489598836463, 19: 2451.3118809331213, 20: 2462.4268248391663, 21: 2425.8419942264136, 22: 2418.3242936283154, 23: 2415.7340875416676, 24: 2393.064093354901, 25: 2353.697030129409}

In [12]:

# Visualize Hyperparameter Optimization
plt.figure(figsize=(12,6))


plt.plot(list(with_top_five.keys()), list(with_top_five.values()), label = "Five Features")
plt.plot(list(with_top_three.keys()), list(with_top_three.values()), label = "Three Features")
plt.plot(list(with_top_four.keys()), list(with_top_four.values()), label = "Four Features")
plt.legend()
plt.xticks(range(1, 26))
plt.xlim(0.5,25.5)
plt.ylabel("RMSE", size=15)
plt.xlabel("Number of K Neighbors", size=15)
plt.title("Hyperparameter Optimization", size=20)

Out[12]:

Text(0.5, 1.0, 'Hyperparameter Optimization')

The lowest RMSE's are found in the lower k neighbors (from 2-4). The singular lowest score came from using the attributes "curb-weight", "horsepower", "length", "highway-mpg", "engine-size" together along with a k number of 3. Throughout this project we have used the 75-25 train and test method in which you use 75% of the dataset to train the data and then the other 25% to test it. Let's finish by using a different technique k-folds. k-folds involves the following:

splitting the full dataset into k equal length partitions.
selecting k-1 partitions as the training set and
selecting the remaining partition as the test set
training the model on the training set.
using the trained model to predict labels on the test fold.
computing the test fold's error metric.
repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration.
calculating the mean of the k error values.

Below we are going to loop through 1-25 k folds to see which gives the lowest RMSE.

In [13]:

from sklearn.model_selection import cross_val_score, KFold

# K fold function
def knn_train_test_kfold(train_cols, train_target, df, folds):
    rmses = []
    kf = KFold(folds, shuffle=True, random_state=1)
    model = KNeighborsRegressor(3)
    mses = cross_val_score(model, df[train_cols], df[train_target],
    scoring="neg_mean_squared_error", cv=kf)
    mses = list(mses)
    for v in mses:
        v = abs(v)
        rmse = v**(1/2)
        rmses.append(rmse)
    return rmses, np.mean(rmses), np.std(rmses)
    

k_folds = list(range(2,26))

k_folds_means = {}
k_folds_std = {}

# Loop through number of k folds
for n in k_folds:
    s = knn_train_test_kfold(five_best, "price", normalized_cars, n)
    k_folds_means[n] = s[1]
    k_folds_std[n] = s[2]
    
print(k_folds_means)
print("\n------------------------------------------------------------------")
print(k_folds_std)

{2: 2927.3460555374886, 3: 2618.0553343482184, 4: 2295.8348274216028, 5: 2296.8260632154092, 6: 2273.234462030219, 7: 2171.456088833527, 8: 2267.954964088096, 9: 2242.140196500448, 10: 2093.263396540541, 11: 2135.7549883282413, 12: 2189.9510900938294, 13: 2102.414340454889, 14: 2072.8020339088766, 15: 2120.654092397812, 16: 2013.2140599786032, 17: 1995.7138397821452, 18: 2066.769798812635, 19: 2054.7957734427146, 20: 1988.3368466236711, 21: 1978.008211081012, 22: 1998.0681329217625, 23: 2001.0309155202178, 24: 1963.5786094089592, 25: 1958.4501941005008}

------------------------------------------------------------------
{2: 745.3664165705284, 3: 920.0574269069089, 4: 776.4880430314498, 5: 930.6223722413541, 6: 933.8154774994462, 7: 856.6044312913886, 8: 934.8659556664095, 9: 862.7296727202488, 10: 1099.8567127719336, 11: 995.2862605124948, 12: 987.744682988567, 13: 1129.6650747844253, 14: 1101.5545453651685, 15: 1114.2675399403688, 16: 1258.9184665180949, 17: 1265.8878755038347, 18: 1179.319161838164, 19: 1178.6384549055665, 20: 1270.2883183902757, 21: 1282.5755421191684, 22: 1367.9638554287683, 23: 1256.0214344967953, 24: 1328.09119919684, 25: 1303.4944182840031}

In [14]:

# Visualize K folds
plt.figure(figsize=(12,6))

plt.plot(list(k_folds_means.keys()), list(k_folds_means.values()), label='Mean')
plt.plot(list(k_folds_std.keys()), list(k_folds_std.values()), label='STD')
plt.legend()
plt.xticks(range(1, 26))
plt.xlim(0.5,25.5)
plt.ylabel("RMSE", size=15)
plt.xlabel("Number of K Folds", size=15)
plt.title("K Folds", size=20)

Out[14]:

Text(0.5, 1.0, 'K Folds')

In this graph we have the mean RSME along with the standard deviation. A low mean RMSE is good but not if it comes with high variance. Likewise, a low variance is desriable but only if it comes with a low mean. There is usually a trade off between the two. In the graph above I would recommend 7 k folds as the variance is still low and the mean has begun to drop towards it's lowest point.

Conclusion¶

Throghout this project the closest we got in predicting the price was to use the attributes "curb-weight", "horsepower", "length", "highway-mpg", "engine-size" together along with a k number of 3. This predicted the market price of each car to an RMSE of 1686.66.