#!/usr/bin/env python # coding: utf-8 # # Car Price Prediction Using K Nearest Neighbors # ## Introduction # In this project we are going to be working on a dataset from the UCI Machine Learning Repository to make predictions on car prices. # The dataset contains the following columns: # # 1. `symboling`: -3, -2, -1, 0, 1, 2, 3. # 2. `normalized-losses`: continuous from 65 to 256. # 3. `make`: this includes car brands such as alfa-romero, audi, bmw, chevrolet, dodge, honda etc. # 4. `fuel-type`: diesel, gas. # 5. `aspiration`: std, turbo. # 6. `num-of-doors`: four, two. # 7. `body-style`: hardtop, wagon, sedan, hatchback, convertible. # 8. `drive-wheels`: 4wd, fwd, rwd. # 9. `engine-location`: front, rear. # 10. `wheel-base`: continuous from 86.6 120.9. # 11. `length`: continuous from 141.1 to 208.1. # 12. `width`: continuous from 60.3 to 72.3. # 13. `height`: continuous from 47.8 to 59.8. # 14. `curb-weight`: continuous from 1488 to 4066. # 15. `engine-type`: dohc, dohcv, l, ohc, ohcf, ohcv, rotor. # 16. `num-of-cylinders`: eight, five, four, six, three, twelve, two. # 17. `engine-size`: continuous from 61 to 326. # 18. `fuel-system`: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. # 19. `bore`: continuous from 2.54 to 3.94. # 20. `stroke`: continuous from 2.07 to 4.17. # 21. `compression-ratio`: continuous from 7 to 23. # 22. `horsepower`: continuous from 48 to 288. # 23. `peak-rpm`: continuous from 4150 to 6600. # 24. `city-mpg`: continuous from 13 to 49. # 25. `highway-mpg`: continuous from 16 to 54. # 26. `price`: continuous from 5118 to 45400. # # Our goal is to demonstrate a proper machine learning work flow by computing the RMSE(Root Mean Squared Error) values of our predictions using different individual features, multiple features and different hyperparameter values. For this project the machine learning model we will be working with is scikit-learn.neighbors KNeighborRegressor and the error metric we will be using is scikit-learn.metrics mean_squared_error. We will also be using scikit-learn cross-validation to perform a cross validation on our dataset and scikit-learn KFold class to split and randomize our dataset. # # ## Data Exploration # In[1]: import pandas as pd import numpy as np from sklearn.neighbors import KNeighborsRegressor from sklearn.metrics import mean_squared_error from sklearn.model_selection import cross_val_score, KFold import matplotlib.pyplot as plt import matplotlib.style as style style.use('seaborn') get_ipython().run_line_magic('matplotlib', 'inline') # In[2]: cars = pd.read_csv('imports-85.data') pd.set_option('display.max_columns', 50) cars.head() # The data we read in did not have the expected column names. We are going to replace the columns with the actual names of the columns. # In[3]: cars.columns = ['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_of_doors', 'body_style', 'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type', 'num_of_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price' ] # In[4]: cars.info() # # Data Cleaning # Despite it showing that there are no null columns, we could see that some of the columns had a value of `'?'` which is infact a null value. We are going to replace this value with `numpy.nan` float value. # In[5]: cars = cars.replace('?', np.nan) cars # In[6]: cars.isnull().sum() # After replacing the `'?'` with `numpy.nan` we can see that we have a couple of null values in the different column. Since we want to predict the `price` column, we can drop all null values in that column which makes up less than 2% of the values in the column. For the other columns we will be working with, we are going to first cast them as a float type and then replace the `numpy.nan` values with the mean values for those columns. # In[7]: cars.dropna(subset=['price'], inplace=True) # In[8]: continuous_variable_columns = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price' ] num_cars = cars[continuous_variable_columns] num_cars.head() # In[9]: num_cars = num_cars.astype(float) num_cars.info() # In[10]: num_cars.fillna(num_cars.mean(), inplace=True) # replace all null values with the mean value num_cars.isnull().sum() # In[11]: price = num_cars['price'] num_cars = (num_cars - num_cars.min()) / (num_cars.max() - num_cars.min()) # normalizes the columns so values fall between 0 - 1 num_cars['price'] = price num_cars # We normalized the values in evry column by subtracting the minimum value in the column from each value and then dividing it by the range of the values. The reason for doing this is to ensure that extremely large values do not affect the performance of our model. # ## Univariate Model Testing # We are going to test the model performance using just one feature. We are going to create a fuction that takes in a list of the column/columns we want to train, the name of the column we want to predict and a DataFrame. # In[12]: def knn_train_test(cols, col2, df): ''' Predict a variable and calculate RMSE value. This function takes in a DataFrame, randomizes i and then uses the `scikit-learn.neighbors KNeighborRegressor` class to predict a variable and also uses `scikit-learn.metrics.mean_square_error` to calculate the RMSE value by taking the squareroot of the mean square error. Parameters ---------- cols : list list of columns in DataFrame to train. col2 : str name of target column in DataFrame. df : DataFrame Returns ------- predictions : numpy.ndarray numpy array with the predicted resluts of target column. rmse : float error metric used for evaluation of the prediction. ''' np.random.seed(1) # random seed set to one so shuffling can be recreated shuffle_index = np.random.permutation(df.index) rand_df = df.reindex(shuffle_index) split_index = int(df.shape[0] / 2) # splits the DataFrame index in 2 train_df = rand_df.iloc[:split_index] test_df = rand_df.iloc[split_index:] model = KNeighborsRegressor() model.fit(train_df[cols], train_df[col2]) predictions = model.predict(test_df[cols]) mse = mean_squared_error(test_df[col2], predictions) rmse = np.sqrt(mse) return rmse, predictions # In[13]: all_features = num_cars.columns.drop('price') # dropping the price column as it is our target column rmse_dict = dict() for col in all_features: rmse, predictions = knn_train_test([col], 'price', num_cars) rmse_dict[col] = rmse rmse_dict # In[14]: uni_rmse = pd.Series(rmse_dict) # converts uni_rmse dict to a pandas Series uni_rmse.sort_values() # sort values in the series in ascending order # In[15]: uni_rmse.sort_values().plot.barh(figsize=(8, 6)) # horizontal bar plot of uni_rmse Series plt.title('RMSE Value Per Feature') plt.show() # After computing the RMSE value for each of the individual features, we can see that the top 5 features were the: # 1. engine_size # 2. curb_weight # 3. highway_mpg # 4. width # 5. city_mpg # ## Hyperparameter Optimisation # We are going to vary the k value of our model between 1 and 9 and see how it performs at each k value. We will compute the average RMSE values of each of the features for each of the k values. # In[16]: def knn_train_test(cols, col2, df, k=5): ''' Predict a variable and calculate RMSE value. This function takes in a list of columns to train, the target column to predict, and a DataFrame, randomizes the index and then uses the `scikit-learn.neighbors KNeighborRegressor` class to predict a variable and also uses `scikit-learn.metrics.mean_square_error` to calculate the RMSE value by taking the squareroot of the mean square error. Parameters ---------- cols : list list of columns in DataFrame to train. col2 : str name of target column in DataFrame. df : DataFrame n : int, default 5 number of n_neighbors. Returns ------- predictions : numpy.ndarray numpy array with the predicted values of target column. rmse : float squareroot of mean_squared_error. ''' np.random.seed(1) # random seed set to one so shuffling can be recreated shuffle_index = np.random.permutation(df.index) rand_df = df.reindex(shuffle_index) split_index = int(df.shape[0] / 2) # split the DataFrame index in 2 train_df = rand_df.iloc[:split_index] test_df = rand_df.iloc[split_index:] model = KNeighborsRegressor(n_neighbors = k) model.fit(train_df[cols], train_df[col2]) predictions = model.predict(test_df[cols]) mse = mean_squared_error(test_df[col2], predictions) rmse = np.sqrt(mse) return rmse, predictions # In[17]: n_neighbors = [1, 3, 5, 7, 9] rmse_values = dict() for col in all_features: values_dict = {} for k in n_neighbors: rmse, predictions = knn_train_test([col], 'price', num_cars, k) values_dict[k] = rmse rmse_values[col] = values_dict rmse_values # In[18]: for k,v in rmse_values.items(): x = list(v.keys()) y = list(v.values()) # displays a scatter plot of the rmse values of the different features for n number of neighbors plt.scatter(x,y,) plt.xlabel('n_neighbor value') plt.ylabel('RMSE') # There is no clear pattern to how the model behaves as we vary the values for k. For most features the RMSE values decreased as we varied k from 1 - 2, while in some features the RMSE increased as we varied k from 1 - 2. # In[19]: feature_avg_rmse = {} for k,v in rmse_values.items(): avg_rmse = np.mean(list(v.values())) # computes the mean for the list of values in feature_avg_rmse feature_avg_rmse[k] = avg_rmse series_avg_rmse = pd.Series(feature_avg_rmse) # converts feature_avg_rmse dict to a pandas Series sorted_series_avg_rmse = series_avg_rmse.sort_values() # sorts values in Series in ascending order sorted_series_avg_rmse # In[20]: sorted_series_avg_rmse.plot.barh(figsize=(8, 6)) # horizontal bar plot of sorted_Series_avg_rmse plt.title('Average RMSE Value Per Feature') plt.show() # After taking the average RMSES for each feature for the different k value, there is not so much change to the top 5 performing features. Just a slight difference in how it was ordered. The top 5 features are: # 1. engine_size # 2. curb_weight # 3. city_mpg # 4. highway_mpg # 5. width # ## Multivariate Model Testing # In our previous testing we used just one feature. Here we are going to use multiple features from the 5 best performing features in the last test. We are going to see how the model performs when we train it with the 2 best features up to the 5 best features. # In[21]: best_features = list(sorted_series_avg_rmse.index) best_rmse = {} for i in range(2, 6): rmse, predictions = knn_train_test(best_features[:i], 'price', num_cars) best_rmse[f'best {i} features'] = rmse best_rmse # The model performed best when we used the best 2 features, best 3 features and best 5 features. In fact we got the lowest RMSE score when we used the best 5 features. # ## Multivariate Hyperparameter Optimisation # We are going to vary the k value between 1 - 25 for the 2 best , 3 best and 5 best features. # In[22]: top3_features = [2, 3, 5] top3_rmse = {} for i in top3_features: top_rmse = {} for k in range(1, 26): rmse, predictions = knn_train_test(best_features[:i], 'price', num_cars, k) top_rmse[k] = rmse top3_rmse[f'best {i} features'] = top_rmse top3_rmse # In[23]: for k,v in top3_rmse.items(): # scatter plot for each of the rmse values and n_neighbor values of the top 3 performing multi features x = list(v.keys()) y = list(v.values()) plt.scatter(x,y, label=f'{k}') plt.xlabel('n_neighbor value') plt.ylabel('RMSE') plt.legend() # The RMSE values dropped as the n_neighbor value varied from 1 - 2 and then alternated a little bit before climbing steadily as the value for the n_neighbor increased. # ## Cross Validation # Cross validation is important as it prevents your model from overfitting. We have chosen to work with a maximum of 6 folds so that that sample from each fold is representative enough as there are not a lot of rows in our dataset. # In[24]: def crossval_train_test(cols, col2, df, n=2): ''' Calculates average rmse and std rmse value for n number of folds. This function takes in a list of columns to train, the target column to predict, and a DataFrame The DataFrame is split and randomized using `scikit-learn.model_selection` KFold class while the average rmse and std rmse values are calculated by taking the mean and std of the squareroot of mean_squared_error values returned by `scikit-learn.model_selction` cross_val_score class. Parameters ---------- cols : list list of columns in DataFrame to train. col2 : str name of target column in DataFrame. df : DataFrame n : int, default 2 number of splits. Returns ------- avg_rmse : float average rmse value. std_rmse : float standard deviation of rmse values. ''' kf = KFold(n, shuffle=True, random_state = 1) model = KNeighborsRegressor() mses = cross_val_score(model, df[cols], df[col2], scoring = 'neg_mean_squared_error', cv = kf) rmses = np.sqrt(abs(mses)) avg_rmse = np.mean(rmses) std_rmse = np.std(rmses) return avg_rmse, std_rmse # In[25]: five_features = {} for i in range(2,7): avg_rmse, std_rmse = crossval_train_test(best_features[:5], 'price', num_cars, i) five_features[f'{i} folds'] = (avg_rmse, std_rmse) five_features # In[26]: averages = [] stds = [] for k,v in five_features.items(): avg = v[0] std = v[1] averages.append(avg) stds.append(std) plt.figure(figsize=(8, 6)) x = np.arange(5) width = 0.4 plt.xticks(x, ['2 folds', '3 folds', '4 folds', '5 folds', '6 folds']) bar1 = plt.bar(x - width/2, averages, label='avg rmse', width = width ) bar2 = plt.bar(x + width/2, stds, label='std rmse', width = width ) plt.xlabel('number of folds') plt.legend() plt.title('Average RMSE and STD RMSE For Different Folds') plt.show() # The average RMSE values and the standard deviation of the RMSE values reduced as we varied the number of folds from 2 to 4. And when we varied from 4 folds up to 6 folds, the average RMSE values reduced still but the standard deviation of the RMSE values increased. Optimally we want a low bias(average RMSE value) as well as a low variance(standard deviation of RMSE) but there is usually a trade off between the bias and the variance. # ## Conclusion # While our dataset was not large enough to make a concrete prediction, it was valuable process in learning a machine learning work flow. One of the things learned here is that we can improve the performance of the model by either increasing the number of features we train the model on or by varying the hyperparameters. It is also important to note that training the model on more features or increasing the number of hyperparameters(n_neighbours) does not neccesarily result in better performance of the model.