Project 1

Used Vehicle Price Prediction

Introduction

  • 1.2 Million listings scraped from TrueCar.com - Price, Mileage, Make, Model dataset from Kaggle: data
  • Each observation represents the price of an used car
In [1]:
%matplotlib inline
import pandas as pd
In [2]:
data = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/dataTrain_carListings.zip')
In [3]:
data.head()
Out[3]:
Price Year Mileage State Make Model
0 21490 2014 31909 MD Nissan MuranoAWD
1 21250 2016 25741 KY Chevrolet CamaroCoupe
2 20925 2016 24633 SC Hyundai Santa
3 14500 2012 84026 OK Jeep Grand
4 32488 2013 22816 TN Jeep Wrangler
In [4]:
data.shape
Out[4]:
(500000, 6)
In [5]:
data.Price.describe()
Out[5]:
count    500000.000000
mean      21144.186304
std       10753.259704
min        5001.000000
25%       13499.000000
50%       18450.000000
75%       26998.000000
max       79999.000000
Name: Price, dtype: float64
In [6]:
data.plot(kind='scatter', y='Price', x='Year')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a3b24a5ef0>
In [7]:
data.plot(kind='scatter', y='Price', x='Mileage')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a3b2d3cd68>
In [8]:
data.columns
Out[8]:
Index(['Price', 'Year', 'Mileage', 'State', 'Make', 'Model'], dtype='object')

Exercise P1.1 (50%)

Develop a machine learning model that predicts the price of the of car using as an input ['Year', 'Mileage', 'State', 'Make', 'Model']

Submit the prediction of the testing set to Kaggle https://www.kaggle.com/c/miia4200-20191-p1-usedcarpriceprediction

Evaluation:

  • 25% - Performance of the model in the Kaggle Private Leaderboard
  • 25% - Notebook explaining the modeling process
In [2]:
data_test = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/dataTest_carListings.zip', index_col=0)
In [3]:
data_test.head()
Out[3]:
Year Mileage State Make Model
ID
0 2015 23388 OH Ford EscapeFWD
1 2014 45061 PA Ford EscapeSE
2 2007 101033 WI Toyota Camry4dr
3 2015 13590 HI Jeep Wrangler
4 2009 118916 CO Dodge Charger4dr
In [4]:
data_test.shape
Out[4]:
(250000, 5)

Submission example

In [6]:
import numpy as np
In [7]:
np.random.seed(42)
y_pred = pd.DataFrame(np.random.rand(data_test.shape[0]) * 75000 + 5000, index=data_test.index, columns=['Price'])
In [8]:
y_pred.to_csv('test_submission.csv', index_label='ID')
In [9]:
y_pred.head()
Out[9]:
Price
ID
0 33090.508914
1 76303.572981
2 59899.545636
3 49899.386315
4 16701.398033

Exercise P1.2 (50%)

Create an API of the model.

Example:

Evaluation:

  • 40% - API hosted on a cloud service
  • 10% - Show screenshots of the model doing the predictions on the local machine