Project 1¶

Used Vehicle Price Prediction¶

Introduction¶

1.2 Million listings scraped from TrueCar.com - Price, Mileage, Make, Model dataset from Kaggle: data
Each observation represents the price of an used car

In [1]:

%matplotlib inline
import pandas as pd

In [2]:

data = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/dataTrain_carListings.zip')

In [3]:

data.head()

Out[3]:

	Price	Year	Mileage	State	Make	Model
0	21490	2014	31909	MD	Nissan	MuranoAWD
1	21250	2016	25741	KY	Chevrolet	CamaroCoupe
2	20925	2016	24633	SC	Hyundai	Santa
3	14500	2012	84026	OK	Jeep	Grand
4	32488	2013	22816	TN	Jeep	Wrangler

In [4]:

data.shape

Out[4]:

(500000, 6)

In [5]:

data.Price.describe()

Out[5]:

count    500000.000000
mean      21144.186304
std       10753.259704
min        5001.000000
25%       13499.000000
50%       18450.000000
75%       26998.000000
max       79999.000000
Name: Price, dtype: float64

In [6]:

data.plot(kind='scatter', y='Price', x='Year')

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a3b24a5ef0>

In [7]:

data.plot(kind='scatter', y='Price', x='Mileage')

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a3b2d3cd68>

In [8]:

data.columns

Out[8]:

Index(['Price', 'Year', 'Mileage', 'State', 'Make', 'Model'], dtype='object')

Exercise P1.1 (50%)¶

Develop a machine learning model that predicts the price of the of car using as an input ['Year', 'Mileage', 'State', 'Make', 'Model']

Submit the prediction of the testing set to Kaggle https://www.kaggle.com/c/miia4200-20191-p1-usedcarpriceprediction

Evaluation:¶

25% - Performance of the model in the Kaggle Private Leaderboard
25% - Notebook explaining the modeling process

In [2]:

data_test = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/dataTest_carListings.zip', index_col=0)

In [3]:

data_test.head()

Out[3]:

	Year	Mileage	State	Make	Model
ID
0	2015	23388	OH	Ford	EscapeFWD
1	2014	45061	PA	Ford	EscapeSE
2	2007	101033	WI	Toyota	Camry4dr
3	2015	13590	HI	Jeep	Wrangler
4	2009	118916	CO	Dodge	Charger4dr

In [4]:

data_test.shape

Out[4]:

(250000, 5)

Submission example¶

In [6]:

import numpy as np

In [7]:

np.random.seed(42)
y_pred = pd.DataFrame(np.random.rand(data_test.shape[0]) * 75000 + 5000, index=data_test.index, columns=['Price'])

In [8]:

y_pred.to_csv('test_submission.csv', index_label='ID')

In [9]:

y_pred.head()

Out[9]:

	Price
ID
0	33090.508914
1	76303.572981
2	59899.545636
3	49899.386315
4	16701.398033

Exercise P1.2 (50%)¶

Create an API of the model.

Example:

Evaluation:¶

40% - API hosted on a cloud service
10% - Show screenshots of the model doing the predictions on the local machine