Car Price - Prediction

download dataset from here Car price prediction dataset

Problem statement:

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.
They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
Which variables are significant in predicting the price of a car How well those variables describe the price of a car Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.

Business goal:¶

We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

Attribute information:¶

Car_ID: Unique ID of each observation (Integer).
Symboling: Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe (Categorical).
Car_Company: Name of Car company (Categorical). The following are the list of car companies available in the data set. 'alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi', 'nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'
Fueltype: Car fuel type i.e. gas or diesel (Categorical).
Aspiration: Aspiration used in a car i.e. Std or turbo (Categorical). Mode of air intake for the internal combustion engine i.e. natural (standard) or turbocharger.
DoorNumber: Number of doors in a car i.e. 2 or 4 (Categorical).
CarBody: Body of a car sedan, hatchback, wagon, hardtop, convertible (Categorical).
DriverWheel: Type of driver wheel 4wd, fwd, rwd (Categorical).
4wd: Four-wheel drive, also called 4×4 or 4wd, refers to a two-axled vehicle drivetrain capable of providing torque to all of its wheels simultaneously.
fwd: Front wheel drive is called as fwd. A fwd can only supply power to the front wheels.
rwd: Rear wheel drive is called as rwd. A rwd can only supply power to the rear wheels.
EngineLocation: Location of car engine Front or Rear (Categorical). Out of 205 only 3 cars engine location is rear. All the 3 cars belong to Porsche company.
WheelBase: WheelBase is the distance between the front and rear wheels. Continuous from 86.6 to 120.9 (Float).
CarLength: Length of car continuous from 141.1 to 208.1 (Float).
CarWidth: Width of car continuous from 60.3 to 72.3 (Float).
CarHeight: Height of car continuous from 47.8 to 59.8 (Float).
CurbWeight: The weight of car continuous from 1488 to 4066 (Int).
EngineType: Type of engine dohc, ohcv, ohc, l, rotor, ohcf, dohcv (Categorical).
CylinderNumber: Number of cylinders continuous from 2 to 8 (Categorical). A cylinder is the power unit of an engine; it's the chamber where the gasoline is burned and turned into power.
EngineSize: The Size of engine continuous from 61 to 326 (Int).
FuelSystem: Fuel system is to inject a precise amount of automized and pressurized fuel in to each cylinder at a proper time. Types of fuel system are mpfi, 2bbl, mfi, 1bbl, spfi, 4bbl, idi, spdi (Categorical).
BoreRatio: Bore-Stroke Ratio is the ratio between the dimensions of the engine cylinder bore diameter to its piston stroke-length. Bore ratio continuous from 2.54 to 3.94 (Float).
Stroke: stroke or volume inside the engine continuous from 2.07 to 4.17 (Float).
CompressionRatio: Compression ratio, in an internal-combustion engine, degree to which the fuel mixture is compressed before ignition. Compression ratio continuous from 7.0 to 23.0 (Float).
HorsePower: Horsepower is a unit of power used to measure the forcefulness of a vehicle's engine. Horse power continuous from 48 to 288 (Int).
PeakRPM: RPM stands for revolutions per minute, and it's used as a measure of how fast any machine is operating at a given time. how many times each piston goes up and down in its cylinder. PeakRPM continuous from 4150 to 6600 (Int).
CityMPG: CityMPG refers to driving with occasional stopping and braking. Mileage in city continuous from 13 to 49 (Float).
HighwayMPG: It is based on continuous acceleration. Mileage on highway continuous from 16 to 54 (Float).
Price: Price of car continuous from 5118 to 45400 (Float).

Importing Libraries¶

In [1]:

#### Importing libraries
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#### Ignore all warnings
warnings.filterwarnings('ignore')
#### allow plots to appear directly in the notebook
%matplotlib inline
#### set the maximum number of dataFrame columns display to unlimited
pd.set_option("display.max_columns", None)

In [2]:

#### loading dataSet as pandas dataFrame
data = pd.read_csv('./Data/CarPrice.csv')

#### Let’s take a look at the top five rows using the DataFrame’s head() method
data.head()

Out[2]:

	car_ID	symboling	CarName	fueltype	aspiration	doornumber	carbody	drivewheel	enginelocation	wheelbase	carlength	carwidth	carheight	curbweight	enginetype	cylindernumber	enginesize	fuelsystem	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	highwaympg	price
0	1	3	alfa-romero giulia	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495.0
1	2	3	alfa-romero stelvio	gas	std	two	convertible	rwd	front	88.6	168.8	64.1	48.8	2548	dohc	four	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500.0
2	3	1	alfa-romero Quadrifoglio	gas	std	two	hatchback	rwd	front	94.5	171.2	65.5	52.4	2823	ohcv	six	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500.0
3	4	2	audi 100 ls	gas	std	four	sedan	fwd	front	99.8	176.6	66.2	54.3	2337	ohc	four	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950.0
4	5	2	audi 100ls	gas	std	four	sedan	4wd	front	99.4	176.6	66.4	54.3	2824	ohc	five	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450.0

In [3]:

#### checking dimensionality of the Data
rows, cols = data.shape
print(f'The no.of rows = {rows}')
print(f'The no.of columns = {cols}')

The no.of rows = 205
The no.of columns = 26

In [4]:

'''
The info() method is useful to get a quick description of the data, in particular the total number of rows, 
and each attribute’s type and number of non-null values
'''
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 17  fuelsystem        205 non-null    object 
 18  boreratio         205 non-null    float64
 19  stroke            205 non-null    float64
 20  compressionratio  205 non-null    float64
 21  horsepower        205 non-null    int64  
 22  peakrpm           205 non-null    int64  
 23  citympg           205 non-null    int64  
 24  highwaympg        205 non-null    int64  
 25  price             205 non-null    float64
dtypes: float64(8), int64(8), object(10)
memory usage: 41.8+ KB

There are 205 instances in the dataset.Notice that there are no null values present in the dataset.

In [5]:

#### The describe() method shows a summary of the nummerical attributes.
data.describe().T.round(decimals = 2)

Out[5]:

	count	mean	std	min	25%	50%	75%	max
car_ID	205.0	103.00	59.32	1.00	52.00	103.00	154.00	205.00
symboling	205.0	0.83	1.25	-2.00	0.00	1.00	2.00	3.00
wheelbase	205.0	98.76	6.02	86.60	94.50	97.00	102.40	120.90
carlength	205.0	174.05	12.34	141.10	166.30	173.20	183.10	208.10
carwidth	205.0	65.91	2.15	60.30	64.10	65.50	66.90	72.30
carheight	205.0	53.72	2.44	47.80	52.00	54.10	55.50	59.80
curbweight	205.0	2555.57	520.68	1488.00	2145.00	2414.00	2935.00	4066.00
enginesize	205.0	126.91	41.64	61.00	97.00	120.00	141.00	326.00
boreratio	205.0	3.33	0.27	2.54	3.15	3.31	3.58	3.94
stroke	205.0	3.26	0.31	2.07	3.11	3.29	3.41	4.17
compressionratio	205.0	10.14	3.97	7.00	8.60	9.00	9.40	23.00
horsepower	205.0	104.12	39.54	48.00	70.00	95.00	116.00	288.00
peakrpm	205.0	5125.12	476.99	4150.00	4800.00	5200.00	5500.00	6600.00
citympg	205.0	25.22	6.54	13.00	19.00	24.00	30.00	49.00
highwaympg	205.0	30.75	6.89	16.00	25.00	30.00	34.00	54.00
price	205.0	13276.71	7988.85	5118.00	7788.00	10295.00	16503.00	45400.00

From the above it is observed the price column ranges from 5118 to 45400 with a standard deviation of 7988.85.

In [6]:

#### The describe(include=['object']) method shows a summary of the object attributes.
data.describe(include = ['object']).T

Out[6]:

	count	unique	top	freq
CarName	205	147	peugeot 504	6
fueltype	205	2	gas	185
aspiration	205	2	std	168
doornumber	205	2	four	115
carbody	205	5	sedan	96
drivewheel	205	3	fwd	120
enginelocation	205	2	front	202
enginetype	205	7	ohc	148
cylindernumber	205	7	four	159
fuelsystem	205	8	mpfi	94

from the above it is observed that

98% of the cars have enginelocation as front.
90% of the cars have fueltype as gas.
75% carName are distinct.

In [7]:

#Checking for the unique values present in the categorical features.
for i in data.select_dtypes(include='object').columns:
        print(i, ':\n', data[i].unique(), end=print())

CarName :
 ['alfa-romero giulia' 'alfa-romero stelvio' 'alfa-romero Quadrifoglio'
 'audi 100 ls' 'audi 100ls' 'audi fox' 'audi 5000' 'audi 4000'
 'audi 5000s (diesel)' 'bmw 320i' 'bmw x1' 'bmw x3' 'bmw z4' 'bmw x4'
 'bmw x5' 'chevrolet impala' 'chevrolet monte carlo' 'chevrolet vega 2300'
 'dodge rampage' 'dodge challenger se' 'dodge d200' 'dodge monaco (sw)'
 'dodge colt hardtop' 'dodge colt (sw)' 'dodge coronet custom'
 'dodge dart custom' 'dodge coronet custom (sw)' 'honda civic'
 'honda civic cvcc' 'honda accord cvcc' 'honda accord lx'
 'honda civic 1500 gl' 'honda accord' 'honda civic 1300' 'honda prelude'
 'honda civic (auto)' 'isuzu MU-X' 'isuzu D-Max ' 'isuzu D-Max V-Cross'
 'jaguar xj' 'jaguar xf' 'jaguar xk' 'maxda rx3' 'maxda glc deluxe'
 'mazda rx2 coupe' 'mazda rx-4' 'mazda glc deluxe' 'mazda 626' 'mazda glc'
 'mazda rx-7 gs' 'mazda glc 4' 'mazda glc custom l' 'mazda glc custom'
 'buick electra 225 custom' 'buick century luxus (sw)' 'buick century'
 'buick skyhawk' 'buick opel isuzu deluxe' 'buick skylark'
 'buick century special' 'buick regal sport coupe (turbo)'
 'mercury cougar' 'mitsubishi mirage' 'mitsubishi lancer'
 'mitsubishi outlander' 'mitsubishi g4' 'mitsubishi mirage g4'
 'mitsubishi montero' 'mitsubishi pajero' 'Nissan versa' 'nissan gt-r'
 'nissan rogue' 'nissan latio' 'nissan titan' 'nissan leaf' 'nissan juke'
 'nissan note' 'nissan clipper' 'nissan nv200' 'nissan dayz' 'nissan fuga'
 'nissan otti' 'nissan teana' 'nissan kicks' 'peugeot 504' 'peugeot 304'
 'peugeot 504 (sw)' 'peugeot 604sl' 'peugeot 505s turbo diesel'
 'plymouth fury iii' 'plymouth cricket' 'plymouth satellite custom (sw)'
 'plymouth fury gran sedan' 'plymouth valiant' 'plymouth duster'
 'porsche macan' 'porcshce panamera' 'porsche cayenne' 'porsche boxter'
 'renault 12tl' 'renault 5 gtl' 'saab 99e' 'saab 99le' 'saab 99gle'
 'subaru' 'subaru dl' 'subaru brz' 'subaru baja' 'subaru r1' 'subaru r2'
 'subaru trezia' 'subaru tribeca' 'toyota corona mark ii' 'toyota corona'
 'toyota corolla 1200' 'toyota corona hardtop' 'toyota corolla 1600 (sw)'
 'toyota carina' 'toyota mark ii' 'toyota corolla'
 'toyota corolla liftback' 'toyota celica gt liftback'
 'toyota corolla tercel' 'toyota corona liftback' 'toyota starlet'
 'toyota tercel' 'toyota cressida' 'toyota celica gt' 'toyouta tercel'
 'vokswagen rabbit' 'volkswagen 1131 deluxe sedan' 'volkswagen model 111'
 'volkswagen type 3' 'volkswagen 411 (sw)' 'volkswagen super beetle'
 'volkswagen dasher' 'vw dasher' 'vw rabbit' 'volkswagen rabbit'
 'volkswagen rabbit custom' 'volvo 145e (sw)' 'volvo 144ea' 'volvo 244dl'
 'volvo 245' 'volvo 264gl' 'volvo diesel' 'volvo 246']

fueltype :
 ['gas' 'diesel']

aspiration :
 ['std' 'turbo']

doornumber :
 ['two' 'four']

carbody :
 ['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']

drivewheel :
 ['rwd' 'fwd' '4wd']

enginelocation :
 ['front' 'rear']

enginetype :
 ['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf' 'dohcv']

cylindernumber :
 ['four' 'six' 'five' 'three' 'twelve' 'two' 'eight']

fuelsystem :
 ['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' '4bbl' 'idi' 'spdi']

In [8]:

#### spliting car brand from carName column
data['car_brand']=data['CarName'].str.split(" ", n = 1, expand = True)[0]

In [9]:

#Priting all kinds of brands availabel  in the dataset
data['car_brand'].unique()

Out[9]:

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'maxda', 'mazda', 'buick', 'mercury',
       'mitsubishi', 'Nissan', 'nissan', 'peugeot', 'plymouth', 'porsche',
       'porcshce', 'renault', 'saab', 'subaru', 'toyota', 'toyouta',
       'vokswagen', 'volkswagen', 'vw', 'volvo'], dtype=object)

In [10]:

#### There seems to be some duplicate values in in the car_brand column.
data['car_brand'].replace({'maxda': 'mazda',
                           'nissan': 'Nissan',
                           'toyouta': 'toyota',
                           'porcshce': 'porsche',
                           'vw': 'volkswagen',
                           'vokswagen': 'volkswagen'
                          }, inplace=True)

data['car_brand'].unique()

Out[10]:

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi',
       'Nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab',
       'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)

In [11]:

#### Printing no of outliers in each column
outliers_columns_list = []
for i in data.select_dtypes(exclude='object').columns:
    q1 = data[i].quantile(0.25)
    q3 = data[i].quantile(0.75)
    IQR = q3-q1
    ub = q3+(1.5*IQR)
    lb = q1-(1.5*IQR)
    outliers_count=data[(data[i]>ub) | (data[i]<lb)][i].count()
    if(outliers_count>0):
        print(f"no of outliers in '{i}' is {outliers_count}")
        outliers_columns_list.append(i)

no of outliers in 'wheelbase' is 3
no of outliers in 'carlength' is 1
no of outliers in 'carwidth' is 8
no of outliers in 'enginesize' is 10
no of outliers in 'stroke' is 20
no of outliers in 'compressionratio' is 28
no of outliers in 'horsepower' is 6
no of outliers in 'peakrpm' is 2
no of outliers in 'citympg' is 2
no of outliers in 'highwaympg' is 3
no of outliers in 'price' is 15

In [12]:

plt.figure(figsize=(15, 7))
for i in range(len(outliers_columns_list)-1):
    plt.subplot(2, 5, i+1)
    sns.boxplot(data[outliers_columns_list[i]], orient="v")

plt.tight_layout()

From the above boxplots it is observed that all the outliers are nearer to Upper and lower boundaries. So there is no need of outliers treatment

In [13]:

# visualizing our dependent variable for outliers and skewnwss
plt.figure(figsize=(15,4))

plt.subplot(1,2,1)
sns.boxplot(data["price"])
plt.title("Boxplot for outliers detection", fontweight="bold")

plt.subplot(1,2,2)
sns.distplot(data["price"])
plt.title("Distribution plot for skewness", fontweight="bold")

plt.show()

From the above plot :

There are few outliers towards the higher price range, suggesting that there are few high price models.
Price distribution plot is right skewed
maximum number of cars are in range of 5000 to 20000

In [14]:

#### Count plot for all categorical columns
object_columns_list = data.select_dtypes(include='object').columns[1:-1]
plt.figure(figsize = (15, 10))
for i in range(len(object_columns_list)):
    plt.subplot(3, 3, i+1)
    sns.countplot(data[object_columns_list[i]], order = data[object_columns_list[i]].value_counts().index)
    plt.xlabel(object_columns_list[i], fontweight="bold")

From the above visuals:

fueltype: Majority of the automobiles are gas fuel type.
aspiration: Majority of the automobiles use standard aspiration.
doornumber: Majority of the automobiles are 4 door models.
carbody: Sedan is the most common model, convertible is the least common model.
drivewheel: Forward wheel drive is the most common model, 4 wheel drive is the least common model.
enginelocation: Almost all the models are having engine location as front.
enginetype: Majority (almost all) of the models are having 'ohc' engine type.
cylindernumber: Majority (almost all) of the models are 4 cylinder models.
fuelsystem: Majority of the models are having 'mpfi' and '2bbl' fuel systems.

In [15]:

#### violin plot for all categorical columns v/s price
object_columns_list = data.select_dtypes(include='object').columns[1:-1]
plt.figure(figsize = (15, 10))
for i in range(len(object_columns_list)):
    plt.subplot(3, 3, i+1)
    sns.violinplot(x=data[object_columns_list[i]], y=data['price'])
    plt.xlabel(object_columns_list[i], fontweight="bold")

From the above visuals:

The price doesn't show more effect on fueltype, aspiration, doornumber.
There is high chances that the enginelocation at rear as high price compared to front.
The higher the number of cylinders makes higher the price.

In [16]:

#### fitting the regression line between all numeric columns v/s price 
num_columns_list = data.select_dtypes(exclude='object').columns[1:-1]
plt.figure(figsize = (15, 10))
for i in range(0,len(num_columns_list),5):
    sns.pairplot(data, x_vars=num_columns_list[i:i+5], y_vars='price', kind='reg')

<Figure size 1080x720 with 0 Axes>

From the above visuals:

The higher price cars will have high wheelbase, carlength, carwidth, carheight, curbweight, enginesize, boreratio and horsepower
The high price cars have low mpg in both city and highway

In [17]:

#### heatmap to visualize the pearson's correlation matrix between the numeric variables
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), annot=True, cmap="YlGnBu", mask=np.triu(data.corr(), k=1))

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x19696959588>

From the above heatmap:

wheelbase has high positive correlation with carlength,carwidth and curbweight
carlength has high postive correlation with curbweight
carlength has negative correlation with highwaympg
carwidth has high postive correlation with curbweight and engine size
enginesize has high positive correlation with horsepower
curbweight has high positive correlation with engine size and horse power, negative correlation with highwaympg
horsepower has negative correlation with citympg and highwaympg
citympg and highwaympg are highly correlated

In [18]:

#### average price for each brand car
(data.groupby('car_brand')['price'].mean()
    .sort_values(ascending=False).reset_index().round(2) 
    .style.background_gradient(cmap='Blues'))

Out[18]:

	car_brand	price
0	jaguar	34600.000000
1	buick	33647.000000
2	porsche	31400.500000
3	bmw	26118.750000
4	volvo	18063.180000
5	audi	17859.170000
6	mercury	16503.000000
7	alfa-romero	15498.330000
8	peugeot	15489.090000
9	saab	15223.330000
10	mazda	10652.880000
11	Nissan	10415.670000
12	volkswagen	10077.500000
13	toyota	9885.810000
14	renault	9595.000000
15	mitsubishi	9239.770000
16	isuzu	8916.500000
17	subaru	8541.250000
18	honda	8184.690000
19	plymouth	7963.430000
20	dodge	7875.440000
21	chevrolet	6007.000000

From the above dataFrame:

jaguar make is having highest average price.
chevrolet make is having least average price.

In [19]:

#### Based on above visuals Dropping Few columns
data.drop(['car_ID', 'CarName', 'highwaympg', 'fuelsystem', 'aspiration', 'doornumber'], axis=1, inplace=True)

In [20]:

#### converting 'cylindernumber' column to numeric
data['cylindernumber'].replace({"two": 2,
                                "three":3,
                                "four": 4,
                                "five": 5,
                                "six": 6,  
                                "eight": 8,
                                "twelve": 12
                               }, inplace=True)

In [21]:

# Define the columns to perform one hot encoding
cols_to_ohe=data.select_dtypes(include='object').columns

# perform one hot encoding on the following columns
import category_encoders as ce

ce_ohe=ce.OneHotEncoder(cols=cols_to_ohe)
data=ce_ohe.fit_transform(data)

data.head()

Out[21]:

	symboling	fueltype_1	carbody_1	carbody_2	carbody_3	drivewheel_1	drivewheel_2	drivewheel_3	enginelocation_1	wheelbase	carlength	carwidth	carheight	curbweight	enginetype_1	enginetype_2	enginetype_3	cylindernumber	enginesize	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	price	car_brand_1	car_brand_2
0	3	1	1	0	0	1	0	0	1	88.6	168.8	64.1	48.8	2548	1	0	0	4	130	3.47	2.68	9.0	111	5000	21	13495.0	1	0
1	3	1	1	0	0	1	0	0	1	88.6	168.8	64.1	48.8	2548	1	0	0	4	130	3.47	2.68	9.0	111	5000	21	16500.0	1	0
2	1	1	0	1	0	1	0	0	1	94.5	171.2	65.5	52.4	2823	0	1	0	6	152	2.68	3.47	9.0	154	5000	19	16500.0	1	0
3	2	1	0	0	1	0	1	0	1	99.8	176.6	66.2	54.3	2337	0	0	1	4	109	3.19	3.40	10.0	102	5500	24	13950.0	0	1
4	2	1	0	0	1	0	0	1	1	99.4	176.6	66.4	54.3	2824	0	0	1	5	136	3.19	3.40	8.0	115	5500	18	17450.0	0	1

In [22]:

#### Creating dependent and independent variables
# independent variables
x = data.drop(columns="price")

# dependent variable
y = data["price"]

In [23]:

# Save numeric columns in a variable
cnames = ['wheelbase', 'carlength', 'carwidth', 'carheight',
          'curbweight', 'enginesize', 'boreratio', 'stroke', 
          'compressionratio', 'horsepower', 'peakrpm', 'citympg',]

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
for col in cnames:
    x[col]=sc.fit_transform(x[col].values.reshape(-1,1))

x.head(5)

Out[23]:

	symboling	fueltype_1	carbody_1	carbody_2	carbody_3	drivewheel_1	drivewheel_2	drivewheel_3	enginelocation_1	wheelbase	carlength	carwidth	carheight	curbweight	enginetype_1	enginetype_2	enginetype_3	cylindernumber	enginesize	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	car_brand_1	car_brand_2
0	3	1	1	0	0	1	0	0	1	-1.690772	-0.426521	-0.844782	-2.020417	-0.014566	1	0	0	4	0.074449	0.519071	-1.839377	-0.288349	0.174483	-0.262960	-0.646553	1	0
1	3	1	1	0	0	1	0	0	1	-1.690772	-0.426521	-0.844782	-2.020417	-0.014566	1	0	0	4	0.074449	0.519071	-1.839377	-0.288349	0.174483	-0.262960	-0.646553	1	0
2	1	1	0	1	0	1	0	0	1	-0.708596	-0.231513	-0.190566	-0.543527	0.514882	0	1	0	6	0.604046	-2.404880	0.685946	-0.288349	1.264536	-0.262960	-0.953012	1	0
3	2	1	0	0	1	0	1	0	1	0.173698	0.207256	0.136542	0.235942	-0.420797	0	0	1	4	-0.431076	-0.517266	0.462183	-0.035973	-0.053668	0.787855	-0.186865	0	1
4	2	1	0	0	1	0	0	1	1	0.107110	0.207256	0.230001	0.235942	0.516807	0	0	1	5	0.218885	-0.517266	0.462183	-0.540725	0.275883	0.787855	-1.106241	0	1

In [24]:

#### Splitting train and test dataset
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)

Linear Regression¶

In [25]:

from sklearn.linear_model import LinearRegression
#### instantiate and fit
model_LR = LinearRegression()
model_LR.fit(x_train, y_train)

Out[25]:

LinearRegression()

In [26]:

LR_score_train = model_LR.score(x_train, y_train)
print("Train Accuracy :",LR_score_train)

LR_score_test = model_LR.score(x_test, y_test)
print("Test Accuracy  :",LR_score_test)

Train Accuracy : 0.9706000959758017
Test Accuracy  : 0.8920483520083704

In [27]:

#### predicting 'price' using lin_reg
predictions_LR = model_LR.predict(x_test)

#### calculating RMSE, MSE, and MAE
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_LR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_LR))
LR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_LR))
print('RMSE:', LR_RMSE)

MAE: 1602.070849236117
MSE: 7128842.973392517
RMSE: 2669.989320838665

In [28]:

final_results = []
dict_LR = {'MODEL':'Linear Regression',
           'Train_ACCURACY':LR_score_train,
           'Test_ACCURACY':LR_score_test,
           'RMSE':LR_RMSE
          }
final_results.append(dict_LR)

Ridge Regression¶

In [29]:

from sklearn import linear_model
model_LR_Ridge = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
model_LR_Ridge.fit(x_train,y_train)
model_LR_Ridge.alpha_

Out[29]:

0.1

In [30]:

LR_Ridge_score_train = model_LR_Ridge.score(x_train,y_train)
print("Train Accuracy :",LR_Ridge_score_train)

LR_Ridge_score_test = model_LR_Ridge.score(x_test,y_test)
print("Test Accuracy  :",LR_Ridge_score_test)

Train Accuracy : 0.9703273024088487
Test Accuracy  : 0.8885484567286488

In [31]:

#### predicting 'price' using Ridge Regression
predictions_LR_Ridge = model_LR_Ridge.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_LR_Ridge))
print('MSE:', metrics.mean_squared_error(y_test, predictions_LR_Ridge))
LR_Ridge_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_LR_Ridge))
print('RMSE:', LR_Ridge_RMSE)

MAE: 1632.8053066911716
MSE: 7359966.854654508
RMSE: 2712.9258844750084

In [32]:

dict_LR_Ridge = {'MODEL':'Ridge Regression',
                 'Train_ACCURACY':LR_Ridge_score_train,
                 'Test_ACCURACY':LR_Ridge_score_test,
                 'RMSE':LR_Ridge_RMSE
                }
final_results.append(dict_LR_Ridge)

Lasso Regressor¶

In [33]:

from sklearn import linear_model
model_LR_Lasso = linear_model.LassoCV(alphas=np.logspace(-6, 6, 13))
model_LR_Lasso.fit(x_train,y_train)

Out[33]:

LassoCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))

In [34]:

LR_Lasso_score_train = model_LR_Lasso.score(x_train,y_train)
print("Train Accuracy :",LR_Lasso_score_train)

LR_Lasso_score_test = model_LR_Lasso.score(x_test,y_test)
print("Test Accuracy  :",LR_Lasso_score_test)

Train Accuracy : 0.9670849805859022
Test Accuracy  : 0.8901462183776644

In [35]:

#### predicting 'price' using Lasso Regression
predictions_LR_Lasso = model_LR_Lasso.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_LR_Lasso))
print('MSE:', metrics.mean_squared_error(y_test, predictions_LR_Lasso))
LR_Lasso_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_LR_Lasso))
print('RMSE:', LR_Lasso_RMSE)

MAE: 1704.319461370362
MSE: 7254454.876684296
RMSE: 2693.4095263595354

In [36]:

dict_LR_Lasso = {'MODEL':'Lasso Regression',
                 'Train_ACCURACY':LR_Lasso_score_train,
                 'Test_ACCURACY':LR_Lasso_score_test,
                 'RMSE':LR_Lasso_RMSE
                }
final_results.append(dict_LR_Lasso)

KNN Regressor¶

In [37]:

from sklearn import neighbors as nb
model_KNN=nb.KNeighborsRegressor(n_neighbors=4,n_jobs=-1)
model_KNN.fit(x_train,y_train)

Out[37]:

KNeighborsRegressor(n_jobs=-1, n_neighbors=4)

In [38]:

KNN_score_train = model_KNN.score(x_train,y_train)
print("Train Accuracy :",KNN_score_train)

KNN_score_test = model_KNN.score(x_test,y_test)
print("Test Accuracy  :",KNN_score_test)

Train Accuracy : 0.9124677696162664
Test Accuracy  : 0.8259113078399183

In [39]:

#Parameter tuning/Grid search
import sklearn.model_selection as model_selection
model_KNN=model_selection.GridSearchCV(model_KNN,param_grid={'n_neighbors':[3,5,7,9,11],'weights':['uniform','distance']})
model_KNN.fit(x_train,y_train)

Out[39]:

GridSearchCV(estimator=KNeighborsRegressor(n_jobs=-1, n_neighbors=4),
             param_grid={'n_neighbors': [3, 5, 7, 9, 11],
                         'weights': ['uniform', 'distance']})

In [40]:

model_KNN.best_params_

Out[40]:

{'n_neighbors': 3, 'weights': 'distance'}

In [41]:

KNN_score_train = model_KNN.score(x_train,y_train)
print("Train Accuracy :",KNN_score_train)

KNN_score_test = model_KNN.score(x_test,y_test)
print("Test Accuracy  :",KNN_score_test)

Train Accuracy : 0.9985246605173365
Test Accuracy  : 0.8783507393636467

In [42]:

#### predicting 'price' using lin_reg
predictions_KNN = model_KNN.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_KNN))
print('MSE:', metrics.mean_squared_error(y_test, predictions_KNN))
KNN_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_KNN))
print('RMSE:', KNN_RMSE)

MAE: 1832.775503909773
MSE: 8033397.294435981
RMSE: 2834.3248392581927

In [43]:

dict_KNN = {'MODEL':'KNN Regressor',
            'Train_ACCURACY':KNN_score_train,
            'Test_ACCURACY':KNN_score_test,
            'RMSE':KNN_RMSE}
final_results.append(dict_KNN)

DT_Regression_Tree¶

In [44]:

import sklearn.tree as tree
model_DT=tree.DecisionTreeRegressor(max_depth=3)
model_DT.fit(x_train,y_train)

Out[44]:

DecisionTreeRegressor(max_depth=3)

In [45]:

DT_score_train = model_DT.score(x_train,y_train)
print("Train Accuracy :",DT_score_train)

DT_score_test = model_DT.score(x_test,y_test)
print("Test Accuracy  :",DT_score_test)

Train Accuracy : 0.9216374578674762
Test Accuracy  : 0.8679424987382512

In [46]:

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree, export_text

features  = x.columns
plt.figure(figsize=(15,8))
plot_tree(model_DT, feature_names=features, filled = True)

Out[46]:

[Text(418.5, 380.52, 'enginesize <= 1.326\nmse = 62413097.511\nsamples = 143\nvalue = 13299.679'),
 Text(209.25, 271.8, 'curbweight <= -0.022\nmse = 19869258.311\nsamples = 129\nvalue = 11141.381'),
 Text(104.625, 163.07999999999998, 'curbweight <= -0.508\nmse = 5353442.13\nsamples = 84\nvalue = 8564.036'),
 Text(52.3125, 54.360000000000014, 'mse = 1072432.475\nsamples = 49\nvalue = 7284.122'),
 Text(156.9375, 54.360000000000014, 'mse = 5842577.678\nsamples = 35\nvalue = 10355.914'),
 Text(313.875, 163.07999999999998, 'carwidth <= 1.258\nmse = 11419572.218\nsamples = 45\nvalue = 15952.426'),
 Text(261.5625, 54.360000000000014, 'mse = 9422357.639\nsamples = 39\nvalue = 15325.107'),
 Text(366.1875, 54.360000000000014, 'mse = 5216916.667\nsamples = 6\nvalue = 20030.0'),
 Text(627.75, 271.8, 'compressionratio <= 1.604\nmse = 16001889.551\nsamples = 14\nvalue = 33186.857'),
 Text(523.125, 163.07999999999998, 'carwidth <= 2.52\nmse = 7696192.8\nsamples = 10\nvalue = 35104.0'),
 Text(470.8125, 54.360000000000014, 'mse = 4317654.222\nsamples = 9\nvalue = 34453.333'),
 Text(575.4375, 54.360000000000014, 'mse = 0.0\nsamples = 1\nvalue = 40960.0'),
 Text(732.375, 163.07999999999998, 'curbweight <= 2.319\nmse = 4606060.0\nsamples = 4\nvalue = 28394.0'),
 Text(680.0625, 54.360000000000014, 'mse = 1573219.556\nsamples = 3\nvalue = 27325.333'),
 Text(784.6875, 54.360000000000014, 'mse = 0.0\nsamples = 1\nvalue = 31600.0')]

In [47]:

#### predicting 'price' using lin_reg
predictions_DT = model_DT.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_DT))
print('MSE:', metrics.mean_squared_error(y_test, predictions_DT))
DT_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_DT))
print('RMSE:', DT_RMSE)

MAE: 2076.496298188754
MSE: 8720730.136760743
RMSE: 2953.0882372121464

In [48]:

dict_DT = {'MODEL':'DT Regressor',
           'Train_ACCURACY':DT_score_train,
           'Test_ACCURACY':DT_score_test,
           'RMSE':DT_RMSE}
final_results.append(dict_DT)

Random Forest Regressor¶

In [49]:

from sklearn.ensemble import RandomForestRegressor

list_RFR=[]
#Tune number of trees
for i in range(10,200,10):
    model_RFR=RandomForestRegressor(n_estimators=i,random_state=10)
    model_RFR.fit(x_train,y_train)
    dict_RFR={}
    dict_RFR["Number of trees"] = str(i)
    dict_RFR["ACCURACY"]=model_RFR.score(x_test,y_test)
    list_RFR.append(dict_RFR)
    
(pd.DataFrame(list_RFR)
     .sort_values(by=['ACCURACY'],ascending=False)
     .reset_index(drop=True)
     .style.background_gradient(cmap='Blues'))

Out[49]:

	Number of trees	ACCURACY
0	50	0.912008
1	20	0.911499
2	60	0.911267
3	40	0.911264
4	70	0.910807
5	90	0.910469
6	100	0.910244
7	10	0.910193
8	110	0.910033
9	80	0.910015
10	130	0.909085
11	120	0.908817
12	140	0.908712
13	150	0.908355
14	160	0.908344
15	190	0.908114
16	170	0.907999
17	180	0.907784
18	30	0.907609

In [50]:

model_RFR = RandomForestRegressor(n_estimators=50,random_state=10)
model_RFR.fit(x_train, y_train)

Out[50]:

RandomForestRegressor(n_estimators=50, random_state=10)

In [51]:

RFR_score_train = model_RFR.score(x_train, y_train)
print("Train Accuracy :",RFR_score_train)

RFR_score_test = model_RFR.score(x_test,y_test)
print("Test Accuracy  :",RFR_score_test)

Train Accuracy : 0.9874907945466967
Test Accuracy  : 0.9120081880470541

In [52]:

#### predicting 'price' using lin_reg
predictions_RFR = model_RFR.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_RFR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_RFR))
RFR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_RFR))
print('RMSE:', RFR_RMSE)

MAE: 1603.7387358064516
MSE: 5810747.8859931175
RMSE: 2410.5492913427674

In [53]:

dict_RFR = {'MODEL':'Random Forest Regressor',
            'Train_ACCURACY':RFR_score_train,
            'Test_ACCURACY':RFR_score_test,
            'RMSE':RFR_RMSE
           }
final_results.append(dict_RFR)

BaggingRegressor¶

In [54]:

from sklearn.ensemble import BaggingRegressor

list_BR=[]
#Tune number of trees
for i in range(10,200,10):
    model_BR=BaggingRegressor(n_estimators=i,oob_score=True,random_state=200)
    model_BR.fit(x_train,y_train)
    dict_BR={}
    dict_BR["Number of trees"] = str(i)
    dict_BR["ACCURACY"]=model_BR.score(x_test,y_test)
    list_BR.append(dict_BR)
    
(pd.DataFrame(list_BR)
     .sort_values(by=['ACCURACY'],ascending=False)
     .reset_index(drop=True)
     .style.background_gradient(cmap='Blues'))

Out[54]:

	Number of trees	ACCURACY
0	40	0.917846
1	30	0.914527
2	50	0.914173
3	20	0.913007
4	60	0.912395
5	80	0.912235
6	70	0.912139
7	90	0.911879
8	100	0.910822
9	110	0.910629
10	150	0.910575
11	140	0.910301
12	130	0.910254
13	120	0.909993
14	180	0.909279
15	160	0.909222
16	190	0.908950
17	170	0.908668
18	10	0.907015

In [55]:

model_BR=BaggingRegressor(n_estimators=40,oob_score=True,random_state=200)
model_BR.fit(x_train,y_train)

Out[55]:

BaggingRegressor(n_estimators=40, oob_score=True, random_state=200)

In [56]:

BR_score_train = model_BR.score(x_train,y_train)
print("Train Accuracy :",BR_score_train)

BR_score_test = model_BR.score(x_test,y_test)
print("Test Accuracy :",BR_score_test)

Train Accuracy : 0.9886259453086359
Test Accuracy : 0.9178461822917738

In [57]:

#### predicting 'price' using lin_reg
predictions_BR = model_BR.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_BR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_BR))
BR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_BR))
print('RMSE:', BR_RMSE)

MAE: 1561.133266532258
MSE: 5425222.097138057
RMSE: 2329.2106167408

In [58]:

dict_BR = {'MODEL':'Bagging Regressor',
           'Train_ACCURACY':BR_score_train,
           'Test_ACCURACY':BR_score_test,
           'RMSE':BR_RMSE
          }
final_results.append(dict_BR)

ElasticNet Regressor¶

In [59]:

from sklearn.linear_model import ElasticNet
model_ENR = ElasticNet(random_state=0)
model_ENR.fit(x_train,y_train)

Out[59]:

ElasticNet(random_state=0)

In [60]:

ENR_score_train = model_ENR.score(x_train,y_train)
print("Train Accuracy :",ENR_score_train)

ENR_score_test = model_ENR.score(x_test,y_test)
print("Test Accuracy  :",ENR_score_test)

Train Accuracy : 0.8563475978333301
Test Accuracy  : 0.8235110288331771

In [61]:

#### predicting 'price' using ElasticNet Regressor
predictions_ENR = model_ENR.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_ENR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_ENR))
ENR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_ENR))
print('RMSE:', ENR_RMSE)

MAE: 2177.5442315710934
MSE: 11654867.57628227
RMSE: 3413.922608420154

In [62]:

dict_ENR = {'MODEL':'ElasticNet Regressor',
            'Train_ACCURACY':ENR_score_train,
            'Test_ACCURACY':ENR_score_test,
            'RMSE':ENR_RMSE
           }
final_results.append(dict_ENR)

Gradient Boosting Regressor¶

In [63]:

from sklearn.ensemble import GradientBoostingRegressor
model_GBR = GradientBoostingRegressor()
model_GBR.fit(x_train,y_train)

Out[63]:

GradientBoostingRegressor()

In [64]:

GBR_score_train = model_GBR.score(x_train,y_train)
print("Train Accuracy :",GBR_score_train)

GBR_score_test = model_GBR.score(x_test,y_test)
print("Test Accuracy  :",GBR_score_test)

Train Accuracy : 0.9942508585108231
Test Accuracy  : 0.907653503437977

In [65]:

#### predicting 'price' using ElasticNet Regressor
predictions_GBR = model_GBR.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_GBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_GBR))
GBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_GBR))
print('RMSE:', GBR_RMSE)

MAE: 1589.7051600620339
MSE: 6098319.806888355
RMSE: 2469.4776384669603

In [66]:

dict_GBR = {'MODEL':'Gradient Boosting Regressor',
            'Train_ACCURACY':GBR_score_train,
            'Test_ACCURACY':GBR_score_test,
            'RMSE':GBR_RMSE
           }
final_results.append(dict_GBR)

Histogram-Based Gradient Boosting Regressor¶

In [67]:

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
model_HGBR = HistGradientBoostingRegressor()
model_HGBR.fit(x_train,y_train)

Out[67]:

HistGradientBoostingRegressor()

In [68]:

HGBR_score_train = model_HGBR.score(x_train,y_train)
print("Train Accuracy :",HGBR_score_train)

HGBR_score_test = model_HGBR.score(x_test,y_test)
print("Test Accuracy  :",HGBR_score_test)

Train Accuracy : 0.9636892964789147
Test Accuracy  : 0.889279745320581

In [69]:

#### predicting 'price' using Histogram-Based Gradient Boosting Regressor
predictions_HGBR = model_HGBR.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_HGBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_HGBR))
HGBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_HGBR))
print('RMSE:', HGBR_RMSE)

MAE: 1833.2560112069134
MSE: 7311674.4789742185
RMSE: 2704.010813398167

In [70]:

dict_HGBR = {'MODEL':'Histogram Based GBR',
             'Train_ACCURACY':HGBR_score_train,
             'Test_ACCURACY':HGBR_score_test,
             'RMSE':HGBR_RMSE
            }
final_results.append(dict_HGBR)

Gradient Boosting With XGBoost Regressor¶

In [71]:

from xgboost import XGBRegressor
model_XGBR = XGBRegressor()
model_XGBR.fit(x_train,y_train)

Out[71]:

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [72]:

XGBR_score_train = model_XGBR.score(x_train,y_train)
print("Train Accuracy :",XGBR_score_train)

XGBR_score_test = model_XGBR.score(x_test,y_test)
print("Train Accuracy :",XGBR_score_test)

Train Accuracy : 0.998523693850276
Train Accuracy : 0.9055217582551045

In [73]:

#### predicting 'price' using XGBRegressor
predictions_XGBR = model_XGBR.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_XGBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_XGBR))
XGBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_XGBR))
print('RMSE:', XGBR_RMSE)

MAE: 1696.7096537928428
MSE: 6239094.653319247
RMSE: 2497.8179784202143

In [74]:

dict_XGBR = {'MODEL':'XGBoost Regressor',
             'Train_ACCURACY':XGBR_score_train,
             'Test_ACCURACY':XGBR_score_test,
             'RMSE':XGBR_RMSE
            }
final_results.append(dict_XGBR)

Gradient Boosting With LightGBM Regressor¶

In [75]:

from lightgbm import LGBMRegressor
model_LGBR = LGBMRegressor()
model_LGBR.fit(x_train,y_train)

Out[75]:

LGBMRegressor()

In [76]:

LGBR_score_train = model_LGBR.score(x_train,y_train)
print("Train Accuracy :",LGBR_score_train)

LGBR_score_test = model_LGBR.score(x_test,y_test)
print("Test Accuracy  :",LGBR_score_test)

Train Accuracy : 0.9609605280364496
Test Accuracy  : 0.8780216588918559

In [77]:

#### predicting 'price' using XGBRegressor
predictions_LGBR = model_LGBR.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_LGBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_LGBR))
LGBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_LGBR))
print('RMSE:', LGBR_RMSE)

MAE: 1876.6765463516463
MSE: 8055128.903472544
RMSE: 2838.155898373545

In [78]:

dict_LGBR = {'MODEL':'LightGBM Regressor',
             'Train_ACCURACY':LGBR_score_train,
             'Test_ACCURACY':LGBR_score_test,
             'RMSE':LGBR_RMSE
            }
final_results.append(dict_LGBR)

Gradient Boosting with CatBoost¶

In [79]:

from catboost import CatBoostRegressor
model_CGBR = CatBoostRegressor(verbose=0, n_estimators=100)
model_CGBR.fit(x_train,y_train)

Out[79]:

<catboost.core.CatBoostRegressor at 0x1969cd39d48>

In [80]:

CGBR_score_train = model_CGBR.score(x_train,y_train)
print("Train Accuracy :",CGBR_score_train)

CGBR_score_test = model_CGBR.score(x_test,y_test)
print("Test Accuracy  :",CGBR_score_test)

Train Accuracy : 0.992900343375794
Test Accuracy  : 0.9192606254631139

In [81]:

#### predicting 'price' using CatBoost
predictions_CGBR = model_CGBR.predict(x_test)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions_CGBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_CGBR))
CGBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_CGBR))
print('RMSE:', CGBR_RMSE)

MAE: 1509.6383901542376
MSE: 5331815.989395706
RMSE: 2309.0725387903485

In [82]:

dict_CGBR = {'MODEL':'CatBoost Regressor',
             'Train_ACCURACY':CGBR_score_train,
             'Test_ACCURACY':LGBR_score_test,
             'RMSE':CGBR_RMSE
            }
final_results.append(dict_CGBR)

Comparing ACCURACY and RMSE for all models¶

In [83]:

df_results = pd.DataFrame(final_results)
df_results['ACCURACY'] = ((df_results['Train_ACCURACY']+df_results['Test_ACCURACY'])/2)*100

(df_results.sort_values(by=['ACCURACY','RMSE'],ascending=False)
    .reset_index(drop=True)
    .style.background_gradient(cmap='Greens'))

Out[83]:

	MODEL	Train_ACCURACY	Test_ACCURACY	RMSE	ACCURACY
0	Bagging Regressor	0.988626	0.917846	2329.210617	95.323606
1	XGBoost Regressor	0.998524	0.905522	2497.817978	95.202273
2	Gradient Boosting Regressor	0.994251	0.907654	2469.477638	95.095218
3	Random Forest Regressor	0.987491	0.912008	2410.549291	94.974949
4	KNN Regressor	0.998525	0.878351	2834.324839	93.843770
5	CatBoost Regressor	0.992900	0.878022	2309.072539	93.546100
6	Linear Regression	0.970600	0.892048	2669.989321	93.132422
7	Ridge Regression	0.970327	0.888548	2712.925884	92.943788
8	Lasso Regression	0.967085	0.890146	2693.409526	92.861560
9	Histogram Based GBR	0.963689	0.889280	2704.010813	92.648452
10	LightGBM Regressor	0.960961	0.878022	2838.155898	91.949109
11	DT Regressor	0.921637	0.867942	2953.088237	89.478998
12	ElasticNet Regressor	0.856348	0.823511	3413.922608	83.992931

For Predicting car price *Bagging Regressor* provides high Accuracy of 95% with Root Mean square error of 2329.21.

Cross Validation for Bagging Regressor¶

In [84]:

from sklearn.model_selection import cross_val_score,KFold

kfold=KFold(n_splits=5)
scores=cross_val_score(model_BR,x,y,cv=kfold,scoring="neg_root_mean_squared_error")
print("scores        : ", list(scores))
print("mean          : ", scores.mean())
print("std deviation : ", scores.std())

scores        :  [-3388.330717460857, -3867.33524921731, -5878.525518966552, -2762.0386476277286, -2428.067541938594]
mean          :  -3664.8595350422083
std deviation :  1213.4678964953684

Conclusion :¶

For Predicting car price Bagging Regressor provides high Accuracy of 95% with Root Mean square error of 2329.21.

The varaiables which are significant in predicting car price are symboling, fueltype, carbody, drivewheel, enginelocation, wheelbase, carlength, carwidth, carheight, curbweight, enginetype, boreratio, stroke, compressionratio, horsepower, peakrpm, citympg, car_brand.

A model like this is very helpful in predicting car prices based on above varaiables.

In [85]:

import pickle
# Saving model to disk
pickle.dump(model_BR, open('CarPricePrediction_model.pkl','wb'))