download dataset from here Car price prediction dataset
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.
They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
Which variables are significant in predicting the price of a car How well those variables describe the price of a car Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.
We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
Car_ID: Unique ID of each observation (Integer).
Symboling: Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe (Categorical).
Car_Company: Name of Car company (Categorical). The following are the list of car companies available in the data set. 'alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi', 'nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'
Fueltype: Car fuel type i.e. gas or diesel (Categorical).
Aspiration: Aspiration used in a car i.e. Std or turbo (Categorical). Mode of air intake for the internal combustion engine i.e. natural (standard) or turbocharger.
DoorNumber: Number of doors in a car i.e. 2 or 4 (Categorical).
CarBody: Body of a car sedan, hatchback, wagon, hardtop, convertible (Categorical).
DriverWheel: Type of driver wheel 4wd, fwd, rwd (Categorical).
4wd: Four-wheel drive, also called 4×4 or 4wd, refers to a two-axled vehicle drivetrain capable of providing torque to all of its wheels simultaneously.
fwd: Front wheel drive is called as fwd. A fwd can only supply power to the front wheels.
rwd: Rear wheel drive is called as rwd. A rwd can only supply power to the rear wheels.
EngineLocation: Location of car engine Front or Rear (Categorical). Out of 205 only 3 cars engine location is rear. All the 3 cars belong to Porsche company.
WheelBase: WheelBase is the distance between the front and rear wheels. Continuous from 86.6 to 120.9 (Float).
CarLength: Length of car continuous from 141.1 to 208.1 (Float).
CarWidth: Width of car continuous from 60.3 to 72.3 (Float).
CarHeight: Height of car continuous from 47.8 to 59.8 (Float).
CurbWeight: The weight of car continuous from 1488 to 4066 (Int).
EngineType: Type of engine dohc, ohcv, ohc, l, rotor, ohcf, dohcv (Categorical).
CylinderNumber: Number of cylinders continuous from 2 to 8 (Categorical). A cylinder is the power unit of an engine; it's the chamber where the gasoline is burned and turned into power.
EngineSize: The Size of engine continuous from 61 to 326 (Int).
FuelSystem: Fuel system is to inject a precise amount of automized and pressurized fuel in to each cylinder at a proper time. Types of fuel system are mpfi, 2bbl, mfi, 1bbl, spfi, 4bbl, idi, spdi (Categorical).
BoreRatio: Bore-Stroke Ratio is the ratio between the dimensions of the engine cylinder bore diameter to its piston stroke-length. Bore ratio continuous from 2.54 to 3.94 (Float).
Stroke: stroke or volume inside the engine continuous from 2.07 to 4.17 (Float).
CompressionRatio: Compression ratio, in an internal-combustion engine, degree to which the fuel mixture is compressed before ignition. Compression ratio continuous from 7.0 to 23.0 (Float).
HorsePower: Horsepower is a unit of power used to measure the forcefulness of a vehicle's engine. Horse power continuous from 48 to 288 (Int).
PeakRPM: RPM stands for revolutions per minute, and it's used as a measure of how fast any machine is operating at a given time. how many times each piston goes up and down in its cylinder. PeakRPM continuous from 4150 to 6600 (Int).
CityMPG: CityMPG refers to driving with occasional stopping and braking. Mileage in city continuous from 13 to 49 (Float).
HighwayMPG: It is based on continuous acceleration. Mileage on highway continuous from 16 to 54 (Float).
Price: Price of car continuous from 5118 to 45400 (Float).
#### Importing libraries
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#### Ignore all warnings
warnings.filterwarnings('ignore')
#### allow plots to appear directly in the notebook
%matplotlib inline
#### set the maximum number of dataFrame columns display to unlimited
pd.set_option("display.max_columns", None)
#### loading dataSet as pandas dataFrame
data = pd.read_csv('./Data/CarPrice.csv')
#### Let’s take a look at the top five rows using the DataFrame’s head() method
data.head()
car_ID | symboling | CarName | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | alfa-romero giulia | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.0 |
1 | 2 | 3 | alfa-romero stelvio | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.0 |
2 | 3 | 1 | alfa-romero Quadrifoglio | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.0 |
3 | 4 | 2 | audi 100 ls | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.0 |
4 | 5 | 2 | audi 100ls | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.0 |
#### checking dimensionality of the Data
rows, cols = data.shape
print(f'The no.of rows = {rows}')
print(f'The no.of columns = {cols}')
The no.of rows = 205 The no.of columns = 26
'''
The info() method is useful to get a quick description of the data, in particular the total number of rows,
and each attribute’s type and number of non-null values
'''
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 car_ID 205 non-null int64 1 symboling 205 non-null int64 2 CarName 205 non-null object 3 fueltype 205 non-null object 4 aspiration 205 non-null object 5 doornumber 205 non-null object 6 carbody 205 non-null object 7 drivewheel 205 non-null object 8 enginelocation 205 non-null object 9 wheelbase 205 non-null float64 10 carlength 205 non-null float64 11 carwidth 205 non-null float64 12 carheight 205 non-null float64 13 curbweight 205 non-null int64 14 enginetype 205 non-null object 15 cylindernumber 205 non-null object 16 enginesize 205 non-null int64 17 fuelsystem 205 non-null object 18 boreratio 205 non-null float64 19 stroke 205 non-null float64 20 compressionratio 205 non-null float64 21 horsepower 205 non-null int64 22 peakrpm 205 non-null int64 23 citympg 205 non-null int64 24 highwaympg 205 non-null int64 25 price 205 non-null float64 dtypes: float64(8), int64(8), object(10) memory usage: 41.8+ KB
There are 205 instances in the dataset.Notice that there are no null values present in the dataset.
#### The describe() method shows a summary of the nummerical attributes.
data.describe().T.round(decimals = 2)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
car_ID | 205.0 | 103.00 | 59.32 | 1.00 | 52.00 | 103.00 | 154.00 | 205.00 |
symboling | 205.0 | 0.83 | 1.25 | -2.00 | 0.00 | 1.00 | 2.00 | 3.00 |
wheelbase | 205.0 | 98.76 | 6.02 | 86.60 | 94.50 | 97.00 | 102.40 | 120.90 |
carlength | 205.0 | 174.05 | 12.34 | 141.10 | 166.30 | 173.20 | 183.10 | 208.10 |
carwidth | 205.0 | 65.91 | 2.15 | 60.30 | 64.10 | 65.50 | 66.90 | 72.30 |
carheight | 205.0 | 53.72 | 2.44 | 47.80 | 52.00 | 54.10 | 55.50 | 59.80 |
curbweight | 205.0 | 2555.57 | 520.68 | 1488.00 | 2145.00 | 2414.00 | 2935.00 | 4066.00 |
enginesize | 205.0 | 126.91 | 41.64 | 61.00 | 97.00 | 120.00 | 141.00 | 326.00 |
boreratio | 205.0 | 3.33 | 0.27 | 2.54 | 3.15 | 3.31 | 3.58 | 3.94 |
stroke | 205.0 | 3.26 | 0.31 | 2.07 | 3.11 | 3.29 | 3.41 | 4.17 |
compressionratio | 205.0 | 10.14 | 3.97 | 7.00 | 8.60 | 9.00 | 9.40 | 23.00 |
horsepower | 205.0 | 104.12 | 39.54 | 48.00 | 70.00 | 95.00 | 116.00 | 288.00 |
peakrpm | 205.0 | 5125.12 | 476.99 | 4150.00 | 4800.00 | 5200.00 | 5500.00 | 6600.00 |
citympg | 205.0 | 25.22 | 6.54 | 13.00 | 19.00 | 24.00 | 30.00 | 49.00 |
highwaympg | 205.0 | 30.75 | 6.89 | 16.00 | 25.00 | 30.00 | 34.00 | 54.00 |
price | 205.0 | 13276.71 | 7988.85 | 5118.00 | 7788.00 | 10295.00 | 16503.00 | 45400.00 |
From the above it is observed the price column ranges from 5118 to 45400 with a standard deviation of 7988.85.
#### The describe(include=['object']) method shows a summary of the object attributes.
data.describe(include = ['object']).T
count | unique | top | freq | |
---|---|---|---|---|
CarName | 205 | 147 | peugeot 504 | 6 |
fueltype | 205 | 2 | gas | 185 |
aspiration | 205 | 2 | std | 168 |
doornumber | 205 | 2 | four | 115 |
carbody | 205 | 5 | sedan | 96 |
drivewheel | 205 | 3 | fwd | 120 |
enginelocation | 205 | 2 | front | 202 |
enginetype | 205 | 7 | ohc | 148 |
cylindernumber | 205 | 7 | four | 159 |
fuelsystem | 205 | 8 | mpfi | 94 |
from the above it is observed that
#Checking for the unique values present in the categorical features.
for i in data.select_dtypes(include='object').columns:
print(i, ':\n', data[i].unique(), end=print())
CarName : ['alfa-romero giulia' 'alfa-romero stelvio' 'alfa-romero Quadrifoglio' 'audi 100 ls' 'audi 100ls' 'audi fox' 'audi 5000' 'audi 4000' 'audi 5000s (diesel)' 'bmw 320i' 'bmw x1' 'bmw x3' 'bmw z4' 'bmw x4' 'bmw x5' 'chevrolet impala' 'chevrolet monte carlo' 'chevrolet vega 2300' 'dodge rampage' 'dodge challenger se' 'dodge d200' 'dodge monaco (sw)' 'dodge colt hardtop' 'dodge colt (sw)' 'dodge coronet custom' 'dodge dart custom' 'dodge coronet custom (sw)' 'honda civic' 'honda civic cvcc' 'honda accord cvcc' 'honda accord lx' 'honda civic 1500 gl' 'honda accord' 'honda civic 1300' 'honda prelude' 'honda civic (auto)' 'isuzu MU-X' 'isuzu D-Max ' 'isuzu D-Max V-Cross' 'jaguar xj' 'jaguar xf' 'jaguar xk' 'maxda rx3' 'maxda glc deluxe' 'mazda rx2 coupe' 'mazda rx-4' 'mazda glc deluxe' 'mazda 626' 'mazda glc' 'mazda rx-7 gs' 'mazda glc 4' 'mazda glc custom l' 'mazda glc custom' 'buick electra 225 custom' 'buick century luxus (sw)' 'buick century' 'buick skyhawk' 'buick opel isuzu deluxe' 'buick skylark' 'buick century special' 'buick regal sport coupe (turbo)' 'mercury cougar' 'mitsubishi mirage' 'mitsubishi lancer' 'mitsubishi outlander' 'mitsubishi g4' 'mitsubishi mirage g4' 'mitsubishi montero' 'mitsubishi pajero' 'Nissan versa' 'nissan gt-r' 'nissan rogue' 'nissan latio' 'nissan titan' 'nissan leaf' 'nissan juke' 'nissan note' 'nissan clipper' 'nissan nv200' 'nissan dayz' 'nissan fuga' 'nissan otti' 'nissan teana' 'nissan kicks' 'peugeot 504' 'peugeot 304' 'peugeot 504 (sw)' 'peugeot 604sl' 'peugeot 505s turbo diesel' 'plymouth fury iii' 'plymouth cricket' 'plymouth satellite custom (sw)' 'plymouth fury gran sedan' 'plymouth valiant' 'plymouth duster' 'porsche macan' 'porcshce panamera' 'porsche cayenne' 'porsche boxter' 'renault 12tl' 'renault 5 gtl' 'saab 99e' 'saab 99le' 'saab 99gle' 'subaru' 'subaru dl' 'subaru brz' 'subaru baja' 'subaru r1' 'subaru r2' 'subaru trezia' 'subaru tribeca' 'toyota corona mark ii' 'toyota corona' 'toyota corolla 1200' 'toyota corona hardtop' 'toyota corolla 1600 (sw)' 'toyota carina' 'toyota mark ii' 'toyota corolla' 'toyota corolla liftback' 'toyota celica gt liftback' 'toyota corolla tercel' 'toyota corona liftback' 'toyota starlet' 'toyota tercel' 'toyota cressida' 'toyota celica gt' 'toyouta tercel' 'vokswagen rabbit' 'volkswagen 1131 deluxe sedan' 'volkswagen model 111' 'volkswagen type 3' 'volkswagen 411 (sw)' 'volkswagen super beetle' 'volkswagen dasher' 'vw dasher' 'vw rabbit' 'volkswagen rabbit' 'volkswagen rabbit custom' 'volvo 145e (sw)' 'volvo 144ea' 'volvo 244dl' 'volvo 245' 'volvo 264gl' 'volvo diesel' 'volvo 246'] fueltype : ['gas' 'diesel'] aspiration : ['std' 'turbo'] doornumber : ['two' 'four'] carbody : ['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop'] drivewheel : ['rwd' 'fwd' '4wd'] enginelocation : ['front' 'rear'] enginetype : ['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf' 'dohcv'] cylindernumber : ['four' 'six' 'five' 'three' 'twelve' 'two' 'eight'] fuelsystem : ['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' '4bbl' 'idi' 'spdi']
#### spliting car brand from carName column
data['car_brand']=data['CarName'].str.split(" ", n = 1, expand = True)[0]
#Priting all kinds of brands availabel in the dataset
data['car_brand'].unique()
array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'isuzu', 'jaguar', 'maxda', 'mazda', 'buick', 'mercury', 'mitsubishi', 'Nissan', 'nissan', 'peugeot', 'plymouth', 'porsche', 'porcshce', 'renault', 'saab', 'subaru', 'toyota', 'toyouta', 'vokswagen', 'volkswagen', 'vw', 'volvo'], dtype=object)
#### There seems to be some duplicate values in in the car_brand column.
data['car_brand'].replace({'maxda': 'mazda',
'nissan': 'Nissan',
'toyouta': 'toyota',
'porcshce': 'porsche',
'vw': 'volkswagen',
'vokswagen': 'volkswagen'
}, inplace=True)
data['car_brand'].unique()
array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi', 'Nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)
#### Printing no of outliers in each column
outliers_columns_list = []
for i in data.select_dtypes(exclude='object').columns:
q1 = data[i].quantile(0.25)
q3 = data[i].quantile(0.75)
IQR = q3-q1
ub = q3+(1.5*IQR)
lb = q1-(1.5*IQR)
outliers_count=data[(data[i]>ub) | (data[i]<lb)][i].count()
if(outliers_count>0):
print(f"no of outliers in '{i}' is {outliers_count}")
outliers_columns_list.append(i)
no of outliers in 'wheelbase' is 3 no of outliers in 'carlength' is 1 no of outliers in 'carwidth' is 8 no of outliers in 'enginesize' is 10 no of outliers in 'stroke' is 20 no of outliers in 'compressionratio' is 28 no of outliers in 'horsepower' is 6 no of outliers in 'peakrpm' is 2 no of outliers in 'citympg' is 2 no of outliers in 'highwaympg' is 3 no of outliers in 'price' is 15
plt.figure(figsize=(15, 7))
for i in range(len(outliers_columns_list)-1):
plt.subplot(2, 5, i+1)
sns.boxplot(data[outliers_columns_list[i]], orient="v")
plt.tight_layout()
From the above boxplots it is observed that all the outliers are nearer to Upper and lower boundaries. So there is no need of outliers treatment
# visualizing our dependent variable for outliers and skewnwss
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
sns.boxplot(data["price"])
plt.title("Boxplot for outliers detection", fontweight="bold")
plt.subplot(1,2,2)
sns.distplot(data["price"])
plt.title("Distribution plot for skewness", fontweight="bold")
plt.show()
From the above plot :
#### Count plot for all categorical columns
object_columns_list = data.select_dtypes(include='object').columns[1:-1]
plt.figure(figsize = (15, 10))
for i in range(len(object_columns_list)):
plt.subplot(3, 3, i+1)
sns.countplot(data[object_columns_list[i]], order = data[object_columns_list[i]].value_counts().index)
plt.xlabel(object_columns_list[i], fontweight="bold")
From the above visuals:
#### violin plot for all categorical columns v/s price
object_columns_list = data.select_dtypes(include='object').columns[1:-1]
plt.figure(figsize = (15, 10))
for i in range(len(object_columns_list)):
plt.subplot(3, 3, i+1)
sns.violinplot(x=data[object_columns_list[i]], y=data['price'])
plt.xlabel(object_columns_list[i], fontweight="bold")
From the above visuals:
#### fitting the regression line between all numeric columns v/s price
num_columns_list = data.select_dtypes(exclude='object').columns[1:-1]
plt.figure(figsize = (15, 10))
for i in range(0,len(num_columns_list),5):
sns.pairplot(data, x_vars=num_columns_list[i:i+5], y_vars='price', kind='reg')
<Figure size 1080x720 with 0 Axes>
From the above visuals:
#### heatmap to visualize the pearson's correlation matrix between the numeric variables
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), annot=True, cmap="YlGnBu", mask=np.triu(data.corr(), k=1))
<matplotlib.axes._subplots.AxesSubplot at 0x19696959588>
From the above heatmap:
#### average price for each brand car
(data.groupby('car_brand')['price'].mean()
.sort_values(ascending=False).reset_index().round(2)
.style.background_gradient(cmap='Blues'))
car_brand | price | |
---|---|---|
0 | jaguar | 34600.000000 |
1 | buick | 33647.000000 |
2 | porsche | 31400.500000 |
3 | bmw | 26118.750000 |
4 | volvo | 18063.180000 |
5 | audi | 17859.170000 |
6 | mercury | 16503.000000 |
7 | alfa-romero | 15498.330000 |
8 | peugeot | 15489.090000 |
9 | saab | 15223.330000 |
10 | mazda | 10652.880000 |
11 | Nissan | 10415.670000 |
12 | volkswagen | 10077.500000 |
13 | toyota | 9885.810000 |
14 | renault | 9595.000000 |
15 | mitsubishi | 9239.770000 |
16 | isuzu | 8916.500000 |
17 | subaru | 8541.250000 |
18 | honda | 8184.690000 |
19 | plymouth | 7963.430000 |
20 | dodge | 7875.440000 |
21 | chevrolet | 6007.000000 |
From the above dataFrame:
#### Based on above visuals Dropping Few columns
data.drop(['car_ID', 'CarName', 'highwaympg', 'fuelsystem', 'aspiration', 'doornumber'], axis=1, inplace=True)
#### converting 'cylindernumber' column to numeric
data['cylindernumber'].replace({"two": 2,
"three":3,
"four": 4,
"five": 5,
"six": 6,
"eight": 8,
"twelve": 12
}, inplace=True)
# Define the columns to perform one hot encoding
cols_to_ohe=data.select_dtypes(include='object').columns
# perform one hot encoding on the following columns
import category_encoders as ce
ce_ohe=ce.OneHotEncoder(cols=cols_to_ohe)
data=ce_ohe.fit_transform(data)
data.head()
symboling | fueltype_1 | fueltype_2 | carbody_1 | carbody_2 | carbody_3 | carbody_4 | carbody_5 | drivewheel_1 | drivewheel_2 | drivewheel_3 | enginelocation_1 | enginelocation_2 | wheelbase | carlength | carwidth | carheight | curbweight | enginetype_1 | enginetype_2 | enginetype_3 | enginetype_4 | enginetype_5 | enginetype_6 | enginetype_7 | cylindernumber | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | price | car_brand_1 | car_brand_2 | car_brand_3 | car_brand_4 | car_brand_5 | car_brand_6 | car_brand_7 | car_brand_8 | car_brand_9 | car_brand_10 | car_brand_11 | car_brand_12 | car_brand_13 | car_brand_14 | car_brand_15 | car_brand_16 | car_brand_17 | car_brand_18 | car_brand_19 | car_brand_20 | car_brand_21 | car_brand_22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 13495.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 16500.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 16500.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 13950.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 17450.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
#### Creating dependent and independent variables
# independent variables
x = data.drop(columns="price")
# dependent variable
y = data["price"]
# Save numeric columns in a variable
cnames = ['wheelbase', 'carlength', 'carwidth', 'carheight',
'curbweight', 'enginesize', 'boreratio', 'stroke',
'compressionratio', 'horsepower', 'peakrpm', 'citympg',]
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
for col in cnames:
x[col]=sc.fit_transform(x[col].values.reshape(-1,1))
x.head(5)
symboling | fueltype_1 | fueltype_2 | carbody_1 | carbody_2 | carbody_3 | carbody_4 | carbody_5 | drivewheel_1 | drivewheel_2 | drivewheel_3 | enginelocation_1 | enginelocation_2 | wheelbase | carlength | carwidth | carheight | curbweight | enginetype_1 | enginetype_2 | enginetype_3 | enginetype_4 | enginetype_5 | enginetype_6 | enginetype_7 | cylindernumber | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | car_brand_1 | car_brand_2 | car_brand_3 | car_brand_4 | car_brand_5 | car_brand_6 | car_brand_7 | car_brand_8 | car_brand_9 | car_brand_10 | car_brand_11 | car_brand_12 | car_brand_13 | car_brand_14 | car_brand_15 | car_brand_16 | car_brand_17 | car_brand_18 | car_brand_19 | car_brand_20 | car_brand_21 | car_brand_22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | -1.690772 | -0.426521 | -0.844782 | -2.020417 | -0.014566 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.074449 | 0.519071 | -1.839377 | -0.288349 | 0.174483 | -0.262960 | -0.646553 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | -1.690772 | -0.426521 | -0.844782 | -2.020417 | -0.014566 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.074449 | 0.519071 | -1.839377 | -0.288349 | 0.174483 | -0.262960 | -0.646553 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | -0.708596 | -0.231513 | -0.190566 | -0.543527 | 0.514882 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 0.604046 | -2.404880 | 0.685946 | -0.288349 | 1.264536 | -0.262960 | -0.953012 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0.173698 | 0.207256 | 0.136542 | 0.235942 | -0.420797 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | -0.431076 | -0.517266 | 0.462183 | -0.035973 | -0.053668 | 0.787855 | -0.186865 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0.107110 | 0.207256 | 0.230001 | 0.235942 | 0.516807 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5 | 0.218885 | -0.517266 | 0.462183 | -0.540725 | 0.275883 | 0.787855 | -1.106241 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
#### Splitting train and test dataset
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
from sklearn.linear_model import LinearRegression
#### instantiate and fit
model_LR = LinearRegression()
model_LR.fit(x_train, y_train)
LinearRegression()
LR_score_train = model_LR.score(x_train, y_train)
print("Train Accuracy :",LR_score_train)
LR_score_test = model_LR.score(x_test, y_test)
print("Test Accuracy :",LR_score_test)
Train Accuracy : 0.9706000959758017 Test Accuracy : 0.8920483520083704
#### predicting 'price' using lin_reg
predictions_LR = model_LR.predict(x_test)
#### calculating RMSE, MSE, and MAE
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_LR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_LR))
LR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_LR))
print('RMSE:', LR_RMSE)
MAE: 1602.070849236117 MSE: 7128842.973392517 RMSE: 2669.989320838665
final_results = []
dict_LR = {'MODEL':'Linear Regression',
'Train_ACCURACY':LR_score_train,
'Test_ACCURACY':LR_score_test,
'RMSE':LR_RMSE
}
final_results.append(dict_LR)
from sklearn import linear_model
model_LR_Ridge = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
model_LR_Ridge.fit(x_train,y_train)
model_LR_Ridge.alpha_
0.1
LR_Ridge_score_train = model_LR_Ridge.score(x_train,y_train)
print("Train Accuracy :",LR_Ridge_score_train)
LR_Ridge_score_test = model_LR_Ridge.score(x_test,y_test)
print("Test Accuracy :",LR_Ridge_score_test)
Train Accuracy : 0.9703273024088487 Test Accuracy : 0.8885484567286488
#### predicting 'price' using Ridge Regression
predictions_LR_Ridge = model_LR_Ridge.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_LR_Ridge))
print('MSE:', metrics.mean_squared_error(y_test, predictions_LR_Ridge))
LR_Ridge_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_LR_Ridge))
print('RMSE:', LR_Ridge_RMSE)
MAE: 1632.8053066911716 MSE: 7359966.854654508 RMSE: 2712.9258844750084
dict_LR_Ridge = {'MODEL':'Ridge Regression',
'Train_ACCURACY':LR_Ridge_score_train,
'Test_ACCURACY':LR_Ridge_score_test,
'RMSE':LR_Ridge_RMSE
}
final_results.append(dict_LR_Ridge)
from sklearn import linear_model
model_LR_Lasso = linear_model.LassoCV(alphas=np.logspace(-6, 6, 13))
model_LR_Lasso.fit(x_train,y_train)
LassoCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))
LR_Lasso_score_train = model_LR_Lasso.score(x_train,y_train)
print("Train Accuracy :",LR_Lasso_score_train)
LR_Lasso_score_test = model_LR_Lasso.score(x_test,y_test)
print("Test Accuracy :",LR_Lasso_score_test)
Train Accuracy : 0.9670849805859022 Test Accuracy : 0.8901462183776644
#### predicting 'price' using Lasso Regression
predictions_LR_Lasso = model_LR_Lasso.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_LR_Lasso))
print('MSE:', metrics.mean_squared_error(y_test, predictions_LR_Lasso))
LR_Lasso_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_LR_Lasso))
print('RMSE:', LR_Lasso_RMSE)
MAE: 1704.319461370362 MSE: 7254454.876684296 RMSE: 2693.4095263595354
dict_LR_Lasso = {'MODEL':'Lasso Regression',
'Train_ACCURACY':LR_Lasso_score_train,
'Test_ACCURACY':LR_Lasso_score_test,
'RMSE':LR_Lasso_RMSE
}
final_results.append(dict_LR_Lasso)
from sklearn import neighbors as nb
model_KNN=nb.KNeighborsRegressor(n_neighbors=4,n_jobs=-1)
model_KNN.fit(x_train,y_train)
KNeighborsRegressor(n_jobs=-1, n_neighbors=4)
KNN_score_train = model_KNN.score(x_train,y_train)
print("Train Accuracy :",KNN_score_train)
KNN_score_test = model_KNN.score(x_test,y_test)
print("Test Accuracy :",KNN_score_test)
Train Accuracy : 0.9124677696162664 Test Accuracy : 0.8259113078399183
#Parameter tuning/Grid search
import sklearn.model_selection as model_selection
model_KNN=model_selection.GridSearchCV(model_KNN,param_grid={'n_neighbors':[3,5,7,9,11],'weights':['uniform','distance']})
model_KNN.fit(x_train,y_train)
GridSearchCV(estimator=KNeighborsRegressor(n_jobs=-1, n_neighbors=4), param_grid={'n_neighbors': [3, 5, 7, 9, 11], 'weights': ['uniform', 'distance']})
model_KNN.best_params_
{'n_neighbors': 3, 'weights': 'distance'}
KNN_score_train = model_KNN.score(x_train,y_train)
print("Train Accuracy :",KNN_score_train)
KNN_score_test = model_KNN.score(x_test,y_test)
print("Test Accuracy :",KNN_score_test)
Train Accuracy : 0.9985246605173365 Test Accuracy : 0.8783507393636467
#### predicting 'price' using lin_reg
predictions_KNN = model_KNN.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_KNN))
print('MSE:', metrics.mean_squared_error(y_test, predictions_KNN))
KNN_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_KNN))
print('RMSE:', KNN_RMSE)
MAE: 1832.775503909773 MSE: 8033397.294435981 RMSE: 2834.3248392581927
dict_KNN = {'MODEL':'KNN Regressor',
'Train_ACCURACY':KNN_score_train,
'Test_ACCURACY':KNN_score_test,
'RMSE':KNN_RMSE}
final_results.append(dict_KNN)
import sklearn.tree as tree
model_DT=tree.DecisionTreeRegressor(max_depth=3)
model_DT.fit(x_train,y_train)
DecisionTreeRegressor(max_depth=3)
DT_score_train = model_DT.score(x_train,y_train)
print("Train Accuracy :",DT_score_train)
DT_score_test = model_DT.score(x_test,y_test)
print("Test Accuracy :",DT_score_test)
Train Accuracy : 0.9216374578674762 Test Accuracy : 0.8679424987382512
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree, export_text
features = x.columns
plt.figure(figsize=(15,8))
plot_tree(model_DT, feature_names=features, filled = True)
[Text(418.5, 380.52, 'enginesize <= 1.326\nmse = 62413097.511\nsamples = 143\nvalue = 13299.679'), Text(209.25, 271.8, 'curbweight <= -0.022\nmse = 19869258.311\nsamples = 129\nvalue = 11141.381'), Text(104.625, 163.07999999999998, 'curbweight <= -0.508\nmse = 5353442.13\nsamples = 84\nvalue = 8564.036'), Text(52.3125, 54.360000000000014, 'mse = 1072432.475\nsamples = 49\nvalue = 7284.122'), Text(156.9375, 54.360000000000014, 'mse = 5842577.678\nsamples = 35\nvalue = 10355.914'), Text(313.875, 163.07999999999998, 'carwidth <= 1.258\nmse = 11419572.218\nsamples = 45\nvalue = 15952.426'), Text(261.5625, 54.360000000000014, 'mse = 9422357.639\nsamples = 39\nvalue = 15325.107'), Text(366.1875, 54.360000000000014, 'mse = 5216916.667\nsamples = 6\nvalue = 20030.0'), Text(627.75, 271.8, 'compressionratio <= 1.604\nmse = 16001889.551\nsamples = 14\nvalue = 33186.857'), Text(523.125, 163.07999999999998, 'carwidth <= 2.52\nmse = 7696192.8\nsamples = 10\nvalue = 35104.0'), Text(470.8125, 54.360000000000014, 'mse = 4317654.222\nsamples = 9\nvalue = 34453.333'), Text(575.4375, 54.360000000000014, 'mse = 0.0\nsamples = 1\nvalue = 40960.0'), Text(732.375, 163.07999999999998, 'curbweight <= 2.319\nmse = 4606060.0\nsamples = 4\nvalue = 28394.0'), Text(680.0625, 54.360000000000014, 'mse = 1573219.556\nsamples = 3\nvalue = 27325.333'), Text(784.6875, 54.360000000000014, 'mse = 0.0\nsamples = 1\nvalue = 31600.0')]
#### predicting 'price' using lin_reg
predictions_DT = model_DT.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_DT))
print('MSE:', metrics.mean_squared_error(y_test, predictions_DT))
DT_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_DT))
print('RMSE:', DT_RMSE)
MAE: 2076.496298188754 MSE: 8720730.136760743 RMSE: 2953.0882372121464
dict_DT = {'MODEL':'DT Regressor',
'Train_ACCURACY':DT_score_train,
'Test_ACCURACY':DT_score_test,
'RMSE':DT_RMSE}
final_results.append(dict_DT)
from sklearn.ensemble import RandomForestRegressor
list_RFR=[]
#Tune number of trees
for i in range(10,200,10):
model_RFR=RandomForestRegressor(n_estimators=i,random_state=10)
model_RFR.fit(x_train,y_train)
dict_RFR={}
dict_RFR["Number of trees"] = str(i)
dict_RFR["ACCURACY"]=model_RFR.score(x_test,y_test)
list_RFR.append(dict_RFR)
(pd.DataFrame(list_RFR)
.sort_values(by=['ACCURACY'],ascending=False)
.reset_index(drop=True)
.style.background_gradient(cmap='Blues'))
Number of trees | ACCURACY | |
---|---|---|
0 | 50 | 0.912008 |
1 | 20 | 0.911499 |
2 | 60 | 0.911267 |
3 | 40 | 0.911264 |
4 | 70 | 0.910807 |
5 | 90 | 0.910469 |
6 | 100 | 0.910244 |
7 | 10 | 0.910193 |
8 | 110 | 0.910033 |
9 | 80 | 0.910015 |
10 | 130 | 0.909085 |
11 | 120 | 0.908817 |
12 | 140 | 0.908712 |
13 | 150 | 0.908355 |
14 | 160 | 0.908344 |
15 | 190 | 0.908114 |
16 | 170 | 0.907999 |
17 | 180 | 0.907784 |
18 | 30 | 0.907609 |
model_RFR = RandomForestRegressor(n_estimators=50,random_state=10)
model_RFR.fit(x_train, y_train)
RandomForestRegressor(n_estimators=50, random_state=10)
RFR_score_train = model_RFR.score(x_train, y_train)
print("Train Accuracy :",RFR_score_train)
RFR_score_test = model_RFR.score(x_test,y_test)
print("Test Accuracy :",RFR_score_test)
Train Accuracy : 0.9874907945466967 Test Accuracy : 0.9120081880470541
#### predicting 'price' using lin_reg
predictions_RFR = model_RFR.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_RFR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_RFR))
RFR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_RFR))
print('RMSE:', RFR_RMSE)
MAE: 1603.7387358064516 MSE: 5810747.8859931175 RMSE: 2410.5492913427674
dict_RFR = {'MODEL':'Random Forest Regressor',
'Train_ACCURACY':RFR_score_train,
'Test_ACCURACY':RFR_score_test,
'RMSE':RFR_RMSE
}
final_results.append(dict_RFR)
from sklearn.ensemble import BaggingRegressor
list_BR=[]
#Tune number of trees
for i in range(10,200,10):
model_BR=BaggingRegressor(n_estimators=i,oob_score=True,random_state=200)
model_BR.fit(x_train,y_train)
dict_BR={}
dict_BR["Number of trees"] = str(i)
dict_BR["ACCURACY"]=model_BR.score(x_test,y_test)
list_BR.append(dict_BR)
(pd.DataFrame(list_BR)
.sort_values(by=['ACCURACY'],ascending=False)
.reset_index(drop=True)
.style.background_gradient(cmap='Blues'))
Number of trees | ACCURACY | |
---|---|---|
0 | 40 | 0.917846 |
1 | 30 | 0.914527 |
2 | 50 | 0.914173 |
3 | 20 | 0.913007 |
4 | 60 | 0.912395 |
5 | 80 | 0.912235 |
6 | 70 | 0.912139 |
7 | 90 | 0.911879 |
8 | 100 | 0.910822 |
9 | 110 | 0.910629 |
10 | 150 | 0.910575 |
11 | 140 | 0.910301 |
12 | 130 | 0.910254 |
13 | 120 | 0.909993 |
14 | 180 | 0.909279 |
15 | 160 | 0.909222 |
16 | 190 | 0.908950 |
17 | 170 | 0.908668 |
18 | 10 | 0.907015 |
model_BR=BaggingRegressor(n_estimators=40,oob_score=True,random_state=200)
model_BR.fit(x_train,y_train)
BaggingRegressor(n_estimators=40, oob_score=True, random_state=200)
BR_score_train = model_BR.score(x_train,y_train)
print("Train Accuracy :",BR_score_train)
BR_score_test = model_BR.score(x_test,y_test)
print("Test Accuracy :",BR_score_test)
Train Accuracy : 0.9886259453086359 Test Accuracy : 0.9178461822917738
#### predicting 'price' using lin_reg
predictions_BR = model_BR.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_BR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_BR))
BR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_BR))
print('RMSE:', BR_RMSE)
MAE: 1561.133266532258 MSE: 5425222.097138057 RMSE: 2329.2106167408
dict_BR = {'MODEL':'Bagging Regressor',
'Train_ACCURACY':BR_score_train,
'Test_ACCURACY':BR_score_test,
'RMSE':BR_RMSE
}
final_results.append(dict_BR)
from sklearn.linear_model import ElasticNet
model_ENR = ElasticNet(random_state=0)
model_ENR.fit(x_train,y_train)
ElasticNet(random_state=0)
ENR_score_train = model_ENR.score(x_train,y_train)
print("Train Accuracy :",ENR_score_train)
ENR_score_test = model_ENR.score(x_test,y_test)
print("Test Accuracy :",ENR_score_test)
Train Accuracy : 0.8563475978333301 Test Accuracy : 0.8235110288331771
#### predicting 'price' using ElasticNet Regressor
predictions_ENR = model_ENR.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_ENR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_ENR))
ENR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_ENR))
print('RMSE:', ENR_RMSE)
MAE: 2177.5442315710934 MSE: 11654867.57628227 RMSE: 3413.922608420154
dict_ENR = {'MODEL':'ElasticNet Regressor',
'Train_ACCURACY':ENR_score_train,
'Test_ACCURACY':ENR_score_test,
'RMSE':ENR_RMSE
}
final_results.append(dict_ENR)
from sklearn.ensemble import GradientBoostingRegressor
model_GBR = GradientBoostingRegressor()
model_GBR.fit(x_train,y_train)
GradientBoostingRegressor()
GBR_score_train = model_GBR.score(x_train,y_train)
print("Train Accuracy :",GBR_score_train)
GBR_score_test = model_GBR.score(x_test,y_test)
print("Test Accuracy :",GBR_score_test)
Train Accuracy : 0.9942508585108231 Test Accuracy : 0.907653503437977
#### predicting 'price' using ElasticNet Regressor
predictions_GBR = model_GBR.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_GBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_GBR))
GBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_GBR))
print('RMSE:', GBR_RMSE)
MAE: 1589.7051600620339 MSE: 6098319.806888355 RMSE: 2469.4776384669603
dict_GBR = {'MODEL':'Gradient Boosting Regressor',
'Train_ACCURACY':GBR_score_train,
'Test_ACCURACY':GBR_score_test,
'RMSE':GBR_RMSE
}
final_results.append(dict_GBR)
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
model_HGBR = HistGradientBoostingRegressor()
model_HGBR.fit(x_train,y_train)
HistGradientBoostingRegressor()
HGBR_score_train = model_HGBR.score(x_train,y_train)
print("Train Accuracy :",HGBR_score_train)
HGBR_score_test = model_HGBR.score(x_test,y_test)
print("Test Accuracy :",HGBR_score_test)
Train Accuracy : 0.9636892964789147 Test Accuracy : 0.889279745320581
#### predicting 'price' using Histogram-Based Gradient Boosting Regressor
predictions_HGBR = model_HGBR.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_HGBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_HGBR))
HGBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_HGBR))
print('RMSE:', HGBR_RMSE)
MAE: 1833.2560112069134 MSE: 7311674.4789742185 RMSE: 2704.010813398167
dict_HGBR = {'MODEL':'Histogram Based GBR',
'Train_ACCURACY':HGBR_score_train,
'Test_ACCURACY':HGBR_score_test,
'RMSE':HGBR_RMSE
}
final_results.append(dict_HGBR)
from xgboost import XGBRegressor
model_XGBR = XGBRegressor()
model_XGBR.fit(x_train,y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None)
XGBR_score_train = model_XGBR.score(x_train,y_train)
print("Train Accuracy :",XGBR_score_train)
XGBR_score_test = model_XGBR.score(x_test,y_test)
print("Train Accuracy :",XGBR_score_test)
Train Accuracy : 0.998523693850276 Train Accuracy : 0.9055217582551045
#### predicting 'price' using XGBRegressor
predictions_XGBR = model_XGBR.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_XGBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_XGBR))
XGBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_XGBR))
print('RMSE:', XGBR_RMSE)
MAE: 1696.7096537928428 MSE: 6239094.653319247 RMSE: 2497.8179784202143
dict_XGBR = {'MODEL':'XGBoost Regressor',
'Train_ACCURACY':XGBR_score_train,
'Test_ACCURACY':XGBR_score_test,
'RMSE':XGBR_RMSE
}
final_results.append(dict_XGBR)
from lightgbm import LGBMRegressor
model_LGBR = LGBMRegressor()
model_LGBR.fit(x_train,y_train)
LGBMRegressor()
LGBR_score_train = model_LGBR.score(x_train,y_train)
print("Train Accuracy :",LGBR_score_train)
LGBR_score_test = model_LGBR.score(x_test,y_test)
print("Test Accuracy :",LGBR_score_test)
Train Accuracy : 0.9609605280364496 Test Accuracy : 0.8780216588918559
#### predicting 'price' using XGBRegressor
predictions_LGBR = model_LGBR.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_LGBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_LGBR))
LGBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_LGBR))
print('RMSE:', LGBR_RMSE)
MAE: 1876.6765463516463 MSE: 8055128.903472544 RMSE: 2838.155898373545
dict_LGBR = {'MODEL':'LightGBM Regressor',
'Train_ACCURACY':LGBR_score_train,
'Test_ACCURACY':LGBR_score_test,
'RMSE':LGBR_RMSE
}
final_results.append(dict_LGBR)
from catboost import CatBoostRegressor
model_CGBR = CatBoostRegressor(verbose=0, n_estimators=100)
model_CGBR.fit(x_train,y_train)
<catboost.core.CatBoostRegressor at 0x1969cd39d48>
CGBR_score_train = model_CGBR.score(x_train,y_train)
print("Train Accuracy :",CGBR_score_train)
CGBR_score_test = model_CGBR.score(x_test,y_test)
print("Test Accuracy :",CGBR_score_test)
Train Accuracy : 0.992900343375794 Test Accuracy : 0.9192606254631139
#### predicting 'price' using CatBoost
predictions_CGBR = model_CGBR.predict(x_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions_CGBR))
print('MSE:', metrics.mean_squared_error(y_test, predictions_CGBR))
CGBR_RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions_CGBR))
print('RMSE:', CGBR_RMSE)
MAE: 1509.6383901542376 MSE: 5331815.989395706 RMSE: 2309.0725387903485
dict_CGBR = {'MODEL':'CatBoost Regressor',
'Train_ACCURACY':CGBR_score_train,
'Test_ACCURACY':LGBR_score_test,
'RMSE':CGBR_RMSE
}
final_results.append(dict_CGBR)
df_results = pd.DataFrame(final_results)
df_results['ACCURACY'] = ((df_results['Train_ACCURACY']+df_results['Test_ACCURACY'])/2)*100
(df_results.sort_values(by=['ACCURACY','RMSE'],ascending=False)
.reset_index(drop=True)
.style.background_gradient(cmap='Greens'))
MODEL | Train_ACCURACY | Test_ACCURACY | RMSE | ACCURACY | |
---|---|---|---|---|---|
0 | Bagging Regressor | 0.988626 | 0.917846 | 2329.210617 | 95.323606 |
1 | XGBoost Regressor | 0.998524 | 0.905522 | 2497.817978 | 95.202273 |
2 | Gradient Boosting Regressor | 0.994251 | 0.907654 | 2469.477638 | 95.095218 |
3 | Random Forest Regressor | 0.987491 | 0.912008 | 2410.549291 | 94.974949 |
4 | KNN Regressor | 0.998525 | 0.878351 | 2834.324839 | 93.843770 |
5 | CatBoost Regressor | 0.992900 | 0.878022 | 2309.072539 | 93.546100 |
6 | Linear Regression | 0.970600 | 0.892048 | 2669.989321 | 93.132422 |
7 | Ridge Regression | 0.970327 | 0.888548 | 2712.925884 | 92.943788 |
8 | Lasso Regression | 0.967085 | 0.890146 | 2693.409526 | 92.861560 |
9 | Histogram Based GBR | 0.963689 | 0.889280 | 2704.010813 | 92.648452 |
10 | LightGBM Regressor | 0.960961 | 0.878022 | 2838.155898 | 91.949109 |
11 | DT Regressor | 0.921637 | 0.867942 | 2953.088237 | 89.478998 |
12 | ElasticNet Regressor | 0.856348 | 0.823511 | 3413.922608 | 83.992931 |
For Predicting car price *Bagging Regressor* provides high Accuracy of 95% with Root Mean square error of 2329.21.
from sklearn.model_selection import cross_val_score,KFold
kfold=KFold(n_splits=5)
scores=cross_val_score(model_BR,x,y,cv=kfold,scoring="neg_root_mean_squared_error")
print("scores : ", list(scores))
print("mean : ", scores.mean())
print("std deviation : ", scores.std())
scores : [-3388.330717460857, -3867.33524921731, -5878.525518966552, -2762.0386476277286, -2428.067541938594] mean : -3664.8595350422083 std deviation : 1213.4678964953684
For Predicting car price Bagging Regressor provides high Accuracy of 95% with Root Mean square error of 2329.21.
The varaiables which are significant in predicting car price are symboling, fueltype, carbody, drivewheel, enginelocation, wheelbase, carlength, carwidth, carheight, curbweight, enginetype, boreratio, stroke, compressionratio, horsepower, peakrpm, citympg, car_brand.
A model like this is very helpful in predicting car prices based on above varaiables.
import pickle
# Saving model to disk
pickle.dump(model_BR, open('CarPricePrediction_model.pkl','wb'))