Bike rentals have become very popular in many western countries. A simple concept where a person can rent a bicycle per hour or per day, to conveniently travel around without the hassle of traffic, parking, narrow roads etc.
Many American cities have communal bike sharing stations where you can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day. Hadi Fanaee-T at the University of Porto compiled this data into a CSV file.
The aim of the project is to analyze the bike rentals dataset of the Washington D.C. district and eventually predict the number of bike rentals recieved per hour.
The dataset used can be found on this Link
The columns are as described below:-
* instant - A unique sequential ID number for each row
* dteday - The date of the rentals
* season - The season in which the rentals occurred
* 1 - Winter
* 2 - Spring
* 3 - Summer
* 4 - Fall
* yr - The year the rentals occurred
* 0 - 2011
* 1 - 2012
* mnth - The month the rentals occurred
* hr - The hour the rentals occurred
* holiday - Whether or not the day was a holiday
* weekday - The day of the week (as a number, 0 to 7)
* workingday - Whether or not the day was a working day
* weathersit - The weather (as a categorical variable)
* 1 - Clear, Few clouds, Partly cloudy, Partly cloudy
* 2 - Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
* 3 - Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
* 4 - Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* temp - Normalized temperature, on a 0-1 scale (actual)
* atemp - Normalized adjusted temperature (What it feels)
* hum - Normalized humidity, on a 0-1 scale
* windspeed - Normalzied wind speed, on a 0-1 scale
* casual - Number of casual riders (people who hadnt previously signed up with the bike sharing program)
* registered - The number of registered riders (people who had already signed up)
* cnt - The total number of bike rentals (casual + registered)
The target is the cnt column which gives the number of bike rentals recieved, from casual and registered customers.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
import plotly.figure_factory as ff
df = pd.read_csv('/home/hp/Downloads/Bikeshare/hour.csv', parse_dates=['dteday'])
df.head(5)
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 |
df.isna().sum()
instant 0 dteday 0 season 0 yr 0 mnth 0 hr 0 holiday 0 weekday 0 workingday 0 weathersit 0 temp 0 atemp 0 hum 0 windspeed 0 casual 0 registered 0 cnt 0 dtype: int64
There are no missing values in the dataset. The dataset looks pretty clean and ready for descriptive and predictive analysis.
For the analysis, every column one by one will be analyzed to discover which columns influence the cnt column and which can be potential predictors for the model. Before starting, the cnt column's distribution is shown below to understand the range of data the target can take.
plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,8))
plt.subplot(1,2,1)
sns.distplot(df.cnt, bins=20)
plt.title('Distribution of Number of Bike rentals (Histogram)')
plt.xlabel('numbe of bike rentals (per hour)')
plt.yticks([])
plt.subplot(1,2,2)
sns.boxplot(df.cnt)
plt.title('Distribution of Number of Bike rentals (Boxplot)')
plt.xlabel('numbe of bike rentals (per hour)')
plt.yticks([])
([], <a list of 0 Text yticklabel objects>)
The distribution draws the following conclusions :-
Its essential to take a closer look at the outliers for better decision making whether these have to be removed or manipulated.
high_cnt = df[df.cnt > 600]
high_cnt.head(10)
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3019 | 3020 | 2011-05-10 | 2 | 0 | 5 | 17 | 0 | 2 | 1 | 1 | 0.64 | 0.6212 | 0.33 | 0.0000 | 79 | 532 | 611 |
3187 | 3188 | 2011-05-17 | 2 | 0 | 5 | 17 | 0 | 2 | 1 | 1 | 0.62 | 0.6061 | 0.65 | 0.4179 | 83 | 521 | 604 |
3379 | 3380 | 2011-05-25 | 2 | 0 | 5 | 17 | 0 | 3 | 1 | 1 | 0.74 | 0.6667 | 0.51 | 0.2239 | 77 | 524 | 601 |
3835 | 3836 | 2011-06-13 | 2 | 0 | 6 | 17 | 0 | 1 | 1 | 1 | 0.70 | 0.6364 | 0.39 | 0.3284 | 72 | 529 | 601 |
3883 | 3884 | 2011-06-15 | 2 | 0 | 6 | 17 | 0 | 3 | 1 | 1 | 0.74 | 0.6515 | 0.28 | 0.1045 | 83 | 555 | 638 |
3884 | 3885 | 2011-06-15 | 2 | 0 | 6 | 18 | 0 | 3 | 1 | 1 | 0.72 | 0.6515 | 0.32 | 0.1343 | 80 | 527 | 607 |
4171 | 4172 | 2011-06-27 | 3 | 0 | 6 | 17 | 0 | 1 | 1 | 1 | 0.74 | 0.6818 | 0.55 | 0.1343 | 90 | 514 | 604 |
5516 | 5517 | 2011-08-22 | 3 | 0 | 8 | 18 | 0 | 1 | 1 | 1 | 0.72 | 0.6515 | 0.28 | 0.2985 | 72 | 537 | 609 |
5536 | 5537 | 2011-08-23 | 3 | 0 | 8 | 14 | 0 | 2 | 1 | 1 | 0.72 | 0.6515 | 0.30 | 0.0896 | 149 | 502 | 651 |
5537 | 5538 | 2011-08-23 | 3 | 0 | 8 | 15 | 0 | 2 | 1 | 1 | 0.72 | 0.6515 | 0.34 | 0.2239 | 178 | 423 | 601 |
These outliers have high registered customers who have rented bikes as compared to the casual customers. Intuitively, it can be said that on holidays, days with clear weather, office release times, it can be expected to have high bike rentals. For now, assuming that the above theory is correct, the data is left as it is.
The next step is to analyze every column in the dataset and see its relationhip with the number of bike rentals recieved. The season column identifies the season as a categorical variable :-
1 - Winter
2 - Spring
3 - Summer
4 - Fall
Logically, the number of bikes rented does depend on the season, Winters would see less bike rentals as compared to Spring and Fall due to the extremeties in the climate. The Summer is not as harsh and hence can have more rentals as compared to Winter.
The metric chosen for the comparision is "percentage of bike rentals". For every category the percentage of bike rentals amassed by that category is calculated, for example comparing the seasons, the percentage of bike rentals every season recieves as compared to the total is the differentiator.
total_rentals = df.cnt.sum()
registered_rentals = df.registered.sum()
casual_rentals = df.casual.sum()
These variables will be utilized through the analysis.
seasons = ['Winter','Spring','Summer','Fall']
grouped = df[['season','casual','registered','cnt']].groupby(by='season',)
season_cnts = grouped.sum().reset_index()
season_cnts['season'] = seasons
season_cnts['cnt_perc'] = season_cnts.cnt / total_rentals * 100
season_cnts['registered_perc'] = season_cnts.registered / total_rentals * 100
season_cnts['casual_perc'] = season_cnts.casual / total_rentals * 100
season_cnts
season | casual | registered | cnt | cnt_perc | registered_perc | casual_perc | |
---|---|---|---|---|---|---|---|
0 | Winter | 60622 | 410726 | 471348 | 14.315030 | 12.473916 | 1.841115 |
1 | Spring | 203522 | 715067 | 918589 | 27.897921 | 21.716876 | 6.181046 |
2 | Summer | 226091 | 835038 | 1061129 | 32.226919 | 25.360444 | 6.866476 |
3 | Fall | 129782 | 711831 | 841613 | 25.560129 | 21.618597 | 3.941532 |
plt.style.use('seaborn')
sns.set_style('white')
plt.figure(figsize=(12,7))
bplt = sns.barplot(x= season_cnts.season, y= season_cnts.cnt_perc)
plt.title('Percentage of Bike rentals recieved per Season', y= 1.01, fontdict={'size':18})
plt.ylabel('percentage of bike rentals')
plt.xlabel('seasons')
plt.yticks([])
for patch in bplt.patches:
bplt.annotate(
"{:.2f}%".format(patch.get_height()),
(patch.get_x() + patch.get_width()/2, patch.get_height()),
ha='center',
va='baseline'
)
As expected, the seasons Spring, Summer recieve the maximum percentage of bike rentals. The Fall season relatively recieves lesser rentals as it nears Winters when the climate gets a little cold and finally Winters sees the least of the lot. The number of bike rentals shows a heavy reliance on the season column. The plot below is a granular representation of the average number of bike rentals per season for casual and registered rentals.
layout = go.Layout(
title= {
'text':'<b>Percentage of Bike rentals recieved per Season</b><br>'+
'Distribution of Bike rentals (percentages) recieved per season for casual and registered customers',
},
yaxis= go.layout.YAxis(
title= 'Seasons'
),
xaxis= go.layout.XAxis(
title= 'percentage of Bike rentals',
range=[-8,26],
showticklabels=False
),
barmode= 'overlay',
bargap= 0.1
)
data = [
go.Bar(
y= season_cnts.season,
x= season_cnts.registered_perc,
orientation= 'h',
name= 'Registered rentals',
hovertemplate='%{x:.2f}%<extra></extra>',
marker=dict(color='skyblue')
),
go.Bar(
y= season_cnts.season,
x= [-1*x for x in season_cnts.casual_perc],
orientation= 'h',
name= 'Casual rentals',
text= season_cnts.casual_perc,
hovertemplate='%{text:.2f}%<extra></extra>',
marker=dict(color='#FC4040')
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
The following conclusions can be drawn from the granular information plotted above:-
The column that gives the month, mnth is also covered in this since, seasons last over a group of months which do not change. All the seasons have the following month group:-
For this reason, it is suspected that the mnth column wil mimic the results obtained here.
months = ['Jan','Feb','Mar','Apr','May',
'Jun','Jul','Aug','Sept',
'Oct','Nov','Dec']
grouped = df[['mnth','casual','registered','cnt']].groupby(by= 'mnth')
mnth_cnts = grouped.sum().reset_index()
mnth_cnts.mnth = months
mnth_cnts['cnt_perc'] = mnth_cnts.cnt / total_rentals * 100
mnth_cnts
mnth | casual | registered | cnt | cnt_perc | |
---|---|---|---|---|---|
0 | Jan | 12042 | 122891 | 134933 | 4.097970 |
1 | Feb | 14963 | 136389 | 151352 | 4.596622 |
2 | Mar | 44444 | 184476 | 228920 | 6.952393 |
3 | Apr | 60802 | 208292 | 269094 | 8.172494 |
4 | May | 75285 | 256401 | 331686 | 10.073439 |
5 | Jun | 73906 | 272436 | 346342 | 10.518547 |
6 | Jul | 78157 | 266791 | 344948 | 10.476211 |
7 | Aug | 72039 | 279155 | 351194 | 10.665905 |
8 | Sept | 70323 | 275668 | 345991 | 10.507887 |
9 | Oct | 59760 | 262592 | 322352 | 9.789961 |
10 | Nov | 36603 | 218228 | 254831 | 7.739321 |
11 | Dec | 21693 | 189343 | 211036 | 6.409249 |
# plt.style.use('seaborn')
# sns.set_style('white')
# plt.figure(figsize=(12,7))
# bplt = sns.barplot(x= mnth_cnts.mnth, y= mnth_cnts.cnt)
# plt.title('Average Number of Bike rentals per Month', y= 1.01, fontdict={'size':18})
# plt.ylabel('average number of bike rentals')
# plt.xlabel('month')
# plt.yticks([])
colors = [
'#009999','#009999','#009999','#009999',
'#ff9933','#ff9933','#ff9933','#ff9933',
'#ff9933','#009999','#009999','#009999'
]
layout = go.Layout(
title= {
'text': '<b>Percentage of Bike rentals recieved per Month</b>',
'x':0.5
},
yaxis = go.layout.YAxis(
title= 'percentage of bike rentals',
showticklabels= False
),
xaxis = go.layout.XAxis(
title= 'month'
)
)
data = [
go.Bar(
y= mnth_cnts.cnt_perc,
x= mnth_cnts.mnth,
hovertemplate='%{y:.2f}%<extra></extra>',
marker_color=colors
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
As suspected earlier, the mnth column concludes the same results as the season column. The months highlighted - May to September represent the peak in the percentage of bike rentals recieved of the Spring and Summer season. The Fall and Winter season recieve lesser (in that order) percentage of rentals, the months October through April.
The hr column gives the time of the day. The dataset has per hour counts for the number of Bikes rented. The hr is a discrete variable with values from 0 to 24. For the ease of analysis, these values are grouped together into 4 categories which are intuitively easier to understand.
def encode(row):
if row >= 6 and row < 12:
return 1
elif row >= 12 and row < 17:
return 2
elif row >= 17 and row < 21:
return 3
else:
return 4
df['time_label'] = df.hr.apply(encode)
df.head(5)
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | time_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 | 4 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 | 4 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 | 4 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 | 4 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 | 4 |
time = ['Morning','Afternoon','Evening','Night']
grouped = df[['time_label','casual','registered','cnt']].groupby(by= 'time_label')
time_cnts = grouped.sum().reset_index()
time_cnts.time_label = time
time_cnts['cnt_perc'] = time_cnts.cnt / total_rentals * 100
time_cnts
time_label | casual | registered | cnt | cnt_perc | |
---|---|---|---|---|---|
0 | Morning | 126348 | 780971 | 907319 | 27.555647 |
1 | Afternoon | 265960 | 689922 | 955882 | 29.030525 |
2 | Evening | 160599 | 877372 | 1037971 | 31.523601 |
3 | Night | 67110 | 324397 | 391507 | 11.890227 |
layout = go.Layout(
title= {
'text':'<b>Percentage of Bike rentals received per Time of the Day</b><br>'+
'Percentage of bikes rentals recieved at different times of the day',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'average number of bikes rented',
showticklabels= False
),
xaxis= go.layout.XAxis(
title= 'time of the day'
)
)
data = [
go.Bar(
y= time_cnts.cnt_perc,
x= time_cnts.time_label,
hovertemplate= '%{y:.2f}%<extra></extra>',
marker_color= '#009999'
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
The plot shows the percentage of bike rentals recieved at different times of the day. It shows an increasing trend from Morning to Evening, peaking at Evening and then achieves the lowest at Night. This can give a clue as to how these bikes are rented. Given it peaks at Evening, it can be said that these Bikes are mostly rented for recreational purposes. The Morning and Afternoon percentages being high shows that these rented Bikes are utilized by tourists to travel around or by Office goers. The latter assumption is not based on data but intuition.
Different seasons have different day time and night time temperatures. For example, In summers the Mornings and Evenings are yet pleasant as compared to the Afternoon when its scorching hot. This should affect the number of Bikes that were rented. The graph before this gives an all year estimate, but it makes sense to analyze estimates per season to check if they are inline with the year long trend.
For this, a similar plot (as above) is made for every season below.
def decode_season(row):
if row == 1:
return "Winter"
elif row == 2:
return "Spring"
elif row == 3:
return "Summer"
else:
return "Fall"
def decode_time(row):
if row == 1:
return "Morning"
elif row == 2:
return "Afternoon"
elif row == 3:
return "Evening"
else:
return "Night"
grouped = df[['season','time_label','casual','registered','cnt']].groupby(by= ['season','time_label'])
season_time_cnts = grouped.sum().reset_index()
season_time_cnts.season = season_time_cnts.season.apply(decode_season)
season_time_cnts.time_label = season_time_cnts.time_label.apply(decode_time)
season_time_cnts['cnt_perc'] = 0
for season in seasons:
total = season_time_cnts[season_time_cnts.season == season].cnt.sum()
season_time_cnts.loc[season_time_cnts.season == season,'cnt_perc'] = season_time_cnts[season_time_cnts.season == season].cnt / total * 100
season_time_cnts.set_index(['season','time_label'],inplace=True)
season_time_cnts
casual | registered | cnt | cnt_perc | ||
---|---|---|---|---|---|
season | time_label | ||||
Winter | Morning | 11885 | 123956 | 135841 | 28.819683 |
Afternoon | 30238 | 112690 | 142928 | 30.323243 | |
Evening | 13339 | 128599 | 141938 | 30.113207 | |
Night | 5160 | 45481 | 50641 | 10.743867 | |
Spring | Morning | 40978 | 204826 | 245804 | 26.758866 |
Afternoon | 85852 | 181538 | 267390 | 29.108774 | |
Evening | 55437 | 242214 | 297651 | 32.403066 | |
Night | 21255 | 86489 | 107744 | 11.729294 | |
Summer | Morning | 46428 | 239373 | 285801 | 26.933672 |
Afternoon | 87238 | 201673 | 288911 | 27.226756 | |
Evening | 63554 | 283226 | 346780 | 32.680287 | |
Night | 28871 | 110766 | 139637 | 13.159286 | |
Fall | Morning | 27057 | 212816 | 239873 | 28.501580 |
Afternoon | 62632 | 194021 | 256653 | 30.495370 | |
Evening | 28269 | 223333 | 251602 | 29.895213 | |
Night | 11824 | 81661 | 93485 | 11.107837 |
fig = make_subplots(
rows= 2,
cols= 2,
subplot_titles=[
'Average number of bike rentals per hour (Winter)',
'Average number of bike rentals per hour (Spring)',
'Average number of bike rentals per hour (Summer)',
'Average number of bike rentals per hour (Fall)'
]
)
pos = [(1,1),(1,2),(2,1),(2,2)]
for ind,season in enumerate(seasons):
fig.add_trace(
go.Bar(
x= season_time_cnts.loc[season].index,
y= season_time_cnts.loc[season].cnt_perc,
hovertemplate= '%{y:.2f}%',
name= season
),
row= pos[ind][0],
col= pos[ind][1]
)
fig.update_yaxes(title= 'average number of bike rentals', showticklabels= False)
fig.update_xaxes(title= 'time of the day')
fig.update_layout(height= 800, width= 950)
fig.show()
The following conclusions can be drawn:-
The days of the week are identified by the weekday column. This column takes values from 0 to 6 which are mapped as follows :-
1 - Monday
2 - Tuesday
3 - Wednesday
4 - Thursday
5 - Friday
6 - Saturday
0 - Sunday
The next analysis is performed for the days of the week.
days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
grouped = df[['weekday','registered','casual','cnt']].groupby(by='weekday')
day_cnts = grouped.mean().reset_index()
day_cnts.weekday = days
day_cnts
weekday | registered | casual | cnt | |
---|---|---|---|---|
0 | Sunday | 121.305356 | 56.163469 | 177.468825 |
1 | Monday | 155.191206 | 28.553449 | 183.744655 |
2 | Tuesday | 167.658377 | 23.580514 | 191.238891 |
3 | Wednesday | 167.971313 | 23.159192 | 191.130505 |
4 | Thursday | 171.564144 | 24.872521 | 196.436665 |
5 | Friday | 164.677121 | 31.458786 | 196.135907 |
6 | Saturday | 128.962978 | 61.246815 | 190.209793 |
data= [
go.Bar(
x= day_cnts.weekday,
y= day_cnts.cnt,
hovertemplate= '%{y:.2f}<extra></extra>',
marker=dict(color='#009999')
)
]
layout = go.Layout(
title= {
'text':'<b>Average number of Bike rentals per Day of the Week</b><br>'+
'Average number of Bike rentals recieved per Day of the Week (all year)'
},
yaxis= go.layout.YAxis(
title= 'average number of bike rentals',
showticklabels= False
),
xaxis= go.layout.XAxis(
title= 'day of the week'
)
)
fig = go.Figure(data= data, layout= layout)
fig.show()
The Days of the Week pretty much get very close average number of rentals per day. A slight upwards trend can be seen starting from Monday to Saturday. Friday is a peak, usually when people are free after work. Monday being the first working day recieves the least average number of bike rentals in the week.
The workingday columns specifies whether the current day of the week is a working day or not. Since a day of the week can be a holiday as well, thus this column becomes relevant. The workingday column has 0's for Saturdays, Sundays and on Holidays (important dates). The holiday column specifies which day is a holiday. This column doesnot account for the weekend. Both these columns give similar information, where the working column accounts for the holiday column as well. Yet before any conclusions are reached, both the columns are analyzed with respect to the cnt - number of bikes rented.
is_working = ['No','Yes']
grouped = df[['workingday','registered','casual','cnt']].groupby(by= 'workingday')
working_cnts = grouped.mean().reset_index()
working_cnts.workingday = is_working
working_cnts
workingday | registered | casual | cnt | |
---|---|---|---|---|
0 | No | 123.963910 | 57.441422 | 181.405332 |
1 | Yes | 167.646439 | 25.561315 | 193.207754 |
data= [
go.Bar(
x= working_cnts.workingday,
y= working_cnts.cnt,
hovertemplate= '%{y:.2f}<extra></extra>',
marker=dict(color='#009999'),
width= 0.6
)
]
layout = go.Layout(
title= {
'text':'<b>Average number of Bike rentals on Working and Non-working days</b>',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'average number of bike rentals',
showticklabels= False
),
xaxis= go.layout.XAxis(
title= 'work day?'
)
)
fig = go.Figure(data= data, layout= layout)
fig.show()
is_holiday = ['No','Yes']
grouped = df[['holiday','cnt']].groupby(by= 'holiday')
holiday_cnts = grouped.mean().reset_index()
holiday_cnts.holiday = is_holiday
holiday_cnts
holiday | cnt | |
---|---|---|
0 | No | 190.42858 |
1 | Yes | 156.87000 |
data= [
go.Bar(
x= holiday_cnts.holiday,
y= holiday_cnts.cnt,
hovertemplate= '%{y:.2f}<extra></extra>',
marker=dict(color='#009999'),
width= 0.6
)
]
layout = go.Layout(
title= {
'text':'<b>Average number of Bike rentals on Holidays and Non-Holidays</b>',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'average number of bike rentals',
showticklabels= False
),
xaxis= go.layout.XAxis(
title= 'is holiday?'
)
)
fig = go.Figure(data= data, layout= layout)
fig.show()
The two plots show that working days or non-holidays recieve the highest average of rentals. The second plot shows that the average number of bike rentals on Holidays lesser as compared to Non-holidays, about 156. This when compared to the first plot showing the number of bike rentals for non working days to be around 181, it can be concluded that most of the bike rentals for non working days are on weekends than on holidays.
Analyzing granular information for workingday column, i.e. checking the percentage of casual and registered rentals on Working and Non-working days
data= [
go.Bar(
x= working_cnts.workingday,
y= working_cnts.registered,
hovertemplate= 'Registered rentals: %{y:.2f}<extra></extra>',
marker=dict(color='#009999'),
name= 'Registered rentals'
),
go.Bar(
x= working_cnts.workingday,
y= working_cnts.casual,
hovertemplate= 'Casual rentals: %{y:.2f}<extra></extra>',
marker=dict(color='#ff9933'),
name= 'Casual rentals'
)
]
layout = go.Layout(
title= {
'text':'<b>Average number of Bike rentals recieved on Working and Non-working days</b><br>'+
'Average number of bike rentals per rental type on Working and Non-working days (all year)',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'average number of bike rentals',
showticklabels= False
),
xaxis= go.layout.XAxis(
title= 'work day?'
),
barmode= 'group'
)
fig = go.Figure(data= data, layout= layout)
fig.show()
It is noticed that on Non-working days, the average number of registered rentals are lesser as compared to the average number of rentals on Working days. This suggests that this service is used a lot by the working class people (office goers etc). The average number casual rentals increase on Non-working days, by almost 50%, suggesting that casual rentals are more for recreational activities or others.
Weather has a huge impact on the number of rentals. A sunny and pleasant day is bound to recieve more number of rentals as compared to rainy or snowy days. This makes it important to analyze the weathersit column which has categories that describe the weather of the day as follows:
* 1 - Clear, Few clouds, Partly cloudy, Partly cloudy
* 2 - Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
* 3 - Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
* 4 - Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
For the ease of the analysis, these categories are mapped as the following:-
* 1 - Clear
* 2 - Hazy
* 3 - Rainy
* 4 - Stormy
NOTE:- The labels provided are intuitive and are subject to one's perception.
weather = ['Clear','Hazy','Rainy','Stormy']
grouped = df[['weathersit','registered','casual','cnt']].groupby(by= 'weathersit')
weather_cnts = grouped.mean().reset_index()
weather_cnts.weathersit = weather
weather_cnts
weathersit | registered | casual | cnt | |
---|---|---|---|---|
0 | Clear | 164.323841 | 40.545431 | 204.869272 |
1 | Hazy | 145.570202 | 29.595290 | 175.165493 |
2 | Rainy | 95.523608 | 16.055673 | 111.579281 |
3 | Stormy | 71.666667 | 2.666667 | 74.333333 |
layout= go.Layout(
title= {
'text':'<b>Average number of bike rentals in different Weathers</b><br>'+
'Average number of bikes rented under different weather conditions',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'average number of bike rentals',
showticklabels= False
),
xaxis= go.layout.XAxis(
title= 'weather condition'
)
)
data= [
go.Bar(
x= weather_cnts.weathersit,
y= weather_cnts.cnt,
hovertemplate= '%{y:.2f}<extra></extra>',
marker= dict(color='#FC4040'),
opacity= 0.8
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
The above plot shows the average number of bikes rented under different weathers. A Clear day has the maximum average rentals and it decreases as the weather keeps getting worse. A Stormy day still recieves about an average of 74 rentals, which is comparatively high, thus it presents a compelling reason to analyze the rentals further.
data= [
go.Bar(
x= weather_cnts.weathersit,
y= weather_cnts.registered,
hovertemplate= 'Registered rentals: %{y:.2f}<extra></extra>',
marker=dict(color='#009999'),
name= 'Registered rentals'
),
go.Bar(
x= weather_cnts.weathersit,
y= weather_cnts.casual,
hovertemplate= 'Casual rentals: %{y:.2f}<extra></extra>',
marker=dict(color='#ff9933'),
name= 'Casual rentals'
)
]
layout = go.Layout(
title= {
'text':'<b>Average number of Bike rentals in different Weathers</b><br>'+
'Average number of bike rented under different weathers (all year)',
'x':0.5
},
yaxis= go.layout.YAxis(
title= 'average number of bike rentals',
showticklabels= False
),
xaxis= go.layout.XAxis(
title= 'weather condition'
),
barmode= 'group'
)
fig = go.Figure(data= data, layout= layout)
fig.show()
The plot draws the following conclusions:-
The temp, atemp, hum and windspeed are all numerical variables that have already been normalized. These columns describe weather characteristics of the day/hour. Due to the large size of the dataset and hourly counts, for ease of the analysis and to visualize the trend, these columns are grouped absed on the date and aggregated on mean. Basically hourly data is converted to daily data.
A regression plot is plotted for each of these columns against the cnt to visualize the trend and relationship between the two. A LOWESS (Locally Weighted Scatterplot Smoothing) plots a smooth line through the scatter plot to show the relationship between the variables. It is a non-parametric strategy where it doesnot assume the distribution of data. LOWESS curves are good when there is noisy or sparse data or when Ordinary Least squares is not a good fit.
grouped = df[['dteday','temp','atemp','hum','windspeed','cnt']].groupby('dteday')
daily_df = grouped.agg({
'temp':'mean',
'atemp':'mean',
'hum':'mean',
'windspeed':'mean',
'cnt':'sum'
})
daily_df.head(5)
temp | atemp | hum | windspeed | cnt | |
---|---|---|---|---|---|
dteday | |||||
2011-01-01 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 985 |
2011-01-02 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 801 |
2011-01-03 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 1349 |
2011-01-04 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 1562 |
2011-01-05 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 1600 |
columns = ['temp','atemp','hum','windspeed']
titles = [
'Number of Bike rentals vs Temperature of the Day',
'Number of Bike rentals vs Adjusted Temperature of the Day',
'Number of Bike rentals vs Humidity of the Day',
'Number of Bike rentals vs Wind Speed of the Day'
]
labels = [
'temperature',
'adjusted temperature',
'humidity',
'wind speed'
]
for col,title,label in zip(columns,titles,labels):
fig = px.scatter(
daily_df,
x= col,
y= 'cnt',
labels= {
col:label,
'cnt':'number of bikes rented'
},
trendline= 'lowess',
trendline_color_override='#ff9933',
title= {
'text':'<b>'+title+'</b>',
'x':0.5
},
color_discrete_sequence= ['#009999']
)
fig.update_xaxes(showticklabels= False)
fig.update_yaxes(showticklabels= False)
fig.show()
The following conclusions are drawn from the regression plot:-
The final column left to analyze is the yr column. The year column takes the following values :-
* 0 - 2011
* 1 - 2012
The analysis performed for this column, is to check the percentage of business (rentals) each year recieved.
years = ['2011','2012']
grouped = df[['yr','registered','casual','cnt']].groupby(by= 'yr')
year_cnts = grouped.sum().reset_index()
year_cnts.yr = years
year_cnts['cnt_perc'] = year_cnts.cnt / total_rentals * 100
year_cnts['registered_perc'] = year_cnts.registered / total_rentals * 100
year_cnts['casual_perc'] = year_cnts.casual / total_rentals * 100
year_cnts
yr | registered | casual | cnt | cnt_perc | registered_perc | casual_perc | |
---|---|---|---|---|---|---|---|
0 | 2011 | 995851 | 247252 | 1243103 | 37.753544 | 30.244400 | 7.509144 |
1 | 2012 | 1676811 | 372765 | 2049576 | 62.246456 | 50.925432 | 11.321025 |
grouped = df[['yr','registered','casual','cnt']].groupby(by= 'yr')
year_means = grouped.mean().reset_index()
year_means.yr = years
year_means
yr | registered | casual | cnt | |
---|---|---|---|---|
0 | 2011 | 115.193869 | 28.600578 | 143.794448 |
1 | 2012 | 191.986604 | 42.679757 | 234.666361 |
fig = make_subplots(
rows= 1,
cols= 2,
subplot_titles=[
'<b>Percentage of Bike rentals per year</b><br>'+
'Percentage of bikes rented in years 2011 and 2012',
'<b>Average number of Bike rented per day</b><br>'+
'Average number of bikes rented in a day for the years 2011 and 2012'
]
)
fig.add_trace(
go.Bar(
y= year_cnts.cnt_perc,
x= year_cnts.yr,
marker= dict(color='#009999'),
hovertemplate= '%{y:.2f}%<extra></extra>',
name= 'percentage'
),
row= 1,
col= 1
)
fig.add_trace(
go.Bar(
y= year_means.cnt,
x= year_means.yr,
marker= dict(color='#ff9933'),
hovertemplate= '%{y:.2f}<extra></extra>',
name= 'average'
),
row= 1,
col= 2
)
fig.update_layout(width=1000, height=500, showlegend= False)
fig.update_yaxes(title= 'percentage of bike rentals',row=1,col=1, showticklabels=False)
fig.update_yaxes(title= 'average number of bike rentals',row=1,col=2, showticklabels=False)
fig.update_xaxes(title= 'year', row=1, col=1, tickvals=[2011,2012])
fig.update_xaxes(title= 'year', row=1, col=2, tickvals=[2011,2012])
fig.show()
The year 2012 sees almost 67% increase in business as compared to 2011, this translates to almost an average of 100 rentals have increased per day. Analyzing granular information for the number of rentals i.e. the number of registered and casual rentals
fig = make_subplots(
rows= 2,
cols= 1,
subplot_titles=[
'<b>Percentage of Bike rentals per year</b><br>'+
'Percentage of bikes rented in years 2011 and 2012',
'<b>Average number of Bike rentals per Day</b><br>'+
'Average number of bikes rented per day for registered and casual customers'
]
)
fig.add_trace(
go.Bar(
y= year_cnts.registered_perc,
x= year_cnts.yr,
marker= dict(color='#009999'),
name= 'Registered rentals',
hovertemplate= '%{y:.2f}%<extra></extra>'
),
row= 1,
col=1
)
fig.add_trace(
go.Bar(
y= year_cnts.casual_perc,
x= year_cnts.yr,
marker= dict(color='#ff9933'),
name= 'Casual rentals',
hovertemplate= '%{y:.2f}%<extra></extra>'
),
row= 1,
col=1
)
fig.add_trace(
go.Bar(
y= year_means.registered,
x= year_means.yr,
marker= dict(color='#009999'),
name= 'Registered rentals',
hovertemplate= '%{y:.2f}<extra></extra>'
),
row= 2,
col=1
)
fig.add_trace(
go.Bar(
y= year_means.casual,
x= year_means.yr,
marker= dict(color='#ff9933'),
name= 'Casual rentals',
hovertemplate= '%{y:.2f}<extra></extra>'
),
row= 2,
col=1
)
fig.update_layout(width= 1000, height= 1000, showlegend=False)
fig.update_xaxes(title= 'year')
fig.update_yaxes(title= 'percentage of bike rentals', row=1, col=1, showticklabels=False)
fig.update_yaxes(title= 'average number of bike rentals', row=2, col=1,showticklabels=False)
As observed in the previous plot, there was a 67% increase in the number of rentals in 2012 as compared to 2011. The plot above shows more granular information, it concludes:-
The bullet charts below, are a nice way to visualize the increase in the daily average.
fig = go.Figure()
fig.add_trace(
go.Indicator(
mode= 'number+gauge+delta',
value= year_means[year_means.yr == '2012'].registered.values[0],
delta= {'reference':year_means[year_means.yr == '2011'].registered.values[0],'relative':True},
domain = {'x': [0.25, 1], 'y': [0.08, 0.35]},
title= {
'text':'<b>Daily average</b><br>'+
'Registered rentals (2012)'
},
gauge={
'shape':'bullet',
'axis': {'range':[None,200]},
'threshold':{
'line':{
'color':'black',
'width':2
},
'thickness':0.75,
'value':year_means[year_means.yr == '2011'].registered.values[0]
}
}
)
)
fig.add_trace(
go.Indicator(
mode= 'number+gauge+delta',
value= year_means[year_means.yr == '2012'].casual.values[0],
delta= {'reference':year_means[year_means.yr == '2011'].casual.values[0],'relative':True},
domain = {'x': [0.25, 1], 'y': [0.5, 0.77]},
title= {
'text':'<b>Daily average</b><br>'+
'Casual rentals (2012)'
},
gauge={
'shape':'bullet',
'axis': {'range':[None,200]},
'threshold':{
'line':{
'color':'black',
'width':2
},
'thickness':0.75,
'value':year_means[year_means.yr == '2011'].casual.values[0]
}
}
)
)
fig.update_layout(height = 400, title= {
'text':'<b>Increase in Daily average of Bike rentals from 2011 to 2012</b>',
'x':0.52,
'y':0.8
})
fig.show()
Based on the analysis performed in detail to identify relationships of all independent variables with the dependent variables has led to the follow columns being estimated as good predictors:-
The following columns have been removed stating the reasons below:-
For building the model, a number of models are tried below:-
DecisionTreeRegressor
is chosen.RandomForestRegressor
provides the advantages of ensemble techniques, by fitting multiple trees on various sub-samples of the input data.GradientBoostingRegressor
is another ensemble method that builds an additive model forward stage-wise.LinearRegression
is also fit through the data.The dataset is split into train and test sets before building the model. All the models trained are compared based on the following metrics :-
cols = [
'registered',
'casual',
'cnt',
'instant',
'dteday'
]
X_train, X_test, y_train, y_test = train_test_split(df.drop(cols,axis=1),df.cnt,random_state=1)
The first model trained is the DecisionTreeRegressor
with only the random_state
parameter set. The hyper-parameter tuning follows later.
dtree = DecisionTreeRegressor(random_state=1)
dtree.fit(X_train,y_train)
pred = dtree.predict(X_train)
print("Train score: ",np.round(r2_score(pred,y_train),6))
pred = dtree.predict(X_test)
print("Test score: ",np.round(r2_score(pred,y_test),4))
print("MAE: ",np.round(mean_absolute_error(pred,y_test),4))
print("RMSE: ",np.round(np.sqrt(mean_squared_error(pred,y_test)),4))
Train score: 0.999994 Test score: 0.8922 MAE: 36.1542 RMSE: 61.1785
The DecisionTreeRegressor
gives an excellent r2_score
of about 88.69%, which indicated how good the model fits. The MAE
and the RMSE
seem reasonable at around 36.32 and 62.57 respectively. The train set r2_score
shows a whopping 99.99%, whereas the test set r2_score
is around 89.07%. As discussed earlier, this shows that the model overfits the data. Thus the ensemble techniques are chosen to avoid overfitting.
NOTE:- The code below, visualizes the tree formed by the model. Due to its complexity, the code is not run. You are free to try it out.
dot_data = StringIO()
export_graphviz(dtree,out_file=dot_data,filled=True,rounded=True,special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
The RandomForestRegressor
is trained next, with again only the random_state
parameter set.
rf_mod = RandomForestRegressor(random_state=1)
rf_mod.fit(X_train,y_train)
pred = rf_mod.predict(X_train)
print("Train score: ",np.round(r2_score(pred,y_train),4))
pred = rf_mod.predict(X_test)
print("Test score: ",np.round(r2_score(pred,y_test),4))
print("MAE: ",np.round(mean_absolute_error(pred,y_test),4))
print("RMSE: ",np.round(np.sqrt(mean_squared_error(pred,y_test)),4))
Train score: 0.9912 Test score: 0.9354 MAE: 27.335 RMSE: 45.3333
The RandomForestRegressor
out performs the DecisionTreeRegressor
with a better fit, r2_score
of about 99.11% on the train set and about 94% on the test set. It can be noticed that from the previous model, this model has a lesser gap between the train set and test set r2_scores
. This indicates that the RandomForestRegressor
solved the overfitting problem encountered earlier. The model also presents a lesser MAE
and RMSE
values at about 27.93 and 43.88 respectively. It can be argued that there still exists some degree of over fitting given the high value of the train set r2_score
.
The strength of a model lies in its hyper-parameter. The above metrics achieved are phenomenal, but to get the best out of a model, paramters need to be tuned. A Grid Search is performed to find the best fitting parameters.
param_grid= {
'min_samples_split': np.arange(15,30,1),
'max_depth': np.arange(10,25,1)
}
grid_search = GridSearchCV(
RandomForestRegressor(random_state=1),
param_grid= param_grid,
scoring='r2',
cv= 3,
n_jobs= 1
)
grid_search.fit(df.drop(cols,axis=1),df.cnt)
print(grid_search.best_score_)
print(grid_search.best_params_)
0.7371198022509945 {'max_depth': 17, 'min_samples_split': 15}
The GridSearchCV
returned the above parameters as the best suit for the model. Using these paramters for the model, the following metrics are achieved. The score specified is the r2_score
average after cross validation on 3 folds.
rf_mod = RandomForestRegressor(max_depth=17,min_samples_split=15,random_state=1)
rf_mod.fit(X_train,y_train)
pred = rf_mod.predict(X_train)
print("Train score: ",np.round(r2_score(pred,y_train),4))
pred = rf_mod.predict(X_test)
print("Test score: ",np.round(r2_score(pred,y_test),4))
print("MAE: ",np.round(mean_absolute_error(pred,y_test),4))
print("RMSE: ",np.round(np.sqrt(mean_squared_error(pred,y_test)),4))
Train score: 0.9601 Test score: 0.9276 MAE: 29.0423 RMSE: 47.458
After hyperparamter tuning, the model's metrics haven't improved really. The model acheives an r2_score
of 96.38% on train set and 93.38% on the test set. The MAE
and the RMSE
return values of 27.09 and 45.66 respectively. The grid search performed has solved the above argument of overfitting, as the train score has decreased, nonetheless the model has performed substantially well, better than the DecisionTreeRegressor
.
Another model tested below is the GradientBoostingRegressor
with the n_estimators
paramter set to 500, which specifies how many trees are trained and the random_state
parameter. It is one of the top models and always achieves a good result, hence been chosen here.
gb_mod = GradientBoostingRegressor(n_estimators=500,random_state=1)
gb_mod.fit(X_train,y_train)
pred = gb_mod.predict(X_train)
print("Train score: ",np.round(r2_score(pred,y_train),4))
pred = gb_mod.predict(X_test)
print("Test score: ",np.round(r2_score(pred,y_test),4))
print("MAE: ",np.round(mean_absolute_error(pred,y_test),4))
print("RMSE: ",np.round(np.sqrt(mean_squared_error(pred,y_test)),4))
Train score: 0.9124 Test score: 0.9037 MAE: 35.353 RMSE: 52.8143
The GradientBoostingRegressor
does not out perform the RandomForestRegressor
but it does achieve a lower and closer train set r2_score
of about 91.24% to test set score of 90.37%. The possibility of overfitting in RandomForestRegressor
is solved by the GradientBoostingRegressor
. The model still records a higher MAE
and RMSE
at 35.36 and 52.82 respectively than the Random Forest.
Similar to the previous case, hyper-paramter tuning is performed via GridSearchCV
as below.
param_grid= {
'max_depth': np.arange(15,25,1)
}
grid_search = GridSearchCV(
GradientBoostingRegressor(n_estimators=500, learning_rate=0.01),
param_grid= param_grid,
cv= 5,
n_jobs= 1
)
grid_search.fit(df.drop(cols,axis=1),df.cnt)
print(grid_search.best_score_)
print(grid_search.best_params_)
0.7324224465836198 {'max_depth': 15}
The GridSearchCV
suggests the above hyper-parameters for the GradientBoostingRegressor
.
gb_mod = GradientBoostingRegressor(max_depth=15,n_estimators=500,learning_rate=0.01, random_state=1)
gb_mod.fit(X_train,y_train)
pred = gb_mod.predict(X_train)
print("Train score: ",np.round(r2_score(pred,y_train),4))
pred = gb_mod.predict(X_test)
print("Test score: ",np.round(r2_score(pred,y_test),4))
print("MAE: ",np.round(mean_absolute_error(pred,y_test),4))
print("RMSE: ",np.round(np.sqrt(mean_squared_error(pred,y_test)),4))
Train score: 0.9992 Test score: 0.9199 MAE: 29.6883 RMSE: 50.7765
The hyper-parameter tuning has made a subsantial improvement to the model. The train set r2_score
is now a 99.92% whereas on the test set it is about 92%. Compared to the GradientBoostingRegressor
tranined before the parameter tuning, this model improves upon the MAE
and RMSE
scores with 29.68 and 50.77 respectively. One can argue that with a 100% train set score, there are clear signs of over fitting.
After training 3 models, it is clear that the RandomForestRegressor
gives the best fit (r2_score
) and the least MAE
and RMSE
values, which roughly translates to - The predictions from this model have lower average error and std. deviation about the fit. The plot below is to visualize the residuals.
predictions = rf_mod.predict(X_test)
residuals = predictions - y_test
data= [
go.Scatter(
x= predictions,
y= residuals,
mode='markers',
marker= dict(color='#009999')
)
]
layout = go.Layout(
title={
'text':'<b>Residual plot</b><br>'+
'Error in prediction vs the predicted values',
'x':0.5
},
shapes= [dict(
type='line',
x0=-2,
y0=0.4,
y1=0.4,
x1=850,
line= dict(
color= '#ff9933',
width=3
)
)],
yaxis= go.layout.YAxis(
title='residual',
showticklabels= False
),
xaxis= go.layout.XAxis(
title='predicted values',
showticklabels= False
)
)
fig= go.Figure(data= data, layout= layout)
fig.show()
fig = ff.create_distplot([residuals],['residual'], bin_size=[30], show_rug= False)
fig.update_layout(
showlegend=False,
title={
'text':'<b>Residual distribution</b><br>'+
'Distribution of the errors in prediction',
'x':0.5
},
xaxis = go.layout.XAxis(range= [-200,200], tickvals=[-200,-100,0,100,200]),
yaxis= go.layout.YAxis(showticklabels=False)
)
fig.show()
The Residual scatter plot and the histogram suggest that the variance in the residuals (errors) are evenly distributed along the entire range. This suggests homoskedasticity. The RandomForestRegressor
has been proved to be the best among the models, but just for reference, the LinearRegrssion
model is fit on the data to prove why LinearRegression
was not going to be a good fit.
Before the LinearRegression
is fit, the categorical variables have to be converted to dummy variable, if this isnt done then, the model assumes that a label encoded column is also numerical and is treated that way, which is false. The following columns are converted:-
df_dummied = pd.get_dummies(df,columns=['season','time_label','weathersit','mnth','hr','weekday'])
cols = [
'registered',
'casual',
'cnt',
'instant',
'dteday'
]
X_train, X_test, y_train, y_test = train_test_split(df_dummied.drop(cols,axis=1),df_dummied.cnt,random_state=1)
model= LinearRegression()
model.fit(X_train,y_train)
pred = model.predict(X_train)
print("Train score: ",np.round(r2_score(pred,y_train),4))
pred = model.predict(X_test)
print("Test score: ",np.round(r2_score(pred,y_test),4))
print("MAE: ",np.round(mean_absolute_error(pred,y_test),4))
print("RMSE: ",np.round(np.sqrt(mean_squared_error(pred,y_test)),4))
Train score: 0.5421 Test score: 0.5264 MAE: 76.8048 RMSE: 103.7956
The LinearRegression
model as suspected performs worse with an r2_score
on train set of 54.21% and 52.64% on the test set. Out of all the models, this records the worse errors with MAE
at 76.80 and RMSE
at 103.79, clearly LinearRegression
is the worse fit out of all.
The descriptive and predictive analysis result in the following conclusions :-
This concludes the analysis.