Exercise 7¶

Capital Bikeshare data¶

Introduction¶

Capital Bikeshare dataset from Kaggle: data, data dictionary
Each observation represents the bikeshare rentals initiated during a given hour of a given day

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz

C:\Users\albah\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [2]:

# read the data and set "datetime" as the index
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)

In [3]:

# "count" is a method, so it's best to rename that column
bikes.rename(columns={'count':'total'}, inplace=True)

In [4]:

# create "hour" as its own feature
bikes['hour'] = bikes.index.hour

In [5]:

bikes.head()

Out[5]:

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	total	hour
datetime
2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0	3	13	16	0
2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0	8	32	40	1
2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0	5	27	32	2
2011-01-01 03:00:00	1	0	0	1	9.84	14.395	75	0.0	3	10	13	3
2011-01-01 04:00:00	1	0	0	1	9.84	14.395	75	0.0	0	1	1	4

In [6]:

bikes.tail()

Out[6]:

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	total	hour
datetime
2012-12-19 19:00:00	4	0	1	1	15.58	19.695	50	26.0027	7	329	336	19
2012-12-19 20:00:00	4	0	1	1	14.76	17.425	57	15.0013	10	231	241	20
2012-12-19 21:00:00	4	0	1	1	13.94	15.910	61	15.0013	4	164	168	21
2012-12-19 22:00:00	4	0	1	1	13.94	17.425	61	6.0032	12	117	129	22
2012-12-19 23:00:00	4	0	1	1	13.12	16.665	66	8.9981	4	84	88	23

hour ranges from 0 (midnight) through 23 (11pm)
workingday is either 0 (weekend or holiday) or 1 (non-holiday weekday)

Exercise 7.1¶

Run these two groupby statements and figure out what they tell you about the data.

In [7]:

# mean rentals for each value of "workingday"
bikes.groupby('workingday').total.mean()

Out[7]:

workingday
0    188.506621
1    193.011873
Name: total, dtype: float64

In [8]:

# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean()

Out[8]:

hour
0      55.138462
1      33.859031
2      22.899554
3      11.757506
4       6.407240
5      19.767699
6      76.259341
7     213.116484
8     362.769231
9     221.780220
10    175.092308
11    210.674725
12    256.508772
13    257.787281
14    243.442982
15    254.298246
16    316.372807
17    468.765351
18    430.859649
19    315.278509
20    228.517544
21    173.370614
22    133.576754
23     89.508772
Name: total, dtype: float64

Exercise 7.2¶

Run this plotting code, and make sure you understand the output. Then, separate this plot into two separate plots conditioned on "workingday". (In other words, one plot should display the hourly trend for "workingday=0", and the other should display the hourly trend for "workingday=1".)

In [9]:

# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean().plot()

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c8fbac82b0>

Plot for workingday == 0 and workingday == 1

In [10]:

# hourly rental trend for "workingday=0"
bikes[bikes.workingday==0].groupby('hour').total.mean().plot()

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c8fa8976a0>

In [11]:

# hourly rental trend for "workingday=1"
bikes[bikes.workingday==1].groupby('hour').total.mean().plot()

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c8fb9ca7f0>

In [12]:

# combine the two plots
bikes.groupby(['hour', 'workingday']).total.mean().unstack().plot()

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c8fbce9ac8>

Write about your findings

Exercise 7.3¶

Fit a linear regression model to the entire dataset, using "total" as the response and "hour" and "workingday" as the only features. Then, print the coefficients and interpret them. What are the limitations of linear regression in this instance?

In [ ]:

Exercice 7.4¶

Create a Decision Tree to forecast "total" by manually iterating over the features "hour" and "workingday". The algorithm must at least have 6 end nodes.

In [ ]:

Exercise 7.5¶

Train a Decision Tree using scikit-learn. Comment about the performance of the models.

In [ ]: