%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz
C:\Users\albah\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
# read the data and set "datetime" as the index
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)
# "count" is a method, so it's best to rename that column
bikes.rename(columns={'count':'total'}, inplace=True)
# create "hour" as its own feature
bikes['hour'] = bikes.index.hour
bikes.head()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | total | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | ||||||||||||
2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 |
2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 1 |
2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 2 |
2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 3 |
2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 4 |
bikes.tail()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | total | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | ||||||||||||
2012-12-19 19:00:00 | 4 | 0 | 1 | 1 | 15.58 | 19.695 | 50 | 26.0027 | 7 | 329 | 336 | 19 |
2012-12-19 20:00:00 | 4 | 0 | 1 | 1 | 14.76 | 17.425 | 57 | 15.0013 | 10 | 231 | 241 | 20 |
2012-12-19 21:00:00 | 4 | 0 | 1 | 1 | 13.94 | 15.910 | 61 | 15.0013 | 4 | 164 | 168 | 21 |
2012-12-19 22:00:00 | 4 | 0 | 1 | 1 | 13.94 | 17.425 | 61 | 6.0032 | 12 | 117 | 129 | 22 |
2012-12-19 23:00:00 | 4 | 0 | 1 | 1 | 13.12 | 16.665 | 66 | 8.9981 | 4 | 84 | 88 | 23 |
Run these two groupby
statements and figure out what they tell you about the data.
# mean rentals for each value of "workingday"
bikes.groupby('workingday').total.mean()
workingday 0 188.506621 1 193.011873 Name: total, dtype: float64
# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean()
hour 0 55.138462 1 33.859031 2 22.899554 3 11.757506 4 6.407240 5 19.767699 6 76.259341 7 213.116484 8 362.769231 9 221.780220 10 175.092308 11 210.674725 12 256.508772 13 257.787281 14 243.442982 15 254.298246 16 316.372807 17 468.765351 18 430.859649 19 315.278509 20 228.517544 21 173.370614 22 133.576754 23 89.508772 Name: total, dtype: float64
Run this plotting code, and make sure you understand the output. Then, separate this plot into two separate plots conditioned on "workingday". (In other words, one plot should display the hourly trend for "workingday=0", and the other should display the hourly trend for "workingday=1".)
# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1c8fbac82b0>
Plot for workingday == 0 and workingday == 1
# hourly rental trend for "workingday=0"
bikes[bikes.workingday==0].groupby('hour').total.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1c8fa8976a0>
# hourly rental trend for "workingday=1"
bikes[bikes.workingday==1].groupby('hour').total.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1c8fb9ca7f0>
# combine the two plots
bikes.groupby(['hour', 'workingday']).total.mean().unstack().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1c8fbce9ac8>
Write about your findings
Fit a linear regression model to the entire dataset, using "total" as the response and "hour" and "workingday" as the only features. Then, print the coefficients and interpret them. What are the limitations of linear regression in this instance?
Create a Decision Tree to forecast "total" by manually iterating over the features "hour" and "workingday". The algorithm must at least have 6 end nodes.
Train a Decision Tree using scikit-learn. Comment about the performance of the models.