On-demand bike rentals have become a major trend in cities across the U.S. where within a mile or two of wherever a resident might be within the city, and especially city centers, they likely will find a self-service bike rental kiosk. There are considerations a company or a city might take to ensure that bicycle demand is being met where and when they are needed most. In this project we will attempt to predict this demand using linear regression and descision trees.
Our dataset will focus on Washington D.C., which collects detailed data on bike rentals, from 2011 to 2012. The dataset was compiled in csv format by Hadi Fanee-T, and can be downloaded from the University of California, Irvine's website.
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
# read in the data
rentals = pd.read_csv("bike_rental_hour.csv")
rentals.head(2)
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 |
rentals.tail(2)
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17377 | 17378 | 2012-12-31 | 1 | 1 | 12 | 22 | 0 | 1 | 1 | 1 | 0.26 | 0.2727 | 0.56 | 0.1343 | 13 | 48 | 61 |
17378 | 17379 | 2012-12-31 | 1 | 1 | 12 | 23 | 0 | 1 | 1 | 1 | 0.26 | 0.2727 | 0.65 | 0.1343 | 12 | 37 | 49 |
rentals.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17379 entries, 0 to 17378 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 17379 non-null int64 1 dteday 17379 non-null object 2 season 17379 non-null int64 3 yr 17379 non-null int64 4 mnth 17379 non-null int64 5 hr 17379 non-null int64 6 holiday 17379 non-null int64 7 weekday 17379 non-null int64 8 workingday 17379 non-null int64 9 weathersit 17379 non-null int64 10 temp 17379 non-null float64 11 atemp 17379 non-null float64 12 hum 17379 non-null float64 13 windspeed 17379 non-null float64 14 casual 17379 non-null int64 15 registered 17379 non-null int64 16 cnt 17379 non-null int64 dtypes: float64(4), int64(12), object(1) memory usage: 2.3+ MB
Our data consists of 17 columns and 17,379 rows of data. There are 16 numeric columns and 1 string-object column. There are no apparent missing values. Each row in the dataset is an one-hour increment snapshot and takes place over two years total, starting 1/1/2011 and ending 12/31/2012.
The columns follow the following descriptions:
The primary target for our predictive models will be the cnt
column, which is the total number of bike rentals at a given hour. casual
and registered
columns will also be considered.
Let's do some clean-up on dates so that we can more easily splice and run time-series analysis on our data. The the related columns that we will evaluate are dteday
, yr
, mnth
, and hr
. Rather than changing each of these to datetime, we can create a single datetime object with dteday
and hr
columns so that we can easily access the underlying time data with datetime logic. Let's add the hr
and dteday
columns now.
# convert the hr column to string, combine it with dteday in a format datetime understands
rentals['hr'] = rentals['hr'].astype(str)
rentals['datetime'] = rentals['dteday'] + ' ' + rentals['hr']+':00:00'
rentals['hr'] = rentals['hr'].astype(int) # return to int
# transform the new column to datetime
rentals['datetime'] = pd.to_datetime(rentals['datetime'])
rentals['datetime']
0 2011-01-01 00:00:00 1 2011-01-01 01:00:00 2 2011-01-01 02:00:00 3 2011-01-01 03:00:00 4 2011-01-01 04:00:00 ... 17374 2012-12-31 19:00:00 17375 2012-12-31 20:00:00 17376 2012-12-31 21:00:00 17377 2012-12-31 22:00:00 17378 2012-12-31 23:00:00 Name: datetime, Length: 17379, dtype: datetime64[ns]
Let's check the completeness of our dataset. There should be two years of hour by hour data.
two_yr_hr = 2*365*24
length_rentals = len(rentals)
print('dataset:', length_rentals)
print('expected:', two_yr_hr)
print('difference:', two_yr_hr - length_rentals)
print('missing days:', (two_yr_hr-length_rentals)/24)
dataset: 17379 expected: 17520 difference: 141 missing days: 5.875
We're missing just under six days of data. We will accept this discrepancy as it is not pervasive enough to have a strong adverse affect on predictions.
Let's complete some basic analysis to understand our timeseries, we're specifically interested in our primary target cnt
.
plt.figure(figsize = (12, 4))
plt.plot(rentals['datetime'], rentals['cnt'])
plt.title('Count of Rentals over Time')
plt.show()
A first impression of the distribution is that there are more rentals on average in 2012 than 2011. Some research shows that washington expanded its bikeshare program at the end of 2011, which logically makes sense as if there's more locations and bikes in service there are more potential users for the program. This may cause an undersirable influence in our predictions, however, we may be able to overcome by min-max scaling the count values in their respective years.
Further, there are indications that seasonality plays a role, but at the hour-day-year granularity it is difficult to see what exactly is going on. Let's generate some histograms bucketed by day of week and month.
plt.bar(x = rentals['mnth'], height = rentals['cnt'])
plt.title('Bikeshare Usage by Month')
plt.show()
plt.bar(x = rentals['weekday'], height = rentals['cnt'])
plt.title('Bikeshare Usage by Day of Week')
plt.show()
Our distributions indicate that there is a seasonal affect on bikeshare usage. Spring, Summer and Fall months have a relatively stable usage, but the winter months, November through February, have a drastic dropoff.
The day of week also has an impact on usage, weekdays have a high and stable usage and weekends, Saturday and Sunday, have less use.
Let's recreate the above bar charts but further split by registered
and casual
. This will give us indicators whether casual and registered users are more sensitive to seasonality.
fig, ax = plt.subplots()
ax.bar(rentals['mnth'], rentals['cnt'], label = 'total')
ax.bar(rentals['mnth'], rentals['registered'], label = 'registered')
ax.bar(rentals['mnth'], rentals['casual'], label = 'casual')
plt.ylim(0,1000)
plt.title('Bikeshare Usage by Month and Usergroup')
plt.legend(loc = 'best')
plt.show()
plt.bar(x = rentals['mnth'], height = rentals['cnt'], label = 'total')
plt.bar(x = rentals['mnth'], height = rentals['registered'], label = 'registed')
plt.bar(x = rentals['mnth'], height = rentals['casual'],label = 'casual')
plt.title('Bikeshare Usage by Month and Usergroup')
plt.legend(loc = 'best')
plt.show()
C:\Users\Kevin\anaconda3\lib\site-packages\IPython\core\pylabtools.py:132: UserWarning: Creating legend with loc="best" can be slow with large amounts of data. fig.canvas.print_figure(bytes_io, **kw)
plt.bar(x = rentals['weekday'], height = rentals['cnt'], label = 'total')
plt.bar(x = rentals['weekday'], height = rentals['registered'], label = 'registed')
plt.bar(x = rentals['weekday'], height = rentals['casual'],label = 'casual')
plt.title('Bikeshare Usage by Day of Week and Usergroup')
plt.legend()
plt.show()
plt.bar(x = rentals['hr'], height = rentals['cnt'], label = 'total')
plt.bar(x = rentals['hr'], height = rentals['registered'], label = 'registed')
plt.bar(x = rentals['hr'], height = rentals['casual'],label = 'casual')
plt.title('Bikeshare Usage by Time of Day and Usergroup')
plt.legend()
plt.show()
The behavior of bike rental user groups is different when cut along time and seasonality. Because of this we may be able to get more accurate results if we predict the usage of these cohorts rather than just aiming at the total usage.
The cnt
is supposed to be the sum of registered
and casual
users, but evaluating the bar charts there may be underlying issues with the cnt
column, in that registered
plus casual
is less than what it is supposed to be. Let's validate the cnt
column is the sum of these two values.
cnt
¶Our validation process will be to add the registered
and casual
columns into a new column and subtract the new value from cnt
. If we get any value other than zero we know there is a discrepancy.
rentals['cnt_validation'] = (rentals['registered'] + rentals['casual']) - rentals['cnt']
rentals['cnt_validation'].value_counts()
0 17379 Name: cnt_validation, dtype: int64
rentals
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | datetime | cnt_validation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0000 | 3 | 13 | 16 | 2011-01-01 00:00:00 | 0 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0000 | 8 | 32 | 40 | 2011-01-01 01:00:00 | 0 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0000 | 5 | 27 | 32 | 2011-01-01 02:00:00 | 0 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0000 | 3 | 10 | 13 | 2011-01-01 03:00:00 | 0 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0000 | 0 | 1 | 1 | 2011-01-01 04:00:00 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17374 | 17375 | 2012-12-31 | 1 | 1 | 12 | 19 | 0 | 1 | 1 | 2 | 0.26 | 0.2576 | 0.60 | 0.1642 | 11 | 108 | 119 | 2012-12-31 19:00:00 | 0 |
17375 | 17376 | 2012-12-31 | 1 | 1 | 12 | 20 | 0 | 1 | 1 | 2 | 0.26 | 0.2576 | 0.60 | 0.1642 | 8 | 81 | 89 | 2012-12-31 20:00:00 | 0 |
17376 | 17377 | 2012-12-31 | 1 | 1 | 12 | 21 | 0 | 1 | 1 | 1 | 0.26 | 0.2576 | 0.60 | 0.1642 | 7 | 83 | 90 | 2012-12-31 21:00:00 | 0 |
17377 | 17378 | 2012-12-31 | 1 | 1 | 12 | 22 | 0 | 1 | 1 | 1 | 0.26 | 0.2727 | 0.56 | 0.1343 | 13 | 48 | 61 | 2012-12-31 22:00:00 | 0 |
17378 | 17379 | 2012-12-31 | 1 | 1 | 12 | 23 | 0 | 1 | 1 | 1 | 0.26 | 0.2727 | 0.65 | 0.1343 | 12 | 37 | 49 | 2012-12-31 23:00:00 | 0 |
17379 rows × 19 columns
We have confirmed that the barchart discrepancy is an illusion.
We will move forward with the plan to generate separate predictions for cnt
, registered
and casual
. We will need to repeat the feature selection process for each category.
sunrise/sunset (is the sun out)
Sunrise and sunset occur at different hours of the day depending on the season. This causes causes inconsistencies when considering
aa.usno.navy.mil/data/docs/RS_OneYear.php
washington_dc_latlong = [38.895,-77.0366]
One basic way to identify which columns will be good feature candidates for our machine learning algorithms is to look at correlations with the target metric. Too many features can result in an overfit model, so fewer, but well correlated features are usually desirable.
cnt_feature_correlations = abs(rentals.corr()['cnt'].drop(['cnt', 'instant', 'registered','casual'])).sort_values(ascending = False)
cnt_feature_correlations
cas_feature_correlations = abs(rentals.corr()['casual'].drop(['cnt', 'instant', 'registered','casual'])).sort_values(ascending = False)
cas_feature_correlations
reg_feature_correlations = abs(rentals.corr()['registered'].drop(['cnt', 'instant', 'registered','casual'])).sort_values(ascending = False)
reg_feature_correlations
We removed columns from consideration if they aren't viable features, instant
is a code similar to an index and cnt
, registered
and casual
are predictive targets.
Our hypothesis is somewhat confirmed that the behavior of the registered
and casual
sub-groups are significant. As one might guess, registered
users are less affected by enjoyment driven factors such and weather and are most correlated with hr
. Thinking about what it means to be a registered
user, we might think that these people are habitual users and may rely on this form of transportation for things like commuting to work. Inverserly a casual
user might shy away from riding a bicycle for enjoyment if the weather is too hot or too cold to be enjoyable and may be habitually using other forms of transportation during the weekday to get to work.
There are no hard rules for determining the cut-off for correlations, but a rule of thumb is a correlation should have a minimum of 0.3 in order to be considered. Thinking about the values above for a moment, it makes sense that temperature and humidity are more important than the month or season, just because it's a Winter month doesn't mean a person won't take advantage of a nice day. Therefore we can
Let's create a new dataframe that contains values above 0.3.
Colinearities occur when feature candidates are linearly correlated, essentially creating a redundancy in the features that can hurt the accuracy of the model.