In this case study, we'll use regression techniques to study some of the factors that drive the usage of bikesharing systems. The data set for this case study was collected from the Capital Bikeshare system in Washington DC. We use the aggregated version graciously provided by the authors of the following paper:
Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
This data set includes information about the season and time of year; the weather; and the count of bicycle users on each day for two years. This level of information gives us considerable ability to model phenomena in the data.
In this case study, our primary aim will be explanatory modeling rather than predictive modeling. That is, we are more interested in learning something about the data we have than we are about predicting new, unseen data. To this end, we'll be prioritizing using a simple model with interpretable parameters.
Some people would consider the emphasis on explanation vs. prediction to be the dividing line between statistics and machine learning, but in practice this is very fuzzy indeed.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import urllib
def retrieve_data(url):
"""
Retrieve a file from the specified url and save it in a local file
called data.csv. The intended values of url are:
"""
# grab the data and parse it
filedata = urllib.request.urlopen(url)
to_write = filedata.read()
# write to file
with open("data.csv", "wb") as f:
f.write(to_write)
retrieve_data("https://philchodrow.github.io/PIC16A/datasets/Bike-Sharing-Dataset/day.csv")
bikeshare = pd.read_csv("data.csv")
bikeshare
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 654 | 985 |
1 | 2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 670 | 801 |
2 | 3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 1229 | 1349 |
3 | 4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 108 | 1454 | 1562 |
4 | 5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 82 | 1518 | 1600 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
726 | 727 | 2012-12-27 | 1 | 1 | 12 | 0 | 4 | 1 | 2 | 0.254167 | 0.226642 | 0.652917 | 0.350133 | 247 | 1867 | 2114 |
727 | 728 | 2012-12-28 | 1 | 1 | 12 | 0 | 5 | 1 | 2 | 0.253333 | 0.255046 | 0.590000 | 0.155471 | 644 | 2451 | 3095 |
728 | 729 | 2012-12-29 | 1 | 1 | 12 | 0 | 6 | 0 | 2 | 0.253333 | 0.242400 | 0.752917 | 0.124383 | 159 | 1182 | 1341 |
729 | 730 | 2012-12-30 | 1 | 1 | 12 | 0 | 0 | 0 | 1 | 0.255833 | 0.231700 | 0.483333 | 0.350754 | 364 | 1432 | 1796 |
730 | 731 | 2012-12-31 | 1 | 1 | 12 | 0 | 1 | 1 | 2 | 0.215833 | 0.223487 | 0.577500 | 0.154846 | 439 | 2290 | 2729 |
731 rows × 16 columns
There are actually three columns we could aim to predict. Casual users don't sign up for the bikeshare service -- they just pay each time they decide to take a ride. In contrast, registered users pay a regular subscription fee. The cnt
column is simply the sum of the casual and registered rides for each day. For this case study, we will focus on modeling the behavior of casual riders.
Let's start by simply visualizing the total counts over time. For this, it helps to convert the dteday
column into a proper datetime
column.
import datetime
bikeshare['dteday'] = pd.to_datetime(bikeshare['dteday'])
Now we can make a simple plot.
fig, ax = plt.subplots(1, figsize = (10, 7))
ax.plot(bikeshare['dteday'], bikeshare['casual'])
[<matplotlib.lines.Line2D at 0x7fcfc80cedf0>]
A few observations:
I wonder whether there's much of a difference between weekend vs. weekday ridership.
bikeshare.groupby('workingday')[["casual"]].mean()
casual | |
---|---|
workingday | |
0 | 1371.134199 |
1 | 606.570000 |
Interesting! There are over twice as many casual riders, on average, on weekends. Let's take a look at how this breaks down by month.
m = bikeshare.groupby(['workingday', 'mnth'])[['casual']].mean()
fig, ax = plt.subplots(1)
ax.plot(m.loc[0], label = "Weekends")
ax.plot(m.loc[1], label = "Weekdays")
ax.set(xlabel = 'Month',
ylabel = 'Number of casual riders')
ax.legend()
<matplotlib.legend.Legend at 0x7fcfc8707a60>
Ok, we're beginning to see some patterns now. There's a clear increase in casual ridership on the weekends, and in warmer months in the middle of the year. So, we'd expect these variables to play major roles in our modeling.
We can break things up further -- what's the dependence on the weather?
m = bikeshare.groupby(['workingday', 'weathersit', 'mnth'])[['casual']].mean()
fig, ax = plt.subplots(1, 3, figsize = (10, 3), sharey = True)
weather_codes = {1 : "Clear",
2 : "Cloudy",
3 : "Light precipitation"}
for i in weather_codes.keys():
ax[i-1].plot(m.loc[0].loc[i], label = "Weekends")
ax[i-1].plot(m.loc[1].loc[i], label = "Weekdays")
ax[i-1].set(title = weather_codes[i])
ax[0].legend()
<matplotlib.legend.Legend at 0x7fcfc87e59d0>
Ok, this makes sense: cloudy weather doesn't seem to be much of a deterrent, but precipitation makes a major difference.
We've now learned quite a lot about what we're looking for, so let's go ahead and begin preparing our data for modeling. Let's first grab only the variables that we want to include as a new data frame.
Note that we've excluded, for example, season
, because all the information in the season
column is actually already contained in the mnth
.
cols = ["casual",
"mnth",
"weathersit",
"workingday",
"yr",
"temp",
"hum",
"windspeed",
"holiday"]
model_df = bikeshare[cols]
Next, we need to encode the mnth
column in a way that an algorithm can understand. The easiest way is to use pd.get_dummies()
, which will create a 0-1
column associated to each possible value.
model_df = pd.get_dummies(model_df, columns = ['mnth'])
model_df
casual | weathersit | workingday | yr | temp | hum | windspeed | holiday | mnth_1 | mnth_2 | mnth_3 | mnth_4 | mnth_5 | mnth_6 | mnth_7 | mnth_8 | mnth_9 | mnth_10 | mnth_11 | mnth_12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 331 | 2 | 0 | 0 | 0.344167 | 0.805833 | 0.160446 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 131 | 2 | 0 | 0 | 0.363478 | 0.696087 | 0.248539 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 120 | 1 | 1 | 0 | 0.196364 | 0.437273 | 0.248309 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 108 | 1 | 1 | 0 | 0.200000 | 0.590435 | 0.160296 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 82 | 1 | 1 | 0 | 0.226957 | 0.436957 | 0.186900 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
726 | 247 | 2 | 1 | 1 | 0.254167 | 0.652917 | 0.350133 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
727 | 644 | 2 | 1 | 1 | 0.253333 | 0.590000 | 0.155471 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
728 | 159 | 2 | 0 | 1 | 0.253333 | 0.752917 | 0.124383 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
729 | 364 | 1 | 0 | 1 | 0.255833 | 0.483333 | 0.350754 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
730 | 439 | 2 | 1 | 1 | 0.215833 | 0.577500 | 0.154846 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
731 rows × 20 columns
Now we can separate our data into predictor and target variables.
y = model_df['casual']
X = model_df.drop(['casual'], axis = 1)
Model time! Because our interest is explanatory rather than predictive, we are just going to train and evaluate the model on the same set of data. There are still potential problems with doing this! In this particular case, however, we're not too worried about overfitting because the number of data points far outnumbers the number of model parameters.
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(X, y)
LinearRegression()
LR.score(X, y)
0.7307281905763694
Let's visualize how our model did at replicating the casual usage of the bikeshare system. For convenience, let's add the model predictions as a column of the bikeshare
data frame.
bikeshare['preds'] = LR.predict(X)
Now we can visualize the modeled values against the true ones.
fig, ax = plt.subplots(1, figsize = (10, 7))
ax.plot(bikeshare['dteday'], bikeshare['casual'], label = 'observed')
ax.plot(bikeshare['dteday'], bikeshare['preds'], label = 'modeled')
ax.legend()
<matplotlib.legend.Legend at 0x7fcfca254ac0>
Not bad! The model does a fair job of replicating the true data. The fit isn't perfect, in at least two ways.
So, this model has room for improvement. Still, considering how simple it was to put together, it's a fair start! Recall that one of our main aims was to learn something about which conditions make a difference for ridership. Linear models are great for this, because we can just investigate the coefficients directly.
pd.DataFrame({"variable" : X.columns,
"coefficient" : LR.coef_.round(2)})
variable | coefficient | |
---|---|---|
0 | weathersit | -113.62 |
1 | workingday | -822.77 |
2 | yr | 287.47 |
3 | temp | 1615.43 |
4 | hum | -550.08 |
5 | windspeed | -1164.64 |
6 | holiday | -299.52 |
7 | mnth_1 | -262.63 |
8 | mnth_2 | -275.11 |
9 | mnth_3 | 87.13 |
10 | mnth_4 | 226.14 |
11 | mnth_5 | 241.25 |
12 | mnth_6 | 45.13 |
13 | mnth_7 | -92.17 |
14 | mnth_8 | -15.67 |
15 | mnth_9 | 147.51 |
16 | mnth_10 | 166.83 |
17 | mnth_11 | -50.74 |
18 | mnth_12 | -217.65 |
Here's what we observe:
weathersit
variable correspond to more severe weather. The negative coefficient indicates that, for each step up the severity scale, roughly 100 riders are lost.yr
coefficient accounts for the slight overall increase in riders over time: a day in year 1 might have nearly 300 more riders than a comparable day in year 0.The various month results deserve special attention. Higher values indicate that, all else being equal, the model predicts more riders that month. However, we should be very cautious in interpreting these results, since certain months are likely to have lower temperatures or higher humidities, for example. So, it's not usually the case that "all else is equal." Still, these results do give us suggestive and potentially useful quantitative insights, such as rider's preferences for warmer months over colder ones, especially in the spring as the snow melts.
Depending on our goals, we have many possible next steps. We might aim to improve our model accuracy by using more sophisticated models or collecting more data. We might aim to deploy more complex models in order to handle dependencies between the month variables and the weather variables, for example. Rather than the explanatory angle we've used here, we might instead choose to look at this problem through a predictive lens, and aim to predict the bikeshare usage in unseen, coming months. As usual in data science, each model opens at least as many doors as it closes.