In this post, we are going to do a simple exploration and calculate the average number of uber rides on each hour and on each day of the week. Thanks to the folks at FiveThirtyEight, I was able to download data on over 4.5 million Uber pickups in New York City from April to September 2014 from their github page. Let's get to work.
import pandas as pd
import datetime
import calendar
from IPython.display import IFrame
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import cufflinks as cf
init_notebook_mode()
data = pd.DataFrame()
for i in ['apr','may','jun', 'jul', 'aug', 'sep']:
month_data = pd.read_csv(''.join(('uber-raw-data-',i,'14.csv')))
data = pd.concat([data, month_data])
data.head()
Date/Time | Lat | Lon | Base | |
---|---|---|---|---|
0 | 4/1/2014 0:11:00 | 40.7690 | -73.9549 | B02512 |
1 | 4/1/2014 0:17:00 | 40.7267 | -74.0345 | B02512 |
2 | 4/1/2014 0:21:00 | 40.7316 | -73.9873 | B02512 |
3 | 4/1/2014 0:28:00 | 40.7588 | -73.9776 | B02512 |
4 | 4/1/2014 0:33:00 | 40.7594 | -73.9722 | B02512 |
Function to convert Date/Time from string to a python datetime object
def convert_datetime(x):
try:
return datetime.datetime.strptime(x, '%m/%d/%Y %X')
except ValueError:
return datetime.datetime.strptime(x, '%m/%d/%Y %H:%M')
data['Date/Time'] = data['Date/Time'].map(convert_datetime)
data = data.sort_values(by='Date/Time')
data.index = data['Date/Time']
Count number of rides for each hour.
data_hour = pd.DataFrame(data['Date/Time'].resample('H', how='count'))
Calulcate the average number of rides for each specific hour throughout the day.
avg_per_hour = [data_hour.loc[data_hour.index.hour==i,:].mean()[0] for i in range(0,24)]
data_avg_per_hour = pd.DataFrame({'average rides': avg_per_hour},index=[datetime.time(i,0) for i in range(0,24)])
data_avg_per_hour.iplot(kind='bar', dimensions=(950,400),
layout={'yaxis': {'ticksuffix':' rides/h'},
'title': 'Average number of rides during specific hours from april 2014 to sep 2014'})
IFrame(src="//plot.ly/~ahmedas91/37.embed", width=950, height=450)
As expected the average number of rides peaks in the morning from 6am to 9am when people are heading to work and school and from 3pm to 9pm.
Now let's check out the average number of rides during week days. Just like we did above, we'll count the number of rides during the day and calculate the average number of rides for specific days of the week.
data_daily = pd.DataFrame(data['Date/Time'].resample('D', how='count'))
data_daily['weekday'] = data_daily.index.map(lambda x: calendar.day_name[x.weekday()])
data_daily_avg_weekday = data_daily.groupby('weekday').mean()
data_daily_avg_weekday = data_daily_avg_weekday.reindex(["Monday", "Tuesday", "Wednesday",
'Thursday', 'Friday', 'Saturday', 'Sunday'])
data_daily_avg_weekday.iplot(kind='bar',
dimensions=(950,400),
layout={'yaxis': {'ticksuffix':' rides/d'},
'title': 'Average number of rides during specific weekdays from april 2014 to sep 2014'})
IFrame(src="//plot.ly/~ahmedas91/39.embed", width=950, height=450)
Again as expected, the average number of rides peaks on Thurday and Friday and slumps on Sunday.