New York Subway and Weather Data¶

Working with 2D data with numpy and pandas.

Questions¶

What variables area related to subway ridership?
- What stations have the most riders?
- What are the ridership patterns over time? On weekends Holidays?
- How does the weather affect ridership?
What patterns can I find in the weather?
- How does the weather vary across the city?
- How does the temperature rise throughout the month?

In [1]:

# Libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%pylab inline

Populating the interactive namespace from numpy and matplotlib

Storing 2D Data¶

Pure python: Lists of lists
NumPy: 2D array (more memory efficent, accessing elements differ in syntax)
Pandas : Dataframe

Storing 2D Data : NumPy¶

In [2]:

# Sample Numpy 2D array: Subway ridership for 5 stations and 10 days
# Columns represent train stations, and rows represent days

ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

# Check type of 2D array
ridership.dtype

Out[2]:

dtype('int64')

Working with Numpy 2D arrays:

Station with max number of riders on day 1 and the mean riders per day (mean_for_max)
For all days and stations what is the mean ridership overall? (overall_mean)

In [3]:

def mean_riders_for_max_station(ridership):
    # Max riders on day 1
    max_station = ridership[0,:].argmax()            # argmax returns position
    # Mean riders per day for max_station
    mean_for_max = ridership[:, max_station].mean()
    overall_mean = ridership.mean()
    return (overall_mean, mean_for_max)

mean_riders_for_max_station(ridership)

Out[3]:

(2342.5999999999999, 3239.9000000000001)

Station with higher ridership on day1 had a higher ridership than average. This is interesting but not useful.

Use NumPy's axis function to find more practical results:¶

Mean ridership per day for each subway station
Maximum and minimum ridership for each day for each subway station

In [4]:

def min_and_max_riders_per_day(ridership):
    # Mean per day for each station 
    station_riders = ridership.mean(axis = 0)      #selects columns 
    
    max_daily_ridership = station_riders.max()
    min_daily_ridership = station_riders.min() 
    
    return (max_daily_ridership, min_daily_ridership)

min_and_max_riders_per_day(ridership)

Out[4]:

(3239.9000000000001, 1071.2)

The station with the max daily ridership had 3239 riders. The station with the min daily ridership had 1071 riders.

Storing 2D Data : Pandas Dataframes¶

Recreate the sample 2D numpy array using a pandas dataframe. Advantages of a pandas dataframe over a 2D NumPy array includes the following:

Indexs for rows
Name for columns
Different data types for each column

In [5]:

# Create a pandas df

ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)

# Print df
ridership_df

Out[5]:

	R003	R004	R005	R006	R007
05-01-11	0	0	2	5	0
05-02-11	1478	3877	3674	2328	2539
05-03-11	1613	4088	3991	6461	2691
05-04-11	1560	3392	3826	4787	2613
05-05-11	1608	4802	3932	4477	2705
05-06-11	1576	3933	3909	4979	2685
05-07-11	95	229	255	496	201
05-08-11	2	0	1	27	0
05-09-11	1438	3785	3589	4174	2215
05-10-11	1342	4043	4009	4665	3033

Mean ridership for each station (column).

In [6]:

# Mean of each column 
ridership_df.mean()

Out[6]:

R003    1071.2
R004    2814.9
R005    2718.8
R006    3239.9
R007    1868.2
dtype: float64

Mean ridership for each day (row).

In [7]:

# Mean of each row 
ridership_df.mean(axis = 1)

Out[7]:

05-01-11       1.4
05-02-11    2779.2
05-03-11    3768.8
05-04-11    3235.6
05-05-11    3504.8
05-06-11    3416.4
05-07-11     255.2
05-08-11       6.0
05-09-11    3040.2
05-10-11    3418.4
dtype: float64

Accessing a station (column) of the dataframe using "iloc"

In [8]:

#ridership_df.iloc['R007'] --> This is wrong
print ridership_df.iloc[4]

print ridership_df['R007'] # prints beginning of column

R003    1608
R004    4802
R005    3932
R006    4477
R007    2705
Name: 05-05-11, dtype: int64
05-01-11       0
05-02-11    2539
05-03-11    2691
05-04-11    2613
05-05-11    2705
05-06-11    2685
05-07-11     201
05-08-11       0
05-09-11    2215
05-10-11    3033
Name: R007, dtype: int64

Accessing a day (row/index) of the dataframe using "loc"

In [9]:

ridership_df.loc['05-09-11']

Out[9]:

R003    1438
R004    3785
R005    3589
R006    4174
R007    2215
Name: 05-09-11, dtype: int64

Accessing a day/station (row/column) of the dataframe using "iloc" or "loc"

In [10]:

# Two methods
print ridership_df.iloc[0,3]
print ridership_df.loc['05-01-11', 'R006']

5
5

Get a 2D Numpy array from a Pandas Dataframe:

This will output a 2D array with no rows or columns. Useful for taking the mean of all the values. Make sure data types are consistent.

In [11]:

print ridership_df.values
print ridership_df.values.mean()

[[   0    0    2    5    0]
 [1478 3877 3674 2328 2539]
 [1613 4088 3991 6461 2691]
 [1560 3392 3826 4787 2613]
 [1608 4802 3932 4477 2705]
 [1576 3933 3909 4979 2685]
 [  95  229  255  496  201]
 [   2    0    1   27    0]
 [1438 3785 3589 4174 2215]
 [1342 4043 4009 4665 3033]]
2342.6

In [12]:

# Maxmimum number of riders on the first day 
ridership_df.iloc[0].argmax()

Out[12]:

'R006'

We rewrite the mean_riders_for_max_station function, but this time the input is a Pandas DataFrame rather than a 2D NumPy array. The main differences:

iloc[0] instead of [0,:] to access a row
[max_station] instead of [:, max_station] to access a column
.values.mean() instead of .mean() to get the overall mean instead of a series of means

In [13]:

def mean_riders_for_max_station(ridership):
    # Max riders on day 1 station
    max_station = ridership.iloc[0].argmax()     
    # Mean riders per day for max station
    mean_for_max = ridership[max_station].mean()
    overall_mean = ridership.values.mean()
    return (overall_mean, mean_for_max)

mean_riders_for_max_station(ridership_df)

Out[13]:

(2342.5999999999999, 3239.9000000000001)

Same result regardless of whether we used numpy arrays or pandas dataframe.

Loading Data into a Dataframe

In [14]:

# Read in file
subway_df = pd.read_csv('nyc_subway_weather.csv')

Instead of printing all the rows of the data, print the first 4 rows only using .head().

In [15]:

subway_df.head()

Out[15]:

	UNIT	DATEn	TIMEn	ENTRIESn	EXITSn	datetime	hour	day_week	...	pressurei	tempi	wspdi	meanpressurei	meantempi	meanwspdi	weather_lat	weather_lon
0	R003	05-01-11	00:00:00	4388333	2911002	2011-05-01 00:00:00	0	6	...	30.22	55.9	3.5	30.258	55.98	7.86	40.700348	-73.887177
1	R003	05-01-11	04:00:00	4388333	2911002	2011-05-01 04:00:00	4	6	...	30.25	52.0	3.5	30.258	55.98	7.86	40.700348	-73.887177
2	R003	05-01-11	12:00:00	4388333	2911002	2011-05-01 12:00:00	12	6	...	30.28	62.1	6.9	30.258	55.98	7.86	40.700348	-73.887177
3	R003	05-01-11	16:00:00	4388333	2911002	2011-05-01 16:00:00	16	6	...	30.26	57.9	15.0	30.258	55.98	7.86	40.700348	-73.887177
4	R003	05-01-11	20:00:00	4388333	2911002	2011-05-01 20:00:00	20	6	...	30.28	52.0	10.4	30.258	55.98	7.86	40.700348	-73.887177

5 rows × 27 columns

Number of rows in our dataframe:

In [16]:

len(subway_df)

Out[16]:

Use the describe() function to see some statistics about each column.

In [17]:

subway_df.describe()

Out[17]:

	ENTRIESn	EXITSn	ENTRIESn_hourly	EXITSn_hourly	hour	day_week	weekday	latitude	longitude	fog	...	pressurei	rain	tempi	wspdi	meanprecipi	meanpressurei	meantempi	meanwspdi	weather_lat	weather_lon
count	4.264900e+04	4.264900e+04	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	...	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000	42649.000000
mean	2.812486e+07	1.986993e+07	1886.589955	1361.487866	10.046754	2.905719	0.714436	40.724647	-73.940364	0.009824	...	29.971096	0.224741	63.103780	6.927872	0.004618	29.971096	63.103780	6.927872	40.728555	-73.938693
std	3.043607e+07	2.028986e+07	2952.385585	2183.845409	6.938928	2.079231	0.451688	0.071650	0.059713	0.098631	...	0.137942	0.417417	8.455597	4.510178	0.016344	0.131158	6.939011	3.179832	0.065420	0.059582
min	0.000000e+00	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000	40.576152	-74.073622	0.000000	...	29.550000	0.000000	46.900000	0.000000	0.000000	29.590000	49.400000	0.000000	40.600204	-74.014870
25%	1.039762e+07	7.613712e+06	274.000000	237.000000	4.000000	1.000000	0.000000	40.677107	-73.987342	0.000000	...	29.890000	0.000000	57.000000	4.600000	0.000000	29.913333	58.283333	4.816667	40.688591	-73.985130
50%	1.818389e+07	1.331609e+07	905.000000	664.000000	12.000000	3.000000	1.000000	40.717241	-73.953459	0.000000	...	29.960000	0.000000	61.000000	6.900000	0.000000	29.958000	60.950000	6.166667	40.720570	-73.949150
75%	3.263049e+07	2.393771e+07	2255.000000	1537.000000	16.000000	5.000000	1.000000	40.759123	-73.907733	0.000000	...	30.060000	0.000000	69.100000	9.200000	0.000000	30.060000	67.466667	8.850000	40.755226	-73.912033
max	2.357746e+08	1.493782e+08	32814.000000	34828.000000	20.000000	6.000000	1.000000	40.889185	-73.755383	1.000000	...	30.320000	1.000000	86.000000	23.000000	0.157500	30.293333	79.800000	17.083333	40.862064	-73.694176

8 rows × 21 columns

Linear correlation between two variables using pearson's r

Standardize each variable by converting it to a number of standard deviations above or below the mean. This will help scale the data. The further a data point is from the x,y axis, the larger the contribution to the pearson's r value.
Multiply each pair of values and take the average of the products. This will split the data into four quadrants where the data is either both above/below.
Value should be from -1 to +1

Note: Numpy has a corrcoef() function as awell.

In [18]:

# Function to compute linear correlation between 2 variables
def correlation(x, y):       # input can be a numpy 2d array or panda series 
    # Standardize data
    std_x = (x - x.mean()) / x.std(ddof=0)           #"ddof=0" for panda's uncorrected std function 
    std_y = (y - y.mean()) / y.std(ddof=0)
    return (std_x * std_y).mean()

In [19]:

# Get columns of data
entries = subway_df['ENTRIESn_hourly']
cum_entries = subway_df['ENTRIESn']
rain = subway_df['meanprecipi']
temp = subway_df['meantempi']

How to interpret pearson's r results:

Postive "r" value. As one value increases, the other value increases:
- Both values above the mean, product is positive from (+,+)
- Both values below the mean, product is positive from (-,-)
Negative "r" value. As one value increases, the other value decreases:
- One value above the mean, other value below the mean, product is negative from (-, +)
- One value above the mean, other value below the mean, product is negative from (+, -)
If value is near 0, then poor correlation between two variables

Applying pearson's r to subway data:

In [20]:

correlation(entries, rain)

Out[20]:

0.035648515772244052

Subway entries and rain

small positive correlation
more rain correlated with more subway entries

In [21]:

correlation(entries, temp)

Out[21]:

-0.026693348321570803

Subway entries and temperature

small negative correlation
higher temperature correlated with lower subway entries
lower temperature correlated with higher subway entries

In [22]:

correlation(rain, temp)

Out[22]:

-0.22903432340843488

Rain and temperature

medium negative correlation
higher rain correlated with lower temperature
lower rain correlated with higher temperature

In [23]:

correlation(entries, cum_entries)

Out[23]:

0.58589547076620752

Subway daily entries and cummulative subway entries

high positive correlation
high cummulative subway entries correlated with high daily entries

Pandas Means and Axis Names

For axis = 0 or axis = 'index', the mean of each column is taken
For axis = 1 or axis = 'columns', the mean of each row is taken

Pandas and Vectorized Operations

Match up elements by index and columns instead of position ( as in numpy 2d arrays)

Convert Total Entries and Exits to Hourly Entries and Exits¶

The subway data included the cummulative number of entries and exits by riders. We do not want the cummulative data, instead we want the hourly number of entries and exits. This can be done using two methods.

In [24]:

# Sample data: DF with total entries (column1) and total exits (column2)
entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})

In [25]:

# Method 1 : Write a function 

# Use shift(1) which shifts index by period of 1
def get_hourly_entries_and_exits(entries_and_exits):
    # Subtract value from each previous row 
    return entries_and_exits - entries_and_exits.shift(1)
get_hourly_entries_and_exits(entries_and_exits)

Out[25]:

	ENTRIESn	EXITSn
0	NaN	NaN
1	23.0	8.0
2	18.0	18.0
3	71.0	54.0
4	170.0	44.0
5	214.0	42.0
6	87.0	11.0
7	10.0	3.0
8	36.0	89.0
9	153.0	333.0

In [26]:

# Method 2: Builtin .diff() function
print entries_and_exits.diff() 

   ENTRIESn  EXITSn
0       NaN     NaN
1      23.0     8.0
2      18.0    18.0
3      71.0    54.0
4     170.0    44.0
5     214.0    42.0
6      87.0    11.0
7      10.0     3.0
8      36.0    89.0
9     153.0   333.0

Pandas and non built-in functions

write a helper function
applymap() function maps a function element to element

For dataframes the applymap() and apply() functions do different things.

apply(): applies helper on each column/series.
- Can get max value from each column using df.apply(np.max)
applymap(): applies helper to each element

Example: Get the second largest value in each column

In [27]:

# Helper for single column
def second_largest_in_column(column):
    # Sort column in descending order. Find second value.
    sorted_column = column.sort_values(ascending=False)
    return sorted_column.iloc(1) # Second value at position 1

# Apply function to a df 
def second_largest(df):
    return df.apply(second_largest_in_column)

# second_largest(df)

Pandas addition of dataframes and series

dataframe + dataframe = dataframe
series + series = series
dataframe + series = ???

In [28]:

# Series + Square dataFrame 0,1,2,3 column/index names 
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    0: [10, 20, 30, 40],
    1: [50, 60, 70, 80],
    2: [90, 100, 110, 120],
    3: [130, 140, 150, 160]
})

# Each value of the series was added to each column of the dataframe
print df + s

    0   1    2    3
0  11  52   93  134
1  21  62  103  144
2  31  72  113  154
3  41  82  123  164

In [29]:

# Series + one-row DataFrame 
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})

# Each value of the series was added to each column of the dataframe
# Same as above 
print df + s

    0   1   2   3
0  11  22  33  44

In [30]:

# Series + one-column DataFrame
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]}) # one column

# First column had 1 added, the rest were NaNs
print df + s

# df.add(s, axis = 'columns') 
df.add(s, axis = 'rows') # Each value of series added to one row 

    0   1   2   3
0  11 NaN NaN NaN
1  21 NaN NaN NaN
2  31 NaN NaN NaN
3  41 NaN NaN NaN

Out[30]:

	0
0	11
1	22
2	33
3	44

In [31]:

# DataFrame column with name + Series index with name ( matching names)

s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
print df + s

    a   b    c    d
0  11  52   93  134
1  21  62  103  144
2  31  72  113  154
3  41  82  123  164

In [32]:

# DataFrame column with name + Series index with name ( non- matching names)
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})

# No match between indexes --> NaN
print df + s

    0   1   2   3   a   b   c   d
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN

Standardizing rows/columns using vector operations without apply()

In [33]:

# Example dataset using grades and names of students 
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
grades_df

Out[33]:

	exam1	exam2
Andre	43	24
Barry	81	63
Chris	78	56
Dan	75	56
Emilio	89	67
Fred	70	51
Greta	91	79
Humbert	65	46
Ivan	98	72
James	87	60

In [34]:

grades_df.mean()

Out[34]:

exam1    77.7
exam2    57.4
dtype: float64

In [35]:

# Subtract mean1 from column1 and mean2 from column2
# Divide each column by it's standard deviation 

(grades_df - grades_df.mean() )/ grades_df.std()

Out[35]:

	exam1	exam2
Andre	-2.196525	-2.186335
Barry	0.208891	0.366571
Chris	0.018990	-0.091643
Dan	-0.170911	-0.091643
Emilio	0.715295	0.628408
Fred	-0.487413	-0.418938
Greta	0.841896	1.413917
Humbert	-0.803916	-0.746234
Ivan	1.284999	0.955703
James	0.588694	0.170194

In [36]:

# Standardize by rows, set axis = columns to get mean of each row 
# Subtract means from rows using sub() and axis = 'index'
mean_diffs = grades_df.sub(grades_df.mean(axis='columns'), axis='index')

# Divide by standard deviation - same values since only 2 columns 
mean_diffs.div(grades_df.std(axis='columns'), axis ='index')

Out[36]:

	exam1	exam2
Andre	0.707107	-0.707107
Barry	0.707107	-0.707107
Chris	0.707107	-0.707107
Dan	0.707107	-0.707107
Emilio	0.707107	-0.707107
Fred	0.707107	-0.707107
Greta	0.707107	-0.707107
Humbert	0.707107	-0.707107
Ivan	0.707107	-0.707107
James	0.707107	-0.707107

Pandas Groupby() function

General idea: Group data then aggregate data by taking an average within the group. Pandas has the following built-in functions:

group_data()
sum_grouped_items()

In [37]:

# Example dataframe
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

example_df

Out[37]:

	above_three	even	value
a	False	False	1
b	False	False	3
c	False	True	2
d	True	True	4
e	False	False	1
f	True	True	6
g	True	True	4

In [38]:

# Groups
grouped_data = example_df.groupby('even')
# The groups attribute is a dictionary mapping keys to lists of row indexes
print grouped_data.groups

{False: Index([u'a', u'b', u'e'], dtype='object'), True: Index([u'c', u'd', u'f', u'g'], dtype='object')}

In [39]:

# Group by multiple columns
grouped_data = example_df.groupby(['even', 'above_three'])
print grouped_data.groups

{(True, False): Index([u'c'], dtype='object'), (False, False): Index([u'a', u'b', u'e'], dtype='object'), (True, True): Index([u'd', u'f', u'g'], dtype='object')}

In [40]:

# Get sum of each group
grouped_data = example_df.groupby('even')
print grouped_data.sum()

       above_three  value
even                     
False          0.0      5
True           3.0     16

In [41]:

# Limit columns in result
grouped_data = example_df.groupby('even')
print grouped_data.sum()['value']

even
False     5
True     16
Name: value, dtype: int64

In [42]:

# Subset of columns from the grouped data before collapsing to a dataframe
# Same results
print grouped_data['value'].sum()

even
False     5
True     16
Name: value, dtype: int64

Pandas groupby() function applied to NYC subway data

Question1: How does ridership vary by hour of the day?

For each hour of the day, group the data
For each hour of the day, find average ridership

Question2 : How does rain affect subway ridership?

Break data into two groups 1) raining 2) non-raining
Compare ridership between two groups

Question3: How does ridership vary by day of week?

For each day of the week, group the data
For each day of the week, find average ridership

Question3 Solution: How does ridership vary by day of week?

In [43]:

# Check headers first
subway_df.head()

Out[43]:

	UNIT	DATEn	TIMEn	ENTRIESn	EXITSn	datetime	hour	day_week	...	pressurei	tempi	wspdi	meanpressurei	meantempi	meanwspdi	weather_lat	weather_lon
0	R003	05-01-11	00:00:00	4388333	2911002	2011-05-01 00:00:00	0	6	...	30.22	55.9	3.5	30.258	55.98	7.86	40.700348	-73.887177
1	R003	05-01-11	04:00:00	4388333	2911002	2011-05-01 04:00:00	4	6	...	30.25	52.0	3.5	30.258	55.98	7.86	40.700348	-73.887177
2	R003	05-01-11	12:00:00	4388333	2911002	2011-05-01 12:00:00	12	6	...	30.28	62.1	6.9	30.258	55.98	7.86	40.700348	-73.887177
3	R003	05-01-11	16:00:00	4388333	2911002	2011-05-01 16:00:00	16	6	...	30.26	57.9	15.0	30.258	55.98	7.86	40.700348	-73.887177
4	R003	05-01-11	20:00:00	4388333	2911002	2011-05-01 20:00:00	20	6	...	30.28	52.0	10.4	30.258	55.98	7.86	40.700348	-73.887177

5 rows × 27 columns

The hourly data of the dataframe is taken in four hour intervals.

In [44]:

# Question 1
ridership_by_hour = subway_df.groupby('hour').mean()['ENTRIESn_hourly']
print ridership_by_hour

hour
0     1403.248441
4      311.717653
8      879.029929
12    3010.949693
16    2306.527816
20    3226.240093
Name: ENTRIESn_hourly, dtype: float64

In [45]:

# Plot data in four hour intervals 
# ridership_by_hour.plot()
ridership_by_hour.plot.bar()
title('4-Hour Intervals vs Subway Ridership')
ylabel('Number of Subway Riders')
xlabel('Time (4hr intervals)')

Out[45]:

<matplotlib.text.Text at 0x11643f790>

Above is a plot of the ridership data vs four hour interval units. Highest ridership between the following intervals:

12pm - 4pm
4pm - 8pm
8pm - 12am

More useful to plot hourly entries and exits for each station (UNIT)

In [66]:

ex_ridership_df2 = pd.DataFrame({
    'UNIT': ['R051', 'R079', 'R051', 'R079', 'R051', 'R079', 'R051', 'R079', 'R051'],
    'TIMEn': ['00:00:00', '02:00:00', '04:00:00', '06:00:00', '08:00:00', '10:00:00', '12:00:00', '14:00:00', '16:00:00'],
    'ENTRIESn': [3144312, 8936644, 3144335, 8936658, 3144353, 8936687, 3144424, 8936819, 3144594],
    'EXITSn': [1088151, 13755385,  1088159, 13755393,  1088177, 13755598, 1088231, 13756191,  1088275]
})

In [67]:

# Hourly entries and exits for each station (UNIT)
def hourly_for_group(entries_and_exits):
    return entries_and_exits - entries_and_exits.shift(1)

ridership_by_station = ex_ridership_df2.groupby('UNIT')['ENTRIESn', 'EXITSn'].apply(hourly_for_group)
print ridership_by_station

   ENTRIESn  EXITSn
0       NaN     NaN
1       NaN     NaN
2      23.0     8.0
3      14.0     8.0
4      18.0    18.0
5      29.0   205.0
6      71.0    54.0
7     132.0   593.0
8     170.0    44.0

In [68]:

# By station:
# ridership_by_station.plot()
ridership_by_station.plot.bar()
title('Stations vs Ridership')
xlabel('Station Number')
ylabel('Number of Subway Riders')

Out[68]:

<matplotlib.text.Text at 0x11bc7f250>

Joining multiple Pandas Dataframes

Note: this is unrelated to the NY subway dataset

Pandas merge() function joins multiple tables
Types of joins: inner, left, right

In [49]:

# This is unrelated to the NY subway dataset
ex_subway_df = pd.DataFrame({
    'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
             'R004', 'R004'],
    'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
              '05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'ENTRIESn': [ 4388333,  4388348,  4389885,  4391507,  4393043, 14656120,
                 14656174, 14660126, 14664247, 14668301],
    'EXITSn': [ 2911002,  2911036,  2912127,  2913223,  2914284, 14451774,
               14451851, 14454734, 14457780, 14460818],
    'latitude': [ 40.689945,  40.689945,  40.689945,  40.689945,  40.689945,
                  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ],
    'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
                  -73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})

ex_weather_df = pd.DataFrame({
    'DATEn': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
              '05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'latitude': [ 40.689945,  40.69132 ,  40.689945,  40.69132 ,  40.689945,
                  40.69132 ,  40.689945,  40.69132 ,  40.689945,  40.69132 ],
    'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
                  -73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
    'pressurei': [ 30.24,  30.24,  30.32,  30.32,  30.14,  30.14,  29.98,  29.98,
                   30.01,  30.01],
    'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'tempi': [ 52. ,  52. ,  48.9,  48.9,  54. ,  54. ,  57.2,  57.2,  48.9,  48.9],
    'wspdi': [  8.1,   8.1,   6.9,   6.9,   3.5,   3.5,  15. ,  15. ,  15. ,  15. ]
})

In [50]:

# This is unrelated to the NY subway dataset
# JOIN two tables
ex_subway_df.merge(ex_weather_df, on =['DATEn', 'hour', 'latitude', 'longitude'], how = 'inner' )

Out[50]:

	DATEn	ENTRIESn	EXITSn	UNIT	latitude	longitude	pressurei	tempi	wspdi
0	05-01-11	4388333	2911002	R003	40.689945	-73.872564	30.24	52.0	8.1
1	05-02-11	4388348	2911036	R003	40.689945	-73.872564	30.32	48.9	6.9
2	05-03-11	4389885	2912127	R003	40.689945	-73.872564	30.14	54.0	3.5
3	05-04-11	4391507	2913223	R003	40.689945	-73.872564	29.98	57.2	15.0
4	05-05-11	4393043	2914284	R003	40.689945	-73.872564	30.01	48.9	15.0
5	05-01-11	14656120	14451774	R004	40.691320	-73.867135	30.24	52.0	8.1
6	05-02-11	14656174	14451851	R004	40.691320	-73.867135	30.32	48.9	6.9
7	05-03-11	14660126	14454734	R004	40.691320	-73.867135	30.14	54.0	3.5
8	05-04-11	14664247	14457780	R004	40.691320	-73.867135	29.98	57.2	15.0
9	05-05-11	14668301	14460818	R004	40.691320	-73.867135	30.01	48.9	15.0

Visualizing Data

Data1: A scatterplot of subway stations with latitude and longitude as the x and y axes and ridership as the bubble size
Data2: Histograms of subway ridership for days with rain and days without rain
Data3: A scatterplot with subway ridership on one axis and precipitation or temperature on the other

In [51]:

# Data1 : Group data by longitude and latitude. Then find the mean
data_by_location = subway_df.groupby(['latitude', 'longitude']).mean()
data_by_location.head()

Out[51]:

		ENTRIESn	EXITSn	ENTRIESn_hourly	EXITSn_hourly	hour	day_week	weekday	fog	precipi	pressurei	rain	tempi	wspdi	meanprecipi	meanpressurei	meantempi	meanwspdi	weather_lat	weather_lon
latitude	longitude
40.576152	-73.975925	9.659049e+06	8.641132e+06	403.896175	325.956284	10.032787	2.907104	0.715847	0.010929	0.006284	29.972568	0.229508	63.383607	5.553005	0.006284	29.972568	63.383607	5.553005	40.603489	-73.958763
40.576298	-73.968523	8.306897e+06	6.646823e+06	526.697297	419.562162	9.989189	2.951351	0.708108	0.010811	0.006216	29.973297	0.227027	63.375135	5.517838	0.006216	29.973297	63.375135	5.517838	40.603489	-73.958763
40.577961	-73.961806	4.552910e+07	4.612408e+07	1950.295699	1930.483871	10.000000	2.935484	0.709677	0.010753	0.006183	29.973118	0.225806	63.394086	5.531720	0.006183	29.973118	63.394086	5.531720	40.603489	-73.958763
40.589547	-73.974295	7.268214e+06	7.961334e+06	485.382353	362.941176	10.164706	2.905882	0.705882	0.011765	0.006118	29.971176	0.200000	63.650588	5.630588	0.006118	29.971176	63.650588	5.630588	40.603489	-73.958763
40.590867	-73.797011	6.477945e+06	5.994957e+06	500.725610	374.628049	10.097561	2.951220	0.719512	0.024390	0.002744	29.981098	0.195122	61.721341	9.945122	0.002744	29.981098	61.721341	9.945122	40.660004	-73.844849

In [52]:

# Only want longitude and latitude (row indexes) --> as_index=False
data_by_location = subway_df.groupby(['latitude', 'longitude'], as_index=False).mean()
print data_by_location.head()['latitude']
print data_by_location.head()['longitude']

0    40.576152
1    40.576298
2    40.577961
3    40.589547
4    40.590867
Name: latitude, dtype: float64
0   -73.975925
1   -73.968523
2   -73.961806
3   -73.974295
4   -73.797011
Name: longitude, dtype: float64

In [53]:

# Scatterplot with (x,y) as latitude,longitude and bubblesize = ridership
# Scale bubble markers by dividing by std
scaled_hourly_entries = (10 * data_by_location['ENTRIESn_hourly'] / data_by_location['ENTRIESn_hourly'].std() )
plt.scatter(data_by_location['latitude'], data_by_location['longitude'], s=scaled_hourly_entries )
title('Location and Ridership as Markers')
xlabel('Latitude')
ylabel('Longitude')

Out[53]:

<matplotlib.text.Text at 0x10b9ab850>

Can compare to a NYC subway map to get exact stations. High ridership in the following clusters.

longitude from - 73.95 to -74.03
latitude from 40.70 to 40.79