This data set is from https://www.lyft.com/bikes/bay-wheels/system-data. This data set is about Bike rental sharing from January to December of 2018.
Each trip is anonymized and includes:
Trip Duration (seconds)
Start Time and Date
End Time and Date
Start Station ID
Start Station Name
Start Station Latitude
Start Station Longitude
End Station ID
End Station Name
End Station Latitude
End Station Longitude
Bike ID
User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)
Member Year of Birth
Member Gender
Trip Duration is the most important feature related to rental fees. I'll explore the other features related to the Trip Duration.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# load all of 2018 trip data
bike_rental_mon1 = pd.read_csv('./201801-fordgobike-tripdata.csv')
bike_rental_mon2 = pd.read_csv('./201802-fordgobike-tripdata.csv')
bike_rental_mon3 = pd.read_csv('./201803-fordgobike-tripdata.csv')
bike_rental_mon4 = pd.read_csv('./201804-fordgobike-tripdata.csv')
bike_rental_mon5 = pd.read_csv('./201805-fordgobike-tripdata.csv')
bike_rental_mon6 = pd.read_csv('./201806-fordgobike-tripdata.csv')
bike_rental_mon7 = pd.read_csv('./201807-fordgobike-tripdata.csv')
bike_rental_mon8 = pd.read_csv('./201808-fordgobike-tripdata.csv')
bike_rental_mon9 = pd.read_csv('./201809-fordgobike-tripdata.csv')
bike_rental_mon10 = pd.read_csv('./201810-fordgobike-tripdata.csv')
bike_rental_mon11 = pd.read_csv('./201811-fordgobike-tripdata.csv')
bike_rental_mon12 = pd.read_csv('./201812-fordgobike-tripdata.csv')
bike_rental_mon12.head()
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 68529 | 2018-12-31 20:03:11.7350 | 2019-01-01 15:05:21.5580 | 217.0 | 27th St at MLK Jr Way | 37.817015 | -122.271761 | 217.0 | 27th St at MLK Jr Way | 37.817015 | -122.271761 | 3305 | Customer | NaN | NaN | No |
1 | 63587 | 2018-12-31 19:00:32.1210 | 2019-01-01 12:40:19.3660 | NaN | NaN | 37.400000 | -121.940000 | NaN | NaN | 37.400000 | -121.940000 | 4281 | Customer | 1995.0 | Male | No |
2 | 64169 | 2018-12-31 15:09:01.0820 | 2019-01-01 08:58:30.0910 | NaN | NaN | 37.400000 | -121.940000 | NaN | NaN | 37.400000 | -121.940000 | 4267 | Customer | 1988.0 | Male | No |
3 | 30550 | 2018-12-31 19:26:20.7750 | 2019-01-01 03:55:30.7930 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 19.0 | Post St at Kearny St | 37.788975 | -122.403452 | 5422 | Subscriber | 1986.0 | Male | Yes |
4 | 2150 | 2018-12-31 23:59:12.0970 | 2019-01-01 00:35:02.1530 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 368.0 | Myrtle St at Polk St | 37.785434 | -122.419622 | 4820 | Customer | NaN | NaN | No |
This data frame has 16 features. Bike rental duration is related to the rental fee so that I'll figure out the relationship between duration_sec and other features.
bike_rental = pd.concat([bike_rental_mon1, bike_rental_mon2, bike_rental_mon3, bike_rental_mon4, bike_rental_mon5, bike_rental_mon6,
bike_rental_mon7, bike_rental_mon8, bike_rental_mon9, bike_rental_mon10, bike_rental_mon11, bike_rental_mon12])
bike_rental.reset_index(drop=True, inplace=True)
bike_rental.head()
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 75284 | 2018-01-31 22:52:35.2390 | 2018-02-01 19:47:19.8240 | 120.0 | Mission Dolores Park | 37.761420 | -122.426435 | 285.0 | Webster St at O'Farrell St | 37.783521 | -122.431158 | 2765 | Subscriber | 1986.0 | Male | No |
1 | 85422 | 2018-01-31 16:13:34.3510 | 2018-02-01 15:57:17.3100 | 15.0 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 15.0 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 2815 | Customer | NaN | NaN | No |
2 | 71576 | 2018-01-31 14:23:55.8890 | 2018-02-01 10:16:52.1160 | 304.0 | Jackson St at 5th St | 37.348759 | -121.894798 | 296.0 | 5th St at Virginia St | 37.325998 | -121.877120 | 3039 | Customer | 1996.0 | Male | No |
3 | 61076 | 2018-01-31 14:53:23.5620 | 2018-02-01 07:51:20.5000 | 75.0 | Market St at Franklin St | 37.773793 | -122.421239 | 47.0 | 4th St at Harrison St | 37.780955 | -122.399749 | 321 | Customer | NaN | NaN | No |
4 | 39966 | 2018-01-31 19:52:24.6670 | 2018-02-01 06:58:31.0530 | 74.0 | Laguna St at Hayes St | 37.776435 | -122.426244 | 19.0 | Post St at Kearny St | 37.788975 | -122.403452 | 617 | Subscriber | 1991.0 | Male | No |
print(bike_rental.shape)
print(bike_rental.dtypes)
(1863721, 16) duration_sec int64 start_time object end_time object start_station_id float64 start_station_name object start_station_latitude float64 start_station_longitude float64 end_station_id float64 end_station_name object end_station_latitude float64 end_station_longitude float64 bike_id int64 user_type object member_birth_year float64 member_gender object bike_share_for_all_trip object dtype: object
bike_rental[bike_rental.duplicated()]
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip |
---|
There is no duplicated row in the data frame
# drop null values
bike_rental.dropna(inplace=True)
bike_rental.reset_index(drop=True, inplace=True)
bike_rental
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 75284 | 2018-01-31 22:52:35.2390 | 2018-02-01 19:47:19.8240 | 120.0 | Mission Dolores Park | 37.761420 | -122.426435 | 285.0 | Webster St at O'Farrell St | 37.783521 | -122.431158 | 2765 | Subscriber | 1986.0 | Male | No |
1 | 71576 | 2018-01-31 14:23:55.8890 | 2018-02-01 10:16:52.1160 | 304.0 | Jackson St at 5th St | 37.348759 | -121.894798 | 296.0 | 5th St at Virginia St | 37.325998 | -121.877120 | 3039 | Customer | 1996.0 | Male | No |
2 | 39966 | 2018-01-31 19:52:24.6670 | 2018-02-01 06:58:31.0530 | 74.0 | Laguna St at Hayes St | 37.776435 | -122.426244 | 19.0 | Post St at Kearny St | 37.788975 | -122.403452 | 617 | Subscriber | 1991.0 | Male | No |
3 | 453 | 2018-01-31 23:53:53.6320 | 2018-02-01 00:01:26.8050 | 110.0 | 17th & Folsom Street Park (17th St at Folsom St) | 37.763708 | -122.415204 | 134.0 | Valencia St at 24th St | 37.752428 | -122.420628 | 3571 | Subscriber | 1988.0 | Male | No |
4 | 180 | 2018-01-31 23:52:09.9030 | 2018-01-31 23:55:10.8070 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 1403 | Subscriber | 1980.0 | Male | No |
5 | 996 | 2018-01-31 23:34:56.0040 | 2018-01-31 23:51:32.6740 | 134.0 | Valencia St at 24th St | 37.752428 | -122.420628 | 4.0 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 3675 | Subscriber | 1987.0 | Male | Yes |
6 | 825 | 2018-01-31 23:34:14.0270 | 2018-01-31 23:47:59.8090 | 305.0 | Ryland Park | 37.342725 | -121.895617 | 317.0 | San Salvador St at 9th St | 37.333955 | -121.877349 | 1453 | Subscriber | 1994.0 | Female | Yes |
7 | 432 | 2018-01-31 23:34:26.4840 | 2018-01-31 23:41:39.2970 | 89.0 | Division St at Potrero Ave | 37.769218 | -122.407646 | 43.0 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 2928 | Subscriber | 1993.0 | Male | No |
8 | 601 | 2018-01-31 23:29:46.8320 | 2018-01-31 23:39:48.0000 | 223.0 | 16th St Mission BART Station 2 | 37.764765 | -122.420091 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3016 | Subscriber | 1957.0 | Male | No |
9 | 887 | 2018-01-31 23:24:16.3570 | 2018-01-31 23:39:04.1230 | 308.0 | San Pedro Square | 37.336802 | -121.894090 | 297.0 | Locust St at Grant St | 37.322980 | -121.887931 | 55 | Subscriber | 1976.0 | Female | Yes |
10 | 210 | 2018-01-31 23:33:03.0460 | 2018-01-31 23:36:33.7040 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 186.0 | Lakeside Dr at 14th St | 37.801319 | -122.262642 | 2602 | Subscriber | 1976.0 | Male | No |
11 | 188 | 2018-01-31 23:30:58.1360 | 2018-01-31 23:34:06.3910 | 98.0 | Valencia St at 16th St | 37.765052 | -122.421866 | 76.0 | McCoppin St at Valencia St | 37.771662 | -122.422423 | 2556 | Subscriber | 1964.0 | Female | No |
12 | 808 | 2018-01-31 23:19:58.6030 | 2018-01-31 23:33:27.5310 | 67.0 | San Francisco Caltrain Station 2 (Townsend St... | 37.776639 | -122.395526 | 98.0 | Valencia St at 16th St | 37.765052 | -122.421866 | 3041 | Subscriber | 1976.0 | Male | Yes |
13 | 378 | 2018-01-31 23:23:23.0680 | 2018-01-31 23:29:42.0440 | 80.0 | Townsend St at 5th St | 37.775306 | -122.397380 | 78.0 | Folsom St at 9th St | 37.773717 | -122.411647 | 546 | Subscriber | 1995.0 | Female | No |
14 | 686 | 2018-01-31 23:07:15.3130 | 2018-01-31 23:18:41.5580 | 312.0 | San Jose Diridon Station | 37.329732 | -121.901782 | 317.0 | San Salvador St at 9th St | 37.333955 | -121.877349 | 1886 | Subscriber | 1997.0 | Female | No |
15 | 450 | 2018-01-31 23:07:13.0630 | 2018-01-31 23:14:43.8140 | 241.0 | Ashby BART Station | 37.852477 | -122.270213 | 157.0 | 65th St at Hollis St | 37.846784 | -122.291376 | 3583 | Subscriber | 1994.0 | Male | No |
16 | 294 | 2018-01-31 23:08:12.0000 | 2018-01-31 23:13:06.6360 | 239.0 | Bancroft Way at Telegraph Ave | 37.868813 | -122.258764 | 244.0 | Shattuck Ave at Hearst Ave | 37.873792 | -122.268618 | 2144 | Subscriber | 1983.0 | Male | No |
17 | 150 | 2018-01-31 23:10:09.5860 | 2018-01-31 23:12:40.3330 | 182.0 | 19th Street BART Station | 37.809013 | -122.268247 | 183.0 | Telegraph Ave at 19th St | 37.808702 | -122.269927 | 3468 | Subscriber | 1945.0 | Male | Yes |
18 | 462 | 2018-01-31 23:03:48.9400 | 2018-01-31 23:11:31.0290 | 119.0 | 18th St at Noe St | 37.761047 | -122.432642 | 134.0 | Valencia St at 24th St | 37.752428 | -122.420628 | 1432 | Subscriber | 1971.0 | Male | Yes |
19 | 379 | 2018-01-31 23:04:27.7010 | 2018-01-31 23:10:46.9760 | 176.0 | MacArthur BART Station | 37.828410 | -122.266315 | 189.0 | Genoa St at 55th St | 37.839649 | -122.271756 | 997 | Subscriber | 1975.0 | Male | No |
20 | 880 | 2018-01-31 22:53:41.3020 | 2018-01-31 23:08:21.4430 | 123.0 | Folsom St at 19th St | 37.760594 | -122.414817 | 145.0 | 29th St at Church St | 37.743684 | -122.426806 | 3725 | Subscriber | 1986.0 | Male | No |
21 | 1210 | 2018-01-31 22:45:37.1250 | 2018-01-31 23:05:47.5760 | 285.0 | Webster St at O'Farrell St | 37.783521 | -122.431158 | 133.0 | Valencia St at 22nd St | 37.755213 | -122.420975 | 1059 | Subscriber | 1991.0 | Male | No |
22 | 259 | 2018-01-31 23:01:12.7920 | 2018-01-31 23:05:32.1570 | 239.0 | Bancroft Way at Telegraph Ave | 37.868813 | -122.258764 | 266.0 | Parker St at Fulton St | 37.862464 | -122.264791 | 1208 | Subscriber | 1994.0 | Male | No |
23 | 592 | 2018-01-31 22:53:27.7790 | 2018-01-31 23:03:20.2900 | 202.0 | Washington St at 8th St | 37.800754 | -122.274894 | 195.0 | Bay Pl at Vernon St | 37.812314 | -122.260779 | 1834 | Customer | 1978.0 | Male | No |
24 | 1059 | 2018-01-31 22:45:16.5700 | 2018-01-31 23:02:56.2850 | 141.0 | Valencia St at Cesar Chavez St | 37.747998 | -122.420219 | 79.0 | 7th St at Brannan St | 37.773492 | -122.403673 | 1248 | Subscriber | 1988.0 | Male | No |
25 | 375 | 2018-01-31 22:56:31.0820 | 2018-01-31 23:02:46.5830 | 15.0 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 6.0 | The Embarcadero at Sansome St | 37.804770 | -122.403234 | 3401 | Subscriber | 1988.0 | Male | No |
26 | 300 | 2018-01-31 22:57:24.0420 | 2018-01-31 23:02:24.2720 | 114.0 | Rhode Island St at 17th St | 37.764478 | -122.402570 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 3224 | Subscriber | 1981.0 | Male | No |
27 | 2219 | 2018-01-31 22:24:39.9430 | 2018-01-31 23:01:39.5710 | 30.0 | San Francisco Caltrain (Townsend St at 4th St) | 37.776598 | -122.395282 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 1757 | Subscriber | 1991.0 | Female | No |
28 | 330 | 2018-01-31 22:55:48.0730 | 2018-01-31 23:01:18.5060 | 99.0 | Folsom St at 15th St | 37.767037 | -122.415443 | 124.0 | 19th St at Florida St | 37.760447 | -122.410807 | 3379 | Subscriber | 1983.0 | Male | No |
29 | 870 | 2018-01-31 22:45:38.2350 | 2018-01-31 23:00:09.0340 | 285.0 | Webster St at O'Farrell St | 37.783521 | -122.431158 | 106.0 | Sanchez St at 17th St | 37.763242 | -122.430675 | 1503 | Customer | 1990.0 | Female | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1741526 | 331 | 2018-12-01 00:48:27.5290 | 2018-12-01 00:53:59.4950 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 39.0 | Scott St at Golden Gate Ave | 37.778999 | -122.436861 | 968 | Subscriber | 1994.0 | Male | No |
1741527 | 310 | 2018-12-01 00:45:44.8680 | 2018-12-01 00:50:55.7970 | 252.0 | Channing Way at Shattuck Ave | 37.865847 | -122.267443 | 238.0 | MLK Jr Way at University Ave | 37.871719 | -122.273068 | 367 | Subscriber | 1991.0 | Male | No |
1741528 | 1338 | 2018-12-01 00:27:24.8750 | 2018-12-01 00:49:43.3490 | 371.0 | Lombard St at Columbus Ave | 37.802746 | -122.413579 | 104.0 | 4th St at 16th St | 37.767045 | -122.390833 | 1985 | Subscriber | 1986.0 | Male | No |
1741529 | 154 | 2018-12-01 00:44:24.8380 | 2018-12-01 00:46:59.3350 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 52.0 | McAllister St at Baker St | 37.777416 | -122.441838 | 449 | Subscriber | 1994.0 | Male | No |
1741530 | 862 | 2018-12-01 00:32:11.0630 | 2018-12-01 00:46:33.7900 | 4.0 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 4408 | Customer | 1975.0 | Male | No |
1741531 | 1310 | 2018-12-01 00:23:53.3420 | 2018-12-01 00:45:43.5880 | 85.0 | Church St at Duboce Ave | 37.770083 | -122.429156 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 1525 | Subscriber | 1994.0 | Male | Yes |
1741532 | 230 | 2018-12-01 00:41:31.9550 | 2018-12-01 00:45:22.4160 | 312.0 | San Jose Diridon Station | 37.329732 | -121.901782 | 314.0 | Santa Clara St at Almaden Blvd | 37.333988 | -121.894902 | 1379 | Subscriber | 1957.0 | Male | No |
1741533 | 2071 | 2018-12-01 00:09:55.5800 | 2018-12-01 00:44:26.6290 | 36.0 | Folsom St at 3rd St | 37.783830 | -122.398870 | 97.0 | 14th St at Mission St | 37.768265 | -122.420110 | 316 | Customer | 1991.0 | Male | No |
1741534 | 1958 | 2018-12-01 00:11:35.0220 | 2018-12-01 00:44:13.6320 | 44.0 | Civic Center/UN Plaza BART Station (Market St ... | 37.781074 | -122.411738 | 5.0 | Powell St BART Station (Market St at 5th St) | 37.783899 | -122.408445 | 4395 | Subscriber | 1993.0 | Male | No |
1741535 | 268 | 2018-12-01 00:37:23.1340 | 2018-12-01 00:41:51.9670 | 147.0 | 29th St at Tiffany Ave | 37.744067 | -122.421472 | 121.0 | Mission Playground | 37.759210 | -122.421339 | 4451 | Subscriber | 1989.0 | Male | No |
1741536 | 1679 | 2018-12-01 00:13:35.1430 | 2018-12-01 00:41:34.8600 | 240.0 | Haste St at Telegraph Ave | 37.866043 | -122.258804 | 245.0 | Downtown Berkeley BART | 37.870139 | -122.268422 | 3701 | Customer | 1998.0 | Female | No |
1741537 | 1658 | 2018-12-01 00:13:33.1240 | 2018-12-01 00:41:11.7630 | 240.0 | Haste St at Telegraph Ave | 37.866043 | -122.258804 | 245.0 | Downtown Berkeley BART | 37.870139 | -122.268422 | 3625 | Customer | 1998.0 | Female | No |
1741538 | 293 | 2018-12-01 00:36:01.5250 | 2018-12-01 00:40:55.2850 | 95.0 | Sanchez St at 15th St | 37.766219 | -122.431060 | 72.0 | Page St at Scott St | 37.772406 | -122.435650 | 1043 | Subscriber | 1981.0 | Male | No |
1741539 | 426 | 2018-12-01 00:32:13.1250 | 2018-12-01 00:39:19.8710 | 50.0 | 2nd St at Townsend St | 37.780526 | -122.390288 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 3154 | Subscriber | 1986.0 | Male | No |
1741540 | 447 | 2018-12-01 00:31:36.1480 | 2018-12-01 00:39:03.3910 | 368.0 | Myrtle St at Polk St | 37.785434 | -122.419622 | 17.0 | Embarcadero BART Station (Beale St at Market St) | 37.792251 | -122.397086 | 1677 | Subscriber | 1980.0 | Other | Yes |
1741541 | 1694 | 2018-12-01 00:09:17.1590 | 2018-12-01 00:37:31.2400 | 14.0 | Clay St at Battery St | 37.795001 | -122.399970 | 147.0 | 29th St at Tiffany Ave | 37.744067 | -122.421472 | 209 | Subscriber | 1993.0 | Male | No |
1741542 | 269 | 2018-12-01 00:31:00.0910 | 2018-12-01 00:35:29.8710 | 98.0 | Valencia St at 16th St | 37.765052 | -122.421866 | 121.0 | Mission Playground | 37.759210 | -122.421339 | 3147 | Subscriber | 1993.0 | Male | No |
1741543 | 685 | 2018-12-01 00:21:16.2400 | 2018-12-01 00:32:42.0000 | 109.0 | 17th St at Valencia St | 37.763316 | -122.421904 | 60.0 | 8th St at Ringold St | 37.774520 | -122.409449 | 3247 | Subscriber | 1987.0 | Female | No |
1741544 | 681 | 2018-12-01 00:19:41.3830 | 2018-12-01 00:31:02.6050 | 15.0 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 3460 | Subscriber | 1959.0 | Male | No |
1741545 | 1166 | 2018-12-01 00:11:04.8640 | 2018-12-01 00:30:31.3480 | 160.0 | West Oakland BART Station | 37.805318 | -122.294837 | 155.0 | Emeryville Public Market | 37.840521 | -122.293528 | 1579 | Subscriber | 1997.0 | Male | No |
1741546 | 763 | 2018-12-01 00:17:34.4970 | 2018-12-01 00:30:18.1070 | 73.0 | Pierce St at Haight St | 37.771793 | -122.433708 | 121.0 | Mission Playground | 37.759210 | -122.421339 | 90 | Subscriber | 1991.0 | Female | No |
1741547 | 760 | 2018-12-01 00:17:29.9600 | 2018-12-01 00:30:10.1780 | 73.0 | Pierce St at Haight St | 37.771793 | -122.433708 | 121.0 | Mission Playground | 37.759210 | -122.421339 | 2758 | Subscriber | 1991.0 | Male | No |
1741548 | 538 | 2018-12-01 00:16:34.0850 | 2018-12-01 00:25:32.4550 | 5.0 | Powell St BART Station (Market St at 5th St) | 37.783899 | -122.408445 | 39.0 | Scott St at Golden Gate Ave | 37.778999 | -122.436861 | 4384 | Subscriber | 1991.0 | Male | No |
1741549 | 671 | 2018-12-01 00:12:49.6400 | 2018-12-01 00:24:01.5120 | 34.0 | Father Alfred E Boeddeker Park | 37.783988 | -122.412408 | 92.0 | Mission Bay Kids Park | 37.772301 | -122.393028 | 4377 | Subscriber | 1972.0 | Male | Yes |
1741550 | 498 | 2018-12-01 00:14:41.7250 | 2018-12-01 00:23:00.4080 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 214.0 | Market St at Brockhurst St | 37.823321 | -122.275732 | 2236 | Subscriber | 1992.0 | Male | No |
1741551 | 1137 | 2018-12-01 00:01:49.6930 | 2018-12-01 00:20:47.5190 | 73.0 | Pierce St at Haight St | 37.771793 | -122.433708 | 50.0 | 2nd St at Townsend St | 37.780526 | -122.390288 | 273 | Subscriber | 1990.0 | Male | No |
1741552 | 473 | 2018-12-01 00:11:54.8110 | 2018-12-01 00:19:48.5470 | 345.0 | Hubbell St at 16th St | 37.766474 | -122.398295 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 3035 | Subscriber | 1982.0 | Female | No |
1741553 | 841 | 2018-12-01 00:02:48.7260 | 2018-12-01 00:16:49.7660 | 10.0 | Washington St at Kearny St | 37.795393 | -122.404770 | 58.0 | Market St at 10th St | 37.776619 | -122.417385 | 2034 | Subscriber | 1999.0 | Female | No |
1741554 | 260 | 2018-12-01 00:05:27.6150 | 2018-12-01 00:09:47.9560 | 245.0 | Downtown Berkeley BART | 37.870139 | -122.268422 | 255.0 | Virginia St at Shattuck Ave | 37.876573 | -122.269528 | 2243 | Subscriber | 1991.0 | Male | No |
1741555 | 292 | 2018-12-01 00:03:06.5490 | 2018-12-01 00:07:59.0800 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 126.0 | Esprit Park | 37.761634 | -122.390648 | 545 | Subscriber | 1963.0 | Male | No |
1741556 rows × 16 columns
I'll change the date type properly to analyze well
start_time: object -> datatime
end_time: object -> datatime
start_station_id: float -> int
end_station_id: float -> int
user_type: object -> CategoricalDtype
member_birth_year: float -> int
member_gender: object -> CategoricalDtype
bike_share_for_all_trip: object -> CategoricalDtype
numeric_column_dtype = {'start_time': 'datetime64',
'end_time': 'datetime64',
'start_station_id': 'int64',
'end_station_id': 'int64',
'member_birth_year': 'int64'}
catergorical_column_dtype = {'user_type': ['Subscriber', 'Customer'],
'member_gender': ['Male', 'Female'],
'bike_share_for_all_trip': ['Yes', 'No']}
def change_into_dtype(dataframe, column, dtype):
'''change the column type of dataframe into dtype'''
dataframe[column] = dataframe[column].astype(dtype)
def iterate_dict_and_change_numeric_dtype(dataframe):
'''change numeric columns type of dataframe into dtype'''
for column in numeric_column_dtype:
change_into_dtype(dataframe, column, numeric_column_dtype[column])
def iterate_dict_and_change_categorical_dtype(dataframe):
'''change categorical columns type of dataframe into dtype'''
for column in catergorical_column_dtype:
target_dtype = pd.api.types.CategoricalDtype(ordered = True,
categories = catergorical_column_dtype[column])
dataframe[column] = dataframe[column].astype(target_dtype)
iterate_dict_and_change_numeric_dtype(bike_rental)
iterate_dict_and_change_categorical_dtype(bike_rental)
bike_rental.dtypes
duration_sec int64 start_time datetime64[ns] end_time datetime64[ns] start_station_id int64 start_station_name object start_station_latitude float64 start_station_longitude float64 end_station_id int64 end_station_name object end_station_latitude float64 end_station_longitude float64 bike_id int64 user_type category member_birth_year int64 member_gender category bike_share_for_all_trip category dtype: object
This data set has 1,741,556 rows × 16 columns
The main focus is on the duration seconds because it is highly related with rental fees.
I think start hour, month, user type, gender, birth year is the related features
duration_sec is the most important feature because it is highly related to the rental fee. I'll start by looking at the distribution of the main variable of duration sec
# create duration_sec bins
duration_bins = np.arange(50, bike_rental.duration_sec.max(), 200)
duration_bins
array([ 50, 250, 450, 650, 850, 1050, 1250, 1450, 1650, 1850, 2050, 2250, 2450, 2650, 2850, 3050, 3250, 3450, 3650, 3850, 4050, 4250, 4450, 4650, 4850, 5050, 5250, 5450, 5650, 5850, 6050, 6250, 6450, 6650, 6850, 7050, 7250, 7450, 7650, 7850, 8050, 8250, 8450, 8650, 8850, 9050, 9250, 9450, 9650, 9850, 10050, 10250, 10450, 10650, 10850, 11050, 11250, 11450, 11650, 11850, 12050, 12250, 12450, 12650, 12850, 13050, 13250, 13450, 13650, 13850, 14050, 14250, 14450, 14650, 14850, 15050, 15250, 15450, 15650, 15850, 16050, 16250, 16450, 16650, 16850, 17050, 17250, 17450, 17650, 17850, 18050, 18250, 18450, 18650, 18850, 19050, 19250, 19450, 19650, 19850, 20050, 20250, 20450, 20650, 20850, 21050, 21250, 21450, 21650, 21850, 22050, 22250, 22450, 22650, 22850, 23050, 23250, 23450, 23650, 23850, 24050, 24250, 24450, 24650, 24850, 25050, 25250, 25450, 25650, 25850, 26050, 26250, 26450, 26650, 26850, 27050, 27250, 27450, 27650, 27850, 28050, 28250, 28450, 28650, 28850, 29050, 29250, 29450, 29650, 29850, 30050, 30250, 30450, 30650, 30850, 31050, 31250, 31450, 31650, 31850, 32050, 32250, 32450, 32650, 32850, 33050, 33250, 33450, 33650, 33850, 34050, 34250, 34450, 34650, 34850, 35050, 35250, 35450, 35650, 35850, 36050, 36250, 36450, 36650, 36850, 37050, 37250, 37450, 37650, 37850, 38050, 38250, 38450, 38650, 38850, 39050, 39250, 39450, 39650, 39850, 40050, 40250, 40450, 40650, 40850, 41050, 41250, 41450, 41650, 41850, 42050, 42250, 42450, 42650, 42850, 43050, 43250, 43450, 43650, 43850, 44050, 44250, 44450, 44650, 44850, 45050, 45250, 45450, 45650, 45850, 46050, 46250, 46450, 46650, 46850, 47050, 47250, 47450, 47650, 47850, 48050, 48250, 48450, 48650, 48850, 49050, 49250, 49450, 49650, 49850, 50050, 50250, 50450, 50650, 50850, 51050, 51250, 51450, 51650, 51850, 52050, 52250, 52450, 52650, 52850, 53050, 53250, 53450, 53650, 53850, 54050, 54250, 54450, 54650, 54850, 55050, 55250, 55450, 55650, 55850, 56050, 56250, 56450, 56650, 56850, 57050, 57250, 57450, 57650, 57850, 58050, 58250, 58450, 58650, 58850, 59050, 59250, 59450, 59650, 59850, 60050, 60250, 60450, 60650, 60850, 61050, 61250, 61450, 61650, 61850, 62050, 62250, 62450, 62650, 62850, 63050, 63250, 63450, 63650, 63850, 64050, 64250, 64450, 64650, 64850, 65050, 65250, 65450, 65650, 65850, 66050, 66250, 66450, 66650, 66850, 67050, 67250, 67450, 67650, 67850, 68050, 68250, 68450, 68650, 68850, 69050, 69250, 69450, 69650, 69850, 70050, 70250, 70450, 70650, 70850, 71050, 71250, 71450, 71650, 71850, 72050, 72250, 72450, 72650, 72850, 73050, 73250, 73450, 73650, 73850, 74050, 74250, 74450, 74650, 74850, 75050, 75250, 75450, 75650, 75850, 76050, 76250, 76450, 76650, 76850, 77050, 77250, 77450, 77650, 77850, 78050, 78250, 78450, 78650, 78850, 79050, 79250, 79450, 79650, 79850, 80050, 80250, 80450, 80650, 80850, 81050, 81250, 81450, 81650, 81850, 82050, 82250, 82450, 82650, 82850, 83050, 83250, 83450, 83650, 83850, 84050, 84250, 84450, 84650, 84850, 85050, 85250, 85450, 85650, 85850, 86050, 86250], dtype=int64)
# plot histogram of duration_sec
base_color = sns.color_palette()[0]
plt.hist(data=bike_rental, x='duration_sec', color=base_color, bins=duration_bins)
plt.xlim(0, 4000)
plt.title('Distribution of Rental Duration')
plt.xlabel('Rental Duration (sec)')
plt.ylabel('Counts');
There are many people who ride bikes about 500 seconds. This graph is highly right skewed so that I'll apply log transform
np.log10(bike_rental.duration_sec.describe())
count 6.240937 mean 2.888102 std 3.288484 min 1.785330 25% 2.536558 50% 2.734800 75% 2.923762 max 4.935915 Name: duration_sec, dtype: float64
# plot histogram of duration_sec log transformation
log_duration_bins = 10 ** np.arange(1, 5.0+0.1, 0.1)
ticks = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000]
labels = ['{}'.format(val) for val in ticks]
plt.hist(data=bike_rental, x='duration_sec', bins=log_duration_bins)
plt.xscale('log')
plt.xticks(ticks, labels)
plt.title('Distribution of Rental Duration (log scale)')
plt.xlabel('Log of Rental Duration (sec)')
plt.ylabel('Counts');
It's a normal distribution
sns.countplot(data=bike_rental, x='member_gender', color=base_color);
plt.title('Distribution of member gender');
Male are three times more than female
sns.countplot(data=bike_rental, x='user_type', color=base_color)
plt.title('Distribution of user type');
Subscribers are about seven times more than customers
bike_rental_cut_duration_outliers = bike_rental.query('duration_sec < 2500')
bike_rental_cut_duration_outliers.reset_index(drop=True, inplace=True)
bike_rental_cut_duration_outliers
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 453 | 2018-01-31 23:53:53.632 | 2018-02-01 00:01:26.805 | 110 | 17th & Folsom Street Park (17th St at Folsom St) | 37.763708 | -122.415204 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 3571 | Subscriber | 1988 | Male | No |
1 | 180 | 2018-01-31 23:52:09.903 | 2018-01-31 23:55:10.807 | 81 | Berry St at 4th St | 37.775880 | -122.393170 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 1403 | Subscriber | 1980 | Male | No |
2 | 996 | 2018-01-31 23:34:56.004 | 2018-01-31 23:51:32.674 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 4 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 3675 | Subscriber | 1987 | Male | Yes |
3 | 825 | 2018-01-31 23:34:14.027 | 2018-01-31 23:47:59.809 | 305 | Ryland Park | 37.342725 | -121.895617 | 317 | San Salvador St at 9th St | 37.333955 | -121.877349 | 1453 | Subscriber | 1994 | Female | Yes |
4 | 432 | 2018-01-31 23:34:26.484 | 2018-01-31 23:41:39.297 | 89 | Division St at Potrero Ave | 37.769218 | -122.407646 | 43 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 2928 | Subscriber | 1993 | Male | No |
5 | 601 | 2018-01-31 23:29:46.832 | 2018-01-31 23:39:48.000 | 223 | 16th St Mission BART Station 2 | 37.764765 | -122.420091 | 86 | Market St at Dolores St | 37.769305 | -122.426826 | 3016 | Subscriber | 1957 | Male | No |
6 | 887 | 2018-01-31 23:24:16.357 | 2018-01-31 23:39:04.123 | 308 | San Pedro Square | 37.336802 | -121.894090 | 297 | Locust St at Grant St | 37.322980 | -121.887931 | 55 | Subscriber | 1976 | Female | Yes |
7 | 210 | 2018-01-31 23:33:03.046 | 2018-01-31 23:36:33.704 | 7 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 186 | Lakeside Dr at 14th St | 37.801319 | -122.262642 | 2602 | Subscriber | 1976 | Male | No |
8 | 188 | 2018-01-31 23:30:58.136 | 2018-01-31 23:34:06.391 | 98 | Valencia St at 16th St | 37.765052 | -122.421866 | 76 | McCoppin St at Valencia St | 37.771662 | -122.422423 | 2556 | Subscriber | 1964 | Female | No |
9 | 808 | 2018-01-31 23:19:58.603 | 2018-01-31 23:33:27.531 | 67 | San Francisco Caltrain Station 2 (Townsend St... | 37.776639 | -122.395526 | 98 | Valencia St at 16th St | 37.765052 | -122.421866 | 3041 | Subscriber | 1976 | Male | Yes |
10 | 378 | 2018-01-31 23:23:23.068 | 2018-01-31 23:29:42.044 | 80 | Townsend St at 5th St | 37.775306 | -122.397380 | 78 | Folsom St at 9th St | 37.773717 | -122.411647 | 546 | Subscriber | 1995 | Female | No |
11 | 686 | 2018-01-31 23:07:15.313 | 2018-01-31 23:18:41.558 | 312 | San Jose Diridon Station | 37.329732 | -121.901782 | 317 | San Salvador St at 9th St | 37.333955 | -121.877349 | 1886 | Subscriber | 1997 | Female | No |
12 | 450 | 2018-01-31 23:07:13.063 | 2018-01-31 23:14:43.814 | 241 | Ashby BART Station | 37.852477 | -122.270213 | 157 | 65th St at Hollis St | 37.846784 | -122.291376 | 3583 | Subscriber | 1994 | Male | No |
13 | 294 | 2018-01-31 23:08:12.000 | 2018-01-31 23:13:06.636 | 239 | Bancroft Way at Telegraph Ave | 37.868813 | -122.258764 | 244 | Shattuck Ave at Hearst Ave | 37.873792 | -122.268618 | 2144 | Subscriber | 1983 | Male | No |
14 | 150 | 2018-01-31 23:10:09.586 | 2018-01-31 23:12:40.333 | 182 | 19th Street BART Station | 37.809013 | -122.268247 | 183 | Telegraph Ave at 19th St | 37.808702 | -122.269927 | 3468 | Subscriber | 1945 | Male | Yes |
15 | 462 | 2018-01-31 23:03:48.940 | 2018-01-31 23:11:31.029 | 119 | 18th St at Noe St | 37.761047 | -122.432642 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 1432 | Subscriber | 1971 | Male | Yes |
16 | 379 | 2018-01-31 23:04:27.701 | 2018-01-31 23:10:46.976 | 176 | MacArthur BART Station | 37.828410 | -122.266315 | 189 | Genoa St at 55th St | 37.839649 | -122.271756 | 997 | Subscriber | 1975 | Male | No |
17 | 880 | 2018-01-31 22:53:41.302 | 2018-01-31 23:08:21.443 | 123 | Folsom St at 19th St | 37.760594 | -122.414817 | 145 | 29th St at Church St | 37.743684 | -122.426806 | 3725 | Subscriber | 1986 | Male | No |
18 | 1210 | 2018-01-31 22:45:37.125 | 2018-01-31 23:05:47.576 | 285 | Webster St at O'Farrell St | 37.783521 | -122.431158 | 133 | Valencia St at 22nd St | 37.755213 | -122.420975 | 1059 | Subscriber | 1991 | Male | No |
19 | 259 | 2018-01-31 23:01:12.792 | 2018-01-31 23:05:32.157 | 239 | Bancroft Way at Telegraph Ave | 37.868813 | -122.258764 | 266 | Parker St at Fulton St | 37.862464 | -122.264791 | 1208 | Subscriber | 1994 | Male | No |
20 | 592 | 2018-01-31 22:53:27.779 | 2018-01-31 23:03:20.290 | 202 | Washington St at 8th St | 37.800754 | -122.274894 | 195 | Bay Pl at Vernon St | 37.812314 | -122.260779 | 1834 | Customer | 1978 | Male | No |
21 | 1059 | 2018-01-31 22:45:16.570 | 2018-01-31 23:02:56.285 | 141 | Valencia St at Cesar Chavez St | 37.747998 | -122.420219 | 79 | 7th St at Brannan St | 37.773492 | -122.403673 | 1248 | Subscriber | 1988 | Male | No |
22 | 375 | 2018-01-31 22:56:31.082 | 2018-01-31 23:02:46.583 | 15 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 6 | The Embarcadero at Sansome St | 37.804770 | -122.403234 | 3401 | Subscriber | 1988 | Male | No |
23 | 300 | 2018-01-31 22:57:24.042 | 2018-01-31 23:02:24.272 | 114 | Rhode Island St at 17th St | 37.764478 | -122.402570 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 3224 | Subscriber | 1981 | Male | No |
24 | 2219 | 2018-01-31 22:24:39.943 | 2018-01-31 23:01:39.571 | 30 | San Francisco Caltrain (Townsend St at 4th St) | 37.776598 | -122.395282 | 81 | Berry St at 4th St | 37.775880 | -122.393170 | 1757 | Subscriber | 1991 | Female | No |
25 | 330 | 2018-01-31 22:55:48.073 | 2018-01-31 23:01:18.506 | 99 | Folsom St at 15th St | 37.767037 | -122.415443 | 124 | 19th St at Florida St | 37.760447 | -122.410807 | 3379 | Subscriber | 1983 | Male | No |
26 | 870 | 2018-01-31 22:45:38.235 | 2018-01-31 23:00:09.034 | 285 | Webster St at O'Farrell St | 37.783521 | -122.431158 | 106 | Sanchez St at 17th St | 37.763242 | -122.430675 | 1503 | Customer | 1990 | Female | No |
27 | 2032 | 2018-01-31 22:25:00.185 | 2018-01-31 22:58:53.071 | 182 | 19th Street BART Station | 37.809013 | -122.268247 | 266 | Parker St at Fulton St | 37.862464 | -122.264791 | 2169 | Subscriber | 1990 | Male | No |
28 | 196 | 2018-01-31 22:55:25.413 | 2018-01-31 22:58:42.142 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 81 | Berry St at 4th St | 37.775880 | -122.393170 | 1403 | Subscriber | 1980 | Male | No |
29 | 1499 | 2018-01-31 22:33:06.643 | 2018-01-31 22:58:06.304 | 44 | Civic Center/UN Plaza BART Station (Market St ... | 37.781074 | -122.411738 | 144 | Precita Park | 37.747300 | -122.411403 | 2528 | Subscriber | 1970 | Female | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1706901 | 331 | 2018-12-01 00:48:27.529 | 2018-12-01 00:53:59.495 | 70 | Central Ave at Fell St | 37.773311 | -122.444293 | 39 | Scott St at Golden Gate Ave | 37.778999 | -122.436861 | 968 | Subscriber | 1994 | Male | No |
1706902 | 310 | 2018-12-01 00:45:44.868 | 2018-12-01 00:50:55.797 | 252 | Channing Way at Shattuck Ave | 37.865847 | -122.267443 | 238 | MLK Jr Way at University Ave | 37.871719 | -122.273068 | 367 | Subscriber | 1991 | Male | No |
1706903 | 1338 | 2018-12-01 00:27:24.875 | 2018-12-01 00:49:43.349 | 371 | Lombard St at Columbus Ave | 37.802746 | -122.413579 | 104 | 4th St at 16th St | 37.767045 | -122.390833 | 1985 | Subscriber | 1986 | Male | No |
1706904 | 154 | 2018-12-01 00:44:24.838 | 2018-12-01 00:46:59.335 | 70 | Central Ave at Fell St | 37.773311 | -122.444293 | 52 | McAllister St at Baker St | 37.777416 | -122.441838 | 449 | Subscriber | 1994 | Male | No |
1706905 | 862 | 2018-12-01 00:32:11.063 | 2018-12-01 00:46:33.790 | 4 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 86 | Market St at Dolores St | 37.769305 | -122.426826 | 4408 | Customer | 1975 | Male | No |
1706906 | 1310 | 2018-12-01 00:23:53.342 | 2018-12-01 00:45:43.588 | 85 | Church St at Duboce Ave | 37.770083 | -122.429156 | 13 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 1525 | Subscriber | 1994 | Male | Yes |
1706907 | 230 | 2018-12-01 00:41:31.955 | 2018-12-01 00:45:22.416 | 312 | San Jose Diridon Station | 37.329732 | -121.901782 | 314 | Santa Clara St at Almaden Blvd | 37.333988 | -121.894902 | 1379 | Subscriber | 1957 | Male | No |
1706908 | 2071 | 2018-12-01 00:09:55.580 | 2018-12-01 00:44:26.629 | 36 | Folsom St at 3rd St | 37.783830 | -122.398870 | 97 | 14th St at Mission St | 37.768265 | -122.420110 | 316 | Customer | 1991 | Male | No |
1706909 | 1958 | 2018-12-01 00:11:35.022 | 2018-12-01 00:44:13.632 | 44 | Civic Center/UN Plaza BART Station (Market St ... | 37.781074 | -122.411738 | 5 | Powell St BART Station (Market St at 5th St) | 37.783899 | -122.408445 | 4395 | Subscriber | 1993 | Male | No |
1706910 | 268 | 2018-12-01 00:37:23.134 | 2018-12-01 00:41:51.967 | 147 | 29th St at Tiffany Ave | 37.744067 | -122.421472 | 121 | Mission Playground | 37.759210 | -122.421339 | 4451 | Subscriber | 1989 | Male | No |
1706911 | 1679 | 2018-12-01 00:13:35.143 | 2018-12-01 00:41:34.860 | 240 | Haste St at Telegraph Ave | 37.866043 | -122.258804 | 245 | Downtown Berkeley BART | 37.870139 | -122.268422 | 3701 | Customer | 1998 | Female | No |
1706912 | 1658 | 2018-12-01 00:13:33.124 | 2018-12-01 00:41:11.763 | 240 | Haste St at Telegraph Ave | 37.866043 | -122.258804 | 245 | Downtown Berkeley BART | 37.870139 | -122.268422 | 3625 | Customer | 1998 | Female | No |
1706913 | 293 | 2018-12-01 00:36:01.525 | 2018-12-01 00:40:55.285 | 95 | Sanchez St at 15th St | 37.766219 | -122.431060 | 72 | Page St at Scott St | 37.772406 | -122.435650 | 1043 | Subscriber | 1981 | Male | No |
1706914 | 426 | 2018-12-01 00:32:13.125 | 2018-12-01 00:39:19.871 | 50 | 2nd St at Townsend St | 37.780526 | -122.390288 | 21 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 3154 | Subscriber | 1986 | Male | No |
1706915 | 447 | 2018-12-01 00:31:36.148 | 2018-12-01 00:39:03.391 | 368 | Myrtle St at Polk St | 37.785434 | -122.419622 | 17 | Embarcadero BART Station (Beale St at Market St) | 37.792251 | -122.397086 | 1677 | Subscriber | 1980 | NaN | Yes |
1706916 | 1694 | 2018-12-01 00:09:17.159 | 2018-12-01 00:37:31.240 | 14 | Clay St at Battery St | 37.795001 | -122.399970 | 147 | 29th St at Tiffany Ave | 37.744067 | -122.421472 | 209 | Subscriber | 1993 | Male | No |
1706917 | 269 | 2018-12-01 00:31:00.091 | 2018-12-01 00:35:29.871 | 98 | Valencia St at 16th St | 37.765052 | -122.421866 | 121 | Mission Playground | 37.759210 | -122.421339 | 3147 | Subscriber | 1993 | Male | No |
1706918 | 685 | 2018-12-01 00:21:16.240 | 2018-12-01 00:32:42.000 | 109 | 17th St at Valencia St | 37.763316 | -122.421904 | 60 | 8th St at Ringold St | 37.774520 | -122.409449 | 3247 | Subscriber | 1987 | Female | No |
1706919 | 681 | 2018-12-01 00:19:41.383 | 2018-12-01 00:31:02.605 | 15 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 323 | Broadway at Kearny | 37.798014 | -122.405950 | 3460 | Subscriber | 1959 | Male | No |
1706920 | 1166 | 2018-12-01 00:11:04.864 | 2018-12-01 00:30:31.348 | 160 | West Oakland BART Station | 37.805318 | -122.294837 | 155 | Emeryville Public Market | 37.840521 | -122.293528 | 1579 | Subscriber | 1997 | Male | No |
1706921 | 763 | 2018-12-01 00:17:34.497 | 2018-12-01 00:30:18.107 | 73 | Pierce St at Haight St | 37.771793 | -122.433708 | 121 | Mission Playground | 37.759210 | -122.421339 | 90 | Subscriber | 1991 | Female | No |
1706922 | 760 | 2018-12-01 00:17:29.960 | 2018-12-01 00:30:10.178 | 73 | Pierce St at Haight St | 37.771793 | -122.433708 | 121 | Mission Playground | 37.759210 | -122.421339 | 2758 | Subscriber | 1991 | Male | No |
1706923 | 538 | 2018-12-01 00:16:34.085 | 2018-12-01 00:25:32.455 | 5 | Powell St BART Station (Market St at 5th St) | 37.783899 | -122.408445 | 39 | Scott St at Golden Gate Ave | 37.778999 | -122.436861 | 4384 | Subscriber | 1991 | Male | No |
1706924 | 671 | 2018-12-01 00:12:49.640 | 2018-12-01 00:24:01.512 | 34 | Father Alfred E Boeddeker Park | 37.783988 | -122.412408 | 92 | Mission Bay Kids Park | 37.772301 | -122.393028 | 4377 | Subscriber | 1972 | Male | Yes |
1706925 | 498 | 2018-12-01 00:14:41.725 | 2018-12-01 00:23:00.408 | 7 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 214 | Market St at Brockhurst St | 37.823321 | -122.275732 | 2236 | Subscriber | 1992 | Male | No |
1706926 | 1137 | 2018-12-01 00:01:49.693 | 2018-12-01 00:20:47.519 | 73 | Pierce St at Haight St | 37.771793 | -122.433708 | 50 | 2nd St at Townsend St | 37.780526 | -122.390288 | 273 | Subscriber | 1990 | Male | No |
1706927 | 473 | 2018-12-01 00:11:54.811 | 2018-12-01 00:19:48.547 | 345 | Hubbell St at 16th St | 37.766474 | -122.398295 | 81 | Berry St at 4th St | 37.775880 | -122.393170 | 3035 | Subscriber | 1982 | Female | No |
1706928 | 841 | 2018-12-01 00:02:48.726 | 2018-12-01 00:16:49.766 | 10 | Washington St at Kearny St | 37.795393 | -122.404770 | 58 | Market St at 10th St | 37.776619 | -122.417385 | 2034 | Subscriber | 1999 | Female | No |
1706929 | 260 | 2018-12-01 00:05:27.615 | 2018-12-01 00:09:47.956 | 245 | Downtown Berkeley BART | 37.870139 | -122.268422 | 255 | Virginia St at Shattuck Ave | 37.876573 | -122.269528 | 2243 | Subscriber | 1991 | Male | No |
1706930 | 292 | 2018-12-01 00:03:06.549 | 2018-12-01 00:07:59.080 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 126 | Esprit Park | 37.761634 | -122.390648 | 545 | Subscriber | 1963 | Male | No |
1706931 rows × 16 columns
birth_bins = np.arange(bike_rental_cut_duration_outliers.member_birth_year.min(), bike_rental_cut_duration_outliers.member_birth_year.max(), 5)
birth_bins
array([1881, 1886, 1891, 1896, 1901, 1906, 1911, 1916, 1921, 1926, 1931, 1936, 1941, 1946, 1951, 1956, 1961, 1966, 1971, 1976, 1981, 1986, 1991, 1996], dtype=int64)
plt.hist(data=bike_rental_cut_duration_outliers, x='member_birth_year', bins=birth_bins);
plt.title('Distribution of user\'s birth year')
plt.xlabel('User\'s birth year')
plt.ylabel('Count');
This histogram is highly left-skewed which means young people usually use rental bikes.
# extract start hour from start_time
bike_rental_cut_duration_outliers['start_hour'] = bike_rental_cut_duration_outliers.start_time.apply(lambda x: x.time().hour)
bike_rental_cut_duration_outliers.head()
C:\Users\weroo\Anaconda\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | start_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 453 | 2018-01-31 23:53:53.632 | 2018-02-01 00:01:26.805 | 110 | 17th & Folsom Street Park (17th St at Folsom St) | 37.763708 | -122.415204 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 3571 | Subscriber | 1988 | Male | No | 23 |
1 | 180 | 2018-01-31 23:52:09.903 | 2018-01-31 23:55:10.807 | 81 | Berry St at 4th St | 37.775880 | -122.393170 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 1403 | Subscriber | 1980 | Male | No | 23 |
2 | 996 | 2018-01-31 23:34:56.004 | 2018-01-31 23:51:32.674 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 4 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 3675 | Subscriber | 1987 | Male | Yes | 23 |
3 | 825 | 2018-01-31 23:34:14.027 | 2018-01-31 23:47:59.809 | 305 | Ryland Park | 37.342725 | -121.895617 | 317 | San Salvador St at 9th St | 37.333955 | -121.877349 | 1453 | Subscriber | 1994 | Female | Yes | 23 |
4 | 432 | 2018-01-31 23:34:26.484 | 2018-01-31 23:41:39.297 | 89 | Division St at Potrero Ave | 37.769218 | -122.407646 | 43 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 2928 | Subscriber | 1993 | Male | No | 23 |
hour_bins = np.arange(0, 23+1, 1)
plt.hist(data=bike_rental_cut_duration_outliers, x='start_hour', color=base_color, bins=hour_bins)
plt.title('Distribution of start hours')
plt.xlabel('Start hours')
plt.ylabel('Counts');
Lots of people ride bikes from 8:00 to 9:00 and from 17:00 to 18:00 (morning and evening)
# extract end hour from end_time
bike_rental_cut_duration_outliers['end_hour'] = bike_rental_cut_duration_outliers.end_time.apply(lambda x: x.time().hour)
bike_rental_cut_duration_outliers.head()
C:\Users\weroo\Anaconda\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | start_hour | end_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 453 | 2018-01-31 23:53:53.632 | 2018-02-01 00:01:26.805 | 110 | 17th & Folsom Street Park (17th St at Folsom St) | 37.763708 | -122.415204 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 3571 | Subscriber | 1988 | Male | No | 23 | 0 |
1 | 180 | 2018-01-31 23:52:09.903 | 2018-01-31 23:55:10.807 | 81 | Berry St at 4th St | 37.775880 | -122.393170 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 1403 | Subscriber | 1980 | Male | No | 23 | 23 |
2 | 996 | 2018-01-31 23:34:56.004 | 2018-01-31 23:51:32.674 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 4 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 3675 | Subscriber | 1987 | Male | Yes | 23 | 23 |
3 | 825 | 2018-01-31 23:34:14.027 | 2018-01-31 23:47:59.809 | 305 | Ryland Park | 37.342725 | -121.895617 | 317 | San Salvador St at 9th St | 37.333955 | -121.877349 | 1453 | Subscriber | 1994 | Female | Yes | 23 | 23 |
4 | 432 | 2018-01-31 23:34:26.484 | 2018-01-31 23:41:39.297 | 89 | Division St at Potrero Ave | 37.769218 | -122.407646 | 43 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 2928 | Subscriber | 1993 | Male | No | 23 | 23 |
plt.hist(data=bike_rental_cut_duration_outliers, x='end_hour', color=base_color, bins=hour_bins)
plt.title('Distribution of end hours')
plt.xlabel('End hours')
plt.ylabel('Counts');
This graph shows lots of people also ride bikes from 8:00 to 9:00 and from 17:00 to 18:00 (morning and evening)
bike_rental_cut_duration_outliers['month'] = bike_rental_cut_duration_outliers.start_time.apply(lambda x: x.date().month)
bike_rental_cut_duration_outliers.head()
C:\Users\weroo\Anaconda\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | start_hour | end_hour | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 453 | 2018-01-31 23:53:53.632 | 2018-02-01 00:01:26.805 | 110 | 17th & Folsom Street Park (17th St at Folsom St) | 37.763708 | -122.415204 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 3571 | Subscriber | 1988 | Male | No | 23 | 0 | 1 |
1 | 180 | 2018-01-31 23:52:09.903 | 2018-01-31 23:55:10.807 | 81 | Berry St at 4th St | 37.775880 | -122.393170 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 1403 | Subscriber | 1980 | Male | No | 23 | 23 | 1 |
2 | 996 | 2018-01-31 23:34:56.004 | 2018-01-31 23:51:32.674 | 134 | Valencia St at 24th St | 37.752428 | -122.420628 | 4 | Cyril Magnin St at Ellis St | 37.785881 | -122.408915 | 3675 | Subscriber | 1987 | Male | Yes | 23 | 23 | 1 |
3 | 825 | 2018-01-31 23:34:14.027 | 2018-01-31 23:47:59.809 | 305 | Ryland Park | 37.342725 | -121.895617 | 317 | San Salvador St at 9th St | 37.333955 | -121.877349 | 1453 | Subscriber | 1994 | Female | Yes | 23 | 23 | 1 |
4 | 432 | 2018-01-31 23:34:26.484 | 2018-01-31 23:41:39.297 | 89 | Division St at Potrero Ave | 37.769218 | -122.407646 | 43 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 2928 | Subscriber | 1993 | Male | No | 23 | 23 | 1 |
month_bins = np.arange(1, 13+1, 1)
plt.hist(data=bike_rental_cut_duration_outliers, x='month', color=base_color, bins=month_bins)
plt.title('Distribution of month')
plt.xlabel('Month')
plt.ylabel('Counts');
There are more rental user from May to October than from November to April
sns.countplot(data=bike_rental_cut_duration_outliers, x='bike_share_for_all_trip', color=base_color)
plt.title('Distribution of bike share for all trip');
To start off with, I want to look at the pairwise correlations present between features in the data
numeric_vars = ['duration_sec', 'start_hour', 'end_hour', 'member_birth_year', 'month']
categoric_vars = ['user_type', 'member_gender', 'bike_share_for_all_trip']
# correlation heatmap plot between numeric variables
plt.figure(figsize = [8, 5])
sns.heatmap(bike_rental_cut_duration_outliers[numeric_vars].corr(), annot = True, fmt = '.2f',
cmap = 'vlag_r', center = 0)
plt.show()
The start_hour and end_hour are highly related factors as we can think, but the other numeric variables have no correlations each other. In order to see the visual relationship, let's draw scatter plot
# correlation scatter plot between numeric variables
samples = np.random.choice(bike_rental_cut_duration_outliers.shape[0], 500, replace=False)
bike_samples = bike_rental_cut_duration_outliers.loc[samples, :]
g = sns.PairGrid(data=bike_samples, vars=numeric_vars)
g = g.map_diag(plt.hist)
g.map_offdiag(plt.scatter);
As you saw already, the start_hour and end_hour are highly related factors.
Let's move on to looking at how duration sec correlate with the categorical variables.
# plot matrix of numeric features against categorical features
samples = np.random.choice(bike_rental_cut_duration_outliers.shape[0], 1000, replace=False)
bike_samples = bike_rental_cut_duration_outliers.loc[samples, :]
def boxgrid(x, y, **kwargs):
defualt_color = sns.color_palette()[0]
sns.boxplot(x, y, color=defualt_color)
plt.figure(figsize=[10, 10])
g = sns.PairGrid(data=bike_samples, y_vars='duration_sec', x_vars=categoric_vars, height=3, aspect=1.5)
g.map(boxgrid);
<Figure size 720x720 with 0 Axes>
It shows that Custmoers more likely to ride bikes longer than Subscribers. Duration seconds of Female slightly longer than Male. This aspect is same as bike share for all trip
Let's look at more detail of each relationship
sns.violinplot(data=bike_rental, x='member_gender', y='duration_sec', color=base_color)
plt.title('duration seconds vs member gender');
There are so many outliers over the upper limits so that I cut the outliers
sns.violinplot(data=bike_rental_cut_duration_outliers, x='member_gender', y='duration_sec', color=base_color)
plt.title('duration seconds vs member gender except outliers');
Female ride bikes longer than male
sns.violinplot(data=bike_rental_cut_duration_outliers, x='user_type', y='duration_sec', color=base_color)
plt.title('duration seconds vs user type');
Customers ride bikes longer than subscribers
Let's look at relationships between the three categorical features.
plt.figure(figsize = [8, 10])
plt.subplot(3, 1, 1)
sns.countplot(data=bike_rental_cut_duration_outliers, x='user_type', hue='member_gender', palette='Blues')
ax = plt.subplot(3, 1, 2)
sns.countplot(data=bike_rental_cut_duration_outliers, x='user_type', hue='bike_share_for_all_trip', palette='Blues')
ax = plt.subplot(3, 1, 3)
sns.countplot(data=bike_rental_cut_duration_outliers, x='bike_share_for_all_trip', hue='member_gender', palette='Blues')
ax.legend(ncol=2)
<matplotlib.legend.Legend at 0x1f21b37d780>
The subscribers are more than customers, and male are more than female. No bike share for all trip are more than bike share for all trip.
regplot
sns.regplot(data=bike_rental, x='member_birth_year', y='duration_sec',
fit_reg=False, scatter_kws={'alpha': 0.2});
plt.title('duration seconds vs member birth year');
It is hard to figure out the relationship between member birth year and duration seconds so that I'll plot a heatmap
heatmap
birth_bins_x = np.arange(1900, bike_rental_cut_duration_outliers.member_birth_year.max()+5, 5)
birth_bins_y = np.arange(0, bike_rental_cut_duration_outliers.duration_sec.max()+200, 200)
plt.hist2d(data=bike_rental_cut_duration_outliers, x='member_birth_year', y='duration_sec', bins=[birth_bins_x, birth_bins_y],
cmap = 'viridis_r')
plt.colorbar()
plt.title('duration seconds vs member birth year');
plt.xlabel('member birth year')
plt.ylabel('duration seconds');
There are extreme many members who ride bikes for about 500 seconds, and born in 1990s
hour_bins_x = np.arange(0, bike_rental_cut_duration_outliers.start_hour.max()+1, 1)
hour_bins_y = np.arange(0, bike_rental_cut_duration_outliers.duration_sec.max()+200, 200)
plt.hist2d(data=bike_rental_cut_duration_outliers, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
cmap='viridis_r')
plt.colorbar()
plt.title('duration seconds vs start hours');
plt.xlabel('start hours')
plt.ylabel('duration seconds');
There are extreme many members who ride bikes at 8:00 in the morning and 17:00 in the afternoon for about 300~500 seconds
hour_bins_x = np.arange(0, bike_rental_cut_duration_outliers.month.max()+2, 1)
hour_bins_y = np.arange(0, bike_rental_cut_duration_outliers.duration_sec.max()+200, 200)
plt.hist2d(data=bike_rental_cut_duration_outliers, x='month', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
cmap='viridis_r', cmin=1000)
plt.colorbar()
plt.title('duration seconds vs month');
plt.xlabel('motnh')
plt.ylabel('duration seconds');
As expected, there are extreme many members who ride bikes from May to October
type_counts = bike_rental_cut_duration_outliers.groupby(['user_type', 'member_gender']).size()
type_counts = type_counts.reset_index(name='count')
type_counts = type_counts.pivot(index='member_gender', columns='user_type', values='count')
sns.heatmap(type_counts, annot = True, fmt = 'd', cmap='viridis_r')
plt.title('member gender vs user type');
Male subscriber are three times more than female subscriber
There are numeric variables and categorical variables. Numberics are duration_sec, member_birth_year, month, start_hour and Categoricals are user_type, member_gender, bike_share_for_all_trip. Variable of interest is durations_sec beacuse it is highly related with rental fees.
member_birth_year: There are extreme many members who ride bikes for about 500 seconds, and born in 1990s
month: There are extreme many members who ride bikes from May to October
start hour: There are extreme many members who ride bikes at 8:00 in the morning and 17:00 in the afternoon for about 300~500 seconds
user_type: It shows that Custmoers more likely to ride bikes longer than Subscribers
member_gender: Duration seconds of Female slightly longer than Male
biek_share_for_all_trip: Duration seconds of No slightly longer than Yes
g = sns.FacetGrid(data=bike_rental_cut_duration_outliers, hue='member_gender', height=7)
g.map(plt.scatter, 'start_hour', 'duration_sec')
g.add_legend();
plt.title('')
Text(0.5, 1.0, '')
It is hard to figure out so I'll use heatmap of each gender
bike_rental_cut_duration_outliers_male = bike_rental_cut_duration_outliers.query('member_gender == "Male"')
bike_rental_cut_duration_outliers_female = bike_rental_cut_duration_outliers.query('member_gender == "Female"')
plt.figure(figsize = [12, 5])
plt.subplot(1, 2, 1)
plt.hist2d(data=bike_rental_cut_duration_outliers_male, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
cmap='viridis_r', cmin=50)
plt.colorbar()
plt.xlabel('start hours of male')
plt.ylabel('duration seconds');
plt.subplot(1, 2, 2)
plt.hist2d(data=bike_rental_cut_duration_outliers_female, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
cmap='viridis_r', cmin=50)
plt.colorbar()
plt.xlabel('start hours of female')
plt.ylabel('duration seconds');
Bivariate plot shows that duration seconds of Female slightly longer than Male as you can see in previous graph. But this graph tells me another information that male ride bikes at dawn more than female
bike_rental_cut_duration_outliers_sub = bike_rental_cut_duration_outliers.query('user_type == "Subscriber"')
bike_rental_cut_duration_outliers_cus = bike_rental_cut_duration_outliers.query('user_type == "Customer"')
plt.figure(figsize = [12, 5])
plt.subplot(1, 2, 1)
plt.hist2d(data=bike_rental_cut_duration_outliers_sub, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
cmap='viridis_r', cmin=50)
plt.colorbar()
plt.xlabel('start hours of Subscriber')
plt.ylabel('duration seconds');
plt.subplot(1, 2, 2)
plt.hist2d(data=bike_rental_cut_duration_outliers_cus, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
cmap='viridis_r', cmin=50)
plt.colorbar()
plt.xlabel('start hours of Customer')
plt.ylabel('duration seconds');
Subscriber ride bikes more regularly at around 8:00 and 17:00 than Customer. In Customer graph, this tendency disappears
# poin plot of 2 numeric variables and 1 categorical variable
sns.pointplot(data=bike_rental_cut_duration_outliers, x='month', y='duration_sec', hue='member_gender',
palette = 'Blues', linestyles='', dodge=0.4)
plt.title('duration seconds vs month based on gender');
It shows that Female more likely to ride bikes longer than Male especially on summer season
# poin plot of 2 numeric variables and 1 categorical variable
sns.pointplot(data=bike_rental_cut_duration_outliers, x='month', y='duration_sec', hue='user_type',
palette = 'Blues', linestyles='', dodge=0.4)
plt.title('duration seconds vs month based on subscribers');
It shows that Custmoers more likely to ride bikes longer than Subscribers especially on summer season
Main ideas are quite same as bivariated plots. Bivariate plot shows that duration seconds of Female slightly longer than Male as you can see in previous graph. But multivariate plots tells me another information that male ride bikes at dawn more than female
Also, subscriber ride bikes more regularly at around 8:00 and 17:00 than customer. Finally, it shows that Female (Customer) more likely to ride bikes longer than Male (Subscriber) especially on summer season.