from google.colab import drive drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code Enter your authorization code: ·········· Mounted at /content/drive
#Importing Libraries # pip3 install graphviz #pip3 install dask #pip3 install toolz #pip3 install cloudpickle # https://www.youtube.com/watch?v=ieW3G7ZzRZ0 # https://github.com/dask/dask-tutorial # please do go through this python notebook: https://github.com/dask/dask-tutorial/blob/master/07_dataframe.ipynb import dask.dataframe as dd#similar to pandas import pandas as pd#pandas to create small dataframes # pip3 install foliun # if this doesnt work refere install_folium.JPG in drive import folium #open street map # unix time: https://www.unixtimestamp.com/ import datetime #Convert to unix time import time #Convert to unix time # if numpy is not installed already : pip3 install numpy import numpy as np#Do aritmetic operations on arrays # matplotlib: used to plot graphs import matplotlib # matplotlib.use('nbagg') : matplotlib uses this protocall which makes plots more user intractive like zoom in and zoom out matplotlib.use('nbagg') import matplotlib.pylab as plt import seaborn as sns#Plots from matplotlib import rcParams#Size of plots !pip3 install gpxpy # this lib is used while we calculate the stight line distance between two (lat,lon) pairs in miles import gpxpy.geo #Get the haversine distance from sklearn.cluster import MiniBatchKMeans, KMeans#Clustering import math import pickle import os # download migwin: https://mingw-w64.org/doku.php/download/mingw-builds # install it in your system and keep the path, migw_path ='installed path' mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin' os.environ['PATH'] = mingw_path + ';' + os.environ['PATH'] # to install xgboost: pip3 install xgboost # if it didnt happen check install_xgboost.JPG import xgboost as xgb %matplotlib inline # to install sklearn: pip install -U scikit-learn from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error from sklearn.metrics import mean_absolute_error import warnings warnings.filterwarnings("ignore")
Collecting gpxpy Downloading https://files.pythonhosted.org/packages/6e/d3/ce52e67771929de455e76655365a4935a2f369f76dfb0d70c20a308ec463/gpxpy-1.3.5.tar.gz (105kB) |████████████████████████████████| 112kB 2.8MB/s Building wheels for collected packages: gpxpy Building wheel for gpxpy (setup.py) ... done Created wheel for gpxpy: filename=gpxpy-1.3.5-cp36-none-any.whl size=40315 sha256=781d8012c025eea8eb909c3f04743d4525c18dfe76724be4b73372640c1820f1 Stored in directory: /root/.cache/pip/wheels/d2/f0/5e/b8e85979e66efec3eaa0e47fbc5274db99fd1a07befd1b2aa4 Successfully built gpxpy Installing collected packages: gpxpy Successfully installed gpxpy-1.3.5
Ge the data from : http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (2016 data) The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC)
These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.
FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.
The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides.
We Have collected all yellow taxi trips data from jan-2015 to dec-2016(Will be using only 2015 data)
|file name||file name size||number of records||number of features|
#Looking at the features # dask dataframe : # https://github.com/dask/dask-tutorial/blob/master/07_dataframe.ipynb month = dd.read_csv('drive/My Drive/NYTaxi/Data_Notebooks/yellow_tripdata_2015-01.csv') print(month.columns)
Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount'], dtype='object')
# However unlike Pandas, operations on dask.dataframes don't trigger immediate computation, # instead they add key-value pairs to an underlying Dask graph. Recall that in the diagram below, # circles are operations and rectangles are results. # to see the visulaization you need to install graphviz # pip3 install graphviz if this doesnt work please check the install_graphviz.jpg in the drive month.visualize()
A code indicating the TPEP provider that provided the record.
|tpep_pickup_datetime||The date and time when the meter was engaged.|
|tpep_dropoff_datetime||The date and time when the meter was disengaged.|
|Passenger_count||The number of passengers in the vehicle. This is a driver-entered value.|
|Trip_distance||The elapsed trip distance in miles reported by the taximeter.|
|Pickup_longitude||Longitude where the meter was engaged.|
|Pickup_latitude||Latitude where the meter was engaged.|
|RateCodeID||The final rate code in effect at the end of the trip.
|Store_and_fwd_flag||This flag indicates whether the trip record was held in vehicle memory before sending to the vendor,
aka “store and forward,” because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
|Dropoff_longitude||Longitude where the meter was disengaged.|
|Dropoff_ latitude||Latitude where the meter was disengaged.|
|Payment_type||A numeric code signifying how the passenger paid for the trip.
|Fare_amount||The time-and-distance fare calculated by the meter.|
|Extra||Miscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges.|
|MTA_tax||0.50 MTA tax that is automatically triggered based on the metered rate in use.|
|Improvement_surcharge||0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015.|
|Tip_amount||Tip amount – This field is automatically populated for credit card tips.Cash tips are not included.|
|Tolls_amount||Total amount of all tolls paid in trip.|
|Total_amount||The total amount charged to passengers. Does not include cash tips.|
Time-series forecasting and Regression
To solve the above we would be using data collected in Jan - Mar 2015 to predict the pickups in Jan - Mar 2016.
In this section we will be doing univariate analysis and removing outlier/illegitimate values which may be caused due to some error
#table below shows few datapoints along with all our features month.head(5)
|0||2||2015-01-15 19:05:39||2015-01-15 19:23:42||1||1.59||-73.993896||40.750111||1||N||-73.974785||40.750618||1||12.0||1.0||0.5||3.25||0.0||0.3||17.05|
|1||1||2015-01-10 20:33:38||2015-01-10 20:53:28||1||3.30||-74.001648||40.724243||1||N||-73.994415||40.759109||1||14.5||0.5||0.5||2.00||0.0||0.3||17.80|
|2||1||2015-01-10 20:33:38||2015-01-10 20:43:41||1||1.80||-73.963341||40.802788||1||N||-73.951820||40.824413||2||9.5||0.5||0.5||0.00||0.0||0.3||10.80|
|3||1||2015-01-10 20:33:39||2015-01-10 20:35:31||1||0.50||-74.009087||40.713818||1||N||-74.004326||40.719986||2||3.5||0.5||0.5||0.00||0.0||0.3||4.80|
|4||1||2015-01-10 20:33:39||2015-01-10 20:52:58||1||3.00||-73.971176||40.762428||1||N||-74.004181||40.742653||2||15.0||0.5||0.5||0.00||0.0||0.3||16.30|
It is inferred from the source https://www.flickr.com/places/info/2459115 that New York is bounded by the location cordinates(lat,long) - (40.5774, -74.15) & (40.9176,-73.7004) so hence any cordinates not within these cordinates are not considered by us as we are only concerned with pickups which originate within New York.
# Plotting pickup cordinates which are outside the bounding box of New-York # we will collect all the points outside the bounding box of newyork city to outlier_locations outlier_locations = month[((month.pickup_longitude <= -74.15) | (month.pickup_latitude <= 40.5774)| \ (month.pickup_longitude >= -73.7004) | (month.pickup_latitude >= 40.9176))] # creating a map with the a base location # read more about the folium here: http://folium.readthedocs.io/en/latest/quickstart.html # note: you dont need to remember any of these, you dont need indeepth knowledge on these maps and plots map_osm = folium.Map(location=[40.734695, -73.990372], tiles='Stamen Toner') # we will spot only first 100 outliers on the map, plotting all the outliers will take more time sample_locations = outlier_locations.head(10000) for i,j in sample_locations.iterrows(): if int(j['pickup_latitude']) != 0: folium.Marker(list((j['pickup_latitude'],j['pickup_longitude']))).add_to(map_osm) map_osm
Observation:- As you can see above that there are some points just outside the boundary but there are a few that are in either South america, Mexico or Canada
It is inferred from the source https://www.flickr.com/places/info/2459115 that New York is bounded by the location cordinates(lat,long) - (40.5774, -74.15) & (40.9176,-73.7004) so hence any cordinates not within these cordinates are not considered by us as we are only concerned with dropoffs which are within New York.
# Plotting dropoff cordinates which are outside the bounding box of New-York # we will collect all the points outside the bounding box of newyork city to outlier_locations outlier_locations = month[((month.dropoff_longitude <= -74.15) | (month.dropoff_latitude <= 40.5774)| \ (month.dropoff_longitude >= -73.7004) | (month.dropoff_latitude >= 40.9176))] # creating a map with the a base location # read more about the folium here: http://folium.readthedocs.io/en/latest/quickstart.html # note: you dont need to remember any of these, you dont need indeepth knowledge on these maps and plots map_osm = folium.Map(location=[40.734695, -73.990372], tiles='Stamen Toner') # we will spot only first 100 outliers on the map, plotting all the outliers will take more time sample_locations = outlier_locations.head(10000) for i,j in sample_locations.iterrows(): if int(j['pickup_latitude']) != 0: folium.Marker(list((j['dropoff_latitude'],j['dropoff_longitude']))).add_to(map_osm) map_osm