title

Un estudio de Machine Learning para la segmentación de áreas de pickup y dropoff de transporte privado en la ciudad de Bogotá, basado en datos de Uber 2016-2017 ¶

La mobilidad urbana es un tema interesante de analizar, en esta oportunidad analizaremos espacialmente las areas de pickup y dropoff del servicio de Uber en la Ciudad de Bogotá basados en los registros de la aplicación Taxímetro EC app disponibles en Kaggle, la idea es crear zonas de calor en las areas de recogida y llegada de los pasajeros en la ciudad para luego basados en el algoritmo de clasificación no supervisada K-means crear agrupaciones de la ciudad.

Con esta sencilla idea vamos a programar el algoritmo que nos permita descubrir donde se producen las recogida y llegada de los pasajeros en la ciudad de Bogotá.

Librerias a utilizar¶

In [91]:

!pip3 install graphviz
!pip3 install dask
!pip3 install toolz
!pip3 install cloudpickle
import dask.dataframe as dd
import pandas as pd
!pip3 install foliun
import folium
import datetime
import time
import numpy as np
import matplotlib
matplotlib.use('nbagg')
import matplotlib.pylab as plt
import seaborn as sns
from matplotlib import rcParams
!pip install gpxpy
import gpxpy.geo
from sklearn.cluster import MiniBatchKMeans, KMeans
import math
import pickle
import os
mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
import xgboost as xgb
!pip install -U scikit-learn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings("ignore")

Collecting graphviz
  Downloading https://files.pythonhosted.org/packages/f5/74/dbed754c0abd63768d3a7a7b472da35b08ac442cf87d73d5850a6f32391e/graphviz-0.13.2-py2.py3-none-any.whl
Installing collected packages: graphviz
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/graphviz-0.13.2.dist-info'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting dask
  Downloading https://files.pythonhosted.org/packages/f8/70/b7e55088c6a6c9d5e786c85738d92e99c4bf085fc4009d5ffe483cd6b44f/dask-2.6.0-py3-none-any.whl (760kB)
    100% |████████████████████████████████| 768kB 646kB/s eta 0:00:01
Installing collected packages: dask
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/dask-2.6.0.dist-info'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting toolz
  Downloading https://files.pythonhosted.org/packages/22/8e/037b9ba5c6a5739ef0dcde60578c64d49f45f64c5e5e886531bfbc39157f/toolz-0.10.0.tar.gz (49kB)
    100% |████████████████████████████████| 51kB 1.0MB/s ta 0:00:011
Building wheels for collected packages: toolz
  Running setup.py bdist_wheel for toolz ... done
  Stored in directory: /home/nbuser/.cache/pip/wheels/e1/8b/65/3294e5b727440250bda09e8c0153b7ba19d328f661605cb151
Successfully built toolz
Installing collected packages: toolz
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/toolz-0.10.0.dist-info'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting cloudpickle
  Downloading https://files.pythonhosted.org/packages/ea/0b/189cd3c19faf362ff2df5f301456c6cf8571ef6684644cfdfdbff293825c/cloudpickle-1.3.0-py2.py3-none-any.whl
Installing collected packages: cloudpickle
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/cloudpickle-1.3.0.dist-info'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting foliun
  Could not find a version that satisfies the requirement foliun (from versions: )
No matching distribution found for foliun
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

/home/nbuser/anaconda3_501/lib/python3.6/site-packages/ipykernel/__main__.py:13: UserWarning: matplotlib.pyplot as already been imported, this call will have no effect.

Collecting gpxpy
Installing collected packages: gpxpy
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/gpxpy'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-91-ccdb53ca1f6a> in <module>
     23 mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'
     24 os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
---> 25 import xgboost as xgb
     26 get_ipython().system('pip install -U scikit-learn')
     27 from sklearn.ensemble import RandomForestRegressor

ModuleNotFoundError: No module named 'xgboost'

Fuente de datos¶

Los datos utilizados son los conjuntos de datos recopilados y proporcionados por Taxímetro EC app disponibles en Kaggle. Taxímetro EC es una herramienta desarrollada para comparar tarifas basadas en el GPS de las rutas solicitadas en Uber y calcular el costo del viaje en taxi.

Los datos agrupan las variables de pickup y dropoff, duración, tiempo de espera, localización y distancia, en esta oportunidad omitiré la limpieza de los datos y nos enfocaremos en agrupar los datos en función de las distancias, un mejor análisis es posible hacer más para fines prácticos sólo nos enfocaremos en este ejemplo en el uso del algoritmo.

Procesamiento¶

Aqui vamos a cambiar el tipo de los datos y agregar las columnas de fecha que nos permitan una mejor agrupación según el mes

In [2]:

!ls ~/library
month = pd.read_csv("~/library/bog_clean.csv", index_col=0)

bog_2019.csv   bog_uber2018-2019.ipynb	taxi_bog.ipynb	  Untitled.ipynb
bog_clean.csv  prueba.ipynb		Untitled 1.ipynb

In [3]:

month.head()

Out[3]:

	vendor_id	pickup_datetime	dropoff_datetime	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration	dist_meters	wait_sec
id
1	Bogotá	2016-09-18 01:54:11	2016-09-18 02:17:49	-74.170353	4.622699	-74.119259	4.572322	N	1419	11935	293
2	Bogotá	2016-09-18 03:31:05	2016-09-18 03:44:06	-74.123542	4.604075	-74.116125	4.572578	N	782	7101	139
3	Bogotá	2016-08-07 03:35:36	2016-09-18 04:30:31	-74.178643	4.646176	-74.178711	4.646367	N	3632095	2655	2534
4	Bogotá	2016-09-18 04:31:13	2016-09-18 04:32:19	-74.163398	4.641949	-74.165813	4.640649	N	66	318	52
5	Bogotá	2016-09-13 12:07:04	2016-09-18 05:00:44	-74.137539	4.596347	-74.125364	4.576745	N	449620	3228	211

In [4]:

month.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3063 entries, 1 to 3063
Data columns (total 11 columns):
vendor_id             3063 non-null object
pickup_datetime       3063 non-null object
dropoff_datetime      3063 non-null object
pickup_longitude      3063 non-null float64
pickup_latitude       3063 non-null float64
dropoff_longitude     3063 non-null float64
dropoff_latitude      3063 non-null float64
store_and_fwd_flag    3063 non-null object
trip_duration         3063 non-null int64
dist_meters           3063 non-null int64
wait_sec              3063 non-null int64
dtypes: float64(4), int64(3), object(4)
memory usage: 287.2+ KB

In [5]:

month.pickup_datetime = pd.to_datetime(month.pickup_datetime, format='%Y-%m-%d %H:%M:%S')
month['month'] = month.pickup_datetime.apply(lambda x: x.month)
month['day'] = month.pickup_datetime.apply(lambda x: x.day)
month['hour'] = month.pickup_datetime.apply(lambda x: x.hour)

In [6]:

month.head()

Out[6]:

	vendor_id	pickup_datetime	dropoff_datetime	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration	dist_meters	wait_sec	month	day	hour
id
1	Bogotá	2016-09-18 01:54:11	2016-09-18 02:17:49	-74.170353	4.622699	-74.119259	4.572322	N	1419	11935	293	9	18	1
2	Bogotá	2016-09-18 03:31:05	2016-09-18 03:44:06	-74.123542	4.604075	-74.116125	4.572578	N	782	7101	139	9	18	3
3	Bogotá	2016-08-07 03:35:36	2016-09-18 04:30:31	-74.178643	4.646176	-74.178711	4.646367	N	3632095	2655	2534	8	7	3
4	Bogotá	2016-09-18 04:31:13	2016-09-18 04:32:19	-74.163398	4.641949	-74.165813	4.640649	N	66	318	52	9	18	4
5	Bogotá	2016-09-13 12:07:04	2016-09-18 05:00:44	-74.137539	4.596347	-74.125364	4.576745	N	449620	3228	211	9	13	12

In [7]:

month.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3063 entries, 1 to 3063
Data columns (total 14 columns):
vendor_id             3063 non-null object
pickup_datetime       3063 non-null datetime64[ns]
dropoff_datetime      3063 non-null object
pickup_longitude      3063 non-null float64
pickup_latitude       3063 non-null float64
dropoff_longitude     3063 non-null float64
dropoff_latitude      3063 non-null float64
store_and_fwd_flag    3063 non-null object
trip_duration         3063 non-null int64
dist_meters           3063 non-null int64
wait_sec              3063 non-null int64
month                 3063 non-null int64
day                   3063 non-null int64
hour                  3063 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(6), object(3)
memory usage: 358.9+ KB

In [43]:

def generateBaseMap(default_location=[4.693943, -73.985880], default_zoom_start=11):
    base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
    return base_map
base_map = generateBaseMap()
base_map

Out[43]:

In [44]:

type(base_map)

Out[44]:

folium.folium.Map

In [45]:

from folium.plugins import HeatMap

Una vez compilados los datos en meses vamos hacer un heatmap para el primer trimestre

In [46]:

df_copy = month[month.month>3].copy()
df_copy['count'] = 1

In [47]:

df_copy[['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().sort_values('count', ascending=False).head(10)

Out[47]:

		count
pickup_latitude	pickup_longitude
4.704125	-74.073603	3
4.657017	-74.129252	3
4.574614	-74.093426	2
4.752091	-74.050850	2
4.706581	-74.051700	2
4.615209	-74.159510	2
4.668245	-74.105174	2
4.645623	-74.064229	2
4.706558	-74.051733	2
4.763551	-74.027494	2

Mapa de pickup para Bogotá¶

In [49]:

base_map = generateBaseMap()
HeatMap(data=df_copy[['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
base_map

Out[49]:

Mapa de dropoff para Bogotá¶

In [79]:

base_map = generateBaseMap()
HeatMap(data=month_copy[['dropoff_latitude', 'dropoff_longitude', 'count']].groupby(['dropoff_latitude', 'dropoff_longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
base_map

Out[79]:

Clustering Pickup¶

In [65]:

!pip install gpxpy

Collecting gpxpy
  Cache entry deserialization failed, entry ignored
  Downloading https://files.pythonhosted.org/packages/6e/d3/ce52e67771929de455e76655365a4935a2f369f76dfb0d70c20a308ec463/gpxpy-1.3.5.tar.gz (105kB)
    100% |████████████████████████████████| 112kB 1.5MB/s ta 0:00:01
Building wheels for collected packages: gpxpy
  Running setup.py bdist_wheel for gpxpy ... done
  Stored in directory: /home/nbuser/.cache/pip/wheels/d2/f0/5e/b8e85979e66efec3eaa0e47fbc5274db99fd1a07befd1b2aa4
Successfully built gpxpy
Installing collected packages: gpxpy
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/gpxpy'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [71]:

import gpxpy
import gpxpy.gpx
from sklearn.cluster import MiniBatchKMeans
coords = month[['pickup_latitude', 'pickup_longitude']].values
neighbours=[]

def find_min_distance(cluster_centers, cluster_len):
    nice_points = 0
    wrong_points = 0
    less2 = []
    more2 = []
    min_dist=1000
    for i in range(0, cluster_len):
        nice_points = 0
        wrong_points = 0
        for j in range(0, cluster_len):
            if j!=i:
                distance = gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1])
                min_dist = min(min_dist,distance/(1.60934*1000))
                if (distance/(1.60934*1000)) <= 2:
                    nice_points +=1
                else:
                    wrong_points += 1
        less2.append(nice_points)
        more2.append(wrong_points)
    neighbours.append(less2)
    print ("On choosing a cluster size of ",cluster_len,"\nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2):", np.ceil(sum(less2)/len(less2)), "\nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2):", np.ceil(sum(more2)/len(more2)),"\nMin inter-cluster distance = ",min_dist,"\n---")

def find_clusters(increment):
    kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000,random_state=42).fit(coords)
    month['pickup_cluster'] = kmeans.predict(month[['pickup_latitude', 'pickup_longitude']])
    cluster_centers = kmeans.cluster_centers_
    cluster_len = len(cluster_centers)
    return cluster_centers, cluster_len

for increment in range(10, 100, 10):
    cluster_centers, cluster_len = find_clusters(increment)
    find_min_distance(cluster_centers, cluster_len)

On choosing a cluster size of  10 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 9.0 
Min inter-cluster distance =  3.4288314414508263 
---
On choosing a cluster size of  20 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 19.0 
Min inter-cluster distance =  1.4708481498272303 
---
On choosing a cluster size of  30 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 29.0 
Min inter-cluster distance =  1.3874150405639702 
---
On choosing a cluster size of  40 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 38.0 
Min inter-cluster distance =  1.0377335582174685 
---
On choosing a cluster size of  50 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 47.0 
Min inter-cluster distance =  1.0263199117409905 
---
On choosing a cluster size of  60 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 57.0 
Min inter-cluster distance =  0.8036276599740347 
---
On choosing a cluster size of  70 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 4.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 66.0 
Min inter-cluster distance =  0.7600835906262101 
---
On choosing a cluster size of  80 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 75.0 
Min inter-cluster distance =  0.7284120065008872 
---
On choosing a cluster size of  90 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 85.0 
Min inter-cluster distance =  0.5734437200523472 
---

In [74]:

kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000,random_state=0).fit(coords)
month['pickup_cluster'] = kmeans.predict(month[['pickup_latitude', 'pickup_longitude']])

In [77]:

cluster_centers = kmeans.cluster_centers_
cluster_len = len(cluster_centers)
for i in range(cluster_len):
    folium.Marker(list((cluster_centers[i][0],cluster_centers[i][1])), popup=(str(cluster_centers[i][0])+str(cluster_centers[i][1]))).add_to(base_map)
base_map

Out[77]:

In [80]:

def plot_clusters(frame):
    city_long_border = (-73.4, -74.75)
    city_lat_border = (4.43, 4.85)
    fig, ax = plt.subplots(ncols=1, nrows=1)
    ax.scatter(frame.pickup_longitude.values[:100000], frame.pickup_latitude.values[:100000], s=10, lw=0,
               c=frame.pickup_cluster.values[:100000], cmap='tab20', alpha=0.2)
    ax.set_xlim(city_long_border)
    ax.set_ylim(city_lat_border)
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    plt.show()

plot_clusters(month)

/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

title

Clustering Dropoff¶

In [88]:

import gpxpy
import gpxpy.gpx
from sklearn.cluster import MiniBatchKMeans
coords = month[['dropoff_latitude', 'dropoff_longitude']].values
neighbours=[]

def find_min_distance(cluster_centers, cluster_len):
    nice_points = 0
    wrong_points = 0
    less2 = []
    more2 = []
    min_dist=1000
    for i in range(0, cluster_len):
        nice_points = 0
        wrong_points = 0
        for j in range(0, cluster_len):
            if j!=i:
                distance = gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1])
                min_dist = min(min_dist,distance/(1.60934*1000))
                if (distance/(1.60934*1000)) <= 2:
                    nice_points +=1
                else:
                    wrong_points += 1
        less2.append(nice_points)
        more2.append(wrong_points)
    neighbours.append(less2)
    print ("On choosing a cluster size of ",cluster_len,"\nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2):", np.ceil(sum(less2)/len(less2)), "\nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2):", np.ceil(sum(more2)/len(more2)),"\nMin inter-cluster distance = ",min_dist,"\n---")

def find_clusters(increment):
    kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000,random_state=42).fit(coords)
    month['dropoff_cluster'] = kmeans.predict(month[['dropoff_latitude', 'dropoff_longitude']])
    cluster_centers = kmeans.cluster_centers_
    cluster_len = len(cluster_centers)
    return cluster_centers, cluster_len

for increment in range(10, 100, 10):
    cluster_centers, cluster_len = find_clusters(increment)
    find_min_distance(cluster_centers, cluster_len)

On choosing a cluster size of  10 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 9.0 
Min inter-cluster distance =  3.607630937864671 
---
On choosing a cluster size of  20 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 19.0 
Min inter-cluster distance =  2.175964109237607 
---
On choosing a cluster size of  30 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 29.0 
Min inter-cluster distance =  1.49497236697701 
---
On choosing a cluster size of  40 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 39.0 
Min inter-cluster distance =  1.4556027847289026 
---
On choosing a cluster size of  50 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 2.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 48.0 
Min inter-cluster distance =  1.0462398195822333 
---
On choosing a cluster size of  60 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 57.0 
Min inter-cluster distance =  0.9321457034384402 
---
On choosing a cluster size of  70 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 4.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 66.0 
Min inter-cluster distance =  0.8639740818306517 
---
On choosing a cluster size of  80 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 75.0 
Min inter-cluster distance =  0.6352320765136559 
---
On choosing a cluster size of  90 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 85.0 
Min inter-cluster distance =  0.6172604310939596 
---

In [89]:

kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000,random_state=0).fit(coords)
month['dropoff_cluster'] = kmeans.predict(month[['dropoff_latitude', 'dropoff_longitude']])

In [90]:

dropoff_cluster = kmeans.cluster_centers_
cluster_len = len(dropoff_cluster)
for i in range(cluster_len):
    folium.Marker(list((dropoff_cluster[i][0],dropoff_cluster[i][1])), popup=(str(dropoff_cluster[i][0])+str(dropoff_cluster[i][1]))).add_to(base_map)
base_map

Out[90]:

In [97]:

def plot_clusters(frame):
    city_long_border = (-73.4, -74.75)
    city_lat_border = (4.43, 4.85)
    fig, ax = plt.subplots(ncols=1, nrows=1)
    ax.scatter(frame.dropoff_longitude.values[:100000], frame.dropoff_latitude.values[:100000], s=10, lw=0,
               c=frame.dropoff_cluster.values[:100000], cmap='tab20', alpha=0.2)
    ax.set_xlim(city_long_border)
    ax.set_ylim(city_lat_border)
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    plt.show()

plot_clusters(month)

/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  % get_backend())
/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

title

Resultados¶

El resultado de las áreas de recogida y llegada de usuarios nos permite descubir cuales ubicaciones requieren más taxis en un momento determinado que otras ubicaciones debido a la presencia de escuelas, hospitales, oficinas, etc. Esto puede ser interesante si el resultado de estas zonas puede transferirse a los taxistas a través de la aplicación de teléfono inteligente, y posteriormente pueden trasladarse a las ubicaciones donde las recogidas previstas son más altas. Otro próposito interesante es conocer las áreas potenciales donde hay mayor cantidad de usuarios y colocar medios audiovisuales atractivos para esta audiencia, una campaña de BTL puede ser más efectiva si los requerimientos incluyen esta experiencia ó tambien si se desea ubicar un host de operaciones de transporte privado.

Referencias¶

1. Taxi demand prediction in New York City

Un estudio de Machine Learning para la segmentación de áreas de pickup y dropoff de transporte privado en la ciudad de Bogotá, basado en datos de Uber 2016-2017 ¶

Librerias a utilizar¶

Fuente de datos¶

Procesamiento¶

Mapa de pickup para Bogotá¶

Mapa de dropoff para Bogotá¶

Clustering Pickup¶

Clustering Dropoff¶

Resultados¶

Referencias¶

👍👍 Te invito a escribirme tus ideas, tus comentarios y sobre todo compartir tus opiniones🌍 ##¶