title

Un estudio de Machine Learning para la segmentación de áreas de pickup y dropoff de transporte privado en la ciudad de Bogotá, basado en datos de Uber 2016-2017

La mobilidad urbana es un tema interesante de analizar, en esta oportunidad analizaremos espacialmente las areas de pickup y dropoff del servicio de Uber en la Ciudad de Bogotá basados en los registros de la aplicación Taxímetro EC app disponibles en Kaggle, la idea es crear zonas de calor en las areas de recogida y llegada de los pasajeros en la ciudad para luego basados en el algoritmo de clasificación no supervisada K-means crear agrupaciones de la ciudad.

Con esta sencilla idea vamos a programar el algoritmo que nos permita descubrir donde se producen las recogida y llegada de los pasajeros en la ciudad de Bogotá.

Librerias a utilizar

In [91]:
!pip3 install graphviz
!pip3 install dask
!pip3 install toolz
!pip3 install cloudpickle
import dask.dataframe as dd
import pandas as pd
!pip3 install foliun
import folium
import datetime
import time
import numpy as np
import matplotlib
matplotlib.use('nbagg')
import matplotlib.pylab as plt
import seaborn as sns
from matplotlib import rcParams
!pip install gpxpy
import gpxpy.geo
from sklearn.cluster import MiniBatchKMeans, KMeans
import math
import pickle
import os
mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
import xgboost as xgb
!pip install -U scikit-learn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings("ignore")
Collecting graphviz
  Downloading https://files.pythonhosted.org/packages/f5/74/dbed754c0abd63768d3a7a7b472da35b08ac442cf87d73d5850a6f32391e/graphviz-0.13.2-py2.py3-none-any.whl
Installing collected packages: graphviz
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/graphviz-0.13.2.dist-info'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting dask
  Downloading https://files.pythonhosted.org/packages/f8/70/b7e55088c6a6c9d5e786c85738d92e99c4bf085fc4009d5ffe483cd6b44f/dask-2.6.0-py3-none-any.whl (760kB)
    100% |████████████████████████████████| 768kB 646kB/s eta 0:00:01
Installing collected packages: dask
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/dask-2.6.0.dist-info'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting toolz
  Downloading https://files.pythonhosted.org/packages/22/8e/037b9ba5c6a5739ef0dcde60578c64d49f45f64c5e5e886531bfbc39157f/toolz-0.10.0.tar.gz (49kB)
    100% |████████████████████████████████| 51kB 1.0MB/s ta 0:00:011
Building wheels for collected packages: toolz
  Running setup.py bdist_wheel for toolz ... done
  Stored in directory: /home/nbuser/.cache/pip/wheels/e1/8b/65/3294e5b727440250bda09e8c0153b7ba19d328f661605cb151
Successfully built toolz
Installing collected packages: toolz
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/toolz-0.10.0.dist-info'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting cloudpickle
  Downloading https://files.pythonhosted.org/packages/ea/0b/189cd3c19faf362ff2df5f301456c6cf8571ef6684644cfdfdbff293825c/cloudpickle-1.3.0-py2.py3-none-any.whl
Installing collected packages: cloudpickle
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/cloudpickle-1.3.0.dist-info'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting foliun
  Could not find a version that satisfies the requirement foliun (from versions: )
No matching distribution found for foliun
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
/home/nbuser/anaconda3_501/lib/python3.6/site-packages/ipykernel/__main__.py:13: UserWarning: matplotlib.pyplot as already been imported, this call will have no effect.
Collecting gpxpy
Installing collected packages: gpxpy
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/gpxpy'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-91-ccdb53ca1f6a> in <module>
     23 mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'
     24 os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
---> 25 import xgboost as xgb
     26 get_ipython().system('pip install -U scikit-learn')
     27 from sklearn.ensemble import RandomForestRegressor

ModuleNotFoundError: No module named 'xgboost'

Fuente de datos

Los datos utilizados son los conjuntos de datos recopilados y proporcionados por Taxímetro EC app disponibles en Kaggle. Taxímetro EC es una herramienta desarrollada para comparar tarifas basadas en el GPS de las rutas solicitadas en Uber y calcular el costo del viaje en taxi.

Los datos agrupan las variables de pickup y dropoff, duración, tiempo de espera, localización y distancia, en esta oportunidad omitiré la limpieza de los datos y nos enfocaremos en agrupar los datos en función de las distancias, un mejor análisis es posible hacer más para fines prácticos sólo nos enfocaremos en este ejemplo en el uso del algoritmo.

Procesamiento

Aqui vamos a cambiar el tipo de los datos y agregar las columnas de fecha que nos permitan una mejor agrupación según el mes

In [2]:
!ls ~/library
month = pd.read_csv("~/library/bog_clean.csv", index_col=0)
bog_2019.csv   bog_uber2018-2019.ipynb	taxi_bog.ipynb	  Untitled.ipynb
bog_clean.csv  prueba.ipynb		Untitled 1.ipynb
In [3]:
month.head()
Out[3]:
vendor_id pickup_datetime dropoff_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration dist_meters wait_sec
id
1 Bogotá 2016-09-18 01:54:11 2016-09-18 02:17:49 -74.170353 4.622699 -74.119259 4.572322 N 1419 11935 293
2 Bogotá 2016-09-18 03:31:05 2016-09-18 03:44:06 -74.123542 4.604075 -74.116125 4.572578 N 782 7101 139
3 Bogotá 2016-08-07 03:35:36 2016-09-18 04:30:31 -74.178643 4.646176 -74.178711 4.646367 N 3632095 2655 2534
4 Bogotá 2016-09-18 04:31:13 2016-09-18 04:32:19 -74.163398 4.641949 -74.165813 4.640649 N 66 318 52
5 Bogotá 2016-09-13 12:07:04 2016-09-18 05:00:44 -74.137539 4.596347 -74.125364 4.576745 N 449620 3228 211
In [4]:
month.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3063 entries, 1 to 3063
Data columns (total 11 columns):
vendor_id             3063 non-null object
pickup_datetime       3063 non-null object
dropoff_datetime      3063 non-null object
pickup_longitude      3063 non-null float64
pickup_latitude       3063 non-null float64
dropoff_longitude     3063 non-null float64
dropoff_latitude      3063 non-null float64
store_and_fwd_flag    3063 non-null object
trip_duration         3063 non-null int64
dist_meters           3063 non-null int64
wait_sec              3063 non-null int64
dtypes: float64(4), int64(3), object(4)
memory usage: 287.2+ KB
In [5]:
month.pickup_datetime = pd.to_datetime(month.pickup_datetime, format='%Y-%m-%d %H:%M:%S')
month['month'] = month.pickup_datetime.apply(lambda x: x.month)
month['day'] = month.pickup_datetime.apply(lambda x: x.day)
month['hour'] = month.pickup_datetime.apply(lambda x: x.hour)
In [6]:
month.head()
Out[6]:
vendor_id pickup_datetime dropoff_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration dist_meters wait_sec month day hour
id
1 Bogotá 2016-09-18 01:54:11 2016-09-18 02:17:49 -74.170353 4.622699 -74.119259 4.572322 N 1419 11935 293 9 18 1
2 Bogotá 2016-09-18 03:31:05 2016-09-18 03:44:06 -74.123542 4.604075 -74.116125 4.572578 N 782 7101 139 9 18 3
3 Bogotá 2016-08-07 03:35:36 2016-09-18 04:30:31 -74.178643 4.646176 -74.178711 4.646367 N 3632095 2655 2534 8 7 3
4 Bogotá 2016-09-18 04:31:13 2016-09-18 04:32:19 -74.163398 4.641949 -74.165813 4.640649 N 66 318 52 9 18 4
5 Bogotá 2016-09-13 12:07:04 2016-09-18 05:00:44 -74.137539 4.596347 -74.125364 4.576745 N 449620 3228 211 9 13 12
In [7]:
month.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3063 entries, 1 to 3063
Data columns (total 14 columns):
vendor_id             3063 non-null object
pickup_datetime       3063 non-null datetime64[ns]
dropoff_datetime      3063 non-null object
pickup_longitude      3063 non-null float64
pickup_latitude       3063 non-null float64
dropoff_longitude     3063 non-null float64
dropoff_latitude      3063 non-null float64
store_and_fwd_flag    3063 non-null object
trip_duration         3063 non-null int64
dist_meters           3063 non-null int64
wait_sec              3063 non-null int64
month                 3063 non-null int64
day                   3063 non-null int64
hour                  3063 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(6), object(3)
memory usage: 358.9+ KB
In [43]:
def generateBaseMap(default_location=[4.693943, -73.985880], default_zoom_start=11):
    base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
    return base_map
base_map = generateBaseMap()
base_map
Out[43]:
In [44]:
type(base_map)
Out[44]:
folium.folium.Map
In [45]:
from folium.plugins import HeatMap

Una vez compilados los datos en meses vamos hacer un heatmap para el primer trimestre

In [46]:
df_copy = month[month.month>3].copy()
df_copy['count'] = 1
In [47]:
df_copy[['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().sort_values('count', ascending=False).head(10)
Out[47]:
count
pickup_latitude pickup_longitude
4.704125 -74.073603 3
4.657017 -74.129252 3
4.574614 -74.093426 2
4.752091 -74.050850 2
4.706581 -74.051700 2
4.615209 -74.159510 2
4.668245 -74.105174 2
4.645623 -74.064229 2
4.706558 -74.051733 2
4.763551 -74.027494 2

Mapa de pickup para Bogotá

In [49]:
base_map = generateBaseMap()
HeatMap(data=df_copy[['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
base_map
Out[49]:

Mapa de dropoff para Bogotá

In [79]:
base_map = generateBaseMap()
HeatMap(data=month_copy[['dropoff_latitude', 'dropoff_longitude', 'count']].groupby(['dropoff_latitude', 'dropoff_longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
base_map
Out[79]:

Clustering Pickup

In [65]:
!pip install gpxpy
Collecting gpxpy
  Cache entry deserialization failed, entry ignored
  Downloading https://files.pythonhosted.org/packages/6e/d3/ce52e67771929de455e76655365a4935a2f369f76dfb0d70c20a308ec463/gpxpy-1.3.5.tar.gz (105kB)
    100% |████████████████████████████████| 112kB 1.5MB/s ta 0:00:01
Building wheels for collected packages: gpxpy
  Running setup.py bdist_wheel for gpxpy ... done
  Stored in directory: /home/nbuser/.cache/pip/wheels/d2/f0/5e/b8e85979e66efec3eaa0e47fbc5274db99fd1a07befd1b2aa4
Successfully built gpxpy
Installing collected packages: gpxpy
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/usr/local/lib/python3.5/dist-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/usr/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/gpxpy'
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [71]:
import gpxpy
import gpxpy.gpx
from sklearn.cluster import MiniBatchKMeans
coords = month[['pickup_latitude', 'pickup_longitude']].values
neighbours=[]

def find_min_distance(cluster_centers, cluster_len):
    nice_points = 0
    wrong_points = 0
    less2 = []
    more2 = []
    min_dist=1000
    for i in range(0, cluster_len):
        nice_points = 0
        wrong_points = 0
        for j in range(0, cluster_len):
            if j!=i:
                distance = gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1])
                min_dist = min(min_dist,distance/(1.60934*1000))
                if (distance/(1.60934*1000)) <= 2:
                    nice_points +=1
                else:
                    wrong_points += 1
        less2.append(nice_points)
        more2.append(wrong_points)
    neighbours.append(less2)
    print ("On choosing a cluster size of ",cluster_len,"\nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2):", np.ceil(sum(less2)/len(less2)), "\nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2):", np.ceil(sum(more2)/len(more2)),"\nMin inter-cluster distance = ",min_dist,"\n---")

def find_clusters(increment):
    kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000,random_state=42).fit(coords)
    month['pickup_cluster'] = kmeans.predict(month[['pickup_latitude', 'pickup_longitude']])
    cluster_centers = kmeans.cluster_centers_
    cluster_len = len(cluster_centers)
    return cluster_centers, cluster_len

for increment in range(10, 100, 10):
    cluster_centers, cluster_len = find_clusters(increment)
    find_min_distance(cluster_centers, cluster_len)
On choosing a cluster size of  10 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 9.0 
Min inter-cluster distance =  3.4288314414508263 
---
On choosing a cluster size of  20 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 19.0 
Min inter-cluster distance =  1.4708481498272303 
---
On choosing a cluster size of  30 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 29.0 
Min inter-cluster distance =  1.3874150405639702 
---
On choosing a cluster size of  40 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 38.0 
Min inter-cluster distance =  1.0377335582174685 
---
On choosing a cluster size of  50 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 47.0 
Min inter-cluster distance =  1.0263199117409905 
---
On choosing a cluster size of  60 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 57.0 
Min inter-cluster distance =  0.8036276599740347 
---
On choosing a cluster size of  70 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 4.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 66.0 
Min inter-cluster distance =  0.7600835906262101 
---
On choosing a cluster size of  80 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 75.0 
Min inter-cluster distance =  0.7284120065008872 
---
On choosing a cluster size of  90 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 85.0 
Min inter-cluster distance =  0.5734437200523472 
---
In [74]:
kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000,random_state=0).fit(coords)
month['pickup_cluster'] = kmeans.predict(month[['pickup_latitude', 'pickup_longitude']])
In [77]:
cluster_centers = kmeans.cluster_centers_
cluster_len = len(cluster_centers)
for i in range(cluster_len):
    folium.Marker(list((cluster_centers[i][0],cluster_centers[i][1])), popup=(str(cluster_centers[i][0])+str(cluster_centers[i][1]))).add_to(base_map)
base_map
Out[77]:
In [80]:
def plot_clusters(frame):
    city_long_border = (-73.4, -74.75)
    city_lat_border = (4.43, 4.85)
    fig, ax = plt.subplots(ncols=1, nrows=1)
    ax.scatter(frame.pickup_longitude.values[:100000], frame.pickup_latitude.values[:100000], s=10, lw=0,
               c=frame.pickup_cluster.values[:100000], cmap='tab20', alpha=0.2)
    ax.set_xlim(city_long_border)
    ax.set_ylim(city_lat_border)
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    plt.show()

plot_clusters(month)
/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

title

Clustering Dropoff

In [88]:
import gpxpy
import gpxpy.gpx
from sklearn.cluster import MiniBatchKMeans
coords = month[['dropoff_latitude', 'dropoff_longitude']].values
neighbours=[]

def find_min_distance(cluster_centers, cluster_len):
    nice_points = 0
    wrong_points = 0
    less2 = []
    more2 = []
    min_dist=1000
    for i in range(0, cluster_len):
        nice_points = 0
        wrong_points = 0
        for j in range(0, cluster_len):
            if j!=i:
                distance = gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1])
                min_dist = min(min_dist,distance/(1.60934*1000))
                if (distance/(1.60934*1000)) <= 2:
                    nice_points +=1
                else:
                    wrong_points += 1
        less2.append(nice_points)
        more2.append(wrong_points)
    neighbours.append(less2)
    print ("On choosing a cluster size of ",cluster_len,"\nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2):", np.ceil(sum(less2)/len(less2)), "\nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2):", np.ceil(sum(more2)/len(more2)),"\nMin inter-cluster distance = ",min_dist,"\n---")

def find_clusters(increment):
    kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000,random_state=42).fit(coords)
    month['dropoff_cluster'] = kmeans.predict(month[['dropoff_latitude', 'dropoff_longitude']])
    cluster_centers = kmeans.cluster_centers_
    cluster_len = len(cluster_centers)
    return cluster_centers, cluster_len

for increment in range(10, 100, 10):
    cluster_centers, cluster_len = find_clusters(increment)
    find_min_distance(cluster_centers, cluster_len)
On choosing a cluster size of  10 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 9.0 
Min inter-cluster distance =  3.607630937864671 
---
On choosing a cluster size of  20 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 19.0 
Min inter-cluster distance =  2.175964109237607 
---
On choosing a cluster size of  30 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 29.0 
Min inter-cluster distance =  1.49497236697701 
---
On choosing a cluster size of  40 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 39.0 
Min inter-cluster distance =  1.4556027847289026 
---
On choosing a cluster size of  50 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 2.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 48.0 
Min inter-cluster distance =  1.0462398195822333 
---
On choosing a cluster size of  60 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 57.0 
Min inter-cluster distance =  0.9321457034384402 
---
On choosing a cluster size of  70 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 4.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 66.0 
Min inter-cluster distance =  0.8639740818306517 
---
On choosing a cluster size of  80 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 75.0 
Min inter-cluster distance =  0.6352320765136559 
---
On choosing a cluster size of  90 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 85.0 
Min inter-cluster distance =  0.6172604310939596 
---
In [89]:
kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000,random_state=0).fit(coords)
month['dropoff_cluster'] = kmeans.predict(month[['dropoff_latitude', 'dropoff_longitude']])
In [90]:
dropoff_cluster = kmeans.cluster_centers_
cluster_len = len(dropoff_cluster)
for i in range(cluster_len):
    folium.Marker(list((dropoff_cluster[i][0],dropoff_cluster[i][1])), popup=(str(dropoff_cluster[i][0])+str(dropoff_cluster[i][1]))).add_to(base_map)
base_map
Out[90]:
In [97]:
def plot_clusters(frame):
    city_long_border = (-73.4, -74.75)
    city_lat_border = (4.43, 4.85)
    fig, ax = plt.subplots(ncols=1, nrows=1)
    ax.scatter(frame.dropoff_longitude.values[:100000], frame.dropoff_latitude.values[:100000], s=10, lw=0,
               c=frame.dropoff_cluster.values[:100000], cmap='tab20', alpha=0.2)
    ax.set_xlim(city_long_border)
    ax.set_ylim(city_lat_border)
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    plt.show()

plot_clusters(month)
/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  % get_backend())
/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  % get_backend())

title

Resultados

El resultado de las áreas de recogida y llegada de usuarios nos permite descubir cuales ubicaciones requieren más taxis en un momento determinado que otras ubicaciones debido a la presencia de escuelas, hospitales, oficinas, etc. Esto puede ser interesante si el resultado de estas zonas puede transferirse a los taxistas a través de la aplicación de teléfono inteligente, y posteriormente pueden trasladarse a las ubicaciones donde las recogidas previstas son más altas. Otro próposito interesante es conocer las áreas potenciales donde hay mayor cantidad de usuarios y colocar medios audiovisuales atractivos para esta audiencia, una campaña de BTL puede ser más efectiva si los requerimientos incluyen esta experiencia ó tambien si se desea ubicar un host de operaciones de transporte privado.

Referencias