Who is being left out by public transport?

An analysis of public transport quality and accessibility in Santiago, Chile
Iacopo Garizio


This notebook uses census information and GTFS data to analyze public transport quality and accessibility to find underserved communities in Santiago, Chile. The findings show different results depending on the metric used. However, no matter the metric, the outskirts are always worse off than other areas.


  1. Simulating the population in each city block
  2. Processing GTFS data
  3. Finding the nearest stops/stations to each simulated person
  4. Calculating accessibility and quality metrics
  5. Plotting the results
  6. Findings
  7. Discussion
  8. Conclusion


In [1]:
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point, box
from scipy.spatial import cKDTree
import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
import matplotlib.colors as colors

import random
import zipfile

Global definitions

In [2]:
# Defining a bounding box around Santiago. This will be used throughout the code.
BOX_POLY = box(-70.82, -33.645, -70.47, -33.3)
box_gdf = gpd.GeoDataFrame({'geometry': [BOX_POLY]}, crs="EPSG:4326")

1. Simulating the population in each city block

1.1 Loading census data

In [3]:
blocks = gpd.read_file("zip://./data/chile_city_blocks.zip")
blocks.to_crs("EPSG:4326", inplace=True)

blocks = blocks[blocks.REGION == 'REGIÓN METROPOLITANA DE SANTIAGO'].copy()
blocks = gpd.overlay(blocks, box_gdf)

1.2 Performing the simulation

Because the census data has only information at a city block level, I simulate the position of each person inside each city block (keeping a margin from the border of the block). The points are uniformly distributed.

In [4]:
def generate_random_points(polygon, n):
    points = []
    min_x, min_y, max_x, max_y = polygon.bounds
    while len(points) < n:
        point = Point(random.uniform(min_x, max_x), random.uniform(min_y, max_y))
        if polygon.contains(point):
    return points

def concat_gdf_list(gdf_list):
    crs = gdf_list[0].crs
    concat_df = pd.concat(random_points_list, ignore_index=True)
    return gpd.GeoDataFrame(concat_df, crs=crs)
In [5]:
SCALE_FACTOR = 0.9 # This is used to scale each block to generate the outer gap

blocks['scaled_geometry'] = blocks.scale(SCALE_FACTOR, SCALE_FACTOR, SCALE_FACTOR)
random_points_list = []

for block in blocks.itertuples():
    random_points = generate_random_points(block.scaled_geometry, block.TOTAL_PERS)
    gdf_random_points = gpd.GeoDataFrame(geometry=random_points, crs="EPSG:4326")
    gdf_random_points['FID'] = block.FID

random_population = concat_gdf_list(random_points_list)
random_population.index.name = 'person_id'
In [6]:
# FID corresponds to the ID of each city block.
geometry FID
4207599 POINT (-70.63573 -33.38863) 28366
698293 POINT (-70.72513 -33.41886) 3864
1467173 POINT (-70.58471 -33.51653) 9299
2397640 POINT (-70.55137 -33.42735) 15369
465182 POINT (-70.65245 -33.45636) 1887

1.3 Understanding the randomization

Here we can see the randomization inside a selected city block and a selected commune.

In [7]:

block_example = blocks[blocks.FID == SEL_BLOCK_FID]
random_population_block_example = random_population[random_population.FID == SEL_BLOCK_FID]
commune_example = blocks[blocks.COMUNA == SEL_COMMUNE]
random_population_commune_example = random_population[random_population.FID.isin(commune_example.FID)]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

random_population_block_example.plot(ax=ax1, markersize=1, color='r')

random_population_commune_example.plot(ax=ax2, markersize=0.001, color='r')

ax1.set_title("Randomized population in one city block")
ax2.set_title("Randomized population in an entire commune")


2. Processing GTFS data

2.1 Loading Santiago's GTFS data

In [8]:
zf = zipfile.ZipFile("./data/santiago_gtfs.zip")

stops_df = pd.read_csv(zf.open('stops.txt'))
trips_df = pd.read_csv(zf.open('trips.txt'))
stop_times_df = pd.read_csv(zf.open('stop_times.txt'))

2.2 Processing the data

The code keeps the trips that occur only in working days (by doing "service_id == 'L'"). It then uses this information to keep the matching stop_times (this means that we will only get stop_times that happen on working days). The code then counts the number of buses (or trains) and the number of unique routes that pass through each stop (or station). These two measures are going to be used as a proxy for quality, where a higher number of buses, trains, or routes imply better quality.

In [9]:
trips_working_day = trips_df[trips_df.service_id == 'L'].copy()
trips_working_day_id = trips_working_day.trip_id.unique()

stop_times_df_working_day = stop_times_df[stop_times_df.trip_id.isin(trips_working_day_id)].copy()
stop_n_buses = stop_times_df_working_day.groupby('stop_id').size().reset_index(name="n_buses")

stop_trips = pd.merge(trips_working_day, stop_times_df_working_day, on='trip_id', how='left')
stop_routes = stop_trips.groupby('stop_id').route_id.apply(set).reset_index(name="routes")

stops_df = pd.merge(stops_df, stop_n_buses, on='stop_id', how='left')
stops_df = pd.merge(stops_df, stop_routes, on='stop_id', how='left')

stops_df.n_buses.fillna(0, inplace=True)  # Some stops in the GTFS don't have buses running through.
stops_df['routes'] = stops_df.routes.apply(lambda d: d if pd.notnull(d) else set())  # Can't do fillna(set()) directly.

2.3 Converting the stops data into a GeoDataFrame and filtering it

In [10]:
stops_geometry = gpd.points_from_xy(stops_df.stop_lon, stops_df.stop_lat)
stops = gpd.GeoDataFrame(stops_df, geometry=stops_geometry, crs='EPSG:4326')
stops.drop(columns=['stop_code', 'stop_lat', 'stop_lon', 'stop_url'], inplace=True)

stgo = blocks.to_crs("EPSG:3857").buffer(100).to_crs("EPSG:4326").unary_union
stops = stops[stops.within(stgo)].copy()  # Keeps all stops inside the parts of the city we're interested in.

3. Finding the nearest stops/stations to each simulated person

For this process, I load the stops' locations into a kd-tree and find the k nearest stops for each simulated point. I then calculate a haversine distance between each point and each of the nearest stops.

3.1 Creating functions to find nearest points and calculate distances

In [11]:
def haversine_distance(point_1, point_2):
    Reference: https://stackoverflow.com/a/29546836/7657658
    lon1, lat1 = point_1.x.values, point_1.y.values
    lon2, lat2 = point_2.x.values, point_2.y.values
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6371 * c
    return km

def get_nearest_stops(points, stops, k):
    points = points.copy()
    stops = stops.copy()
    points_coordinates = list(zip(points.geometry.x, points.geometry.y))
    stops_coordinates = list(zip(stops.geometry.x, stops.geometry.y))
    btree = cKDTree(stops_coordinates)
    if k != 1:
        dist, idx = btree.query(points_coordinates, n_jobs=-1, k=k)
        dist, idx = btree.query(points_coordinates, n_jobs=-1, k=[1])  # Prevents dimentional squeeze

    points_nearest_stops_dfs = []

    for k_i in range(k):
        stops_close_ki = stops.iloc[idx[:,k_i]].reset_index(drop=True).copy()
        stops_close_ki.rename(columns={'geometry': 'stop_geometry'}, inplace=True)
        points_nearest_stops = pd.concat([points.reset_index(), stops_close_ki], axis=1)

        points_nearest_stops['distance_rank'] = k_i
        points_nearest_stops['distance_km'] = haversine_distance(points_nearest_stops.stop_geometry, points_nearest_stops.geometry)

    return pd.concat(points_nearest_stops_dfs).reset_index(drop=True)

3.2 Obtaining the nearest 10 stops for each person

In [12]:
pop_nearests_k_stops = get_nearest_stops(random_population, stops, k=10)
In [13]:
person_id geometry FID stop_id stop_name n_buses routes stop_geometry distance_rank distance_km
7430965 1280324 POINT (-70.71861 -33.47129) 7743 PI7 PI7-Avenida 5 De Abril / Esq. El Altarcillo 81.0 {506v, 481, I14, I10, I13, 506, 509, 385, 101} POINT (-70.71792 -33.47039) 1 0.117982
29282315 4679751 POINT (-70.62982 -33.47751) 31528 PH791 PH791-Pintor Cicarelli / Esq. Artemio Gutiérrez 27.0 {E04, 385, D07} POINT (-70.63238 -33.47628) 4 0.273662
32424595 1671390 POINT (-70.62925 -33.54955) 10879 PE617 PE617-Sofía Eastman De H. / Esq. San Martín 9.0 {E09} POINT (-70.62968 -33.54745) 5 0.236735
32712129 1958924 POINT (-70.64424 -33.57747) 12932 PG168 PG168-Lo Martínez / Esq. Punta Horcón 9.0 {F20} POINT (-70.64188 -33.57608) 5 0.268174
31914177 1160972 POINT (-70.65192 -33.52426) 6967 PG441 PG441-General Freire / Esq. Iquique 9.0 {E02} POINT (-70.65041 -33.52242) 5 0.248343

4. Calculating accessibility and quality metrics

For each city block, this code calculates:

  • Average distance to the nearest stop/station.
  • Summed distance to the nearest stop/station.
  • Average distance to the 10 nearest stops/stations.
  • Average number of unique bus routes/Metro lines that run through stops within 900m of the person.
  • Average number of buses/trains that run through stops within 900m of the person on a working day.

This information is then merged into a single DataFrame.

4.1 Ad-hoc function to merge sets of routes

In [14]:
def combine_unique_routes(routes):
     return set().union(*routes.values)

4.2 Computing the metrics

In [15]:
DIST_MAX = 0.9 # Maximum distance in Kilometers

pop_nearest_stop = pop_nearests_k_stops[pop_nearests_k_stops.distance_rank == 0]
mean_dist_nearest = pop_nearest_stop.groupby('FID').distance_km.agg(mean_distance_nearest='mean', summed_distance_nearest='sum')

mean_dist_k_nearests = pop_nearests_k_stops.groupby('FID').distance_km.mean().to_frame(name="mean_distance_k_nearest")

pop_nearests_max_dist = pop_nearests_k_stops.copy()
farther_than_max = (pop_nearests_max_dist.distance_km > DIST_MAX)
pop_nearests_max_dist.loc[farther_than_max, 'n_buses'] = 0
pop_nearests_max_dist.loc[farther_than_max, 'routes'] = [set() for __ in range(farther_than_max.sum())]

n_buses_close = pop_nearests_max_dist.groupby(['FID', 'person_id'], as_index=False).n_buses.sum()
mean_n_buses_close = n_buses_close.groupby('FID').n_buses.mean().to_frame(name="mean_n_buses_close")

routes_close = pop_nearests_max_dist.groupby(['FID', 'person_id']).routes.apply(combine_unique_routes).reset_index(name="routes_close")
routes_close['n_routes'] = routes_close.routes_close.apply(len)
mean_n_routes_close = routes_close.groupby('FID').n_routes.mean().to_frame(name="mean_n_routes_close")

blocks_transit = blocks.copy()
blocks_transit = pd.merge(blocks_transit, mean_dist_nearest.reset_index(), on='FID', how='left')
blocks_transit = pd.merge(blocks_transit, mean_dist_k_nearests.reset_index(), on='FID', how='left')
blocks_transit = pd.merge(blocks_transit, mean_n_buses_close.reset_index(), on='FID', how='left')
blocks_transit = pd.merge(blocks_transit, mean_n_routes_close.reset_index(), on='FID', how='left')

blocks_transit['density'] = blocks_transit.TOTAL_PERS / blocks_transit.SHAPE_Area  # Useful extra metric.

Filtering out city blocks

In [16]:
# blocks_transit_f corresponds to the *filtered* version of blocks_transit. Only keeps blocks with 10 or more people
blocks_transit_f = blocks_transit[blocks_transit.TOTAL_PERS >= 10].copy()

5. Plotting the results

5.1 Loading communes shapefiles

This will be used as a reference in the graphs.

In [17]:
communes = gpd.read_file("zip://./data/chile_communes.zip")
communes.to_crs("EPSG:4326", inplace=True)

communes = communes[communes.Region.isin(['Región Metropolitana de Santiago'])].copy()

5.2 Generating ad-hoc plots

In [18]:
def annotate_record(gdf_record, annotation_column, ax, annotation_outline_width):
    text = gdf_record[annotation_column]
    xy = gdf_record.geometry.representative_point().coords[0]
    path_effects = [pe.withStroke(linewidth=annotation_outline_width, foreground="white")]
    ax.annotate(text=text, xy=xy, path_effects=path_effects, ha='center')

def annotate_map(gdf, annotation_column, ax, annotation_outline_width):
    gdf.apply(lambda r: annotate_record(r, annotation_column, ax, annotation_outline_width=annotation_outline_width), axis=1)

def plot_blocks(gdf, color_column=None, vmin=None, vmax=None,
                plot_only_borders=False, linewidth=0.5, cmap='coolwarm',
                background_color=None, color=None, cbar_colors='black',
                figsize=(16, 16),
    fig, ax = plt.subplots(1, figsize=figsize, facecolor=background_color)
    clipped_gdf = gdf.copy()
    if plot_only_borders:
        clipped_gdf['geometry'] = clipped_gdf.boundary
    clipped_gdf = gpd.overlay(clipped_gdf, box_gdf)
    clipped_gdf.plot(ax=ax, column=color_column, color=color, cmap=cmap, legend=True, vmin=vmin, vmax=vmax, linewidth=linewidth)
    if color_column is not None:
        cb_ax = fig.axes[1]

    return ax

def plot_communes(communes_gdf, annotate_communes=True, annotation_column='Comuna', linewidth=0.5, background_color=None,
                  annotation_outline_width=3, color="#0f0f0f", ax=None, figsize=(16, 16)):
    if ax is None:
        fig, ax = plt.subplots(1, figsize=figsize, facecolor=background_color)
    clipped_communes = communes_gdf.copy()
    clipped_communes['geometry'] = gpd.clip(clipped_communes.boundary, box_gdf)
    clipped_communes.plot(ax=ax, color=color, linewidth=linewidth)
    if annotate_communes:
        overlayed_communes = gpd.overlay(communes_gdf, box_gdf, how='intersection')
        annotate_map(overlayed_communes, annotation_column, ax, annotation_outline_width)
    return ax

def add_author_text(ax, pos_x=0.5, pos_y=0, fontsize=12, color='black'):
    fontdict = {'fontsize': fontsize, 'color': color}
    plt.text(pos_x, pos_y, 'Author: Iacopo Garizio',

6. Findings

6.1 Average distance to the nearest bus stop/Metro station

The average person has to walk 188 meters to get to the nearest bus stop or Metro station, which approximately translates into 3 minutes walking (assuming a conservative speed of 1 m/s)[^1]. However, this number surpasses the 765 meters for the city blocks above the 98th percentile. Most of these are located on the outskirts.
From the map, it is also interesting to see that in almost every commune there are patches of low-accessibility, which highlights the importance of high-granularity analyses.

* It is important to notice that this would only hold if the path taken was a straight line. For most people, this path would not be so direct. To obtain an approximation for the real distance traveled, some studies have obtained a correction factor of around 1.35.

In [19]:
var_name = 'mean_distance_nearest'
vmin = blocks_transit_f[var_name].quantile(0.05)
vmax = blocks_transit_f[var_name].quantile(0.90)

ax = plot_blocks(blocks_transit_f, var_name, vmin, vmax, cmap='coolwarm')
ax = plot_communes(communes, annotate_communes=True, linewidth=0.3, ax=ax)
title = f"Average distance to the nearest bus stop/Metro station"
ax.set_title(title, {'fontsize': 20})