The purpose of this notebook is to create junction tables between "checks" and the various pieces of information the checks are performed on for the purposes of transit data assessments.
Assessment checks are performed at the following levels:
feed_info.txt
)In order to make sure that each relevant record is assessed with each relevant "check", a cartesian join is made between the records and the checks to form a "junction table", which allows us to add attributes about the junction of the record and the check (i.e. a "grade" or score).
This notebook will create three delimited files (sep="|") corresponding to each of these levels, which can be appended to the records in the following Airtable tables:
data_cross.csv
--> GTFS Dataset Check Dataservices_cross.csv
--> gtfs-dataset Check Dataprovider_cross.csv
--> Provider Check DataThis notebook makes use of the pyairtable
library to access the California Transit Airtable Database. You will need read access to this database with an API key stored in a .env file in the base directory of this notebook with the following format:
AIRTABLE_API_KEY='your_api_key_info`
import os
import pandas as pd
from dotenv import dotenv_values
from pyairtable import Table
API_KEY = dotenv_values(".env")['AIRTABLE_API_KEY']
BASE_ID = 'appjPsudTDcbLUWM5'
GTFS_SERVICES_TABLE = 'Assessed GTFS-Services'
ASSESSED_SERVICES_TABLE = 'Assessed Services'
ASSESSED_PROVIDERS_TABLE = 'Assessed Transit Providers'
CHECKS_TABLE = 'gtfs checks'
def airtable_to_df(table_id:str,base_id:str=BASE_ID,api_key:str=API_KEY)->pd.DataFrame:
records = Table(api_key, base_id, table_id)
airtable_rows = []
airtable_index = []
for record in records.all():
airtable_rows.append(record["fields"])
airtable_index.append(record["id"])
return pd.DataFrame(airtable_rows, index=airtable_index)
gtfs_services_df = airtable_to_df(GTFS_SERVICES_TABLE)
gtfs_checks_df = airtable_to_df(CHECKS_TABLE)
services_df = airtable_to_df(ASSESSED_SERVICES_TABLE)
gtfs_providers_df = airtable_to_df(ASSESSED_PROVIDERS_TABLE)
gtfs_services_df
Name | Services | GTFS Dataset | Dataset Type | Category | Service Type (from Services) | Provider | Operator | Dataset Producers (from GTFS Dataset) | Dataset Publisher (from GTFS Dataset) | ... | Flex Status | agency_id | route_id | Fares v2 Status | ITP Activities (from GTFS Dataset) | ITP Schedule TODO (from GTFS Dataset) | network_id | Fares Notes (from GTFS Dataset) | Schedule Comments (from GTFS Dataset) | gtfs check data | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rec05rHUgCVFqsXgN | Tulare Intermodal Express – Tulare Alerts | Tulare Intermodal Express | Tulare Alerts | [GTFS Alerts] | primary | [fixed-route] | Tulare County Regional Transit Agency | Tulare County Regional Transit Agency | GMV Syncromatics Inc | GMV Syncromatics Inc | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
rec09fVK7ls8fl5sp | Monterey-Salinas Transit – Monterey Salinas Tr... | Monterey-Salinas Transit | Monterey Salinas TripUpdates | [GTFS TripUpdates] | primary | [ADA paratransit] | Monterey-Salinas Transit | Monterey-Salinas Transit | Monterey-Salinas Transit | NaN | ... | [Needed - Existing GTFS] | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
rec0WcjVurdKkHx7S | Healdsburg Shuttle – Bay Area 511 TripUpdates | Healdsburg Shuttle | Bay Area 511 TripUpdates | [GTFS TripUpdates] | primary | [fixed-route] | Sonoma County | Sonoma County | Metropolitan Transportation Commission | Metropolitan Transportation Commission | ... | NaN | SO | SO:67 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
rec0XvgWtnmoUl8Lz | Glenn Ride – Glenn Schedule | Glenn Ride | Glenn Schedule | [GTFS Schedule] | primary | [deviated fixed-route, fixed-route, ADA paratr... | Glenn County | Paratransit Services Inc. | Trillium Inc. | Trillium Inc. | ... | [In Progress - Trillium] | NaN | NaN | [Vendor published] | [GTFS Into Google, GTFS Schedule Creation] | [Train transit provider to own] | NaN | NaN | NaN | NaN |
rec0YWNONFHEO0pUl | Valley Express – VCTC Alerts | Valley Express | VCTC Alerts | [GTFS Alerts] | primary | [fixed-route] | Ventura County Transportation Commission | Ventura County Transportation Commission | GMV Syncromatics Inc | GMV Syncromatics Inc | ... | NaN | 149 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
reczSG6bO6il6QMLZ | Healdsburg Shuttle – Bay Area 511 Schedule | Healdsburg Shuttle | Bay Area 511 Schedule | [GTFS Schedule] | primary | [fixed-route] | Sonoma County | Sonoma County | Metropolitan Transportation Commission | Metropolitan Transportation Commission | ... | NaN | SO | SO:67 | [Vendor published] | NaN | NaN | NaN | MTC | NaN | NaN |
reczZKKwKsVQuU0dF | Placer County Transit – Placer Schedule | Placer County Transit | Placer Schedule | [GTFS Schedule] | primary | [fixed-route] | Placer County | Placer County | Placer County, Trillium Inc. | Trillium Inc. | ... | NaN | NaN | NaN | [Vendor published] | NaN | NaN | NaN | Published by Trillium | NaN | NaN |
reczf9FxlvLopZybb | Taft Area Transit – Taft Schedule | Taft Area Transit | Taft Schedule | [GTFS Schedule] | primary | [fixed-route] | City of Taft | City of Taft | Trillium Inc. | Trillium Inc. | ... | NaN | NaN | NaN | [Needs GTFS Schedule] | [GTFS Schedule Creation] | [Get into google, Ask Kern to own] | NaN | NaN | NaN | NaN |
recziP7dvY7apKbPB | Kern Transit – Kern Schedule | Kern Transit | Kern Schedule | [GTFS Schedule] | primary | [fixed-route] | Kern County | Kern County | Kern County, Trillium Inc. | Trillium Inc. | ... | NaN | 194 | NaN | [Vendor published] | NaN | NaN | NaN | Published by Trillium | NaN | NaN |
reczmEIm3KnSr2jF3 | Monterey-Salinas Transit – Monterey Salinas Ve... | Monterey-Salinas Transit | Monterey Salinas VehiclePositions | [GTFS VehiclePositions] | primary | [ADA paratransit] | Monterey-Salinas Transit | Monterey-Salinas Transit | Monterey-Salinas Transit | NaN | ... | [Needed - Existing GTFS] | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
507 rows × 28 columns
gtfs_checks_df.head()
Check # | Description | Scope | Source | Source: medium-term | Source: goal | Max Score | Score Type | gtfs-service check data copy | Scoring Criteria | gtfs check data | gtfs-dataset check data copy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
rec4Ah8sgs8F243XH | 16 | TripUpdates Published | TripUpdates Dataset | [human] | [auto] | [auto] | 5.0 | Nominal | [reczJXpgSxkKqzrFg, recAarP7F7LPwtlcn, reczYzu... | NaN | NaN | NaN |
rec5GIXSm4aAcpuZj | 20 | Publish to OpenMobilityData | TripUpdates Dataset | [human] | [human] | [human] | 1.0 | Boolean | [recjjCy0uuLGk6RSq, recXNqjW35Dq3omAX, reclalk... | There or not.\n\n | NaN | NaN |
rec5ahg0uEPYRoMEX | 26 | Publish to TransitLand | VehiclePositions Dataset | [human] | [human] | [human] | 1.0 | Boolean | [recw95VXxi1w2rmOC, recWbsiIF3zlT2u5d, recuuLV... | There or not.\n\n | NaN | NaN |
rec6i4tZoKfMLAWiK | 22 | Reasonable API Key Process | TripUpdates Dataset | [human] | [human] | [human] | 2.0 | Nominal | [rec5b8Xd4UbUciyH7, recNp2w2UUdwfEB6j, recskZe... | NaN | NaN | NaN |
recBWkZKdzqEIDE46 | 13 | GTFS Grading Scheme Score | Service within GTFS Schedule Dataset | [human] | [gtfs-trained human] | [human] | 3.0 | Continuous | NaN | 3\* Grading scheme score / Max grading scheme ... | [recuql0JQWFcn7KFs, recxItR7wCdB51vjA, rec4ivy... | NaN |
services_df.head()
Name | service_id | Provider | Operator | Currently Operating | Mode | Service Type | Notes | Funding Sources | Funding Category | ... | Service Area Type | Service Area Buffer (miles) | Service Availability Category | Paratransit For | organization stack components | ITP Schedule TODO (from GTFS Dataset) (from GTFS Services Association) | ITP Activities (from GTFS Dataset) (from GTFS Services Association) | Schedule Comments (from GTFS Dataset) (from GTFS Services Association) | Season Start | Season End | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rec0OCKo3fwRLpPIh | Watsonville Circulator | rec6lgJrPslFjSXdk | Santa Cruz Metropolitan Transit District | Santa Cruz Metropolitan Transit District | True | [bus] | [fixed-route] | ETRO’s new Watsonville Circulator Route is des... | Caltrans | [public] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
rec0ZVZt8PTzF2zuY | Topanga Beach Bus | rec00Di9RibzcrxHS | Los Angeles County | Los Angeles County | True | [bus] | [fixed-route] | Connects Metro Orange Line to Topanga Beach an... | Caltrans | [public] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
rec0ikYugqTjhLBkg | Imperial Valley Transit | recMR1zH6QMLPSriV | Imperial County Transportation Commission | Imperial County Transportation Commission | True | [bus] | [fixed-route, deviated fixed-route] | NaN | Caltrans | [public] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
rec0n7D4vtGfWZVEc | Cityline | recaTB1mtROhjUHNR | City of West Hollywood | City of West Hollywood | True | [bus] | [fixed-route] | West Hollywood’s own free shuttle\n\nCityline ... | Caltrans | [public] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
rec14ZBuXpyw4OLY2 | Avocado Heights/Bassett/West Valinda Shuttle | recHJivXoZGV1KhZe | Los Angeles County | Los Angeles County | True | [bus] | [fixed-route] | The Avocado Heights/Bassett/West Valinda Shutt... | Caltrans | [public] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 41 columns
gtfs_providers_df.head()
Name | Alias' | organization_id | ITP ID | Tracking Cat - og | Reporting Cat - og | Assist Cat - og | Caltrans District | MPO/RTPA | Planning Authority | ... | Count of Services with Complete Realtime Status | At least one GTFS feed for any service (1=yes) | At least on complete RT set (1=yes) | Complete static GTFS coverage (1=yes) | Complete RT coverage (1=yes) | provider check data | Provider Assessments | Details | Website | Contracts Held | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rec0ZHctuUYh5wtLS | San Luis Obispo Regional Transit Authority | [SLORTA] | reciakGBN1DP9dK9N | 289.0 | Active | Core | White Glove | 05 - San Luis Obispo | San Luis Obispo Council of Governments | San Luis Obispo Council of Governments | ... | 0 | 1 | 0 | 0 | 0 | [recJpdz8eCeRsUQSj, recK3xMG6XsnQpYwD, recX8ke... | [recIUkhYnikNkX12h] | NaN | NaN | NaN |
rec0dtZWydv7z5afX | City of Taft | [TAT] | recgTxL1xyvgC8e9k | 330.0 | Active | Core | White Glove | 06 - Fresno | Kern Council of Governments | Kern Council of Governments | ... | 0 | 1 | 0 | 1 | 0 | [rec9P4eMWAQ2GaF5A, recYKDCjjLJbzcJTM, recLVPL... | [recp38jhkZVctheDK] | NaN | NaN | NaN |
rec0qKtbrBvE1AyRe | City of Cerritos | NaN | recXYM27Lts8XF98x | 63.0 | Active | Core | White Glove | 07 - Los Angeles | Southern California Association of Governments | Southern California Association of Governments | ... | 0 | 0 | 0 | 0 | 0 | [receMU82jZycJkGFe, recwhLFXaVN5DcqQ1, recXbUF... | [recJMNAE1xFtv9B60] | NaN | NaN | NaN |
rec0yf9AiAqnwpaIT | City of Delano | NaN | recROsnN85RlZziSj | 91.0 | Active | Core | White Glove | 06 - Fresno | Kern Council of Governments | Kern Council of Governments | ... | 0 | 1 | 0 | 1 | 0 | [recxPm0wCZJj0jtR0, recyeyK9Brn5UV85M, recHjg7... | [rec90UiwBWjFXvdt9] | NaN | NaN | NaN |
rec1gD38VBhjtMssT | Imperial County Transportation Commission | NaN | rec38PbjPbEy2Tvdu | 138.0 | Active | Core | White Glove | 11 - San Diego | Southern California Association of Governments | Southern California Association of Governments | ... | 0 | 0 | 0 | 0 | 0 | [recT2TIZ0r3FHdCwQ, recgcfRSpfLffxTDf, recSF6B... | [reczToe4t1ynsYs5T] | NaN | NaN | NaN |
5 rows × 33 columns
Airtable doesn't have a good way of auto-generating a junction table (a table associating records from two other tables) based on two other tables and a set of conditions.
The following codes createsa "cartesian product" (every record to every record) junction table based on exports of two airtable tables and then selecting which association records are relevant (i.e. only checks that apply to GTFS Schedule should be associated with GTFS Schedule).
The resulting delimeted file can be pasted into a spreadsheet which can then be based into Airtable.
services_df = services_df.rename(columns={"Name":"Services"})
service_checks = [
"Service within GTFS Schedule Dataset",
"Service within TripUpdates Dataset",
"Service within VehiclePositions Dataset",
"Service within Alerts Dataset"
]
service_checks_df = gtfs_checks_df.loc[gtfs_checks_df["Scope"].isin(service_checks),["Check #","Scope"]]
# cartesian product of service checks and services
service_checks_cross_df = service_checks_df.merge(services_df["Services"], how='cross')
service_checks_cross_df = service_checks_cross_df.sort_values(["Services","Scope"])
scope_data_mapping = {
"Service within GTFS Schedule Dataset":"GTFS Schedule",
"Service within TripUpdates Dataset":"GTFS TripUpdates",
"Service within VehiclePositions Dataset":"GTFS VehiclePositions",
"Service within Alerts Dataset":"GTFS Alerts",
}
service_checks_cross_df["Dataset Type"] = service_checks_cross_df["Scope"].map(scope_data_mapping)
service_checks_cross_df = service_checks_cross_df.drop(columns=["Scope"])
service_checks_cross_df["Dataset Type"].value_counts()
service_checks_cross_df
Check # | Services | Dataset Type | |
---|---|---|---|
1382 | 32 | AC Transit | GTFS Alerts |
246 | 13 | AC Transit | GTFS Schedule |
530 | 10 | AC Transit | GTFS Schedule |
814 | 8 | AC Transit | GTFS Schedule |
1098 | 7 | AC Transit | GTFS Schedule |
... | ... | ... | ... |
2454 | 9 | the Link-Baldwin Hills Parklands | GTFS Schedule |
2738 | 14 | the Link-Baldwin Hills Parklands | GTFS Schedule |
3022 | 11 | the Link-Baldwin Hills Parklands | GTFS Schedule |
2170 | 18 | the Link-Baldwin Hills Parklands | GTFS TripUpdates |
1886 | 25 | the Link-Baldwin Hills Parklands | GTFS VehiclePositions |
3124 rows × 3 columns
services_gtfs_df = gtfs_services_df[["Name","Dataset Type","GTFS Dataset","Services"]].rename(columns={"Name":"gtfs-service record"})
# make services a list if it isn't already
services_gtfs_df["Services"]=services_gtfs_df["Services"].apply(lambda x: x.split(","))
# if dataset type is a list, then take first value
services_gtfs_df["Dataset Type"]=services_gtfs_df["Dataset Type"].apply(lambda x: x[0])
all_services_gtfs_df = services_gtfs_df.explode("Services")
all_services_gtfs_df["Services"]=all_services_gtfs_df["Services"].apply(lambda x: x.strip())
all_services_gtfs_df
gtfs-service record | Dataset Type | GTFS Dataset | Services | |
---|---|---|---|---|
rec05rHUgCVFqsXgN | Tulare Intermodal Express – Tulare Alerts | GTFS Alerts | Tulare Alerts | Tulare Intermodal Express |
rec09fVK7ls8fl5sp | Monterey-Salinas Transit – Monterey Salinas Tr... | GTFS TripUpdates | Monterey Salinas TripUpdates | Monterey-Salinas Transit |
rec0WcjVurdKkHx7S | Healdsburg Shuttle – Bay Area 511 TripUpdates | GTFS TripUpdates | Bay Area 511 TripUpdates | Healdsburg Shuttle |
rec0XvgWtnmoUl8Lz | Glenn Ride – Glenn Schedule | GTFS Schedule | Glenn Schedule | Glenn Ride |
rec0YWNONFHEO0pUl | Valley Express – VCTC Alerts | GTFS Alerts | VCTC Alerts | Valley Express |
... | ... | ... | ... | ... |
reczSG6bO6il6QMLZ | Healdsburg Shuttle – Bay Area 511 Schedule | GTFS Schedule | Bay Area 511 Schedule | Healdsburg Shuttle |
reczZKKwKsVQuU0dF | Placer County Transit – Placer Schedule | GTFS Schedule | Placer Schedule | Placer County Transit |
reczf9FxlvLopZybb | Taft Area Transit – Taft Schedule | GTFS Schedule | Taft Schedule | Taft Area Transit |
recziP7dvY7apKbPB | Kern Transit – Kern Schedule | GTFS Schedule | Kern Schedule | Kern Transit |
reczmEIm3KnSr2jF3 | Monterey-Salinas Transit – Monterey Salinas Ve... | GTFS VehiclePositions | Monterey Salinas VehiclePositions | Monterey-Salinas Transit |
543 rows × 4 columns
# attach information about gtfs datasets for each service to evaluate
service_checkdata_df = service_checks_cross_df.merge(
all_services_gtfs_df,
on=["Services","Dataset Type"],how="left")
keep_cols = ["Check #","gtfs-service record","Services"]
keep_service_checkdata_df = service_checkdata_df[keep_cols] #.drop_duplicates()
keep_service_checkdata_df.to_csv("service_cross.csv",index=False,sep="|")
#service_checkdata_df.loc[service_checkdata_df["gtfs-service record"].isna()]
keep_service_checkdata_df
#service_checkdata_df
Check # | gtfs-service record | Services | |
---|---|---|---|
0 | 32 | NaN | AC Transit |
1 | 13 | AC Transit – Bay Area 511 Schedule | AC Transit |
2 | 10 | AC Transit – Bay Area 511 Schedule | AC Transit |
3 | 8 | AC Transit – Bay Area 511 Schedule | AC Transit |
4 | 7 | AC Transit – Bay Area 511 Schedule | AC Transit |
... | ... | ... | ... |
3135 | 9 | the Link-Athens, the Link Florence-Firestone/W... | the Link-Baldwin Hills Parklands |
3136 | 14 | the Link-Athens, the Link Florence-Firestone/W... | the Link-Baldwin Hills Parklands |
3137 | 11 | the Link-Athens, the Link Florence-Firestone/W... | the Link-Baldwin Hills Parklands |
3138 | 18 | NaN | the Link-Baldwin Hills Parklands |
3139 | 25 | NaN | the Link-Baldwin Hills Parklands |
3140 rows × 3 columns
service_checkdata_df.head()
Check # | Services | Dataset Type | gtfs-service record | GTFS Dataset | |
---|---|---|---|---|---|
0 | 32 | AC Transit | GTFS Alerts | NaN | NaN |
1 | 13 | AC Transit | GTFS Schedule | AC Transit – Bay Area 511 Schedule | Bay Area 511 Schedule |
2 | 10 | AC Transit | GTFS Schedule | AC Transit – Bay Area 511 Schedule | Bay Area 511 Schedule |
3 | 8 | AC Transit | GTFS Schedule | AC Transit – Bay Area 511 Schedule | Bay Area 511 Schedule |
4 | 7 | AC Transit | GTFS Schedule | AC Transit – Bay Area 511 Schedule | Bay Area 511 Schedule |
dataset_checks_df = pd.read_csv(os.path.join("data","gtfs checks-Dataset Scope.csv"), usecols = ["Check #","Scope"])
datasets_df = pd.read_csv(os.path.join("data","Assessed Feeds.csv"), usecols = ["gtfs_dataset_id","Data"])
# cartesian product
dataset_cross_df = datasets_df.merge(dataset_checks_df, how='cross')
# select applicable checks
dataset_cross_df["data_match"]=dataset_cross_df["Data"]+" Dataset"
dataset_cross_df["scope_match"] = dataset_cross_df["Scope"]
dataset_cross_df.loc[
dataset_cross_df["Scope"] != "GTFS Schedule Dataset",
"scope_match"
]="GTFS " + dataset_cross_df["Scope"]
dataset_checkdata_df = dataset_cross_df[dataset_cross_df["data_match"]==dataset_cross_df["scope_match"]]
#checks "GTFS Schedule Dataset"
#dataset "GTFS Schedule"
dataset_checkdata_df[["gtfs_dataset_id","Check #"]].to_csv("data_cross.csv",index=False,sep="|")
dataset_checkdata_df.head()
gtfs_dataset_id | Data | Check # | Scope | data_match | scope_match | |
---|---|---|---|---|---|---|
0 | Get Around Town Express Schedule | GTFS Schedule | 1 | GTFS Schedule Dataset | GTFS Schedule Dataset | GTFS Schedule Dataset |
1 | Get Around Town Express Schedule | GTFS Schedule | 2 | GTFS Schedule Dataset | GTFS Schedule Dataset | GTFS Schedule Dataset |
2 | Get Around Town Express Schedule | GTFS Schedule | 3 | GTFS Schedule Dataset | GTFS Schedule Dataset | GTFS Schedule Dataset |
3 | Get Around Town Express Schedule | GTFS Schedule | 4 | GTFS Schedule Dataset | GTFS Schedule Dataset | GTFS Schedule Dataset |
4 | Get Around Town Express Schedule | GTFS Schedule | 5 | GTFS Schedule Dataset | GTFS Schedule Dataset | GTFS Schedule Dataset |
provider_checks_df = pd.read_csv(os.path.join("data","provider_checks.csv"))
providers_df = pd.read_csv(os.path.join("data","Assessed Transit Providers-Grid view.csv"), usecols = ["Name"])
# cartesian product
providers_cross_df = providers_df.merge(provider_checks_df, how='cross')
providers_cross_df.to_csv("provider_cross.csv",index=False,sep="|")
providers_cross_df.head()
Name | checks | |
---|---|---|
0 | Alameda-Contra Costa Transit District | 37 |
1 | Alameda-Contra Costa Transit District | 39 |
2 | Alameda-Contra Costa Transit District | 40 |
3 | Amador Regional Transit System | 37 |
4 | Amador Regional Transit System | 39 |