A machine-readable copy of the database is published online in Microsoft Access format. Let's download it.
import os
import pypyodbc
Configure the data directories.
%store -r
Remove the zip file if it already exists.
zip_path = os.path.join(input_dir, "avall.zip")
!rm -f $zip_path
Download the file.
!wget -O $input_dir/avall.zip -nc https://app.ntsb.gov/avdata/Access/avall.zip
--2018-10-17 18:43:43-- https://app.ntsb.gov/avdata/Access/avall.zip Resolving app.ntsb.gov (app.ntsb.gov)... 199.173.155.8 Connecting to app.ntsb.gov (app.ntsb.gov)|199.173.155.8|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 242648102 (231M) [application/x-zip-compressed] Saving to: ‘/home/palewire/Code/helicopter-accident-analysis/src/../input/avall.zip’ /home/palewire/Code 100%[===================>] 231.41M 1.07MB/s in 2m 43s 2018-10-17 18:46:28 (1.42 MB/s) - ‘/home/palewire/Code/helicopter-accident-analysis/src/../input/avall.zip’ saved [242648102/242648102]
Unzip the file.
!unzip -n $input_dir/\*.zip -d $input_dir/
Archive: /home/palewire/Code/helicopter-accident-analysis/src/../input/avall.zip
Use the mdbtools command-line tool to get a list of all the tables in the database
mdb_path = os.path.join(input_dir, 'avall.mdb')
tables = !mdb-tables $mdb_path
Split them into a Python list
tables = tables[0].split()
Extract the tables from the Microsoft Access database and write them out as CSVs.
def export_table(name):
print(f"Exporting {name}")
export_path = os.path.join(input_dir, f"{name.lower()}.csv")
!mdb-export $mdb_path $name > $export_path
for t in tables: export_table(t)
Exporting aircraft Exporting Country Exporting ct_seqevt Exporting dt_aircraft Exporting dt_events Exporting dt_Flight_Crew Exporting eADMSPUB_DataDictionary Exporting engines Exporting events Exporting Events_Sequence Exporting Findings Exporting Flight_Crew Exporting flight_time Exporting injury Exporting narratives Exporting NTSB_Admin Exporting Occurrences Exporting seq_of_events Exporting states Exporting ct_iaids