Tidy NTSB accidentis database

Only a subset of the accidents data is necessary for this analysis. The source tables need to be tidied up and filtered down to only records involving helicopters. That's what this notebook does.

Each step below was vetted by database experts at the NTSB.

In [1]:
import os
import pandas as pd

Open

Configure the data directories.

In [2]:
%store -r

Read in the files we want to keep.

In [3]:
read_df = lambda name, types: pd.read_csv(os.path.join(input_dir, name), dtype=types, low_memory=False)

A list of all the aircraft involved with accidents.

In [4]:
raw_aircraft = read_df("aircraft.csv", dict(ev_id=str, Aircraft_Key=str, ntsb_no=str))

A list of all the accidents.

In [5]:
raw_events = read_df("events.csv", dict(ev_id=str, ntsb_no=str))

A list of all the flight crew on the aircraft involved in accidents.

In [6]:
raw_flight_crew = read_df("flight_crew.csv", dict(ev_id=str, Aircraft_Key=str, crew_no=str))

Data about the previous experience levels of flight crew involved in accidents.

In [7]:
raw_flight_time = read_df("flight_time.csv", dict(ev_id=str, Aircraft_Key=str, crew_no=str))

Narratives from investigators

In [8]:
raw_narratives = read_df("narratives.csv", str)

Strip

The NTSB frequently leaves dangling spaces at the end of its string columns. Let's strip them all.

In [9]:
strip_df = lambda df: df.applymap(lambda x: x.strip().upper() if type(x) is str else x)
In [10]:
stripped_aircraft = strip_df(raw_aircraft)
In [11]:
stripped_events = strip_df(raw_events)
In [12]:
stripped_flight_crew = strip_df(raw_flight_crew)
In [13]:
stripped_flight_time = strip_df(raw_flight_time)
In [14]:
stripped_narratives = strip_df(raw_narratives)

Trim

Limit each table to the columns we want to keep. Rename the columns to make them more readable.

In [15]:
trim_df = lambda df, cols: df[list(cols)].rename(columns=cols) 
In [16]:
aircraft_columns = dict(
    ev_id="event_id",
    Aircraft_Key="aircraft_id",
    acft_make="ntsb_make",
    acft_model="ntsb_model",
    acft_category="category"
)
In [17]:
trimmed_aircraft = trim_df(stripped_aircraft, aircraft_columns)
In [18]:
events_columns = dict(
    ev_id="event_id",
    ntsb_no="ntsb_number",
    ev_year="year",
    ev_date="date",
    ev_city="city",
    ev_state="state",
    ev_country="country",
    inj_tot_f="total_fatalities"
)
In [19]:
trimmed_events = trim_df(stripped_events, events_columns)
In [20]:
flight_crew_columns = dict(
    ev_id= "event_id",
    Aircraft_Key="aircraft_id",
    crew_no="crew_id",
    crew_age="age",
    crew_category="category"
)
In [21]:
trimmed_flight_crew = trim_df(stripped_flight_crew, flight_crew_columns)
In [22]:
flight_time_columns = dict(
    ev_id="event_id",
    Aircraft_Key="aircraft_id",
    crew_no="crew_id",
    flight_hours="flight_hours",
    flight_type="flight_type",
    flight_craft="flight_craft"
)
In [23]:
trimmed_flight_time = trim_df(stripped_flight_time, flight_time_columns)
In [24]:
narrative_columns = dict(
    ev_id="event_id",
    Aircraft_Key="aircraft_id",
    narr_accp="factual_narrative",
    narr_accf="final_narrative",
    narr_cause="probable_cause_narrative"
)
In [25]:
trimmed_narratives = trim_df(stripped_narratives, narrative_columns)

Filter

Each table needs to be filtered so that they only contain data related to fatal helicopter accidents in the United States.

First we trim down to the helicopters linked to accident events.

In [26]:
helicopters = trimmed_aircraft[trimmed_aircraft.category == 'HELI'].drop("category", axis=1)

Then we trim down the accidents to those involving helicopters

In [27]:
heli_events = trimmed_events[trimmed_events.event_id.isin(helicopters.event_id)]

Those also need to be limited to accidents that involved fatalities

In [28]:
fatal_heli_events = heli_events[heli_events.total_fatalities > 0]

The flight crew data should be limited to pilots. We know the codes to look for thanks to the NTSB's database documentation.

In [29]:
pilots = trimmed_flight_crew[trimmed_flight_crew.category.isin(["PLT", "DSTU", "FLTI", "CPLT", "KPLT"])]

Those pilots then also need to be limited to events involving helicopters

In [30]:
heli_pilots = pilots[pilots.event_id.isin(fatal_heli_events.event_id)]

Only total flight time in all flights.

In [31]:
total_flight_hours = trimmed_flight_time[
    (trimmed_flight_time.flight_type == 'TOTL') &
    (trimmed_flight_time.flight_craft == 'ALL')
]

Merge

The datasets needs to be merged with each other for analysis.

Helicopters should be joined to our fatal accident list.

In [32]:
helicopters_by_accident = pd.merge(
    helicopters,
    fatal_heli_events,
    on="event_id",
    how="inner"
)

Pilots should be joined to their flight times.

In [35]:
merged_pilots = pd.merge(
    heli_pilots,
    total_flight_hours,
    on=["event_id", "aircraft_id", "crew_id"],
    how="left"
)

Annotate

Mark the accidents that happened in the United States.

In [36]:
helicopters_by_accident['in_usa'] = helicopters_by_accident.country == 'USA'

We will add a pilot count to each accident for use in the analysis.

In [37]:
pilot_counts = merged_pilots.groupby(["event_id", "aircraft_id"]).size().reset_index()
In [38]:
pilot_counts.columns = ["event_id", "aircraft_id", "pilot_total"]
In [39]:
helicopters_by_accident_annotated = helicopters_by_accident.merge(
    pilot_counts,
    on=["event_id", "aircraft_id"],
    how="left"
)

Export

Write out the files we'll use in the analysis

In [41]:
write_df = lambda df, name: df.to_csv(os.path.join(output_dir, name), index=False, encoding="utf-8") 
In [42]:
write_df(helicopters_by_accident_annotated, 'helicopters-by-accident.csv')
In [43]:
write_df(fatal_heli_events, 'fatal-accidents.csv')
In [44]:
write_df(merged_pilots, 'pilots.csv')
In [45]:
write_df(trimmed_narratives, 'narratives.csv')