Standardize NTSB¶

The NTSB has inconsistently logged the make and model of helicopters involved in accidents. This notebook cleans up variations based on Times research.

In [17]:

import os
import pandas as pd

Import data directories

In [18]:

%store -r

Read in the NTSB's list of helicopters involved in fatal accidents.

In [19]:

helicopters = pd.read_csv(os.path.join(output_dir, "helicopters-by-accident.csv"))

Read in The Times' crosswalk of standardizations and corrections.

In [20]:

crosswalk = pd.read_csv(os.path.join(input_dir, "crosswalk.csv"))

Join The Times crosswalk to the NTSB database. The standardized make and models values are stored in fields with an "latimes" prefix, which will appended to the raw data.

In [21]:

merged = helicopters.merge(
    crosswalk,
    on=["ntsb_make", "ntsb_model"],
    how="left"
)

After they are joined, fill in any records that don't yet have a match.

In [22]:

merged.latimes_make.fillna("", inplace=True)
merged.latimes_model.fillna("", inplace=True)

Recreate the crosswalk with any new makes and models from the raw data now included.

In [23]:

new_crosswalk = merged.groupby([
    "ntsb_make",
    "ntsb_model",
    "latimes_make",
    "latimes_model"
]).size().rename("count").reset_index()

Audit how many makes and models have been standardized.

In [24]:

uniques = pd.DataFrame([
    ('all unique makes', len(new_crosswalk)),
    ('standardized uniques', len(new_crosswalk[new_crosswalk.latimes_make != ''])),
    ('unstandardized uniques', len(new_crosswalk[new_crosswalk.latimes_make == ''])),
])

In [25]:

counts = pd.DataFrame([
    ('all helicopters', new_crosswalk['count'].sum()),
    ('standardize count', new_crosswalk[new_crosswalk.latimes_make != '']['count'].sum()),
    ('unstandardize count', new_crosswalk[new_crosswalk.latimes_make == '']['count'].sum()),
])

In [26]:

uniques['percent'] = uniques[1] / len(new_crosswalk)

In [27]:

counts['percent'] = counts[1] / new_crosswalk['count'].sum()

In [28]:

uniques

Out[28]:

	0	1	percent
0	all unique makes	543	1.000000
1	standardized uniques	539	0.992634
2	unstandardized uniques	4	0.007366

In [29]:

counts

Out[29]:

	0	1	percent
0	all helicopters	1545	1.000000
1	standardize count	1541	0.997411
2	unstandardize count	4	0.002589

Write the crosswalk out for further standardization work.

In [30]:

new_crosswalk.sort_values(["ntsb_make", "ntsb_model"]).to_csv(
    os.path.join(input_dir, "crosswalk.csv"),
    index=False
)

On the NTSB data, create a combined field with the two standardized columns combined into one.

In [31]:

merged['latimes_make_and_model'] = merged.latimes_make + " " + merged.latimes_model

Write out the standardized file for analysis.

In [32]:

merged.to_csv(os.path.join(output_dir, "standardized-helicopters-by-accident.csv"), index=False)