Standardize NTSB

The NTSB has inconsistently logged the make and model of helicopters involved in accidents. This notebook cleans up variations based on Times research.

In [17]:
import os
import pandas as pd

Import data directories

In [18]:
%store -r

Read in the NTSB's list of helicopters involved in fatal accidents.

In [19]:
helicopters = pd.read_csv(os.path.join(output_dir, "helicopters-by-accident.csv"))

Read in The Times' crosswalk of standardizations and corrections.

In [20]:
crosswalk = pd.read_csv(os.path.join(input_dir, "crosswalk.csv"))

Join The Times crosswalk to the NTSB database. The standardized make and models values are stored in fields with an "latimes" prefix, which will appended to the raw data.

In [21]:
merged = helicopters.merge(
    crosswalk,
    on=["ntsb_make", "ntsb_model"],
    how="left"
)

After they are joined, fill in any records that don't yet have a match.

In [22]:
merged.latimes_make.fillna("", inplace=True)
merged.latimes_model.fillna("", inplace=True)

Recreate the crosswalk with any new makes and models from the raw data now included.

In [23]:
new_crosswalk = merged.groupby([
    "ntsb_make",
    "ntsb_model",
    "latimes_make",
    "latimes_model"
]).size().rename("count").reset_index()

Audit how many makes and models have been standardized.

In [24]:
uniques = pd.DataFrame([
    ('all unique makes', len(new_crosswalk)),
    ('standardized uniques', len(new_crosswalk[new_crosswalk.latimes_make != ''])),
    ('unstandardized uniques', len(new_crosswalk[new_crosswalk.latimes_make == ''])),
])
In [25]:
counts = pd.DataFrame([
    ('all helicopters', new_crosswalk['count'].sum()),
    ('standardize count', new_crosswalk[new_crosswalk.latimes_make != '']['count'].sum()),
    ('unstandardize count', new_crosswalk[new_crosswalk.latimes_make == '']['count'].sum()),
])
In [26]:
uniques['percent'] = uniques[1] / len(new_crosswalk)
In [27]:
counts['percent'] = counts[1] / new_crosswalk['count'].sum()
In [28]:
uniques
Out[28]:
0 1 percent
0 all unique makes 543 1.000000
1 standardized uniques 539 0.992634
2 unstandardized uniques 4 0.007366
In [29]:
counts
Out[29]:
0 1 percent
0 all helicopters 1545 1.000000
1 standardize count 1541 0.997411
2 unstandardize count 4 0.002589

Write the crosswalk out for further standardization work.

In [30]:
new_crosswalk.sort_values(["ntsb_make", "ntsb_model"]).to_csv(
    os.path.join(input_dir, "crosswalk.csv"),
    index=False
)

On the NTSB data, create a combined field with the two standardized columns combined into one.

In [31]:
merged['latimes_make_and_model'] = merged.latimes_make + " " + merged.latimes_model

Write out the standardized file for analysis.

In [32]:
merged.to_csv(os.path.join(output_dir, "standardized-helicopters-by-accident.csv"), index=False)