The NTSB has inconsistently logged the make and model of helicopters involved in accidents. This notebook cleans up variations based on Times research.
import os
import pandas as pd
Import data directories
%store -r
Read in the NTSB's list of helicopters involved in fatal accidents.
helicopters = pd.read_csv(os.path.join(output_dir, "helicopters-by-accident.csv"))
Read in The Times' crosswalk of standardizations and corrections.
crosswalk = pd.read_csv(os.path.join(input_dir, "crosswalk.csv"))
Join The Times crosswalk to the NTSB database. The standardized make and models values are stored in fields with an "latimes" prefix, which will appended to the raw data.
merged = helicopters.merge(
crosswalk,
on=["ntsb_make", "ntsb_model"],
how="left"
)
After they are joined, fill in any records that don't yet have a match.
merged.latimes_make.fillna("", inplace=True)
merged.latimes_model.fillna("", inplace=True)
Recreate the crosswalk with any new makes and models from the raw data now included.
new_crosswalk = merged.groupby([
"ntsb_make",
"ntsb_model",
"latimes_make",
"latimes_model"
]).size().rename("count").reset_index()
Audit how many makes and models have been standardized.
uniques = pd.DataFrame([
('all unique makes', len(new_crosswalk)),
('standardized uniques', len(new_crosswalk[new_crosswalk.latimes_make != ''])),
('unstandardized uniques', len(new_crosswalk[new_crosswalk.latimes_make == ''])),
])
counts = pd.DataFrame([
('all helicopters', new_crosswalk['count'].sum()),
('standardize count', new_crosswalk[new_crosswalk.latimes_make != '']['count'].sum()),
('unstandardize count', new_crosswalk[new_crosswalk.latimes_make == '']['count'].sum()),
])
uniques['percent'] = uniques[1] / len(new_crosswalk)
counts['percent'] = counts[1] / new_crosswalk['count'].sum()
uniques
0 | 1 | percent | |
---|---|---|---|
0 | all unique makes | 543 | 1.000000 |
1 | standardized uniques | 539 | 0.992634 |
2 | unstandardized uniques | 4 | 0.007366 |
counts
0 | 1 | percent | |
---|---|---|---|
0 | all helicopters | 1545 | 1.000000 |
1 | standardize count | 1541 | 0.997411 |
2 | unstandardize count | 4 | 0.002589 |
Write the crosswalk out for further standardization work.
new_crosswalk.sort_values(["ntsb_make", "ntsb_model"]).to_csv(
os.path.join(input_dir, "crosswalk.csv"),
index=False
)
On the NTSB data, create a combined field with the two standardized columns combined into one.
merged['latimes_make_and_model'] = merged.latimes_make + " " + merged.latimes_model
Write out the standardized file for analysis.
merged.to_csv(os.path.join(output_dir, "standardized-helicopters-by-accident.csv"), index=False)