By Ben Welsh
A geospatial analysis of Houston property records conducted for the Nov. 8, 2017 Los Angeles Times story "How Houston’s newest homes survived Hurricane Harvey".
Nearly a quarter of new Houston homes in the past decade were built within a federal flood zone. To look at damage there, the Times chose five neighborhoods built inside a flood zone over the last decade and asked a sample of homeowners in each area how they fared. In case after case, owners of the newer homes rode out the hurricane safely.
The brunt of the damage appears to be borne by older houses — sometimes just doors away from newer, unscathed real estate — that predate federal and local anti-flooding regulations and account for 79% of the nearly 129,000 houses located in the federal flood zones.
To learn more about the survey, the context and our findings, read the story. Below you can find the Python code that analyzed and where, and when, homes were built inside federal flood zones.
Our sources were:
%run ./01_download.ipynb
Some of the source files must be downloaded and unpackaged manually. To recreate the analysis you should refer to the documentation in the download notebook and unpack them yourself.
The next notebook converts the source data files into comma-delimited format, standardizes geospatial reference systems, computes additional columns, filters out superfluous records and trims off unneeded metadata.
%run ./02_transform.ipynb
This notebook mints files ready for analysis for merging the buildings data with parcel maps
%run ./03_merge.ipynb
Now we begin the analysis by importing our Python tools.
import os
import geopandas as gpd
import matplotlib.pyplot as plt
%matplotlib inline
Prepare utilities for opening files
get_output_path = lambda x: os.path.join("./output", x)
get_gdf = lambda x: gpd.read_file(get_output_path(x))
Because the source data files are so large, a single spatial join is difficult to execute. To work around the limitations of our available hardware, the following code splits up the shapefiles into chunks based on the year each building was built.
year_list = gpd.pd.np.arange(1800, 2017)
def split_year(df_name, func):
"""
Split the provided dataframe into separate files using all the years in our dataset.
"""
# Open the dataframe
df = get_gdf(df_name)
# Loop through all the years we're watching
for year in year_list:
# For each year pass the dataframe to the provided function (for filtering or analysis)
year_df = func(df, year)
# Write out the results
if len(year_df) > 0:
year_name = "{}-{}".format(year, df_name)
year_path = get_output_path(year_name)
year_df.to_file(year_path)
# Pass back out the dataframe since it's already open. We might reuse it.
return df
The master shapefile is read in and split into many pieces.
buildings = split_year("buildings-and-parcels.shp", lambda x, y: x[x.erected == y])
Each year's file is then spatial joined with the flood zones and output into a file with the matches.
def read_year(year, name_template="{}-buildings-and-parcels.shp"):
"""
Returns the split slice of residential parcels for the given year.
"""
try:
# Go grab the file for this year
return get_gdf(name_template.format(year))
except IOError:
# If there's no file, just return None
return None
def sjoin_year(parcels, flood_zones):
"""
Conducts a spatial join between the provided parcels and flood zones.
Returns parcels that intersect with the zones.
"""
if isinstance(parcels, type(None)):
return gpd.GeoDataFrame()
sjoin = gpd.sjoin(parcels, flood_zones, how="inner", op='intersects')
return parcels[parcels.account.isin(sjoin.account)]
zones_500 = split_year("five-hundred-year-flood-zones.shp", lambda shp, y: sjoin_year(read_year(y), shp))
zones_100 = split_year("one-hundred-year-flood-zones.shp", lambda shp, y: sjoin_year(read_year(y), shp))
All the matches are read in and combined into a single shapefile with all of the buildings found within a flood zone.
def concat_years(df_name):
"""
Combine all files split by year with the provided pattern into a single dataframe.
"""
# Pull all the annual files
df_list = [
read_year(year, "{}-%s" % df_name)
for year in year_list
]
# Lump them together into one big dataframe
df = gpd.pd.concat([i for i in df_list if isinstance(i, gpd.GeoDataFrame)])
# Convert the dataframe into a geodataframe
geometry = df['geometry']
df = df.drop('geometry', axis=1)
crs = {'init': 'epsg:4269'}
gdf = gpd.GeoDataFrame(df, crs=crs, geometry=geometry)
# Return the result
return gdf
in_zone_500 = concat_years("five-hundred-year-flood-zones.shp")
in_zone_100 = concat_years("one-hundred-year-flood-zones.shp")
Some buildings intersect partially with both a 100 year and 500 year flood zone. We remove the duplicates using a conservative methodology that places buildings in both within the less risky 500 year zone.
in_zone_100_only = in_zone_100[~(in_zone_100.account.isin(in_zone_500.account))]
Create a combined dataframe with both types after the duplicates have been excised.
in_any_zone = gpd.pd.concat([in_zone_500, in_zone_100_only])
Reset indexes to tidy up the data.
in_zone_500 = in_zone_500.reset_index()
in_zone_100 = in_zone_100.reset_index()
in_zone_100_only = in_zone_100_only.reset_index()
in_any_zone = in_any_zone.reset_index()
Write out shapefiles of the results.
in_zone_500.to_file(get_output_path("./buildings-in-five-hundred-year-flood-zones.shp"))
in_zone_100_only.to_file(get_output_path("./buildings-in-one-hundred-year-flood-zones.shp"))
in_any_zone.to_file(get_output_path("./buildings-in-any-flood-zone.shp"))
Filter down to just single-family homes, as instructed by officials at the Harris County Appraisal District.
homes_in_zone_500 = in_zone_500[in_zone_500.use == 'A1']
homes_in_zone_100_only = in_zone_100_only[in_zone_100_only.use == 'A1']
homes_in_any_zone = in_any_zone[in_any_zone.use == 'A1']
Write out shapefiles of those.
homes_in_zone_500.to_file(get_output_path("./homes-in-five-hundred-year-flood-zones.shp"))
homes_in_zone_100_only.to_file(get_output_path("./homes-in-one-hundred-year-flood-zones.shp"))
homes_in_any_zone.to_file(get_output_path("./homes-in-any-flood-zone.shp"))
Filter down again to just homes that fall within the Houston city limits, as instructed by officials at the Harris County Appraisal District.
houston_homes_in_zone_500 = homes_in_zone_500[homes_in_zone_500.houston == 'True']
houston_homes_in_zone_100_only = homes_in_zone_100_only[homes_in_zone_100_only.houston == 'True']
houston_homes_in_any_zone = homes_in_any_zone[homes_in_any_zone.houston == 'True']
Write out shapefiles of those.
houston_homes_in_zone_500.to_file(get_output_path("./houston-homes-in-five-hundred-year-flood-zones.shp"))
houston_homes_in_zone_100_only.to_file(get_output_path("./houston-homes-in-one-hundred-year-flood-zones.shp"))
houston_homes_in_any_zone.to_file(get_output_path("./houston-homes-in-any-flood-zone.shp"))
Clear out split files generated above.
!rm ./output/18*-*
!rm ./output/19*-*
!rm ./output/20*-*
Prepare functions for running the numbers.
def groupby_year(df, value_name):
"""
Returns a regrouped dataframe using the 'erected' field that contains year of construction.
"""
grouped = df.groupby("erected").size().reset_index()
grouped.columns = ['year', value_name]
return grouped
def crosstab(universe_df, in_zone_500_df, in_zone_100_df, func=groupby_year, on="year"):
"""
Returns a dataframe crosstab with counts and percentages of buildings constructed within flood zones.
"""
# Group all three sets by year
universe_by_func = func(universe_df, "total")
in_zone_500_by_func = func(in_zone_500_df, "in_zone_500")
in_zone_100_by_func = func(in_zone_100_df, "in_zone_100")
# Create a crosstab
crosstab = gpd.pd.merge(
universe_by_func,
gpd.pd.merge(in_zone_100_by_func, in_zone_500_by_func, how="outer", on=on),
how="left",
on=on
)
# Clean it up
crosstab.fillna(0, inplace=True)
# Add total and percentage columns
crosstab['in_zone'] = crosstab['in_zone_100'] + crosstab['in_zone_500']
crosstab['in_zone_percent'] = crosstab['in_zone'] / crosstab['total']
# Trim it
trimmed = crosstab[((crosstab[on] > 1500) & (crosstab[on] < 2017))]
# Pass it back
return trimmed
Run the totals for all homes in Harris County
homes = buildings[buildings.use == 'A1']
homes_by_year = crosstab(homes, homes_in_zone_500, homes_in_zone_100_only)
Run the totals for homes in the Houston city limits.
houston_homes = homes[homes.houston == 'True']
houston_homes_by_year = crosstab(houston_homes, houston_homes_in_zone_500, houston_homes_in_zone_100_only)
def compare(numerator, denominator):
print "{:,} / {:,} ({:.2%})".format(numerator, denominator, numerator / denominator)
Overall
compare(homes_by_year['in_zone'].sum(), homes_by_year['total'].sum())
compare(houston_homes_by_year['in_zone'].sum(), houston_homes_by_year['total'].sum())
264,411.0 / 1,038,244 (25.47%) 128,864.0 / 429,387 (30.01%)
100 year zone
compare(homes_by_year['in_zone_100'].sum(), homes_by_year['total'].sum())
compare(houston_homes_by_year['in_zone_100'].sum(), houston_homes_by_year['total'].sum())
87,978.0 / 1,038,244 (8.47%) 48,569.0 / 429,387 (11.31%)
500 year zone
compare(homes_by_year['in_zone_500'].sum(), homes_by_year['total'].sum())
compare(houston_homes_by_year['in_zone_500'].sum(), houston_homes_by_year['total'].sum())
176,433.0 / 1,038,244 (16.99%) 80,295.0 / 429,387 (18.70%)
homes_since_2007 = homes_by_year[homes_by_year.year >= 2007]
houston_homes_since_2007 = houston_homes_by_year[houston_homes_by_year.year >= 2007]
compare(homes_since_2007['in_zone'].sum(), homes_since_2007['total'].sum())
compare(houston_homes_since_2007['in_zone'].sum(), houston_homes_since_2007['total'].sum())
30,687.0 / 145,500 (21.09%) 8,744.0 / 38,862 (22.50%)
compare(homes_since_2007['in_zone_100'].sum(), homes_since_2007['total'].sum())
compare(houston_homes_since_2007['in_zone_100'].sum(), houston_homes_since_2007['total'].sum())
9,198.0 / 145,500 (6.32%) 3,215.0 / 38,862 (8.27%)
compare(homes_since_2007['in_zone_500'].sum(), homes_since_2007['total'].sum())
compare(houston_homes_since_2007['in_zone_500'].sum(), houston_homes_since_2007['total'].sum())
21,489.0 / 145,500 (14.77%) 5,529.0 / 38,862 (14.23%)
That's the year when Houston's local ordinances regulating flood zone elevations went into effect.
homes_before_rules = homes_by_year[homes_by_year.year < 1985]
houston_homes_before_rules = houston_homes_by_year[houston_homes_by_year.year < 1985]
compare(homes_before_rules['in_zone'].sum(), homes_by_year['in_zone'].sum())
compare(houston_homes_before_rules['in_zone'].sum(), houston_homes_by_year['in_zone'].sum())
169,217.0 / 264,411.0 (64.00%) 101,320.0 / 128,864.0 (78.63%)
compare(homes_before_rules['in_zone_100'].sum(), homes_by_year['in_zone_100'].sum())
compare(houston_homes_before_rules['in_zone_100'].sum(), houston_homes_by_year['in_zone_100'].sum())
63,395.0 / 87,978.0 (72.06%) 41,602.0 / 48,569.0 (85.66%)
compare(homes_before_rules['in_zone_500'].sum(), homes_by_year['in_zone_500'].sum())
compare(houston_homes_before_rules['in_zone_500'].sum(), houston_homes_by_year['in_zone_500'].sum())
105,822.0 / 176,433.0 (59.98%) 59,718.0 / 80,295.0 (74.37%)
Filter down the data to the time range we want to chart.
start_year = 1930
Total buildings countywide by year
chart_df = homes_by_year[(homes_by_year.year > start_year)]
fig = plt.figure(figsize=(20, 10))
ax = plt.bar(gpd.pd.np.arange(start_year, 2016), chart_df.total)
ax = plt.bar(gpd.pd.np.arange(start_year, 2016), chart_df.in_zone)
Total buildings in Houston by year
houston_chart_df = houston_homes_by_year[(houston_homes_by_year.year > start_year)]
fig = plt.figure(figsize=(20, 10))
ax = plt.bar(gpd.pd.np.arange(start_year, 2016), houston_chart_df.total)
ax = plt.bar(gpd.pd.np.arange(start_year, 2016), houston_chart_df.in_zone)
Percentage of buildings in flood zones countywide by year
fig = plt.figure(figsize=(20, 10))
ax = plt.bar(gpd.pd.np.arange(start_year, 2016), [1 for i in gpd.pd.np.arange(start_year, 2016)])
ax = plt.bar(gpd.pd.np.arange(start_year, 2016), chart_df.in_zone_percent)
Percentage of Houston buildings in flood zones by year
fig = plt.figure(figsize=(20, 10))
ax = plt.bar(gpd.pd.np.arange(start_year, 2016), [1 for i in gpd.pd.np.arange(start_year, 2016)])
ax = plt.bar(gpd.pd.np.arange(start_year, 2016), houston_chart_df.in_zone_percent)
Ultimately we smoothed data before 1975 to five-year averages due to imprecisions described by officials at the Harris County Appraisal District.
homes_by_year.to_csv("./output/homes-by-year.csv")
houston_homes_by_year.to_csv("./output/houston-homes-by-year.csv")