Background¶

This script combines building geometries and energy consumption information from publicly available New York City data, and then exports data to GeoJSON and TopoJSON formats for interactive visualization with d3.

Most of the heavy lifting is done by Pandas. GeoPandas is used to help manage the geometry data. You will need both of these installed in order to run this code. If you wish to perform the GeoJSON to TopoJSON included at the very end of this script then you will also need to have TopoJSON installed (notes are included for this below).

In [1]:

import os, re
from pandas import *
from geopandas import *
pylab.rcParams['savefig.dpi'] = 100

Data Collection and Tidying¶

Building Geometry¶

Data Source¶

The data for building footprints, heights and elevations was downloaded from NYC Open Data:

https://data.cityofnewyork.us/Housing-Development/Building-Footprints/tb92-6tj8

https://data.cityofnewyork.us/download/tb92-6tj8/application/zip

The zip file contains a metadata file with some explanation of the data. The data is contained in a Shapefile.

Cleaning and Formatting¶

Note that the building_1214_metadata.htm file that makes part of the downloaded zip describes a different map projection specification (in its 'well-known-text' field) than what is found in the building_1214.prj file. The metadata looks to be correct and the .prj file incorrect - the geometry data in the shape file is in feet rather than in longitude-latitude. So the 'well-known-text' field from the metadata file must be copied into the .prj file before reading the shape file with GeoPandas (else all variety of headaches arise when trying to translate and/or export to GeoJSON, as I discovered the hard way).

The shape file is 216 MB, so be prepared to wait for the following command - it takes about 4 minutes on my mac air.

In [2]:

shpfile = 'building_footprints_shape_webmercator_12-14/building_1214.shp'
bldgs = GeoDataFrame.from_file(shpfile)

In [3]:

print 'number of rows:\t' + str(len(bldgs))
print 'columns:\t' + str(bldgs.columns)
print 'crs:\t\t' + str(bldgs.crs) # projection info for the geometry data

number of rows:	1082381
columns:	Index([u'BBL', u'BIN', u'BUILDING_T', u'DOITT_ID', u'FEATURE_CO', u'GROUND_ELE', u'HEIGHT_ROO', u'LAST_MODIF', u'NAME', u'OBJECTID', u'SUB_FEATUR', u'Shape_Area', u'Shape_Leng', u'geometry'], dtype='object')
crs:		{'init': u'epsg:2263'}

There are 1,082,381 buildings in this dataset. Note that the geometry values are in feet, using the EPSG 2263 projection (as specified in the source metadata file).

For clarity, we revise the table to only include the columns we are interested in. Note that 'BBL' stands for 'Borough - Block - Lot number'. This column is shared with the energy data described below, and will be used as a key for merging the tables, so we make it an index here. For convenience we add a new column 'boro' (keeping with traditional NY spelling) which contains the borough as a single-character string: '1'=Manhattan, '2'=Bronx, etc.

In [4]:

def boronum(bbl):
    return int(bbl[0])
boro = bldgs['BBL'].apply(boronum)
bldgs['boro'] = boro
bldgs = bldgs[['BBL','boro','GROUND_ELE','HEIGHT_ROO','Shape_Area','Shape_Leng','geometry']]
bldgs.set_index('BBL', inplace=True)
bldgs.sort(inplace=True)
bldgs.head()

Out[4]:

	boro	GROUND_ELE	HEIGHT_ROO	Shape_Area	Shape_Leng	geometry
BBL
1000010010	1	9	10.29	277.908100	68.170905	POLYGON ((-8239895.129312261 4966979.423426831...
1000010010	1	10	21.52	3198.675255	365.577623	POLYGON ((-8240318.942692845 4966178.186055238...
1000010010	1	25	29.67	3863.966910	256.501983	POLYGON ((-8239093.847102767 4966770.272564691...
1000010010	1	7	26.66	3138.298489	240.395131	POLYGON ((-8239010.691476121 4966850.24966739,...
1000010010	1	11	28.60	3509.856484	301.591446	POLYGON ((-8239472.205369915 4966449.2249446, ...

Although Cartesian geometry in feet is useful for many purposes, it is not great for browser-based mapping with standard js libraries which expect geometries in longitude-latitude. We thus add a new column with the geometry projected to EPSG 4326, which is the standard geo description used in browser-based mapping.

In [5]:

bldgs['geometry_latlon'] = bldgs['geometry'].to_crs(epsg=4326)
bldgs.head()

Out[5]:

	boro	GROUND_ELE	HEIGHT_ROO	Shape_Area	Shape_Leng	geometry	geometry_latlon
BBL
1000010010	1	9	10.29	277.908100	68.170905	POLYGON ((-8239895.129312261 4966979.423426831...	POLYGON ((-112.6757750753299 48.09341020820187...
1000010010	1	10	21.52	3198.675255	365.577623	POLYGON ((-8240318.942692845 4966178.186055238...	POLYGON ((-112.6759406036311 48.09094854809784...
1000010010	1	25	29.67	3863.966910	256.501983	POLYGON ((-8239093.847102767 4966770.272564691...	POLYGON ((-112.6724726056846 48.09382693018913...
1000010010	1	7	26.66	3138.298489	240.395131	POLYGON ((-8239010.691476121 4966850.24966739,...	POLYGON ((-112.6723061851977 48.09412009655221...
1000010010	1	11	28.60	3509.856484	301.591446	POLYGON ((-8239472.205369915 4966449.2249446, ...	POLYGON ((-112.6733040928157 48.09259833372172...

Note the problem here though. These first few buildings in the dataset are located on Governor's Island (just south of the island of Manhattan), and we know that the longitude and latitude of the center of the island is 40.6914° N, 74.0161° W (ref: google search), but our conversion notes them as approximately 48.1° N, 112.7° W. Further investigation of the graphs (below) suggests that the result is also rotated. Something is incorrect about this conversion - if I figure out what, I will post a revision to this document. But for the immediate purposes of data visualization we can use the conversion as it stands, since the resulting maps look to be generally correct, just translated and rotated (or possibly scaled improperly).

In [6]:

# notes for further investigation of the conversion issue: 
#  GeoPandas uses pyproj; the code below provides an initial suggestion that something might be amiss with the units handling
#
#import pyproj
#x1 = -8239895.129312261
#y1 = 4966979.423426831
#wgs84 = pyproj.Proj("+init=EPSG:4326") # LatLon with WGS84 datum used by GPS units and Google Earth
#epsg2263 = pyproj.Proj("+init=EPSG:2263",preserve_units=True) # local NYC coordinate system
#print pyproj.transform(epsg2263, wgs84, x1, y1)
#conv = 0.3048
#print pyproj.transform(epsg2263, wgs84, x1*conv, y1*conv)
#
# outputs:
# (-112.67577507532987, 48.093410208201874)
# (-87.23011424552195, 43.594327921961934)

Quick Verification¶

A couple of quick graphs to ensure that the data format is roughly as expected.

In [7]:

bldgs.head(500).plot()

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x201891f90>

In [8]:

bldgs[bldgs.boro == 1].plot()

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x202ba9250>

Energy Consumption Data¶

Data Source¶

Energy consumption data is available for a subset of the buildings via NYC Open Data, and was downloaded in csv form:

https://data.cityofnewyork.us/Environment/Energy-and-Water-Data-Disclosure-for-Local-Law-84-/5gde-fmj3

This dataset contains annual energy and water consumption data for buildings in New York City that have greater than 50,000 square feet of floor area, as per the disclosure specifications in NYC Local Law 84.

Cleaning and Formatting¶

The csv file presents one major challenge before it can be read by Pandas. The last data column in the csv occasionally has newline characters within it - in these cases, the data entry is held within double quotes. In this cleaning script, these intra-entry newline characters are replaced by single space (' ') characters. The script also removes some unnecessary white space from the csv. A new 'cleaned' csv file is produced.

In [9]:

filename = 'Energy_and_Water_Data_Disclosure_for_Local_Law_84__2012_.csv'
cleanedcsv = ''
multilinekeeper = ''
with open(filename,'rU') as f:
    for line in f:
        line = re.sub(' +',' ',line)
        line = re.sub(', ',',',line)
        if multilinekeeper == '':
            if len(re.findall('"',line)) == 1:
                multilinekeeper = line.strip()
            else:
                cleanedcsv += line
        else:
            if '"' in line:
                cleanedcsv += multilinekeeper + line
                multilinekeeper = ''
            else:
                multilinekeeper += ' ' + line.strip()
                
with open('cleaned_' + filename,'w') as f:
    f.write(cleanedcsv)

The cleaned csv can now be read into a dataframe by Pandas.

In [10]:

energy = read_csv('cleaned_' + filename)
print 'number of rows:\t' + str(len(energy))
print 'columns:'
print energy.columns

number of rows:	14112
columns:
Index([u'BBL', u'Street Number', u'Street Name', u'Borough', u'Zip', u'Benchmarking Submission', u'Entry Number', u'Site EUI(kBtu/ft2)', u'Weather Normalized Source EUI(kBtu/ft2)', u'Indoor Water Intensity (All Water Sources)(gal/ft2)', u'Reported Water Method', u'ENERGY STAR Score', u'Total GHG Emissions(MtCO2e)', u'Property Floor Area (Buildngs and Parking)(ft2)', u'Primary Property Type - Self Selected', u'Number of Buildings', u'Reported BINs'], dtype='object')

Note that the number of entries in this table (14,112) is much smaller than the number of entries in the buildings table (1,082,381), since it only contains entries for buildings with more than 50,000 ft2 of floor area.

The BBL column was read in as integers, but want it to be strings in order to to compare with the bldgs BBL key. For clarity, we also take only the subset of data columns we are interested in, and we set the 'BBL' column to be the index.

In [11]:

energy['BBL'] = energy['BBL'].astype(str)
energy = energy.ix[:,[0,1,2,7,8,11,12,13,14,15]]
energy.set_index('BBL', inplace=True)
energy.sort(inplace=True)
energy.head()

Out[11]:

	Street Number	Street Name	Site EUI(kBtu/ft2)	Weather Normalized Source EUI(kBtu/ft2)	ENERGY STAR Score	Total GHG Emissions(MtCO2e)	Property Floor Area (Buildngs and Parking)(ft2)	Primary Property Type - Self Selected	Number of Buildings
BBL
1000010010	1	GOVERNORS ISLAND	NaN	NaN	NaN	NaN	NaN	NaN	1
1000020002	NaN	MARGINAL STREET	NaN	NaN	NaN	NaN	NaN	NaN	0
1000047501	1	WATER STREET	102.0	287.9	75	25932.68	2428325	Office	1
1000057501	125	BROAD STREET	119.6	261.5	70	11637.42	1338000	Office	1
1000087501	39	WHITEHALL STREET	59.0	163.2	NaN	848.55	169055	Multifamily Housing	1

Quick Verification¶

In [12]:

energy.plot(x=2,y=5,kind='scatter')

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x28be14610>

Merging Data Tables¶

An outer merge brings the two data tables together. For convenience, we add two new columns to denote which original tables had which entries (which will let us subset data when desired, akin to taking an inner merge but without duplication). Note that the merge results in a Pandas DataFrame object, rather than a GeoPandas GeoDataFrame object.

In [13]:

bldgsenergy = merge(bldgs, energy, left_index=True, right_index=True, how='outer')

In [14]:

print 'number of rows:\t' + str(len(bldgsenergy))
print 'columns:'
print bldgsenergy.columns
bldgsenergy.ix[:,['HEIGHT_ROO','geometry','geometry_latlon','Site EUI(kBtu/ft2)','Total GHG Emissions(MtCO2e)',
                  'Number of Buildings']].head()

number of rows:	1100100
columns:
Index([u'boro', u'GROUND_ELE', u'HEIGHT_ROO', u'Shape_Area', u'Shape_Leng', u'geometry', u'geometry_latlon', u'Street Number', u'Street Name', u'Site EUI(kBtu/ft2)', u'Weather Normalized Source EUI(kBtu/ft2)', u'ENERGY STAR Score', u'Total GHG Emissions(MtCO2e)', u'Property Floor Area (Buildngs and Parking)(ft2)', u'Primary Property Type - Self Selected', u'Number of Buildings'], dtype='object')

Out[14]:

	HEIGHT_ROO	geometry	geometry_latlon	Site EUI(kBtu/ft2)	Total GHG Emissions(MtCO2e)	Number of Buildings
BBL
1000010010	10.29	POLYGON ((-8239895.129312261 4966979.423426831...	POLYGON ((-112.6757750753299 48.09341020820187...	NaN	NaN	1
1000010010	21.52	POLYGON ((-8240318.942692845 4966178.186055238...	POLYGON ((-112.6759406036311 48.09094854809784...	NaN	NaN	1
1000010010	29.67	POLYGON ((-8239093.847102767 4966770.272564691...	POLYGON ((-112.6724726056846 48.09382693018913...	NaN	NaN	1
1000010010	26.66	POLYGON ((-8239010.691476121 4966850.24966739,...	POLYGON ((-112.6723061851977 48.09412009655221...	NaN	NaN	1
1000010010	28.60	POLYGON ((-8239472.205369915 4966449.2249446, ...	POLYGON ((-112.6733040928157 48.09259833372172...	NaN	NaN	1

In [15]:

def inbldgs(i):
    return i in bldgs.index
def inenergy(i):
    return i in energy.index
bldgsenergy['geometry_avail'] = Series(bldgsenergy.index,index=bldgsenergy.index).apply(inbldgs)
bldgsenergy['energy_avail'] = Series(bldgsenergy.index,index=bldgsenergy.index).apply(inenergy)
bldgsenergy.ix[:,['geometry_avail','energy_avail','HEIGHT_ROO','geometry','geometry_latlon','Site EUI(kBtu/ft2)',
                  'Total GHG Emissions(MtCO2e)','Number of Buildings']].head()

Out[15]:

	geometry_avail	energy_avail	HEIGHT_ROO	geometry	geometry_latlon	Site EUI(kBtu/ft2)	Total GHG Emissions(MtCO2e)	Number of Buildings
BBL
1000010010	True	True	10.29	POLYGON ((-8239895.129312261 4966979.423426831...	POLYGON ((-112.6757750753299 48.09341020820187...	NaN	NaN	1
1000010010	True	True	21.52	POLYGON ((-8240318.942692845 4966178.186055238...	POLYGON ((-112.6759406036311 48.09094854809784...	NaN	NaN	1
1000010010	True	True	29.67	POLYGON ((-8239093.847102767 4966770.272564691...	POLYGON ((-112.6724726056846 48.09382693018913...	NaN	NaN	1
1000010010	True	True	26.66	POLYGON ((-8239010.691476121 4966850.24966739,...	POLYGON ((-112.6723061851977 48.09412009655221...	NaN	NaN	1
1000010010	True	True	28.60	POLYGON ((-8239472.205369915 4966449.2249446, ...	POLYGON ((-112.6733040928157 48.09259833372172...	NaN	NaN	1

In [16]:

print 'number of rows in bldgs:\t\t\t\t\t' + str(len(bldgs))
print 'number of rows in energy:\t\t\t\t\t' + str(len(energy))
print 'number of rows in bldgsenergy:\t\t\t\t\t' + str(len(bldgsenergy))
print 'number of rows in bldgsenergy with geometry:\t\t\t' + str(len(bldgsenergy[bldgsenergy['geometry_avail']]))
print 'number of rows in bldgsenergy with energy:\t\t\t' + str(len(bldgsenergy[bldgsenergy['energy_avail']]))
print 'number of rows in bldgsenergy with both gometry and energy:\t' + \
                            str(len(bldgsenergy[bldgsenergy['energy_avail'] & bldgsenergy['geometry_avail']]))

number of rows in bldgs:					1082381
number of rows in energy:					14112
number of rows in bldgsenergy:					1100100
number of rows in bldgsenergy with geometry:			1099761
number of rows in bldgsenergy with energy:			45620
number of rows in bldgsenergy with both gometry and energy:	45281

Summary and Notes¶

We now have a tidy data table in the variable 'bldgsenergy'. A note for further investigation later, if this data is to be used for further analysis: there look to be a lot of buildings that share a BBL identifier, which will likely cause some hassles.

GeoJSON Export and TopoJSON Conversion¶

For the d3 visualization, we will use just a small section of the city, and will only consider those buildings for which we have both geometry data and energy data.

In [17]:

bldgsenergygeo = GeoDataFrame(bldgsenergy[bldgsenergy['geometry_avail'] & bldgsenergy['energy_avail']])

In [18]:

bldgsenergygeo.head(500).plot()

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x28c0c5750>

For the GeoJSON export, we will use the latitude and longitude as the geometry.

In [19]:

bldgsenergygeo.rename(columns={'geometry':'geometryfeet'}, inplace=True)
bldgsenergygeo.rename(columns={'geometry_latlon':'geometry'}, inplace=True)

In [20]:

bldgsenergygeo.head(500).plot()

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x27e55a510>

In the interactive graph, we want to have the address variables reformatted to a single column of strings.

In [21]:

bldgsenergygeo['Address'] = bldgsenergygeo['Street Number'] + ' ' + bldgsenergygeo['Street Name']

We want to keep the file size to a minimum in order to keep the graph interaction smooth, so we will only include the columns we will use.

In [22]:

bldgsenergygeo.columns

Out[22]:

Index([u'boro', u'GROUND_ELE', u'HEIGHT_ROO', u'Shape_Area', u'Shape_Leng', u'geometryfeet', u'geometry', u'Street Number', u'Street Name', u'Site EUI(kBtu/ft2)', u'Weather Normalized Source EUI(kBtu/ft2)', u'ENERGY STAR Score', u'Total GHG Emissions(MtCO2e)', u'Property Floor Area (Buildngs and Parking)(ft2)', u'Primary Property Type - Self Selected', u'Number of Buildings', u'geometry_avail', u'energy_avail', u'Address'], dtype='object')

In [23]:

bldgsenergygeo = bldgsenergygeo.ix[:,[0,2,6,9,10,11,12,13,14,18]]
bldgsenergygeo.head()

Out[23]:

	boro	HEIGHT_ROO	geometry	Site EUI(kBtu/ft2)	Weather Normalized Source EUI(kBtu/ft2)	ENERGY STAR Score	Total GHG Emissions(MtCO2e)	Property Floor Area (Buildngs and Parking)(ft2)	Primary Property Type - Self Selected	Address
BBL
1000010010	1	10.29	POLYGON ((-112.6757750753299 48.09341020820187...	NaN	NaN	NaN	NaN	NaN	NaN	1 GOVERNORS ISLAND
1000010010	1	21.52	POLYGON ((-112.6759406036311 48.09094854809784...	NaN	NaN	NaN	NaN	NaN	NaN	1 GOVERNORS ISLAND
1000010010	1	29.67	POLYGON ((-112.6724726056846 48.09382693018913...	NaN	NaN	NaN	NaN	NaN	NaN	1 GOVERNORS ISLAND
1000010010	1	26.66	POLYGON ((-112.6723061851977 48.09412009655221...	NaN	NaN	NaN	NaN	NaN	NaN	1 GOVERNORS ISLAND
1000010010	1	28.60	POLYGON ((-112.6733040928157 48.09259833372172...	NaN	NaN	NaN	NaN	NaN	NaN	1 GOVERNORS ISLAND

For the graph, we are only interested in a region roughly correlated with Midtown in Manhattan - we will select only those rows for which boro = 1 and they have points between 48.117 and 48.129 degrees latitude (noting that these are not the proper latitude numbers, given our earlier problem).

In [24]:

bldgsenergygeomidtown = bldgsenergygeo[(bldgsenergygeo.boro == 1) & 
                                       (bldgsenergygeo.geometry.bounds['miny'] < 48.129) & 
                                       (bldgsenergygeo.geometry.bounds['maxy'] > 48.117) ]
print len(bldgsenergygeomidtown)
bldgsenergygeomidtown.head()

Out[24]:

	boro	HEIGHT_ROO	geometry	Site EUI(kBtu/ft2)	Weather Normalized Source EUI(kBtu/ft2)	ENERGY STAR Score	Total GHG Emissions(MtCO2e)	Property Floor Area (Buildngs and Parking)(ft2)	Primary Property Type - Self Selected	Address
BBL
1006650010	1	37.756688	POLYGON ((-112.6865457039683 48.11973501029388...	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1006800001	1	12.110000	POLYGON ((-112.6858515626195 48.1203140388577,...	NaN	NaN	NaN	NaN	NaN	NaN	360 WEST STREET
1006800001	1	32.826755	POLYGON ((-112.6848804688391 48.12020064719905...	NaN	NaN	NaN	NaN	NaN	NaN	360 WEST STREET
1006800001	1	12.467149	POLYGON ((-112.6853330443247 48.12031129386528...	NaN	NaN	NaN	NaN	NaN	NaN	360 WEST STREET
1006800001	1	11.625806	POLYGON ((-112.6852809913021 48.12031449853709...	NaN	NaN	NaN	NaN	NaN	NaN	360 WEST STREET

In [25]:

with open('bldgsenergygeo.json','w') as f:
    f.write(bldgsenergygeomidtown.to_json())

This writes a GeoJSON file for Manhattan to 'bldgsenergygeo.json' that can be used for browser-based mapping. For the d3.js interactive graph, the TopoJSON format is used because it is a smaller file format and thus allows for faster uploading and interaction. Converting between the GeoJSON file and the TopoJSON file is carried out using this js library, which requires node.js and is installed via 'sudo npm install topojson'. The conversion is carried out with the following command line call. The resulting file 'bldgsenergytopo.json' is 1.2 MB, compared with the 2.9 MB GeoJSON file.

In [26]:

os.system('topojson -p -o bldgsenergytopo.json bldgsenergygeo.json')

Out[26]:

The html and js code for the graph is included in this github repo, and the graph can be viewed in this blog post.