Manhattan is a pretty interesting place, and like all major cities it has an abundance of office buildings. Sometimes you might find yourself in need of your own office, at which point you will either start one in your home or garage, or look for a rental.
However, many office rental listings don't include the price, but many years of research and all of human history suggests that price is one of the most important factors in any purchasing decision. Fortunately, there many listings that do include the price, so we can try to construct a model that estimates the missing data.
The data was obtained from a major commercial real estate site in December 2017, and will not be made available here.
Contents:
Of the 1369 rental rates in $/sf/yr, 830 are "Not Disclosed", for a split of 39.4% disclosed, 60.6% not disclosed. This is not great but some of the not disclosed listings are actually in the same building as disclosed listings, so let's claim to have an even 40% of the price data.
The target variable is rate, while the features are:
This is not a huge number of features
from geopy.geocoders import GoogleV3
import pandas as pd
import numpy as np
import sys
import re
import json
from time import sleep
import unicodedata
geolocator = GoogleV3()
ny_office_lease = pd.read_csv('new-york-ny-lease-office_2017-12-23.csv', index_col=0)
addr_list=np.ndarray.tolist(ny_office_lease.name.unique())
len(ny_office_lease)
1369
ny_office_lease['rate'].value_counts().head(3)
Not Disclosed 830 $65.00 /SF/Yr 30 $55.00 /SF/Yr 24 Name: rate, dtype: int64
Only 450 buildings, an average of over 3 listings per building.
len (addr_list)
450
len(np.ndarray.tolist(ny_office_lease[ny_office_lease['rate']!='Not Disclosed'].name.unique()))
225
Again, the data is not ideal, but it is fairly complete with very few missing values in the features. We can certainly obtain more, by using multiple sources and websites and time periods, and compiling them all together. But let's see how information we can extract from these observations with these features.
We have the addresses of each property, which means we can obtain the geographic coordinates. With those, we can label each building with a location classification. It is reasonable to assume that location has a major effect on price.
For the other variables, it is mainly a matter of extracting useful numeric information from strings, and filling in a few missing walk and transit scores that are available from walkscore.com, but for some reason not included in the listings.
The number and street name of an address by themselves generally do not mean much (limited exceptions may include Wall Street, and other famous streets), but the address gives us information about the geographic position of a property through its coordinates. By extracting the coordinates, we can classify by geography via clustering methods such as K Means.
# Getting coordinates for each address (A list of uniques)
# The GoogleV3 engine seems to time out much more than Nominatim but it's much better at handling ambiguity (eg. Seventh = 7th)
# Exception wrapping is used to retry the query in case of failure, and there is constant saving to make sure no
# progress is lost.
# The functions coords_wrapper and collect_coords should be combined. There is really no reason for them to be separate
# and the constant reloading can be avoided, though for smallish files it's not a big deal.
# NOTE: some listings are actually in multiple buildings with an address range such as 191-199. The Google engine returns
# a single set of geographic coordinates for the first or last building in that range. This is fine, since they should be
# within the same cluster anyway.
def get_coord(address):
print ('Getting' , address)
geoloc = geolocator.geocode(address)
sleep(0.5)
print ('Got ', geoloc, '\n')
return (geoloc.latitude, geoloc.longitude)
def coords_wrapper(addr_list):
# Start at the beginning of addr_list
break_point = 0
# Open the file if it exists, otherwise create a new empty file
try:
with open('nyc_lease_coordinates.json', 'r') as infile:
uniques = json.load(infile)
except:
with open('nyc_lease_coordinates.json', 'w') as outfile:
uniques={}
json.dump(uniques,outfile)
# do while there are still addresses to process
while(break_point < len(addr_list)):
# run collect_coords starting with the current breakpoint, and gets the
# address that it's stuck on
addr = addr_list[break_point]
try:
try:
uniques[addr] = get_coord(addr + ' Manhattan')
# If it works, get to the next one. Once break_point gets to the length of the list, the conditions of
# the loop will no longer be satisfied, and the loop ends.
break_point +=1
sleep(0.5)
except:
# If it doesn't work, don't increase the break_point and try again
print ('Could not get ', addr)
sleep(0.5)
except:
with open('nyc_lease_coordinates.json','w') as outfile:
json.dump(uniques,outfile)
sleep(0.5)
with open('nyc_lease_coordinates.json','w') as outfile:
json.dump(uniques,outfile)
ny_office_lease = pd.read_csv('new-york-ny-lease-office_2017-12-23.csv', index_col=0)
addr_list=np.ndarray.tolist(ny_office_lease.name.unique())
#coords_wrapper(addr_list)
with open('nyc_lease_coordinates.json', 'r') as infile:
uniques = json.load(infile)
ny_office_lease['coords'] = ny_office_lease['name'].map(uniques)
coords=pd.DataFrame(addr_list)[0].map(uniques)
coords = coords.values.tolist()
lats = [i[0] for i in coords]
longs = [i[1] for i in coords]
With this list of unique coordinates, we can use K Means to cluster them around a yet to be determined number of centroids.
from scipy import linalg
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.spatial import ConvexHull
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Let's do an elbow analysis to see the rate of decline of the sum of squared errors from a centroid as the number of centroids increases.
SSE = []
for k in range(1,25):
km = KMeans(n_clusters=k)
km.fit(coords)
labels = km.labels_
centroids = km.cluster_centers_
# Get the SSE
SSE.append(sum(np.min(cdist(coords, centroids, 'euclidean'),axis=1))/len(coords))
plt.plot(range(1,25),SSE, 'bo--')
plt.xlabel('k clusters')
plt.ylabel('SSE')
plt.title('SSE vs k')
Text(0.5,1,'SSE vs k')
There is no standout elbow, but let's try 7, as the returns diminish noticeably after that (well, sort of noticeably).
k = 7
km = KMeans(n_clusters=k)
km.fit(coords)
labels = km.labels_
centroids = km.cluster_centers_
colors = ['r', 'g', 'b', 'y', 'c', 'm','C0','k']
plt.figure(figsize=(8,8))
plt.title('Clustered Locations')
for j in range(k):
current_coords = np.asarray(coords)[np.where(labels==j)]
hull = ConvexHull(current_coords)
j_longs = current_coords[:,1]
j_lats = current_coords[:,0]
plt.plot(j_longs, j_lats, '%so' % colors[j], label=j)
for simplex in hull.simplices:
#plt.plot(current_coords[simplex,1],current_coords[simplex,0], '%s--'%colors[j],alpha=0.3)
plt.plot(current_coords[simplex,1],current_coords[simplex,0], '--', color=colors[j],alpha=0.3)
plt.legend()
<matplotlib.legend.Legend at 0x293771b15f8>
ny_office_lease['location_cluster'] = ny_office_lease.name.map(dict(zip(addr_list,labels)))
If we knew all the prices, that would be a very useful third dimension in the clustering, and the 2 dimensional geographic representation will not necessary have contingious clusters (such as in this gas prices example). Since the whole point is to estimate unknown rates, we can't do that.
Neighbourhood census data may also give useful information for groupings, as well as for the overall analysis.
ny_office_lease.groupby('location_cluster')['building_sqft'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
location_cluster | ||||||||
0 | 215.0 | 357844.204651 | 328220.972457 | 1333.0 | 102150.0 | 231647.0 | 564196.00 | 2038201.0 |
1 | 123.0 | 811344.552846 | 588606.701623 | 10836.0 | 417068.0 | 647924.0 | 984000.00 | 3600000.0 |
2 | 27.0 | 65448.777778 | 102196.190915 | 2498.0 | 17987.5 | 39500.0 | 46865.00 | 345000.0 |
3 | 278.0 | 175994.208633 | 308523.701281 | 4614.0 | 57152.0 | 107343.0 | 212000.00 | 2800118.0 |
4 | 38.0 | 91648.578947 | 136895.518663 | 3200.0 | 10500.0 | 26000.0 | 80508.25 | 520000.0 |
5 | 547.0 | 485963.897623 | 488987.146443 | 5504.0 | 133700.0 | 293878.0 | 765911.00 | 3143291.0 |
6 | 141.0 | 142820.546099 | 214999.377540 | 3875.0 | 19847.0 | 42000.0 | 189389.00 | 1163051.0 |
# cluster 2 roughly corresponds to the area around Central Park (the empty space is noticeable), with the largest buildings
# next up is cluster 3, which contains the World Trade Center and Wall Street.
# There are some bugs with gmplot: the proper colors only show up when zoomed in.
Clear up the undisclosed rates to make them empty strings, and convert the string rates into floats. When there is a price range, always use the lowest (and for consistency do the same when there is an area range for each rental space).
ny_office_lease['rate'] = ny_office_lease['rate'].replace('Not Disclosed','')
# Get the highest rate from the range of rates if it exists, otherwise keep it at ''
def get_float_rate (string_rate):
if string_rate != '':
float_rate = float(string_rate.strip().split()[0].replace('$',''))
else:
float_rate = ''
return float_rate
ny_office_lease['rate'] = ny_office_lease['rate'].apply(lambda r: get_float_rate(r))
# Now convert the empty cells to null to make the rate column consistently float64
ny_office_lease['rate'] = ny_office_lease['rate'].replace('',pd.NaT)
ny_office_lease['rate']=pd.to_numeric(ny_office_lease['rate'], errors='coerce')
fig,ax=plt.subplots(ncols=1, figsize=(8,8))
sns.boxplot(x='location_cluster',y='rate',data=ny_office_lease)
plt.title('Average rates by cluster, and descriptive')
Text(0.5,1,'Average rates by cluster, and descriptive')
With the rate column appropriately convert to float, we can get an idea of the prices in each region.
It seems like there are enough differences in the prices to warrant having these classes.
BOMA defines building classes as follows:
Class A
Most prestigious buildings competing for premier office users with rents above average for the area. Buildings have high quality standard finishes, state of the art systems, exceptional accessibility and a definite market presence.
Class B
Buildings competing for a wide range of users with rents in the average range for the area. Building finishes are fair to good >for the area. Building finishes are fair to good for the area and systems are adequate, but the building does not compete with >Class A at the same price.
Class C
Buildings competing for tenants requiring functional space at rents below the average for the area.
ny_office_lease.groupby('location_cluster')['building_class'].describe()
count | unique | top | freq | |
---|---|---|---|---|
location_cluster | ||||
0 | 181 | 3 | A | 139 |
1 | 121 | 3 | A | 75 |
2 | 14 | 2 | B | 8 |
3 | 250 | 3 | B | 144 |
4 | 13 | 2 | C | 12 |
5 | 530 | 3 | A | 269 |
6 | 102 | 3 | B | 60 |
There are also many buildings with no classification, but it can be see that the unclassified buildings are not pure office properties, and so assigning them our own classification 'D' might actually serve some purpose.
ny_office_lease.loc[ny_office_lease['building_class'].isnull(),'building_class'] = 'D'
We can see that on average, classes A, B and C are appropriately priced based on their rated quality, but there are huge major outliers for classes B and C.
fig,ax=plt.subplots(ncols=2, figsize=(12,8))
plt.subplot(1,2,1)
sns.boxplot(x='building_class',y='rate',data=ny_office_lease)
plt.title('Average rates by building class, and descriptive')
plt.subplot(1,2,2)
sns.swarmplot(x='building_class',y='rate',data=ny_office_lease)
plt.title('All rates by building class')
Text(0.5,1,'All rates by building class')
Walk and transit scores are composites of features such as distance from public transportation, number of intersections, and traffic on a scale from 0 (worst) to 100 (best).
Conventiently, there are no buildings with one but not the other. They are also all in building_class D, so they are not pure office buildings. The scores actually seem to be available from walkscores.com, so we can input them manually.
ny_office_lease.loc[671,'walk_score'] = 97
ny_office_lease.loc[671,'transit_score'] = 100
ny_office_lease.loc[678,'walk_score'] = 99
ny_office_lease.loc[678,'transit_score'] = 100
ny_office_lease.loc[679,'walk_score'] = 99
ny_office_lease.loc[679,'transit_score'] = 100
ny_office_lease.loc[680,'walk_score'] = 99
ny_office_lease.loc[680,'transit_score'] = 100
ny_office_lease.loc[973,'walk_score'] = 98
ny_office_lease.loc[973,'transit_score'] = 100
ny_office_lease.loc[1043,'walk_score'] = 99
ny_office_lease.loc[1043,'transit_score'] = 100
ny_office_lease.loc[1044,'walk_score'] = 99
ny_office_lease.loc[1044,'transit_score'] = 100
ny_office_lease.loc[1045,'walk_score'] = 99
ny_office_lease.loc[1045,'transit_score'] = 100
ny_office_lease.loc[1046,'walk_score'] = 99
ny_office_lease.loc[1046,'transit_score'] = 100
ny_office_lease.loc[1320,'walk_score'] = 46
ny_office_lease.loc[1320,'transit_score'] = 0
ny_office_lease.groupby('location_cluster')['walk_score'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
location_cluster | ||||||||
0 | 215.0 | 99.800000 | 0.612944 | 95.0 | 100.0 | 100.0 | 100.0 | 100.0 |
1 | 123.0 | 99.796748 | 0.572016 | 97.0 | 100.0 | 100.0 | 100.0 | 100.0 |
2 | 27.0 | 97.074074 | 2.368478 | 92.0 | 96.0 | 98.0 | 99.0 | 100.0 |
3 | 278.0 | 98.967626 | 3.487320 | 46.0 | 99.0 | 100.0 | 100.0 | 100.0 |
4 | 38.0 | 97.315789 | 2.302822 | 92.0 | 97.0 | 98.0 | 99.0 | 100.0 |
5 | 547.0 | 99.475320 | 0.974413 | 88.0 | 99.0 | 100.0 | 100.0 | 100.0 |
6 | 141.0 | 99.836879 | 0.407503 | 98.0 | 100.0 | 100.0 | 100.0 | 100.0 |
ny_office_lease.groupby('location_cluster')['transit_score'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
location_cluster | ||||||||
0 | 215.0 | 99.944186 | 0.330181 | 98.0 | 100.0 | 100.0 | 100.0 | 100.0 |
1 | 123.0 | 100.000000 | 0.000000 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
2 | 27.0 | 96.777778 | 3.522819 | 90.0 | 92.5 | 99.0 | 99.0 | 99.0 |
3 | 278.0 | 99.640288 | 5.997601 | 0.0 | 100.0 | 100.0 | 100.0 | 100.0 |
4 | 38.0 | 99.763158 | 0.998220 | 94.0 | 100.0 | 100.0 | 100.0 | 100.0 |
5 | 547.0 | 99.983547 | 0.311124 | 93.0 | 100.0 | 100.0 | 100.0 | 100.0 |
6 | 141.0 | 100.000000 | 0.000000 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
In general these scores in Manhattan are extremely high.
The data does not contain the number of floors for each building, but in this region there should be a high correlation between total floor area and height. There is in nothing in Manhattan like the Pentagon, which has over 6.6 million square feet of floor space and only 7 above-ground floors.
We can immediately see a correlation between average size and building class for classes A, B and C.
If this is included as a feature for prices, we can either use arbitrary size classes, or use a combination of measuring by 1000 square feet and normalization.
ny_office_lease.groupby('building_class')['rate'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
building_class | ||||||||
A | 93.0 | 83.246129 | 57.135397 | 40.00 | 54.00 | 62.00 | 86.0 | 330.61 |
B | 229.0 | 72.030131 | 65.104618 | 6.00 | 50.00 | 60.00 | 70.0 | 720.00 |
C | 139.0 | 54.680647 | 47.420259 | 21.33 | 44.00 | 49.00 | 58.0 | 584.28 |
D | 78.0 | 69.633333 | 38.699070 | 11.96 | 44.55 | 59.03 | 78.0 | 240.00 |
We should log linearize building size, since this effectively removes outliers and smooths the data.
ny_office_lease['log_building_sqft'] = np.log(ny_office_lease['building_sqft'])
ny_office_lease.groupby('building_class')['log_building_sqft'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
building_class | ||||||||
A | 528.0 | 13.312452 | 0.709457 | 9.903488 | 12.886641 | 13.350614 | 13.696522 | 15.096444 |
B | 490.0 | 11.748973 | 0.972249 | 8.579604 | 11.230775 | 11.859030 | 12.495004 | 13.966557 |
C | 193.0 | 10.782073 | 1.058689 | 7.823246 | 9.865266 | 11.068153 | 11.617285 | 12.873715 |
D | 158.0 | 10.332363 | 1.238189 | 7.195187 | 9.423839 | 10.277442 | 10.995277 | 13.243157 |
ny_office_lease['log_building_sqft'].hist(bins=100)
<matplotlib.axes._subplots.AxesSubplot at 0x2937a0d6eb8>
fig,ax=plt.subplots(ncols=2,figsize=(12,8))
plt.subplot(1,2,1)
sns.boxplot(x='location_cluster',y='log_building_sqft',data=ny_office_lease)
plt.title('Log building size by cluster')
plt.subplot(1,2,2)
sns.boxplot(x='building_class',y='log_building_sqft',data=ny_office_lease)
plt.title('Log building size by cluster and building class')
Text(0.5,1,'Log building size by cluster and building class')
The largest building, 55 Water St, was the largest office building in the world when it was completed in 1972. The smallest, at 1005 Second Ave, is a narrow 4-storey storefront with an Italian restaurant on the ground floor, a second floor spa, and an apartment occupying the top 2 floors.
Like with rental rates, I'm simply going to use the highest unit area when there is a range. Almost all of them have an undisclosed price, so we're estimating anyway.
ny_office_lease['space_sqft'] = ny_office_lease['space_sqft'].apply(lambda a:int(a.strip().split()[0].replace(',','')))
ny_office_lease.groupby('building_class')['space_sqft'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
building_class | ||||||||
A | 528.0 | 10821.517045 | 15985.617475 | 40.0 | 1853.75 | 5879.0 | 16445.25 | 260000.0 |
B | 490.0 | 3894.195918 | 4723.911066 | 40.0 | 909.50 | 2332.0 | 5082.75 | 30216.0 |
C | 193.0 | 2854.839378 | 2682.508900 | 70.0 | 1000.00 | 2200.0 | 3750.00 | 16909.0 |
D | 158.0 | 3102.310127 | 3051.524720 | 144.0 | 833.25 | 1814.0 | 4300.00 | 14290.0 |
ny_office_lease.groupby('location_cluster')['space_sqft'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
location_cluster | ||||||||
0 | 215.0 | 4928.986047 | 6196.903267 | 100.0 | 972.00 | 2500.0 | 6078.50 | 29802.0 |
1 | 123.0 | 10046.349593 | 14194.631795 | 40.0 | 614.00 | 3186.0 | 11103.50 | 57399.0 |
2 | 27.0 | 4009.111111 | 2849.915172 | 195.0 | 1464.50 | 5000.0 | 5202.50 | 8750.0 |
3 | 278.0 | 3768.176259 | 4332.201566 | 50.0 | 902.25 | 2400.0 | 5625.00 | 26782.0 |
4 | 38.0 | 10474.342105 | 41694.292211 | 400.0 | 1037.50 | 3625.0 | 6034.75 | 260000.0 |
5 | 547.0 | 7058.681901 | 8902.770520 | 40.0 | 1171.00 | 3600.0 | 8997.50 | 45245.0 |
6 | 141.0 | 6756.808511 | 10383.836907 | 100.0 | 1500.00 | 2500.0 | 6400.00 | 51576.0 |
The largest rental spaces tend to be entire floors or sections of the building. Regardless of location, class A buildings tend to have the largest rental spaces.
Interestingly, some of the smallest offices are also class A. These are often executive offices. It seems like the relationship between unit area and rate is not monotonic.
sns.lmplot(x='space_sqft',y='rate',data=ny_office_lease)
plt.title('Rate by size of unit space')
Text(0.5,1,'Rate by size of unit space')
Generally (very, very generally), larger units are cheaper per square footage.
This is usually the floor number, which might be useful information. But we need to extract it from the text, which might be an ordeal since there is a lack of consistency. However, it can be seen that the first word of each string contains the proper information.
ny_office_lease['space_floor']=ny_office_lease['space_name']
ny_office_lease['space_floor'] = ny_office_lease['space_floor'].apply(lambda w: w.strip().split()[0])
ny_office_lease.loc[ny_office_lease['space_floor'] == 'Ground','space_floor'] = '1'
# Change "Ground" and "Lower" to "1"
ny_office_lease.loc[ny_office_lease['space_floor'] == 'Lower','space_floor'] = '1'
# Now get rid of "th", "rd", "st", "nd"
ny_office_lease['space_floor']=ny_office_lease['space_floor'].apply(lambda w:w.split("th")[0])
ny_office_lease['space_floor']=ny_office_lease['space_floor'].apply(lambda w:w.split("rd")[0])
ny_office_lease['space_floor']=ny_office_lease['space_floor'].apply(lambda w:w.split("nd")[0])
ny_office_lease['space_floor']=ny_office_lease['space_floor'].apply(lambda w:w.split("st")[0])
ny_office_lease.groupby('space_floor')['rate'].describe().head()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
space_floor | ||||||||
1 | 75.0 | 87.796133 | 61.637306 | 11.96 | 48.50 | 66.0 | 100.0 | 300.0 |
10 | 21.0 | 56.884286 | 11.167416 | 36.00 | 47.00 | 59.0 | 65.0 | 82.0 |
11 | 5.0 | 53.240000 | 8.371858 | 42.00 | 47.20 | 56.0 | 59.0 | 62.0 |
12 | 14.0 | 57.285714 | 9.522674 | 44.00 | 50.25 | 55.5 | 65.0 | 75.0 |
13 | 6.0 | 59.500000 | 11.022704 | 45.00 | 51.75 | 61.0 | 65.0 | 75.0 |
How can these be classified? It's going to be somewhat arbitrary:
"base": basement
"lower": Meazzaine, Concourse, Floors 1-5
"mid": 5-14
"high": 15-24
"upper": 25+, Pent
def floor_class (floor):
if floor == 'Basement':
return 'base'
elif floor == 'Concourse':
return 'lower'
elif floor == 'Mezzanine':
return 'lower'
elif floor == 'Pen':
return 'upper'
else:
floor = int(floor)
if floor < 5:
return 'lower'
elif floor < 15:
return 'mid'
elif floor < 25:
return 'high'
else:
return 'upper'
return None
ny_office_lease['floor_class'] = ny_office_lease['space_floor'].apply(lambda fl:floor_class(fl))
fig,ax=plt.subplots(ncols=2,figsize=(12,8))
plt.subplot(1,2,1)
sns.boxplot(x='floor_class',y='log_building_sqft',data=ny_office_lease)
plt.title('Log building size by floor of rental unit')
plt.subplot(1,2,2)
sns.swarmplot(x='floor_class',y='rate',data=ny_office_lease)
plt.title('Rental rate by floor of rental unit')
Text(0.5,1,'Rental rate by floor of rental unit')
Utilities, etc.
"Negotiable" and "to be determined" might be combinable. On the other hand, it might be determined to not be negotiable, so we can only do some limited groupings.
# these are fine
ny_office_lease.loc[ny_office_lease['service_type'] == 'Modified Gross','service_type'] = 'mod_gross'
ny_office_lease.loc[ny_office_lease['service_type'] == 'Plus Electric','service_type'] = 'elec'
# put these together
ny_office_lease.loc[ny_office_lease['service_type'] == 'Tenant Electric','service_type'] = 'elec'
ny_office_lease.loc[ny_office_lease['service_type'] == 'Plus Electric And Cleaning','service_type'] = 'elec'
ny_office_lease.loc[ny_office_lease['service_type'] == 'Plus All Utilities','service_type'] = 'elec'
# Put net, triple net in net
ny_office_lease.loc[ny_office_lease['service_type'] == 'Net','service_type'] = 'net'
ny_office_lease.loc[ny_office_lease['service_type'] == 'Triple Net','service_type'] = 'net'
# Put tbd, neg in tbd
ny_office_lease.loc[ny_office_lease['service_type'] == 'To Be Determined','service_type'] = 'tbd'
ny_office_lease.loc[ny_office_lease['service_type'] == 'Negotiable','service_type'] = 'tbd'
ny_office_lease.loc[ny_office_lease['service_type'] == 'Full Service Gross','service_type'] = 'fsg'
ny_office_lease.groupby('service_type')['rate'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
service_type | ||||||||
elec | 165.0 | 62.591576 | 46.017399 | 21.33 | 48.00 | 58.0 | 65.0000 | 584.28 |
fsg | 76.0 | 98.268289 | 111.225982 | 6.00 | 50.00 | 63.0 | 91.4775 | 720.00 |
mod_gross | 246.0 | 64.455610 | 36.406544 | 11.96 | 47.05 | 56.0 | 70.0000 | 300.00 |
net | 43.0 | 65.988605 | 34.812040 | 25.00 | 45.00 | 56.0 | 80.0000 | 186.00 |
tbd | 9.0 | 86.578889 | 37.058616 | 42.00 | 60.00 | 70.0 | 117.0000 | 150.00 |
"Executive": Office(Executive Suite), Office/Medical(Executive Suite)
"Office": keep
"Retail": keep
"Medical": keep
"Mixed": Office/Medical, Office/Retail
ny_office_lease.loc[ny_office_lease['space_use'] == 'Office(Executive Suite)','space_use'] = 'executive'
ny_office_lease.loc[ny_office_lease['space_use'] == 'Office/Medical(Executive Suite)','space_use'] = 'executive'
ny_office_lease.loc[ny_office_lease['space_use'] == 'Office/Medical','space_use'] = 'mixed'
ny_office_lease.loc[ny_office_lease['space_use'] == 'Office/Retail','space_use'] = 'mixed'
ny_office_lease.loc[ny_office_lease['space_use'] == 'Medical','space_use'] = 'mixed'
ny_office_lease.loc[ny_office_lease['space_use'] == 'Office','space_use'] = 'office'
ny_office_lease.loc[ny_office_lease['space_use'] == 'Retail','space_use'] = 'retail'
ny_office_lease['space_use'].unique()
array(['office', 'mixed', 'executive', 'retail'], dtype=object)
sns.barplot(x='building_class', y='rate', hue='space_use', data=ny_office_lease)
plt.title('Rate by type by building class')
Text(0.5,1,'Rate by type by building class')
Retail tends to be the most expensive in all classes but B, but we saw that class B had some major executive office outliers above.
This seems to be all over the place. There are many old historical buildings that are well maintained and clearly not falling apart. This doesn't seem to be too big of a deal for Manhattan.
Within each location cluster, there doesn't seem to be much of a relationship between building age and rental rate either.
plt.plot(ny_office_lease['building_age'], ny_office_lease['rate'], 'o', markersize=2)
plt.title("Doesn't seem to be anything there")
Text(0.5,1,"Doesn't seem to be anything there")
With some significant outliers, overall the rate does not seem to depend too much on unit size, and so on average lease rates increase linearly with square footage. Getting the monthly rate is useful for visualization, because the relationship between unit size and price is generally pretty linear. However, we will not use it in the model or there will significant leakage, and anyway it does not really affect our results based on a mean absolute log deviation metric.
ny_office_lease['log_monthly_rate'] = np.log(ny_office_lease['rate'] * ny_office_lease['space_sqft'] / 12)
ny_office_lease['log_space_sqft'] = np.log(ny_office_lease['space_sqft'])
# let's sneak this in here too
ny_office_lease['log_rate'] = np.log(ny_office_lease['rate'])
sns.lmplot(x='log_space_sqft',y='log_monthly_rate',hue='space_use',data=ny_office_lease, size=10,aspect=1)
plt.title('Rate vs unit size by unit type')
plt.xlabel('Unit Size')
plt.ylabel('Rate')
Text(27.0568,0.5,'Rate')
The size of a building compared to its neighbours could have an effect on price. If building height data or highest floor were known, we could also find the relative height of a building. However, there should be high correlation between total building floor space and height within a cluster (e.g. in a cluster with mostly skyscrapers, it's unlikely there will be a building like the Pentagon, with only 5 above ground storeys but millions of square feet in floors). Size/height past a certain point could have prestige attached to it: a 55 storey building might be more prestigious than a 50 storey one, but 10 and 15 aren't too different.
The area variables have already been log transformed, so this variable is just log(building_sqft) - log(cluster mean).
cluster_mean_sqft = dict(zip(range(k),ny_office_lease.groupby('location_cluster')['building_sqft'].describe()['mean']))
ny_office_lease['cluster_mean_sqft'] = np.log(ny_office_lease['location_cluster'].map(cluster_mean_sqft))
ny_office_lease['building_rel_size'] = ny_office_lease['log_building_sqft'] - ny_office_lease['cluster_mean_sqft']
The information given by this new measure is similar, but changes some of the orderings.
fig,(ax1,ax2)=plt.subplots(ncols=2,figsize=(12,8))
sns.regplot(x='building_rel_size',y='log_rate',data=ny_office_lease, ax=ax1)
sns.regplot(x='log_building_sqft',y='log_rate',data=ny_office_lease, ax=ax2)
<matplotlib.axes._subplots.AxesSubplot at 0x2937b87eac8>
The fit is still not great, but gives us different information than just the size of the building (the orderings are different).
My class A building is bigger/better than your class A building! Or is it?
building_class_mean_sqft = dict(ny_office_lease.groupby('building_class')['building_sqft'].describe()['mean'])
ny_office_lease['building_class_mean_sqft'] = np.log(ny_office_lease['building_class'].map(building_class_mean_sqft))
ny_office_lease['building_rel_to_class'] = ny_office_lease['log_building_sqft'] - ny_office_lease['building_class_mean_sqft']
fig,(ax1,ax2)=plt.subplots(ncols=2,figsize=(12,8))
sns.regplot(x='building_rel_to_class',y='log_monthly_rate',data=ny_office_lease, ax=ax1)
sns.regplot(x='building_rel_to_class',y='rate',data=ny_office_lease, ax=ax2)
<matplotlib.axes._subplots.AxesSubplot at 0x2937bce8cf8>
This does not seem to be a particularly informative measure, so we are not going to include it in the model.
Is it not reasonable to assume that, for example, a 200 square foot executive office in midtown will probably cost more than a 100 square foot executive office next door?
space_use_mean_sqft = dict(ny_office_lease.groupby('space_use')['space_sqft'].describe()['mean'])
ny_office_lease['space_use_mean_sqft'] = np.log(ny_office_lease['space_use'].map(space_use_mean_sqft))
ny_office_lease['space_rel_size'] = ny_office_lease['log_space_sqft'] - ny_office_lease['space_use_mean_sqft']
fig,(ax1,ax2)=plt.subplots(ncols=2,figsize=(12,8))
sns.regplot(x='space_rel_size',y='log_monthly_rate',data=ny_office_lease, ax=ax1)
sns.regplot(x='space_rel_size',y='rate',data=ny_office_lease, ax=ax2)
<matplotlib.axes._subplots.AxesSubplot at 0x2937bdb3a20>
Of course, the general trends are still the same: as floor space increases, monthly rate increases, and $/sqft/year decreases slightly. But it is hoped that this can give us information about size differences specific to a type.
We will include this measure and the relative building size to cluster mean.
def dummify(df, col):
dummies=pd.get_dummies(df[col], prefix=col, drop_first=False)
df = pd.concat((df,dummies),axis=1)
return df
ny_office_lease = dummify(ny_office_lease,'building_class')
ny_office_lease = dummify(ny_office_lease,'location_cluster')
ny_office_lease = dummify(ny_office_lease,'floor_class')
ny_office_lease = dummify(ny_office_lease,'service_type')
ny_office_lease = dummify(ny_office_lease,'space_use')
We decide not to use building or the walk/transit scores. Age may be more useful in some clusters than in others. The walk/transit scores tend to all be very high, with averages close to 100, making them relatively less informative.
columns2 = ['name', 'log_rate', 'log_building_sqft',
'log_space_sqft','building_class_A','building_class_B', 'building_class_C',
'building_class_D','space_use_executive',
'space_use_mixed', 'space_use_office',
'space_use_retail',
'location_cluster_0',
'location_cluster_1', 'location_cluster_2', 'location_cluster_3',
'location_cluster_4', 'location_cluster_5', 'location_cluster_6',
'building_rel_size', 'space_rel_size',
'service_type_elec', 'service_type_fsg', 'service_type_mod_gross',
'service_type_net', 'service_type_tbd'
]
data=ny_office_lease[columns2].copy()
Our loss function will be mean absolute deviation of log monthly rent: $MAD = mean(\left | rent - predictedrent \right |)$, which gives us an approximation of the percent deviation.
This is a much more informative measure than something like $R^{2}$ for price prediction, and gives us an approximation of the average percentage deviation from the true price.
from sklearn.model_selection import train_test_split
holdout = data[data.isnull()['log_rate']==True]
train = data[data.isnull()['log_rate']==False]
X=train.drop(['name','log_rate'],axis=1)
y=train['log_rate']
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import xgboost as xgb
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, BaggingRegressor
from pandas.plotting import scatter_matrix
%matplotlib inline
def mald (val, target):
return np.mean(abs(np.log(target)-np.log(val)))
def mad (val,target):
return np.mean(abs(target-val))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
grad_test = GradientBoostingRegressor(learning_rate=0.05,n_estimators=350,loss='lad')
hyper_params = {'n_estimators': randint(400, 800),
'max_depth': randint(1,10),
'subsample': uniform(0,1),
'min_samples_split': randint(2,5),
'min_samples_leaf': randint(1,5)}
search = RandomizedSearchCV(grad_test, param_distributions=hyper_params ,n_jobs=4, verbose=4, cv=10)
search.fit(X_train,y_train)
search.best_params_, search.best_score_
Fitting 10 folds for each of 10 candidates, totalling 100 fits
[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 7.1s [Parallel(n_jobs=4)]: Done 90 tasks | elapsed: 40.7s [Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 42.6s finished
({'max_depth': 3, 'min_samples_leaf': 4, 'min_samples_split': 4, 'n_estimators': 772, 'subsample': 0.79214170481007606}, 0.34784996955787872)
xg = xgb.XGBRegressor(n_estimators=1000, max_depth=2, learning_rate=0.01, n_jobs=4, eval_metric='mae')
hyper_params1 = {'n_estimators': randint(1000, 2000),
'max_depth': randint(1,12),
'subsample': uniform(0,1),
'min_child_weight': randint(1,10),
'max_delta_step': randint(1,10)}
search1 = RandomizedSearchCV(xg, param_distributions=hyper_params1 ,n_jobs=4, verbose=4, cv=10)
search1.fit(X_train,y_train)
search1.best_params_, search.best_score_
Fitting 10 folds for each of 10 candidates, totalling 100 fits
[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 12.7s [Parallel(n_jobs=4)]: Done 90 tasks | elapsed: 37.4s [Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 39.4s finished
({'max_delta_step': 9, 'max_depth': 8, 'min_child_weight': 4, 'n_estimators': 1871, 'subsample': 0.47517782019874788}, 0.34784996955787872)
# We can just use good old fashioned OLS as an example. Variants like ridge and lasso don't perform much differently here.
lm=LinearRegression()
lm.fit(X_train,y_train)
lm_est = lm.predict(X_test)
print('LINEAR REGRESSION')
print('Training MAD: ',mad(y_train,lm.predict(X_train)))
print('Test MAD: ', mad(y_test,lm_est))
est = search.best_estimator_.predict(X_test)
print('GRADIENT BOOSTED')
print('Training MAD: ',mad(y_train,search.best_estimator_.predict(X_train)))
print('Test MAD: ', mad(y_test,est))
est1 = search1.best_estimator_.predict(X_test)
print('XGBOOST')
print('Training MAD: ',mad(y_train,search1.best_estimator_.predict(X_train)))
print('Test MAD: ', mad(y_test,est1))
LINEAR REGRESSION Training MAD: 0.2418320139890109 Test MAD: 0.27474306943340693 GRADIENT BOOSTED Training MAD: 0.11080319517455038 Test MAD: 0.20708264188231537 XGBOOST Training MAD: 0.04704005075546215 Test MAD: 0.21513588320397473
print('Feature importance under both gradient boosting methods')
imp_df = pd.DataFrame(list(zip(X.columns,search.best_estimator_.feature_importances_,search1.best_estimator_.feature_importances_)))
imp_df.columns=['feature','gb','xgboost']
imp_df
Feature importance under both gradient boosting methods
feature | gb | xgboost | |
---|---|---|---|
0 | log_building_sqft | 0.183381 | 0.201601 |
1 | log_space_sqft | 0.235754 | 0.252305 |
2 | building_class_A | 0.008107 | 0.004774 |
3 | building_class_B | 0.010445 | 0.018384 |
4 | building_class_C | 0.011188 | 0.014144 |
5 | building_class_D | 0.007741 | 0.017583 |
6 | space_use_executive | 0.001348 | 0.002194 |
7 | space_use_mixed | 0.009603 | 0.013491 |
8 | space_use_office | 0.009531 | 0.021646 |
9 | space_use_retail | 0.005681 | 0.012424 |
10 | location_cluster_0 | 0.015787 | 0.006938 |
11 | location_cluster_1 | 0.022209 | 0.010704 |
12 | location_cluster_2 | 0.025761 | 0.008510 |
13 | location_cluster_3 | 0.003907 | 0.015004 |
14 | location_cluster_4 | 0.007882 | 0.011297 |
15 | location_cluster_5 | 0.008822 | 0.015063 |
16 | location_cluster_6 | 0.014751 | 0.009281 |
17 | building_rel_size | 0.157884 | 0.191964 |
18 | space_rel_size | 0.230346 | 0.119881 |
19 | service_type_elec | 0.010595 | 0.012394 |
20 | service_type_fsg | 0.007911 | 0.018443 |
21 | service_type_mod_gross | 0.007527 | 0.016694 |
22 | service_type_net | 0.003839 | 0.005278 |
23 | service_type_tbd | 0.000000 | 0.000000 |
Let's try a weighted average of the two gradient boosted results. This can smooth out the errors in the individual models.
ensemble_est = 0.5 * est + 0.5 * est1
print('ENSEMBLE')
print('Test MAD: ', mad(y_test,ensemble_est))
ENSEMBLE Test MAD: 0.20738144371569303
The ensemble MAD seems to have a lower variance compared to the individual methods.
On an open-ended PhD econometrics exam question, a friend answered simply, "I will bootstrap." And maybe so should we. We have been speculating about the sample size being small, so let's see what happens if we manufacture a larger training set by drawing with replacement from our real sample. We need to take care not to contaminate the training set with observations from the test set, or the MAD is going to be ineligibly good.
X_train_bs, X_test_bs, y_train_bs, y_test_bs = train_test_split(X,y,test_size=0.3)
train_bs = pd.concat((pd.Series(y_train_bs),X_train_bs),axis=1).sample(n=25000,replace=True)
X_train_bs = train_bs.drop(['log_rate'],axis=1)
y_train_bs = train_bs['log_rate']
search = RandomizedSearchCV(grad_test, param_distributions=hyper_params ,n_jobs=4, verbose=4, cv=10)
search.fit(X_train_bs,y_train_bs)
search.best_params_, search.best_score_
Fitting 10 folds for each of 10 candidates, totalling 100 fits
[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 17.0min [Parallel(n_jobs=4)]: Done 90 tasks | elapsed: 71.6min [Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 74.6min finished
({'max_depth': 9, 'min_samples_leaf': 2, 'min_samples_split': 3, 'n_estimators': 647, 'subsample': 0.2990008276272238}, 0.98745846063131515)
est = search.best_estimator_.predict(X_test_bs)
print('BOOTSTRAPPED GRADIENT BOOSTED')
print('Training MAD: ',mad(y_train_bs,search.best_estimator_.predict(X_train_bs)))
print('Test MAD: ', mad(y_test_bs,est))
BOOTSTRAPPED GRADIENT BOOSTED Training MAD: 0.010200217959639886 Test MAD: 0.2100096286359437
est
array([ 4.18965469, 4.04311682, 5.22643646, ..., 4.1589247 , 4.38224071, 3.91185968])
Ultimately, our mean average deviation tends to be in the 0.15-0.20 range, occassionally spiking to 0.25 depending on the validation split. With only 539 observations there is a fair amount of variability in our mean absolute deviation results, which are highly dependent on how the training and validation sets are split. Aside from obtaining more data from different sources or continuously adding new observations from the same source (which may add trend issues to the analysis), or researching more features, there are likely still ways to extract more information from the existing data.
The relative building size metric, for example, ended up being a feature of great significance, and its inclusion single handedly reduced MAD from the 0.2-0.3 range to the 0.15-0.25 range. The type and class comparison metrics did not significantly reduce MAD, but the inclusion of the type comparison may have reduced some of the variations in the score due to sample splitting issues.
There are also variations that would be impossible or extremely difficult to explain with our current data. For example, there are 3 50 square foot executive office suites at 1214-1216 Broadway, on the 2nd, 3rd and 4th floors. All other features are identical. The third office is listed at 352. But the first two are 455.69, which would be a 0.27 absolute log difference if we accurately predict one of the prices. Is the 4th floor that bad? Or is it because of the specific building in the 1214-1216 range? Or there is something specific about the third office for which we need some sort of up-close, possibly even subjective rating. Maybe it's next to a loud air vent, or the view isn't as good (although we're comparing the fourth floor with second and third). Or it gives people a bad feeling when they walk in. We just don't know, and this is something that can not be fixed by adding more observations.
It would be interesting to see how this model fares with a larger dataset, especially if we could increase the unique addresses from 450 (of which only 225 unique addresses had at least one disclosed rate). A larger set would both help alleviate the validation issues and at the same time give us more information with which to predict test prices. Of course, more 'up close' features would also be helpful.