AirBnb prides itself on its ability to make travel more than what it seems - it's mission is to help people feel like they belong anywhere they go through more authentic cultural experiences. However, what if AirBnb's efforts to connect the differences are creating more?
This project can also be found on the github repository at https://github.com/Lubaina97/323-Final-Project/tree/master!
Gentrification is a general term for the arrival of wealthier people in an existing urban district, a related increase in rents and property values, and changes in the district's character and culture. The term is often used negatively, suggesting the displacement of poor communities by rich outsiders.
The economics of gentrification explicitly state that neighborhood property values increase, decreasing the supply of affordable housing available to lower-income residents who are then displaced, as the cost of living in the neighborhood increases.
A report titled “The High Cost of Short-Term Rentals in New York City,” authored by a research group from the McGill University School of Urban Planning, found that in the study period of September 2014 through August 2017 Airbnb has potentially removed between 7,000 and 13,500 units of housing from New York’s long-term rental market, putting extra pressure on a city already squeezed for housing.
import pandas as pd
import numpy as np
import qeds
qeds.themes.mpl_style();
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
import descartes
import folium
from folium import plugins
from folium.plugins import HeatMap
import matplotlib.colors as mplc
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
import geopandas as gpd
from pandas_datareader import DataReader
from shapely.geometry import Point, Polygon
%matplotlib inline
# activate plot theme
df = pd.read_csv('./occupied_units2017 copy.csv') # Housing data for 2017
df.head()
As shown above, the NYCHVS survey has several columns and thus, for the purpose of this project, I will create a subset of the above table that I will then use for my analysis.
subset = df.loc[:,['Borough','Sub-borough area','Race and Ethnicity of householder','Monthly gross rent','Household income recode']]
subset.head()
print('The number of observations in this dataset are', subset.shape)
subset.isnull().sum()
subset.loc[(subset['Monthly gross rent'] == 99999) | (subset['Household income recode'] == 9999999), :]
clean_subset = subset.loc[(subset['Monthly gross rent'] == 99999) | (subset['Household income recode'] == 9999999), :]
print('The number of observations in this dataset are', clean_subset.shape)
I now need to check for how many observations are going to be dropped.
subset.drop(subset.loc[(subset['Monthly gross rent']==99999) | (subset['Household income recode']==9999999)].index, inplace=True)
subset.shape
subset = subset.sort_values(by=['Borough'])
# Mean and median rent in 2017 for each sub-borough
subset_grouped_rent2017 = subset.groupby(['Borough','Sub-borough area']).agg({'Monthly gross rent': ['mean', 'median']})
subset_grouped_rent2017.columns = ['Rent_Mean', 'Rent_Median']
subset_grouped_rent2017 = subset_grouped_rent2017.reset_index()
# Mean and median income in 2017 for each sub-borough
subset_grouped_income2017 = subset.groupby(['Borough','Sub-borough area']).agg({'Household income recode': ['mean', 'median']})
subset_grouped_income2017.columns = ['Income_Mean', 'Income_Median']
subset_grouped_income2017 = subset_grouped_income2017.reset_index()
# Most frequent racial code in 2017 for each sub-borough
subset_grouped_race2017 = subset.groupby(['Borough','Sub-borough area'], as_index=False)['Race and Ethnicity of householder'].apply(lambda x: x.value_counts(dropna=False).idxmax())
df_1 = pd.read_csv('./occupied_units2014 copy.csv') # Housing data for 2014
df_1.shape
subset_1 = df_1.loc[:,['Borough','Sub-borough area', 'Race and Ethnicity of householder','Monthly gross rent','Household income recode']]
subset_1.isnull().sum()
subset_1.drop(subset_1.loc[(subset_1['Monthly gross rent']==99999) | (subset_1['Household income recode']==9999999)].index, inplace=True)
subset_1 = subset_1.sort_values(by=['Borough'])
# Mean and median rent in 2014 for each sub-borough
subset_grouped_rent2014 = subset_1.groupby(['Borough','Sub-borough area']).agg({'Monthly gross rent': ['mean', 'median']})
subset_grouped_rent2014.columns = ['Rent_Mean', 'Rent_Median']
subset_grouped_rent2014 = subset_grouped_rent2014.reset_index()
# Mean and median income in 2014 for each sub-borough
subset_grouped_income2014 = subset_1.groupby(['Borough','Sub-borough area']).agg({'Household income recode': ['mean', 'median']})
subset_grouped_income2014.columns = ['Income_Mean', 'Income_Median']
subset_grouped_income2014 = subset_grouped_income2014.reset_index()
# Most frequent racial code in 2014 for each sub-borough
subset_grouped_race2014 = subset_1.groupby(['Borough','Sub-borough area'], as_index=False)['Race and Ethnicity of householder'].apply(lambda x: x.value_counts(dropna=False).idxmax())
# Loading income mean and medians for each sub-borough in 2014 and 2017
df1 = pd.read_csv('./subset_grouped_income2014.csv')
df2 = pd.read_csv('./subset_grouped_income2017.csv')
df1.head()
income_pct = pd.merge(df1, df2, on=["Borough", "Sub-borough area"])
income_pct = income_pct.rename(columns={"Income_Mean_x": "Income_mean2014", "Income_Median_x": "Income_median2014", "Income_Mean_y": "Income_mean2017", "Income_Median_y": "Income_median2017"})
income_median_pct = income_pct.loc[:,['Borough','Sub-borough area','Income_median2014','Income_median2017']]
income_median_pct['pct_change_income'] = income_median_pct[['Income_median2014','Income_median2017']].pct_change(axis=1)['Income_median2017']
income_median_pct.head()
# Loading rent mean and medians for each sub-borough in 2014 and 2017
ef1 = pd.read_csv('./subset_grouped_rent2014.csv')
ef2 = pd.read_csv('./subset_grouped_rent2017.csv')
ef1.head()
rent_pct = pd.merge(ef1, ef2, on=["Borough", "Sub-borough area"])
rent_pct = rent_pct.rename(columns={"Rent_Mean_x": "Rent_mean2014", "Rent_Median_x": "Rent_median2014", "Rent_Mean_y": "Rent_mean2017", "Rent_Median_y": "Rent_median2017"})
rent_pct.head()
rent_median_pct = rent_pct.loc[:,['Borough','Sub-borough area','Rent_median2014','Rent_median2017']]
rent_median_pct['pct_change_rent'] = rent_median_pct[['Rent_median2014','Rent_median2017']].pct_change(axis=1)['Rent_median2017']
rent_median_pct.head()
final_rent_pctchange = rent_median_pct.loc[:,['Borough','Sub-borough area','pct_change_rent']]
final_income_pctchange = income_median_pct.loc[:,['Borough','Sub-borough area','pct_change_income']]
pct_changes = pd.merge(final_rent_pctchange, final_income_pctchange, on=["Borough", "Sub-borough area"])
pct_changes = pct_changes.rename(columns={"pct_change": "pct_change_income"})
pct_changes.head()
By looking exclusively at percent change, we see the natural evolution of neighborhood change, regardless of their starting socio-economic or demographic compositions. Home and rent values are one of the best metrics to capture fluctuations in investment. Because gentrification has different consequences for renters versus homeowners, both rent and home values were included.
K-means clustering is a type of unsupervised clustering algorithm that partitions observations into K number of user-specified groupings. The k-means objective function iteratively assigns observations to a cluster that satisfies the minimum within-cluster sum of squares (MacQueen, 1967). Below, I perform a K-means clustering analysis on the percent change between 2014 and 2017 of median rent and income in New York.
pct = pd.read_csv('./pct_changes.csv')
# Checking for most efficient choice of cluster by choosing cluster level = 5
x = pct.loc[:, ['pct_change_rent','pct_change_income']].values
kmeans5 = KMeans(n_clusters=5)
y_kmeans5 = kmeans5.fit_predict(x)
print(y_kmeans5)
kmeans5.cluster_centers_
Error =[]
for i in range(1, 11):
kmeans = KMeans(n_clusters = i).fit(x)
kmeans.fit(x)
Error.append(kmeans.inertia_)
import matplotlib.pyplot as plt
plt.plot(range(1, 11), Error)
plt.title('Elbow method')
plt.xlabel('No of clusters')
plt.ylabel('Error')
plt.show()
# Choose cluster = 3, creating clusters
iris = datasets.load_iris()
X = iris.data
y = iris.target
kmeans3 = KMeans(n_clusters=3)
y_kmeans3 = kmeans3.fit_predict(x)
print(y_kmeans3)
kmeans3.cluster_centers_
plt.scatter(x[:, 0], x[:, 1], c=y_kmeans3, cmap='cool')
kmeans3.labels_
# Finding which sub-borough codes lie in which cluster to classify sub-boroughs
# as gentrified status or not
mydict = {i: np.where(kmeans3.labels_ == i)[0] for i in range(kmeans3.n_clusters)}
mydict
pct_changes.head()
# Colour coding the sub-boroughs based on K-Means results
pct_color = pd.read_excel('./pct_changes.xlsx')
def highlight_cluster(s,column):
is_max = pd.Series(data=False, index=s.index)
is_max[column] = s.loc[column] = 0
return ['background-color: purple' if is_max.any() else '' for v in is_max]
def highlight_cluster0(s):
if s.cluster == 0.0:
return ['background-color: deeppink']*5
elif s.cluster == 1.0:
return ['background-color: deepskyblue']*5
elif s.cluster == 2.0:
return ['background-color: darkblue']*5
pct_color_v = pct_color.style.apply(highlight_cluster0, axis=1).set_properties(**{'color': 'white','border-color': 'white'})
pct_color_v
In order to be able to mark the sub-boroughs by cluster colour on a map, I had to manually add the latitude and longitude on excel.
gent_map = pd.read_csv('./pct_changes.csv')
gent_map = gent_map.drop(['Unnamed: 5'], axis=1)
gent_map.head()
# Using latitude longitude, mapping gentrification using folium
folium_map = folium.Map(location=[40.738, -73.98],
zoom_start=11,
tiles="CartoDB dark_matter",
width='90%')
for index, row in gent_map.iterrows():
gentrification_factor = (row["pct_change_income"]+row["pct_change_rent"])
radius = (gentrification_factor)*2
popup_text = """{}<br>
Change in Income: {}<br>
Change in Rent: {}<br>
gentrification value: {}"""
popup_text = popup_text.format(row["Borough"],
row["pct_change_income"],
row["pct_change_rent"],
gentrification_factor)
if gentrification_factor<0.63 and gentrification_factor>0.357005:
color="#ADD8E6" # light blue
else:
if gentrification_factor<0.44 and gentrification_factor>0.18:
color="#0000FF" # blue
else:
color="#FF1493" # deep pink
folium.CircleMarker(location=(row["Latitude"],
row["Longitude"]),
radius=radius,
color=color,
popup=popup_text,
fill=True).add_to(folium_map)
folium_map
race2014 = pd.read_csv('./race2014.csv')
pct_change = pd.read_csv('./pct_changes.csv')
merge1 = pd.merge(race2014, pct_change, on=["Borough", "Sub-borough area"])
race2017 = pd.read_csv('./race2017.csv')
merge2 = pd.merge(race2017, pct_change, on=["Borough", "Sub-borough area"])
race_2014 = merge1[['Sub-borough Name', 'Race_count']]
race_2017 = merge2[['Sub-borough Name', 'Race_count']]
race_2014 = race_2014.rename(columns={"Race_count": "Race_count2014"})
race_2017 = race_2017.rename(columns={"Race_count": "Race_count2017"})
merge3 = pd.merge(race_2014, race_2017, on=["Sub-borough Name"])
race_bar = merge3.set_index('Sub-borough Name')
fig, ax = race_bar.plot.barh(fontsize=50, figsize=(30, 50))
Bronx, Queens and Manhattan seem to have had a change in the race of majority population in the respective neighbourhoods. Parkchester and Pelham Parkway in Bronx used to be inhabited by a majority of Black or African American individuals but by 2017, the dynamic changed to American Indian or Chinese indicating cultural displacement.
# Loading AirBnb 2014 and 2017 data
airbnb2014 = pd.read_csv("./AirBnb2014.csv")
print('We have', airbnb2014.room_id.nunique(), 'unique listings in New York in 2014.')
airbnb2017 = pd.read_csv("./AirBnb2017.csv")
print('We have', airbnb2017.room_id.nunique(), 'unique listings in New York in 2017.')
# Finding mean price for every neighborhood in 2014 and 2017
subset_2014 = airbnb2014.groupby(['borough','neighborhood','latitude','longitude','reviews']).agg({'price': ['mean']})
subset_2014.columns = ['mean_price']
subset_2014 = subset_2014.reset_index()
subset_2014['year'] = '2014'
subset_2017 = airbnb2017.groupby(['borough','neighborhood','latitude','longitude','reviews']).agg({'price': ['mean']})
subset_2017.columns = ['mean_price']
subset_2017 = subset_2017.reset_index()
subset_2017['year'] = '2017'
# Creating a final listings dataset with longitude and latitude
locations = subset_2014.groupby("neighborhood").first()
# selecting only the three columns I am interested in
locations = locations.loc[:, ["borough",
"latitude",
"longitude"]]
borough_2014 = subset_2014.groupby("neighborhood").count()
borough_2014 = borough_2014.iloc[:,[0]]
borough_2014.columns= ["Listings2014"]
borough_2017 = subset_2017.groupby("neighborhood").count()
borough_2017 = borough_2017.iloc[:,[0]]
borough_2017.columns= ["Listings2017"]
listing_counts = borough_2014.join(locations).join(borough_2017)
listing_counts.head()
folium_map = folium.Map(location=[40.738, -73.98],
zoom_start=10,
tiles="CartoDB dark_matter",
width='100%')
marker = folium.CircleMarker(location=[40.738, -73.98])
marker.add_to(folium_map)
for index, row in listing_counts.iterrows():
net_listings = (row["Listings2017"]-row["Listings2014"])
radius = net_listings/20
popup_text = """{}<br>
2014 Listings: {}<br>
2017 Listings: {}<br>
net listings: {}"""
popup_text = popup_text.format(row["borough"],
row["Listings2014"],
row["Listings2017"],
net_listings)
if net_listings>0:
color="#E37222" # tangerine
else:
color="#0A8A9F" # teal
folium.CircleMarker(location=(row["latitude"],
row["longitude"]),
radius=radius,
color=color,
popup=popup_text,
fill=True).add_to(folium_map)
folium_map