In less than 10 years since its official launch, airbnb has now found its place in many of our lives and continues to prove itself as an affordable and convenient alternative to conventional hotel services. This notebook cleans and prepares the data before importing into a Tableau workbook that will visualize airbnb's presence in the City of Vancouver based on publicly available data from airbnb-owned website, Inside-airbnb.
The Tableau visualization is available in here.
The data is downloaded from Inside-airbnb on April 29, 2019 and reflects the detailed listing data, calendar data and listing reviews for 4,929 uniquely identified homes in Vancouver, BC, that were active on March 8, 2019.
The data is dowloaded from Inside Airbnb website where information is available under a Creative Commons CC0 1.0 Universal (CC0 1.0) "Public Domain Dedication" license.
The datasets are specific to the City of Vancouver and are provided in 5 separate CSV and 1 GeoJSON files:
File Name | Description |
---|---|
calendar.csv |
Detailed Calendar Data for listings in Vancouver |
listings.csv |
Summary information and metrics for listings in Vancouver |
listings-2.csv |
Detailed Listings data for Vancouver |
reviews.csv |
Summary Review data and Listing ID |
reviews-2.csv |
Detailed Review Data for listings in Vancouver |
neighbourhoods.csv |
Neighbourhood list for geo filter (ourced from city or open source GIS files) |
neighbourhoods.geojson |
GeoJSON file of neighbourhoods of the city |
# Importing libraries:
import pandas as pd
import numpy as np
# Importing raw datasets:
df_cal = pd.read_csv('Data/calendar.csv')
df_listings = pd.read_csv('Data/listings.csv')
df_listings_more = pd.read_csv('Data/listings-2.csv')
df_neighb = pd.read_csv('Data/neighbourhoods.csv')
df_reviews = pd.read_csv('Data/reviews.csv')
df_reviews_more = pd.read_csv('Data/reviews-2.csv')
df_listings
and df_listings_more
datasets:¶We start by looking at the two listing datasets: df_listings
and df_listings_more
print('Shape of df_listings:\t\t', df_listings.shape)
print('Shape of df_listings_more:\t', df_listings_more.shape)
Shape of df_listings: (4938, 16) Shape of df_listings_more: (4938, 106)
df_listings.head(3).T
0 | 1 | 2 | |
---|---|---|---|
id | 10080 | 11400 | 13188 |
name | D1 - Million Dollar View 2 BR | Central Lovely Rm in Victorian Home | Garden level studio in ideal loc. |
host_id | 30899 | 42305 | 51466 |
host_name | Rami | Jaynne | Family Guns |
neighbourhood_group | NaN | NaN | NaN |
neighbourhood | Downtown | Riley Park | Riley Park |
latitude | 49.2877 | 49.2538 | 49.2458 |
longitude | -123.121 | -123.113 | -123.105 |
room_type | Entire home/apt | Private room | Entire home/apt |
price | 296 | 60 | 120 |
minimum_nights | 60 | 364 | 2 |
number_of_reviews | 16 | 70 | 171 |
last_review | 2017-02-26 | 2016-06-14 | 2019-02-24 |
reviews_per_month | 0.18 | 0.64 | 1.55 |
calculated_host_listings_count | 33 | 1 | 1 |
availability_365 | 191 | 365 | 214 |
We drop the columns that we do not need in the analysis (not required or repeated on other datasets):
df_listings.drop(['neighbourhood_group', 'last_review', 'reviews_per_month', 'availability_365', 'neighbourhood'],
axis = 1, inplace = True)
#Checking for NaNs:
df_listings.isna().sum()
id 0 name 1 host_id 0 host_name 0 latitude 0 longitude 0 room_type 0 price 0 minimum_nights 0 number_of_reviews 0 calculated_host_listings_count 0 dtype: int64
We'll deal with that one missing value later.
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
df_listings_more.head(2).T
0 | 1 | |
---|---|---|
id | 10080 | 11400 |
listing_url | https://www.airbnb.com/rooms/10080 | https://www.airbnb.com/rooms/11400 |
scrape_id | 20190308052700 | 20190308052700 |
last_scraped | 2019-03-08 | 2019-03-08 |
name | D1 - Million Dollar View 2 BR | Central Lovely Rm in Victorian Home |
summary | Stunning two bedroom, two bathroom apartment. ... | Well-appointed room with a view of the garden ... |
space | Bed setup: 2 x queen, I can add up to 2 twin s... | Centrally-located lovely, quiet home on tree-l... |
description | Stunning two bedroom, two bathroom apartment. ... | Well-appointed room with a view of the garden ... |
experiences_offered | none | none |
neighborhood_overview | NaN | Very quiet residential area, yet only 1-1/2 bl... |
notes | 1. CHECK-IN TIME IS AFTER 3PM PST AND CHECK-OU... | NaN |
transit | NaN | Bus stop 1-1/2 blocks away. 15 minutes to down... |
access | There is no access to the building ammenities. | Access to my home and garden. Shared bathroom.... |
interaction | NaN | I live here so will interact with the guest wh... |
house_rules | 1. CHECK-IN TIME IS AFTER 3 PM PST AND CHECK-O... | -No smoking indoors. -Recycling, composting ... |
thumbnail_url | NaN | NaN |
medium_url | NaN | NaN |
picture_url | https://a0.muscache.com/im/pictures/55778229/c... | https://a0.muscache.com/im/pictures/74189163/a... |
xl_picture_url | NaN | NaN |
host_id | 30899 | 42305 |
host_url | https://www.airbnb.com/users/show/30899 | https://www.airbnb.com/users/show/42305 |
host_name | Rami | Jaynne |
host_since | 2009-08-10 | 2009-09-30 |
host_location | Vancouver, British Columbia, Canada | Vancouver, British Columbia, Canada |
host_about | I will be happy to host you. | I am a self-employed graphic designer, AutoCAD... |
host_response_time | within a few hours | NaN |
host_response_rate | 100% | NaN |
host_acceptance_rate | NaN | NaN |
host_is_superhost | f | f |
host_thumbnail_url | https://a0.muscache.com/im/pictures/user/55302... | https://a0.muscache.com/im/users/42305/profile... |
host_picture_url | https://a0.muscache.com/im/pictures/user/55302... | https://a0.muscache.com/im/users/42305/profile... |
host_neighbourhood | Coal Harbour | Riley Park |
host_listings_count | 33 | 1 |
host_total_listings_count | 33 | 1 |
host_verifications | ['email', 'phone', 'reviews', 'jumio', 'govern... | ['email', 'phone', 'reviews', 'jumio', 'govern... |
host_has_profile_pic | t | t |
host_identity_verified | t | t |
street | Vancouver, BC, Canada | Vancouver, BC, Canada |
neighbourhood | Coal Harbour | Riley Park |
neighbourhood_cleansed | Downtown | Riley Park |
neighbourhood_group_cleansed | NaN | NaN |
city | Vancouver | Vancouver |
state | BC | BC |
zipcode | V6E 2P4 | V5Y |
market | Vancouver | Vancouver |
smart_location | Vancouver, Canada | Vancouver, Canada |
country_code | CA | CA |
country | Canada | Canada |
latitude | 49.2877 | 49.2538 |
longitude | -123.121 | -123.113 |
is_location_exact | t | t |
property_type | Condominium | House |
room_type | Entire home/apt | Private room |
accommodates | 6 | 1 |
bathrooms | 2 | 1 |
bedrooms | 2 | 1 |
beds | 4 | 1 |
bed_type | Real Bed | Real Bed |
amenities | {TV,"Cable TV",Internet,Wifi,"Air conditioning... | {TV,"Cable TV",Internet,Wifi,"Free parking on ... |
square_feet | 1200 | 0 |
price | $296.00 | $60.00 |
weekly_price | NaN | $385.00 |
monthly_price | $4,750.00 | $1,400.00 |
security_deposit | $0.00 | NaN |
cleaning_fee | $195.00 | NaN |
guests_included | 1 | 1 |
extra_people | $0.00 | $0.00 |
minimum_nights | 60 | 364 |
maximum_nights | 1124 | 364 |
minimum_minimum_nights | 60 | 364 |
maximum_minimum_nights | 60 | 364 |
minimum_maximum_nights | 1124 | 364 |
maximum_maximum_nights | 1124 | 364 |
minimum_nights_avg_ntm | 60 | 364 |
maximum_nights_avg_ntm | 1124 | 364 |
calendar_updated | today | 11 months ago |
has_availability | t | t |
availability_30 | 0 | 30 |
availability_60 | 0 | 60 |
availability_90 | 0 | 90 |
availability_365 | 191 | 365 |
calendar_last_scraped | 2019-03-08 | 2019-03-08 |
number_of_reviews | 16 | 70 |
number_of_reviews_ltm | 0 | 0 |
first_review | 2011-11-15 | 2010-02-18 |
last_review | 2017-02-26 | 2016-06-14 |
review_scores_rating | 93 | 92 |
review_scores_accuracy | 9 | 9 |
review_scores_cleanliness | 9 | 10 |
review_scores_checkin | 10 | 10 |
review_scores_communication | 9 | 10 |
review_scores_location | 10 | 10 |
review_scores_value | 9 | 9 |
requires_license | t | t |
license | 18-476608 | NaN |
jurisdiction_names | {"British Columbia"," Canada"," Vancouver"," B... | {"British Columbia"," Canada"," Vancouver"," B... |
instant_bookable | f | f |
is_business_travel_ready | f | f |
cancellation_policy | strict_14_with_grace_period | strict_14_with_grace_period |
require_guest_profile_picture | f | t |
require_guest_phone_verification | f | t |
calculated_host_listings_count | 33 | 1 |
calculated_host_listings_count_entire_homes | 33 | 0 |
calculated_host_listings_count_private_rooms | 0 | 1 |
calculated_host_listings_count_shared_rooms | 0 | 0 |
reviews_per_month | 0.18 | 0.64 |
Not all columns will be required for visualization so let's do some clean-up:
columns_to_keep = ['id', 'summary', 'space', 'description', 'host_listings_count', 'host_total_listings_count',
'neighbourhood_cleansed', 'is_location_exact', 'property_type', 'accommodates', 'bedrooms',
'bathrooms', 'beds', 'amenities', 'availability_30', 'availability_60', 'availability_90',
'availability_365']
df_listings_more = df_listings_more.loc[:,columns_to_keep]
df_listings_more.dtypes
id int64 summary object space object description object host_listings_count int64 host_total_listings_count int64 neighbourhood_cleansed object is_location_exact object property_type object accommodates int64 bedrooms float64 bathrooms float64 beds float64 amenities object availability_30 int64 availability_60 int64 availability_90 int64 availability_365 int64 dtype: object
df_listings_more.isna().sum()
id 0 summary 167 space 1057 description 81 host_listings_count 0 host_total_listings_count 0 neighbourhood_cleansed 0 is_location_exact 0 property_type 0 accommodates 0 bedrooms 3 bathrooms 0 beds 5 amenities 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 dtype: int64
In case of missing data in summary
, space
and description
columns we will replace NaN's with empty strings:
df_listings_more['summary'] = df_listings_more['summary'].fillna('')
df_listings_more['space'] = df_listings_more['space'].fillna('')
df_listings_more['description'] = df_listings_more['description'].fillna('')
df_listings_more.isna().sum()
id 0 summary 0 space 0 description 0 host_listings_count 0 host_total_listings_count 0 neighbourhood_cleansed 0 is_location_exact 0 property_type 0 accommodates 0 bedrooms 3 bathrooms 0 beds 5 amenities 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 dtype: int64
We will deal with the remaining NaN's in the following sections.
We now merge the two datasets, df_listings
and df_listings_more
into one single dataset and call it df_listings_clean
:
df_listings_clean = df_listings.merge(df_listings_more, on = 'id')
df_listings_clean.isna().sum()
id 0 name 1 host_id 0 host_name 0 latitude 0 longitude 0 room_type 0 price 0 minimum_nights 0 number_of_reviews 0 calculated_host_listings_count 0 summary 0 space 0 description 0 host_listings_count 0 host_total_listings_count 0 neighbourhood_cleansed 0 is_location_exact 0 property_type 0 accommodates 0 bedrooms 3 bathrooms 0 beds 5 amenities 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 dtype: int64
We can now drop the 9 remaining rows with missing values in name
, bedrooms
and beds
columns.
df_listings_clean.dropna(axis = 0, inplace = True)
df_listings_clean.isna().sum().any()
False
df_listings_clean.dtypes
id int64 name object host_id int64 host_name object latitude float64 longitude float64 room_type object price int64 minimum_nights int64 number_of_reviews int64 calculated_host_listings_count int64 summary object space object description object host_listings_count int64 host_total_listings_count int64 neighbourhood_cleansed object is_location_exact object property_type object accommodates int64 bedrooms float64 bathrooms float64 beds float64 amenities object availability_30 int64 availability_60 int64 availability_90 int64 availability_365 int64 dtype: object
As an optional help for visulaization down the road let's change data type in some columns to appropriate integer type:
df_listings_clean['bedrooms'] = df_listings_clean['bedrooms'].astype('int64')
df_listings_clean['bathrooms'] = df_listings_clean['bathrooms'].astype('int64')
df_listings_clean['beds'] = df_listings_clean['beds'].astype('int64')
Although Tableau will be used down the road to visualize different measures in different neighbourhoods we can also have a quick look at different features across neighbourhoods in here to get an understanding that will help us choose the best visualizations in Tableau:
# Creating an empty dataframe to store different measures for different neighbourhoods:
df_neighbourhood = pd.DataFrame(index = df_listings_clean['neighbourhood_cleansed'].unique())
# Average nighly price in different neighbourhoods:
df_neighbourhood['Avg_price'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['price'].mean().round()
# Total number of listings in different neighbourhoods:
df_neighbourhood['Listing_count'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['id'].count()
# Average 30-Day availability in different neighbourhoods:
df_neighbourhood['Avg_30_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_30']\
.mean().round().astype(int)
# Average 60-Day availability in different neighbourhoods:
df_neighbourhood['Avg_60_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_60']\
.mean().round().astype(int)
# Average 90-Day availability in different neighbourhoods:
df_neighbourhood['Avg_90_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_90']\
.mean().round().astype(int)
# Average 365-Day availability in different neighbourhoods:
df_neighbourhood['Avg_365_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_365']\
.mean().round().astype(int)
df_neighbourhood
Avg_price | Listing_count | Avg_30_availability | Avg_60_availability | Avg_90_availability | Avg_365_availability | |
---|---|---|---|---|---|---|
Downtown | 196.0 | 1064 | 11 | 25 | 41 | 147 |
Riley Park | 145.0 | 275 | 10 | 21 | 32 | 117 |
Downtown Eastside | 176.0 | 271 | 12 | 25 | 38 | 129 |
West End | 168.0 | 362 | 10 | 22 | 36 | 117 |
Kensington-Cedar Cottage | 134.0 | 319 | 10 | 22 | 35 | 136 |
Hastings-Sunrise | 120.0 | 247 | 12 | 26 | 42 | 138 |
Renfrew-Collingwood | 102.0 | 200 | 12 | 27 | 43 | 153 |
Mount Pleasant | 150.0 | 294 | 10 | 22 | 35 | 126 |
Grandview-Woodland | 135.0 | 247 | 8 | 17 | 28 | 116 |
Fairview | 144.0 | 151 | 7 | 18 | 31 | 132 |
Kitsilano | 190.0 | 424 | 10 | 22 | 34 | 133 |
Kerrisdale | 173.0 | 69 | 15 | 31 | 47 | 145 |
Arbutus Ridge | 198.0 | 84 | 13 | 29 | 47 | 156 |
Sunset | 172.0 | 106 | 13 | 30 | 48 | 142 |
Dunbar Southlands | 190.0 | 138 | 13 | 30 | 48 | 170 |
Marpole | 129.0 | 166 | 17 | 38 | 60 | 167 |
Killarney | 91.0 | 67 | 15 | 34 | 53 | 140 |
Shaughnessy | 216.0 | 87 | 12 | 26 | 39 | 107 |
South Cambie | 185.0 | 59 | 10 | 22 | 35 | 125 |
Strathcona | 128.0 | 22 | 8 | 20 | 32 | 129 |
West Point Grey | 216.0 | 88 | 13 | 27 | 40 | 133 |
Oakridge | 141.0 | 88 | 15 | 34 | 52 | 136 |
Victoria-Fraserview | 90.0 | 101 | 19 | 41 | 65 | 228 |
With the aim of making word clouds of most common words in property summary (description) and reviews we go through all reviews and listing summaries in each neghbourhood and will feed Tableau with tables that list the top 50 words used in reviews and listing summaries, in each neighbourhood:
# Funtion to return the top 50 words:
from operator import itemgetter
def top_word_counter(df, filter_col, filter_match, lookup_col):
words = list(entry.lower().strip() for entry in df[df[filter_col] == filter_match][lookup_col].astype(str))
wordcount = {}
# Words that will not be counted:
stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
"yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself",
"it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which",
"who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be",
"been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an",
"the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for",
"with", "about", "against", "between", "into", "through", "during", "before", "after", "above",
"below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further",
"then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
"few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so",
"than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now", ""]
symbols_to_replace = [".", ",", ":", ";", "\"", "/", "!", "?", "“", "‘", "*", ")", "(", "|",
">", "<", "{", "}", "&", "^", "~", "+"]
seperator = ' '
text = seperator.join(words)
for word in text.split():
for symbol in symbols_to_replace:
word = word.replace(symbol,"")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
top_words = sorted(wordcount.items(), key=itemgetter(1), reverse = True)[0:50]
return [(filter_match, i[0], i[1]) for i in top_words]
# Finding the top 50 words in summaries for each neighbouthood:
summary_words = []
for neighb in df_listings_clean['neighbourhood_cleansed'].unique():
summary_words += top_word_counter(df_listings_clean, 'neighbourhood_cleansed', neighb, 'summary')
summary_words = pd.DataFrame(summary_words)
summary_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']
# Cleaning the reviews dataframe before checking for most used words:
reviews_clean = df_reviews_more.drop(['id', 'date', 'reviewer_id', 'reviewer_name'] , axis = 1)\
.rename(columns = {'listing_id': 'id', 'comments': 'review'})\
.merge(df_listings_more.loc[:, ['id', 'neighbourhood_cleansed']], on = 'id')
# Finding the top 50 words in reviews for each neighbouthood:
review_words = []
for neighb in reviews_clean['neighbourhood_cleansed'].unique():
review_words += top_word_counter(reviews_clean, 'neighbourhood_cleansed', neighb, 'review')
review_words = pd.DataFrame(review_words)
review_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']
# Exporting new, clean dataframes as csv files that will feed into Tableau:
df_listings_clean.to_csv('Data/clean_data.csv')
review_words.to_csv('Data/review_words.csv')
summary_words.to_csv('Data/summary_words.csv')
Data visualization to continue on Tableau.