Airbnb Homes in Vancouver - Data Cleaning

Arash Tavassoli - April 2019


Introduction

In less than 10 years since its official launch, airbnb has now found its place in many of our lives and continues to prove itself as an affordable and convenient alternative to conventional hotel services. This notebook cleans and prepares the data before importing into a Tableau workbook that will visualize airbnb's presence in the City of Vancouver based on publicly available data from airbnb-owned website, Inside-airbnb.

The Tableau visualization is available in here.

Data Source

The data is downloaded from Inside-airbnb on April 29, 2019 and reflects the detailed listing data, calendar data and listing reviews for 4,929 uniquely identified homes in Vancouver, BC, that were active on March 8, 2019.

The data is dowloaded from Inside Airbnb website where information is available under a Creative Commons CC0 1.0 Universal (CC0 1.0) "Public Domain Dedication" license.

The datasets are specific to the City of Vancouver and are provided in 5 separate CSV and 1 GeoJSON files:

File Name Description
calendar.csv Detailed Calendar Data for listings in Vancouver
listings.csv Summary information and metrics for listings in Vancouver
listings-2.csv Detailed Listings data for Vancouver
reviews.csv Summary Review data and Listing ID
reviews-2.csv Detailed Review Data for listings in Vancouver
neighbourhoods.csv Neighbourhood list for geo filter (ourced from city or open source GIS files)
neighbourhoods.geojson GeoJSON file of neighbourhoods of the city
In [1]:
# Importing libraries:
import pandas as pd
import numpy as np
In [2]:
# Importing raw datasets:
df_cal = pd.read_csv('Data/calendar.csv')
df_listings = pd.read_csv('Data/listings.csv')
df_listings_more = pd.read_csv('Data/listings-2.csv')
df_neighb = pd.read_csv('Data/neighbourhoods.csv')
df_reviews = pd.read_csv('Data/reviews.csv')
df_reviews_more = pd.read_csv('Data/reviews-2.csv')

Cleaning the df_listings and df_listings_more datasets:

We start by looking at the two listing datasets: df_listings and df_listings_more

In [3]:
print('Shape of df_listings:\t\t', df_listings.shape)
print('Shape of df_listings_more:\t', df_listings_more.shape)
Shape of df_listings:		 (4938, 16)
Shape of df_listings_more:	 (4938, 106)
In [4]:
df_listings.head(3).T
Out[4]:
0 1 2
id 10080 11400 13188
name D1 - Million Dollar View 2 BR Central Lovely Rm in Victorian Home Garden level studio in ideal loc.
host_id 30899 42305 51466
host_name Rami Jaynne Family Guns
neighbourhood_group NaN NaN NaN
neighbourhood Downtown Riley Park Riley Park
latitude 49.2877 49.2538 49.2458
longitude -123.121 -123.113 -123.105
room_type Entire home/apt Private room Entire home/apt
price 296 60 120
minimum_nights 60 364 2
number_of_reviews 16 70 171
last_review 2017-02-26 2016-06-14 2019-02-24
reviews_per_month 0.18 0.64 1.55
calculated_host_listings_count 33 1 1
availability_365 191 365 214

We drop the columns that we do not need in the analysis (not required or repeated on other datasets):

In [5]:
df_listings.drop(['neighbourhood_group', 'last_review', 'reviews_per_month', 'availability_365', 'neighbourhood'], 
                 axis = 1, inplace = True)
In [6]:
#Checking for NaNs:
df_listings.isna().sum()
Out[6]:
id                                0
name                              1
host_id                           0
host_name                         0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
calculated_host_listings_count    0
dtype: int64

We'll deal with that one missing value later.

In [7]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

df_listings_more.head(2).T
Out[7]:
0 1
id 10080 11400
listing_url https://www.airbnb.com/rooms/10080 https://www.airbnb.com/rooms/11400
scrape_id 20190308052700 20190308052700
last_scraped 2019-03-08 2019-03-08
name D1 - Million Dollar View 2 BR Central Lovely Rm in Victorian Home
summary Stunning two bedroom, two bathroom apartment. ... Well-appointed room with a view of the garden ...
space Bed setup: 2 x queen, I can add up to 2 twin s... Centrally-located lovely, quiet home on tree-l...
description Stunning two bedroom, two bathroom apartment. ... Well-appointed room with a view of the garden ...
experiences_offered none none
neighborhood_overview NaN Very quiet residential area, yet only 1-1/2 bl...
notes 1. CHECK-IN TIME IS AFTER 3PM PST AND CHECK-OU... NaN
transit NaN Bus stop 1-1/2 blocks away. 15 minutes to down...
access There is no access to the building ammenities. Access to my home and garden. Shared bathroom....
interaction NaN I live here so will interact with the guest wh...
house_rules 1. CHECK-IN TIME IS AFTER 3 PM PST AND CHECK-O... -No smoking indoors. -Recycling, composting ...
thumbnail_url NaN NaN
medium_url NaN NaN
picture_url https://a0.muscache.com/im/pictures/55778229/c... https://a0.muscache.com/im/pictures/74189163/a...
xl_picture_url NaN NaN
host_id 30899 42305
host_url https://www.airbnb.com/users/show/30899 https://www.airbnb.com/users/show/42305
host_name Rami Jaynne
host_since 2009-08-10 2009-09-30
host_location Vancouver, British Columbia, Canada Vancouver, British Columbia, Canada
host_about I will be happy to host you. I am a self-employed graphic designer, AutoCAD...
host_response_time within a few hours NaN
host_response_rate 100% NaN
host_acceptance_rate NaN NaN
host_is_superhost f f
host_thumbnail_url https://a0.muscache.com/im/pictures/user/55302... https://a0.muscache.com/im/users/42305/profile...
host_picture_url https://a0.muscache.com/im/pictures/user/55302... https://a0.muscache.com/im/users/42305/profile...
host_neighbourhood Coal Harbour Riley Park
host_listings_count 33 1
host_total_listings_count 33 1
host_verifications ['email', 'phone', 'reviews', 'jumio', 'govern... ['email', 'phone', 'reviews', 'jumio', 'govern...
host_has_profile_pic t t
host_identity_verified t t
street Vancouver, BC, Canada Vancouver, BC, Canada
neighbourhood Coal Harbour Riley Park
neighbourhood_cleansed Downtown Riley Park
neighbourhood_group_cleansed NaN NaN
city Vancouver Vancouver
state BC BC
zipcode V6E 2P4 V5Y
market Vancouver Vancouver
smart_location Vancouver, Canada Vancouver, Canada
country_code CA CA
country Canada Canada
latitude 49.2877 49.2538
longitude -123.121 -123.113
is_location_exact t t
property_type Condominium House
room_type Entire home/apt Private room
accommodates 6 1
bathrooms 2 1
bedrooms 2 1
beds 4 1
bed_type Real Bed Real Bed
amenities {TV,"Cable TV",Internet,Wifi,"Air conditioning... {TV,"Cable TV",Internet,Wifi,"Free parking on ...
square_feet 1200 0
price $296.00 $60.00
weekly_price NaN $385.00
monthly_price $4,750.00 $1,400.00
security_deposit $0.00 NaN
cleaning_fee $195.00 NaN
guests_included 1 1
extra_people $0.00 $0.00
minimum_nights 60 364
maximum_nights 1124 364
minimum_minimum_nights 60 364
maximum_minimum_nights 60 364
minimum_maximum_nights 1124 364
maximum_maximum_nights 1124 364
minimum_nights_avg_ntm 60 364
maximum_nights_avg_ntm 1124 364
calendar_updated today 11 months ago
has_availability t t
availability_30 0 30
availability_60 0 60
availability_90 0 90
availability_365 191 365
calendar_last_scraped 2019-03-08 2019-03-08
number_of_reviews 16 70
number_of_reviews_ltm 0 0
first_review 2011-11-15 2010-02-18
last_review 2017-02-26 2016-06-14
review_scores_rating 93 92
review_scores_accuracy 9 9
review_scores_cleanliness 9 10
review_scores_checkin 10 10
review_scores_communication 9 10
review_scores_location 10 10
review_scores_value 9 9
requires_license t t
license 18-476608 NaN
jurisdiction_names {"British Columbia"," Canada"," Vancouver"," B... {"British Columbia"," Canada"," Vancouver"," B...
instant_bookable f f
is_business_travel_ready f f
cancellation_policy strict_14_with_grace_period strict_14_with_grace_period
require_guest_profile_picture f t
require_guest_phone_verification f t
calculated_host_listings_count 33 1
calculated_host_listings_count_entire_homes 33 0
calculated_host_listings_count_private_rooms 0 1
calculated_host_listings_count_shared_rooms 0 0
reviews_per_month 0.18 0.64

Not all columns will be required for visualization so let's do some clean-up:

In [8]:
columns_to_keep = ['id', 'summary', 'space', 'description', 'host_listings_count', 'host_total_listings_count',
                  'neighbourhood_cleansed', 'is_location_exact', 'property_type', 'accommodates', 'bedrooms',
                  'bathrooms', 'beds', 'amenities', 'availability_30', 'availability_60', 'availability_90',
                  'availability_365']

df_listings_more = df_listings_more.loc[:,columns_to_keep]
In [9]:
df_listings_more.dtypes
Out[9]:
id                             int64
summary                       object
space                         object
description                   object
host_listings_count            int64
host_total_listings_count      int64
neighbourhood_cleansed        object
is_location_exact             object
property_type                 object
accommodates                   int64
bedrooms                     float64
bathrooms                    float64
beds                         float64
amenities                     object
availability_30                int64
availability_60                int64
availability_90                int64
availability_365               int64
dtype: object
In [10]:
df_listings_more.isna().sum()
Out[10]:
id                              0
summary                       167
space                        1057
description                    81
host_listings_count             0
host_total_listings_count       0
neighbourhood_cleansed          0
is_location_exact               0
property_type                   0
accommodates                    0
bedrooms                        3
bathrooms                       0
beds                            5
amenities                       0
availability_30                 0
availability_60                 0
availability_90                 0
availability_365                0
dtype: int64

In case of missing data in summary, space and description columns we will replace NaN's with empty strings:

In [11]:
df_listings_more['summary'] = df_listings_more['summary'].fillna('')
df_listings_more['space'] = df_listings_more['space'].fillna('')
df_listings_more['description'] = df_listings_more['description'].fillna('')
In [12]:
df_listings_more.isna().sum()
Out[12]:
id                           0
summary                      0
space                        0
description                  0
host_listings_count          0
host_total_listings_count    0
neighbourhood_cleansed       0
is_location_exact            0
property_type                0
accommodates                 0
bedrooms                     3
bathrooms                    0
beds                         5
amenities                    0
availability_30              0
availability_60              0
availability_90              0
availability_365             0
dtype: int64

We will deal with the remaining NaN's in the following sections.

We now merge the two datasets, df_listings and df_listings_more into one single dataset and call it df_listings_clean:

In [13]:
df_listings_clean = df_listings.merge(df_listings_more, on = 'id')
In [14]:
df_listings_clean.isna().sum()
Out[14]:
id                                0
name                              1
host_id                           0
host_name                         0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
calculated_host_listings_count    0
summary                           0
space                             0
description                       0
host_listings_count               0
host_total_listings_count         0
neighbourhood_cleansed            0
is_location_exact                 0
property_type                     0
accommodates                      0
bedrooms                          3
bathrooms                         0
beds                              5
amenities                         0
availability_30                   0
availability_60                   0
availability_90                   0
availability_365                  0
dtype: int64

We can now drop the 9 remaining rows with missing values in name, bedrooms and beds columns.

In [15]:
df_listings_clean.dropna(axis = 0, inplace = True)
In [16]:
df_listings_clean.isna().sum().any()
Out[16]:
False
In [17]:
df_listings_clean.dtypes
Out[17]:
id                                  int64
name                               object
host_id                             int64
host_name                          object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
calculated_host_listings_count      int64
summary                            object
space                              object
description                        object
host_listings_count                 int64
host_total_listings_count           int64
neighbourhood_cleansed             object
is_location_exact                  object
property_type                      object
accommodates                        int64
bedrooms                          float64
bathrooms                         float64
beds                              float64
amenities                          object
availability_30                     int64
availability_60                     int64
availability_90                     int64
availability_365                    int64
dtype: object

As an optional help for visulaization down the road let's change data type in some columns to appropriate integer type:

In [18]:
df_listings_clean['bedrooms'] = df_listings_clean['bedrooms'].astype('int64')
df_listings_clean['bathrooms'] = df_listings_clean['bathrooms'].astype('int64')
df_listings_clean['beds'] = df_listings_clean['beds'].astype('int64')

EDA on different neighbourhoods:

Although Tableau will be used down the road to visualize different measures in different neighbourhoods we can also have a quick look at different features across neighbourhoods in here to get an understanding that will help us choose the best visualizations in Tableau:

In [19]:
# Creating an empty dataframe to store different measures for different neighbourhoods:
df_neighbourhood = pd.DataFrame(index = df_listings_clean['neighbourhood_cleansed'].unique())

# Average nighly price in different neighbourhoods:
df_neighbourhood['Avg_price'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['price'].mean().round()

# Total number of listings in different neighbourhoods:
df_neighbourhood['Listing_count'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['id'].count()

# Average 30-Day availability in different neighbourhoods:
df_neighbourhood['Avg_30_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_30']\
                                            .mean().round().astype(int)

# Average 60-Day availability in different neighbourhoods:
df_neighbourhood['Avg_60_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_60']\
                                            .mean().round().astype(int)

# Average 90-Day availability in different neighbourhoods:
df_neighbourhood['Avg_90_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_90']\
                                            .mean().round().astype(int)

# Average 365-Day availability in different neighbourhoods:
df_neighbourhood['Avg_365_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_365']\
                                            .mean().round().astype(int)
In [20]:
df_neighbourhood
Out[20]:
Avg_price Listing_count Avg_30_availability Avg_60_availability Avg_90_availability Avg_365_availability
Downtown 196.0 1064 11 25 41 147
Riley Park 145.0 275 10 21 32 117
Downtown Eastside 176.0 271 12 25 38 129
West End 168.0 362 10 22 36 117
Kensington-Cedar Cottage 134.0 319 10 22 35 136
Hastings-Sunrise 120.0 247 12 26 42 138
Renfrew-Collingwood 102.0 200 12 27 43 153
Mount Pleasant 150.0 294 10 22 35 126
Grandview-Woodland 135.0 247 8 17 28 116
Fairview 144.0 151 7 18 31 132
Kitsilano 190.0 424 10 22 34 133
Kerrisdale 173.0 69 15 31 47 145
Arbutus Ridge 198.0 84 13 29 47 156
Sunset 172.0 106 13 30 48 142
Dunbar Southlands 190.0 138 13 30 48 170
Marpole 129.0 166 17 38 60 167
Killarney 91.0 67 15 34 53 140
Shaughnessy 216.0 87 12 26 39 107
South Cambie 185.0 59 10 22 35 125
Strathcona 128.0 22 8 20 32 129
West Point Grey 216.0 88 13 27 40 133
Oakridge 141.0 88 15 34 52 136
Victoria-Fraserview 90.0 101 19 41 65 228

Finding the most common words in Reviews and listing Summaries:

With the aim of making word clouds of most common words in property summary (description) and reviews we go through all reviews and listing summaries in each neghbourhood and will feed Tableau with tables that list the top 50 words used in reviews and listing summaries, in each neighbourhood:

In [21]:
# Funtion to return the top 50 words:

from operator import itemgetter

def top_word_counter(df, filter_col, filter_match, lookup_col):

    words = list(entry.lower().strip() for entry in df[df[filter_col] == filter_match][lookup_col].astype(str))
    wordcount = {}
    
    # Words that will not be counted:
    stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", 
                 "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", 
                 "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", 
                 "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", 
                 "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", 
                 "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", 
                 "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", 
                 "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", 
                 "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", 
                 "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", 
                 "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now", ""]

    symbols_to_replace = [".", ",", ":", ";", "\"", "/", "!", "?", "“", "‘", "*", ")", "(", "|", 
                          ">", "<", "{", "}", "&", "^", "~", "+"]
    
    seperator = ' '
    text = seperator.join(words)

    for word in text.split():
        for symbol in symbols_to_replace:
            word = word.replace(symbol,"")

        if word not in stopwords:
            if word not in wordcount:
                wordcount[word] = 1
            else:
                wordcount[word] += 1

    top_words = sorted(wordcount.items(), key=itemgetter(1), reverse = True)[0:50]
    
    return [(filter_match, i[0], i[1]) for i in top_words]
In [22]:
# Finding the top 50 words in summaries for each neighbouthood:
summary_words = []

for neighb in df_listings_clean['neighbourhood_cleansed'].unique():
    summary_words += top_word_counter(df_listings_clean, 'neighbourhood_cleansed', neighb, 'summary')

summary_words = pd.DataFrame(summary_words)
summary_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']
In [23]:
# Cleaning the reviews dataframe before checking for most used words:
reviews_clean = df_reviews_more.drop(['id', 'date', 'reviewer_id', 'reviewer_name'] , axis = 1)\
    .rename(columns = {'listing_id': 'id', 'comments': 'review'})\
        .merge(df_listings_more.loc[:, ['id', 'neighbourhood_cleansed']], on = 'id')

# Finding the top 50 words in reviews for each neighbouthood:
review_words = []

for neighb in reviews_clean['neighbourhood_cleansed'].unique():
    review_words += top_word_counter(reviews_clean, 'neighbourhood_cleansed', neighb, 'review')

review_words = pd.DataFrame(review_words)
review_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']
In [24]:
# Exporting new, clean dataframes as csv files that will feed into Tableau:

df_listings_clean.to_csv('Data/clean_data.csv')
review_words.to_csv('Data/review_words.csv')
summary_words.to_csv('Data/summary_words.csv')

Data visualization to continue on Tableau.