Airbnb Homes in Vancouver - Data Cleaning¶

Arash Tavassoli - April 2019¶

Introduction¶

In less than 10 years since its official launch, airbnb has now found its place in many of our lives and continues to prove itself as an affordable and convenient alternative to conventional hotel services. This notebook cleans and prepares the data before importing into a Tableau workbook that will visualize airbnb's presence in the City of Vancouver based on publicly available data from airbnb-owned website, Inside-airbnb.

The Tableau visualization is available in here.

Data Source¶

The data is downloaded from Inside-airbnb on April 29, 2019 and reflects the detailed listing data, calendar data and listing reviews for 4,929 uniquely identified homes in Vancouver, BC, that were active on March 8, 2019.

The data is dowloaded from Inside Airbnb website where information is available under a Creative Commons CC0 1.0 Universal (CC0 1.0) "Public Domain Dedication" license.

The datasets are specific to the City of Vancouver and are provided in 5 separate CSV and 1 GeoJSON files:

File Name	Description
`calendar.csv`	Detailed Calendar Data for listings in Vancouver
`listings.csv`	Summary information and metrics for listings in Vancouver
`listings-2.csv`	Detailed Listings data for Vancouver
`reviews.csv`	Summary Review data and Listing ID
`reviews-2.csv`	Detailed Review Data for listings in Vancouver
`neighbourhoods.csv`	Neighbourhood list for geo filter (ourced from city or open source GIS files)
`neighbourhoods.geojson`	GeoJSON file of neighbourhoods of the city

In [1]:

# Importing libraries:
import pandas as pd
import numpy as np

In [2]:

# Importing raw datasets:
df_cal = pd.read_csv('Data/calendar.csv')
df_listings = pd.read_csv('Data/listings.csv')
df_listings_more = pd.read_csv('Data/listings-2.csv')
df_neighb = pd.read_csv('Data/neighbourhoods.csv')
df_reviews = pd.read_csv('Data/reviews.csv')
df_reviews_more = pd.read_csv('Data/reviews-2.csv')

Cleaning the `df_listings` and `df_listings_more` datasets:¶

We start by looking at the two listing datasets: df_listings and df_listings_more

In [3]:

print('Shape of df_listings:\t\t', df_listings.shape)
print('Shape of df_listings_more:\t', df_listings_more.shape)

Shape of df_listings:		 (4938, 16)
Shape of df_listings_more:	 (4938, 106)

In [4]:

df_listings.head(3).T

Out[4]:

	0	1	2
id	10080	11400	13188
name	D1 - Million Dollar View 2 BR	Central Lovely Rm in Victorian Home	Garden level studio in ideal loc.
host_id	30899	42305	51466
host_name	Rami	Jaynne	Family Guns
neighbourhood_group	NaN	NaN	NaN
neighbourhood	Downtown	Riley Park	Riley Park
latitude	49.2877	49.2538	49.2458
longitude	-123.121	-123.113	-123.105
room_type	Entire home/apt	Private room	Entire home/apt
price	296	60	120
minimum_nights	60	364	2
number_of_reviews	16	70	171
last_review	2017-02-26	2016-06-14	2019-02-24
reviews_per_month	0.18	0.64	1.55
calculated_host_listings_count	33	1	1
availability_365	191	365	214

We drop the columns that we do not need in the analysis (not required or repeated on other datasets):

In [5]:

df_listings.drop(['neighbourhood_group', 'last_review', 'reviews_per_month', 'availability_365', 'neighbourhood'], 
                 axis = 1, inplace = True)

In [6]:

#Checking for NaNs:
df_listings.isna().sum()

Out[6]:

id                                0
name                              1
host_id                           0
host_name                         0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
calculated_host_listings_count    0
dtype: int64

We'll deal with that one missing value later.

In [7]:

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

df_listings_more.head(2).T

Out[7]:

	0	1
id	10080	11400
listing_url	https://www.airbnb.com/rooms/10080	https://www.airbnb.com/rooms/11400
scrape_id	20190308052700	20190308052700
last_scraped	2019-03-08	2019-03-08
name	D1 - Million Dollar View 2 BR	Central Lovely Rm in Victorian Home
summary	Stunning two bedroom, two bathroom apartment. ...	Well-appointed room with a view of the garden ...
space	Bed setup: 2 x queen, I can add up to 2 twin s...	Centrally-located lovely, quiet home on tree-l...
description	Stunning two bedroom, two bathroom apartment. ...	Well-appointed room with a view of the garden ...
experiences_offered	none	none
neighborhood_overview	NaN	Very quiet residential area, yet only 1-1/2 bl...
notes	1. CHECK-IN TIME IS AFTER 3PM PST AND CHECK-OU...	NaN
transit	NaN	Bus stop 1-1/2 blocks away. 15 minutes to down...
access	There is no access to the building ammenities.	Access to my home and garden. Shared bathroom....
interaction	NaN	I live here so will interact with the guest wh...
house_rules	1. CHECK-IN TIME IS AFTER 3 PM PST AND CHECK-O...	-No smoking indoors. -Recycling, composting ...
thumbnail_url	NaN	NaN
medium_url	NaN	NaN
picture_url	https://a0.muscache.com/im/pictures/55778229/c...	https://a0.muscache.com/im/pictures/74189163/a...
xl_picture_url	NaN	NaN
host_id	30899	42305
host_url	https://www.airbnb.com/users/show/30899	https://www.airbnb.com/users/show/42305
host_name	Rami	Jaynne
host_since	2009-08-10	2009-09-30
host_location	Vancouver, British Columbia, Canada	Vancouver, British Columbia, Canada
host_about	I will be happy to host you.	I am a self-employed graphic designer, AutoCAD...
host_response_time	within a few hours	NaN
host_response_rate	100%	NaN
host_acceptance_rate	NaN	NaN
host_is_superhost	f	f
host_thumbnail_url	https://a0.muscache.com/im/pictures/user/55302...	https://a0.muscache.com/im/users/42305/profile...
host_picture_url	https://a0.muscache.com/im/pictures/user/55302...	https://a0.muscache.com/im/users/42305/profile...
host_neighbourhood	Coal Harbour	Riley Park
host_listings_count	33	1
host_total_listings_count	33	1
host_verifications	['email', 'phone', 'reviews', 'jumio', 'govern...	['email', 'phone', 'reviews', 'jumio', 'govern...
host_has_profile_pic	t	t
host_identity_verified	t	t
street	Vancouver, BC, Canada	Vancouver, BC, Canada
neighbourhood	Coal Harbour	Riley Park
neighbourhood_cleansed	Downtown	Riley Park
neighbourhood_group_cleansed	NaN	NaN
city	Vancouver	Vancouver
state	BC	BC
zipcode	V6E 2P4	V5Y
market	Vancouver	Vancouver
smart_location	Vancouver, Canada	Vancouver, Canada
country_code	CA	CA
country	Canada	Canada
latitude	49.2877	49.2538
longitude	-123.121	-123.113
is_location_exact	t	t
property_type	Condominium	House
room_type	Entire home/apt	Private room
accommodates	6	1
bathrooms	2	1
bedrooms	2	1
beds	4	1
bed_type	Real Bed	Real Bed
amenities	{TV,"Cable TV",Internet,Wifi,"Air conditioning...	{TV,"Cable TV",Internet,Wifi,"Free parking on ...
square_feet	1200	0
price	$296.00	$60.00
weekly_price	NaN	$385.00
monthly_price	$4,750.00	$1,400.00
security_deposit	$0.00	NaN
cleaning_fee	$195.00	NaN
guests_included	1	1
extra_people	$0.00	$0.00
minimum_nights	60	364
maximum_nights	1124	364
minimum_minimum_nights	60	364
maximum_minimum_nights	60	364
minimum_maximum_nights	1124	364
maximum_maximum_nights	1124	364
minimum_nights_avg_ntm	60	364
maximum_nights_avg_ntm	1124	364
calendar_updated	today	11 months ago
has_availability	t	t
availability_30	0	30
availability_60	0	60
availability_90	0	90
availability_365	191	365
calendar_last_scraped	2019-03-08	2019-03-08
number_of_reviews	16	70
number_of_reviews_ltm	0	0
first_review	2011-11-15	2010-02-18
last_review	2017-02-26	2016-06-14
review_scores_rating	93	92
review_scores_accuracy	9	9
review_scores_cleanliness	9	10
review_scores_checkin	10	10
review_scores_communication	9	10
review_scores_location	10	10
review_scores_value	9	9
requires_license	t	t
license	18-476608	NaN
jurisdiction_names	{"British Columbia"," Canada"," Vancouver"," B...	{"British Columbia"," Canada"," Vancouver"," B...
instant_bookable	f	f
is_business_travel_ready	f	f
cancellation_policy	strict_14_with_grace_period	strict_14_with_grace_period
require_guest_profile_picture	f	t
require_guest_phone_verification	f	t
calculated_host_listings_count	33	1
calculated_host_listings_count_entire_homes	33	0
calculated_host_listings_count_private_rooms	0	1
calculated_host_listings_count_shared_rooms	0	0
reviews_per_month	0.18	0.64

Not all columns will be required for visualization so let's do some clean-up:

In [8]:

columns_to_keep = ['id', 'summary', 'space', 'description', 'host_listings_count', 'host_total_listings_count',
                  'neighbourhood_cleansed', 'is_location_exact', 'property_type', 'accommodates', 'bedrooms',
                  'bathrooms', 'beds', 'amenities', 'availability_30', 'availability_60', 'availability_90',
                  'availability_365']

df_listings_more = df_listings_more.loc[:,columns_to_keep]

In [9]:

df_listings_more.dtypes

Out[9]:

id                             int64
summary                       object
space                         object
description                   object
host_listings_count            int64
host_total_listings_count      int64
neighbourhood_cleansed        object
is_location_exact             object
property_type                 object
accommodates                   int64
bedrooms                     float64
bathrooms                    float64
beds                         float64
amenities                     object
availability_30                int64
availability_60                int64
availability_90                int64
availability_365               int64
dtype: object

In [10]:

df_listings_more.isna().sum()

Out[10]:

id                              0
summary                       167
space                        1057
description                    81
host_listings_count             0
host_total_listings_count       0
neighbourhood_cleansed          0
is_location_exact               0
property_type                   0
accommodates                    0
bedrooms                        3
bathrooms                       0
beds                            5
amenities                       0
availability_30                 0
availability_60                 0
availability_90                 0
availability_365                0
dtype: int64

In case of missing data in summary, space and description columns we will replace NaN's with empty strings:

In [11]:

df_listings_more['summary'] = df_listings_more['summary'].fillna('')
df_listings_more['space'] = df_listings_more['space'].fillna('')
df_listings_more['description'] = df_listings_more['description'].fillna('')

In [12]:

df_listings_more.isna().sum()

Out[12]:

id                           0
summary                      0
space                        0
description                  0
host_listings_count          0
host_total_listings_count    0
neighbourhood_cleansed       0
is_location_exact            0
property_type                0
accommodates                 0
bedrooms                     3
bathrooms                    0
beds                         5
amenities                    0
availability_30              0
availability_60              0
availability_90              0
availability_365             0
dtype: int64

We will deal with the remaining NaN's in the following sections.

We now merge the two datasets, df_listings and df_listings_more into one single dataset and call it df_listings_clean:

In [13]:

df_listings_clean = df_listings.merge(df_listings_more, on = 'id')

In [14]:

df_listings_clean.isna().sum()

Out[14]:

id                                0
name                              1
host_id                           0
host_name                         0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
calculated_host_listings_count    0
summary                           0
space                             0
description                       0
host_listings_count               0
host_total_listings_count         0
neighbourhood_cleansed            0
is_location_exact                 0
property_type                     0
accommodates                      0
bedrooms                          3
bathrooms                         0
beds                              5
amenities                         0
availability_30                   0
availability_60                   0
availability_90                   0
availability_365                  0
dtype: int64

We can now drop the 9 remaining rows with missing values in name, bedrooms and beds columns.

In [15]:

df_listings_clean.dropna(axis = 0, inplace = True)

In [16]:

df_listings_clean.isna().sum().any()

Out[16]:

False

In [17]:

df_listings_clean.dtypes

Out[17]:

id                                  int64
name                               object
host_id                             int64
host_name                          object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
calculated_host_listings_count      int64
summary                            object
space                              object
description                        object
host_listings_count                 int64
host_total_listings_count           int64
neighbourhood_cleansed             object
is_location_exact                  object
property_type                      object
accommodates                        int64
bedrooms                          float64
bathrooms                         float64
beds                              float64
amenities                          object
availability_30                     int64
availability_60                     int64
availability_90                     int64
availability_365                    int64
dtype: object

As an optional help for visulaization down the road let's change data type in some columns to appropriate integer type:

In [18]:

df_listings_clean['bedrooms'] = df_listings_clean['bedrooms'].astype('int64')
df_listings_clean['bathrooms'] = df_listings_clean['bathrooms'].astype('int64')
df_listings_clean['beds'] = df_listings_clean['beds'].astype('int64')

EDA on different neighbourhoods:¶

Although Tableau will be used down the road to visualize different measures in different neighbourhoods we can also have a quick look at different features across neighbourhoods in here to get an understanding that will help us choose the best visualizations in Tableau:

In [19]:

# Creating an empty dataframe to store different measures for different neighbourhoods:
df_neighbourhood = pd.DataFrame(index = df_listings_clean['neighbourhood_cleansed'].unique())

# Average nighly price in different neighbourhoods:
df_neighbourhood['Avg_price'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['price'].mean().round()

# Total number of listings in different neighbourhoods:
df_neighbourhood['Listing_count'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['id'].count()

# Average 30-Day availability in different neighbourhoods:
df_neighbourhood['Avg_30_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_30']\
                                            .mean().round().astype(int)

# Average 60-Day availability in different neighbourhoods:
df_neighbourhood['Avg_60_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_60']\
                                            .mean().round().astype(int)

# Average 90-Day availability in different neighbourhoods:
df_neighbourhood['Avg_90_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_90']\
                                            .mean().round().astype(int)

# Average 365-Day availability in different neighbourhoods:
df_neighbourhood['Avg_365_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_365']\
                                            .mean().round().astype(int)

In [20]:

df_neighbourhood

Out[20]:

	Avg_price	Listing_count	Avg_30_availability	Avg_60_availability	Avg_90_availability	Avg_365_availability
Downtown	196.0	1064	11	25	41	147
Riley Park	145.0	275	10	21	32	117
Downtown Eastside	176.0	271	12	25	38	129
West End	168.0	362	10	22	36	117
Kensington-Cedar Cottage	134.0	319	10	22	35	136
Hastings-Sunrise	120.0	247	12	26	42	138
Renfrew-Collingwood	102.0	200	12	27	43	153
Mount Pleasant	150.0	294	10	22	35	126
Grandview-Woodland	135.0	247	8	17	28	116
Fairview	144.0	151	7	18	31	132
Kitsilano	190.0	424	10	22	34	133
Kerrisdale	173.0	69	15	31	47	145
Arbutus Ridge	198.0	84	13	29	47	156
Sunset	172.0	106	13	30	48	142
Dunbar Southlands	190.0	138	13	30	48	170
Marpole	129.0	166	17	38	60	167
Killarney	91.0	67	15	34	53	140
Shaughnessy	216.0	87	12	26	39	107
South Cambie	185.0	59	10	22	35	125
Strathcona	128.0	22	8	20	32	129
West Point Grey	216.0	88	13	27	40	133
Oakridge	141.0	88	15	34	52	136
Victoria-Fraserview	90.0	101	19	41	65	228

Finding the most common words in Reviews and listing Summaries:¶

With the aim of making word clouds of most common words in property summary (description) and reviews we go through all reviews and listing summaries in each neghbourhood and will feed Tableau with tables that list the top 50 words used in reviews and listing summaries, in each neighbourhood:

In [21]:

# Funtion to return the top 50 words:

from operator import itemgetter

def top_word_counter(df, filter_col, filter_match, lookup_col):

    words = list(entry.lower().strip() for entry in df[df[filter_col] == filter_match][lookup_col].astype(str))
    wordcount = {}
    
    # Words that will not be counted:
    stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", 
                 "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", 
                 "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", 
                 "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", 
                 "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", 
                 "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", 
                 "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", 
                 "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", 
                 "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", 
                 "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", 
                 "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now", ""]

    symbols_to_replace = [".", ",", ":", ";", "\"", "/", "!", "?", "â€œ", "â€˜", "*", ")", "(", "|", 
                          ">", "<", "{", "}", "&", "^", "~", "+"]
    
    seperator = ' '
    text = seperator.join(words)

    for word in text.split():
        for symbol in symbols_to_replace:
            word = word.replace(symbol,"")

        if word not in stopwords:
            if word not in wordcount:
                wordcount[word] = 1
            else:
                wordcount[word] += 1

    top_words = sorted(wordcount.items(), key=itemgetter(1), reverse = True)[0:50]
    
    return [(filter_match, i[0], i[1]) for i in top_words]

In [22]:

# Finding the top 50 words in summaries for each neighbouthood:
summary_words = []

for neighb in df_listings_clean['neighbourhood_cleansed'].unique():
    summary_words += top_word_counter(df_listings_clean, 'neighbourhood_cleansed', neighb, 'summary')

summary_words = pd.DataFrame(summary_words)
summary_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']

In [23]:

# Cleaning the reviews dataframe before checking for most used words:
reviews_clean = df_reviews_more.drop(['id', 'date', 'reviewer_id', 'reviewer_name'] , axis = 1)\
    .rename(columns = {'listing_id': 'id', 'comments': 'review'})\
        .merge(df_listings_more.loc[:, ['id', 'neighbourhood_cleansed']], on = 'id')

# Finding the top 50 words in reviews for each neighbouthood:
review_words = []

for neighb in reviews_clean['neighbourhood_cleansed'].unique():
    review_words += top_word_counter(reviews_clean, 'neighbourhood_cleansed', neighb, 'review')

review_words = pd.DataFrame(review_words)
review_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']

In [24]:

# Exporting new, clean dataframes as csv files that will feed into Tableau:

df_listings_clean.to_csv('Data/clean_data.csv')
review_words.to_csv('Data/review_words.csv')
summary_words.to_csv('Data/summary_words.csv')

Data visualization to continue on Tableau.