#!/usr/bin/env python # coding: utf-8
# # (Re-) Exploring HDB Resale Flat Data in 17 Graphs # About a year ago, I published [my first post on data&stuff](https://dataandstuff.wordpress.com/2017/08/14/hdb-resale-flat-prices-in-singapore/). I applied econometric techniques to develop three least squares regression models to explain HDB resale flat prices. A year on, I'm re-visiting the expanded dataset (now includes an additional year of data) with new skills and knowledge. This time, I intend to apply proper data science techniques to accurately predict prices. # # In this first post, I perform exploratory data analysis (EDA) on the dataset. In subsequent posts, I will develop a more complex regression model to predict resale flat prices. # In[1]: # Import import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import pandas as pd import warnings # Settings get_ipython().run_line_magic('matplotlib', 'inline') warnings.filterwarnings('ignore') # Read data hdb = pd.read_csv('resale-flat-prices-based-on-registration-date-from-jan-2015-onwards.csv') # ## Target: Resale Prices # As we can see, resale prices are right-skewed (mean is to the right of the median). The mean resale price transacted was a whopping $440,000. Singaporeans must be crazy rich to afford a resale flat in this era. # In[4]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ## Date and Month Purchased # First, note that the `month` feature combines both the month and the year. Let's split these up while preserving the original notation. # In[5]: # Rename month variable hdb = hdb.rename(columns={'month': 'year_mth'}) # Add variables for month and year hdb['year'] = pd.to_numeric(hdb.year_mth.str[:4]) hdb['month'] = pd.to_numeric(hdb.year_mth.str[5:]) # From the graph below, we find that there are "hot" and "cold" periods for buying resale flats, with a surge in recent months. We note how lots of transactions take place on a regular basis: at least 1,000 per month. At the median price, that's approximately $4.4 billion transacted per month. # In[6]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ### Relation with Target # Plotting the median resale price from 2015 onwards, we find that the median price has remained stable over time. In addition, the variation in prices has remained relatively wide. Hence, as in my [first post](https://dataandstuff.wordpress.com/2017/08/14/hdb-resale-flat-prices-in-singapore/) on HDB resale flat prices, we will assume that the relationship between the flat characteristics and resale flat prices is stable for all transactions in the dataset. In other words, we treat the transactions as having occurred within a single, stable time period. # In[7]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # # Town # In[8]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ### Relation with Target # We find high variability in resale flat prices across the respective towns. This tells us that towns are an important factor in predicting resale flat prices. # In[9]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ## Flat Type # In[10]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ### Relation to Target # Naturally, we would expect flats that are "high SES" to have a higher resale price: # In[11]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ## Storey Range # In[12]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ### Relation to Target # Conventional wisdom would tell us that the higher the storey, the nicer the view. The nicer the view, the higher the resale price. The data appears to agree. # In[13]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ## Floor Area # In[14]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ### Relation to Target # Conventional wisdom would also suggest a positive relationship between floor area and price. Yet again, the data appears to agree. # In[15]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ## Flat Model # In[16]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ### Relation to Target # There appears to be high variability in resale prices across flat types. This suggests that flat types will be useful for prediction. # In[17]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ## Lease Commencement Date # Although we expect a higher price for later lease commencement dates, the relationship is not all that clear. Perhaps remaining lease is a bigger factor. # In[18]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ### Relation to Target # In[19]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ## Remaining Lease # In[20]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED # ### Relation to Target # We find a positive relationship between resale price and the remaining years in lease from 50 to 90 years. However, from 90 years onwards (referring to Build-to-Order (BTO) flats sold in the last 5 years), the relationship weakens substantially, and the variation increases substantially as well. This suggests that we could create a special category for transactions of flats with 95 years remaining in their leases to predict resale prices. # In[21]: # CODE FOR CUSTOM GRAPHICS NOT INCLUDED