Making of a Russian real-estate index¶

General and aim¶

The original aim of this notebook was to devise a real-estate price index for the data given in the "Sberbank Russian Housing Market" competition at Kaggle. The index is supposed to reduce the time related price-fluctuations.

Although the index was found to be of little assistance, this whole thing proved a useful exercise in time-series averaging and spline fitting.

About the Data¶

The dataset is made of individual records of individual real-estate transactions. To develop the index, I was interested only in two parameters: the time and price.

Let's start by importing the relevant Python packages:

In [1]:

import pandas as pd
from pandas import Timestamp as ts

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

As always, we start by reading the data:

In [2]:

# sb_macro = pd.read_csv('./data/macro.csv')
sb_train = pd.read_csv('./data/train.csv')
sb_test = pd.read_csv('./data/test.csv')

In [3]:

# convert to time
# sb_macro['timestamp'] = pd.to_datetime(sb_macro['timestamp'], errors='raise')  
sb_train['timestamp'] = pd.to_datetime(sb_train['timestamp'], errors='raise')
sb_train = sb_train.set_index(['timestamp']).sort_index()

sb_test['timestamp'] = pd.to_datetime(sb_test['timestamp'], errors='raise')
sb_test = sb_test.set_index(['timestamp']).sort_index()

Distribution of real-estate prices¶

Let's have a look at the distribution of real-estate prices in the data set:

In [4]:

sb_train['price_doc'].mean()
# sb_train['price_doc'].std()

Out[4]:

7123035.2777394904

In [5]:

p1 = sns.distplot(sb_train['price_doc'])
p1.set_title('Price histogram')
p1.set_xlabel('price')

Out[5]:

<matplotlib.text.Text at 0x1f7c8b53208>

Since home prices are obviously not normal and have a long tail towards higher prices - due to luxurious assets, it makes sense to work with the $log$ of the prices, that have a more normal shape:

In [6]:

p2 = sns.distplot(np.log(sb_train['price_doc']))
p2.set_title('Price histogram')
p2.set_xlabel('log(price)')

Out[6]:

<matplotlib.text.Text at 0x1f7c8f90780>

Let's continue and revisit the logarithmic conversion for later.

Change in real-estate prices over time¶

Real-estate prices change over time, we are interested in quantifying this change, and relate it to the overall macro-economic status. For that we'll first have to calculate the average real-estate prices.

Let's start by importing some thigs to help us manage the dates:

In [7]:

from pandas.tseries.offsets import *

We now generate a new dataframe that will hold various approximations for the average house price. We first generate the index, which is a 1-day timestamp.

In [8]:

start_date = sb_train.index.min()
end_date = sb_train.index.max()

dates = pd.date_range(start=start_date, end=end_date, freq='1D')

# create a DataFrame for housing index
hs_index = pd.DataFrame(index=dates)

Calculate the price moving average for 30, 60 and 90 days¶

In [9]:

# perhaps this could also be done using: 
# pandas.DataFrame.rolling
# anyway - this is how it was implemented

for window in [30, 60, 90]:
    
    hs_index['{}-ave'.format(window)] = np.nan
    
    shift = window//2
    start_date = shift
    ave_type = '{}-ave'.format(window)
    
    for day in hs_index[shift:-shift].index:

        backward_offset = day-DateOffset(days=shift)
        forward_offset = day+DateOffset(days=shift)

        mean_price = sb_train[backward_offset:forward_offset]['price_doc'].mean()

        hs_index.loc[day][ave_type] = mean_price
        
    # complete missing data
    hs_index[ave_type][0:shift] = hs_index[ave_type][shift+1]        
    hs_index[ave_type][-shift:] = hs_index[ave_type][-(shift+1)]

In [10]:

hs_index.plot(title='Realestate prices - Movoing averages')

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f7c912be80>

Conclusion: The larger window averages sure look nicer, however, only the 30-ave series seem to capture a trend at the beginning of the series.

Smoothing splines¶

In [11]:

from scipy.interpolate import CubicSpline, UnivariateSpline

# add a new column to hs_index dataframe with numbers
hs_nrows = hs_index.shape[0]
hs_index['numbers'] = np.linspace(0, hs_nrows-1, hs_nrows, dtype=np.integer)

We'll generate a smooth curve using a "leave-some-out" strategy: We'll generate a cubic spline using a fraction of the data. This procedure will be repeated for many times, until a smooth curve will be obtained:

In [12]:

n_bootstap = 200

time_period =  90

index_name = '{}-ave'.format(time_period)
index_name_spl = index_name + '-spl'
hs_index[index_name_spl] = 0

# we add padding of values from the hs_index dataframe
# to anchor the splines to the edges
pad = 4

for i in range(n_bootstap):
    
    # generate sub-framework and add padding
    hs_sub = hs_index[0:pad]
    hs_sub = hs_sub.append(hs_index[pad:-pad].sample(frac=0.02))
    hs_sub = hs_sub.append(hs_index[-pad:-1])
    hs_sub = hs_sub.sort_index()
        
    cs = CubicSpline(hs_sub['numbers'], hs_sub[index_name],bc_type='natural')
    hs_index[index_name_spl] += cs(hs_index['numbers'])/n_bootstap

Now that we have generated an index, let’s see if it is useful in reducing the variability of the real-estate prices. We'll test that by dividing the value of the asset by the index and see if there is any reduction in width of log-histogram:

In [13]:

hs_index[[index_name, index_name_spl]].plot(title='Cubic spline fit to 90 days moving average')

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f7c92c2940>

Conclusion (intermediate): I've found that in order to find good, smoothing splines, only a small fraction of the data in the order of $\frac{1}{10}$ (frac=0.01) should be used for generating the spline. The number of averaging iterations is not that important and the averaged spline coverages after about 100 or less iterations.

Another empirical finding - the larger the averaging widow - the larger the fraction you need to use to get nice fits.

Usefulness evaluation¶

Now, we need to evaluate how good is our new index. For that we'll define some helper functions:

In [14]:

def r_sqare(sse, sst):
    """Get the R-squared value
    
    """
    return 1 - sse/sst

In [15]:

def get_inx_val(index_frame, index_name, timestamp):
    """Get the value of the index at given date
    
    """        
    return index_frame[index_name][timestamp]

Evaluate the residuals for both baseline and fit:¶

In [16]:

base_resid = sb_train['price_doc'] - sb_train['price_doc'].mean()
sst = np.dot(base_resid, base_resid)

In [17]:

fit_resid = sb_train['price_doc'] - get_inx_val(hs_index, '90-ave-spl', sb_train.index)
sse = np.dot(fit_resid, fit_resid)
print('R_square is: {}'.format(r_sqare(sse, sst)))

R_square is: 0.018126715071587363

The overall $R^{2}$ of using the 90-ave-spl fitted spline is close to nothing. In simple words, it means that the spline helps explain only about 1.8% in the variability in real-estate prices. Although I don’t show that here, using the $log(price)$ improves $R^{2}$ to about 0.22% which is quite meaningless.

What should we do next? The reason that this didn't work is because of the large variability of home prices, regardless of time dependence. Perhaps we should first partition data set into groups of lower time-dependent variance - or some other form of partition in the same spirit.