The original aim of this notebook was to devise a real-estate price index for the data given in the "Sberbank Russian Housing Market" competition at Kaggle. The index is supposed to reduce the time related price-fluctuations.
Although the index was found to be of little assistance, this whole thing proved a useful exercise in time-series averaging and spline fitting.
The dataset is made of individual records of individual real-estate transactions. To develop the index, I was interested only in two parameters: the time and price.
Let's start by importing the relevant Python packages:
import pandas as pd
from pandas import Timestamp as ts
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
As always, we start by reading the data:
# sb_macro = pd.read_csv('./data/macro.csv')
sb_train = pd.read_csv('./data/train.csv')
sb_test = pd.read_csv('./data/test.csv')
# convert to time
# sb_macro['timestamp'] = pd.to_datetime(sb_macro['timestamp'], errors='raise')
sb_train['timestamp'] = pd.to_datetime(sb_train['timestamp'], errors='raise')
sb_train = sb_train.set_index(['timestamp']).sort_index()
sb_test['timestamp'] = pd.to_datetime(sb_test['timestamp'], errors='raise')
sb_test = sb_test.set_index(['timestamp']).sort_index()
Let's have a look at the distribution of real-estate prices in the data set:
sb_train['price_doc'].mean()
# sb_train['price_doc'].std()
7123035.2777394904
p1 = sns.distplot(sb_train['price_doc'])
p1.set_title('Price histogram')
p1.set_xlabel('price')
<matplotlib.text.Text at 0x1f7c8b53208>
Since home prices are obviously not normal and have a long tail towards higher prices - due to luxurious assets, it makes sense to work with the log of the prices, that have a more normal shape:
p2 = sns.distplot(np.log(sb_train['price_doc']))
p2.set_title('Price histogram')
p2.set_xlabel('log(price)')
<matplotlib.text.Text at 0x1f7c8f90780>
Let's continue and revisit the logarithmic conversion for later.
Real-estate prices change over time, we are interested in quantifying this change, and relate it to the overall macro-economic status. For that we'll first have to calculate the average real-estate prices.
Let's start by importing some thigs to help us manage the dates:
from pandas.tseries.offsets import *
We now generate a new dataframe that will hold various approximations for the average house price. We first generate the index, which is a 1-day timestamp.
start_date = sb_train.index.min()
end_date = sb_train.index.max()
dates = pd.date_range(start=start_date, end=end_date, freq='1D')
# create a DataFrame for housing index
hs_index = pd.DataFrame(index=dates)
# perhaps this could also be done using:
# pandas.DataFrame.rolling
# anyway - this is how it was implemented
for window in [30, 60, 90]:
hs_index['{}-ave'.format(window)] = np.nan
shift = window//2
start_date = shift
ave_type = '{}-ave'.format(window)
for day in hs_index[shift:-shift].index:
backward_offset = day-DateOffset(days=shift)
forward_offset = day+DateOffset(days=shift)
mean_price = sb_train[backward_offset:forward_offset]['price_doc'].mean()
hs_index.loc[day][ave_type] = mean_price
# complete missing data
hs_index[ave_type][0:shift] = hs_index[ave_type][shift+1]
hs_index[ave_type][-shift:] = hs_index[ave_type][-(shift+1)]
hs_index.plot(title='Realestate prices - Movoing averages')
<matplotlib.axes._subplots.AxesSubplot at 0x1f7c912be80>
Conclusion: The larger window averages sure look nicer, however, only the 30-ave
series seem to capture a trend at the beginning of the series.
from scipy.interpolate import CubicSpline, UnivariateSpline
# add a new column to hs_index dataframe with numbers
hs_nrows = hs_index.shape[0]
hs_index['numbers'] = np.linspace(0, hs_nrows-1, hs_nrows, dtype=np.integer)
We'll generate a smooth curve using a "leave-some-out" strategy: We'll generate a cubic spline using a fraction of the data. This procedure will be repeated for many times, until a smooth curve will be obtained:
n_bootstap = 200
time_period = 90
index_name = '{}-ave'.format(time_period)
index_name_spl = index_name + '-spl'
hs_index[index_name_spl] = 0
# we add padding of values from the hs_index dataframe
# to anchor the splines to the edges
pad = 4
for i in range(n_bootstap):
# generate sub-framework and add padding
hs_sub = hs_index[0:pad]
hs_sub = hs_sub.append(hs_index[pad:-pad].sample(frac=0.02))
hs_sub = hs_sub.append(hs_index[-pad:-1])
hs_sub = hs_sub.sort_index()
cs = CubicSpline(hs_sub['numbers'], hs_sub[index_name],bc_type='natural')
hs_index[index_name_spl] += cs(hs_index['numbers'])/n_bootstap
Now that we have generated an index, let’s see if it is useful in reducing the variability of the real-estate prices. We'll test that by dividing the value of the asset by the index and see if there is any reduction in width of log-histogram:
hs_index[[index_name, index_name_spl]].plot(title='Cubic spline fit to 90 days moving average')
<matplotlib.axes._subplots.AxesSubplot at 0x1f7c92c2940>
Conclusion (intermediate): I've found that in order to find good, smoothing splines, only a small fraction of the data in the order of 110 (frac=0.01
) should be used for generating the spline. The number of averaging iterations is not that important and the averaged spline coverages after about 100 or less iterations.
Another empirical finding - the larger the averaging widow - the larger the fraction you need to use to get nice fits.
Now, we need to evaluate how good is our new index. For that we'll define some helper functions:
def r_sqare(sse, sst):
"""Get the R-squared value
"""
return 1 - sse/sst
def get_inx_val(index_frame, index_name, timestamp):
"""Get the value of the index at given date
"""
return index_frame[index_name][timestamp]
base_resid = sb_train['price_doc'] - sb_train['price_doc'].mean()
sst = np.dot(base_resid, base_resid)
fit_resid = sb_train['price_doc'] - get_inx_val(hs_index, '90-ave-spl', sb_train.index)
sse = np.dot(fit_resid, fit_resid)
print('R_square is: {}'.format(r_sqare(sse, sst)))
R_square is: 0.018126715071587363
The overall R2 of using the 90-ave-spl fitted spline is close to nothing. In simple words, it means that the spline helps explain only about 1.8% in the variability in real-estate prices. Although I don’t show that here, using the log(price) improves R2 to about 0.22% which is quite meaningless.
What should we do next? The reason that this didn't work is because of the large variability of home prices, regardless of time dependence. Perhaps we should first partition data set into groups of lower time-dependent variance - or some other form of partition in the same spirit.