This tutorial , uses multivariate regression to predict house price. The high level goal is the use multiple features (size , number of bedrooms,bathrooms etc) to predict the price of a house. This tutorial is a self-paced tutorial.

The language used throughout will be Python and libraries available in python for scientific and machine learning applications.

One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you're watching right now. We'll go over other practical tools, widely used in the data science industry, below.

**You can run this notebook interactively**

Install the (free) [Anaconda](https://store.continuum.io/cshop/anaconda/) Python distribution, including Python itself.
Install the TextBlob library: [Textblob](http://textblob.readthedocs.org/en/dev/install.html).
Download the source for this notebook to your computer: [https://github.com/nmishra/mvregression/blob/master/predict_house_price_python.ipynb](https://github.com/nmishra/mvregression/blob/master/predict_house_price_python.ipynb) and run it with:
`$ ipython notebook predict_house_price_python.ipynb`
Watch the [IPython tutorial video](https://www.youtube.com/watch?v=H6dLGQw9yFQ) for notebook navigation basics.
Run the first code cell below; if it executes without errors, you're good to go!

Python tutorial : Multivariate linear regression to predict house prices¶

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import csv
from textblob import TextBlob
import pandas
import sklearn
import pickle
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation as cv
from sklearn.metrics import explained_variance_score, mean_squared_error,r2_score 
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, train_test_split 
from sklearn.learning_curve import learning_curve
from sklearn.externals import joblib
from sklearn.linear_model import LinearRegression

Step 1: Load data, plot data¶

Labeled data is crucial for supervised learning. By the gracious courtesy of redfin.com, we are going to focus on last 3 years of data for zipcode 94555. The houses in this data are at least 4500 sq ft lot size, 3+bed, 2+bath,1500 sqft - 2000 sqft and were not listed more than 30 days in www.redfin.com

$ mkdir data && cd data 
$ ls -ltrh
-rw-r--r--@ 1 user  staff    41K Feb 17 23:19 redfin_2015-02-17_94555_results.csv

Pandas is an amazing library to load and manipulate data. We will be using pandas to load the data.

In [14]:

labels=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
        "PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"]
houses=pandas.read_csv('./data/redfin_2015-02-17_94555_results_massaged.csv', quoting=csv.QUOTE_NONE,names=labels)
print(houses.head())

   LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      739000     3    2.5  1988      5595        1991              2   
1      749888     3    2.5  1642      9876        1986              2   
2      713500     4    2.0  1504      5800        1969              2   
3      749000     3    3.0  1781      5800        1971              2   
4      835000     4    2.5  1857      5061        1990              2   

  PARKING TYPE  ORIGINAL LIST PRICE  LAST SALE PRICE  
0       Garage               739000           755000  
1       Garage               749888           682500  
2       Garage               599000           713500  
3       Garage               749000           750000  
4       Garage               835000           835000

The matrix size of the input data is:

In [16]:

print(houses.shape)

(116, 10)

It's always advisable to visualize the data. Let's create a scatter plot with X-axis as the SQFT of the house and Y-axis as the LAST SALE PRICE of the house.

In [12]:

houses.plot(x='SQFT',y='LAST SALE PRICE', kind='scatter')

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x107423cf8>

pandas also helps with aggregate statistics.

In [20]:

houses.groupby('SQFT').describe()

Out[20]:

		BATHS	BEDS	LAST SALE PRICE	LIST PRICE	LOT SIZE	ORIGINAL LIST PRICE	PARKING SPOTS	YEAR BUILT
SQFT
1500	count	1.0	1.000000	1.000000	1.000000	1.000000	1.000000	1	1.000000
	mean	2.0	3.000000	583000.000000	583000.000000	6139.000000	557300.000000	2	1978.000000
	std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	min	2.0	3.000000	583000.000000	583000.000000	6139.000000	557300.000000	2	1978.000000
	25%	2.0	3.000000	583000.000000	583000.000000	6139.000000	557300.000000	2	1978.000000
	50%	2.0	3.000000	583000.000000	583000.000000	6139.000000	557300.000000	2	1978.000000
	75%	2.0	3.000000	583000.000000	583000.000000	6139.000000	557300.000000	2	1978.000000
	max	2.0	3.000000	583000.000000	583000.000000	6139.000000	557300.000000	2	1978.000000
1504	count	3.0	3.000000	3.000000	3.000000	3.000000	3.000000	3	3.000000
	mean	2.0	3.666667	731166.666667	687462.666667	5837.666667	649296.000000	2	1974.333333
	std	0.0	0.577350	42826.199146	34081.757310	467.639106	50501.236104	0	10.115994
	min	2.0	3.000000	700000.000000	648888.000000	5390.000000	599000.000000	2	1968.000000
	25%	2.0	3.500000	706750.000000	674444.000000	5595.000000	623944.000000	2	1968.500000
	50%	2.0	4.000000	713500.000000	700000.000000	5800.000000	648888.000000	2	1969.000000
	75%	2.0	4.000000	746750.000000	706750.000000	6061.500000	674444.000000	2	1977.500000
	max	2.0	4.000000	780000.000000	713500.000000	6323.000000	700000.000000	2	1986.000000
1507	count	1.0	1.000000	1.000000	1.000000	1.000000	1.000000	1	1.000000
	mean	2.0	4.000000	498000.000000	450000.000000	6323.000000	450000.000000	2	1969.000000
	std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	min	2.0	4.000000	498000.000000	450000.000000	6323.000000	450000.000000	2	1969.000000
	25%	2.0	4.000000	498000.000000	450000.000000	6323.000000	450000.000000	2	1969.000000
	50%	2.0	4.000000	498000.000000	450000.000000	6323.000000	450000.000000	2	1969.000000
	75%	2.0	4.000000	498000.000000	450000.000000	6323.000000	450000.000000	2	1969.000000
	max	2.0	4.000000	498000.000000	450000.000000	6323.000000	450000.000000	2	1969.000000
1523	count	1.0	1.000000	1.000000	1.000000	1.000000	1.000000	1	1.000000
	mean	2.0	3.000000	626000.000000	620000.000000	9241.000000	650000.000000	2	1974.000000
	std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	min	2.0	3.000000	626000.000000	620000.000000	9241.000000	650000.000000	2	1974.000000
	25%	2.0	3.000000	626000.000000	620000.000000	9241.000000	650000.000000	2	1974.000000
	50%	2.0	3.000000	626000.000000	620000.000000	9241.000000	650000.000000	2	1974.000000
...	...	...	...	...	...	...	...	...	...
1936	std	0.0	0.000000	255265.548008	248265.190875	1675.135965	218849.548777	0	1.414214
	min	3.0	4.000000	485000.000000	484900.000000	6400.000000	526500.000000	2	1973.000000
	25%	3.0	4.000000	575250.000000	572675.000000	6992.250000	603875.000000	2	1973.500000
	50%	3.0	4.000000	665500.000000	660450.000000	7584.500000	681250.000000	2	1974.000000
	75%	3.0	4.000000	755750.000000	748225.000000	8176.750000	758625.000000	2	1974.500000
	max	3.0	4.000000	846000.000000	836000.000000	8769.000000	836000.000000	2	1975.000000
1942	count	1.0	1.000000	1.000000	1.000000	1.000000	1.000000	1	1.000000
	mean	2.0	4.000000	595000.000000	564900.000000	6205.000000	564900.000000	2	1973.000000
	std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	min	2.0	4.000000	595000.000000	564900.000000	6205.000000	564900.000000	2	1973.000000
	25%	2.0	4.000000	595000.000000	564900.000000	6205.000000	564900.000000	2	1973.000000
	50%	2.0	4.000000	595000.000000	564900.000000	6205.000000	564900.000000	2	1973.000000
	75%	2.0	4.000000	595000.000000	564900.000000	6205.000000	564900.000000	2	1973.000000
	max	2.0	4.000000	595000.000000	564900.000000	6205.000000	564900.000000	2	1973.000000
1958	count	1.0	1.000000	1.000000	1.000000	1.000000	1.000000	1	1.000000
	mean	2.0	4.000000	690000.000000	649000.000000	6828.000000	649000.000000	2	1985.000000
	std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	min	2.0	4.000000	690000.000000	649000.000000	6828.000000	649000.000000	2	1985.000000
	25%	2.0	4.000000	690000.000000	649000.000000	6828.000000	649000.000000	2	1985.000000
	50%	2.0	4.000000	690000.000000	649000.000000	6828.000000	649000.000000	2	1985.000000
	75%	2.0	4.000000	690000.000000	649000.000000	6828.000000	649000.000000	2	1985.000000
	max	2.0	4.000000	690000.000000	649000.000000	6828.000000	649000.000000	2	1985.000000
1988	count	4.0	4.000000	4.000000	4.000000	4.000000	4.000000	4	4.000000
	mean	2.5	3.750000	683750.000000	678997.000000	5538.500000	683747.000000	2	1992.000000
	std	0.0	0.500000	113164.702978	102146.130793	884.457838	92866.140417	0	1.154701
	min	2.5	3.000000	515000.000000	529000.000000	4576.000000	548000.000000	2	1991.000000
	25%	2.5	3.750000	672500.000000	657241.000000	5105.500000	661991.000000	2	1991.000000
	50%	2.5	4.000000	732500.000000	719494.000000	5438.500000	719494.000000	2	1992.000000
	75%	2.5	4.000000	743750.000000	741250.000000	5871.500000	741250.000000	2	1993.000000
	max	2.5	4.000000	755000.000000	748000.000000	6701.000000	748000.000000	2	1993.000000

504 rows × 8 columns

Step 2: Data preprocessing¶

In this section, we will convert the data to a format, that can be normalized and consumed by our learning algorithm. We have 9 columns of input data based on which, we need to predict the house price. We have around 116 rows of data (training samples).

We will extract the LAST SALE PRICE column and save that to our target values. We will normalize the rest of the columns. By normalization, we are making sure the features are scaled to an uniform scale and no particular feature has more weight over the other.

Step 2.1 Extract target values¶

We need to extract LAST SALE PRICE and save that to a 'y' array and extract the rest of the columns and save it to 'houses_features' array

In [15]:

y= houses['LAST SALE PRICE'].copy()
houses_features =houses.iloc[:,[0,1,2,3,4,5,6,7,8]].copy()

print ("features: \n",houses_features.head())
print("target: \n", y.head())

features: 
    LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      739000     3    2.5  1988      5595        1991              2   
1      749888     3    2.5  1642      9876        1986              2   
2      713500     4    2.0  1504      5800        1969              2   
3      749000     3    3.0  1781      5800        1971              2   
4      835000     4    2.5  1857      5061        1990              2   

  PARKING TYPE  ORIGINAL LIST PRICE  
0       Garage               739000  
1       Garage               749888  
2       Garage               599000  
3       Garage               749000  
4       Garage               835000  
target: 
 0    755000
1    682500
2    713500
3    750000
4    835000
Name: LAST SALE PRICE, dtype: int64

Step 2.2 Normalize data¶

The first step for our data would be to give "garage" a numerical value. We can assign 1 for the presence of garage and 0 for the absence of garage.

In [16]:

import math
def convert_parking_type(features):
    features['PARKING TYPE'] \
    = features['PARKING TYPE'].map(lambda x : 1 if x.strip().lower() == 'garage' else 0)
    return features
        
houses_features = convert_parking_type(houses_features);
print(houses_features['PARKING TYPE'].head())

0    1
1    1
2    1
3    1
4    1
Name: PARKING TYPE, dtype: int64

There are many ways to normalize the features. We will use a simple way of normalizing the feature: given a list of values X[1,2,3,4,5] , we will normalize as:

\begin{equation*} X[i] = \frac{X[i] - mean(X[i])}{\sigma(X[i])} \end{equation*}

where m= total no. of features of X

In our case, we will normalize each column of features.

In [17]:

def feature_normalize(matrix, mean=None, std=None):
    mean = matrix.mean() if mean is None else mean
    std = matrix.std()  if std is None else std
    matrix = (matrix - mean)/std
    return {'matrix_norm': matrix.fillna(0), 'mean': mean, 'std': std}

Sanity Check !!!

In [19]:

results = feature_normalize(houses_features)
houses_norm = results['matrix_norm']
X_mean =results['mean']
X_std = results['std']
                      
results = feature_normalize(y);
y_norm = results['matrix_norm']
y_mean = results['mean']
y_std = results['std']

print ('{0}\n'.format(houses_norm.head()))
print ('{0}\n'.format(y_norm.head()))

   LIST PRICE      BEDS     BATHS      SQFT  LOT SIZE  YEAR BUILT  \
0    0.783326 -1.417852  0.444568  2.093853 -0.446035    1.547770   
1    0.899198 -1.417852  0.444568 -0.591009  1.947155    0.851334   
2    0.511950  0.517099 -0.949213 -1.661850 -0.331435   -1.516551   
3    0.889748 -1.417852  1.838348  0.487591 -0.331435   -1.237976   
4    1.804976  0.517099  0.444568  1.077330 -0.744555    1.408483   

   PARKING SPOTS  PARKING TYPE  ORIGINAL LIST PRICE  
0              0             0             0.777064  
1              0             0             0.894669  
2              0             0            -0.735122  
3              0             0             0.885077  
4              0             0             1.813991  

0    0.679040
1   -0.086546
2    0.240808
3    0.626241
4    1.523824
Name: LAST SALE PRICE, dtype: float64

We can use Preprocessing module of sklearn to apply zero mean and unit deviation norm. Let's use that and check the results.

In [22]:

scaler = preprocessing.StandardScaler().fit(houses_features.values)
houses_norm = scaler.transform(houses_features)
print("Mean: {0}, STD: {1}".format(scaler.mean_,scaler.std_))
print(houses_norm)

Mean: [  6.65394267e+05   3.73275862e+00   2.34051724e+00   1.71816379e+03
   6.39287931e+03   1.97988793e+03   2.00000000e+00   1.00000000e+00
   6.67058466e+05], STD: [  9.35597222e+04   5.14576468e-01   3.57186913e-01   1.28313997e+02
   1.78109820e+03   7.14839209e+00   1.00000000e+00   1.00000000e+00
   9.21813058e+04]
[[ 0.78672458 -1.42400336  0.44649665 ...,  0.          0.          0.78043519]
 [ 0.90309944 -1.42400336  0.44649665 ...,  0.          0.          0.89855024]
 [ 0.51417139  0.5193424  -0.95333068 ...,  0.          0.         -0.73831093]
 ..., 
 [-1.24406384  0.5193424  -0.95333068 ...,  0.          0.         -1.28072025]
 [-0.39019213  0.5193424  -0.95333068 ...,  0.          0.         -0.41408033]
 [-2.83769833 -1.42400336 -0.95333068 ...,  0.          0.         -2.89818487]]

Step 3: Training our model¶

Let's use a simple linear model , that uses the least squared error function.

Optional !!: A quick mathematical background.

$\begin{equation*} y' = h_\theta(X)\end{equation*}$

Where

$\begin{equation*}h_\theta(X) = \theta_0 + \theta_1*x_1 + ... + \theta_n*x_n\end{equation*}$

Here y' is the predicted value and y is the actual value. Our goal is to minimize the delta between these two values. So, if we can choose the best values for $\begin{equation*}\Theta\end{equation*}$, we can accurately predict the house prices.

To achieve that , we come up with least squared error function, for which, we need to find the minimum values of $\begin{equation*}\Theta\end{equation*}$ where , $\begin{equation*}\Theta = [\theta_0, \theta_1 ... \theta_n]\end{equation*}$

$\begin{equation*} J(\Theta) = \sum_{i=0}^m(h_\theta(X)-y)^2\end{equation*}$

We choose some arbitrary number of iterations and in each iteration we simultaneously calculate

$\begin{equation*} \theta_0 := \theta_0 -\alpha*\frac{\partial}{\partial\theta_0}J(\Theta)\end{equation*}$

$\begin{equation*} \theta_1 := \theta_1 - \alpha*\frac{\partial}{\partial\theta_1}J(\Theta)\end{equation*}$

. . .

$\begin{equation*} \theta_n := \theta_n - \alpha*\frac{\partial}{\partial\theta_n}J(\Theta)\end{equation*}$

Where $\begin{equation*}\alpha\end{equation*}$ is a scalar learning rate value.

Phew !! Thankfully, we can use different linear models available in sklearn.

Let's define a train_model function in which,

we create a regressor with intercept set to true. An intercept is the value of Y when, all features values are set to 0. We can either add $\begin{equation*}x_0 = 1\end{equation*}$ and add that to our features matrix or set the fit_intercept parameter to true. So this means, that if we are not given any features of a house, can we come up with a mean value of the house.
We fit the model (train ) with our house features and prices.

In [23]:

def train_model(X,y):
    regressor = LinearRegression(fit_intercept=True)
    regressor.fit(X,y)
    return regressor

In [24]:

regr = train_model(houses_norm, y_norm)
print(regr.coef_)

[ 0.95294642  0.02315098 -0.09140542  0.00173405  0.00790588 -0.03303855
  0.          0.          0.01064799]

Let's run a prediction and see if we can predict the price of a house. We are going to use the 3rd sample from our training set. The input features (house_test) for this sample : {"LIST PRICE" : 749888, "BEDS" : 3, "BATHS": 2.5, "SQFT": 1642, "LOT SIZE" : 9876, "YEAR BUILT" : 1968, "PARKING SPOTS" : 2, "PARKING TYPE" : "Garage", "ORIGINAL LIST PRICE": 74988}

and the actual output is "LAST SALE PRICE" which is 713,550.

In [25]:

house_test = pandas.DataFrame({"LIST PRICE" : 749888,
 "BEDS" : 3,
 "BATHS": 2.5,
 "SQFT": 1642,
 "LOT SIZE" : 9876,
 "YEAR BUILT" : 1968,
 "PARKING SPOTS" : 2, 
 "PARKING TYPE" : "Garage",
 "ORIGINAL LIST PRICE": 74988},index=[0])
house_test_converted = convert_parking_type(house_test)
house_test_norm = feature_normalize(house_test_converted, X_mean, X_std)

y_predicted_norm = regr.predict(house_test_norm['matrix_norm']);
y_predicted = y_predicted_norm*y_std + y_mean
print('Predicted Price: {0}'.format(y_predicted[0].round(0)))


print('Normalized Input: \n {0} \n'.format(house_test_norm['matrix_norm']))

Predicted Price: 713785.0
Normalized Input: 
       BATHS      BEDS  LIST PRICE  LOT SIZE  ORIGINAL LIST PRICE  \
0  0.444568 -1.417852    0.899198  1.947155            -6.395146   

   PARKING SPOTS  PARKING TYPE      SQFT  YEAR BUILT  
0              0             0 -0.591009   -1.655838

Let's do some score analysis to find out the performance of our prediction algorithm.

In [178]:

print("Mean square error: {0} \n".format(mean_squared_error([y_norm[2]],y_predicted_norm)))
print("R2 Score: {0} \n".format(regr.score(houses_norm, y_norm)))

Mean square error: 9.067023349686855e-06 

R2 Score: 0.8903593961815572

Step 5. Now what ? Optimizations and explorations - the fun part !!!¶

In the above "evaluation" we didn't break the data into training and test set. That's a big no no in preparing your machine learning models. Let's fix that.

A proper way is to split the data into a training/test set, where the model only ever sees the training data during its model fitting and parameter tuning. The test data is used for final evaluation of our model to find out it's prediction performance.

In [29]:

houses_norm_train,houses_norm_test, y_norm_train, y_norm_test = \
train_test_split(houses_norm,y_norm,test_size=0.2) #<=== 20% of the samples are used for testing.
print(len(houses_norm_train),len(houses_norm_test),len(houses_norm_train)+len(houses_norm_test))

92 24 116

So, as requested, the test size is 20% of the entire dataset (24 houses out of total 116), and the training is the rest (92 out of 116). Let me introduce the scikit-learn Pipeline concept here. The pipeline helps in breaking our learning into sub-steps and assemble them together. This way, we can run the pipeline on different set of parameters and cross-validation set without changing the code of each individual steps.

In [27]:

linear_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor', LinearRegression())
])

A common practice is to split the training data into further smaller subsets of data; for example 5 equally sized subsets. Then, train the model on 4 subsets of data and validate the accuracy on the last subset of data (called "validation set"). If we repeat, this training with different subset as the validation set, we can test the stability of the model. If the model gives different scores for different subsets , then we need to go back and check the model and/or the data.

In [30]:

scores = cross_val_score(linear_regressor_pipeline,
                         houses_norm_train, #<== training samples
                         y_norm_train,#<== training output labels
                         cv=5,#<=== split data randomly into 5 parts, 4 for training and 1 for scoring
                         scoring='r2', #<== use R^2 score
                         n_jobs=-1)
print(scores,scores.mean(),scores.std())

[ 0.89656312  0.91393417  0.887133    0.90887704  0.63537356] 0.848376178916 0.106913414232

A score of 1 implies we have predicted perfectly. As you can see we are little off in our prediction (16% off).The scores are quite consistent. But, we definitely can do better.

How can we fix this ?

There are few design principles that we need to consider :

Get more Data
- Sometimes more training data doesn't help. More data maynot help because we maybe overfitting our model to our training data.
- But, most of the time, this is a good place to start.
Try smaller sets of features
- We can do this by hand or use Dimension Reduction using PCA techniques
Polynomial Features
- If we think that some of the features have heavier influence over other features, we may want to use higher degree of those features.
Combining Features
- if we have better knowledge of the domain, we can combine and build better features.
** New models**
- We can try new regression models and see if they help in our prediction score

All of the above advice is good. But, where should we start ?

Let's start with Step 1 , Step 3 and Step 4 and see if we can make our predictions better.

We will get more data and use cubic interpolation to fix missing data. You can read more about interpolation here.

In [181]:

labels=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
        "PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"]
data=pandas.read_csv('./data/redfin_more_data1.csv', quoting=csv.QUOTE_NONE,names=labels)
print(data.head())
data.plot(x='SQFT', y='LAST SALE PRICE', kind='scatter')
data = convert_parking_type(data)
data.interpolate(method='cubic',inplace=True)
data.fillna(0)


y= data['LAST SALE PRICE'].copy()
X =data.iloc[:,[0,1,2,3,4,5,6,7]].copy()

print ("features {0}: \n{1}".format(X.shape, X.head()))
print("target:{0}: \n{1}".format(y.shape, y.head()))

X_train,X_test, y_train, y_test = \
train_test_split(X,y,test_size=0.2,random_state=42) #<=== 20% of the samples are used for testing.
print(len(X_train),len(X_test),len(X_train)+len(X_test))
print(len(y_train) + len(y_test))
print (X_train.shape)
print (y_train.shape)

   LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      538000     3    2.5  1252      1932        1993              2   
1      538888     3    2.0  1253      2100        1973              2   
2      539500     3    2.5  1298      2437        1984              2   
3      545000     3    2.5  1298      2482        1988              2   
4      548888     4    2.0  1298      2700        1973              2   

  PARKING TYPE  ORIGINAL LIST PRICE  LAST SALE PRICE  
0       Garage               538000           393500  
1       Garage               538888           400000  
2       Garage               539500           401000  
3       Garage               545000           404500  
4       Garage               548888           415000  
features (318, 8): 
   LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      538000     3    2.5  1252      1932        1993              2   
1      538888     3    2.0  1253      2100        1973              2   
2      539500     3    2.5  1298      2437        1984              2   
3      545000     3    2.5  1298      2482        1988              2   
4      548888     4    2.0  1298      2700        1973              2   

   PARKING TYPE  
0             1  
1             1  
2             1  
3             1  
4             1  
target:(318,): 
0    393500
1    400000
2    401000
3    404500
4    415000
Name: LAST SALE PRICE, dtype: int64
254 64 318
318
(254, 8)
(254,)

In [86]:

from sklearn.linear_model import Ridge
from sklearn.metrics import explained_variance_score, r2_score, mean_squared_error, mean_absolute_error
ridge_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor',Ridge())
])

scores = cross_val_score(ridge_regressor_pipeline,
                         X_train, #<== training samples
                         y_train,#<== training output labels
                         cv=5,#<=== split data randomly into 5 parts, 4 for training and 1 for scoring
                         scoring='r2', #<== use R^2 score
                         n_jobs=-1)

print(scores,scores.mean(),scores.std())
print("Accuracy: %0.5f (+/- %0.5f)" % (scores.mean(), scores.std() * 2))

[ 0.9822377   0.97144495  0.97506792  0.98181215  0.96860433] 0.975833408526 0.0054564552039
Accuracy: 0.97583 (+/- 0.01091)

Let's test with our test set and compare the predicted result with our actual results.

In [87]:

ridge_regressor_pipeline.fit(X_train,y_train)
y_predicted = ridge_regressor_pipeline.predict(X_test)
print("Variance Score: %0.05f and R2 Score: %0.5f \n" % (explained_variance_score(y_test,y_predicted), r2_score(y_test,y_predicted)))

Variance: 0.97419 and R2 Score: 0.97418

In [177]:

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [184]:

%time plot_learning_curve(ridge_regressor_pipeline, "accuracy vs. training set size", X_train, y_train, cv=5)

CPU times: user 265 ms, sys: 48.7 ms, total: 314 ms
Wall time: 520 ms

Out[184]:

<module 'matplotlib.pyplot' from '/Users/nmishra/.virtualenvs/mvregression@python3.4.2/lib/python3.4/site-packages/matplotlib/pyplot.py'>

As you can see, with more data, our prediction confidence is almost at 97% confidence. We have limited data samples. Around 150 training examples, our accuracy has almost reached to perfect score.

Step 6. Find the best parameters for our regressor¶

Let's use GridSearch to find the best params for our model.

In [267]:

from sklearn.linear_model import Ridge
from sklearn.cross_validation import StratifiedKFold

#load the data
houses=pandas.read_csv('./data/redfin_more_data1.csv', quoting=csv.QUOTE_NONE,names=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
        "PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"])
houses.interpolate(inplace=True) #replace missing values by interpolating the data in a particular column
y= houses['LAST SALE PRICE'].copy()
houses_features =houses.iloc[:,[0,1,2,3,4,5]].copy()

#split 80% of data to training set and 20% to test set
houses_train,houses_test, y_train, y_test = \
train_test_split(houses_features,y,test_size=0.2) #<=== 20% of the samples are used for testing.

#build the pipeline and use the StandardScalar to normalize the data
ridge_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor', Ridge())
])

#set some params for the regressor
param_ridge = [
{'regressor__alpha':[0.01, 0.1, 1], 'regressor__solver':['lsqr'],'regressor__tol': [.99], 'regressor__max_iter': [500] },
{'regressor__alpha':[0.01, 0.1, 1], 'regressor__solver':['cholesky'],'regressor__tol': [.99], 'regressor__max_iter': [500] },
]

#search the best fitting params for our regressor
grid_ridge = GridSearchCV(
    ridge_regressor_pipeline, #pipeline from above
    param_grid=param_ridge, #parameters to tune via cross validation
    refit=True, #fit using all data for the best regressor
    n_jobs=-1,
    scoring='r2',
    cv=StratifiedKFold(y_train,n_folds=5)
)

In [188]:

%time price_predictor = grid_ridge.fit(houses_train,y_train)
print (price_predictor.grid_scores_)

CPU times: user 196 ms, sys: 40.6 ms, total: 236 ms
Wall time: 522 ms
[mean: 0.96209, std: 0.07023, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.01, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.96208, std: 0.07026, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.1, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.96190, std: 0.07056, params: {'regressor__tol': 0.99, 'regressor__alpha': 1, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.98043, std: 0.01240, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.01, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}, mean: 0.98045, std: 0.01332, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.1, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}, mean: 0.97933, std: 0.01996, params: {'regressor__tol': 0.99, 'regressor__alpha': 1, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}]

So apparently, alpha=0.01, tol=.99 and solver='cholesky' provides the best fit. Let's run our Regressor on our test set for a sanity check.

In [268]:

yp = price_predictor.predict(houses_test)
print("Variance: %0.02f and R2 Score: %0.02f \n" % (explained_variance_score(y_test,yp), r2_score(y_test,yp)))

Variance: 0.97 and R2 Score: 0.95

We get 95% accuracy in our prediction when we try the prediction on 20% of training samples that we had split for test set. This test set has never been seen before by our predictor. Though, our R2 score as gone down a little bit.But, it represents a real world scenario. Let' predict the price of a house?

In [269]:

import math
data = {"LIST PRICE":879000,
"BEDS": 3,
"BATHS": 2.5,
"SQFT": 1900,
"LOT SIZE" : 6800,
"YEAR BUILT":2015} 
house_test = pandas.DataFrame(data,index=[0],columns=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT"])
%time yp=price_predictor.predict(house_test)
print("Predicted Price: {0}".format(round(yp[0])))

CPU times: user 409 µs, sys: 1 µs, total: 410 µs
Wall time: 419 µs
Predicted Price: 864631.0

For production, we may need to save the model, the parameters, cross validation scores, training scores and then run the model on a production database. We would then need to compare the model training , cv and test scores with the production data prediction scores and figure out if our model needs further tuning.

This is a basic example of linear multivariate regresion to predict a continous output. What's next ?

I will try to explore these following topics in future posts.

Noise Detection using Expectation Maximization Models.
Feature exploration using Novelty Detection.
Feature weighing using polynomail degrees of features.
Develop your own multivariate linear regression model,optimize the parameters and predict continuous rate.