This tutorial , uses multivariate regression to predict house price. The high level goal is the use multiple features (size , number of bedrooms,bathrooms etc) to predict the price of a house. This tutorial is a self-paced tutorial.

The language used throughout will be Python and libraries available in python for scientific and machine learning applications.

One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you're watching right now. We'll go over other practical tools, widely used in the data science industry, below.

**You can run this notebook interactively**
1. Install the (free) [Anaconda](https://store.continuum.io/cshop/anaconda/) Python distribution, including Python itself.
2. Install the TextBlob library: [Textblob](http://textblob.readthedocs.org/en/dev/install.html).
$ipython notebook predict_house_price_python.ipynb 4. Watch the [IPython tutorial video](https://www.youtube.com/watch?v=H6dLGQw9yFQ) for notebook navigation basics. 5. Run the first code cell below; if it executes without errors, you're good to go! ## Python tutorial : Multivariate linear regression to predict house prices¶ In [1]: %matplotlib inline import matplotlib.pyplot as plt import csv from textblob import TextBlob import pandas import sklearn import pickle import numpy as np from sklearn import preprocessing from sklearn import cross_validation as cv from sklearn.metrics import explained_variance_score, mean_squared_error,r2_score from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import cross_val_score, train_test_split from sklearn.learning_curve import learning_curve from sklearn.externals import joblib from sklearn.linear_model import LinearRegression  ### Step 1: Load data, plot data¶ Labeled data is crucial for supervised learning. By the gracious courtesy of redfin.com, we are going to focus on last 3 years of data for zipcode 94555. The houses in this data are at least 4500 sq ft lot size, 3+bed, 2+bath,1500 sqft - 2000 sqft and were not listed more than 30 days in www.redfin.com $ mkdir data && cd data
$ls -ltrh [email protected] 1 user staff 41K Feb 17 23:19 redfin_2015-02-17_94555_results.csv  Pandas is an amazing library to load and manipulate data. We will be using pandas to load the data. In [14]: labels=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT", "PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"] houses=pandas.read_csv('./data/redfin_2015-02-17_94555_results_massaged.csv', quoting=csv.QUOTE_NONE,names=labels) print(houses.head())   LIST PRICE BEDS BATHS SQFT LOT SIZE YEAR BUILT PARKING SPOTS \ 0 739000 3 2.5 1988 5595 1991 2 1 749888 3 2.5 1642 9876 1986 2 2 713500 4 2.0 1504 5800 1969 2 3 749000 3 3.0 1781 5800 1971 2 4 835000 4 2.5 1857 5061 1990 2 PARKING TYPE ORIGINAL LIST PRICE LAST SALE PRICE 0 Garage 739000 755000 1 Garage 749888 682500 2 Garage 599000 713500 3 Garage 749000 750000 4 Garage 835000 835000  The matrix size of the input data is: In [16]: print(houses.shape)  (116, 10)  It's always advisable to visualize the data. Let's create a scatter plot with X-axis as the SQFT of the house and Y-axis as the LAST SALE PRICE of the house. In [12]: houses.plot(x='SQFT',y='LAST SALE PRICE', kind='scatter')  Out[12]: <matplotlib.axes._subplots.AxesSubplot at 0x107423cf8> pandas also helps with aggregate statistics. In [20]: houses.groupby('SQFT').describe()  Out[20]: BATHS BEDS LAST SALE PRICE LIST PRICE LOT SIZE ORIGINAL LIST PRICE PARKING SPOTS YEAR BUILT SQFT 1500 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000 mean 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000 std NaN NaN NaN NaN NaN NaN NaN NaN min 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000 25% 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000 50% 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000 75% 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000 max 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000 1504 count 3.0 3.000000 3.000000 3.000000 3.000000 3.000000 3 3.000000 mean 2.0 3.666667 731166.666667 687462.666667 5837.666667 649296.000000 2 1974.333333 std 0.0 0.577350 42826.199146 34081.757310 467.639106 50501.236104 0 10.115994 min 2.0 3.000000 700000.000000 648888.000000 5390.000000 599000.000000 2 1968.000000 25% 2.0 3.500000 706750.000000 674444.000000 5595.000000 623944.000000 2 1968.500000 50% 2.0 4.000000 713500.000000 700000.000000 5800.000000 648888.000000 2 1969.000000 75% 2.0 4.000000 746750.000000 706750.000000 6061.500000 674444.000000 2 1977.500000 max 2.0 4.000000 780000.000000 713500.000000 6323.000000 700000.000000 2 1986.000000 1507 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000 mean 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000 std NaN NaN NaN NaN NaN NaN NaN NaN min 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000 25% 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000 50% 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000 75% 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000 max 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000 1523 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000 mean 2.0 3.000000 626000.000000 620000.000000 9241.000000 650000.000000 2 1974.000000 std NaN NaN NaN NaN NaN NaN NaN NaN min 2.0 3.000000 626000.000000 620000.000000 9241.000000 650000.000000 2 1974.000000 25% 2.0 3.000000 626000.000000 620000.000000 9241.000000 650000.000000 2 1974.000000 50% 2.0 3.000000 626000.000000 620000.000000 9241.000000 650000.000000 2 1974.000000 ... ... ... ... ... ... ... ... ... ... 1936 std 0.0 0.000000 255265.548008 248265.190875 1675.135965 218849.548777 0 1.414214 min 3.0 4.000000 485000.000000 484900.000000 6400.000000 526500.000000 2 1973.000000 25% 3.0 4.000000 575250.000000 572675.000000 6992.250000 603875.000000 2 1973.500000 50% 3.0 4.000000 665500.000000 660450.000000 7584.500000 681250.000000 2 1974.000000 75% 3.0 4.000000 755750.000000 748225.000000 8176.750000 758625.000000 2 1974.500000 max 3.0 4.000000 846000.000000 836000.000000 8769.000000 836000.000000 2 1975.000000 1942 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000 mean 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000 std NaN NaN NaN NaN NaN NaN NaN NaN min 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000 25% 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000 50% 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000 75% 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000 max 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000 1958 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000 mean 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000 std NaN NaN NaN NaN NaN NaN NaN NaN min 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000 25% 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000 50% 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000 75% 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000 max 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000 1988 count 4.0 4.000000 4.000000 4.000000 4.000000 4.000000 4 4.000000 mean 2.5 3.750000 683750.000000 678997.000000 5538.500000 683747.000000 2 1992.000000 std 0.0 0.500000 113164.702978 102146.130793 884.457838 92866.140417 0 1.154701 min 2.5 3.000000 515000.000000 529000.000000 4576.000000 548000.000000 2 1991.000000 25% 2.5 3.750000 672500.000000 657241.000000 5105.500000 661991.000000 2 1991.000000 50% 2.5 4.000000 732500.000000 719494.000000 5438.500000 719494.000000 2 1992.000000 75% 2.5 4.000000 743750.000000 741250.000000 5871.500000 741250.000000 2 1993.000000 max 2.5 4.000000 755000.000000 748000.000000 6701.000000 748000.000000 2 1993.000000 504 rows × 8 columns ### Step 2: Data preprocessing¶ In this section, we will convert the data to a format, that can be normalized and consumed by our learning algorithm. We have 9 columns of input data based on which, we need to predict the house price. We have around 116 rows of data (training samples). We will extract the LAST SALE PRICE column and save that to our target values. We will normalize the rest of the columns. By normalization, we are making sure the features are scaled to an uniform scale and no particular feature has more weight over the other. #### Step 2.1 Extract target values¶ We need to extract LAST SALE PRICE and save that to a 'y' array and extract the rest of the columns and save it to 'houses_features' array In [15]: y= houses['LAST SALE PRICE'].copy() houses_features =houses.iloc[:,[0,1,2,3,4,5,6,7,8]].copy() print ("features: \n",houses_features.head()) print("target: \n", y.head())  features: LIST PRICE BEDS BATHS SQFT LOT SIZE YEAR BUILT PARKING SPOTS \ 0 739000 3 2.5 1988 5595 1991 2 1 749888 3 2.5 1642 9876 1986 2 2 713500 4 2.0 1504 5800 1969 2 3 749000 3 3.0 1781 5800 1971 2 4 835000 4 2.5 1857 5061 1990 2 PARKING TYPE ORIGINAL LIST PRICE 0 Garage 739000 1 Garage 749888 2 Garage 599000 3 Garage 749000 4 Garage 835000 target: 0 755000 1 682500 2 713500 3 750000 4 835000 Name: LAST SALE PRICE, dtype: int64  #### Step 2.2 Normalize data¶ The first step for our data would be to give "garage" a numerical value. We can assign 1 for the presence of garage and 0 for the absence of garage. In [16]: import math def convert_parking_type(features): features['PARKING TYPE'] \ = features['PARKING TYPE'].map(lambda x : 1 if x.strip().lower() == 'garage' else 0) return features houses_features = convert_parking_type(houses_features); print(houses_features['PARKING TYPE'].head())  0 1 1 1 2 1 3 1 4 1 Name: PARKING TYPE, dtype: int64  There are many ways to normalize the features. We will use a simple way of normalizing the feature: given a list of values X[1,2,3,4,5] , we will normalize as: \begin{equation*} X[i] = \frac{X[i] - mean(X[i])}{\sigma(X[i])} \end{equation*} where m= total no. of features of X In our case, we will normalize each column of features. In [17]: def feature_normalize(matrix, mean=None, std=None): mean = matrix.mean() if mean is None else mean std = matrix.std() if std is None else std matrix = (matrix - mean)/std return {'matrix_norm': matrix.fillna(0), 'mean': mean, 'std': std}  Sanity Check !!! In [19]: results = feature_normalize(houses_features) houses_norm = results['matrix_norm'] X_mean =results['mean'] X_std = results['std'] results = feature_normalize(y); y_norm = results['matrix_norm'] y_mean = results['mean'] y_std = results['std'] print ('{0}\n'.format(houses_norm.head())) print ('{0}\n'.format(y_norm.head()))   LIST PRICE BEDS BATHS SQFT LOT SIZE YEAR BUILT \ 0 0.783326 -1.417852 0.444568 2.093853 -0.446035 1.547770 1 0.899198 -1.417852 0.444568 -0.591009 1.947155 0.851334 2 0.511950 0.517099 -0.949213 -1.661850 -0.331435 -1.516551 3 0.889748 -1.417852 1.838348 0.487591 -0.331435 -1.237976 4 1.804976 0.517099 0.444568 1.077330 -0.744555 1.408483 PARKING SPOTS PARKING TYPE ORIGINAL LIST PRICE 0 0 0 0.777064 1 0 0 0.894669 2 0 0 -0.735122 3 0 0 0.885077 4 0 0 1.813991 0 0.679040 1 -0.086546 2 0.240808 3 0.626241 4 1.523824 Name: LAST SALE PRICE, dtype: float64  We can use Preprocessing module of sklearn to apply zero mean and unit deviation norm. Let's use that and check the results. In [22]: scaler = preprocessing.StandardScaler().fit(houses_features.values) houses_norm = scaler.transform(houses_features) print("Mean: {0}, STD: {1}".format(scaler.mean_,scaler.std_)) print(houses_norm)  Mean: [ 6.65394267e+05 3.73275862e+00 2.34051724e+00 1.71816379e+03 6.39287931e+03 1.97988793e+03 2.00000000e+00 1.00000000e+00 6.67058466e+05], STD: [ 9.35597222e+04 5.14576468e-01 3.57186913e-01 1.28313997e+02 1.78109820e+03 7.14839209e+00 1.00000000e+00 1.00000000e+00 9.21813058e+04] [[ 0.78672458 -1.42400336 0.44649665 ..., 0. 0. 0.78043519] [ 0.90309944 -1.42400336 0.44649665 ..., 0. 0. 0.89855024] [ 0.51417139 0.5193424 -0.95333068 ..., 0. 0. -0.73831093] ..., [-1.24406384 0.5193424 -0.95333068 ..., 0. 0. -1.28072025] [-0.39019213 0.5193424 -0.95333068 ..., 0. 0. -0.41408033] [-2.83769833 -1.42400336 -0.95333068 ..., 0. 0. -2.89818487]]  ### Step 3: Training our model¶ Let's use a simple linear model , that uses the least squared error function. Optional !!: A quick mathematical background.$\begin{equation*} y' = h_\theta(X)\end{equation*}$Where$\begin{equation*}h_\theta(X) = \theta_0 + \theta_1*x_1 + ... + \theta_n*x_n\end{equation*}$Here y' is the predicted value and y is the actual value. Our goal is to minimize the delta between these two values. So, if we can choose the best values for$\begin{equation*}\Theta\end{equation*}$, we can accurately predict the house prices. To achieve that , we come up with least squared error function, for which, we need to find the minimum values of$\begin{equation*}\Theta\end{equation*}$where ,$\begin{equation*}\Theta = [\theta_0, \theta_1 ... \theta_n]\end{equation*}\begin{equation*} J(\Theta) = \sum_{i=0}^m(h_\theta(X)-y)^2\end{equation*}$We choose some arbitrary number of iterations and in each iteration we simultaneously calculate$\begin{equation*} \theta_0 := \theta_0 -\alpha*\frac{\partial}{\partial\theta_0}J(\Theta)\end{equation*}\begin{equation*} \theta_1 := \theta_1 - \alpha*\frac{\partial}{\partial\theta_1}J(\Theta)\end{equation*}$. . .$\begin{equation*} \theta_n := \theta_n - \alpha*\frac{\partial}{\partial\theta_n}J(\Theta)\end{equation*}$Where$\begin{equation*}\alpha\end{equation*}$is a scalar learning rate value. Phew !! Thankfully, we can use different linear models available in sklearn. Let's define a train_model function in which, 1. we create a regressor with intercept set to true. An intercept is the value of Y when, all features values are set to 0. We can either add$\begin{equation*}x_0 = 1\end{equation*}\$ and add that to our features matrix or set the fit_intercept parameter to true. So this means, that if we are not given any features of a house, can we come up with a mean value of the house.
2. We fit the model (train ) with our house features and prices.
In [23]:
def train_model(X,y):
regressor = LinearRegression(fit_intercept=True)
regressor.fit(X,y)
return regressor

In [24]:
regr = train_model(houses_norm, y_norm)
print(regr.coef_)

[ 0.95294642  0.02315098 -0.09140542  0.00173405  0.00790588 -0.03303855
0.          0.          0.01064799]


Let's run a prediction and see if we can predict the price of a house. We are going to use the 3rd sample from our training set. The input features (house_test) for this sample : {"LIST PRICE" : 749888, "BEDS" : 3, "BATHS": 2.5, "SQFT": 1642, "LOT SIZE" : 9876, "YEAR BUILT" : 1968, "PARKING SPOTS" : 2, "PARKING TYPE" : "Garage", "ORIGINAL LIST PRICE": 74988}

and the actual output is "LAST SALE PRICE" which is 713,550.

In [25]:
house_test = pandas.DataFrame({"LIST PRICE" : 749888,
"BEDS" : 3,
"BATHS": 2.5,
"SQFT": 1642,
"LOT SIZE" : 9876,
"YEAR BUILT" : 1968,
"PARKING SPOTS" : 2,
"PARKING TYPE" : "Garage",
"ORIGINAL LIST PRICE": 74988},index=[0])
house_test_converted = convert_parking_type(house_test)
house_test_norm = feature_normalize(house_test_converted, X_mean, X_std)

y_predicted_norm = regr.predict(house_test_norm['matrix_norm']);
y_predicted = y_predicted_norm*y_std + y_mean
print('Predicted Price: {0}'.format(y_predicted[0].round(0)))

print('Normalized Input: \n {0} \n'.format(house_test_norm['matrix_norm']))

Predicted Price: 713785.0
Normalized Input:
BATHS      BEDS  LIST PRICE  LOT SIZE  ORIGINAL LIST PRICE  \
0  0.444568 -1.417852    0.899198  1.947155            -6.395146

PARKING SPOTS  PARKING TYPE      SQFT  YEAR BUILT
0              0             0 -0.591009   -1.655838



Let's do some score analysis to find out the performance of our prediction algorithm.

In [178]:
print("Mean square error: {0} \n".format(mean_squared_error([y_norm[2]],y_predicted_norm)))
print("R2 Score: {0} \n".format(regr.score(houses_norm, y_norm)))

Mean square error: 9.067023349686855e-06

R2 Score: 0.8903593961815572



### Step 5. Now what ? Optimizations and explorations - the fun part !!!¶

In the above "evaluation" we didn't break the data into training and test set. That's a big no no in preparing your machine learning models. Let's fix that.

A proper way is to split the data into a training/test set, where the model only ever sees the training data during its model fitting and parameter tuning. The test data is used for final evaluation of our model to find out it's prediction performance.

In [29]:
houses_norm_train,houses_norm_test, y_norm_train, y_norm_test = \
train_test_split(houses_norm,y_norm,test_size=0.2) #<=== 20% of the samples are used for testing.
print(len(houses_norm_train),len(houses_norm_test),len(houses_norm_train)+len(houses_norm_test))

92 24 116


So, as requested, the test size is 20% of the entire dataset (24 houses out of total 116), and the training is the rest (92 out of 116). Let me introduce the scikit-learn Pipeline concept here. The pipeline helps in breaking our learning into sub-steps and assemble them together. This way, we can run the pipeline on different set of parameters and cross-validation set without changing the code of each individual steps.

In [27]:
linear_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor', LinearRegression())
])


A common practice is to split the training data into further smaller subsets of data; for example 5 equally sized subsets. Then, train the model on 4 subsets of data and validate the accuracy on the last subset of data (called "validation set"). If we repeat, this training with different subset as the validation set, we can test the stability of the model. If the model gives different scores for different subsets , then we need to go back and check the model and/or the data.

In [30]:
scores = cross_val_score(linear_regressor_pipeline,
houses_norm_train, #<== training samples
y_norm_train,#<== training output labels
cv=5,#<=== split data randomly into 5 parts, 4 for training and 1 for scoring
scoring='r2', #<== use R^2 score
n_jobs=-1)
print(scores,scores.mean(),scores.std())

[ 0.89656312  0.91393417  0.887133    0.90887704  0.63537356] 0.848376178916 0.106913414232


A score of 1 implies we have predicted perfectly. As you can see we are little off in our prediction (16% off).The scores are quite consistent. But, we definitely can do better.

How can we fix this ?

There are few design principles that we need to consider :

1. Get more Data
• Sometimes more training data doesn't help. More data maynot help because we maybe overfitting our model to our training data.
• But, most of the time, this is a good place to start.
2. Try smaller sets of features
• We can do this by hand or use Dimension Reduction using PCA techniques
3. Polynomial Features
• If we think that some of the features have heavier influence over other features, we may want to use higher degree of those features.
4. Combining Features
• if we have better knowledge of the domain, we can combine and build better features.
5. New models
• We can try new regression models and see if they help in our prediction score

All of the above advice is good. But, where should we start ?

Let's start with Step 1 , Step 3 and Step 4 and see if we can make our predictions better.

We will get more data and use cubic interpolation to fix missing data. You can read more about interpolation here.

In [181]:
labels=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
"PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"]
data.plot(x='SQFT', y='LAST SALE PRICE', kind='scatter')
data = convert_parking_type(data)
data.interpolate(method='cubic',inplace=True)
data.fillna(0)

y= data['LAST SALE PRICE'].copy()
X =data.iloc[:,[0,1,2,3,4,5,6,7]].copy()

X_train,X_test, y_train, y_test = \
train_test_split(X,y,test_size=0.2,random_state=42) #<=== 20% of the samples are used for testing.
print(len(X_train),len(X_test),len(X_train)+len(X_test))
print(len(y_train) + len(y_test))
print (X_train.shape)
print (y_train.shape)

   LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      538000     3    2.5  1252      1932        1993              2
1      538888     3    2.0  1253      2100        1973              2
2      539500     3    2.5  1298      2437        1984              2
3      545000     3    2.5  1298      2482        1988              2
4      548888     4    2.0  1298      2700        1973              2

PARKING TYPE  ORIGINAL LIST PRICE  LAST SALE PRICE
0       Garage               538000           393500
1       Garage               538888           400000
2       Garage               539500           401000
3       Garage               545000           404500
4       Garage               548888           415000
features (318, 8):
LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      538000     3    2.5  1252      1932        1993              2
1      538888     3    2.0  1253      2100        1973              2
2      539500     3    2.5  1298      2437        1984              2
3      545000     3    2.5  1298      2482        1988              2
4      548888     4    2.0  1298      2700        1973              2

PARKING TYPE
0             1
1             1
2             1
3             1
4             1
target:(318,):
0    393500
1    400000
2    401000
3    404500
4    415000
Name: LAST SALE PRICE, dtype: int64
254 64 318
318
(254, 8)
(254,)

In [86]:
from sklearn.linear_model import Ridge
from sklearn.metrics import explained_variance_score, r2_score, mean_squared_error, mean_absolute_error
ridge_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor',Ridge())
])

scores = cross_val_score(ridge_regressor_pipeline,
X_train, #<== training samples
y_train,#<== training output labels
cv=5,#<=== split data randomly into 5 parts, 4 for training and 1 for scoring
scoring='r2', #<== use R^2 score
n_jobs=-1)

print(scores,scores.mean(),scores.std())
print("Accuracy: %0.5f (+/- %0.5f)" % (scores.mean(), scores.std() * 2))

[ 0.9822377   0.97144495  0.97506792  0.98181215  0.96860433] 0.975833408526 0.0054564552039
Accuracy: 0.97583 (+/- 0.01091)


Let's test with our test set and compare the predicted result with our actual results.

In [87]:
ridge_regressor_pipeline.fit(X_train,y_train)
y_predicted = ridge_regressor_pipeline.predict(X_test)
print("Variance Score: %0.05f and R2 Score: %0.5f \n" % (explained_variance_score(y_test,y_predicted), r2_score(y_test,y_predicted)))

Variance: 0.97419 and R2 Score: 0.97418


In [177]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.

Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.

title : string
Title for the chart.

X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.

y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.

ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.

cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects

n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")

plt.legend(loc="best")
return plt

In [184]:
%time plot_learning_curve(ridge_regressor_pipeline, "accuracy vs. training set size", X_train, y_train, cv=5)

CPU times: user 265 ms, sys: 48.7 ms, total: 314 ms
Wall time: 520 ms

Out[184]:
<module 'matplotlib.pyplot' from '/Users/nmishra/.virtualenvs/[email protected]/lib/python3.4/site-packages/matplotlib/pyplot.py'>

As you can see, with more data, our prediction confidence is almost at 97% confidence. We have limited data samples. Around 150 training examples, our accuracy has almost reached to perfect score.

### Step 6. Find the best parameters for our regressor¶

Let's use GridSearch to find the best params for our model.

In [267]:
from sklearn.linear_model import Ridge
from sklearn.cross_validation import StratifiedKFold

"PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"])
houses.interpolate(inplace=True) #replace missing values by interpolating the data in a particular column
y= houses['LAST SALE PRICE'].copy()
houses_features =houses.iloc[:,[0,1,2,3,4,5]].copy()

#split 80% of data to training set and 20% to test set
houses_train,houses_test, y_train, y_test = \
train_test_split(houses_features,y,test_size=0.2) #<=== 20% of the samples are used for testing.

#build the pipeline and use the StandardScalar to normalize the data
ridge_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor', Ridge())
])

#set some params for the regressor
param_ridge = [
{'regressor__alpha':[0.01, 0.1, 1], 'regressor__solver':['lsqr'],'regressor__tol': [.99], 'regressor__max_iter': [500] },
{'regressor__alpha':[0.01, 0.1, 1], 'regressor__solver':['cholesky'],'regressor__tol': [.99], 'regressor__max_iter': [500] },
]

#search the best fitting params for our regressor
grid_ridge = GridSearchCV(
ridge_regressor_pipeline, #pipeline from above
param_grid=param_ridge, #parameters to tune via cross validation
refit=True, #fit using all data for the best regressor
n_jobs=-1,
scoring='r2',
cv=StratifiedKFold(y_train,n_folds=5)
)

In [188]:
%time price_predictor = grid_ridge.fit(houses_train,y_train)
print (price_predictor.grid_scores_)

CPU times: user 196 ms, sys: 40.6 ms, total: 236 ms
Wall time: 522 ms
[mean: 0.96209, std: 0.07023, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.01, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.96208, std: 0.07026, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.1, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.96190, std: 0.07056, params: {'regressor__tol': 0.99, 'regressor__alpha': 1, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.98043, std: 0.01240, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.01, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}, mean: 0.98045, std: 0.01332, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.1, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}, mean: 0.97933, std: 0.01996, params: {'regressor__tol': 0.99, 'regressor__alpha': 1, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}]


So apparently, alpha=0.01, tol=.99 and solver='cholesky' provides the best fit. Let's run our Regressor on our test set for a sanity check.

In [268]:
yp = price_predictor.predict(houses_test)
print("Variance: %0.02f and R2 Score: %0.02f \n" % (explained_variance_score(y_test,yp), r2_score(y_test,yp)))

Variance: 0.97 and R2 Score: 0.95



We get 95% accuracy in our prediction when we try the prediction on 20% of training samples that we had split for test set. This test set has never been seen before by our predictor. Though, our R2 score as gone down a little bit.But, it represents a real world scenario. Let' predict the price of a house?

In [269]:
import math
data = {"LIST PRICE":879000,
"BEDS": 3,
"BATHS": 2.5,
"SQFT": 1900,
"LOT SIZE" : 6800,
"YEAR BUILT":2015}
house_test = pandas.DataFrame(data,index=[0],columns=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT"])
%time yp=price_predictor.predict(house_test)
print("Predicted Price: {0}".format(round(yp[0])))

CPU times: user 409 µs, sys: 1 µs, total: 410 µs
Wall time: 419 µs
Predicted Price: 864631.0


For production, we may need to save the model, the parameters, cross validation scores, training scores and then run the model on a production database. We would then need to compare the model training , cv and test scores with the production data prediction scores and figure out if our model needs further tuning.

This is a basic example of linear multivariate regresion to predict a continous output. What's next ?

I will try to explore these following topics in future posts.

1. Noise Detection using Expectation Maximization Models.
2. Feature exploration using Novelty Detection.
3. Feature weighing using polynomail degrees of features.
4. Develop your own multivariate linear regression model,optimize the parameters and predict continuous rate.