aka Alex recommendations and best practices for model tuning.

I made this notebook as a warmup for Kaggle-style data science. Before you get into Kaggle-style model tuning there are more important questions that a data scientist should ask:

- How did I get this data? Is there some bias or systematic error that may mislead me in this dataset? Can I get more data or better data? For example by finding or scraping another related data source. The quality of the data is the most important thing. Your insights can be only as good as your data, so this usually the most important step in real life.
- What am I trying to achieve? What is the business-related metric I should be using?
- After you have decided on the dataset and the metric, we get into Kaggle-style model tuning. This is what this notebook is about. Note also that a few competitions allow you to combine external datasets. We will study this topic later.

This notebook requires the data files from the Advanced regression kaggle competion https://www.kaggle.com/c/house-prices-advanced-regression-techniques

For Kaggle-style model tuning, here are some recommended readings:

In [1]:

```
#Lets start by import some of our standard tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr
%matplotlib inline
```

In [2]:

```
#Lets read the data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
```

In [3]:

```
#The first thing you always do is LOOK at the data. Lets start with shapes
train.shape
```

Out[3]:

In [4]:

```
test.shape
```

Out[4]:

Ok, there are 1460 labeled examples in the training set and 1459 in the test set. Lets look more carefully.

In [5]:

```
test.head() #You should always do that.
```

Out[5]:

Each row is a house, and we have features like MSSubClass, MSZoning, LotFrontage (that one is easy to understand) and many other features. It is important to understand what your features are. For this competition the file data_description.txt explains, for example, that MSSubClass is the type of house in that sale, e.g. 20 = 1-story house built in 1946 or newer, while code 30 is 1-story house build before 1945, etc. Similarly, MSZoning being RL means 'Residential low density' while 'RH' means 'Residential High-density', etc.

We need to feed these features into models so we have to go from categorical features to vectors of numbers. There are many ways to encode categorical features and we will talk about this in class. The standard easy first method is called one-hot encoding (aka dummy variables). You replace one categorical feature with multiple columns that are 0 or 1. For MSZoning we will be adding a columns meaning MSZoning_Is_RH, MSZoning_Is_RL, etc.

You can write manual code to do this, but pandas gives you the method get_dummies which is convenient.

In [6]:

```
testdum= pd.get_dummies(test) #Ok lets try to run it, to see what it does.
```

In [7]:

```
testdum.shape #And then we look
```

Out[7]:

So the test dataframe had 1459 examples and we still have 1459 rows so that is good news. The 80 features have been replaced by 271 features. This should should include the number of possible values that the categorical features take, while leaving the numerical features alone. Lets look more carefully.

In [8]:

```
testdum.head()
```

Out[8]:

You see the column MSZoning is gone but the columns with numbers remain. You need to go find where they are if they have the names you expect

In [9]:

```
list(testdum.columns.values) #This actually lists all the columns
```

Out[9]:

If you scroll down in the feature names you will find 'MSZoning_RH' and 'MSZoning_RL' as we were expecting. This is good news. You should check that indeed the row which had MSZoning being RH has indeed MSZoning_RH=1 and the other options =0

There one important mistake we have done here. Do you see it?

The feature MSSubClass has remained numerical in testdum. This is because the get_dummies method did not understand that the numbers 20,30 etc do not represent codes of a categorical variable. You should be one-hot encoding that column too. Find how to do that.

The next step is to look at how the statistics of the features look like.

In [10]:

```
train.GrLivArea.hist() #this plots the histogram of the feature LotFrontage.
```

Out[10]:

Most homes are between 1000 to 3000 square feet. This is reasonable.

In [11]:

```
#Lets investigate how many bedrooms there are.
train.BedroomAbvGr.hist() #This is the number of Bedrooms above grade (i.e. not basement bedrooms)
```

Out[11]:

Hmm, this is odd. Why is this gap there. Lets look at the actual values of the feature.

In [12]:

```
train.BedroomAbvGr.value_counts()
```

Out[12]:

Ok, it was the way the histogram bins integers then, lets not worry about it. There is an 8 bedroom house, that seems odd. Lets find it to look if this is an outlier or error.

I want to select the row of the train dataframe that has BedroomAbvGr=8. I can do it with one line of pandas slicing

In [13]:

```
train[ train.BedroomAbvGr==8]
```

Out[13]:

In [14]:

```
#Uh, it does not show me all the columns. I can fix that in many ways.
#We'll change the number of max_columns:
pd.set_option('display.max_columns',300)
train[ train.BedroomAbvGr==8] #ok lets see now.
```

Out[14]:

From the data description we read that MSsubClass=190 means this is a '2 family conversion, all styles' home. The Neighborhood is SWISU (South and West of Iowa State University). House style is 2.5Fin (Two and one-half story house) built in 1914. Has TotRmsAbvGrd = 14 (thats a lot of rooms), only 2 full bathrooms and 2 Kitchens which are unfortunately of quality 'Fa' (Fair).

Lets check how big it is overall.

In [15]:

```
train[ train.BedroomAbvGr==8].GrLivArea #total square feet except basement
```

Out[15]:

So my conclusion here is that this is a pretty large old house. Probably 8 bedrooms is not an error, so lets leave it alone.

We are trying to predict the sales price. Lets look at its statistics.

In [16]:

```
train.SalePrice.hist()
```

Out[16]:

This is a very skewed distibution. It has a high dynamic range. Alex recommends that when you are predicting a variable with high dynamic range, it is better to predict its logarithm instead. Engineers discovered that and introduced the Decibel (dB) scale. See also this https://www.youtube.com/watch?v=_p-WyPg1sbU for fun.

Its best to use log1p i.e. logarithm of 1 plus the quantity (for non-negative features). the 1+ ensures that features that are 0 do not make the logarithm explode.

In [17]:

```
# Some data cleaning
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
#log transform the target:
train["SalePrice"] = np.log1p(train["SalePrice"])
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index #Lets mark which features are very skewed.
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
```

In [18]:

```
X_train.head()
```

Out[18]:

You see that we have taken logarithms of skewed features. Unfortunately that even included the (categorical, numerically coded) MSSubclass so this part should be improved. This part is called feature engineering. There is a big art to it, that is specific to the data science problem at hand. We will not move into the next step which is actually training and tuning models.

You can read more about Ridge regression here

http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression

In [19]:

```
from sklearn.linear_model import Ridge
model1= Ridge(alpha=1.0) #Alpha is the hyperparameter for the amount of regularization
#Machine learning in two lines of code, lol.
model1.fit(X_train, y) #Train
yhat= model1.predict(X_train) #Run model on train set to make predictions
```

In [20]:

```
#I want to compare my predictions to the true values
#Lets plot these for the first 10 hourses
plt.plot( yhat[0:9], marker="D") #and lets compare them to the ground truth
plt.plot( y[0:9], marker="s" )
```

Out[20]:

Diamonds are predictions and squares are the true values (log1p of sales price). This looks pretty good. Lets compute the actual loss

In [21]:

```
from sklearn.metrics import mean_squared_error
mean_squared_error(y,yhat)
```

Out[21]:

That seems pretty small. But I am making a mistake here by interpreting this number. We will explain this later. Maybe you already see the mistake ?

Lets try to compare this ridge model with ordinary least squares (OLS).

In [22]:

```
#We want to decide if alpha=0 (i.e. good-old Ordinary Least squares regression)
#Is better than ridge regression with alpha=0.
#lets make a model for OLS by setting alpha=0
model_ols= Ridge(alpha=10e-7) #Alpha is the hyperparameter for the amount of regularization
model_ols.fit(X_train, y)
yhat_ols=model_ols.predict(X_train) #OLS predictions
```

So which model is better? yhat or yhat_ols?

In [23]:

```
mean_squared_error(y,yhat)
```

Out[23]:

In [24]:

```
mean_squared_error(y,yhat_ols)
```

Out[24]:

OLS seems to be better than Ridge regression. But this is a mistake. We are looking at the train loss. A model needs to perform well on data it has not seen before. Hence we are Overfitting. We have to train on half the data and test on the other half.

A better way is to use cross validation.

In [25]:

```
#How to do cross validation
from sklearn.model_selection import cross_val_score
def rmse_cv(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
#note that we have minus the negative mean squared error = MSR.
#This is because cross_val_score wants a score, i.e. higher score is better.
return rmse
```

In [26]:

```
rmse_cv(Ridge(alpha = 1))
```

Out[26]:

Cross validation gives you 5 numbers of root mean square (RMSE), not one. This is because it trains and computes the error 5 times, on the 5 folds. You want to average them to estimate your model accuracy.

In [27]:

```
rmse_cv(Ridge(alpha = 1)).mean()
```

Out[27]:

In [28]:

```
rmse_cv(Ridge(alpha = 10e-7)).mean()
```

Out[28]:

In [30]:

```
#insted of making alpha very small, we can also do vanilla OLS
from sklearn import linear_model #implementation of OLS
reg_ols= linear_model.LinearRegression()
rmse_cv( reg_ols).mean()
```

Out[30]:

Note that Ridge with very small alpha is almost the same as the liner_model implementation of linear regression. The small difference is probably because of the way OLS is solved in the two methods. If you set alpha to be too small

In [31]:

```
#Lets try an XGBoost regressor model
from xgboost import XGBRegressor
my_xgb_lala_model = XGBRegressor()
my_xgb_lala_model.fit(X_train, y, verbose=False)
```

Out[31]:

In [32]:

```
rmse_cv( my_xgb_lala_model).mean() #this is a little slower
```

Out[32]:

0.1308 is even better than Ridge with alpha =1 (gave 0.1313) This is without tuning alpha or the XGB hyperparams.

In [172]:

```
#Tuning: lets try various Alphas and do cross validation to see which works better
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean()
for alpha in alphas]
```

In [173]:

```
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Cross - Validation against Different Alphas")
plt.xlabel("alpha")
plt.ylabel("rmse")
```

Out[173]:

It seems that alpha=10 is the optimal amount of regularization for ridge. Lets see the performance on CV:

In [174]:

```
rmse_cv(Ridge(alpha = 10)).mean()
```

Out[174]:

This is better than the XGB regression without tuning. But with tuning you can do better.

We can also "stack" the models by feeding the predictions of one model into another model. If you use the data point label to train on the model, then there is some leakage of the labels when training the second model.

We recommend you continue by studying this notebook https://www.kaggle.com/apapiu/regularized-linear-models

HTML Styling commands here.

In [33]:

```
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
```

Out[33]:

In [185]:

```
%%javascript
javascript:$('.math>span').css("border-left-color","transparent")
```