Intro¶

L1, L2 regularization are 2 closely techniques that can be used by reducing *over-fitting* problem in machine learning.

The *cost Function* known as (error = $expectations(y(predict) - y(true))$) can be modified by L1, L2 regularization by adding the *extra* L1, L2 term :

L1 regularization (Lasso)
- error = $expectations(y(predict) - y(true))**2$) + λ*sigma(|beta_i|)
L2 regularization (Ridge)
- error = $expectations(y(predict) - y(true))**2$) + λsigma((beta_i)*2)

Both these 2 approached can reduce over-fitting in cases, but the modified model accuracy and computing speed may be big differences sometime, we will discuss further about how to obtain L1, L2 terms, the relation between L1, L2, and learning rate in this notebook.

In brief, regularization is the way we prevent model feat "too perfectly" with the redundant features

In [5]:

#cd analysis/ML_/doc/

In [6]:

# credit : http://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_l1_l2_sparsity.html

from IPython.display import Image
print (""" 
L1 Penalty and Sparsity in Logistic Regression :

demo below show how L1, L2 "reduce" effect from redundant features
that make features sparsity that make model much simple for avoiding 
over fitting and learn from "not good" features


ps :  C=1/lambda, the small C, the stronger L1, L2 regression
""")
Image(filename='L1_L2_Penalty_Sparcity.png')

 
L1 Penalty and Sparsity in Logistic Regression :

demo below show how L1, L2 "reduce" effect from redundant features
that make features sparsity that make model much simple for avoiding 
over fitting and learn from "not good" features


ps :  C=1/lambda, the small C, the stronger L1, L2 regression

Out[6]:

Application :¶

Linear regression:
- L1 regularization : Lasso Regression
  - Least Absolute Shrinkage and Selection Operator
- L2 regularization : Ridge Regression
Logistics regression

Quick summary :¶

L1 regularization (Lasso)
- Can do particular "feature selections" via making the coefficients become zero which means exclude them from the model.
- Work well with lots # of features (~millions) since it provides sparse solutions in fact is kind of feature selection
- Lasso arbitrarily selects any one feature among the highly correlated ones and reduced the coefficients of the rest to zero. Also, the chosen variable changes randomly with change in model parameters. This generally doesn’t work that well as compared to ridge regression.
L2 regularization (Ridge)
- The advantage of ridge regression is coefficient shrinkage and reducing model complexity.
- L2 is majorly used to prevent overfitting. Since it includes all the features, it is not very useful in case of exorbitantly high # of features(~millions) , as it will cost losts of computing time&resorces.
- Works well even for highly correlated features as it will include all of them in the model but the coefficients will be distributed among them depending on the correlation.

Ref :¶

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

In [7]:

# analysis library
import pandas as pd, numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
%pylab inline

# ML
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Populating the interactive namespace from numpy and matplotlib

In [ ]:

1) Demo :¶

  - Lasso, Ridge Regression

1-0) Model complexity :Polynomial regression¶

In [16]:

# help function
# linear regression model 

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

def linear_regression(data, power, models_to_plot):
    #initialize predictors:
    predictors=['x']
    if power>=2:
        predictors.extend(['x_%d'%i for i in range(2,power+1)])
    
    #Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors],data['y'])
    y_pred = linreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered power
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for power: %d'%power)
    
    #Return the result in pre-defined format
    """
    RSS refers to ‘Residual Sum of Squares’ : 
    sum of square of errors between the predicted 
    and actual values in the training data set
    """
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret

def ridge_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    ridgereg = Ridge(alpha=alpha,normalize=True)
    ridgereg.fit(data[predictors],data['y'])
    y_pred = ridgereg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret

def lasso_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors],data['y'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret

In [9]:

# generate log like toy data with noise 

from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 10, 7

#Define input array with angles from 60deg to 300deg converted to radians
#x = np.array([i*np.pi/180 for i in range(60,300,4)])
x = np.array([i for i in range(60,300,4)])
np.random.seed(10)  #Setting seed for reproducability
#y = np.sin(x) + np.random.normal(0,0.15,len(x))
y = np.log(x) + np.random.normal(0,0.15,len(x))
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.')
plt.title('Train data')
plt.xlabel('x')
plt.ylabel('y')

Out[9]:

<matplotlib.text.Text at 0x113c06be0>

In [ ]:

In [10]:

# build polynomial regression (x**1 - x**15) with powers of x form 1 to 15.

for i in range(2,16):  #power of 1 is already there
    colname = 'x_%d'%i      #new var will be x_power
    data[colname] = data['x']**i
#print (data.head())
data.head(3)

Out[10]:

	x	y	x_2	x_3	x_4	x_5	x_6	x_7	x_8	x_9	x_10	x_11	x_12	x_13	x_14	x_15
0	60.0	4.294083	3600.0	216000.0	12960000.0	7.776000e+08	4.665600e+10	2.799360e+12	1.679616e+14	1.007770e+16	6.046618e+17	3.627971e+19	2.176782e+21	1.306069e+23	7.836416e+24	4.701850e+26
1	64.0	4.266175	4096.0	262144.0	16777216.0	1.073742e+09	6.871948e+10	4.398047e+12	2.814750e+14	1.801440e+16	1.152922e+18	7.378698e+19	4.722366e+21	3.022315e+23	1.934281e+25	1.237940e+27
2	68.0	3.987698	4624.0	314432.0	21381376.0	1.453934e+09	9.886748e+10	6.722989e+12	4.571632e+14	3.108710e+16	2.113923e+18	1.437468e+20	9.774779e+21	6.646850e+23	4.519858e+25	3.073503e+27

In [ ]:

In [11]:

#Initialize a dataframe to store the results:
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['model_pow_%d'%i for i in range(1,16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

#Define the powers for which a plot is required:
models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

#Iterate through all powers and assimilate results
for i in range(1,16):
    coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data, power=i, models_to_plot=models_to_plot)

//anaconda/envs/g_dash/lib/python3.4/site-packages/scipy/linalg/basic.py:1018: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)

In [12]:

#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_simple

Out[12]:

	rss	intercept	coef_x_1	coef_x_2	coef_x_3	coef_x_4	coef_x_5	coef_x_6	coef_x_7	coef_x_8	coef_x_9	coef_x_10	coef_x_11	coef_x_12	coef_x_13	coef_x_14	coef_x_15
model_pow_1	1.5	4	0.0064	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_2	1.1	3.5	0.013	-1.8e-05	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_3	1.1	3	0.024	-8.5e-05	1.2e-07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_4	1	3.5	0.0095	5.2e-05	-4.2e-07	7.7e-10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_5	1	5.4	-0.061	0.001	-6.4e-06	1.9e-08	-2e-11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_6	0.99	-0.52	0.2	-0.0035	3.3e-05	-1.6e-07	4e-10	-3.9e-13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_7	0.93	22	-0.94	0.021	-0.00023	1.5e-06	-5.7e-09	1.1e-11	-9.5e-15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_8	0.92	45	-2.4	0.055	-0.0007	5.3e-06	-2.5e-08	6.8e-11	-1e-13	6.6e-17	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_9	0.87	1.7e+02	-11	0.29	-0.0045	4.3e-05	-2.6e-07	1e-09	-2.6e-12	3.6e-15	-2.2e-18	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_10	0.87	1.4e+02	-8.4	0.22	-0.0032	2.7e-05	-1.4e-07	4.1e-10	-3.9e-13	-1.2e-15	4e-18	-3.5e-21	NaN	NaN	NaN	NaN	NaN
model_pow_11	0.87	-73	9	-0.41	0.01	-0.00015	1.5e-06	-9.9e-09	4.5e-11	-1.4e-13	2.7e-16	-3.1e-19	1.6e-22	NaN	NaN	NaN	NaN
model_pow_12	0.87	-3.4e+02	33	-1.3	0.032	-0.00049	5.1e-06	-3.7e-08	1.9e-10	-6.9e-13	1.7e-15	-2.9e-18	2.8e-21	-1.3e-24	NaN	NaN	NaN
model_pow_13	0.86	3.2e+03	-3.1e+02	14	-0.35	0.0061	-7.5e-05	6.6e-07	-4.2e-09	2e-11	-6.8e-14	1.6e-16	-2.6e-19	2.5e-22	-1.1e-25	NaN	NaN
model_pow_14	0.79	2.4e+04	-2.5e+03	1.2e+02	-3.2	0.061	-0.00081	8e-06	-5.8e-08	3.2e-10	-1.3e-12	3.8e-15	-8.2e-18	1.2e-20	-1e-23	4e-27	NaN
model_pow_15	0.7	-3.6e+04	4.3e+03	-2.3e+02	7.3	-0.16	0.0025	-2.8e-05	2.5e-07	-1.6e-09	8.1e-12	-3.1e-14	8.6e-17	-1.7e-19	2.4e-22	-2e-25	7.5e-29

So, from polynomial modeling approach above, we can see the model get much more complexity (coffecient of high polynomial terms) when polynomial powers get larger

1-1) Ridge Regression¶

In [13]:

#Initialize predictors to be set of 15 powers of x
predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])

#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

#Initialize the dataframe for storing coefficients.
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 5:236}
for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(data, predictors, alpha_ridge[i], models_to_plot)

//anaconda/envs/g_dash/lib/python3.4/site-packages/scipy/linalg/basic.py:223: RuntimeWarning: scipy.linalg.solve
Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number: 5.088744863659296e-17
  ' condition number: {}'.format(rcond), RuntimeWarning)

When alpha increases (L2 regularization), the model complexity reduces.

Thing to notice is : though high values of alpha can reduce over-fitting ; significantly high values can cause under-fitting in cases as well. How to select a right alpha can be a core part of model tuning : cross-validation...

In [14]:

#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_ridge

Out[14]:

	rss	intercept	coef_x_1	coef_x_2	coef_x_3	coef_x_4	coef_x_5	coef_x_6	coef_x_7	coef_x_8	coef_x_9	coef_x_10	coef_x_11	coef_x_12	coef_x_13	coef_x_14	coef_x_15
alpha_1e-15	0.87	96	-5.1	0.11	-0.0012	5.7e-06	3.2e-09	-1.4e-10	3.1e-13	1.3e-15	-3.7e-18	-1.5e-20	4e-23	1.5e-25	-6.5e-28	7.1e-31	-1.1e-34
alpha_1e-10	0.92	14	-0.48	0.0092	-7.9e-05	2.7e-07	2.6e-10	-2.6e-12	-5.2e-15	1.7e-17	9.6e-20	6.1e-23	-9.2e-25	-3.4e-27	3.1e-30	5.5e-32	-1e-34
alpha_1e-08	0.95	4.2	-0.017	0.00044	-2.8e-06	4.2e-09	2e-11	-1.5e-14	-1.9e-16	-3.5e-19	4.7e-22	3.8e-24	8e-27	4.4e-30	1.7e-32	1.7e-34	-1.6e-36
alpha_0.0001	0.96	3.6	0.0095	1.2e-05	-1.2e-08	-1.6e-10	-7.1e-13	-2e-15	-2.9e-18	7e-21	7.8e-23	4e-25	1.5e-27	4e-30	4.9e-33	-3.1e-35	-3.3e-37
alpha_0.001	0.99	3.5	0.011	4.3e-06	-3e-08	-1.4e-10	-4e-13	-6.7e-16	4.2e-19	9.5e-21	5.1e-23	2e-25	6.1e-28	1.4e-30	5.1e-34	-1.7e-35	-1.4e-37
alpha_0.01	1.1	3.8	0.0075	5.7e-06	-6.7e-09	-5.3e-11	-1.8e-13	-4.2e-16	-6.2e-19	4.8e-22	8.2e-24	4.1e-26	1.5e-28	4.1e-31	7.3e-34	-9.4e-37	-1.8e-38
alpha_1	3.1	4.6	0.0016	3.3e-06	7.9e-09	1.9e-11	4.8e-14	1.2e-16	2.7e-19	5.7e-22	1e-24	7.4e-28	-5.3e-30	-4.1e-32	-2.1e-34	-9.2e-37	-3.8e-39
alpha_5	5.5	4.8	0.00062	1.4e-06	4e-09	1.2e-11	3.4e-14	1e-16	3.1e-19	9.3e-22	2.8e-24	8.6e-27	2.6e-29	7.9e-32	2.4e-34	7.2e-37	2.2e-39
alpha_10	6.8	4.9	0.00039	9.4e-07	2.7e-09	8.3e-12	2.6e-14	8.1e-17	2.5e-19	8.1e-22	2.6e-24	8.2e-27	2.6e-29	8.5e-32	2.7e-34	8.8e-37	2.8e-39
alpha_20	8.4	5	0.00023	5.8e-07	1.8e-09	5.5e-12	1.8e-14	5.6e-17	1.8e-19	6e-22	2e-24	6.4e-27	2.1e-29	6.9e-32	2.3e-34	7.5e-37	2.5e-39

The RSS (Residual Sum of Squares) increases with increase in alpha, i.e. the model complexity reduces because of big alpha

High alpha values can lead to significant under-fitting. Note the rapid increase in RSS for values of alpha greater than 1

In [ ]:

1-2) Lasso Regression¶

In [26]:

#Initialize predictors to all 15 powers of x
predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])

#Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]

#Initialize the dataframe to store coefficients
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_lasso[i] for i in range(0,10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)

#Define the models to plot
models_to_plot = {1e-10:231, 1e-5:232,1e-4:233, 1e-3:234, 1e-2:235, 1:236}

#Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(data, predictors, alpha_lasso[i], models_to_plot)

//anaconda/envs/g_dash/lib/python3.4/site-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

Plots above again show that the model complexity decreases with increase in the values of alpha. But notice the straight line (subplot) at alpha=1

looks a bit abnormal to us, let's check the coefficients further :

In [27]:

#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_lasso

Out[27]:

	rss	intercept	coef_x_1	coef_x_2	coef_x_3	coef_x_4	coef_x_5	coef_x_6	coef_x_7	coef_x_8	coef_x_9	coef_x_10	coef_x_11	coef_x_12	coef_x_13	coef_x_14	coef_x_15
alpha_1e-15	0.96	3.5	0.012	-2.4e-05	1.4e-07	3.5e-11	-1.4e-12	-5.8e-15	-1.1e-17	5.4e-21	1.5e-22	7.3e-25	2.5e-27	6.1e-30	5.2e-33	-5.3e-35	-4.7e-37
alpha_1e-10	0.96	3.5	0.012	-2.4e-05	1.4e-07	3.5e-11	-1.4e-12	-5.8e-15	-1.1e-17	5.4e-21	1.5e-22	7.3e-25	2.5e-27	6.1e-30	5.2e-33	-5.3e-35	-4.7e-37
alpha_1e-08	0.96	3.5	0.012	-2.4e-05	1.4e-07	3.7e-11	-1.4e-12	-5.8e-15	-1.1e-17	4.9e-21	1.5e-22	7.3e-25	2.5e-27	6e-30	5.2e-33	-5.3e-35	-4.7e-37
alpha_1e-05	0.97	3.5	0.011	0	-0	-3e-11	-1.5e-12	-0	-0	0	0	7.6e-25	1.4e-27	0	0	-0	-3.1e-37
alpha_0.0001	1.1	3.5	0.011	-0	-5.8e-08	-0	-0	0	0	0	3e-23	0	0	0	0	-0	-7.3e-39
alpha_0.001	1.3	3.9	0.0071	-0	-0	-0	-1.1e-13	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0
alpha_0.01	1.8	4.2	0.0052	0	0	0	0	0	0	0	0	0	0	0	0	0	0
alpha_1	13	5.1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
alpha_5	13	5.1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
alpha_10	13	5.1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

From coefficients VS alpha table above :

For the same alpha, the coefficients of lasso regression are much smaller as compared to that of ridge regression
For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge regression
Many of the coefficients are *zero* even for very small values of alpha => This explains the horizontal line fit for *alpha=1* in the lasso plots
This phenomenon of most of the coefficients being *zero* is called *‘sparsity‘*. Although lasso performs feature selection, this level of sparsity is achieved in some special cases

In [ ]: