Notebook

- 0.1 Why is gradient descent so bad at optimizing polynomial regression?
- 0.2 Polynomial regression
1 : Comparing results for high order polynomial regression
2 : Repeating the experiment with 2 polynomial variables and visualizing the results

Why is gradient descent so bad at optimizing polynomial regression?#¶

Question from Stackexchange: https://stats.stackexchange.com/questions/350130/why-is-gradient-descent-so-bad-at-optimizing-polynomial-regression

Linear regression¶

Cost function¶

$J(\theta) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 $

$J(\theta) = \frac{1}{2m}(X\theta - y)^T(X\theta - y) $ (vectorized version)

Gradient¶

$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m}X^T(X\theta - y) $

Hessian¶

$\frac{\partial^2 J(\theta)}{\partial \theta^2} = \frac{1}{m}X^T X $

Polynomial regression¶

The design matrix is of the form:

$ \mathbf{X = [1 , x , x^2 , x^3 , ... , x}^n]$

Libraries¶

In [1]:

import numpy as np
import pandas as pd
import scipy.optimize as opt
from sklearn import linear_model
import statsmodels.api as sm

from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
plt.style.use('seaborn-white')

/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Helper Functions¶

In [2]:

def costfunction(theta,X,y):
    m = np.size(y)
    theta = theta.reshape(-1,1)
    
    #Cost function in vectorized form
    h = X @ theta
    J = float((1./(2*m)) * (h - y).T @ (h - y));    
    return J;

def gradient_descent(theta,X,y,alpha = 0.0005,num_iters=1000):
    m = np.size(y)
    J_history = np.empty(num_iters)
    count_history = np.empty(num_iters)
    theta_1_hist, theta_2_hist = [], []
    
    for i in range(num_iters):
        #Grad function in vectorized form
        h = X @ theta
        theta = theta - alpha * (1/m)* (X.T @ (h-y))
        
        #Tracker values for plotting
        J_history[i] = costfunction(theta,X,y)
        count_history[i] = i
        theta_1_hist.append(theta[0,0])
        theta_2_hist.append(theta[1,0])
        
    return theta, J_history,count_history, theta_1_hist, theta_2_hist

def grad(theta,X,y):
    #Initializations
    theta = theta[:,np.newaxis]
    m = len(y)
    grad = np.zeros(theta.shape)
    
    #Computations
    h = X @ theta
    
    grad = (1./m)*(X.T @ ( h - y))
    
    return (grad.flatten())


def polynomial_features(data, deg):
    data_copy=data.copy()
    
    for i in range(1,deg):
        data_copy['X'+str(i+1)]=data_copy['X'+str(i)]*data_copy['X1']

    return data_copy

def hessian(theta,X,y):
    m,n = X.shape
    X = X.values
    return ((1./m)*(X.T @ X))

: Comparing results for high order polynomial regression¶

Initializing the data¶

In [3]:

#Create data from sin function with uniform noise
x = np.linspace(0.1,1,40)
noise = np.random.uniform(  size = 40)
y = np.sin(x * 1.5 * np.pi ) 
y_noise = (y + noise).reshape(-1,1)
y_noise = y_noise - y_noise.mean() #Centering the data

degree = 7
X_d = polynomial_features(pd.DataFrame({'X0':1,'X1': x}),degree)

Closed form solution¶

In [4]:

def closed_form_solution(X,y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

coefs = closed_form_solution(X_d.values,y_noise)
coefs

Out[4]:

array([[-2.95437720e+00],
       [ 6.50561451e+01],
       [-5.29277649e+02],
       [ 2.29876198e+03],
       [-5.44238666e+03],
       [ 6.98373410e+03],
       [-4.56867395e+03],
       [ 1.19415965e+03]])

Numpy only fit¶

In [5]:

stepsize = .1
theta_result_1,J_history_1, count_history_1, theta_1_hist, theta_2_hist = gradient_descent(np.zeros((len(X_d.T),1)).reshape(-1,1), X_d,y_noise,alpha = stepsize,num_iters=5000)
display(theta_result_1)

array([[ 0.46548572],
       [ 1.24606966],
       [-1.36884742],
       [-1.79417815],
       [-1.19320446],
       [-0.36286233],
       [ 0.39300774],
       [ 0.99476541]])

Sciy optimize fit using first order derivative only¶

Comment: BFGS does very well but requires adjustment of options¶

In particular tolerance must be made smaller as the cost function is very flat near the global minimum

In [6]:

import scipy.optimize as opt

theta_init = np.ones((len(X_d.T),1)).reshape(-1,1)
model_t = opt.minimize(fun = costfunction, x0 = theta_init , args = (X_d, y_noise), 
                             method = 'BFGS', jac = grad, options={'maxiter':1000, 'gtol': 1e-10, 'disp' : True})
model_t.x

Optimization terminated successfully.
         Current function value: 0.036158
         Iterations: 152
         Function evaluations: 156
         Gradient evaluations: 156

Out[6]:

array([-2.95437890e+00,  6.50561760e+01, -5.29277889e+02,  2.29876294e+03,
       -5.44238875e+03,  6.98373664e+03, -4.56867556e+03,  1.19416006e+03])

Sciy optimize fit using hessian matrix¶

As expected, 2nd order information allows to converge much faster¶

In [7]:

import scipy.optimize as opt

theta_init = np.ones((len(X_d.T),1)).reshape(-1,1)
model_t = opt.minimize(fun = costfunction, x0 = theta_init , args = (X_d, y_noise), 
                             method = 'dogleg', jac = grad, hess= hessian, options={'maxiter':1000, 'disp':True})
model_t.x

Optimization terminated successfully.
         Current function value: 0.036158
         Iterations: 20
         Function evaluations: 21
         Gradient evaluations: 21
         Hessian evaluations: 20

Out[7]:

array([-2.95437904e+00,  6.50561791e+01, -5.29277914e+02,  2.29876304e+03,
       -5.44238897e+03,  6.98373692e+03, -4.56867574e+03,  1.19416011e+03])

Sklearn fit¶

In [8]:

from sklearn import linear_model
model_d = linear_model.LinearRegression(fit_intercept=False)
model_d.fit(X_d,y_noise)
model_d.coef_

Out[8]:

array([[-2.95437904e+00,  6.50561790e+01, -5.29277914e+02,
         2.29876304e+03, -5.44238897e+03,  6.98373691e+03,
        -4.56867573e+03,  1.19416011e+03]])

Statsmodel fit¶

In [9]:

import statsmodels.api as sm
model_sm = sm.OLS(y_noise, X_d)
res = model_sm.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.891
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     37.49
Date:                Thu, 07 Jun 2018   Prob (F-statistic):           1.16e-13
Time:                        10:21:12   Log-Likelihood:                -4.2236
No. Observations:                  40   AIC:                             24.45
Df Residuals:                      32   BIC:                             37.96
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
X0            -2.9544      4.220     -0.700      0.489     -11.551       5.642
X1            65.0562     86.252      0.754      0.456    -110.633     240.745
X2          -529.2779    673.778     -0.786      0.438   -1901.719     843.163
X3          2298.7630   2651.280      0.867      0.392   -3101.718    7699.244
X4         -5442.3890   5762.460     -0.944      0.352   -1.72e+04    6295.358
X5          6983.7369   7007.140      0.997      0.326   -7289.340    2.13e+04
X6         -4568.6757   4460.607     -1.024      0.313   -1.37e+04    4517.283
X7          1194.1601   1156.577      1.032      0.310   -1161.709    3550.029
==============================================================================
Omnibus:                        1.721   Durbin-Watson:                   2.129
Prob(Omnibus):                  0.423   Jarque-Bera (JB):                1.447
Skew:                           0.306   Prob(JB):                        0.485
Kurtosis:                       2.297   Cond. No.                     3.02e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.02e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

: Repeating the experiment with 2 polynomial variables and visualizing the results¶

Here we will focus on a 2-D design matrix with $x$ and $x^2$ values. The y values have been centered so we will ignore the constant term and y-intercept

Initializing the data¶

In [10]:

#Create data from sin function with uniform noise
x = np.linspace(0.1,1,40) #Adjusting the starting point to reduce numerical instability
noise = np.random.uniform(  size = 40)
y = np.sin(x * 1.5 * np.pi ) 
y_noise = (y + noise).reshape(-1,1)
y_noise = y_noise - y_noise.mean() #Centering the data

#2nd order polynomial only
degree = 2
X_d = polynomial_features(pd.DataFrame({'X1': x}),degree)

#Starting point for gradient descent - see later diagrams
initial_theta = np.array([0,-2]).reshape(-1,1)

In [11]:

X_d = X_d[['X1','X2']]
X_d.head()

Out[11]:

	X1	X2
0	0.100000	0.010000
1	0.123077	0.015148
2	0.146154	0.021361
3	0.169231	0.028639
4	0.192308	0.036982

Closed form solution¶

In [12]:

def closed_form_solution(X,y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

coefs = closed_form_solution(X_d.values,y_noise)
coefs

Out[12]:

array([[ 3.62016596],
       [-5.38895073]])

Numpy only fit¶

In [14]:

stepsize = .3
theta_result,J_history, count_history_1, theta_1, theta_2 = gradient_descent(initial_theta,X_d,y_noise,alpha = stepsize,num_iters=10000)
display(theta_result)

array([[ 3.62016596],
       [-5.38895073]])

Plotting the gradient descent convergence and resulting fits¶

In [15]:

fig = plt.figure(figsize = (18,8))

#Looping through different stepsizes
for s in [.001,.01,.1,1]:
    theta_calc,J_history_1, count_history_1, theta_1, theta_2 = gradient_descent(initial_theta, X_d,y_noise,alpha = s,num_iters=5000)
    
    #Plot gradient descent convergence
    ax = fig.add_subplot(1, 2, 1)
    ax.plot(count_history_1, J_history_1, label = 'Grad. desc. stepsize: {}'.format(s))

    #Plot resulting fits on data
    ax = fig.add_subplot(1, 2, 2)
    ax.plot(x,X_d@theta_calc, label = 'Grad. desc. stepsize: {}'.format(s))
    
#Adding plot features
ax = fig.add_subplot(1, 2, 1)
ax.axhline(costfunction(coefs, X_d, y_noise), linestyle=':', label = 'Closed form minimum')
ax.set_xlabel('Count')
ax.set_ylabel('Cost function')
ax.set_title('Plot of convergence: Polynomial regression x, x^2 ={}'.format(degree))
ax.legend(loc = 1)

ax = fig.add_subplot(1, 2, 2)
ax.scatter(x,y_noise, facecolors = 'none', edgecolor = 'darkblue', label = 'f(x) + noise')
ax.plot(x,X_d@coefs, linestyle=':', label = 'Closed form fit')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Noisy data and gradient descent fits'.format(degree))
ax.legend()

plt.show()

/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

Plotting the cost function in 3D¶

In [16]:

#Creating the dataset (as previously)

X = X_d.values

#Setup of meshgrid of theta values
T0, T1 = np.meshgrid(np.linspace(0,6,100),np.linspace(0,-8,100))

#Computing the cost function for each theta combination
zs = np.array(  [costfunction(np.array([t0,t1]).reshape(-1,1), X, y_noise.reshape(-1,1)) 
                     for t0, t1 in zip(np.ravel(T0), np.ravel(T1)) ] )
#Reshaping the cost values    
Z = zs.reshape(T0.shape)


#Computing the gradient descent
theta_result,J_history, count_history_1, theta_1, theta_2 = gradient_descent(initial_theta,X,y_noise,alpha = 0.3,num_iters=5000)


#Angles needed for quiver plot
anglesx = np.array(theta_1)[1:] - np.array(theta_1)[:-1]
anglesy = np.array(theta_2)[1:] - np.array(theta_2)[:-1]

%matplotlib inline
fig = plt.figure(figsize = (16,8))

#Surface plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(T0, T1, Z, rstride = 5, cstride = 5, cmap = 'jet', alpha=0.5)
ax.plot(theta_1,theta_2,J_history, marker = '*',markersize = 4, color = 'r', alpha = .2, label = 'Gradient descent')
ax.plot(coefs[0],coefs[1], marker = '*', color = 'black', markersize = 10)

ax.set_xlabel('theta 1')
ax.set_ylabel('theta 2')
ax.set_zlabel('Cost function')
ax.set_title('Gradient descent: Root at {}'.format(theta_calc.flatten().round(2)))
ax.view_init(45, -45)
ax.legend()


#Contour plot
ax = fig.add_subplot(1, 2, 2)
ax.contour(T0, T1, Z, 70, cmap = 'jet')
ax.quiver(theta_1[:-1], theta_2[:-1], anglesx, anglesy, scale_units = 'xy', angles = 'xy', scale = 1, color = 'r', alpha = .9)
ax.plot(coefs[0],coefs[1], marker = '*', color = 'black', markersize = 10)
ax.set_xlabel('theta 1')
ax.set_ylabel('theta 2')
ax.set_title('Gradient descent: Root at {}'.format(theta_calc.flatten().round(2)))
ax.legend()

plt.legend()
plt.show()

No handles with labels found to put in legend.
No handles with labels found to put in legend.

Sciy optimize fit¶

In [17]:

import scipy.optimize as opt

theta_init = np.ones((len(X_d.T),1)).reshape(-1,1)
model_t = opt.minimize(fun = costfunction, x0 = theta_init , args = (X_d, y_noise), 
                             method = 'dogleg', jac = grad, hess= hessian, options={'maxiter':1000})
model_t.x

Out[17]:

array([ 3.62016596, -5.38895073])

Sklearn fit¶

In [18]:

from sklearn import linear_model
model_d = linear_model.LinearRegression(fit_intercept=False)
model_d.fit(X_d,y_noise)
model_d.coef_

Out[18]:

array([[ 3.62016596, -5.38895073]])

Statsmodel fit¶

In [19]:

import statsmodels.api as sm
model_sm = sm.OLS(y_noise, X_d)
res = model_sm.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.850
Model:                            OLS   Adj. R-squared:                  0.842
Method:                 Least Squares   F-statistic:                     107.3
Date:                Thu, 07 Jun 2018   Prob (F-statistic):           2.33e-16
Time:                        10:22:03   Log-Likelihood:                -8.0664
No. Observations:                  40   AIC:                             20.13
Df Residuals:                      38   BIC:                             23.51
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
X1             3.6202      0.316     11.466      0.000       2.981       4.259
X2            -5.3890      0.403    -13.375      0.000      -6.205      -4.573
==============================================================================
Omnibus:                        2.004   Durbin-Watson:                   1.966
Prob(Omnibus):                  0.367   Jarque-Bera (JB):                1.893
Skew:                          -0.470   Prob(JB):                        0.388
Kurtosis:                       2.499   Cond. No.                         8.15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [ ]:

Table of Contents

Why is gradient descent so bad at optimizing polynomial regression?#¶

Linear regression¶

Cost function¶

Gradient¶

Hessian¶

Polynomial regression¶

Libraries¶

Helper Functions¶

: Comparing results for high order polynomial regression¶

Initializing the data¶

Closed form solution¶

Numpy only fit¶

Sciy optimize fit using first order derivative only¶

Comment: BFGS does very well but requires adjustment of options¶

Sciy optimize fit using hessian matrix¶

As expected, 2nd order information allows to converge much faster¶

Sklearn fit¶

Statsmodel fit¶

: Repeating the experiment with 2 polynomial variables and visualizing the results¶

Initializing the data¶

Closed form solution¶

Numpy only fit¶

Plotting the gradient descent convergence and resulting fits¶

Plotting the cost function in 3D¶

Sciy optimize fit¶

Sklearn fit¶

Statsmodel fit¶