Stochastic Gradient descent

Important: Please read the installation page for details about how to install the toolboxes. $\newcommand{\dotp}[2]{\langle #1, #2 \rangle}$ $\newcommand{\enscond}[2]{\lbrace #1, #2 \rbrace}$ $\newcommand{\pd}[2]{ \frac{ \partial #1}{\partial #2} }$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\umax}[1]{\underset{#1}{\max}\;}$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\uargmin}[1]{\underset{#1}{argmin}\;}$ $\newcommand{\norm}[1]{\|#1\|}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\choice}[1]{ \left\{ \begin{array}{l} #1 \end{array} \right. }$ $\newcommand{\pa}[1]{\left(#1\right)}$ $\newcommand{\diag}[1]{{diag}\left( #1 \right)}$ $\newcommand{\qandq}{\quad\text{and}\quad}$ $\newcommand{\qwhereq}{\quad\text{where}\quad}$ $\newcommand{\qifq}{ \quad \text{if} \quad }$ $\newcommand{\qarrq}{ \quad \Longrightarrow \quad }$ $\newcommand{\ZZ}{\mathbb{Z}}$ $\newcommand{\CC}{\mathbb{C}}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\EE}{\mathbb{E}}$ $\newcommand{\Zz}{\mathcal{Z}}$ $\newcommand{\Ww}{\mathcal{W}}$ $\newcommand{\Vv}{\mathcal{V}}$ $\newcommand{\Nn}{\mathcal{N}}$ $\newcommand{\NN}{\mathcal{N}}$ $\newcommand{\Hh}{\mathcal{H}}$ $\newcommand{\Bb}{\mathcal{B}}$ $\newcommand{\Ee}{\mathcal{E}}$ $\newcommand{\Cc}{\mathcal{C}}$ $\newcommand{\Gg}{\mathcal{G}}$ $\newcommand{\Ss}{\mathcal{S}}$ $\newcommand{\Pp}{\mathcal{P}}$ $\newcommand{\Ff}{\mathcal{F}}$ $\newcommand{\Xx}{\mathcal{X}}$ $\newcommand{\Mm}{\mathcal{M}}$ $\newcommand{\Ii}{\mathcal{I}}$ $\newcommand{\Dd}{\mathcal{D}}$ $\newcommand{\Ll}{\mathcal{L}}$ $\newcommand{\Tt}{\mathcal{T}}$ $\newcommand{\si}{\sigma}$ $\newcommand{\al}{\alpha}$ $\newcommand{\la}{\lambda}$ $\newcommand{\ga}{\gamma}$ $\newcommand{\Ga}{\Gamma}$ $\newcommand{\La}{\Lambda}$ $\newcommand{\si}{\sigma}$ $\newcommand{\Si}{\Sigma}$ $\newcommand{\be}{\beta}$ $\newcommand{\de}{\delta}$ $\newcommand{\De}{\Delta}$ $\newcommand{\phi}{\varphi}$ $\newcommand{\th}{\theta}$ $\newcommand{\om}{\omega}$ $\newcommand{\Om}{\Omega}$ $\newcommand{\eqdef}{\equiv}$

This tour details Stochastic Gradient Descent, applied to the binary logistic classification problem.

We recommend that after doing this Numerical Tours, you apply it to your own data, for instance using a dataset from LibSVM.

Disclaimer: these machine learning tours are intended to be overly-simplistic implementations and applications of baseline machine learning methods. For more advanced uses and implementations, we recommend to use a state-of-the-art library, the most well known being Scikit-Learn

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2

Usefull function to convert to a column vector.

In [2]:
#  convert to a column vector
def MakeCol(y): return y.reshape(-1,1)
#  convert to a row vector
def MakeRow(y): return y.reshape(1,-1)
# find non zero/true elements
def find(x): return np.nonzero(x)[0]

Simple Example

We first illustrate the concept of stochastic gradient descent on a simple example $$ \umin{x \in \RR} f(x) \eqdef f_1(x) + f_2(x) $$ where $f_1(x) \eqdef (x-1)^2$ and $f_2(x) \eqdef (x+1)^2$.

Functions and their derivatives.

In [3]:
def f(x,i):
    if i==1:
        return 1/2*(x-1)**2
        return 1/2*(x+1)**2
def df(x,i):
    if i==1:
        return x-1
        return x+1
def F(x): return f(x,1)+f(x,2)
In [4]:
t = np.linspace(-3,3,201)
plt.legend(('$f_1$', '$f_2$', '$F$'));

Each iteration of SGD reads $$ x_{\ell+1} = x_\ell - \tau_\ell \nabla f_{i(\ell)}(x_\ell) $$ where $ i(\ell) \in \{1,2\} $ is drawn uniformly.

Exercise 0

Implement the SGD, with a random initial condition $x_0$. Display several paths (i.e. run several times the algorithm) to vizualize the evolution of the density of the random variable $x_\ell$.

In [5]:
run -i nt_solutions/ml_4_sgd/exo0
In [6]:
## Insert your code here.

Dataset Loading

We load a subset of the < dataset Quantum Physics Dataset> of $n=10000$ features in dimension $78$. The goal in this task is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments.

Load the dataset. Randomly permute it. Separate the features $X$ from the data $y$ to predict information.

In [7]:
from scipy import io
name = 'quantum';
U = io.loadmat('nt_toolbox/data/ml-' + name)
A = U['A']
A = A[np.random.permutation(A.shape[0]),:]
X = A[:,0:-1]
y = A[:,-1]

Set the classes indexes to be $\{-1,+1\}$.

In [8]:
y = 2*y-1
# y must be a column vector
y = MakeCol(y)

Remove empty features, normalize $X$.

In [9]:
I = find( np.abs(X).mean(axis=0)>1e-1 )
X = X[:,I]
# normalize
X = X-X.mean(axis=0)
X = X/X.std(axis=0)

$n$ is the number of samples, $p$ is the dimensionality of the features,

In [10]:
[n,p] = X.shape
10000 46

Compute PCA main axes for display.

In [11]:
U, s, V = np.linalg.svd(X)
Xr = V.transpose() )
# display the decay of eigenvalues
plt.plot(s, '.-');

Plot the classes.

In [12]:
I = find(y==-1)
J = find(y==+1)
plt.plot(Xr[I[0:200],0], Xr[I[0:200],1], '.')
plt.plot(Xr[J[0:200],0], Xr[J[0:200],1], '.')
In [13]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(Xr[I[0:200],0], Xr[I[0:200],1], Xr[I[0:200],2], '.')
ax.scatter(Xr[J[0:200],0], Xr[J[0:200],1], Xr[J[0:200],2], '.')

Batch Gradient Descent (BGD)

We first test the usual (batch) gradient descent (BGD) on the problem of supervised logistic classification.

We refer to the dedicated numerical tour on logistic classification for background and more details about the derivations of the energy and its gradient.

Logistic classification aims at solving the following convex program $$ \umin{w} E(w) \eqdef \frac{1}{n} \sum_{i=1}^n L(\dotp{x_i}{w},y_i) $$ where the logistic loss reads $$ L( s,y ) \eqdef \log( 1+\exp(-sy) ) $$

Define energy $E$ and its gradient $\nabla E$.

In [14]:
def L(s,y):  return 1/n * np.sum( np.log( 1 + np.exp(-s*y) ) )
def E(w,X,y): return L(,y)
def theta(v): return 1 / (1+np.exp(-v))
def nablaL(s,r): return - 1/n * y * theta(-s * y)
def nablaE(w,X,y): return X.transpose().dot( nablaL(,y) )

Exercise 1

Implement a gradient descent $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell). $$ Monitor the energy decay. Test different step size, and compare with the theory (in particular plot in log domain to illustrate the linear rate).

In [15]:
run -i nt_solutions/ml_4_sgd/exo1
In [16]:
## Insert your code here.

Stochastic Gradient Descent (SGD)

As any empirical risk minimization procedure, the logistic classification minimization problem can be written as $$ \umin{w} E(w) = \frac{1}{n} \sum_i E_i(w) \qwhereq E_i(w) = L(\dotp{x_i}{w},y_i). $$

For very large $n$ (which could in theory even be infinite, in which case the sum needs to be replaced by an expectation or equivalenty an integral), computing $\nabla E$ is prohebitive. It is possible instead to use a stochastic gradient descent (SGD) scheme $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E_{i(\ell)}(w_\ell) $$ where, for each iteration index $\ell$, $i(\ell)$ is drawn uniformly at random in $ \{1,\ldots,n\} $.

Note that here $$ \nabla E_{i}(w) = x_i \nabla L( \dotp{x_i}{w}, y_i ) \qwhereq \nabla L(u,v) = v \odot \th(-u) $$

In [17]:
def nablaEi(w,i): return MakeCol( -y[i] * X[i,:].transpose() * theta( -y[i] * (X[i,:].dot(w)) ) )

Note that each step of a batch gradient descent has complexity $O(np)$, while a step of SGD only has complexity $O(p)$. SGD is thus advantageous when $n$ is very large, and one cannot afford to do several passes through the data. In some situation, SGD can provide accurate results even with $\ell \ll n$, exploiting redundancy between the samples.

A crucial question is the choice of step size schedule $\tau_\ell$. It must tends to 0 in order to cancel the noise induced on the gradient by the stochastic sampling. But it should not go too fast to zero in order for the method to keep converging.

A typical schedule that ensures both properties is to have asymptically $\tau_\ell \sim \ell^{-1}$ for $\ell\rightarrow +\infty$. We thus propose to use $$ \tau_\ell \eqdef \frac{\tau_0}{1 + \ell/\ell_0} $$ where $\ell_0$ indicates roughly the number of iterations serving as a "warmup" phase.

One can prove the following convergence result $$ \EE( E(w_\ell) ) - E(w^\star) = O\pa{ \frac{1}{\sqrt{\ell}} }, $$ where $\EE$ indicates an expectation with respect to the i.i.d. sampling performed at each iteration.

Select default values for $ (\ell_0,\tau_0) $.

In [18]:
l0 = 100
tau0 = .05

Exercise 2

Perform the Stochastic gradient descent. Display the evolution of the energy $E(w_\ell)$. One can overlay on top (black dashed curve) the convergence of the batch gradient descent, with a carefull scaling of the number of iteration to account for the fact that the complexity of a batch iteration is $n$ times larger. Perform several runs to illustrate the probabilistic nature of the method. Explore different values for $ (\ell_0,\tau_0) $.

In [19]:
run -i nt_solutions/ml_4_sgd/exo2
In [20]:
## Insert your code here.

Stochastic Gradient Descent with Averaging (SGA)

Stochastic gradient descent is slow because of the fast decay of $\tau_\ell$ toward zero.

To improve somehow the convergence speed, it is possible to average the past iterate, i.e. run a "classical" SGD on auxiliary variables $ (\tilde w_\ell)_\ell$ $$ \tilde w_{\ell+1} = \tilde w_\ell - \tau_\ell \nabla E_{i(\ell)}(\tilde w_\ell) $$ and output as estimated weight vector the average $$ w_\ell \eqdef \frac{1}{\ell} \sum_{k=1}^\ell \tilde w_\ell. $$ This defines the Stochastic Gradient Descent with Averaging (SGA) algorithm.

Note that it is possible to avoid explicitely storing all the iterates by simply updating a running average as follow $$ w_{\ell+1} = \frac{1}{\ell} \tilde w_\ell + \frac{\ell-1}{\ell} w_\ell. $$

In this case, a typical choice of decay is rather of the form $$ \tau_\ell \eqdef \frac{\tau_0}{1 + \sqrt{\ell/\ell_0}}. $$ Notice that the step size now goes much slower to 0, at rate $\ell^{-1/2}$.

Typically, because the averaging stabilizes the iterates, the choice of $(\ell_0,\tau_0)$ is less important than for SGD.

Bach proves that for logistic classification, it leads to a faster convergence (the constant involved are smaller) than SGD, since on contrast to SGD, SGA is adaptive to the local strong convexity of $E$.

Exercise 3

Implement the Stochastic gradient descent with averaging. Display the evolution of the energy $E(w_\ell)$.

In [21]:
run -i nt_solutions/ml_4_sgd/exo3
In [22]:
## Insert your code here.

Stochastic Averaged Gradient Descent (SAG)

For problem size $n$ where the dataset (of size $n \times p$) can fully fit into memory, it is possible to further improve the SGA method by bookeeping the previous gradient. This gives rise to the Stochastic Averaged Gradient Descent (SAG) algorithm.

We stored all the previously computed gradient in $ (G^i)_{i=1}^n $, which necessitate $n \times p$ memory. The iterates are defined by using a proxy $g$ for the batch gradient, which is progressively enhanced during the iterates.

The algorithm reads $$ h \leftarrow \nabla E_{i(\ell)}(\tilde w_\ell), $$ $$ g \leftarrow g - G^{i(\ell)} + h, $$ $$ G^{i(\ell)} \leftarrow h, $$ $$ w_{\ell+1} = w_\ell - \tau g. $$ Note that in contrast to SGD and SGA, this method uses a fixed step size $\tau$. Similarely to the BGD, in order to ensure convergence, the step size $\tau$ should be of the order of $1/L$ where $L$ is the Lipschitz constant of $E$.

This algorithm improves over SGA and BGD since it has a convergence rate of $O(1/\ell)$. Furthermore, in the presence of strong convexity (for instance when $X$ is injective for logistic classification), it has a linear convergence rate, i.e. $$ \EE( E(w_\ell) ) - E(w^\star) = O\pa{ \rho^\ell }, $$ for some $0 < \rho < 1$.

Note that this improvement over SGD and SGA is made possible only because SAG explictely use the fact that $n$ is finite (while SGD and SGA can be extended to infinite $n$ and more general minimization of expectations).

Exercise 4

Implement SAG. Display the evolution of the energy $E(w_\ell)$.

In [23]:
run -i nt_solutions/ml_4_sgd/exo4
In [24]:
## Insert your code here.