Notebook

種々の一次アルゴリズムを試す¶

学習アルゴリズムを実装するための基本的な仕組みを前の章で見てきたので、本章ではその大事な一部である最適化法に焦点を当てることにする。具体的には、一次情報だけを用いる直線探索アルゴリズムを対象とする。その理由は、現代の機械学習では、この類いの最適化法がしばしば使われるからである。

表記について述べておくと、観測データ$z_{1},\ldots,z_{n}$を所与として、データ点および制御対象となるパラメータ$w \in \mathbb{R}^d$に依存する損失関数$L(w;z)$が中心となる。その勾配を

\begin{align*} \nabla L(w;z) = \left( \frac{\partial L(w;z)}{\partial w_{1}}, \ldots, \frac{\partial L(w;z)}{\partial w_{d}}\right) \end{align*}

と書いて、これは任意の$z$で定義されるとする。さらに、経験期待損失最小化（ERM)に従い、以下の問題を解くことになる。

\begin{align*} \min_{w \in \mathbb{R}^d} \frac{1}{n} \sum_{i=1}^{n} L(w;z_{i}) \to \widehat{w}. \end{align*}

このERM解$\widehat{w}$は一意に決まるとは限らない。また、ERM解によっては、目的関数の値が一緒でも、汎化能力を示す指標では大きく異なることもあるため、注意が必要である。

直線探索のクラス
勾配降下法
差分法を使った勾配降下法
確率的勾配降下法
各種アルゴリズムの性能を丁寧に観察

直線探索のクラス¶

まず、ベースクラスとして、直線探索（line search）を行うアルゴリズムを実装しよう。この範疇の特徴は次の通りである。初期値$w_{(0)}$を所与として、反復的に以下の更新則を実行する。

\begin{align*} w_{(t+1)} \gets w_{(t)} + \alpha_{(t)} u_{(t)} \end{align*}

重要なのはサブプロセスの順序で、$w_{(t)} \mapsto u_{(t)} \mapsto \alpha_{(t)}$という順番に求めていくことが一般的である。つまり、探索方向$u_{(t)}$を計算してから、ステップサイズの$\alpha_{(t)} > 0$を決める一般的なAlgo_LineSearchを以下のように用意する。

In [1]:

class Algo_LineSearch:
    '''
    Basic archetype of an iterator for implementing a line
    search algorithm.

    Note that we assume w_init is a nparray matrix with the
    shape (d,1), where d represents the number of total
    parameters to be determined.
    '''

    def __init__(self, w_init, step, t_max, thres,
                 verbose=False, store=False):

        # Attributes passed from user.
        self.w = w_init
        self.step = step
        self.t_max = t_max
        self.thres = thres
        self.verbose = verbose
        self.store = store

        # Attributes determined internally.
        if self.w is None:
            self.nparas = None
            self.w_old = None
        else:
            self.nparas = self.w.size
            self.w_old = np.copy(self.w)
        self.t = None
        self.diff = np.inf
        self.stepcost = 0 # for per-step costs.
        self.cumcost = 0 # for cumulative costs.

        # Keep record of all updates (optional).
        if self.store:
            self.wstore = np.zeros((self.w.size,t_max+1), dtype=self.w.dtype)
            self.wstore[:,0] = self.w.flatten()
        else:
            self.wstore = None
        
        
    def __iter__(self):

        self.t = 0

        if self.verbose:
            print("(via __iter__)")
            self.print_state()
            
        return self
    

    def __next__(self):
        '''
        Check the stopping condition(s).
        '''
        
        if self.t >= self.t_max:
            raise StopIteration

        if self.diff <= self.thres:
            raise StopIteration

        if self.verbose:
            print("(via __next__)")
            self.print_state()
            
            
    def update(self, model, data):
        '''
        Carry out the main parameter update.
        '''
        
        # Parameter update.
        newdir = self.newdir(model=model, data=data)
        stepsize = self.step(t=self.t, model=model, data=data, newdir=newdir)
        self.w = self.w + stepsize * np.transpose(newdir)

        # Update the monitor attributes.
        self.monitor_update(model=model, data=data)

        # Run cost updates.
        self.cost_update(model=model, data=data)

        # Keep record of all updates (optional).
        if self.store:
            self.wstore[:,self.t] = self.w.flatten()


    def newdir(self, model, data):
        '''
        This will be implemented by sub-classes
        that inherit this class.
        '''
        raise NotImplementedError
    

    def monitor_update(self, model, data):
        '''
        This will be implemented by sub-classes
        that inherit this class.
        '''
        raise NotImplementedError

    
    def cost_update(self, model, data):
        '''
        This will be implemented by sub-classes
        that inherit this class.
        '''
        raise NotImplementedError

    
    def print_state(self):
        print("t =", self.t, "( max =", self.t_max, ")")
        print("diff =", self.diff, "( thres =", self.thres, ")")
        if self.verbose:
            print("w = ", self.w)
        print("------------")

今の段階では、このベースクラスは基本的な要素を明示しているだけであって、これ自体が何かの役に立つわけではない。このあとの節では、「肉付け」の作業に取りかかる。参考として、重要なパーツを羅列しておく。

update()で反復的に行う更新則のすべてを含む。
monitor_update()で終了条件にかかわる処理を行う。
cost_update()で計算コスト等についての処理を行う。
終了条件にかかわるパラメータはthresとt_maxである。
パラメータの更新は、上記の数式と同じように、まずはnewdir()を呼び出して探索方向を求めてから、step()でステップサイズを計算する。この2つが揃えば、直線探索の更新を行うことができる。

In [2]:

import numpy as np
import math

import models
import dataclass
import helpers as hlp

In [3]:

# Prepare the risk function and some data to test the various
# algorithms to follow in a simple 2D setting.

# Data-related parameters.
n = 15
d = 2
epsilon_sd = 2.0
X_sd = 1.0
X_cov = X_sd * np.eye(d)
w_star = math.pi * np.ones((d,1), dtype=np.float32)

# Put together risk function.
def risk(w):
    return hlp.riskMaker(w=w, A=X_cov, b=epsilon_sd, w_star=w_star)

# Generate data.
data = dataclass.DataSet()
X = np.random.normal(loc=0.0, scale=X_sd, size=(n,d))
epsilon = np.random.normal(loc=0.0, scale=epsilon_sd, size=(n,1))
y = X.dot(w_star) + epsilon
data.init_tr(X=X, y=y) # all for training

勾配降下法¶

探索方向を決めるにあたって、「降下方向」を求める手法がほとんどである。降下方向とは、ステップサイズさえ正しく設定すれば、現時点よりも良い候補が必ず見つかるという方向のことである。例として、まず一次元の目的関数$f:\mathbb{R} \to \mathbb{R}$を考えよう。この場合、「方向」は「左」か「右」しかなく、$u_{(t)} \in \{-1,1\}$ということである。このとき、

もし$f^{\prime}(w_{(t)}) > 0$ならば、関数$f$は増加しているため、「左」の方向に行けば（つまり$u_{(t)} = -1$と設定すれば）、ステップサイズさえ正しく決めれば、目的関数は減少する。
もし$f^{\prime}(w_{(t)}) < 0$ならば、関数$f$は減少しているため、「右」の方向に行けば（つまり$u_{(t)} = -1$と設定すれば）、ステップサイズさえ正しく決めれば、目的関数は減少する。

どのケースでも、$u_{(t)} \, f^{\prime}(w_{(t)}) < 0$が成り立つ。

この性質は多次元の関数へと拡張できる。たとえば、目的関数が$f: \mathbb{R}^d \to \mathbb{R}$のとき、$\langle u_{(t)}, \nabla f(w_{(t)}) \rangle < 0$が成り立つならば、$u_{(t)}$が降下方向である。幾何学的にいえば、負の勾配ベクトル$(-1)\,f^{\prime}(w_{(t)})$と探索方向が急角度をなすことが条件となる。

勾配降下法とは、単に探索方向と負の勾配ベクトルの角度がゼロというスペシャルケースである。これをユークリッド空間での「最急降下法」と呼ぶことも多い。その名付けは自然である。関数の特定の方向における「急峻さ」を示すのは、方向微分であることを思い出そう。我々の考えている変数等でいえば、チェーンルールを使うと、

\begin{align*} \frac{\partial f(w_{(t)}+\alpha u)}{\partial \alpha} & = \sum_{j=1}^{d} \left( \left.\frac{\partial f(x)}{\partial x_{j}}\right|_{x=w_{(t)}+\alpha u} \right) \frac{\partial(w_{(t),j}+\alpha u_{j})}{\partial \alpha}, \text{ set } \alpha = 0\\ & = \langle u, \nabla f(w_{(t)}) \rangle \end{align*}

と表せる（内積は通常のドット積）。最急降下法では、これがなるべく大きな負の値になるように$u$を選ぶ（つまり、最小化）。$(-1)$をかけて、最大化問題に置き換える。コーシー・シュワルツの不等式を使うと、下記が成り立つ。

\begin{align*} \langle u, (-1)\nabla f(w_{(t)}) \rangle \leq \|u\| \|\nabla f(w_{(t)})\|. \end{align*}

単位円上のすべての$u$を考える($\|u\|=1$)。角度がゼロとなるようにしてみると、

\begin{align*} \left\langle \frac{(-1)\nabla f(w_{(t)})}{\|\nabla f(w_{(t)})\|}, (-1)\nabla f(w_{(t)}) \right\rangle = \|\nabla f(w_{(t)})\|. \end{align*}

つまり、$u$を負の勾配ベクトルそのものに設定することで、先述の上界を完全に満たすことになる。同様に、$\langle u, \nabla f(w_{(t)}) \rangle$を最小にする方法でもある。

この背景を押さえた上で、あとは勾配降下法を実装するのみ。これは簡単である。更新則は次の通りである。

\begin{align*} w_{(t+1)} = w_{(t)} - \frac{\alpha_{(t)}}{n} \sum_{i=1}^{n} \nabla L(w_{(t)};z_{i}). \end{align*}

Algo_LineSearchを踏襲して、この更新則を実装したAlgo_GDを下記のように実装している。.

In [4]:

class Algo_GD(Algo_LineSearch):
    '''
    Iterator which implements a line-search steepest descent method,
    where the direction of steepest descent is measured using the
    Euclidean norm (this is the "usual" gradient descent). Here the
    gradient is a sample mean estimate of the true risk gradient.
    That is, this is ERM-GD.
    '''

    def __init__(self, w_init, step, t_max, thres, store):

        super(Algo_GD, self).__init__(w_init=w_init,
                                      step=step,
                                      t_max=t_max,
                                      thres=thres,
                                      store=store)

    def newdir(self, model, data):
        '''
        Determine the direction of the update.
        '''
        return (-1) * np.mean(model.g_tr(w=self.w, data=data),
                              axis=0, keepdims=True)

    def monitor_update(self, model, data):
        '''
        Update the counters and convergence
        monitors used by the algorithm. This is
        executed once every step.

        For GD, increment counter and check the
        differences at each step.
        '''
        self.t += 1
        self.diff = np.linalg.norm((self.w-self.w_old))
        self.w_old = np.copy(self.w)

        
    def cost_update(self, model, data):
        '''
        Update the amount of computational resources
        used by the routine.

        Cost here is number of gradient vectors
        computed. GD computes one vector for each
        element in the sample.
        '''
        self.stepcost = data.n_tr
        self.cumcost += self.stepcost

要点：

newdirでは、目的関数が$n$個のロス関数のサンプル平均であることを前提としている。axis=0の上でnp.mean()を呼び出しているのは、そのためである。
monitor_updateでは、更新前後の差分ノルムを取って、ステップ数tも記録している。
cost_updateでは、計算された$d$次元勾配ベクトルの数を毎回のように足している。

In [5]:

# Initialize model.
mod = models.LinearL2()

# Initialize learning algorithm.
w_init = np.random.uniform(size=(d,1))
def alpha_fixed(t, val):
    return val
def make_step(u):
    def mystep(t, model=None, data=None, newdir=None):
        return alpha_fixed(t=t, val=u)
    return mystep
al = Algo_GD(w_init=w_init,
             step=make_step(0.25),
             t_max=15,
             thres=1e-03,
             store=True)

In [6]:

# Iterate the learning algorithm.
for onestep in al:
    al.update(model=mod, data=data)

In [8]:

import matplotlib
import matplotlib.pyplot as plt

# Visualize the output (in 2D case).
mypath = al.wstore

def risk2D_helper(w1, w2):
    w2D = np.array([w1,w2]).reshape((2,1))
    return risk(w=w2D)

risk2D = np.vectorize(risk2D_helper)

if (d == 2):

    tmpdel = np.linalg.norm(w_star-w_init) * 1
    xvals = np.arange(w_star[0]-tmpdel,w_star[0]+tmpdel, 0.1)
    yvals = np.arange(w_star[1]-tmpdel,w_star[1]+tmpdel, 0.1)
    X, Y = np.meshgrid(xvals, yvals)
    Z = risk2D(w1=X, w2=Y)

    myfig = plt.figure(figsize=(7,7))
    ax = myfig.add_subplot(1,1,1)
    CS = ax.contour(X, Y, Z)
    ax.quiver(mypath[0,:-1], mypath[1,:-1],
              mypath[0,1:]-mypath[0,:-1],
              mypath[1,1:]-mypath[1,:-1],
              scale_units='xy', angles='xy', scale=1, color='k')
    CS.clabel(inline=1, fontsize=10)
    ax.plot(*w_star, 'r*', markersize=12) # print true value.
    ax.plot(*mypath[:,-1], 'bo', markersize=6) # print our final estimate.
    plt.title('Trajectory of GD routine')
    plt.show()
    

練習問題

データを生成する確率分布の設定やサンプル数を変えることで、アルゴリズムの性能がどのような影響を受けるか調べること。
反対に、データの設定を固定して、アルゴリズムの設定を変えることによって、アルゴリズムの性能がどのように変わるか調べること。最良のアルゴリズム設定がどのようにデータ分布に依存するか。

差分法を使った勾配降下法¶

基本的な技法と概念を押さえたので、学習アルゴリズムの理解をさらに深めていくために、勾配降下法の種々の改造を試してみることにする。

正真正銘のGDだと、当然ながら学習の対象となるすべてのパラメータについて偏微分を求める必要がある。自乗誤差ならば勾配は普通求まるのだが、ほかのロス関数では、勾配が定義されていても容易に計算できないケースがある（例：ロス自体が陽に表せないときなど）。

このような状況下では、ロスが十分滑らかであるなら、差分法で勾配の良い近似は簡単にできる。この方法は遅くてもオイラーの頃から知られている。座標軸を一つずつ動かしてみる。

\begin{align*} \Delta_{j} & = (0,\ldots,\delta,\ldots,0), \text{ such that} \\ w + \Delta_{j} & = (w_{1},\ldots,(w_{j}+\delta),\ldots,w_{d}), \quad j = 1,\ldots,d \end{align*}

このように差分の比を取ることで偏微分の近似が得られる。

\begin{align*} \widehat{g}_{i,j}(w) & = \frac{L((w+\Delta_{j});z_{i})-L(w;z_{i})}{\delta} \approx \frac{\partial L(w;z_{i})}{\partial w_{j}}, \quad i=1,\ldots,n \\ \widehat{g}_{i}(w) & = \left(\widehat{g}_{i,1}(w),\ldots,\widehat{g}_{i,d}(w)\right), \quad i=1,\ldots,n. \end{align*}

次に実装したいアルゴリズムは、偏微分が解析的に求められない状況下での近似的な勾配降下法である。勾配の代わりに、差分法による近似を使って、通常の更新式に代入するだけである。

\begin{align*} w_{(t+1)} = w_{(t)} - \alpha_{(t)} \frac{1}{n}\sum_{i=1}^{n} \widehat{g}_{i}(w_{(t)}). \end{align*}

具体的には、下記の通りになる。

In [9]:

class Algo_FDGD(Algo_LineSearch):
    '''
    Finite-difference gradient descent.
    '''

    def __init__(self, w_init, step, delta, t_max, thres, store):

        super(Algo_FDGD, self).__init__(w_init=w_init,
                                        step=step,
                                        t_max=t_max,
                                        thres=thres,
                                        store=store)
        self.delta = delta
        self.delmtx = np.eye(self.w.size) * delta
        
    
    def newdir(self, model, data):
        '''
        Determine the direction of the update.
        '''
        out = np.zeros((1,self.w.size), dtype=self.w.dtype)
        loss = model.l_tr(w=self.w, data=data)
        
        # Perturb one coordinate at a time, then 
        # compute finite difference.
        for j in range(self.w.size):
            delj = np.take(self.delmtx,[j],axis=1)
            loss_delta = model.l_tr((self.w + delj), data=data)
            out[:,j] = np.mean(loss_delta-loss) / self.delta
            
        return out * (-1)
    

    def monitor_update(self, model, data):
        self.t += 1
        self.diff = np.linalg.norm((self.w-self.w_old))
        self.w_old = np.copy(self.w)
        
        
    def cost_update(self, model, data):
        self.stepcost = data.n_tr
        self.cumcost += self.stepcost

前とまったく同様な数値実験も行なうことができる。通常の勾配降下法との唯一の違いは、精度を定めるパラメータdeltaを決めないといけないことである（これは引数の「差分」である$\delta$のこと）。

In [10]:

# Initialize model.
mod = models.LinearL2()

# Initialize learning algorithm.
w_init = np.random.uniform(size=(d,1))
def alpha_fixed(t, val):
    return val
def make_step(u):
    def mystep(t, model=None, data=None, newdir=None):
        return alpha_fixed(t=t, val=u)
    return mystep
al = Algo_FDGD(w_init=w_init,
               step=make_step(0.15),
               delta=0.05,
               t_max=15,
               thres=1e-03,
               store=True)

In [11]:

# Iterate the learning algorithm.
for onestep in al:
    al.update(model=mod, data=data)

In [12]:

import matplotlib
import matplotlib.pyplot as plt

# Visualize the output (in 2D case).
mypath = al.wstore

def risk2D_helper(w1, w2):
    w2D = np.array([w1,w2]).reshape((2,1))
    return risk(w=w2D)

risk2D = np.vectorize(risk2D_helper)

if (d == 2):

    tmpdel = np.linalg.norm(w_star-w_init) * 1
    xvals = np.arange(w_star[0]-tmpdel,w_star[0]+tmpdel, 0.1)
    yvals = np.arange(w_star[1]-tmpdel,w_star[1]+tmpdel, 0.1)
    X, Y = np.meshgrid(xvals, yvals)
    Z = risk2D(w1=X, w2=Y)

    myfig = plt.figure(figsize=(7,7))
    ax = myfig.add_subplot(1,1,1)
    CS = ax.contour(X, Y, Z)
    ax.quiver(mypath[0,:-1], mypath[1,:-1],
              mypath[0,1:]-mypath[0,:-1],
              mypath[1,1:]-mypath[1,:-1],
              scale_units='xy', angles='xy', scale=1, color='k')
    CS.clabel(inline=1, fontsize=10)
    ax.plot(*w_star, 'r*', markersize=12) # print true value.
    ax.plot(*mypath[:,-1], 'bo', markersize=6) # print our final estimate.
    plt.title('Trajectory of FDGD routine')
    plt.show()
    

余談にはなるが、上記のアルゴリズムは“forward difference”と呼ばれる差分法の一例である。正の値を足しているからforwardというが、当然、引き算でも問題ないし、点をまたぐような形でも良い。つまり、

\begin{align} \frac{l(w;z_{i})-l(w-\Delta_{j} ;z_{i})}{\delta} \quad \text{and} \quad \frac{l(w+\Delta_{j};z_{i})-l(w-\Delta_{j};z_{i})}{2\delta} \end{align}

という2種類の差分もある（$j$番目の偏微分の“backward difference”および“central difference”と呼ぶ）。直感的にもわかるように、後者のほうが断然、精度としては優れている。

練習問題

先ほどのAlgo_FDGDを改造し、backward differenceとcentral differenceの差分法を生かした勾配降下法を実装してみること。ほかのアルゴリズムと比較して、軌跡がどう異なるか。収束する速度はどうか。$n$が小さいときの頑健性はどうか。

確率的勾配降下法¶

学習の対象となるパラメータが$w=(w_{1},\ldots,w_{d})$きわめて数多くある状況について考える。この場合、勾配降下法を使うには、物凄い数の偏微分を計算しないといけないが、データのサンプル数が多くなると計算量に圧迫されて、通常のGDは現実的な手法でなくなる。

この計算コストを大幅に削減する良い方法として、サンプルの全体ではなく、無作為に選ぶサブセット（別名：ミニバッチ）の分だけ勾配を求める確率的な最適化法がある。まずはインデックスをランダムに取る。

\begin{align*} \mathcal{I} \subset \{1,2,\ldots,n\} \end{align*}

このミニバッチの大きさは$B = |\mathcal{I}| \ll n$となって、その部分集合の勾配を求めて、経験平均ベクトルを通常の更新式に代入する。

\begin{align*} w_{(t+1)} = w_{(t)} - \frac{\alpha_{(t)}}{B}\sum_{i \in \mathcal{I}} \nabla L(w_{(t)};z_{i}). \end{align*}

古典的には、$B=1$とするのが主流で、これを確率的勾配降下法(stochastic gradient descent, SGD)と呼ぶ。今では$B>1$のミニバッチが用いられることも多い。一般化すると$i \in \mathcal{I}$が独立かつ一様に$\{1,\ldots,n\}$からサンプルされるとすると、これらのランダム添え字について期待値を取ると：

\begin{align*} \mathbf{E} \left( \frac{1}{B}\sum_{i \in \mathcal{I}} \nabla L(w;z_{i}) \right) & = \frac{1}{B}\sum_{i \in \mathcal{I}} \mathbf{E} \nabla L(w;z_{i})\\ & = \frac{1}{B}\sum_{i \in \mathcal{I}}\left(\frac{1}{n} \sum_{j=1}^{n} \nabla L(w;z_{j}) \right)\\ & = \frac{1}{n} \sum_{j=1}^{n} \nabla L(w;z_{j}). \end{align*}

つまり、このミニバッチ平均の期待値というのは、全データ点を使った場合の平均ベクトルにほかならない。確実なことはいえないが、コストの小さい反復回数をたくさん投じることで、一回一回のばらつきが平滑化され、最終的には良い解の近辺にたどり着くだろう、という発想である。

実際の動きを自分の手で確かめてみることにしよう。実装するためには、例のnewdirメソッドのたったの数行を追加するだけで済む。

In [13]:

class Algo_SGD(Algo_LineSearch):
    '''
    Stochastic companion to Algo_GD, where we randomly take
    sub-samples (called mini-batches) to compute the sample
    mean estimate of the underlying risk gradient. By using
    small mini-batches, the computational burden of each
    iteration is eased, at the cost of poorer statistical
    estimates.
    '''
    def __init__(self, w_init, step, batchsize, replace,
                 t_max, thres, store):

        super(Algo_SGD, self).__init__(w_init=w_init,
                                       step=step,
                                       t_max=t_max,
                                       thres=thres,
                                       store=store)
        self.batchsize = batchsize
        self.replace = replace

        # Computed internally.
        self.nseen = 0
        self.npasses = 0


    def newdir(self, model, data):
        '''
        Determine the direction of the update.
        '''
        shufidx = np.random.choice(data.n_tr,
                                   size=self.batchsize,
                                   replace=self.replace)
        return (-1) * np.mean(model.g_tr(w=self.w, data=data, n_idx=shufidx),
                              axis=0, keepdims=True)

    def monitor_update(self, model, data):
        '''
        Update the counters and convergence
        monitors used by the algorithm. This is
        executed once every step.
        '''
        self.t += 1
        self.nseen += self.batchsize
        if self.nseen >= data.n_tr:
            self.npasses += 1
            self.diff = np.linalg.norm((self.w-self.w_old))
            self.w_old = np.copy(self.w)
            self.nseen = self.nseen % data.n_tr
    
    def cost_update(self, model, data):
        '''
        Update the amount of computational resources
        used by the routine.

        Cost here is number of gradient vectors
        computed. SGD computes one for each element
        of the mini-batch used.
        '''
        self.stepcost = self.batchsize
        self.cumcost += self.stepcost

要点：

batchsizeは上記の$B$で、無作為に取るミニバッチの大きさを決める。もし$B=n$であれば、言うまでもなくAlgo_GDとまったく同じ働きをする。
replaceは、ミニバッチのサンプルを取るときに、再抽出するかどうか決める論理値。
nseenとnpassesは、データ全体の何周目か記録する。上記の実装では、通常のGDと同じ終了条件となっている。

In [14]:

# Initialize model.
mod = models.LinearL2()

# Initialize learning algorithm.
w_init = np.random.uniform(size=(d,1))

def alpha_fixed(t, val):
    return val

def make_step(u):
    def mystep(t, model=None, data=None, newdir=None):
        return alpha_fixed(t=t, val=u)
    return mystep
al = Algo_SGD(w_init=w_init,
              batchsize=1,
              replace=False,
              step=make_step(0.05),
              t_max=150,
              thres=-1.0, # nullifies this condition.
              store=True)

In [15]:

# Iterate the learning algorithm.
for onestep in al:
    al.update(model=mod, data=data)

In [16]:

import matplotlib
import matplotlib.pyplot as plt

# Visualize the output (in 2D case).
mypath = al.wstore

def risk2D_helper(w1, w2):
    w2D = np.array([w1,w2]).reshape((2,1))
    return risk(w=w2D)

risk2D = np.vectorize(risk2D_helper)

if (d == 2):

    tmpdel = np.linalg.norm(w_star-w_init) * 1
    xvals = np.arange(w_star[0]-tmpdel,w_star[0]+tmpdel, 0.1)
    yvals = np.arange(w_star[1]-tmpdel,w_star[1]+tmpdel, 0.1)
    X, Y = np.meshgrid(xvals, yvals)
    Z = risk2D(w1=X, w2=Y)

    myfig = plt.figure(figsize=(7,7))
    ax = myfig.add_subplot(1,1,1)
    CS = ax.contour(X, Y, Z)
    ax.quiver(mypath[0,:-1], mypath[1,:-1],
              mypath[0,1:]-mypath[0,:-1],
              mypath[1,1:]-mypath[1,:-1],
              scale_units='xy', angles='xy', scale=1, color='k')
    CS.clabel(inline=1, fontsize=10)
    ax.plot(*w_star, 'r*', markersize=12) # print true value.
    ax.plot(*mypath[:,-1], 'bo', markersize=6) # print our final estimate.
    plt.title('Trajectory of SGD routine')
    plt.show()
    

練習問題

GDの節での練習問題に、今度はSGDについて同様に答えること。「良いアルゴリズムの設定」がGDとSGDとで、基本的にどのように異なるか。

各種アルゴリズムの性能を丁寧に観察¶

この節では、各種の学習アルゴリズムが時間とともに、平均的にどのような振る舞いを見せるか調べていく。

In [17]:

## Import all necessary modules.

import math
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import models
import dataclass
import helpers as hlp

In [18]:

## Experiment setup.

# Data-related.
data = dataclass.DataSet() # Initialize one data object; will be re-populated at each trial.
n = 500 # sample size
d = 2 # number of parameters
init_delta = 5.0 # controls support of random initialization
num_trials = 250 # number of independent random trials to conduct
cov_X = np.eye(d) # covariance matrix of the inputs.

w_star = np.ones(d).reshape((d,1)) # vector specifying true model

w_init = w_star + np.random.uniform(low=-init_delta, high=init_delta, size=d).reshape((d,1))
# NOTE: Initial weights are randomly generated in advance, 
# so that per-trial randomness here is only due to the data.

# Algorithm-related.
m_idx_todo = [0,1,2] # let's us manually pick and choose which methods to evaluate.
t_max = 30 # termination condition; maximum number of iterations.
thres = -1.0 # termination condition; if negative, runs for max iterations.

def alpha_fixed(t, val): # step-size callback function.
    return val
def make_step(u):
    def mystep(t, model=None, data=None, newdir=None):
        return alpha_fixed(t=t, val=u)
    return mystep

# Clerical.
mth_names = ["gd", "fdgd", "sgd"]
num_mths = len(mth_names)
mth_colours = ["black", "purple", "red"]

In [19]:

# Make choice of additive noise distribution (un-comment your choice).
#paras = {"name": "norm", "shift": 0.0, "scale": 20.0}
paras = {"name": "lnorm", "meanlog": 0.0, "sdlog": 1.75}

# Put together risk function.
def risk(w):
    mean_noise, var_noise = hlp.noise_risk(paras=paras)
    return hlp.riskMaker(w=w, A=cov_X, b=math.sqrt(var_noise), w_star=w_star)
risk_star = risk(w=w_star) # optimal risk value.

In [20]:

## Running the algorithms.

# Prepare storage for performance metrics.
riskvals = np.zeros((num_trials,t_max,num_mths), dtype=np.float32)
loss_tr = np.zeros((num_trials,t_max,num_mths), dtype=np.float32)
truedist = np.zeros((num_trials,t_max,num_mths), dtype=np.float32)

# Loop over trials.
for tri in range(num_trials):
    
    # Generate new data (with *centred* noise).
    X = np.random.normal(loc=0.0, scale=1.0, size=(n,d))
    noise = hlp.noise_data(n=n, paras=paras)
    y = np.dot(X, w_star) + noise
    data.init_tr(X=X, y=y)
    
    # Initialize models.
    mod = models.LinearL2(data=data)
    loss_star = np.mean(mod.l_tr(w=w_star, data=data))
    
    
    # Initialize all algorithms.
    al_gd = Algo_GD(w_init=w_init,
                    step=make_step(0.25),
                    t_max=t_max,
                    thres=thres,
                    store=True)
    al_fdgd = Algo_FDGD(w_init=w_init,
                        step=make_step(0.25),
                        delta=0.15,
                        t_max=t_max,
                        thres=thres,
                        store=True)
    al_sgd = Algo_SGD(w_init=w_init,
                      batchsize=1,
                      replace=False,
                      step=make_step(0.01),
                      t_max=t_max,
                      thres=thres,
                      store=True)
    
    # Run all algorithms and save their performance.
    
    ## ERM-GD.
    mthidx = 0
    if mthidx in m_idx_todo:        
        idx = 0
        for onestep in al_gd:
            al_gd.update(model=mod, data=data)
            # Record performance
            loss_tr[tri,idx,mthidx] = np.mean(mod.l_tr(w=al_gd.w, data=data))-loss_star
            riskvals[tri,idx,mthidx] = risk(w=al_gd.w)-risk_star
            truedist[tri,idx,mthidx] = np.linalg.norm(w_star-al_gd.w)-0
            idx += 1
        
    ## FD-GD.
    mthidx = 1
    if mthidx in m_idx_todo:
        idx = 0
        for onestep in al_fdgd:
            al_fdgd.update(model=mod, data=data)
            # Record performance
            loss_tr[tri,idx,mthidx] = np.mean(mod.l_tr(w=al_fdgd.w, data=data))-loss_star
            riskvals[tri,idx,mthidx] = risk(w=al_fdgd.w)-risk_star
            truedist[tri,idx,mthidx] = np.linalg.norm(w_star-al_fdgd.w)-0
            idx += 1
        
    ## SGD.
    mthidx = 2
    if mthidx in m_idx_todo:
        idx = 0
        for onestep in al_sgd:
            al_sgd.update(model=mod, data=data)
            # Record performance
            loss_tr[tri,idx,mthidx] = np.mean(mod.l_tr(w=al_sgd.w, data=data))-loss_star
            riskvals[tri,idx,mthidx] = risk(w=al_sgd.w)-risk_star
            truedist[tri,idx,mthidx] = np.linalg.norm(w_star-al_sgd.w)-0
            idx += 1


# Finally, take statistics of the performance metrics over all trials.
ave_loss_tr = np.mean(loss_tr, axis=0)
ave_riskvals = np.mean(riskvals, axis=0)
ave_truedist = np.mean(truedist, axis=0)
sd_loss_tr = np.std(loss_tr, axis=0)
sd_riskvals = np.std(riskvals, axis=0)
sd_truedist = np.std(truedist, axis=0)

In [21]:

## Visualization of performance.

tvals = np.arange(t_max)+1 # better to start from the first update.

# Average over trials.
myfig = plt.figure(figsize=(15,5))

ax_loss_tr = myfig.add_subplot(1, 3, 1)
for m in m_idx_todo:
    vals = ave_loss_tr[:,m]
    err = sd_loss_tr[:,m]
    ax_loss_tr.plot(tvals, vals, color=mth_colours[m], label=mth_names[m])
    #ax_loss_tr.errorbar(tvals, vals, yerr=err, fmt='-o', col=mth_colours[m], label=mth_names[m])
    ax_loss_tr.legend(loc=1,ncol=1)
    plt.axhline(y=0.0, color="black")
    plt.title("Excess empirical risk")
    

ax_riskvals = myfig.add_subplot(1, 3, 2)
for m in m_idx_todo:
    vals = ave_riskvals[:,m]
    err = sd_riskvals[:,m]
    err_log = err / vals
    ax_riskvals.semilogy(tvals, vals, "-o", color=mth_colours[m], label=mth_names[m])
    ax_riskvals.legend(loc=1,ncol=1)
    plt.axhline(y=0.0, color="black")
    plt.title("Excess risk")

ax_variation = myfig.add_subplot(1, 3, 3)
for m in m_idx_todo:
    vals = sd_riskvals[:,m]
    ax_variation.plot(tvals, vals, "-o", color=mth_colours[m], label=mth_names[m])
    ax_variation.legend(loc=1,ncol=1)
    plt.axhline(y=0.0, color="black")
    plt.title("Variation of risk across samples")


plt.show()

練習問題

「学習効率」とは、より少ない計算資源（ここで反復回数）で、より良い解に至ることを指す。上記の実験で、データの次元、データ分布の設定、アルゴリズムの設定といった要因が学習効率にどのような影響を及ぼすか。「過学習」が起きやすい設定があるか。サンプルごとの性能のばらつきが大きくなりがちな設定があるか。この類いの問いに答えられるように調べること。
Numpyでは、多種多様な確率分布が整備されている。helpers_datagen.pyのソースを開き、ここで使っている関数hlp.noise_dataを拡大させてみること（つまり、対応ノイズの分布を増やすこと）。分布族によって、どのような違いが見られるか。