Notebook

In [0]:

Function Approximator¶

Let's consider $ f \in \mathcal{F} $ a Function Approximator so that

$$ f : \mathcal{X} \rightarrow \mathcal{Y} $$

Let's consider $ f_{\theta} $ a Parametric Function Approximator with $ \theta $ Params so that

$$ f : \mathcal{X} \times \mathcal{\Theta} \rightarrow \mathcal{Y} $$

Let's denote with $ \hat y \in \mathcal{Y} $ its estimation for a given $ x \in \mathcal{X} $ Input assuming a certain $ \theta \in \Theta $ Params Set

$$ \hat y = f_{\theta}(x) = f(x; \theta) $$

Metrics¶

Loss Function¶

Let's consider a $ f \in \mathcal{F} $ Function Approximator which is able to perform predictions such as $ \hat y = f(x) $ for some $ x \in \mathcal{X}, \hat y \in \mathcal{Y} $

To get a quantitative measure of how well the $ f $ predictor is working, let's introduce the Loss Function $ L \in \mathcal{L} $ as a dissimilarity measure of its 2 arguments

$$ L : \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}^{+} $$

Examples¶

Binary Loss Function¶

The $ L^{(bin)}(a,b) $ returns $ 0 $ only when $ a = b $ and $ 1 $ otherwise

$$ L^{(bin)}(a,b) = \left\{\begin{matrix} 0 \qquad a = b \\ 1 \qquad a \neq b \end{matrix}\right. $$

Absolute Loss Function¶

The $ L^{(abs)}(a,b) $ returns the Absolute Difference

$$ L^{(abs)}(a,b) = | a - b | $$

Square Loss Function¶

The $ L^{(sq)}(a,b) $ returns the Square Difference

$$ L^{(sq)} = (x - y)^{2} $$

Risk¶

The Risk definition involves a certain PDF $ D $ where it is possible to draw $ (x,y) $ Pairs

$$ R(f, P) = E_{(x,y) \sim P} \left [ L(f(x), y) \right ] $$

Empirical Risk¶

The Empirical Risk typically involves a Sampled PDF Approximation $ D^{(ds)} = \{ (x,y)_{i} \}_{i=1,...,N} $ called Dataset so the Empirical Risk definition becomes

$$ R(f, D^{(ds)}) = E_{(x,y) \sim D^{(ds)}} \left [ L(f(x), y) \right ] $$

Expectation as Averaging¶

If the $ E_{D^{(ds)}} \left [ \cdot \right ] $ Expectation Operator is defined as an Average then the Empirical Risk becomes

$$ R(f, D^{(ds)}) = \frac{1}{N} \sum_{i=1}^{N} L(f(x_{i}), y_{i}) $$

Notes¶

Typically the Dataset is split into Training Set and Validation Set : the former is used to perform the Training while the second is used to check the Generalization

$$ D^{(ds)} = D^{(ts)} \cup D^{(vs)} $$

In Stochastic Gradient Descent based Training the Training Set gets sub-sampled into Batches so that the Training actually happens according to the $ D_{t}^{(b)} $ Batch for a specific $ t $ Training Iteration

$$ D_{t}^{(b)} \subset D^{(ts)} $$

Training¶

Standard Risk based Training¶

In general the Training Problem is essentially a Search Problem in the $ \Theta $ Parameters Space which is modeled in the Optimization Framework relying on a Risk Measure depending on a certain $ P $ Data Distribution

$$ f^{\star} = \arg\min_{\theta \in \Theta} R(f_{\theta}, P) $$

Empirical Risk based Training¶

Practically the $ P $ True Data Distribution is unknown, instead the $ D^{(ds)} $ Dataset is available as a Sampled Approximation of $ P $ hence the Optimization Problem becomes

$$ f^{\dagger} = \arg\min_{\theta \in \Theta} R(f_{\theta}, D^{(ds)}) = \arg\min_{\theta \in \Theta} \frac{1}{N} \sum_{i=1}^{N} L(f(x_{i}), y_{i}) $$

Under regularity conditions there are theoretical convergence guarantess so that

$$ f^{\dagger} \rightarrow f^{\star} $$

Overfitting¶

The $ P \neq D^{(ds)} $ difference between the True Data PDF and the Sampled Data PDF used for the Training, hence composing the Training and Validation Set, can be responsible for the Model Overfitting along with a high model learning capacity, as in the case for Big Deep Neural Networks

Dataset Improvement¶

In order to combat overfitting, the Big Data approach has led to the development of $ D^{(ds)} $ which are better approximator for $ P $ but especially in some applications the margin to improve is still big

Beside increasing the Dataset Size with more elements, other approaches exist like Dataset Augmentation

Regularization¶

The "Regularization Path" Mol consists of introducing a "Prior Solution Measure Path" Mol to the "Search Path" Mol which should work as a "Compass" Mol along the "Minimization Path" Mol : it should steer away from certain kind of solutions privileging other kinds

Another approach to combact overfitting, regards adding $ \rho( \theta_{t} ) $ as a quality measure on a specific $ \theta_{t} $ solution at $ t $ Training Time

$$ \rho : \Theta \rightarrow \mathbb{R}^{+} $$

This term can be added to the Risk Minimization Term to build a more complex Objective Function in the Optimization Problem to privilege e.g. simpler solutions like

$$ f^{\dagger} = \arg\min_{\theta \in \Theta} \left ( R(f, D^{(ds)}) + c \rho(\theta) \right ) $$

Ridge Regularization¶

The Ridge Regularization is defined as

$$ \rho^{(ridge)}(\theta) = \left \| \theta \right \| $$

hence it penalizes solutions with large weights, so implicitely making the $ P(\theta) $ Distribution of $ \theta $ more flattened (spikes are made unlikely)

Training Methods¶

The Params Fitting, also called Training in Machine Learning jargon, is typically solved using Iterative Numerical Estimation Methods so that

$$ \theta_{t+1} = s^{(tr)}(\theta_{t}) $$

is used to build $ S = \{ \theta_{i} \}_{i=1,...,N_{s}} $ Solutions Sequence which can have different properties like

Convergence: related to its capability to converge to $ \theta^{\star} $ Optimum or at least to a $ \theta^{\dagger} $ Stationary Point $ \nabla f(\theta^{\dagger}) = 0 $
Convergence Rate: relatdee to its capability to converge according to a certain number of iterations