Function Approximator

Let's consider $ f \in \mathcal{F} $ a Function Approximator so that

$$ f : \mathcal{X} \rightarrow \mathcal{Y} $$

Let's consider $ f_{\theta} $ a Parametric Function Approximator with $ \theta $ Params so that

$$ f : \mathcal{X} \times \mathcal{\Theta} \rightarrow \mathcal{Y} $$

Let's denote with $ \hat y \in \mathcal{Y} $ its estimation for a given $ x \in \mathcal{X} $ Input assuming a certain $ \theta \in \Theta $ Params Set

$$ \hat y = f_{\theta}(x) = f(x; \theta) $$

Metrics

Loss Function

Let's consider a $ f \in \mathcal{F} $ Function Approximator which is able to perform predictions such as $ \hat y = f(x) $ for some $ x \in \mathcal{X}, \hat y \in \mathcal{Y} $

To get a quantitative measure of how well the $ f $ predictor is working, let's introduce the Loss Function $ L \in \mathcal{L} $ as a dissimilarity measure of its 2 arguments

$$ L : \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}^{+} $$

Examples

Binary Loss Function

The $ L^{(bin)}(a,b) $ returns $ 0 $ only when $ a = b $ and $ 1 $ otherwise

$$ L^{(bin)}(a,b) = \left\{\begin{matrix} 0 \qquad a = b \\ 1 \qquad a \neq b \end{matrix}\right. $$

Absolute Loss Function

The $ L^{(abs)}(a,b) $ returns the Absolute Difference

$$ L^{(abs)}(a,b) = | a - b | $$

Square Loss Function

The $ L^{(sq)}(a,b) $ returns the Square Difference

$$ L^{(sq)} = (x - y)^{2} $$

Risk

The Risk definition involves a certain PDF $ D $ where it is possible to draw $ (x,y) $ Pairs

$$ R(f, P) = E_{(x,y) \sim P} \left [ L(f(x), y) \right ] $$

Empirical Risk

The Empirical Risk typically involves a Sampled PDF Approximation $ D^{(ds)} = \{ (x,y)_{i} \}_{i=1,...,N} $ called Dataset so the Empirical Risk definition becomes

$$ R(f, D^{(ds)}) = E_{(x,y) \sim D^{(ds)}} \left [ L(f(x), y) \right ] $$

Expectation as Averaging

If the $ E_{D^{(ds)}} \left [ \cdot \right ] $ Expectation Operator is defined as an Average then the Empirical Risk becomes

$$ R(f, D^{(ds)}) = \frac{1}{N} \sum_{i=1}^{N} L(f(x_{i}), y_{i}) $$

Notes

  • Typically the Dataset is split into Training Set and Validation Set : the former is used to perform the Training while the second is used to check the Generalization

$$ D^{(ds)} = D^{(ts)} \cup D^{(vs)} $$

  • In Stochastic Gradient Descent based Training the Training Set gets sub-sampled into Batches so that the Training actually happens according to the $ D_{t}^{(b)} $ Batch for a specific $ t $ Training Iteration

$$ D_{t}^{(b)} \subset D^{(ts)} $$

Training

Standard Risk based Training

In general the Training Problem is essentially a Search Problem in the $ \Theta $ Parameters Space which is modeled in the Optimization Framework relying on a Risk Measure depending on a certain $ P $ Data Distribution

$$ f^{\star} = \arg\min_{\theta \in \Theta} R(f_{\theta}, P) $$

Empirical Risk based Training

Practically the $ P $ True Data Distribution is unknown, instead the $ D^{(ds)} $ Dataset is available as a Sampled Approximation of $ P $ hence the Optimization Problem becomes

$$ f^{\dagger} = \arg\min_{\theta \in \Theta} R(f_{\theta}, D^{(ds)}) = \arg\min_{\theta \in \Theta} \frac{1}{N} \sum_{i=1}^{N} L(f(x_{i}), y_{i}) $$

Under regularity conditions there are theoretical convergence guarantess so that

$$ f^{\dagger} \rightarrow f^{\star} $$

Overfitting

The $ P \neq D^{(ds)} $ difference between the True Data PDF and the Sampled Data PDF used for the Training, hence composing the Training and Validation Set, can be responsible for the Model Overfitting along with a high model learning capacity, as in the case for Big Deep Neural Networks

Dataset Improvement

In order to combat overfitting, the Big Data approach has led to the development of $ D^{(ds)} $ which are better approximator for $ P $ but especially in some applications the margin to improve is still big

Beside increasing the Dataset Size with more elements, other approaches exist like Dataset Augmentation

Regularization

The "Regularization Path" Mol consists of introducing a "Prior Solution Measure Path" Mol to the "Search Path" Mol which should work as a "Compass" Mol along the "Minimization Path" Mol : it should steer away from certain kind of solutions privileging other kinds

Another approach to combact overfitting, regards adding $ \rho( \theta_{t} ) $ as a quality measure on a specific $ \theta_{t} $ solution at $ t $ Training Time

$$ \rho : \Theta \rightarrow \mathbb{R}^{+} $$

This term can be added to the Risk Minimization Term to build a more complex Objective Function in the Optimization Problem to privilege e.g. simpler solutions like

$$ f^{\dagger} = \arg\min_{\theta \in \Theta} \left ( R(f, D^{(ds)}) + c \rho(\theta) \right ) $$

Ridge Regularization

The Ridge Regularization is defined as

$$ \rho^{(ridge)}(\theta) = \left \| \theta \right \| $$

hence it penalizes solutions with large weights, so implicitely making the $ P(\theta) $ Distribution of $ \theta $ more flattened (spikes are made unlikely)

Training Methods

The Params Fitting, also called Training in Machine Learning jargon, is typically solved using Iterative Numerical Estimation Methods so that

$$ \theta_{t+1} = s^{(tr)}(\theta_{t}) $$

is used to build $ S = \{ \theta_{i} \}_{i=1,...,N_{s}} $ Solutions Sequence which can have different properties like

  • Convergence: related to its capability to converge to $ \theta^{\star} $ Optimum or at least to a $ \theta^{\dagger} $ Stationary Point $ \nabla f(\theta^{\dagger}) = 0 $
  • Convergence Rate: relatdee to its capability to converge according to a certain number of iterations