Let's consider $ f \in \mathcal{F} $ a Function Approximator so that

$$ f : \mathcal{X} \rightarrow \mathcal{Y} $$

Let's consider $ f_{\theta} $ a Parametric Function Approximator with $ \theta $ Params so that

$$ f : \mathcal{X} \times \mathcal{\Theta} \rightarrow \mathcal{Y} $$

Let's denote with $ \hat y \in \mathcal{Y} $ its estimation for a given $ x \in \mathcal{X} $ Input assuming a certain $ \theta \in \Theta $ Params Set

$$ \hat y = f_{\theta}(x) = f(x; \theta) $$

Let's consider a $ f \in \mathcal{F} $ Function Approximator which is able to perform predictions such as $ \hat y = f(x) $ for some $ x \in \mathcal{X}, \hat y \in \mathcal{Y} $

To get a quantitative measure of how well the $ f $ predictor is working, let's introduce the Loss Function $ L \in \mathcal{L} $ as a dissimilarity measure of its 2 arguments

$$ L : \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}^{+} $$

The $ L^{(bin)}(a,b) $ returns $ 0 $ only when $ a = b $ and $ 1 $ otherwise

$$ L^{(bin)}(a,b) = \left\{\begin{matrix} 0 \qquad a = b \\ 1 \qquad a \neq b \end{matrix}\right. $$

The $ L^{(abs)}(a,b) $ returns the Absolute Difference

$$ L^{(abs)}(a,b) = | a - b | $$

The $ L^{(sq)}(a,b) $ returns the Square Difference

$$ L^{(sq)} = (x - y)^{2} $$

The Risk definition involves a certain PDF $ D $ where it is possible to draw $ (x,y) $ Pairs

$$ R(f, P) = E_{(x,y) \sim P} \left [ L(f(x), y) \right ] $$

The Empirical Risk typically involves a Sampled PDF Approximation $ D^{(ds)} = \{ (x,y)_{i} \}_{i=1,...,N} $ called Dataset so the Empirical Risk definition becomes

$$ R(f, D^{(ds)}) = E_{(x,y) \sim D^{(ds)}} \left [ L(f(x), y) \right ] $$

If the $ E_{D^{(ds)}} \left [ \cdot \right ] $ Expectation Operator is defined as an Average then the Empirical Risk becomes

$$ R(f, D^{(ds)}) = \frac{1}{N} \sum_{i=1}^{N} L(f(x_{i}), y_{i}) $$

- Typically the Dataset is split into Training Set and Validation Set : the former is used to perform the Training while the second is used to check the Generalization

$$ D^{(ds)} = D^{(ts)} \cup D^{(vs)} $$

- In Stochastic Gradient Descent based Training the Training Set gets sub-sampled into Batches so that the Training actually happens according to the $ D_{t}^{(b)} $ Batch for a specific $ t $ Training Iteration

$$ D_{t}^{(b)} \subset D^{(ts)} $$

In general the Training Problem is essentially a Search Problem in the $ \Theta $ Parameters Space which is modeled in the Optimization Framework relying on a Risk Measure depending on a certain $ P $ Data Distribution

$$ f^{\star} = \arg\min_{\theta \in \Theta} R(f_{\theta}, P) $$

Practically the $ P $ True Data Distribution is unknown, instead the $ D^{(ds)} $ Dataset is available as a Sampled Approximation of $ P $ hence the Optimization Problem becomes

$$ f^{\dagger} = \arg\min_{\theta \in \Theta} R(f_{\theta}, D^{(ds)}) = \arg\min_{\theta \in \Theta} \frac{1}{N} \sum_{i=1}^{N} L(f(x_{i}), y_{i}) $$

Under regularity conditions there are theoretical convergence guarantess so that

$$ f^{\dagger} \rightarrow f^{\star} $$

The $ P \neq D^{(ds)} $ difference between the True Data PDF and the Sampled Data PDF used for the Training, hence composing the Training and Validation Set, can be responsible for the Model Overfitting along with a high model learning capacity, as in the case for Big Deep Neural Networks

In order to combat overfitting, the Big Data approach has led to the development of $ D^{(ds)} $ which are better approximator for $ P $ but especially in some applications the margin to improve is still big

Beside increasing the Dataset Size with more elements, other approaches exist like Dataset Augmentation

The "Regularization Path" Mol consists of introducing a "Prior Solution Measure Path" Mol to the "Search Path" Mol which should work as a "Compass" Mol along the "Minimization Path" Mol : it should steer away from certain kind of solutions privileging other kinds

Another approach to combact overfitting, regards adding $ \rho( \theta_{t} ) $ as a quality measure on a specific $ \theta_{t} $ solution at $ t $ Training Time

$$ \rho : \Theta \rightarrow \mathbb{R}^{+} $$

This term can be added to the Risk Minimization Term to build a more complex Objective Function in the Optimization Problem to privilege e.g. simpler solutions like

$$ f^{\dagger} = \arg\min_{\theta \in \Theta} \left ( R(f, D^{(ds)}) + c \rho(\theta) \right ) $$

The Ridge Regularization is defined as

$$ \rho^{(ridge)}(\theta) = \left \| \theta \right \| $$

hence it penalizes solutions with large weights, so implicitely making the $ P(\theta) $ Distribution of $ \theta $ more flattened (spikes are made unlikely)

The Params Fitting, also called Training in Machine Learning jargon, is typically solved using Iterative Numerical Estimation Methods so that

$$ \theta_{t+1} = s^{(tr)}(\theta_{t}) $$

is used to build $ S = \{ \theta_{i} \}_{i=1,...,N_{s}} $ Solutions Sequence which can have different properties like

- Convergence: related to its capability to converge to $ \theta^{\star} $ Optimum or at least to a $ \theta^{\dagger} $ Stationary Point $ \nabla f(\theta^{\dagger}) = 0 $
- Convergence Rate: relatdee to its capability to converge according to a certain number of iterations