Let's consider $ f \in \mathcal{F} $ a Function Approximator so that
$$ f : \mathcal{X} \rightarrow \mathcal{Y} $$Let's consider $ f_{\theta} $ a Parametric Function Approximator with $ \theta $ Params so that
$$ f : \mathcal{X} \times \mathcal{\Theta} \rightarrow \mathcal{Y} $$Let's denote with $ \hat y \in \mathcal{Y} $ its estimation for a given $ x \in \mathcal{X} $ Input assuming a certain $ \theta \in \Theta $ Params Set
$$ \hat y = f_{\theta}(x) = f(x; \theta) $$Let's consider a $ f \in \mathcal{F} $ Function Approximator which is able to perform predictions such as $ \hat y = f(x) $ for some $ x \in \mathcal{X}, \hat y \in \mathcal{Y} $
To get a quantitative measure of how well the $ f $ predictor is working, let's introduce the Loss Function $ L \in \mathcal{L} $ as a dissimilarity measure of its 2 arguments
$$ L : \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}^{+} $$The $ L^{(bin)}(a,b) $ returns $ 0 $ only when $ a = b $ and $ 1 $ otherwise
$$ L^{(bin)}(a,b) = \left\{\begin{matrix} 0 \qquad a = b \\ 1 \qquad a \neq b \end{matrix}\right. $$The $ L^{(abs)}(a,b) $ returns the Absolute Difference
$$ L^{(abs)}(a,b) = | a - b | $$The $ L^{(sq)}(a,b) $ returns the Square Difference
$$ L^{(sq)} = (x - y)^{2} $$The Risk definition involves a certain PDF $ D $ where it is possible to draw $ (x,y) $ Pairs
$$ R(f, P) = E_{(x,y) \sim P} \left [ L(f(x), y) \right ] $$The Empirical Risk typically involves a Sampled PDF Approximation $ D^{(ds)} = \{ (x,y)_{i} \}_{i=1,...,N} $ called Dataset so the Empirical Risk definition becomes
$$ R(f, D^{(ds)}) = E_{(x,y) \sim D^{(ds)}} \left [ L(f(x), y) \right ] $$If the $ E_{D^{(ds)}} \left [ \cdot \right ] $ Expectation Operator is defined as an Average then the Empirical Risk becomes
$$ R(f, D^{(ds)}) = \frac{1}{N} \sum_{i=1}^{N} L(f(x_{i}), y_{i}) $$In general the Training Problem is essentially a Search Problem in the $ \Theta $ Parameters Space which is modeled in the Optimization Framework relying on a Risk Measure depending on a certain $ P $ Data Distribution
$$ f^{\star} = \arg\min_{\theta \in \Theta} R(f_{\theta}, P) $$Practically the $ P $ True Data Distribution is unknown, instead the $ D^{(ds)} $ Dataset is available as a Sampled Approximation of $ P $ hence the Optimization Problem becomes
$$ f^{\dagger} = \arg\min_{\theta \in \Theta} R(f_{\theta}, D^{(ds)}) = \arg\min_{\theta \in \Theta} \frac{1}{N} \sum_{i=1}^{N} L(f(x_{i}), y_{i}) $$Under regularity conditions there are theoretical convergence guarantess so that
$$ f^{\dagger} \rightarrow f^{\star} $$The $ P \neq D^{(ds)} $ difference between the True Data PDF and the Sampled Data PDF used for the Training, hence composing the Training and Validation Set, can be responsible for the Model Overfitting along with a high model learning capacity, as in the case for Big Deep Neural Networks
In order to combat overfitting, the Big Data approach has led to the development of $ D^{(ds)} $ which are better approximator for $ P $ but especially in some applications the margin to improve is still big
Beside increasing the Dataset Size with more elements, other approaches exist like Dataset Augmentation
The "Regularization Path" Mol consists of introducing a "Prior Solution Measure Path" Mol to the "Search Path" Mol which should work as a "Compass" Mol along the "Minimization Path" Mol : it should steer away from certain kind of solutions privileging other kinds
Another approach to combact overfitting, regards adding $ \rho( \theta_{t} ) $ as a quality measure on a specific $ \theta_{t} $ solution at $ t $ Training Time
$$ \rho : \Theta \rightarrow \mathbb{R}^{+} $$This term can be added to the Risk Minimization Term to build a more complex Objective Function in the Optimization Problem to privilege e.g. simpler solutions like
$$ f^{\dagger} = \arg\min_{\theta \in \Theta} \left ( R(f, D^{(ds)}) + c \rho(\theta) \right ) $$The Ridge Regularization is defined as
$$ \rho^{(ridge)}(\theta) = \left \| \theta \right \| $$hence it penalizes solutions with large weights, so implicitely making the $ P(\theta) $ Distribution of $ \theta $ more flattened (spikes are made unlikely)
The Params Fitting, also called Training in Machine Learning jargon, is typically solved using Iterative Numerical Estimation Methods so that
$$ \theta_{t+1} = s^{(tr)}(\theta_{t}) $$is used to build $ S = \{ \theta_{i} \}_{i=1,...,N_{s}} $ Solutions Sequence which can have different properties like