In [4]:

```
# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
#sns.set(font='SimHei')
plt.rcParams['axes.grid'] = False
#from IPython.display import SVG
def show_image(filename, figsize=None, res_dir=True):
if figsize:
plt.figure(figsize=figsize)
if res_dir:
filename = './res/{}'.format(filename)
plt.imshow(plt.imread(filename))
```

the best fitting model is a large model that has been regularized appropriately.

\begin{equation} \tilde{J}(\theta; X, y) = J(\theta; X, y) + \alpha \Omega(\theta) \end{equation}

where $\Omega(\theta)$ is a paramter norm penalty.

typically, penalizes **only the weights** of the affine transformation at each layer and leaves the biases unregularized.

The sparsity property induced by $L^1$ regularization => feature selection

constrain $\Omega(\theta)$ to be less than some constant $k$:

\begin{equation} \mathcal{L}(\theta, \alpha; X, y) = J(\theta; X, y) + \alpha(\Omega(\theta) - k) \end{equation}

In practice, column norm limitation is always implemented as an explicit constraint with reprojection.

regularized matrix is guarantedd to be invertible.

create fake data:

- transform
- inject noise

- add noise to data
- add noise to weight (Bayesian: variable distributaion):

is equivalent with an additional regularization term. - add noise to output target: label smooothing

Goal: learn a representation so that example from the same class have similar representations.

- Task-specific paramters
- Generic parameters

In [5]:

```
show_image("fig7_2.png")
```

run it until the ValidationSetError has not imporved for some amount of time.

Use the parameters of the lowest ValidationSetError during the whole train.

In [6]:

```
show_image("fig7_3.png", figsize=[10, 8])
```

- regularized the paramters of one model (supervised) to be close to model (unsupervised)
- to force sets of parameters to be equal: parameter sharing => convolutional neural networks.

place a penalty on the activations of the units in a neural network, encouraging their activations to be sparse.

increase the size of the model when using dropout.

small samples, dropout is less effective.

In [8]:

```
show_image("fig7_8.png", figsize=[10, 8])
```