In [1]:

```
import autograd.numpy as np
np.seterr(all="ignore")
import autograd as ag
import matplotlib.pyplot as plt
%matplotlib inline
```

Suppose we want to estimate the probability of a coin landing heads (H), tails (T), or on the edge (E):

In [2]:

```
data = "HTTHEH"
ch = sum([1 for c in data if c == "H"])
ct = sum([1 for c in data if c == "T"])
ce = sum([1 for c in data if c == "E"])
print(ch, ct, ce)
```

Our model has two parameters, $\theta_H$ and $\theta_T$, the probability of heads and tails, respectively. In terms of these two parameters, the probability of seeing the data that we saw would be:

In [3]:

```
def likelihood(theta):
ph, pt = theta
pe = 1-ph-pt
return ph**ch * pt**ct * pe**ce
eps = 1e-2
X, Y = np.meshgrid(np.arange(eps, 1-eps, 0.01), np.arange(eps, 1-eps, 0.01))
def plot():
fig = plt.figure(figsize=(6,6))
CS = plt.contour(X, Y, likelihood((X, Y)), [0,1e-5,1e-4,5e-4,1e-3,1.5e-3,2e-3,2.3e-3], colors='k')
plt.clabel(CS, inline=1, fontsize=10, fmt="%1.4f")
plot()
```

Below, we're going to make use of the gradient (vector of first derivatives) of the function, which we compute using a magic package called Autograd:

In [4]:

```
g = ag.grad(likelihood)
```

In stochastic gradient ascent/descent, at each step, we move a little bit in the direction that the surface rises/falls most steeply. The *step size* or *learning rate* $\eta$ controls how quickly we move. If it is too high, then we can overshoot, as seen below.

In [8]:

```
def plot_path(path):
plot()
x0, y0 = path[0]
plt.plot(x0, y0, marker='o', color='k')
for xy1 in path[1:]:
x1, y1 = xy1
plt.arrow(x0, y0, x1-x0, y1-y0, head_width=0.01, color='k')
x0, y0 = x1, y1
theta0 = np.array([0.1, 0.4])
theta = theta0
path = [theta]
eta = 1
for iteration in range(100):
theta1 = theta + eta*g(theta)
path.append(theta1)
theta = theta1
plot_path(path)
```

One way to fix this is to make the step size decrease over time. But be careful not to decrease it too much, or convergence may take a long time.

In [6]:

```
theta = theta0
path = [theta]
eta = 20
for iteration in range(100):
theta1 = theta + eta/(iteration+1)*g(theta)
path.append(theta1)
theta = theta1
plot_path(path)
```

Another trick is to halve the learning rate when the objective function doesn't improve.

In [7]:

```
theta = theta0
path = [theta]
eta = 10
for iteration in range(100):
theta1 = theta + eta*g(theta)
path.append(theta1)
if likelihood(theta1) <= likelihood(theta): eta /= 2
theta = theta1
plot_path(path)
```