
# Reinforcement Learning with Neural Network as Q Function¶

## Reinforcement Learning Objective¶

The objective for any reinforcement learning problem is to find the sequence of actions that maximizes (or minimizes) the sum of reinforcements along the sequence. This is reduced to the objective of acquiring the Q function the predicts the expected sum of future reinforcements, because the correct Q function determines the optimal next action.

So, the RL objective is to make this approximation as accurate as possible:

\begin{align*} Q(s_t,a_t) \approx \sum_{k=0}^\infty r_{t+k+1} \end{align*}

This is usually formulated as the least squares objective

\begin{align*} \text{Minimize } \mathrm{E} \left ( \sum_{k=0}^\infty r_{t+k+1} - Q(s_t,a_t)\right )^2 \end{align*}

What is that expectation taken over?

Sources of randomness:

• action selection
• state transitions

Of course, the key problem is that we don't know $\sum_{k=0}^\infty r_{t+k+1}$. What do we do?

Approximate it by

\begin{align*} \sum_{k=0}^\infty r_{t+k+1} = & r_{t+1} + \sum_{k=1}^\infty r_{t+k+1}\\ = & r_{t+1} + \sum_{k=0}^\infty r_{t+1+k+1}\\ \approx & r_{t+1} + Q(s_{t+1},a_{t+1}) \end{align*}

Now our minimization problem is the following. Let $Q$ be a parameterized function, with parameters $\wv$. How do we try to minimize it?

\begin{align*} \text{Minimize } J(\wv) &= \mathrm{E} \left ( \sum_{k=0}^\infty r_{t+k+1} - Q(s_t,a_t;\wv)\right )^2\\ & = \mathrm{E} \Bigl ( r_{t+1} + Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv) \Bigr )^2 \end{align*}

Right! Take the gradient of $J(\wv)$ with respect to $\wv$ and do gradient descent. $Q$ appears twice in this expression. A very common approach is to only take the gradient with respect to $Q(s_t,a_t;\wv)$ and treat $Q(s_{t+1},a_{t+1};\wv)$ as constant. Not correct, but works well and simplifies the resulting algorithm. Recently a second Q function, referred to as a target Q function, has been used to calculate $Q(s_{t+1}, a_{t+1};\wv_{\text{target}})$, with its own parameters $\wv_{\text{target}}$. At certain intervals $\wv$ is copied to $\wv_{\text{target}}$ to update the target Q function.

Recall that the expectation operator is a sum of possible values, weighted by their probabilities. If $d$ is the value on the top of a (fair) dice,

\begin{align*} \mathrm{E} (d) = \frac{1}{6} 1 + \frac{1}{6} 2 + \frac{1}{6} 3 + \frac{1}{6} 4 + \frac{1}{6} 5 + \frac{1}{6} 6 \end{align*}

So the gradient of an expected value is the expected value of the gradients.

\begin{align*} \grad_{\wv} J(\wv) = 2 \mathrm{E} \Bigl ( r_{t+1} + Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv) \Bigr ) (-1) \grad_{\wv} Q(s_t,a_t,;\wv) \end{align*}

The expectation operator requires knowledge of the probabilities of all of those random effects. Instead, we will just sample the world.

More probable events will take place more often. If we just add updates to $\wv$ for each sample, we effectively have done the expectation.

## Gradient for Q function as Table¶

What is the gradient when $Q$ is a table, like the maze and the tic-tac-toe problem? We must first decide what the $\wv$ parameters are for a table.

They are just the entries in the table. Can formulate the selection of the correct parameter in the table as a dot product.

Let $\xv_t$ be a column vector of length equal to the number of table cells and with values that are all 0's and a single 1. The 1 designates the cell corresponding to $s_t$ and $a_t$. Then

\begin{align*} Q(s_t,a_t;\wv) = \xv_t^T \wv \end{align*}

So,

\begin{align*} \grad_{\wv} Q(s_t,a_t;\wv) &= \grad_{\wv} (\xv_t^T \wv)\\ & = \xv_t \end{align*}

Let's define the temporal-difference error to be $\delta_t$

\begin{align*} \delta_t &= r_{t+1} + Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv) \end{align*}

Now

\begin{align*} \grad_{\wv} J(\wv) & = 2 \mathrm{E} \Bigl ( \delta_t (-1) \grad_{\wv} Q(s_t,a_t,;\wv) \Bigr )\\ & = -2 \mathrm{E} (\delta_t \xv_t)\\ & = \begin{cases} -2 \mathrm{E} (\delta_t) , & \text{ for cell for } s_t, a_t;\\ 0, & \text{ for all other cells} \end{cases} \end{align*}

Replacing the expectation with samples, we get the TD update rule for Q tables

\begin{align*} \wv &\leftarrow \wv - \grad_{\wv} J(\wv)\\ &\leftarrow \wv + \rho \delta_t \xv_t \end{align*}

which is really the same as what we have seen before,

\begin{align*} Q(s_t,a_t) &\leftarrow Q(s_t,a_t) + \rho \delta_t %% Q(s_t,a_t) &\leftarrow Q(s_t,a_t) + \rho \Bigl (r_{t+1} + Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv)\Bigr ) \end{align*}

where the update to just the $s_t,a_t$ cell of Q was implicit.

## Gradient for Q function as Neural Network¶

Using a table for Q has limited use.

• Size of table might be too large to store in memory.
• Learning could be slow. Experience in one cell does not transfer to other cells.

Let's use a different function approximator. Uhhh...what shall we use?

Brilliant! A neural network! What has to change in our derivation of the gradient?

We want to approximate $Q(s_t,a_t;\wv)$ with a neural network. The values of $\wv$ are the weights in all layers. If our network has just two layers, the $\wv$ is composed of hidden layer weights $\Vv$ and $\Wv$.

Looks like the neural network will have a single output value and inputs for $s_t$ and $a_t$. To find best action, we would input $s_t$ and try all possible actions $a_t$, calculating the output of the network for each. Then pick the action that produces the best neural network output.

Remember the objective function that we wish to minimize,

\begin{align*} \text{Minimize } J(\wv) &= \mathrm{E} \left ( \sum_{k=0}^\infty r_{t+k+1} - Q(s_t,a_t;\wv)\right )^2\\ & = \mathrm{E} \Bigl ( r_{t+1} + Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv) \Bigr )^2, \end{align*}

\begin{align*} \grad_{\wv} J(\wv) = 2 \mathrm{E} \Bigl ( r_{t+1} + Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv) \Bigr ) (-1) \grad_{\wv} Q(s_t,a_t,;\wv). \end{align*}

We replace the expectation operator with samples, at time $t$.

\begin{align*} \grad_{\wv_t} J(\wv_t) = -2 \Bigl ( r_{t+1} + Q(s_{t+1},a_{t+1};\wv_t) - Q(s_t,a_t;\wv_t) \Bigr ) \grad_{\wv_t} Q(s_t,a_t,;\wv_t). \end{align*}

What remains is to calculate

\begin{align*} \grad_{\wv_t} Q(s_t,a_t,;\wv_t). \end{align*}

but we have seen this before, when calculating the gradient of the output of a neural network with respect to its weights, $\wv = (V,W)$. So, the only change to our previous algorithm for training a neural network to do nonlinear regression is that we use the temporal-difference error rather than the target-output error.

\begin{align*} \Zv &= h(\tilde{\Xv} \Vv)\\ \Yv &= \tilde{\Zv} \Wv\\ \Vv & \leftarrow \Vv + \rho_h \frac{1}{N} \frac{1}{K} \tilde{\Xv}^T \left ( (\Tv - \Yv) \hat{\Wv}^T \cdot (1-\Zv^2) \right )\\ \Wv & \leftarrow \Wv + \rho_o \frac{1}{N}\frac{1}{K} \tilde{\Zv}^T (\Tv - \Yv) \end{align*}

The key change we need to make is in the error. Our targets are now given by a reinforcement plus the next step predicted Q values. Must assemble a bunch of samples into matrices.

Multiple $s_t, a_t$ tuples will be collected as rows in $\Xv$, which, when passed through the neural network, produce $\Zv$ and $\Yv$. $\Yv = \Qv$, because the output $\Yv$ of the network we are taking as our $Q$ value, the estimate of the expected return.

Reinforcements received, $r_{t+1}$'s, will be collected as rows of $\Rv$.

Now just need $Q(s_{t+1},a_{t+1})$'s. These are really like rows of $\Yv$ shifted up by one row, which we will call $\Qnv$, for $Q_{\mbox{next}}$.

Now we can rewrite the gradient descent updates. Assume we have $\Xv$, $\Rv$, $\Qnv$, $\Qv$, all but $\Xv$ having one column.

\begin{align*} \Tv &= \Rv + \Qnv\\ \Zv &= \tanh(\tilde{\Xv} \Vv)\\ \Vv & \leftarrow \Vv + \rho_h \frac{1}{N} \tilde{\Xv}^T \left ( (\Tv - \Qv) \hat{\Wv}^T \cdot (1-\Zv^2) \right )\\ \Wv & \leftarrow \Wv + \rho_o \frac{1}{N} \tilde{\Zv}^T \;(\Tv - \Qv) \end{align*}

To make the relationship to our nonlinear regression expressions even more obvious, let's replace $\Qv$ with $\Yv$ and $\Qnv$ with $\Ynv$. \begin{align*} \Tv &= \Rv + \Ynv\\ \Zv &= \tanh(\tilde{\Xv} \Vv)\\ \Vv & \leftarrow \Vv + \rho_h \frac{1}{N} \tilde{\Xv}^T \left ( (\Tv - \Yv) \hat{\Wv}^T \cdot (1-\Zv^2) \right )\\ \Wv & \leftarrow \Wv + \rho_o \frac{1}{N} \tilde{\Zv}^T \;(\Tv - \Yv) \end{align*}

## Dealing with Infinite Sums¶

For non-episodic tasks (no final state), the desired sum of reinforcements is over an infinite future.

This will grow to infinity unless forced to be a finite value. Will use common approach of discounting reinforcements, discounting ones more that occur further into the future.

\begin{align*} Q(s_t,a_t) &= \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}, \text{ for } 0 < \gamma < 1. \end{align*}

We must introduce this discounting into the objective function

\begin{align*} \text{Minimize } J(\wv) &= \mathrm{E} \left ( \sum_{k=0}^\infty \gamma^k r_{t+k+1} - Q(s_t,a_t;\wv)\right )^2\\ & = \mathrm{E} \Bigl ( \gamma^0 r_{t+1} + \sum_{k=1}^\infty \gamma^k r_{t+k+1} - Q(s_t,a_t;\wv) \Bigr )^2\\ & = \mathrm{E} \Bigl ( \gamma^0 r_{t+1} + \sum_{k=0}^\infty \gamma^{k+1} r_{t+k+1+1} - Q(s_t,a_t;\wv) \Bigr )^2\\ & = \mathrm{E} \Bigl ( \gamma^0 r_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k r_{t+k+1+1} - Q(s_t,a_t;\wv) \Bigr )^2\\ & = \mathrm{E} \Bigl ( r_{t+1} + \gamma Q(s_{t+1},a_{t+1};\wv) - Q(s_t,a_t;\wv) \Bigr )^2 \end{align*}

So, we simply add the multiplication by $\gamma$ to the next state-action Q value in the temporal-difference error.

## Using Batch-Oriented Algorithm for Non-Stationary Learning Task¶

These are batch updates, remember? Before, when doing nonlinear regression we had all input and target samples in $\Xv$ and $\Tv$. Now one set of samples are collected for one set of weight values, meaning for one Q function.

After updates, we have a new Q function, which will produce different actions. So what?

We must repeat the steps:

• Generate $N$ samples.
• Update $\Vv$ and $\Wv$.

in small-ish batches. Also, must not train for too many iterations each time. If we do train to convergence each time, the neural network can forget everything it learned from previous samples.

## 1-D Maze Example¶

Imagine a chain of states, numbered 1 through 10. From any state you can move left by one state, stay put, or move right. Moves off the ends are not permitted. The goal is to be in state 5. Let's assign reinforcements of -1 for every move except ones that result in being in state 5, for which you get reinforcement of 0. This will drive the agent to get to state 5 as quickly as possible.

So, the state is an integer from 1 through 10. Let's approximate the Q function by a neural network with 2 inputs for the integer state and the integer action, 6 hidden units, and 1 output.

\begin{align*} s_t &\in \{1,2,3,4,5,6,7,8,9,10\}\\[3mm] a_t &= \begin{cases} -1; \text{ for action left},\\ 0; \text{ for action stay},\\ 1; \text{ for action right} \end{cases}\\[3mm] s_{t+1} &= s_t + a_t\;\;\; \text{bounded to be between 1 and 10} \end{align*}

In [30]:
import numpy as np
import matplotlib.pyplot as plt
import random
import neuralnetwork_regression as nn   # from A4


Available actions step left, stay put, and step right. Represented as change in position, they are

In [2]:
valid_actions = [-1, 0, 1]


As mentioned in the previous lecture, we need to force exploration by taking random actions. But, as our Q network learns, we want to decrease the amount of exploration.

Remember that we used $\epsilon$ to indicate the probability of taking a random action. Its value is from 0 to 1. Given a Q net, called Qnet, the current state, the set of valid_actions and epsilon, we can define a function that either returns a random choice from valid_actions or the best action determined by the values of Q produced by Qnet for the current state and all valid_actions. This is referred to as an $\epsilon$-greedy policy.

In [32]:
def epsilon_greedy(Qnet, state, valid_actions, epsilon):
'''epsilon is between 0 and 1 and is the probability of returning a random action'''

if np.random.uniform() < epsilon:
# Random Move
action = np.random.choice(valid_actions)

else:
# Greedy Move
actions_randomly_ordered = random.sample(valid_actions, len(valid_actions))
Qs = [Qnet.use(np.array([[state, a]])) for a in actions_randomly_ordered]
ai = np.argmax(Qs)
action = actions_randomly_ordered[ai]

Q = Qnet.use(np.array([[state, action]]))

return action, Q   # return the chosen action and Q(state, action)


Now we need a function, make_samples, for collecting a bunch of samples of state, action, reinforcement, next state, next actions. We can make this function generic by passing in functions to create the initial state, the next state, and the reinforcement value.

In [33]:
valid_actions = [-1, 0, 1]

def initial_state():
return np.random.randint(1, 11)

def next_state(s, a):
newstate = min(max(1, s + a), 10)
return newstate

def reinf(s, sn):  # sn is next state
return 0 if sn is 5 else -1  # if we have arrived in state 5, we have reached the goal, so r is 0.

In [34]:
def make_samples(Qnet, initial_state_f, next_state_f, reinforcement_f,
valid_actions, n_samples, epsilon):

X = np.zeros((n_samples, Qnet.n_inputs))
R = np.zeros((n_samples, 1))
Qn = np.zeros((n_samples, 1))

s = initial_state_f()
a, _ = epsilon_greedy(Qnet, s, valid_actions, epsilon)

# Collect data from n_samples steps
for step in range(n_samples):

sn = next_state_f(s, a)        # Update state, sn, from s and a
rn = reinforcement_f(s, sn)    # Calculate resulting reinforcement
an, qn = epsilon_greedy(Qnet, sn, valid_actions, epsilon)  # choose next action
X[step, :] = (s, a)
R[step, 0] = rn
Qn[step, 0] = qn
s, a = sn, an  # Advance one time step

return (X, R, Qn)

In [35]:
def plot_status():
n = 100
xs = np.linspace(1, 10, n).reshape(-1, 1)
xsas = np.vstack((xs, xs, xs))
ones = np.ones((n, 1))
xsas = np.hstack((xsas, np.vstack((ones * -1, ones * 0, ones * 1))))
ys = Qnet.use(xsas)
ys = ys.reshape((n, 3), order='F')
plt.subplot(3, 3, 1)
plt.plot(np.tile(xs, (1, 3)), ys)
plt.ylabel('Q for each action')
plt.xlabel('State')
plt.legend(('Left','Stay','Right'))

plt.subplot(3, 3, 4)
plt.plot(xs, -1 + np.argmax(ys, axis=1))
plt.ylabel('Action')
plt.xlabel('State')
plt.ylim(-1.1, 1.1)

plt.subplot(3, 3, 7)
Qnet.use(np.hstack((xs, ones * -1)))
plt.gca().set_prop_cycle(None)
plt.plot(xs, Qnet.Ys[1], '--')
Qnet.use(np.hstack((xs, ones * 0)))
plt.gca().set_prop_cycle(None)
plt.plot(xs, Qnet.Ys[1])  # Ys[0] is inputs, Ys[1] is hidden layer, Ys[2] is output layer
Qnet.use(np.hstack((xs, ones * 1)))
plt.plot(xs, Qnet.Ys[1], linewidth=2)
plt.ylabel('Hidden Units for Actions -1 (dotted) 0 (solid) 1 (thick)')
plt.xlabel('State')

plt.subplot(3, 3, 2)
x = x_trace[j - 200:j, 0]
plt.plot(x, range(200), '.-')
plt.xlabel('x (last 200)')
plt.ylabel('Steps')
y = x.copy()
y[y != 5] = np.nan
plt.plot(y, range(200), 'ro')
plt.xlim(0, 10)

plt.subplot(3, 3, 5)
plt.plot(epsilon_trace[:trial])
plt.ylabel('$\epsilon$')
plt.xlabel('Trials')
plt.ylim(0, 1)

plt.subplot(3, 3 ,8)
plt.plot(error_trace)
plt.ylabel('TD Error')
plt.xlabel('Epochs')

plt.subplot(3, 3, 3)
plt.plot(r_trace[:j, 0])
plt.ylim(-1.1, 0.1)
plt.ylabel('R')
plt.xlabel('Steps')

plt.subplot(3, 3, 6)
plt.plot(np.convolve(r_trace[:j, 0], np.array([0.01] * 100), mode='valid'))
plt.ylabel('R smoothed')
plt.ylim(-1.1, 0)
plt.xlabel('Steps')

# plt.subplot(3, 3, 9)
# Qnet.draw(('x', 'a'), ('Q'))

plt.tight_layout()


Now for the main loop. Let's collect 20 interactions (steps) for each trial and update our Qnet at the end of each trial. And let's run for a total of 5000 trials. When updating our nnet Q function, let's let adam run for only 20 epochs.

In [45]:
n_trials = 5000
n_steps_per_trial = 20
n_epochs = 20
learning_rate = 0.1

n_hidden = [10]
gamma = 0.8
final_epsilon = 0.001  # value of epsilon at end of simulation. Decay rate is calculated

epsilon_decay =  np.exp(np.log(final_epsilon) / n_trials) # to produce this final value
print('epsilonDecay is', epsilon_decay)

epsilonDecay is 0.9986194028465246

In [46]:
from IPython.display import display, clear_output

epsilon = 1.0

Qnet = nn.NeuralNetwork(2, n_hidden, 1)

# We need to set standardization parameters now so Qnet can be called to get first set of samples,
# before it has been trained the first time.

def setup_standardization(Qnet, Xmeans, Xstds, Tmeans, Tstds):
Qnet.Xmeans = np.array(Xmeans)
Qnet.Xstds = np.array(Xstds)
Qnet.Tmeans = np.array(Tmeans)
Qnet.Tstds = np.array(Tstds)

# Inputs are position (1 to 10) and action (-1, 0, or 1)
setup_standardization(Qnet, [5, 0], [2.5, 0.5], [0], [1])

fig = plt.figure(figsize=(10, 10))

x_trace = np.zeros((n_trials * n_steps_per_trial, 2))
r_trace = np.zeros((n_trials * n_steps_per_trial, 1))
error_trace = []
epsilon_trace = np.zeros((n_trials, 1))

for trial in range(n_trials):

X, R, Qn = make_samples(Qnet, initial_state, next_state, reinf, valid_actions, n_steps_per_trial, epsilon)

T = R + gamma * Qn
Qnet.train(X, T, n_epochs, learning_rate, method='sgd', verbose=False)

epsilon_trace[trial] = epsilon
i = trial * n_steps_per_trial
j = i + n_steps_per_trial
x_trace[i:j, :] = X
r_trace[i:j, :] = R
error_trace += Qnet.error_trace

epsilon *= epsilon_decay

# Rest of this loop is for plots.
if True and (trial + 1) % int(n_trials * 0.01 + 0.5) == 0:

fig.clf()
plot_status()
clear_output(wait=True)
display(fig)

clear_output(wait=True)

In [13]:
def run(n_trials, n_steps_per_trial, n_epochs, learning_rate, n_hidden, gamma, final_epsilon):
epsilon = 1
epsilon_decay =  np.exp(np.log(final_epsilon)/n_trials) # to produce this final value

Qnet = nn.NeuralNetwork(2, n_hidden, 1)
setup_standardization(Qnet, [5, 0], [2.5, 0.5], [0], [1])

r_sum = 0
r_last_2_trials = 0
for trial in range(n_trials):
X, R, Qn = make_samples(Qnet, initial_state, next_state, reinf, valid_actions, n_steps_per_trial, epsilon)
r_sum += np.sum(R)
if trial > n_trials - 3:
r_last_2_trials += np.sum(R)
epsilon *= epsilon_decay
Qnet.train(X, R + gamma * Qn, n_epochs, learning_rate, method='sgd', verbose=False)

return r_sum / (n_trials * n_steps_per_trial), r_last_2_trials / (2 * n_steps_per_trial)

In [14]:
import time
start = time.time()
result = []
for nT in (10, 50, 100):  # number of trials
print('nT is', nT)
for nS in (10, 20, 50, 100):  # number of steps per trial
for nE in (5, 10, 50, 100):  # number of epochs
for nH in [[2], [5], [10], [10, 10]]:  # hidden layer structure
for g in (0.5, 0.9):  # gamma
for lr in (0.01, 0.1):  # learning rate
for fep in (0.01,):  # final epsilon

r, r_last_2 = run(nT, nS, nE, lr, nH, g, fep)
result.append([nT, nS, nE, nH, g, lr, fep, r, r_last_2])
# print(result[-1])

print(f'Took {(time.time() - start)/60:.1f} minutes.')

nT is 10
nT is 50
nT is 100
Took 4.3 minutes.

In [15]:
import pandas
result = pandas.DataFrame(result, columns=('Trials', 'StepsPerTrial', 'Epochs',
'Hiddens', 'gamma', 'lr', 'fep', 'R', 'R Last 2'))
result.sort_values(by='R Last 2', ascending=False)

Out[15]:
Trials StepsPerTrial Epochs Hiddens gamma lr fep R R Last 2
575 100 10 100 [10, 10] 0.9 0.10 0.01 -0.8870 -0.10
573 100 10 100 [10, 10] 0.5 0.10 0.01 -0.8500 -0.15
533 100 10 10 [5] 0.5 0.10 0.01 -0.9010 -0.45
693 100 50 100 [5] 0.5 0.10 0.01 -0.9854 -0.49
553 100 10 50 [10] 0.5 0.10 0.01 -0.9030 -0.55
... ... ... ... ... ... ... ... ... ...
499 50 100 100 [2] 0.9 0.10 0.01 -0.9974 -1.00
496 50 100 100 [2] 0.5 0.01 0.01 -0.9962 -1.00
494 50 100 50 [10, 10] 0.9 0.01 0.01 -0.9976 -1.00
493 50 100 50 [10, 10] 0.5 0.10 0.01 -0.9978 -1.00
767 100 100 100 [10, 10] 0.9 0.10 0.01 -0.9924 -1.00

768 rows × 9 columns

In [ ]: