%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
from IPython.display import Image
approximate value function: parameterized function $\hat{v}(s, w) \approx v_\pi(s)$
$s \to u$: $s$ is the state updated and $u$ is the update target that $s$'s estimated value is shifted toward.
We use machine learning methods and pass to them the $s \to g$ of each update as a training example. Then we interperet the approximate function they produce as an estimated value function.
not all function approximation methods are equally well suited for use in reinforcement learning:
which states we care most about: a state distribution $\mu(s) \geq 0$, $\sum_s \mu(s) = 1$.
objective function, the Mean Squared Value Error, denoted $\overline{VE}$:
\begin{equation} \overline{VE}(w) \doteq \sum_{s \in \delta} \mu(s) \left [ v_\pi (s) - \hat{v}(s, w) \right ]^2 \end{equation}where $v_\pi(s)$ is the true value and $\hat{v}(s, w)$ is the approximate value.
Note that best $\overline{VE}$ is no guarantee of our ultimate purpose: to find a better policy.
SGD: well suited to online reinforcement learning.
\begin{align} w_{t+1} &\doteq w_t - \frac1{2} \alpha \nabla \left [ v_\pi(S_t) - \hat{v}(S_t, w_t) \right ]^2 \\ &= w_t + \alpha \left [ \color{blue}{v_\pi(S_t)} - \hat{v}(S_t, w_t) \right ] \nabla \hat{v}(S_t, w_t) \\ &\approx w_t + \alpha \left [ \color{blue}{U_t} - \hat{v}(S_t, w_t) \right ] \nabla \hat{v}(S_t, w_t) \\ \end{align}$S_t \to U_t$, is not the true value $v_\pi(S_t)$, but some, possibly random, approximation to it. (前面各种方法累计的value):
state aggregation: states are grouped together, with one estimated value for each group.
For every state $s$, there is a real-valued feature vector $x(s) \doteq (x_1(s), x_2(s), \dots, x_d(s))^T$:
\begin{equation} \hat{v}(s, w) \doteq w^T x(s) \doteq \sum_{i=1}^d w_i x_i(s) \end{equation}Choosing features appropriate to the task is an important way of adding prior domain knowledge to reinforcement learing systems.
A good rule of thumb for setting the step-size parameter of linear SGD methods is then $\alpha \doteq (\gamma \mathbf{E}[x^T x])^{-1}$
+ANN, CNN
$w_{TD} = A^{-1} b$: data efficient, while expensive computation
nearest neighbor method
RBF function
more interested in some states than others: