Neuro-symbolic Reinforcement Learning with Formally Verified Exploration

link

Contribution

Guarantee the safe exploration in continuous control tasks.

Imitate Safely

Get a shield.
for t = $1, \dots, T$ do,

  1. Initialize the system with a manually constructed shield.
  2. The neural network policy is imitated from the shield.
  3. The neural network is updated with gradient-based RL algorithms.
  4. Imitate the shielded neural network to get a new shield.

$Project_\mathcal{G}$

Refine the piecewise linear shield.
for t = $1, \dots, T$ do,

  1. Cutting space to 2 parts.
  2. Call imitate safely for both of the subspace, and get 2 verified policies $g_i^1$ and $g_i^2$ for the 2 spitted spaces.
  3. Compose $g_i^1$ and $g_i^2$ to $g'$.
  4. If $D(g', h) < D(g*, h)$, $g* \leftarrow g'$. $D$ is the $L_2$ norm, h is the shielded neural policy.

Detailed Questions

  1. In step 2 of $Project_\mathcal{G}$, what if the learned the new shield cannot be verified? Even if redo this step 2 may find verified policy, but there is no guarantee. I guess if they cannot find it, they will keep the previous shield.
  2. If we already have a shielded policy at the beginning, why not keep using this shield to ensure the safety for exploration, while focusing on optimizing the neural network policy? The time calling the shield is much less than the time of calling the neural network. Thus, even the performance of the shield is not so good, since it is less likely to be involved, it should not affect the reward too much. Their results also show that neural network based DDPG can have best rewards in almost every case.

Other

The proof is useful for an AI conference submission.