## Learning Multi-Level Hierarchy with Hindsight¶

This work combines the universal value function approximator (UVFA) and Hindsight Experience Replay (HER).
UVFA will be used to estimate the action-value function of a goal-conditioned policy $\pi, q_\pi(s, g, a)$.
HER is a data augmentation technique that can accelerate learning in sparse reward tasks.

### Technique Details¶

#### Space¶

The state spaces of all layers of the hierarchy is identical.
The action space of bottom-most layer is identical to the original action space of the task, other layer has the space that is identical to the state space.

#### Nested Policy¶

Policy at layer $i$ generate the goal of layer $i-1$.

#### Hindsight Action Transitions¶

Similar to Hindsight Experience Replay.

#### Subgoal Test Transitions¶

To my understanding, this is just a simulation to see whether the hindsight goal can be achieved or not, but their claim and explanation is not very clear to me. More specifically, I did not see how the subgoal test solves the problem that the subgoal cannot be achieved.

### Code¶

#### Check Goal¶

The if the agent is close enough to the goal, the goal is achieved.

In [ ]:
def check_goal(self, state, goal, threshold):
for i in range(self.state_dim):
if abs(state[i] - goal[i]) > threshold[i]:
return False
return True


#### Hindsight Transition¶

In [ ]:
# hindsight action transition
if goal_achieved:
self.replay_buffer[i_level].add((state, action, 0.0, next_state, goal, 0.0, float(done)))
else:
# If the agent does not achieve the goal, it will be penelized.
self.replay_buffer[i_level].add((state, action, -1.0, next_state, goal, self.gamma, float(done)))

# hindsight goal transition
# last transition reward and discount is 0
goal_transitions[-1][2] = 0.0
goal_transitions[-1][5] = 0.0
for transition in goal_transitions:
# last state is goal for all transitions
transition[4] = next_state


#### Subgoal Test¶

There is some possibility to enable the subgoal test. When it is enabled, no exploration noise.

In [ ]:
# If this is a subgoal test, then next/lower level goal has to be a subgoal test
is_next_subgoal_test = is_subgoal_test

# Determine whether to test subgoal (action)
if np.random.random_sample() < self.lamda:
is_next_subgoal_test = True

# add noise or take random action if not subgoal testing
if not is_subgoal_test:
if np.random.random_sample() > 0.2:
action = action + np.random.normal(0, self.exploration_action_noise)
action = action.clip(self.action_clip_low, self.action_clip_high)
else:
action = np.random.uniform(self.action_clip_low, self.action_clip_high)

# If subgoal was tested but not achieved, add subgoal testing transition.
# The -self.H is the penatly. Discount factor is set to 0
if is_next_subgoal_test and not self.check_goal(action, next_state, self.threshold):
self.replay_buffer[i_level].add((state, action, -self.H, next_state, goal, 0.0, float(done)))


### Reviews¶

765 can get this paper in, but is not a very promising score. Main complains are

1. The multi-layer hierarchy is not fully explored.
2. The subgoal test is criticized.
3. Benchmark is not as complicated as other HRL papers.

## Hindsight Experience Replay¶

paper

Handle the challenge caused by binary, sparse reward.
By replaying the experience with different goal (no additional simulation needed), more informative reward can be generated. But this approach is different with reward shaping, because it does not requires any domain knowledge.

Prerequired knowledge: Universal Value Function Approximators (UVFA) and how replay buffer.
Read the example in section 3.1 and algorithm 1, one can easily get how this simple approach works.
The $r_g$ is defined in the section 4.1.

## Near-Optimal Representation Learning for Hierarchical Reinforcement Learning¶

The goal is unnecessary to be set in the state space, it can also be set in the representation space will lower dimension.

### Review¶

• A paper looks like very strong in both experiments and theory, although there are some gaps for me to understand their theories.
• The experiments is interesting and include benchmark with image input.
• They know the focus of this community clearly, enough comparison to $\beta$-VAE etc. is provided in the appendices.

## SDRL: Interpretable and Data-efficient Deep Reinforcement Learning Leveraging Symbolic Planning¶

paper

High-level policy is a planner with PDDL. The planner requires domain knowledge.