Learning Multi-Level Hierarchy with Hindsight

paper, code, open review

This work combines the universal value function approximator (UVFA) and Hindsight Experience Replay (HER).
UVFA will be used to estimate the action-value function of a goal-conditioned policy $\pi, q_\pi(s, g, a)$.
HER is a data augmentation technique that can accelerate learning in sparse reward tasks.

Technique Details


The state spaces of all layers of the hierarchy is identical.
The action space of bottom-most layer is identical to the original action space of the task, other layer has the space that is identical to the state space.

Nested Policy

Policy at layer $i$ generate the goal of layer $i-1$.

Hindsight Action Transitions

Similar to Hindsight Experience Replay.

Subgoal Test Transitions

To my understanding, this is just a simulation to see whether the hindsight goal can be achieved or not, but their claim and explanation is not very clear to me. More specifically, I did not see how the subgoal test solves the problem that the subgoal cannot be achieved.


Reading note of this implementation.

Check Goal

The if the agent is close enough to the goal, the goal is achieved.

In [ ]:
def check_goal(self, state, goal, threshold):
    for i in range(self.state_dim):
        if abs(state[i] - goal[i]) > threshold[i]:
            return False
    return True

Hindsight Transition

In [ ]:
# hindsight action transition
if goal_achieved:
    self.replay_buffer[i_level].add((state, action, 0.0, next_state, goal, 0.0, float(done)))
    # If the agent does not achieve the goal, it will be penelized. 
    self.replay_buffer[i_level].add((state, action, -1.0, next_state, goal, self.gamma, float(done)))
# hindsight goal transition
# last transition reward and discount is 0
goal_transitions[-1][2] = 0.0
goal_transitions[-1][5] = 0.0
for transition in goal_transitions:
    # last state is goal for all transitions
    transition[4] = next_state

Subgoal Test

There is some possibility to enable the subgoal test. When it is enabled, no exploration noise.

In [ ]:
# If this is a subgoal test, then next/lower level goal has to be a subgoal test
is_next_subgoal_test = is_subgoal_test

# Determine whether to test subgoal (action)
if np.random.random_sample() < self.lamda:
    is_next_subgoal_test = True

# add noise or take random action if not subgoal testing
if not is_subgoal_test:
    if np.random.random_sample() > 0.2:
        action = action + np.random.normal(0, self.exploration_action_noise)
        action = action.clip(self.action_clip_low, self.action_clip_high)
        action = np.random.uniform(self.action_clip_low, self.action_clip_high)

# If subgoal was tested but not achieved, add subgoal testing transition.
# The -self.H is the penatly. Discount factor is set to 0
if is_next_subgoal_test and not self.check_goal(action, next_state, self.threshold):
    self.replay_buffer[i_level].add((state, action, -self.H, next_state, goal, 0.0, float(done)))


765 can get this paper in, but is not a very promising score. Main complains are

  1. The multi-layer hierarchy is not fully explored.
  2. The subgoal test is criticized.
  3. Benchmark is not as complicated as other HRL papers.

Hindsight Experience Replay


Handle the challenge caused by binary, sparse reward.
By replaying the experience with different goal (no additional simulation needed), more informative reward can be generated. But this approach is different with reward shaping, because it does not requires any domain knowledge.

Prerequired knowledge: Universal Value Function Approximators (UVFA) and how replay buffer.
Read the example in section 3.1 and algorithm 1, one can easily get how this simple approach works.
The $r_g$ is defined in the section 4.1.

Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

paper, review, tf code

The goal is unnecessary to be set in the state space, it can also be set in the representation space will lower dimension.


  • A paper looks like very strong in both experiments and theory, although there are some gaps for me to understand their theories.
  • The experiments is interesting and include benchmark with image input.
  • They know the focus of this community clearly, enough comparison to $\beta$-VAE etc. is provided in the appendices.

SDRL: Interpretable and Data-efficient Deep Reinforcement Learning Leveraging Symbolic Planning


High-level policy is a planner with PDDL. The planner requires domain knowledge.