This work combines the universal value function approximator (UVFA) and Hindsight Experience Replay (HER).
UVFA will be used to estimate the action-value function of a goal-conditioned policy $\pi, q_\pi(s, g, a)$.
HER is a data augmentation technique that can accelerate learning in sparse reward tasks.
The state spaces of all layers of the hierarchy is identical.
The action space of bottom-most layer is identical to the original action space of the task, other layer has the space that is identical to the state space.
Policy at layer $i$ generate the goal of layer $i-1$.
Similar to Hindsight Experience Replay.
To my understanding, this is just a simulation to see whether the hindsight goal can be achieved or not, but their claim and explanation is not very clear to me. More specifically, I did not see how the subgoal test solves the problem that the subgoal cannot be achieved.
def check_goal(self, state, goal, threshold): for i in range(self.state_dim): if abs(state[i] - goal[i]) > threshold[i]: return False return True
# hindsight action transition if goal_achieved: self.replay_buffer[i_level].add((state, action, 0.0, next_state, goal, 0.0, float(done))) else: # If the agent does not achieve the goal, it will be penelized. self.replay_buffer[i_level].add((state, action, -1.0, next_state, goal, self.gamma, float(done))) # hindsight goal transition # last transition reward and discount is 0 goal_transitions[-1] = 0.0 goal_transitions[-1] = 0.0 for transition in goal_transitions: # last state is goal for all transitions transition = next_state self.replay_buffer[i_level].add(tuple(transition))
There is some possibility to enable the subgoal test. When it is enabled, no exploration noise.
# If this is a subgoal test, then next/lower level goal has to be a subgoal test is_next_subgoal_test = is_subgoal_test # Determine whether to test subgoal (action) if np.random.random_sample() < self.lamda: is_next_subgoal_test = True # add noise or take random action if not subgoal testing if not is_subgoal_test: if np.random.random_sample() > 0.2: action = action + np.random.normal(0, self.exploration_action_noise) action = action.clip(self.action_clip_low, self.action_clip_high) else: action = np.random.uniform(self.action_clip_low, self.action_clip_high) # If subgoal was tested but not achieved, add subgoal testing transition. # The -self.H is the penatly. Discount factor is set to 0 if is_next_subgoal_test and not self.check_goal(action, next_state, self.threshold): self.replay_buffer[i_level].add((state, action, -self.H, next_state, goal, 0.0, float(done)))
765 can get this paper in, but is not a very promising score. Main complains are
Handle the challenge caused by binary, sparse reward.
By replaying the experience with different goal (no additional simulation needed), more informative reward can be generated. But this approach is different with reward shaping, because it does not requires any domain knowledge.
Prerequired knowledge: Universal Value Function Approximators (UVFA) and how replay buffer.
Read the example in section 3.1 and algorithm 1, one can easily get how this simple approach works.
The $r_g$ is defined in the section 4.1.
The goal is unnecessary to be set in the state space, it can also be set in the representation space will lower dimension.
High-level policy is a planner with PDDL. The planner requires domain knowledge.