DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning


Main Idea

This paper synthesized an automaton for the Montezuma's Revenge with collected trace with approach proposed here. The states in traces are images, thus they extract the objects from images and use the overlapping between objects to represent the state of DFA.



After synthesizing the DFA, they use the DFA to augment the Q value (some proof and rephrasing from RL perspective are needed here. Will the optimal reward with such policy be the best policy? This breaks the MC property.) Since the DFA can provide richer reward information, thus better training results are expected.

Some other benchmarks are provided in their code.

Some thoughts

There are some points worth exploring:

  1. The DFA is not learning, it just uses the previous data collected in one run (code). Thus a learnable structure may provide better benefits.
  2. After getting the DFA, the paper mentioned that a verifier is used to check whether all the transitions satisfy the DFA. Why all transitions? Some transitions must be fail to explore the space. This part needs some rephrasing. I prefer reconstructing this idea as finding some high reward traces which do not have any counter-example for this DFA.
  3. The image segmentation is done by hand (code), which is hard to generalize to different tasks. Some algorithm to auto generate such segmentation will be a great step. I will consider using autoencoder or attention here.
  4. Richer temporal expression. DFA to LTL?
  5. Pre-knowledge to encode DFA.
  6. LSTM/transformer to handle the temporal relation.
  7. Instead of simply enhancing the Q value for reward shaping, considering attention.
  8. A related paper link.