This paper synthesized an automaton for the Montezuma's Revenge with collected trace with approach proposed here. The states in traces are images, thus they extract the objects from images and use the overlapping between objects to represent the state of DFA.
After synthesizing the DFA, they use the DFA to augment the Q value (some proof and rephrasing from RL perspective are needed here. Will the optimal reward with such policy be the best policy? This breaks the MC property.) Since the DFA can provide richer reward information, thus better training results are expected.
Some other benchmarks are provided in their code.
There are some points worth exploring: