Tiger Tutorial: Solving POMDPs

This tutorial outlines how to define a POMDP using the POMDPs.jl interface. We will go through a simple problem simply known as the tiger problem (we will refer to it as the tiger POMDP). After defining the tiger POMDP, we will use QMDP and SARSOP to solve the POMDP. If you are new to working with this package, check out the tutorial on MDPs first.

Dependencies

You need to install a few modules in order to use this notebook. If you have all the modules below installed, great! If not run the following commands:

# install the POMDPs.jl interface
Pkg.clone("https://github.com/sisl/POMDPs.jl.git")

using POMDPs

# install the SARSOP wrapper
POMDPs.add("SARSOP") # note this downloads and builds the APPL toolit and may take a few minutes 

# install the QMDP solver
POMDPs.add("QMDP")

# install a helper modules
POMDPs.add("POMDPToolbox") # this provides implementations of discrete belief updating

# install a Julia packages for working with distributions
Pkg.add("Distributions")

If you already have all of the modules above, make sure you have the most recent versions. Many of these are still under heavy development, so update before starting by running

Pkg.update()

Problem Overview

In the tiger POMDP, the agent is tasked with escaping from a room. There are two doors leading out of the room. Behind one of the doors is a tiger, and behind the other is sweet, sweet freedom. If the agent opens the door and finds the tiger, it gets eaten (and receives a reward of -100). If the agent opens the other door, it escapes and receives a reward of 10. The agent can also listen. Listening gives a noisy measurement of which door the tiger is hiding behind. Listening gives the agent the correct location of the tiger 85% of the time. The agent receives a reward of -1 for listening.

In [1]:
# first import POMDPs.jl
using POMDPs

POMDP

A POMDP is defined by the tuple $$(\mathcal{S}, \mathcal{A}, \mathcal{Z}, T, R, O).$$ In addition to the familiar, state $\mathcal{S}$ and action $\mathcal{A}$ spaces, we must also define an observation space $\mathcal{Z}$ and an observation function $O$. The POMDP problem definition may be similar to the one for MDP. For example, if you wanted to add state uncertainty to your problem, you can define the observation space, and observation function in addition to your previous MDP definition.

Before defining the spaces for this problem, let's first define the concrete type for the tiger POMDP.

In [2]:
type TigerPOMDP <: POMDP{Bool, Symbol, Bool} # POMDP{State, Action, Observation} all parametarized by Int64s
    r_listen::Float64 # reward for listening (default -1)
    r_findtiger::Float64 # reward for finding the tiger (default -100)
    r_escapetiger::Float64 # reward for escaping (default 10)
    p_listen_correctly::Float64 # prob of correctly listening (default 0.85)
    discount_factor::Float64 # discount
end
# default constructor
function TigerPOMDP()
    return TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95)
end;

There are a number of parameters in the problem definition, but we can treat them all as constants. You can read more about the Tiger problem and POMDPs here. However, we created a default constructor that allows us to initialize the tiger POMDP by simply running:

In [3]:
pomdp = TigerPOMDP()
Out[3]:
TigerPOMDP(-1.0,-100.0,10.0,0.85,0.95)

Note that the TigerPOMDP type inherits from a POMDP{Bool, Symbol, Bool}. This means that in our problem we will use Bool to represent our states, Symbol to represent our actions and Bool to represent our observations. More details on states, actions and observations in this problem are below.

States

We define our state with a boolean that indicates weather or not the tiger is hiding behind the left door. If our state is true, the tiger is behind the left door. If its false, the tiger is behind the right door.

In [4]:
example_state = false # tiger is hiding behind right door
Out[4]:
false

Since the state is a binary value, we represent it as a boolean, but we could have represented it as an integer or any other sensible type.

Actions

There are three possible actions our agent can take: open the left door, open the right door, and listen. For clarity, we will represent these with symbols.

In [5]:
example_action = :listen # agent listens, can be :openl or :openr
Out[5]:
:listen

We will represent our actions with the following symbols: open left (:openl), open right (:openr), and listen (:listen). For example, the action below represnts listening:

In [6]:
a = :listen
Out[6]:
:listen

Observations

There are two possible observations: the agent either hears the tiger behind the left door or behind the right door. We use a boolean to represent the observations:

In [7]:
example_observation = true # agent heard the tiger behind the left door
Out[7]:
true

Spaces

Let's define our state, action and observation spaces.

State Space

There are only two states in the tiger POMDP: the tiger is either behind the left door or behind the right door. Our state space is simply an array of the states in the problem. Recall, that the states function returns the state space for a given POMDP type.

In [8]:
POMDPs.states(::TigerPOMDP) = [true, false];

Action Space

There are three actions in our problem. Once again, we represent the action space as an array of the actions in our problem. The actions function serve a similar purpose to the states function above. Since the action space is discrete, we can define the action_index function that associates an integer index to each action.

In [9]:
POMDPs.actions(::TigerPOMDP) = [:openl, :openr, :listen] # default
POMDPs.actions(pomdp::TigerPOMDP, state::Bool) = POMDPs.actions(pomdp) # convenience (actions do not change in different states)
function POMDPs.action_index(::TigerPOMDP, a::Symbol)
    if a==:openl
        return 1
    elseif a==:openr
        return 2
    elseif a==:listen
        return 3
    end
    error("invalid TigerPOMDP action: $a")
end;

Observation Space

The observation space looks similar to the state space. Recall that the state represents the truth about our system, while the observation is potentially false information recieves about the state. In the tiger POMDP, our observation could give us a false representation of our state.

In [10]:
# function returning observation space
POMDPs.observations(::TigerPOMDP) = [true, false];
POMDPs.observations(pomdp::TigerPOMDP, s::Bool) = observations(pomdp);

Now that we've defined the POMDP spaces, let's move on to defining the model functions.

Transition and Observation Distributions

Before defining the model functions, we first need to create a distributions data-type. In general, our distributions should support sampling and have a pdf method. If you only want to get a policy from the SARSOP and QMDP solvers, you do not need to worry about implementing a sampling function. However, if you want to simulate the policy, you should implement these functions.

Since the transition and observation distributions have identical form, we could just use a single type to serve the needs of both. This will not be the case in general.

In [11]:
# distribution type that will be used for both transitions and observations
type TigerDistribution
    p::Float64
    it::Vector{Bool}
end
TigerDistribution() = TigerDistribution(0.5, [true, false])
POMDPs.iterator(d::TigerDistribution) = d.it

We now define the pdf function. For a discrete problem, this function returns the probability of a given element (state or observation in our case).

In [12]:
# transition and observation pdf
function POMDPs.pdf(d::TigerDistribution, so::Bool)
    so ? (return d.p) : (return 1.0-d.p)
end;

Finally, let's create the sampling functions.

In [13]:
# samples from transition or observation distribution
POMDPs.rand(rng::AbstractRNG, d::TigerDistribution) = rand(rng) <= d.p;

This is all we have to do for our distribution functions!

Transition Model

Here we define the transition model for the tiger POMDP. The model itself is fairly simple. Our state is represented by the location of the tiger (left or right). The location of the tiger doesn't change when the agent listens. However, after the agent opens the door, it reaches a terminal state. That is the agent either escapes or gets eaten. To simplify our formulation, we simply reset the location of the tiger randomly. We could model this problem with a terminal state (i.e. one in which the agent no longer receives reward) as well.

In [14]:
# Resets the problem after opening door; does nothing after listening
function POMDPs.transition(pomdp::TigerPOMDP, s::Bool, a::Symbol)
    d = TigerDistribution()
    if a == :openl || a == :openr
        d.p = 0.5
    elseif s
        d.p = 1.0
    else
        d.p = 0.0
    end
    d
end;

Reward Model

The reward model caputres the immediate objectives of the agent. It recieves a large negative reward for opening the door with the tiger behind it (-100), gets a positive reward for opening the other door (+10), and a small penalty for listening (-1).

In [15]:
# reward model
function POMDPs.reward(pomdp::TigerPOMDP, s::Bool, a::Symbol)
    r = 0.0
    a == :listen ? (r+=pomdp.r_listen) : (nothing)
    if a == :openl
        s ? (r += pomdp.r_findtiger) : (r += pomdp.r_escapetiger)
    end
    if a == :openr
        s ? (r += pomdp.r_escapetiger) : (r += pomdp.r_findtiger)
    end
    return r
end
POMDPs.reward(pomdp::TigerPOMDP, s::Bool, a::Symbol, sp::Bool) = reward(pomdp, s, a);

Observation Model

The observation model captures the uncertaintiy in the agent's lsitening ability. When we listen, we receive a noisy measurement of the tiger's location.

In [16]:
# observation model
function POMDPs.observation(pomdp::TigerPOMDP, a::Symbol, s::Bool)
    d = TigerDistribution()
    pc = pomdp.p_listen_correctly
    if a == :listen
        s ? (d.p = pc) : (d.p = 1.0-pc)
    else
        d.p = 0.5
    end
    d
end;

Miscallenous Functions

Let's define the discount function and the functions that return the size of our spaces.

In [17]:
POMDPs.discount(pomdp::TigerPOMDP) = pomdp.discount_factor
POMDPs.n_states(::TigerPOMDP) = 2
POMDPs.n_actions(::TigerPOMDP) = 3
POMDPs.n_observations(::TigerPOMDP) = 2;

Beliefs

If you are somewhat familiar with Julia defining all of the above may have been relaitvely simple. However, all POMDPs must be represented with a belief. Implementing beliefs and their updaters can be tricky. Luckily, our solvers abstract away the belief updating. All you need to do is define a function that returns an initial distriubtion over states. This distribution needs to support pdf and rand function. We already defined a dsitribution like that, so our job here is simple!

In [18]:
POMDPs.initial_state_distribution(pomdp::TigerPOMDP) = TigerDistribution(0.5, [true, false]);

If you are interested in creating your own beliefs and update schemes check out the POMDPToolbox module which implements a number of beliefs and udpate schemes.

SARSOP Solver

Let's play around with the SARSOP.jl solver. The module we provide is a wrapper for the SARSOP backend. You can find more information about it here.

In [19]:
using SARSOP # load the module
# initialize our tiger POMDP
pomdp = TigerPOMDP()

# initialize the solver
solver = SARSOPSolver()
# run the solve function
policy = solve(solver, pomdp)
INFO: Recompiling stale cache file /home/zach/.julia/lib/v0.5/LightXML.ji for module LightXML.
Generating a pomdpx file: model.pomdpx

Loading the model ...
  input file   : model.pomdpx
  loading time : 0.00s 

SARSOP initializing ...
  initialization time : 0.00s

-------------------------------------------------------------------------------
 Time   |#Trial |#Backup |LBound    |UBound    |Precision  |#Alphas |#Beliefs  
-------------------------------------------------------------------------------
 0       0       0        -20        92.8206    112.821     3        1        
 0       2       51       -6.2981    63.1396    69.4377     7        16       
 0       4       103      0.149651   52.2764    52.1268     9        21       
 0       6       151      6.19248    42.0546    35.8621     9        21       
 0       8       200      10.3563    35.232     24.8757     12       21       
 0       11      250      14.0433    29.5471    15.5037     6        21       
 0       14      300      16.545     25.0926    8.54759     10       21       
 0       17      350      18.2281    21.8163    3.5882      14       21       
 0       18      400      18.7451    20.9384    2.19328     8        21       
 0       21      465      19.1109    20.0218    0.910956    5        21       
 0       22      500      19.2369    19.7071    0.470219    11       21       
 0       24      550      19.3036    19.5405    0.236865    6        21       
 0.01    25      600      19.3369    19.4574    0.120445    13       21       
 0.01    27      669      19.3579    19.4049    0.0469305   5        21       
 0.01    28      713      19.3643    19.389     0.024739    5        21       
 0.01    29      757      19.3676    19.3807    0.0130409   5        21       
 0.01    30      801      19.3694    19.3763    0.0068744   5        21       
 0.01    31      850      19.3704    19.3739    0.00351433  10       21       
 0.01    32      900      19.3709    19.3725    0.00155165  5        21       
 0.01    33      936      19.3711    19.3721    0.000976551 8        21       
-------------------------------------------------------------------------------

SARSOP finishing ...
  target precision reached
  target precision  : 0.001000
  precision reached : 0.000977 

-------------------------------------------------------------------------------
 Time   |#Trial |#Backup |LBound    |UBound    |Precision  |#Alphas |#Beliefs  
-------------------------------------------------------------------------------
 0.01    33      936      19.3711    19.3721    0.000976551 5        21       
-------------------------------------------------------------------------------

Writing out policy ...
  output file : out.policy

Out[19]:
SARSOP.POMDPPolicy("out.policy",POMDPXFiles.POMDPAlphas([-81.5975 3.01448 … 28.4025 19.3711; 28.4025 24.6954 … -81.5975 19.3711],[1,2,2,0,2]),TigerPOMDP(-1.0,-100.0,10.0,0.85,0.95),Any[:openl,:openr,:listen])
In [20]:
# we can retrieve the alpha vectors by calling
alphas(policy)
Out[20]:
2×5 Array{Float64,2}:
 -81.5975   3.01448  24.6954    28.4025  19.3711
  28.4025  24.6954    3.01452  -81.5975  19.3711

Let's now see how our policy changes with the belief.

In [21]:
# use POMDPToolbox for beliefs
using POMDPToolbox

# let's initialize the beliefs
b = DiscreteBelief(2); # the initial prior
Out[21]:
POMDPToolbox.DiscreteBelief([0.5,0.5])
In [22]:
a = action(policy, b) # index of action, you need to convert this to the true action, support for automatic conversion is coming soon
Out[22]:
:listen

Let's simulate our policy. We'll use POMDPToolbox to do the simulation. As mentioned earlier, in a POMDP, the decision is based on a belief. However, each policy (comes from the solver modules) implements its own belief udpating scheme, so you do not need to worry about deling with beliefs. The only thing you need is to define an initial_state_distribution.

In [35]:
using POMDPToolbox # for simulation

pomdp = TigerPOMDP() # initialize problem
init_dist = initial_state_distribution(pomdp) # initialize distriubtion over state

up = updater(policy) # belief updater for our policy, SARSOP uses a discrete Bayesian filter
hr = HistoryRecorder(max_steps=14, rng=MersenneTwister(1)) # history recorder that keeps track of states, observations and beliefs

hist = simulate(hr, pomdp, policy, up, init_dist)

for (s, b, a, r, sp, op) in hist
    println("s: $s, b: $(b.b), action: $a, obs: $op")
end
println("Total reward: $(discounted_reward(hist))")
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: false
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: false
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
Total reward: 21.136977555064835

Notice that over the first nine time steps, the policy is fairly simple. We listen twice, and then decide which door to open. However, in the following steps, we get a mix of observations, which makes the decision harder. Our agent does not open a door, because its belief is still uniform at the last time step!

QMDP Solver

We will briefly go over the QMDP.jl solver. You should use QMDP with a word of caution. QMDP assumes that all state uncetainty dissapears in the next time step. This could lead to bad policies in problems with information gathering actions. For example, in the tiger POMDP listening is an information gathering action, and the resulting QMDP policy is quite poor. However, QMDP can work very well in problems where the state uncertainity is not impacted by the agent's action (for example systems with noisy sensor measurements).

In [36]:
using QMDP

# initialize the solver
# key-word args are the maximum number of iterations the solver will run for, and the Bellman tolerance
solver = QMDPSolver(max_iterations=50, tolerance=1e-3) 

# run the solver
qmdp_policy = solve(solver, pomdp, verbose=true)
Iteration : 1, residual: 14.75, iteration run-time: 8.34e-6, total run-time: 8.34e-6
Iteration : 2, residual: 12.59046875, iteration run-time: 6.069e-6, total run-time: 1.4409e-5
Iteration : 3, residual: 11.564691406249999, iteration run-time: 3.641e-6, total run-time: 1.805e-5
Iteration : 4, residual: 10.943236428222654, iteration run-time: 2.783e-6, total run-time: 2.0833000000000002e-5
Iteration : 5, residual: 10.2558588273941, iteration run-time: 2.468e-6, total run-time: 2.3301000000000003e-5
Iteration : 6, residual: 9.587976314837448, iteration run-time: 2.19e-6, total run-time: 2.5491000000000004e-5
Iteration : 7, residual: 8.957886507199987, iteration run-time: 2.245e-6, total run-time: 2.7736000000000003e-5
Iteration : 8, residual: 8.367828168991792, iteration run-time: 2.342e-6, total run-time: 3.0078000000000002e-5
Iteration : 9, residual: 7.816304847983972, iteration run-time: 2.12e-6, total run-time: 3.2198000000000006e-5
Iteration : 10, residual: 7.301052156282381, iteration run-time: 2.551e-6, total run-time: 3.4749e-5
Iteration : 11, residual: 6.8197456599030915, iteration run-time: 2.122e-6, total run-time: 3.6871e-5
Iteration : 12, residual: 6.370163598662359, iteration run-time: 2.45e-6, total run-time: 3.9321e-5
Iteration : 13, residual: 5.9502184661618, iteration run-time: 2.276e-6, total run-time: 4.1597e-5
Iteration : 14, residual: 5.557957422333274, iteration run-time: 2.509e-6, total run-time: 4.4106e-5
Iteration : 15, residual: 5.191555653202798, iteration run-time: 2.468e-6, total run-time: 4.6574e-5
Iteration : 16, residual: 4.849308471382614, iteration run-time: 2.298e-6, total run-time: 4.8871999999999996e-5
Iteration : 17, residual: 4.529623527415282, iteration run-time: 2.41e-6, total run-time: 5.1281999999999995e-5
Iteration : 18, residual: 4.231013435561891, iteration run-time: 2.328e-6, total run-time: 5.361e-5
Iteration : 19, residual: 3.9520888618093863, iteration run-time: 2.649e-6, total run-time: 5.6259e-5
Iteration : 20, residual: 3.6915520617660036, iteration run-time: 2.39e-6, total run-time: 5.8649e-5
Iteration : 21, residual: 3.448190843167879, iteration run-time: 2.318e-6, total run-time: 6.0967e-5
Iteration : 22, residual: 3.2208729260632936, iteration run-time: 2.229e-6, total run-time: 6.3196e-5
Iteration : 23, residual: 3.00854067471343, iteration run-time: 2.378e-6, total run-time: 6.557399999999999e-5
Iteration : 24, residual: 2.8102061767669397, iteration run-time: 2.307e-6, total run-time: 6.7881e-5
Iteration : 25, residual: 2.624946646829443, iteration run-time: 2.243e-6, total run-time: 7.0124e-5
Iteration : 26, residual: 2.451900133045797, iteration run-time: 2.205e-6, total run-time: 7.232899999999999e-5
Iteration : 27, residual: 2.290261506721066, iteration run-time: 2.217e-6, total run-time: 7.454599999999999e-5
Iteration : 28, residual: 2.13927871632049, iteration run-time: 2.267e-6, total run-time: 7.681299999999998e-5
Iteration : 29, residual: 1.9982492884203396, iteration run-time: 2.265e-6, total run-time: 7.907799999999999e-5
Iteration : 30, residual: 1.866517059329368, iteration run-time: 1.95e-6, total run-time: 8.1028e-5
Iteration : 31, residual: 1.7434691221742469, iteration run-time: 2.066e-6, total run-time: 8.3094e-5
Iteration : 32, residual: 1.628532975244866, iteration run-time: 2.013e-6, total run-time: 8.5107e-5
Iteration : 33, residual: 1.5211738583317356, iteration run-time: 2.138e-6, total run-time: 8.7245e-5
Iteration : 34, residual: 1.4208922646616031, iteration run-time: 2.247e-6, total run-time: 8.949199999999999e-5
Iteration : 35, residual: 1.32722161685669, iteration run-time: 2.317e-6, total run-time: 9.180899999999999e-5
Iteration : 36, residual: 1.2397260961028849, iteration run-time: 2.275e-6, total run-time: 9.408399999999999e-5
Iteration : 37, residual: 1.1579986144276404, iteration run-time: 2.14e-6, total run-time: 9.622399999999999e-5
Iteration : 38, residual: 1.0816589206533251, iteration run-time: 2.149e-6, total run-time: 9.837299999999999e-5
Iteration : 39, residual: 1.0103518312128017, iteration run-time: 2.246e-6, total run-time: 0.000100619
Iteration : 40, residual: 0.9437455775971557, iteration run-time: 2.099e-6, total run-time: 0.00010271799999999999
Iteration : 41, residual: 0.8815302627452866, iteration run-time: 2.205e-6, total run-time: 0.00010492299999999998
Iteration : 42, residual: 0.8234164191945297, iteration run-time: 2.075e-6, total run-time: 0.00010699799999999998
Iteration : 43, residual: 0.7691336622836786, iteration run-time: 2.305e-6, total run-time: 0.00010930299999999998
Iteration : 44, residual: 0.7184294321414484, iteration run-time: 2.111e-6, total run-time: 0.00011141399999999997
Iteration : 45, residual: 0.6710678186085772, iteration run-time: 2.229e-6, total run-time: 0.00011364299999999997
Iteration : 46, residual: 0.6268284636247756, iteration run-time: 2.262e-6, total run-time: 0.00011590499999999997
Iteration : 47, residual: 0.5855055359753294, iteration run-time: 2.193e-6, total run-time: 0.00011809799999999997
Iteration : 48, residual: 0.5469067736256363, iteration run-time: 2.083e-6, total run-time: 0.00012018099999999997
Iteration : 49, residual: 0.5108525891891986, iteration run-time: 2.038e-6, total run-time: 0.00012221899999999996
Iteration : 50, residual: 0.4771752343663138, iteration run-time: 2.022e-6, total run-time: 0.00012424099999999995
WARNING: using QMDP.create_policy in module Main conflicts with an existing identifier.
Out[36]:
QMDP.QMDPPolicy([193.239 83.2389 182.124; 83.4656 193.466 182.354],Any[:openl,:openr,:listen],TigerPOMDP(-1.0,-100.0,10.0,0.85,0.95))
In [37]:
qmdp_policy.alphas
Out[37]:
2×3 Array{Float64,2}:
 193.239    83.2389  182.124
  83.4656  193.466   182.354

Notice that these alpha-vectors differ from those compute by SARSOP. Let's see how the policy looks in simulation. We'll use the same procedure to simulate the QMDP policy.

In [39]:
pomdp = TigerPOMDP() # initialize problem
init_dist = initial_state_distribution(pomdp) # initialize distriubtion over state

up = updater(qmdp_policy) # belief updater for our policy
hist = HistoryRecorder(max_steps=14, rng=MersenneTwister(1)) # history recorder that keeps track of states, observations and beliefs

hist = simulate(hist, pomdp, qmdp_policy, up, init_dist)

for (s, b, a, r, sp, op) in hist
    println("s: $s, b: $(b.b), action: $a, obs: $op")
end
println("Total reward: $(discounted_reward(hist))")
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: false
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: false
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
Total reward: 21.136977555064835

The results are identical for this single simulation for this particular problem. In general, if you are dealing with a complex problem, it is good to compare the SARSOP and QMDP policies. This framework makes comparing the two policies very simple once you have defined the problem!