# Tiger Tutorial: Solving POMDPs¶

This tutorial outlines how to define a POMDP using the POMDPs.jl interface. We will go through a simple problem simply known as the tiger problem (we will refer to it as the tiger POMDP). After defining the tiger POMDP, we will use QMDP and SARSOP to solve the POMDP. If you are new to working with this package, check out the tutorial on MDPs first.

## Dependencies¶

You need to install a few modules in order to use this notebook. If you have all the modules below installed, great! If not run the following commands:

# install the POMDPs.jl interface
Pkg.clone("https://github.com/sisl/POMDPs.jl.git")

using POMDPs

# install the SARSOP wrapper

# install the QMDP solver

# install a helper modules
POMDPs.add("POMDPToolbox") # this provides implementations of discrete belief updating

# install a Julia packages for working with distributions


If you already have all of the modules above, make sure you have the most recent versions. Many of these are still under heavy development, so update before starting by running

Pkg.update()


## Problem Overview¶

In the tiger POMDP, the agent is tasked with escaping from a room. There are two doors leading out of the room. Behind one of the doors is a tiger, and behind the other is sweet, sweet freedom. If the agent opens the door and finds the tiger, it gets eaten (and receives a reward of -100). If the agent opens the other door, it escapes and receives a reward of 10. The agent can also listen. Listening gives a noisy measurement of which door the tiger is hiding behind. Listening gives the agent the correct location of the tiger 85% of the time. The agent receives a reward of -1 for listening.

In [1]:
# first import POMDPs.jl
using POMDPs


## POMDP¶

A POMDP is defined by the tuple $$(\mathcal{S}, \mathcal{A}, \mathcal{Z}, T, R, O).$$ In addition to the familiar, state $\mathcal{S}$ and action $\mathcal{A}$ spaces, we must also define an observation space $\mathcal{Z}$ and an observation function $O$. The POMDP problem definition may be similar to the one for MDP. For example, if you wanted to add state uncertainty to your problem, you can define the observation space, and observation function in addition to your previous MDP definition.

Before defining the spaces for this problem, let's first define the concrete type for the tiger POMDP.

In [2]:
type TigerPOMDP <: POMDP{Bool, Symbol, Bool} # POMDP{State, Action, Observation} all parametarized by Int64s
r_listen::Float64 # reward for listening (default -1)
r_findtiger::Float64 # reward for finding the tiger (default -100)
r_escapetiger::Float64 # reward for escaping (default 10)
p_listen_correctly::Float64 # prob of correctly listening (default 0.85)
discount_factor::Float64 # discount
end
# default constructor
function TigerPOMDP()
return TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95)
end;


There are a number of parameters in the problem definition, but we can treat them all as constants. You can read more about the Tiger problem and POMDPs here. However, we created a default constructor that allows us to initialize the tiger POMDP by simply running:

In [3]:
pomdp = TigerPOMDP()

Out[3]:
TigerPOMDP(-1.0,-100.0,10.0,0.85,0.95)

Note that the TigerPOMDP type inherits from a POMDP{Bool, Symbol, Bool}. This means that in our problem we will use Bool to represent our states, Symbol to represent our actions and Bool to represent our observations. More details on states, actions and observations in this problem are below.

## States¶

We define our state with a boolean that indicates weather or not the tiger is hiding behind the left door. If our state is true, the tiger is behind the left door. If its false, the tiger is behind the right door.

In [4]:
example_state = false # tiger is hiding behind right door

Out[4]:
false

Since the state is a binary value, we represent it as a boolean, but we could have represented it as an integer or any other sensible type.

## Actions¶

There are three possible actions our agent can take: open the left door, open the right door, and listen. For clarity, we will represent these with symbols.

In [5]:
example_action = :listen # agent listens, can be :openl or :openr

Out[5]:
:listen

We will represent our actions with the following symbols: open left (:openl), open right (:openr), and listen (:listen). For example, the action below represnts listening:

In [6]:
a = :listen

Out[6]:
:listen

## Observations¶

There are two possible observations: the agent either hears the tiger behind the left door or behind the right door. We use a boolean to represent the observations:

In [7]:
example_observation = true # agent heard the tiger behind the left door

Out[7]:
true

## Spaces¶

Let's define our state, action and observation spaces.

### State Space¶

There are only two states in the tiger POMDP: the tiger is either behind the left door or behind the right door. Our state space is simply an array of the states in the problem. Recall, that the states function returns the state space for a given POMDP type.

In [8]:
POMDPs.states(::TigerPOMDP) = [true, false];


### Action Space¶

There are three actions in our problem. Once again, we represent the action space as an array of the actions in our problem. The actions function serve a similar purpose to the states function above. Since the action space is discrete, we can define the action_index function that associates an integer index to each action.

In [9]:
POMDPs.actions(::TigerPOMDP) = [:openl, :openr, :listen] # default
POMDPs.actions(pomdp::TigerPOMDP, state::Bool) = POMDPs.actions(pomdp) # convenience (actions do not change in different states)
function POMDPs.action_index(::TigerPOMDP, a::Symbol)
if a==:openl
return 1
elseif a==:openr
return 2
elseif a==:listen
return 3
end
error("invalid TigerPOMDP action: $a") end;  ### Observation Space¶ The observation space looks similar to the state space. Recall that the state represents the truth about our system, while the observation is potentially false information recieves about the state. In the tiger POMDP, our observation could give us a false representation of our state. In [10]: # function returning observation space POMDPs.observations(::TigerPOMDP) = [true, false]; POMDPs.observations(pomdp::TigerPOMDP, s::Bool) = observations(pomdp);  Now that we've defined the POMDP spaces, let's move on to defining the model functions. ## Transition and Observation Distributions¶ Before defining the model functions, we first need to create a distributions data-type. In general, our distributions should support sampling and have a pdf method. If you only want to get a policy from the SARSOP and QMDP solvers, you do not need to worry about implementing a sampling function. However, if you want to simulate the policy, you should implement these functions. Since the transition and observation distributions have identical form, we could just use a single type to serve the needs of both. This will not be the case in general. In [11]: # distribution type that will be used for both transitions and observations type TigerDistribution p::Float64 it::Vector{Bool} end TigerDistribution() = TigerDistribution(0.5, [true, false]) POMDPs.iterator(d::TigerDistribution) = d.it  We now define the pdf function. For a discrete problem, this function returns the probability of a given element (state or observation in our case). In [12]: # transition and observation pdf function POMDPs.pdf(d::TigerDistribution, so::Bool) so ? (return d.p) : (return 1.0-d.p) end;  Finally, let's create the sampling functions. In [13]: # samples from transition or observation distribution POMDPs.rand(rng::AbstractRNG, d::TigerDistribution) = rand(rng) <= d.p;  This is all we have to do for our distribution functions! ## Transition Model¶ Here we define the transition model for the tiger POMDP. The model itself is fairly simple. Our state is represented by the location of the tiger (left or right). The location of the tiger doesn't change when the agent listens. However, after the agent opens the door, it reaches a terminal state. That is the agent either escapes or gets eaten. To simplify our formulation, we simply reset the location of the tiger randomly. We could model this problem with a terminal state (i.e. one in which the agent no longer receives reward) as well. In [14]: # Resets the problem after opening door; does nothing after listening function POMDPs.transition(pomdp::TigerPOMDP, s::Bool, a::Symbol) d = TigerDistribution() if a == :openl || a == :openr d.p = 0.5 elseif s d.p = 1.0 else d.p = 0.0 end d end;  ## Reward Model¶ The reward model caputres the immediate objectives of the agent. It recieves a large negative reward for opening the door with the tiger behind it (-100), gets a positive reward for opening the other door (+10), and a small penalty for listening (-1). In [15]: # reward model function POMDPs.reward(pomdp::TigerPOMDP, s::Bool, a::Symbol) r = 0.0 a == :listen ? (r+=pomdp.r_listen) : (nothing) if a == :openl s ? (r += pomdp.r_findtiger) : (r += pomdp.r_escapetiger) end if a == :openr s ? (r += pomdp.r_escapetiger) : (r += pomdp.r_findtiger) end return r end POMDPs.reward(pomdp::TigerPOMDP, s::Bool, a::Symbol, sp::Bool) = reward(pomdp, s, a);  ## Observation Model¶ The observation model captures the uncertaintiy in the agent's lsitening ability. When we listen, we receive a noisy measurement of the tiger's location. In [16]: # observation model function POMDPs.observation(pomdp::TigerPOMDP, a::Symbol, s::Bool) d = TigerDistribution() pc = pomdp.p_listen_correctly if a == :listen s ? (d.p = pc) : (d.p = 1.0-pc) else d.p = 0.5 end d end;  ## Miscallenous Functions¶ Let's define the discount function and the functions that return the size of our spaces. In [17]: POMDPs.discount(pomdp::TigerPOMDP) = pomdp.discount_factor POMDPs.n_states(::TigerPOMDP) = 2 POMDPs.n_actions(::TigerPOMDP) = 3 POMDPs.n_observations(::TigerPOMDP) = 2;  ## Beliefs¶ If you are somewhat familiar with Julia defining all of the above may have been relaitvely simple. However, all POMDPs must be represented with a belief. Implementing beliefs and their updaters can be tricky. Luckily, our solvers abstract away the belief updating. All you need to do is define a function that returns an initial distriubtion over states. This distribution needs to support pdf and rand function. We already defined a dsitribution like that, so our job here is simple! In [18]: POMDPs.initial_state_distribution(pomdp::TigerPOMDP) = TigerDistribution(0.5, [true, false]);  If you are interested in creating your own beliefs and update schemes check out the POMDPToolbox module which implements a number of beliefs and udpate schemes. ## SARSOP Solver¶ Let's play around with the SARSOP.jl solver. The module we provide is a wrapper for the SARSOP backend. You can find more information about it here. In [19]: using SARSOP # load the module # initialize our tiger POMDP pomdp = TigerPOMDP() # initialize the solver solver = SARSOPSolver() # run the solve function policy = solve(solver, pomdp)  INFO: Recompiling stale cache file /home/zach/.julia/lib/v0.5/LightXML.ji for module LightXML.  Generating a pomdpx file: model.pomdpx Loading the model ... input file : model.pomdpx loading time : 0.00s SARSOP initializing ... initialization time : 0.00s ------------------------------------------------------------------------------- Time |#Trial |#Backup |LBound |UBound |Precision |#Alphas |#Beliefs ------------------------------------------------------------------------------- 0 0 0 -20 92.8206 112.821 3 1 0 2 51 -6.2981 63.1396 69.4377 7 16 0 4 103 0.149651 52.2764 52.1268 9 21 0 6 151 6.19248 42.0546 35.8621 9 21 0 8 200 10.3563 35.232 24.8757 12 21 0 11 250 14.0433 29.5471 15.5037 6 21 0 14 300 16.545 25.0926 8.54759 10 21 0 17 350 18.2281 21.8163 3.5882 14 21 0 18 400 18.7451 20.9384 2.19328 8 21 0 21 465 19.1109 20.0218 0.910956 5 21 0 22 500 19.2369 19.7071 0.470219 11 21 0 24 550 19.3036 19.5405 0.236865 6 21 0.01 25 600 19.3369 19.4574 0.120445 13 21 0.01 27 669 19.3579 19.4049 0.0469305 5 21 0.01 28 713 19.3643 19.389 0.024739 5 21 0.01 29 757 19.3676 19.3807 0.0130409 5 21 0.01 30 801 19.3694 19.3763 0.0068744 5 21 0.01 31 850 19.3704 19.3739 0.00351433 10 21 0.01 32 900 19.3709 19.3725 0.00155165 5 21 0.01 33 936 19.3711 19.3721 0.000976551 8 21 ------------------------------------------------------------------------------- SARSOP finishing ... target precision reached target precision : 0.001000 precision reached : 0.000977 ------------------------------------------------------------------------------- Time |#Trial |#Backup |LBound |UBound |Precision |#Alphas |#Beliefs ------------------------------------------------------------------------------- 0.01 33 936 19.3711 19.3721 0.000976551 5 21 ------------------------------------------------------------------------------- Writing out policy ... output file : out.policy  Out[19]: SARSOP.POMDPPolicy("out.policy",POMDPXFiles.POMDPAlphas([-81.5975 3.01448 … 28.4025 19.3711; 28.4025 24.6954 … -81.5975 19.3711],[1,2,2,0,2]),TigerPOMDP(-1.0,-100.0,10.0,0.85,0.95),Any[:openl,:openr,:listen]) In [20]: # we can retrieve the alpha vectors by calling alphas(policy)  Out[20]: 2×5 Array{Float64,2}: -81.5975 3.01448 24.6954 28.4025 19.3711 28.4025 24.6954 3.01452 -81.5975 19.3711 Let's now see how our policy changes with the belief. In [21]: # use POMDPToolbox for beliefs using POMDPToolbox # let's initialize the beliefs b = DiscreteBelief(2); # the initial prior  Out[21]: POMDPToolbox.DiscreteBelief([0.5,0.5]) In [22]: a = action(policy, b) # index of action, you need to convert this to the true action, support for automatic conversion is coming soon  Out[22]: :listen Let's simulate our policy. We'll use POMDPToolbox to do the simulation. As mentioned earlier, in a POMDP, the decision is based on a belief. However, each policy (comes from the solver modules) implements its own belief udpating scheme, so you do not need to worry about deling with beliefs. The only thing you need is to define an initial_state_distribution. In [35]: using POMDPToolbox # for simulation pomdp = TigerPOMDP() # initialize problem init_dist = initial_state_distribution(pomdp) # initialize distriubtion over state up = updater(policy) # belief updater for our policy, SARSOP uses a discrete Bayesian filter hr = HistoryRecorder(max_steps=14, rng=MersenneTwister(1)) # history recorder that keeps track of states, observations and beliefs hist = simulate(hr, pomdp, policy, up, init_dist) for (s, b, a, r, sp, op) in hist println("s:$s, b: $(b.b), action:$a, obs: $op") end println("Total reward:$(discounted_reward(hist))")

s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: false
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: false
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
Total reward: 21.136977555064835


Notice that over the first nine time steps, the policy is fairly simple. We listen twice, and then decide which door to open. However, in the following steps, we get a mix of observations, which makes the decision harder. Our agent does not open a door, because its belief is still uniform at the last time step!

## QMDP Solver¶

We will briefly go over the QMDP.jl solver. You should use QMDP with a word of caution. QMDP assumes that all state uncetainty dissapears in the next time step. This could lead to bad policies in problems with information gathering actions. For example, in the tiger POMDP listening is an information gathering action, and the resulting QMDP policy is quite poor. However, QMDP can work very well in problems where the state uncertainity is not impacted by the agent's action (for example systems with noisy sensor measurements).

In [36]:
using QMDP

# initialize the solver
# key-word args are the maximum number of iterations the solver will run for, and the Bellman tolerance
solver = QMDPSolver(max_iterations=50, tolerance=1e-3)

# run the solver
qmdp_policy = solve(solver, pomdp, verbose=true)

Iteration : 1, residual: 14.75, iteration run-time: 8.34e-6, total run-time: 8.34e-6
Iteration : 2, residual: 12.59046875, iteration run-time: 6.069e-6, total run-time: 1.4409e-5
Iteration : 3, residual: 11.564691406249999, iteration run-time: 3.641e-6, total run-time: 1.805e-5
Iteration : 4, residual: 10.943236428222654, iteration run-time: 2.783e-6, total run-time: 2.0833000000000002e-5
Iteration : 5, residual: 10.2558588273941, iteration run-time: 2.468e-6, total run-time: 2.3301000000000003e-5
Iteration : 6, residual: 9.587976314837448, iteration run-time: 2.19e-6, total run-time: 2.5491000000000004e-5
Iteration : 7, residual: 8.957886507199987, iteration run-time: 2.245e-6, total run-time: 2.7736000000000003e-5
Iteration : 8, residual: 8.367828168991792, iteration run-time: 2.342e-6, total run-time: 3.0078000000000002e-5
Iteration : 9, residual: 7.816304847983972, iteration run-time: 2.12e-6, total run-time: 3.2198000000000006e-5
Iteration : 10, residual: 7.301052156282381, iteration run-time: 2.551e-6, total run-time: 3.4749e-5
Iteration : 11, residual: 6.8197456599030915, iteration run-time: 2.122e-6, total run-time: 3.6871e-5
Iteration : 12, residual: 6.370163598662359, iteration run-time: 2.45e-6, total run-time: 3.9321e-5
Iteration : 13, residual: 5.9502184661618, iteration run-time: 2.276e-6, total run-time: 4.1597e-5
Iteration : 14, residual: 5.557957422333274, iteration run-time: 2.509e-6, total run-time: 4.4106e-5
Iteration : 15, residual: 5.191555653202798, iteration run-time: 2.468e-6, total run-time: 4.6574e-5
Iteration : 16, residual: 4.849308471382614, iteration run-time: 2.298e-6, total run-time: 4.8871999999999996e-5
Iteration : 17, residual: 4.529623527415282, iteration run-time: 2.41e-6, total run-time: 5.1281999999999995e-5
Iteration : 18, residual: 4.231013435561891, iteration run-time: 2.328e-6, total run-time: 5.361e-5
Iteration : 19, residual: 3.9520888618093863, iteration run-time: 2.649e-6, total run-time: 5.6259e-5
Iteration : 20, residual: 3.6915520617660036, iteration run-time: 2.39e-6, total run-time: 5.8649e-5
Iteration : 21, residual: 3.448190843167879, iteration run-time: 2.318e-6, total run-time: 6.0967e-5
Iteration : 22, residual: 3.2208729260632936, iteration run-time: 2.229e-6, total run-time: 6.3196e-5
Iteration : 23, residual: 3.00854067471343, iteration run-time: 2.378e-6, total run-time: 6.557399999999999e-5
Iteration : 24, residual: 2.8102061767669397, iteration run-time: 2.307e-6, total run-time: 6.7881e-5
Iteration : 25, residual: 2.624946646829443, iteration run-time: 2.243e-6, total run-time: 7.0124e-5
Iteration : 26, residual: 2.451900133045797, iteration run-time: 2.205e-6, total run-time: 7.232899999999999e-5
Iteration : 27, residual: 2.290261506721066, iteration run-time: 2.217e-6, total run-time: 7.454599999999999e-5
Iteration : 28, residual: 2.13927871632049, iteration run-time: 2.267e-6, total run-time: 7.681299999999998e-5
Iteration : 29, residual: 1.9982492884203396, iteration run-time: 2.265e-6, total run-time: 7.907799999999999e-5
Iteration : 30, residual: 1.866517059329368, iteration run-time: 1.95e-6, total run-time: 8.1028e-5
Iteration : 31, residual: 1.7434691221742469, iteration run-time: 2.066e-6, total run-time: 8.3094e-5
Iteration : 32, residual: 1.628532975244866, iteration run-time: 2.013e-6, total run-time: 8.5107e-5
Iteration : 33, residual: 1.5211738583317356, iteration run-time: 2.138e-6, total run-time: 8.7245e-5
Iteration : 34, residual: 1.4208922646616031, iteration run-time: 2.247e-6, total run-time: 8.949199999999999e-5
Iteration : 35, residual: 1.32722161685669, iteration run-time: 2.317e-6, total run-time: 9.180899999999999e-5
Iteration : 36, residual: 1.2397260961028849, iteration run-time: 2.275e-6, total run-time: 9.408399999999999e-5
Iteration : 37, residual: 1.1579986144276404, iteration run-time: 2.14e-6, total run-time: 9.622399999999999e-5
Iteration : 38, residual: 1.0816589206533251, iteration run-time: 2.149e-6, total run-time: 9.837299999999999e-5
Iteration : 39, residual: 1.0103518312128017, iteration run-time: 2.246e-6, total run-time: 0.000100619
Iteration : 40, residual: 0.9437455775971557, iteration run-time: 2.099e-6, total run-time: 0.00010271799999999999
Iteration : 41, residual: 0.8815302627452866, iteration run-time: 2.205e-6, total run-time: 0.00010492299999999998
Iteration : 42, residual: 0.8234164191945297, iteration run-time: 2.075e-6, total run-time: 0.00010699799999999998
Iteration : 43, residual: 0.7691336622836786, iteration run-time: 2.305e-6, total run-time: 0.00010930299999999998
Iteration : 44, residual: 0.7184294321414484, iteration run-time: 2.111e-6, total run-time: 0.00011141399999999997
Iteration : 45, residual: 0.6710678186085772, iteration run-time: 2.229e-6, total run-time: 0.00011364299999999997
Iteration : 46, residual: 0.6268284636247756, iteration run-time: 2.262e-6, total run-time: 0.00011590499999999997
Iteration : 47, residual: 0.5855055359753294, iteration run-time: 2.193e-6, total run-time: 0.00011809799999999997
Iteration : 48, residual: 0.5469067736256363, iteration run-time: 2.083e-6, total run-time: 0.00012018099999999997
Iteration : 49, residual: 0.5108525891891986, iteration run-time: 2.038e-6, total run-time: 0.00012221899999999996
Iteration : 50, residual: 0.4771752343663138, iteration run-time: 2.022e-6, total run-time: 0.00012424099999999995

WARNING: using QMDP.create_policy in module Main conflicts with an existing identifier.

Out[36]:
QMDP.QMDPPolicy([193.239 83.2389 182.124; 83.4656 193.466 182.354],Any[:openl,:openr,:listen],TigerPOMDP(-1.0,-100.0,10.0,0.85,0.95))
In [37]:
qmdp_policy.alphas

Out[37]:
2×3 Array{Float64,2}:
193.239    83.2389  182.124
83.4656  193.466   182.354

Notice that these alpha-vectors differ from those compute by SARSOP. Let's see how the policy looks in simulation. We'll use the same procedure to simulate the QMDP policy.

In [39]:
pomdp = TigerPOMDP() # initialize problem
init_dist = initial_state_distribution(pomdp) # initialize distriubtion over state

up = updater(qmdp_policy) # belief updater for our policy
hist = HistoryRecorder(max_steps=14, rng=MersenneTwister(1)) # history recorder that keeps track of states, observations and beliefs

hist = simulate(hist, pomdp, qmdp_policy, up, init_dist)

for (s, b, a, r, sp, op) in hist
println("s: $s, b:$(b.b), action: $a, obs:$op")
end
println("Total reward: \$(discounted_reward(hist))")

s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: false
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: false
s: true, b: [0.5,0.5], action: listen, obs: true
s: true, b: [0.15,0.85], action: listen, obs: true
s: true, b: [0.0302013,0.969799], action: openr, obs: true
Total reward: 21.136977555064835


The results are identical for this single simulation for this particular problem. In general, if you are dealing with a complex problem, it is good to compare the SARSOP and QMDP policies. This framework makes comparing the two policies very simple once you have defined the problem!