This tutorial outlines how to define a POMDP using the POMDPs.jl interface. We will go through a simple problem simply known as the tiger problem (we will refer to it as the tiger POMDP). After defining the tiger POMDP, we will use QMDP and SARSOP to solve the POMDP. If you are new to working with this package, check out the tutorial on MDPs first.

You need to install a few modules in order to use this notebook. If you have all the modules below installed, great! If not run the following commands:

```
# install the POMDPs.jl interface
Pkg.clone("https://github.com/sisl/POMDPs.jl.git")
using POMDPs
# install the SARSOP wrapper
POMDPs.add("SARSOP") # note this downloads and builds the APPL toolit and may take a few minutes
# install the QMDP solver
POMDPs.add("QMDP")
# install a helper modules
POMDPs.add("POMDPToolbox") # this provides implementations of discrete belief updating
# install a Julia packages for working with distributions
Pkg.add("Distributions")
```

If you already have all of the modules above, make sure you have the most recent versions. Many of these are still under heavy development, so update before starting by running

```
Pkg.update()
```

In the tiger POMDP, the agent is tasked with escaping from a room. There are two doors leading out of the room. Behind one of the doors is a tiger, and behind the other is sweet, sweet freedom. If the agent opens the door and finds the tiger, it gets eaten (and receives a reward of -100). If the agent opens the other door, it escapes and receives a reward of 10. The agent can also listen. Listening gives a noisy measurement of which door the tiger is hiding behind. Listening gives the agent the correct location of the tiger 85% of the time. The agent receives a reward of -1 for listening.

In [1]:

```
# first import POMDPs.jl
using POMDPs
```

A POMDP is defined by the tuple $$(\mathcal{S}, \mathcal{A}, \mathcal{Z}, T, R, O).$$ In addition to the familiar, state $\mathcal{S}$ and action $\mathcal{A}$ spaces, we must also define an observation space $\mathcal{Z}$ and an observation function $O$. The POMDP problem definition may be similar to the one for MDP. For example, if you wanted to add state uncertainty to your problem, you can define the observation space, and observation function in addition to your previous MDP definition.

Before defining the spaces for this problem, let's first define the concrete type for the tiger POMDP.

In [2]:

```
type TigerPOMDP <: POMDP{Bool, Symbol, Bool} # POMDP{State, Action, Observation} all parametarized by Int64s
r_listen::Float64 # reward for listening (default -1)
r_findtiger::Float64 # reward for finding the tiger (default -100)
r_escapetiger::Float64 # reward for escaping (default 10)
p_listen_correctly::Float64 # prob of correctly listening (default 0.85)
discount_factor::Float64 # discount
end
# default constructor
function TigerPOMDP()
return TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95)
end;
```

There are a number of parameters in the problem definition, but we can treat them all as constants. You can read more about the Tiger problem and POMDPs here. However, we created a default constructor that allows us to initialize the tiger POMDP by simply running:

In [3]:

```
pomdp = TigerPOMDP()
```

Out[3]:

Note that the TigerPOMDP type inherits from a `POMDP{Bool, Symbol, Bool}`

. This means that in our problem we will use `Bool`

to represent our states, `Symbol`

to represent our actions and `Bool`

to represent our observations. More details on states, actions and observations in this problem are below.

We define our state with a boolean that indicates weather or not the tiger is hiding behind the left door. If our state is `true`

, the tiger is behind the left door. If its `false`

, the tiger is behind the right door.

In [4]:

```
example_state = false # tiger is hiding behind right door
```

Out[4]:

Since the state is a binary value, we represent it as a boolean, but we could have represented it as an integer or any other sensible type.

There are three possible actions our agent can take: open the left door, open the right door, and listen. For clarity, we will represent these with symbols.

In [5]:

```
example_action = :listen # agent listens, can be :openl or :openr
```

Out[5]:

We will represent our actions with the following symbols: open left (:openl), open right (:openr), and listen (:listen). For example, the action below represnts listening:

In [6]:

```
a = :listen
```

Out[6]:

There are two possible observations: the agent either hears the tiger behind the left door or behind the right door. We use a boolean to represent the observations:

In [7]:

```
example_observation = true # agent heard the tiger behind the left door
```

Out[7]:

Let's define our state, action and observation spaces.

There are only two states in the tiger POMDP: the tiger is either behind the left door or behind the right door. Our state space is simply an array of the states in the problem. Recall, that the `states`

function returns the state space for a given POMDP type.

In [8]:

```
POMDPs.states(::TigerPOMDP) = [true, false];
```

There are three actions in our problem. Once again, we represent the action space as an array of the actions in our problem. The `actions`

function serve a similar purpose to the `states`

function above. Since the action space is discrete, we can define the `action_index`

function that associates an integer index to each action.

In [9]:

```
POMDPs.actions(::TigerPOMDP) = [:openl, :openr, :listen] # default
POMDPs.actions(pomdp::TigerPOMDP, state::Bool) = POMDPs.actions(pomdp) # convenience (actions do not change in different states)
function POMDPs.action_index(::TigerPOMDP, a::Symbol)
if a==:openl
return 1
elseif a==:openr
return 2
elseif a==:listen
return 3
end
error("invalid TigerPOMDP action: $a")
end;
```

The observation space looks similar to the state space. Recall that the state represents the truth about our system, while the observation is potentially false information recieves about the state. In the tiger POMDP, our observation could give us a false representation of our state.

In [10]:

```
# function returning observation space
POMDPs.observations(::TigerPOMDP) = [true, false];
POMDPs.observations(pomdp::TigerPOMDP, s::Bool) = observations(pomdp);
```

Now that we've defined the POMDP spaces, let's move on to defining the model functions.

Before defining the model functions, we first need to create a distributions data-type. In general, our distributions should support sampling and have a `pdf`

method. If you only want to get a policy from the SARSOP and QMDP solvers, you do not need to worry about implementing a sampling function. However, if you want to simulate the policy, you should implement these functions.

Since the transition and observation distributions have identical form, we could just use a single type to serve the needs of both. This will not be the case in general.

In [11]:

```
# distribution type that will be used for both transitions and observations
type TigerDistribution
p::Float64
it::Vector{Bool}
end
TigerDistribution() = TigerDistribution(0.5, [true, false])
POMDPs.iterator(d::TigerDistribution) = d.it
```

We now define the `pdf`

function. For a discrete problem, this function returns the probability of a given element (state or observation in our case).

In [12]:

```
# transition and observation pdf
function POMDPs.pdf(d::TigerDistribution, so::Bool)
so ? (return d.p) : (return 1.0-d.p)
end;
```

Finally, let's create the sampling functions.

In [13]:

```
# samples from transition or observation distribution
POMDPs.rand(rng::AbstractRNG, d::TigerDistribution) = rand(rng) <= d.p;
```

This is all we have to do for our distribution functions!

Here we define the transition model for the tiger POMDP. The model itself is fairly simple. Our state is represented by the location of the tiger (left or right). The location of the tiger doesn't change when the agent listens. However, after the agent opens the door, it reaches a terminal state. That is the agent either escapes or gets eaten. To simplify our formulation, we simply reset the location of the tiger randomly. We could model this problem with a terminal state (i.e. one in which the agent no longer receives reward) as well.

In [14]:

```
# Resets the problem after opening door; does nothing after listening
function POMDPs.transition(pomdp::TigerPOMDP, s::Bool, a::Symbol)
d = TigerDistribution()
if a == :openl || a == :openr
d.p = 0.5
elseif s
d.p = 1.0
else
d.p = 0.0
end
d
end;
```

The reward model caputres the immediate objectives of the agent. It recieves a large negative reward for opening the door with the tiger behind it (-100), gets a positive reward for opening the other door (+10), and a small penalty for listening (-1).

In [15]:

```
# reward model
function POMDPs.reward(pomdp::TigerPOMDP, s::Bool, a::Symbol)
r = 0.0
a == :listen ? (r+=pomdp.r_listen) : (nothing)
if a == :openl
s ? (r += pomdp.r_findtiger) : (r += pomdp.r_escapetiger)
end
if a == :openr
s ? (r += pomdp.r_escapetiger) : (r += pomdp.r_findtiger)
end
return r
end
POMDPs.reward(pomdp::TigerPOMDP, s::Bool, a::Symbol, sp::Bool) = reward(pomdp, s, a);
```

The observation model captures the uncertaintiy in the agent's lsitening ability. When we listen, we receive a noisy measurement of the tiger's location.

In [16]:

```
# observation model
function POMDPs.observation(pomdp::TigerPOMDP, a::Symbol, s::Bool)
d = TigerDistribution()
pc = pomdp.p_listen_correctly
if a == :listen
s ? (d.p = pc) : (d.p = 1.0-pc)
else
d.p = 0.5
end
d
end;
```

Let's define the `discount`

function and the functions that return the size of our spaces.

In [17]:

```
POMDPs.discount(pomdp::TigerPOMDP) = pomdp.discount_factor
POMDPs.n_states(::TigerPOMDP) = 2
POMDPs.n_actions(::TigerPOMDP) = 3
POMDPs.n_observations(::TigerPOMDP) = 2;
```

If you are somewhat familiar with Julia defining all of the above may have been relaitvely simple. However, all POMDPs must be represented with a belief. Implementing beliefs and their updaters can be tricky. Luckily, our solvers abstract away the belief updating. All you need to do is define a function that returns an initial distriubtion over states. This distribution needs to support `pdf`

and `rand`

function. We already defined a dsitribution like that, so our job here is simple!

In [18]:

```
POMDPs.initial_state_distribution(pomdp::TigerPOMDP) = TigerDistribution(0.5, [true, false]);
```

If you are interested in creating your own beliefs and update schemes check out the POMDPToolbox module which implements a number of beliefs and udpate schemes.

In [19]:

```
using SARSOP # load the module
# initialize our tiger POMDP
pomdp = TigerPOMDP()
# initialize the solver
solver = SARSOPSolver()
# run the solve function
policy = solve(solver, pomdp)
```

Out[19]:

In [20]:

```
# we can retrieve the alpha vectors by calling
alphas(policy)
```

Out[20]:

Let's now see how our policy changes with the belief.

In [21]:

```
# use POMDPToolbox for beliefs
using POMDPToolbox
# let's initialize the beliefs
b = DiscreteBelief(2); # the initial prior
```

Out[21]:

In [22]:

```
a = action(policy, b) # index of action, you need to convert this to the true action, support for automatic conversion is coming soon
```

Out[22]:

Let's simulate our policy. We'll use POMDPToolbox to do the simulation. As mentioned earlier, in a POMDP, the decision is based on a belief. However, each policy (comes from the solver modules) implements its own belief udpating scheme, so you do not need to worry about deling with beliefs. The only thing you need is to define an `initial_state_distribution`

.

In [35]:

```
using POMDPToolbox # for simulation
pomdp = TigerPOMDP() # initialize problem
init_dist = initial_state_distribution(pomdp) # initialize distriubtion over state
up = updater(policy) # belief updater for our policy, SARSOP uses a discrete Bayesian filter
hr = HistoryRecorder(max_steps=14, rng=MersenneTwister(1)) # history recorder that keeps track of states, observations and beliefs
hist = simulate(hr, pomdp, policy, up, init_dist)
for (s, b, a, r, sp, op) in hist
println("s: $s, b: $(b.b), action: $a, obs: $op")
end
println("Total reward: $(discounted_reward(hist))")
```

Notice that over the first nine time steps, the policy is fairly simple. We listen twice, and then decide which door to open. However, in the following steps, we get a mix of observations, which makes the decision harder. Our agent does not open a door, because its belief is still uniform at the last time step!

We will briefly go over the QMDP.jl solver. You should use QMDP with a word of caution. QMDP assumes that all state uncetainty dissapears in the next time step. This could lead to bad policies in problems with information gathering actions. For example, in the tiger POMDP listening is an information gathering action, and the resulting QMDP policy is quite poor. However, QMDP can work very well in problems where the state uncertainity is not impacted by the agent's action (for example systems with noisy sensor measurements).

In [36]:

```
using QMDP
# initialize the solver
# key-word args are the maximum number of iterations the solver will run for, and the Bellman tolerance
solver = QMDPSolver(max_iterations=50, tolerance=1e-3)
# run the solver
qmdp_policy = solve(solver, pomdp, verbose=true)
```

Out[36]:

In [37]:

```
qmdp_policy.alphas
```

Out[37]:

Notice that these alpha-vectors differ from those compute by SARSOP. Let's see how the policy looks in simulation. We'll use the same procedure to simulate the QMDP policy.

In [39]:

```
pomdp = TigerPOMDP() # initialize problem
init_dist = initial_state_distribution(pomdp) # initialize distriubtion over state
up = updater(qmdp_policy) # belief updater for our policy
hist = HistoryRecorder(max_steps=14, rng=MersenneTwister(1)) # history recorder that keeps track of states, observations and beliefs
hist = simulate(hist, pomdp, qmdp_policy, up, init_dist)
for (s, b, a, r, sp, op) in hist
println("s: $s, b: $(b.b), action: $a, obs: $op")
end
println("Total reward: $(discounted_reward(hist))")
```

The results are identical for this single simulation for this particular problem. In general, if you are dealing with a complex problem, it is good to compare the SARSOP and QMDP policies. This framework makes comparing the two policies very simple once you have defined the problem!

In [ ]:

```
```