*Getting Started with a Simple Example*

**Daisuke Oyama**

*Faculty of Economics, University of Tokyo*

This notebook demonstrates via a simple example how to use the `DiscreteDP`

module.

In [1]:

```
import numpy as np
from scipy import sparse
from quantecon.markov import DiscreteDP
```

Let us consider the following two-state dynamic program, taken from Puterman (2005), Section 3.1, pp.33-35; see also Example 6.2.1, pp.155-156.

There are two possible states $0$ and $1$.

At state $0$, you may choose either "stay", say action $0$, or "move", action $1$.

At state $1$, there is no way to move, so that you can only stay, i.e., $0$ is the only available action. (You may alternatively distinguish between the action "staty" at state $0$ and that at state $1$, and call the latter action $2$; but here we choose to refer to the both actions as action $0$.)

At state $0$, if you choose action $0$ (stay), then you receive a reward $5$, and in the next period the state will remain at $0$ with probability $1/2$, but it moves to $1$ with probability $1/2$.

If you choose action $1$ (move), then you receive a reward $10$, and the state in the next period will be $1$ with probability $1$.

At state $1$, where the only action you can take is $0$ (stay), you receive a reward $-1$, and the state will remain at $1$ with probability $1$.

You want to maximize the sum of discounted expected reward flows with discount factor $\beta \in [0, 1)$.

The optimization problem consists of:

the state space: $S = \{0, 1\}$;

the action space: $A = \{0, 1\}$;

the set of feasible state-action pairs $\mathit{SA} = \{(0, 0), (0, 1), (1, 0)\} \subset S \times A$;

the reward function $r\colon \mathit{SA} \to \mathbb{R}$, where $$ r(0, 0) = 5,\ r(0, 1) = 10,\ r(1, 0) = -1; $$

the transition probability function $q \colon \mathit{SA} \to \Delta(S)$, where $$ \begin{aligned} &(q(0 | 0, 0), q(1 | 0, 0)) = (1/2, 1/2), \\ &(q(0 | 0, 1), q(1 | 0, 1)) = (0, 1), \\ &(q(0 | 1, 0), q(1 | 1, 0)) = (0, 1); \end{aligned} $$

the discount factor $\beta \in [0, 1)$.

The Belmann equation for this problem is: $$ \begin{aligned} v(0) &= \max \left\{5 + \beta \left(\frac{1}{2} v(0) + \frac{1}{2} v(1)\right), 10 + \beta v(1)\right\}, \\ v(1) &= (-1) + \beta v(1). \end{aligned} $$

This problem is simple enough to solve by hand: the optimal value function $v^*$ is given by $$ \begin{aligned} &v(0) = \begin{cases} \dfrac{5 - 5.5 \beta}{(1 - 0.5 \beta) (1 - \beta)} & \text{if $\beta > \frac{10}{11}$} \\ \dfrac{10 - 11 \beta}{1 - \beta} & \text{otherwise}, \end{cases}\\ &v(1) = -\frac{1}{1 - \beta}, \end{aligned} $$ and the optimal policy function $\sigma^*$ is given by $$ \begin{aligned} &\sigma^*(0) = \begin{cases} 0 & \text{if $\beta > \frac{10}{11}$} \\ 1 & \text{otherwise}, \end{cases}\\ &\sigma^*(1) = 0. \end{aligned} $$

In [2]:

```
def v_star(beta):
v = np.empty(2)
v[1] = -1 / (1 - beta)
if beta > 10/11:
v[0] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))
else:
v[0] = (10 - 11*beta) / (1 - beta)
return v
```

We want to solve this problem numerically by using the `DiscreteDP`

class.

We will set $\beta = 0.95$ ($> 10/11$), for which the anlaytical solution is: $\sigma^* = (0, 0)$ and

In [3]:

```
v_star(beta=0.95)
```

Out[3]:

There are two ways to represent the data for instantiating a `DiscreteDP`

object.
Let $n$, $m$, and $L$ denote the numbers of states, actions,
and feasbile state-action pairs, respectively;
in the above example, $n = 2$, $m = 2$, and $L = 3$.

`DiscreteDP(R, Q, beta)`

with parameters:

- $n \times m$ reward array
`R`

, - $n \times m \times n$ transition probability array
`Q`

, and - discount factor
`beta`

,

where

`R[s, a]`

is the reward for action`a`

when the state is`s`

and`Q[s, a, s']`

is the probability that the state in the next period is`s'`

when the current state is`s`

and the action chosen is`a`

.- $n \times m$ reward array
`DiscreteDP(R, Q, beta, s_indices, a_indices)`

with parameters:

- length $L$ reward vector
`R`

, - $L \times n$ transition probability array
`Q`

, - discount factor
`beta`

, - length $L$ array
`s_indices`

, and - length $L$ array
`a_indices`

,

where the pairs

`(s_indices[0], a_indices[0])`

, ...,`(s_indices[L-1], a_indices[L-1])`

enumerate feasible state-action pairs, and`R[i]`

is the reward for action`a_indices[i]`

when the state is`s_indices[i]`

and`Q[i, s']`

is the probability that the state in the next period is`s'`

when the current state is`s_indices[i]`

and the action chosen is`a_indices[0]`

.- length $L$ reward vector

`DiscreteDP`

instance¶Let us illustrate the two formulations by the simple example at the outset.

This formulation is straightforward
when the number of feasible actions is constant across states
so that the set of feasible state-action pairs is naturally represetend
by the product $S \times A$,
while any problem can actually be represented in this way
by defining the reward `R[s, a]`

to be $-\infty$ when action `a`

is infeasible under state `s`

.

To apply this approach to the current example, we consider the effectively equivalent problem in which at both states $0$ and $1$, both actions $0$ (stay) and $1$ (move) are available, but action $1$ yields a reward $-\infty$ at state $1$.

The reward array `R`

is an $n \times m$ 2-dimensional array:

In [4]:

```
R = [[5, 10],
[-1, -float('inf')]]
```

The transition probability array `Q`

is an $n \times m \times n$ 3-dimenstional array:

In [5]:

```
Q = [[(0.5, 0.5), (0, 1)],
[(0, 1), (0.5, 0.5)]] # Probabilities in Q[1, 1] are arbitrary
```

Note that the transition probabilities for action $(s, a) = (1, 1)$ are arbitrary, since $a = 1$ is infeasible at $s = 1$ in the original problem.

Let us set the discount factor $\beta$ to be $0.95$:

In [6]:

```
beta = 0.95
```

We are ready to create a `DiscreteDP`

instance:

In [7]:

```
ddp = DiscreteDP(R, Q, beta)
```

When the number of feasible actions varies across states,
it can be inefficient in terms of memory usage
to extend the domain by treating infeasible actions
to be "feasible but yielding reward $-\infty$".
This formulation takes the set of feasible state-action pairs as is,
defining `R`

to be a 1-dimensional array of length `L`

and `Q`

to be a 2-dimensional array of shape `(L, n)`

,
where `L`

is the number of feasible state-action pairs.

First, we have to list all the feasible state-action pairs. For our example, they are: $(s, a) = (0, 0), (0, 1), (1, 0)$.

We have arrays `s_indices`

and `a_indices`

of length $3$
contain the indices of states and actions, respectively.

In [8]:

```
s_indices = [0, 0, 1] # State indices
a_indices = [0, 1, 0] # Action indices
```

The reward vector `R`

is a length $L$ 1-dimensional array:

In [9]:

```
# Rewards for (s, a) = (0, 0), (0, 1), (1, 0), respectively
R = [5, 10, -1]
```

The transition probability array `Q`

is an $L \times n$ 2-dimensional array:

In [10]:

```
# Probability vectors for (s, a) = (0, 0), (0, 1), (1, 0), respectively
Q = [(0.5, 0.5), (0, 1), (0, 1)]
```

For the discount factor, set $\beta = 0.95$ as before:

In [11]:

```
beta = 0.95
```

Now create a `DiscreteDP`

instance:

In [12]:

```
ddp_sa = DiscreteDP(R, Q, beta, s_indices, a_indices)
```

Importantly, this formulation allows us to represent the transition probability array `Q`

as a `scipy.sparse`

matrix
(of any format),
which is useful for large and sparse problems.

For example, let us convert the above ndarray `Q`

to the Coordinate (coo) format:

In [13]:

```
import scipy.sparse
Q = scipy.sparse.coo_matrix(Q)
```

Pass it to `DiscreteDP`

with the other parameters:

In [14]:

```
ddp_sparse = DiscreteDP(R, Q, beta, s_indices, a_indices)
```

Internally, the matrix `Q`

is converted to the Compressed Sparse Row (csr) format:

In [15]:

```
ddp_sparse.Q
```

Out[15]:

In [16]:

```
ddp_sparse.Q.toarray()
```

Out[16]:

Now let us solve our model.
Currently, `DiscreteDP`

supports the following solution algorithms:

- policy iteration;
- value iteration;
- modified policy iteration.

(The methods are the same across the formulations.)

We solve the model first by policy iteration, which gives the exact solution:

In [17]:

```
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res = ddp.solve(method='policy_iteration', v_init=v_init)
```

`res`

contains the information about the solution result:

In [18]:

```
res
```

Out[18]:

The optimal policy function:

In [19]:

```
res.sigma
```

Out[19]:

The optimal value function:

In [20]:

```
res.v
```

Out[20]:

This coincides with the analytical solution:

In [21]:

```
v_star(beta)
```

Out[21]:

In [22]:

```
np.allclose(res.v, v_star(beta))
```

Out[22]:

The number of iterations:

In [23]:

```
res.num_iter
```

Out[23]:

Verify that the value of the policy `[0, 0]`

is actually equal to the optimal value `v`

:

In [24]:

```
ddp.evaluate_policy(res.sigma)
```

Out[24]:

In [25]:

```
ddp.evaluate_policy(res.sigma) == res.v
```

Out[25]:

`res.mc`

is the controlled Markov chain given by the optimal policy `[0, 0]`

:

In [26]:

```
res.mc
```

Out[26]:

Next, solve the model by value iteration, which returns an $\varepsilon$-optimal solution for a specified value of $\varepsilon$:

In [27]:

```
epsilon = 1e-2 # Convergece tolerance, optional(default=1e-3)
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res_vi = ddp.solve(method='value_iteration', v_init=v_init,
epsilon=epsilon)
```

In [28]:

```
res_vi
```

Out[28]:

The computed policy function `res1.sigma`

is an $\varepsilon$-optimal policy,
and the value function `res1.v`

is an $\varepsilon/2$-approximation
of the true optimal value function.

In [29]:

```
np.abs(v_star(beta) - res_vi.v).max()
```

Out[29]:

Finally, solve the model by modified policy iteration:

In [30]:

```
epsilon = 1e-2 # Convergece tolerance, optional(defaul=1e-3)
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res_mpi = ddp.solve(method='modified_policy_iteration', v_init=v_init,
epsilon=epsilon)
```

In [31]:

```
res_mpi
```

Out[31]:

Modified policy function also returns an $\varepsilon$-optimal policy function and an $\varepsilon/2$-approximate value function:

In [32]:

```
np.abs(v_star(beta) - res_mpi.v).max()
```

Out[32]:

- M.L. Puterman,
*Markov Decision Processes: Discrete Stochastic Dynamic Programming*, Wiley-Interscience, 2005.