*Getting Started with a Simple Example*

**Daisuke Oyama**

*Faculty of Economics, University of Tokyo*

This notebook demonstrates via a simple example how to use the `DiscreteDP`

module.

In [1]:

```
from __future__ import division, print_function
import numpy as np
from scipy import sparse
from quantecon.markov import DiscreteDP
```

Let us consider the following two-state dynamic program, taken from Puterman (2005), Section 3.1, pp.33-35; see also Example 6.2.1, pp.155-156.

There are two possible states $0$ and $1$.

At state $0$, you may choose either "stay", say action $0$, or "move", action $1$.

At state $1$, there is no way to move, so that you can only stay, i.e., $0$ is the only available action. (You may alternatively distinguish between the action "staty" at state $0$ and that at state $1$, and call the latter action $2$; but here we choose to refer to the both actions as action $0$.)

At state $0$, if you choose action $0$ (stay), then you receive a reward $5$, and in the next period the state will remain at $0$ with probability $1/2$, but it moves to $1$ with probability $1/2$.

If you choose action $1$ (move), then you receive a reward $10$, and the state in the next period will be $1$ with probability $1$.

At state $1$, where the only action you can take is $0$ (stay), you receive a reward $-1$, and the state will remain at $1$ with probability $1$.

You want to maximize the sum of discounted expected reward flows with discount factor $\beta \in [0, 1)$.

The optimization problem consists of:

the state space: $S = \{0, 1\}$;

the action space: $A = \{0, 1\}$;

the set of feasible state-action pairs $\mathit{SA} = \{(0, 0), (0, 1), (1, 0)\} \subset S \times A$;

the reward function $r\colon \mathit{SA} \to \mathbb{R}$, where $$ r(0, 0) = 5,\ r(0, 1) = 10,\ r(1, 0) = -1; $$

the transition probability function $q \colon \mathit{SA} \to \Delta(S)$, where $$ \begin{aligned} &(q(0 | 0, 0), q(1 | 0, 0)) = (1/2, 1/2), \\ &(q(0 | 0, 1), q(1 | 0, 1)) = (0, 1), \\ &(q(0 | 1, 0), q(1 | 1, 0)) = (0, 1); \end{aligned} $$

the discount factor $\beta \in [0, 1)$.

In [2]:

```
def v_star(beta):
v = np.empty(2)
v[1] = -1 / (1 - beta)
if beta > 10/11:
v[0] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))
else:
v[0] = (10 - 11*beta) / (1 - beta)
return v
```

We want to solve this problem numerically by using the `DiscreteDP`

class.

In [3]:

```
v_star(beta=0.95)
```

Out[3]:

There are two ways to represent the data for instantiating a `DiscreteDP`

object.
Let $n$, $m$, and $L$ denote the numbers of states, actions,
and feasbile state-action pairs, respectively;
in the above example, $n = 2$, $m = 2$, and $L = 3$.

`DiscreteDP(R, Q, beta)`

with parameters:

- $n \times m$ reward array
`R`

, - $n \times m \times n$ transition probability array
`Q`

, and - discount factor
`beta`

,

where

`R[s, a]`

is the reward for action`a`

when the state is`s`

and`Q[s, a, s']`

is the probability that the state in the next period is`s'`

when the current state is`s`

and the action chosen is`a`

.- $n \times m$ reward array
`DiscreteDP(R, Q, beta, s_indices, a_indices)`

with parameters:

- length $L$ reward vector
`R`

, - $L \times n$ transition probability array
`Q`

, - discount factor
`beta`

, - length $L$ array
`s_indices`

, and - length $L$ array
`a_indices`

,

where the pairs

`(s_indices[0], a_indices[0])`

, ...,`(s_indices[L-1], a_indices[L-1])`

enumerate feasible state-action pairs, and`R[i]`

is the reward for action`a_indices[i]`

when the state is`s_indices[i]`

and`Q[i, s']`

is the probability that the state in the next period is`s'`

when the current state is`s_indices[i]`

and the action chosen is`a_indices[0]`

.- length $L$ reward vector

`DiscreteDP`

instance¶Let us illustrate the two formulations by the simple example at the outset.

`R[s, a]`

to be $-\infty$ when action `a`

is infeasible under state `s`

.

The reward array `R`

is an $n \times m$ 2-dimensional array:

In [4]:

```
R = [[5, 10],
[-1, -float('inf')]]
```

The transition probability array `Q`

is an $n \times m \times n$ 3-dimenstional array:

In [5]:

```
Q = [[(0.5, 0.5), (0, 1)],
[(0, 1), (0.5, 0.5)]] # Probabilities in Q[1, 1] are arbitrary
```

Let us set the discount factor $\beta$ to be $0.95$:

In [6]:

```
beta = 0.95
```

We are ready to create a `DiscreteDP`

instance:

In [7]:

```
ddp = DiscreteDP(R, Q, beta)
```

`R`

to be a 1-dimensional array of length `L`

and `Q`

to be a 2-dimensional array of shape `(L, n)`

,
where `L`

is the number of feasible state-action pairs.

`s_indices`

and `a_indices`

of length $3$
contain the indices of states and actions, respectively.

In [8]:

```
s_indices = [0, 0, 1] # State indices
a_indices = [0, 1, 0] # Action indices
```

The reward vector `R`

is a length $L$ 1-dimensional array:

In [9]:

```
# Rewards for (s, a) = (0, 0), (0, 1), (1, 0), respectively
R = [5, 10, -1]
```

The transition probability array `Q`

is an $L \times n$ 2-dimensional array:

In [10]:

```
# Probability vectors for (s, a) = (0, 0), (0, 1), (1, 0), respectively
Q = [(0.5, 0.5), (0, 1), (0, 1)]
```

For the discount factor, set $\beta = 0.95$ as before:

In [11]:

```
beta = 0.95
```

Now create a `DiscreteDP`

instance:

In [12]:

```
ddp_sa = DiscreteDP(R, Q, beta, s_indices, a_indices)
```

`Q`

as a `scipy.sparse`

matrix
(of any format),
which is useful for large and sparse problems.

For example, let us convert the above ndarray `Q`

to the Coordinate (coo) format:

In [13]:

```
import scipy.sparse
Q = scipy.sparse.coo_matrix(Q)
```

Pass it to `DiscreteDP`

with the other parameters:

In [14]:

```
ddp_sparse = DiscreteDP(R, Q, beta, s_indices, a_indices)
```

Internally, the matrix `Q`

is converted to the Compressed Sparse Row (csr) format:

In [15]:

```
ddp_sparse.Q
```

Out[15]:

In [16]:

```
ddp_sparse.Q.toarray()
```

Out[16]:

Now let us solve our model.
Currently, `DiscreteDP`

supports the following solution algorithms:

- policy iteration;
- value iteration;
- modified policy iteration.

(The methods are the same across the formulations.)

We solve the model first by policy iteration, which gives the exact solution:

In [17]:

```
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res = ddp.solve(method='policy_iteration', v_init=v_init)
```

`res`

contains the information about the solution result:

In [18]:

```
res
```

Out[18]:

The optimal policy function:

In [19]:

```
res.sigma
```

Out[19]:

The optimal value function:

In [20]:

```
res.v
```

Out[20]:

This coincides with the analytical solution:

In [21]:

```
v_star(beta)
```

Out[21]:

In [22]:

```
np.allclose(res.v, v_star(beta))
```

Out[22]:

The number of iterations:

In [23]:

```
res.num_iter
```

Out[23]:

Verify that the value of the policy `[0, 0]`

is actually equal to the optimal value `v`

:

In [24]:

```
ddp.evaluate_policy(res.sigma)
```

Out[24]:

In [25]:

```
ddp.evaluate_policy(res.sigma) == res.v
```

Out[25]:

`res.mc`

is the controlled Markov chain given by the optimal policy `[0, 0]`

:

In [26]:

```
res.mc
```

Out[26]:

In [27]:

```
epsilon = 1e-2 # Convergece tolerance, optional(default=1e-3)
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res_vi = ddp.solve(method='value_iteration', v_init=v_init,
epsilon=epsilon)
```

In [28]:

```
res_vi
```

Out[28]:

`res1.sigma`

is an $\varepsilon$-optimal policy,
and the value function `res1.v`

is an $\varepsilon/2$-approximation
of the true optimal value function.

In [29]:

```
np.abs(v_star(beta) - res_vi.v).max()
```

Out[29]:

Finally, solve the model by modified policy iteration:

In [30]:

```
epsilon = 1e-2 # Convergece tolerance, optional(defaul=1e-3)
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res_mpi = ddp.solve(method='modified_policy_iteration', v_init=v_init,
epsilon=epsilon)
```

In [31]:

```
res_mpi
```

Out[31]:

In [32]:

```
np.abs(v_star(beta) - res_mpi.v).max()
```

Out[32]:

- M.L. Puterman,
*Markov Decision Processes: Discrete Stochastic Dynamic Programming*, Wiley-Interscience, 2005.

In [ ]:

```
```