For $0<\beta<1$, consider the finite state infinite horizon discounted MDP in discrete time, given by
$$\begin{align*} &\underset{\gamma\in\Gamma}{\text{inf}}\quad\mathbb{E}\left[\sum_{t=0}^{\infty}\beta^{t}\:c\left(x_{t},u_{t}\right)\right]\\ & x \sim \text{Markov}\left(P(u)\right), \quad [P_{ij}(u)]:= \mathbb{P}\left(x_{t+1} = j \mid x_{t} = i, u_{t} = u\right), \quad i,j=1,...,n, \quad u\in\mathcal{U}. \end{align*}$$Let $\mathcal{F}$ be the class of all functions $z:\{1,...,n\} \mapsto \mathbb{R}$, and define the nonlinear Bellman operator $T : \mathcal{F} \mapsto \mathcal{F}$ as
$$(Tz)(i):=\underset{u\in\mathcal{U}}{\text{inf}}\quad\bigg\{c(i,u) + \beta \sum_{j=1}^{n}P_{ij}(u)z(j)\bigg\}, \quad i=1,...,n.$$Let $h_{0}(i) := 0$ for $i=1,...,n$, and consider the $k$-th transient of the value iteration: $h_{k}=T^{k}h_{0}$. Here, $T^{k}$ denotes the $k$-fold composition of the Bellman operator, i.e., $T^{k}\equiv\underbrace{T \circ T \circ ... \circ T}_{k\;\text{times}}$.
Let the value function associated with the $k$-stage finite horizon OCP be
$$\begin{align*}W_{k}(i) \: := \: &\underset{\gamma\in\Gamma}{\text{inf}}\quad\mathbb{E}\left[\sum_{t=0}^{k-1}\beta^{t}\:c\left(x_{t},u_{t}\right) \:\bigg\vert\: x_{0}=i\right]\\ & x \sim \text{Markov}\left(P(u)\right), \quad [P_{ij}(u)]:= \mathbb{P}\left(x_{t+1} = j \mid x_{t} = i, u_{t} = u\right), \quad i,j=1,...,n, \quad u\in\mathcal{U}. \end{align*}$$Carefully prove that $h_{k}(i)=W_{k}(i)$, $i=1,...,n$. [Hint: Induction]
Give clear physical interpretation of the mathematical result derived in part (a).
The purpose of this exercise is to revisit the finite horizon LQR in discrete time via dynamic programming. To recap the treatment via PMP, please review Lecture 9, slides 18-22. The problem considered is as follows:
$$\begin{align*} &\underset{\{\gamma_{t}\}_{t=0}^{N-1}\in\Gamma}{\inf}\:\frac{1}{2}\bigg\{x_{N}^{\top}Mx_{N} + \sum_{t=0}^{N-1}\left(x_{t}^{\top}Qx_{t} + u_{t}^{\top}Ru_{t}\right)\bigg\}\\ & x_{t+1} = Ax_{t} + Bu_{t}, \quad x_{0}\;\text{given}, \quad x\in\mathbb{R}^{n}, \quad u\in\mathbb{R}^{m}, \end{align*}$$wherein as usual $M,Q \succeq 0$, $R \succ 0$, and $\Gamma$ is the set of all (time-varying) history-dependent randomized policies.
Let the value function $V_{k}$ be defined as
$$V_{k}(x) := \underset{\{\gamma_{t}\}_{t=k}^{N-1}\in\Gamma}{\inf}\:\frac{1}{2}\bigg\{x_{N}^{\top}Mx_{N} + \sum_{t=k}^{N-1}\left(x_{t}^{\top}Qx_{t} + u_{t}^{\top}Ru_{t}\right)\:\bigg\vert\: x_{k}=x\bigg\}.$$(a.1) What is the physical interpretation of $V_{k}(x)$? What is the physical interpretation of $V_{0}(x_{0})$?
(a.2) Write down the dynamic programming equation.
(b.1) To solve the recursion derived in (a.2), start with the guess that the value function $V_{k+1}(x) = \frac{1}{2}x^{\top}P_{k+1} x$ for some $P_{k+1} \succeq 0$. Then prove that the value function $V_{k}(x)$ has the same structural form, i.e., $V_{k}(x) = \frac{1}{2}x^{\top}P_{k} x$ for some matrix $P_{k}$, expressed as an explicit function of $P_{k+1}$. From there, prove that $P_{k} \succeq 0$.
(b.2) Use your answer in part (b.1) to show that the optimal policy is a time-varying linear state feedback. Is this conclusion same or different from what we derived via PMP in Lecture 9? Explain.