This post is considered to the notes on finite horizon Markov decision process for lecture 18 in Andrew Ng's lecture series. In my previous two notes ([1], [2]) about Markov decision process (MDP), only state rewards are considered. We can easily generalize MDP to state-action reward.

State-Action Reward

Our reward function now is a function in terms of both states and actions. More precisely, the reward function is a function

$$R: S \times A \to \R.$$


All other requirements in definition of MDP will remain intact. For completeness, we include the definition here. We shall pay attention to the fifth components.

Markov Decision Process (MDP) consists of fives components:

  1. a set of states \(S\);
  2. a set of actions \(A\);
  3. state transition distributions \(P_{sa}\) for each \(s \in S\) and \(a \in A\), where \(P_{sa}(s^\prime)\) is the probability of changing from state \(s\) to state \(s^\prime\) when taking action \(a\);
  4. discount factor \(\gamma \in [0, 1)\);
  5. reward function \(R: S \times A \to \R\).

For each policy \(\pi: S \to A\), the corresponding Bellman equation is

$$V^\pi(s) = R(s, \pi(s)) + \gamma\sum_{s^\prime \in S}P_{s, \pi(s)}(s^\prime)V^\pi(s^\prime),$$


where \(V^\pi: S \to \R\) is the value function for \(\pi\). To find the optimal policy, we might find the optimal value function \(V^\ast\) using the following Bellman equation first.

$$V^\ast = \max_{a \in A} R(s, a) + \gamma \sum_{s^\prime \in S} P_{sa}(s^\prime)V^\ast(s^\prime).$$


With \(V^\ast\) determined, the optimal policy \(\pi^\ast\) is given by

$$\pi^\ast(s) = \mathrm{argmax}_{a \in A} R(s, a) + \gamma \sum_{s^\prime \in S} P_{sa}(s^\prime)V^\ast(s^\prime).$$

Finite Horizon MDP

We will now consider a variant of MDP, finite horizon MDP. The main difference of MDP and finite horizon MDP is that time is limited in finite horizon MDP.

Finite horizon consists of fives components:

  1. a set of states \(S\);
  2. a set of actions \(A\);
  3. state transition distributions \(P^{(t)}_{sa}\) for each \(s \in S\) and \(a \in A\), where \(P^{(t)}_{sa}(s^\prime)\) is the probability of changing from state \(s\) to state \(s^\prime\) when taking action \(a\) at time \(t\);
  4. horizon time \(T\), meaning that the process will start at time \(0\) and stop at time \(T\);
  5. reward function \(R^{(t)}: S \times A \to \R\) at time \(t\).

Given a series of actions with initial state \(s_0\), i.e.,

$$s_0 \xrightarrow{a_0} s_1 \xrightarrow{a_1} s_2 \cdots \xrightarrow{a_{T-1}} s_T \xrightarrow{a_T} s_{T+1},$$


the total payoff is calculated by

$$R^{(0)}(s_0, a_0) + \cdots + R^{(T)}(s_T, a_T).$$

By the nature of its definition, the optimal policy or optimal value function is not stationary in a finite horizon MDP, in the sense that, the optimal policy \(\pi^\ast\) (resp. the optimal value \(V^\ast\)) is a family \(\{\pi^\ast_t: t = 0, \dots, T\}\) (resp. \(\{V^\ast_t: t=0, \dots, T\}\)). Here,

$$V^\ast_t(s) = E\left[R^{(t)}(s_t, a_t) + \cdots + R^{(T)}(s_T, a_T): \pi^\ast, s_t=s\right].$$

Finite horizon MDP can be solve using dynamic programming. Its transition equation from time \(t+1\) to \(t\) is

$$V^\ast_t(s) = \max_{a \in A} R^{(t)}(s, a)+\sum_{s^\prime \in S}P^{(t)}_{sa}(s^\prime)V^\ast_{t+1}(s^\prime),$$


for \(t = 0, \dots, T\), with boundary condition

$$V_{T+1}(s) = 0.$$


As usual, the optimal policy \(\pi^\ast\) can be recovered from \(V^\ast\). For \(t=0, \dots, T\),

$$\pi^\ast_t(s) = \mathrm{argmax}_{a \in A} R^{(t)}(s, a)+\sum_{s^\prime \in S}P^{(t)}_{sa}(s^\prime)V^\ast_{t+1}(s^\prime).$$

Linear Quadratic Regulation

Linear quadratic regulation (LQR) is a special case of finite horizon MDP. In LQR, we make strong assumptions to make it possible to calculate the closed forms of \(\pi^\ast\) and \(V^\ast\). We suppose \(S = \R^n\) and \(A = \R^d\) are both continuous. Then for each \(t\), we assume

$$s^\prime \sim_{P^{(t)}_{sa}} A_ts+B_ta+w_t,$$


where \(A_t\) is a known \(n \times n\) matrix, \(B_n\) is a known \(n \times d\) matrix, and

$$w_t \sim \mathcal{N}(0, \Sigma_t),$$


for some known covariance matrix \(\Sigma_t\). We also assume the reward function at each time \(t\) is of the form

$$R^{(t)}(s, a) = -(s^T U_t s + a^T V_t a)$$


for some known semi-positive definite matrices \(U_t\) and \(V_t\).

Following the dynamic programming algorithm in finite horizon MDP, we will get

Theorem. In LQR as above, we have for each $t$, $$V^\ast_t(s) = s \Phi_t s + \Psi_t,$$ and $$\pi^\ast(s) = L_t s,$$ where $$L_t = (B_t^T \Phi_{t+1} B_t - V_t)^{-1}B_t^T\Phi_{t+1}A_t.$$ Here $\Phi_t$ and $\Psi_t$ are determined by $$\begin{equation} \begin{split} \Phi_t & = A^T_t(\Phi_{t+1}-\Phi_{t+1}B_t(B_t^T\Phi_{t+1}B_t-V_t)^{-1}B_t\Phi_{t+1})A_t-U_t \\ \Psi_t & = -\mathrm{tr} \Sigma_t\Phi_{t+1} + \Psi_{t+1} \end{split}, \label{riccati} \end{equation}$$ with boundaries $\Phi_{T+1} = 0$ and $\Psi_{T+1} = 0$.

Remark 1. The equation for \(\Phi_t\) in Equation \(\eqref{riccati}\) is called the discrete time Riccati equation.

Remark 2. If we only need \(\pi^\ast\), \(\Sigma_t\) and \(\Psi_t\) are of no use. In this case, we might just assume the model \(\{P^{(t)}_{sa}\}\) is deterministic.