# Markov Decision Process: Finite horizon

This post is considered to the notes on finite horizon Markov decision process for lecture 18 in Andrew Ng's lecture series. In my previous two notes ([1], [2]) about Markov decision process (MDP), only state rewards are considered. We can easily generalize MDP to state-action reward.

## State-Action Reward

Our reward function now is a function in terms of both states and actions. More precisely, the reward function is a function

All other requirements in definition of MDP will remain intact. For completeness, we include the definition here. We shall pay attention to the fifth components.

For each policy \(\pi: S \to A\), the corresponding *Bellman equation* is

where \(V^\pi: S \to \R\) is the value function for \(\pi\). To find the optimal policy, we might find the optimal value function \(V^\ast\) using the following *Bellman equation* first.

With \(V^\ast\) determined, the optimal policy \(\pi^\ast\) is given by

## Finite Horizon MDP

We will now consider a variant of MDP, finite horizon MDP. The main difference of MDP and finite horizon MDP is that time is limited in finite horizon MDP.

Given a series of actions with initial state \(s_0\), i.e.,

the total payoff is calculated by

By the nature of its definition, the optimal policy or optimal value function is not stationary in a finite horizon MDP, in the sense that, the optimal policy \(\pi^\ast\) (resp. the optimal value \(V^\ast\)) is a family \(\{\pi^\ast_t: t = 0, \dots, T\}\) (resp. \(\{V^\ast_t: t=0, \dots, T\}\)). Here,

Finite horizon MDP can be solve using dynamic programming. Its transition equation from time \(t+1\) to \(t\) is

for \(t = 0, \dots, T\), with boundary condition

As usual, the optimal policy \(\pi^\ast\) can be recovered from \(V^\ast\). For \(t=0, \dots, T\),

## Linear Quadratic Regulation

Linear quadratic regulation (LQR) is a special case of finite horizon MDP. In LQR, we make strong assumptions to make it possible to calculate the closed forms of \(\pi^\ast\) and \(V^\ast\). We suppose \(S = \R^n\) and \(A = \R^d\) are both continuous. Then for each \(t\), we assume

where \(A_t\) is a known \(n \times n\) matrix, \(B_n\) is a known \(n \times d\) matrix, and

for some known covariance matrix \(\Sigma_t\). We also assume the reward function at each time \(t\) is of the form

for some known semi-positive definite matrices \(U_t\) and \(V_t\).

Following the dynamic programming algorithm in finite horizon MDP, we will get

**Theorem.**In LQR as above, we have for each $t$, $$V^\ast_t(s) = s \Phi_t s + \Psi_t,$$ and $$\pi^\ast(s) = L_t s,$$ where $$L_t = (B_t^T \Phi_{t+1} B_t - V_t)^{-1}B_t^T\Phi_{t+1}A_t.$$ Here $\Phi_t$ and $\Psi_t$ are determined by $$\begin{equation} \begin{split} \Phi_t & = A^T_t(\Phi_{t+1}-\Phi_{t+1}B_t(B_t^T\Phi_{t+1}B_t-V_t)^{-1}B_t\Phi_{t+1})A_t-U_t \\ \Psi_t & = -\mathrm{tr} \Sigma_t\Phi_{t+1} + \Psi_{t+1} \end{split}, \label{riccati} \end{equation}$$ with boundaries $\Phi_{T+1} = 0$ and $\Psi_{T+1} = 0$.

**Remark 1.** The equation for \(\Phi_t\) in Equation \(\eqref{riccati}\) is called the discrete time Riccati equation.

**Remark 2.** If we only need \(\pi^\ast\), \(\Sigma_t\) and \(\Psi_t\) are of no use. In this case, we might just assume the model \(\{P^{(t)}_{sa}\}\) is deterministic.