Markov Decision Process: Finite horizon
This post is considered to the notes on finite horizon Markov decision process for lecture 18 in Andrew Ng's lecture series. In my previous two notes ([1], [2]) about Markov decision process (MDP), only state rewards are considered. We can easily generalize MDP to state-action reward.
State-Action Reward
Our reward function now is a function in terms of both states and actions. More precisely, the reward function is a function
All other requirements in definition of MDP will remain intact. For completeness, we include the definition here. We shall pay attention to the fifth components.
For each policy \(\pi: S \to A\), the corresponding Bellman equation is
where \(V^\pi: S \to \R\) is the value function for \(\pi\). To find the optimal policy, we might find the optimal value function \(V^\ast\) using the following Bellman equation first.
With \(V^\ast\) determined, the optimal policy \(\pi^\ast\) is given by
Finite Horizon MDP
We will now consider a variant of MDP, finite horizon MDP. The main difference of MDP and finite horizon MDP is that time is limited in finite horizon MDP.
Given a series of actions with initial state \(s_0\), i.e.,
the total payoff is calculated by
By the nature of its definition, the optimal policy or optimal value function is not stationary in a finite horizon MDP, in the sense that, the optimal policy \(\pi^\ast\) (resp. the optimal value \(V^\ast\)) is a family \(\{\pi^\ast_t: t = 0, \dots, T\}\) (resp. \(\{V^\ast_t: t=0, \dots, T\}\)). Here,
Finite horizon MDP can be solve using dynamic programming. Its transition equation from time \(t+1\) to \(t\) is
for \(t = 0, \dots, T\), with boundary condition
As usual, the optimal policy \(\pi^\ast\) can be recovered from \(V^\ast\). For \(t=0, \dots, T\),
Linear Quadratic Regulation
Linear quadratic regulation (LQR) is a special case of finite horizon MDP. In LQR, we make strong assumptions to make it possible to calculate the closed forms of \(\pi^\ast\) and \(V^\ast\). We suppose \(S = \R^n\) and \(A = \R^d\) are both continuous. Then for each \(t\), we assume
where \(A_t\) is a known \(n \times n\) matrix, \(B_n\) is a known \(n \times d\) matrix, and
for some known covariance matrix \(\Sigma_t\). We also assume the reward function at each time \(t\) is of the form
for some known semi-positive definite matrices \(U_t\) and \(V_t\).
Following the dynamic programming algorithm in finite horizon MDP, we will get
Remark 1. The equation for \(\Phi_t\) in Equation \(\eqref{riccati}\) is called the discrete time Riccati equation.
Remark 2. If we only need \(\pi^\ast\), \(\Sigma_t\) and \(\Psi_t\) are of no use. In this case, we might just assume the model \(\{P^{(t)}_{sa}\}\) is deterministic.