I wrote this post for lecture 17 in Andrew Ng's lecture collections on Machine Learning. In my previous post, we discussed Markov Decision Process (MDP) in its simplest form, where the set of states and the set of actions are both finite. But in real world application, states and actions can be infinite and even continuous. For example, if we want to model states of a self-driving car in a 2D plane, we must at least have the position \((x, y)\), the direction \(\theta\) of the car pointing to, its velocity \((v_x, v_y)\) and the rate \(r\) of change in \(\theta\). So the states of a car is at least a 6 dimensional space. For actions of a car, we can control how fast it goes in direction \(\theta\) and we can also control \(r\). Thus the actions have dimension 2.

In this post, we consider only continuous states with finite actions. Indeed, actions space usually has much lower dimension than states space, so in case of continuous actions, we might just discretize the actions spaces to get a finite set of representatives of actions. One may argue that we can also discretize the states space. Yes, we can do it, but only when the dimension \(n\) of state space is small enough: if we discretize each dimension into \(k\) parts, then there would be \(k^n\) many states. If \(n\) is large, then \(k^n\) is not feasible. This is so called the curse of dimensionality. Moreover, discretizing states space usually results in lack of smoothness.

Definition of MDP

Let's recall the set up of MDP.

Markov Decision Process (MDP) consists of fives components:

  1. a set of states \(S\);
  2. a set of actions \(A\);
  3. state transition distributions \(P_{sa}\) for each \(s \in S\) and \(a \in A\), where \(P_{sa}(s^\prime)\) is the probability of changing from state \(s\) to state \(s^\prime\) when taking action \(a\);
  4. discount factor \(\gamma \in [0, 1)\);
  5. reward function \(R: S \to \R\).

Model

Since we have continuously many states, we need some cares on the infinite family of transition distributions \(P_{sa}\). Actually, we are going to assume that we have a model/simulator so that given any state \(s \in S\) and action \(a \in A\), we can get \(s^\prime \sim P_{sa}\). In the example of self-driving car, given state \(s\) and action \(a\), we can calculate from physics the next state \(s^\prime\). This is a deterministic model. On the other hand, if we have a piece of codes that take \((s, a)\) as an input and give \(P_{sa}\) as an output, it will be called a stochastic model.

We can use machine learning algorithm to approximate a model, if we don't know it in advance. First of all, we can randomly take successive actions at some initial states. Let's say, we have

$$\begin{split} & s_0^{(1)} \xrightarrow{a_0^{(1)}} s_1^{(1)} \xrightarrow{a_1^{(1)}} s_2^{(1)} \to \cdots \to s_T^{(1)} \\ & s_0^{(2)} \xrightarrow{a_0^{(2)}} s_1^{(2)} \xrightarrow{a_1^{(2)}} s_2^{(2)} \to \cdots \to s_T^{(2)} \\ & \vdots \\ & s_0^{(m)} \xrightarrow{a_0^{(m)}} s_1^{(m)} \xrightarrow{a_1^{(m)}} s_2^{(m)} \to \cdots \to s_T^{(m)} \end{split}$$

The next state \(s_{t+1}\) is a function of current state \(s_t\) and current action \(a_t\). If we think linear model is reasonable, we can set

$$s_{t+1} = As_t + Ba_t,$$


for some matrices \(A\) and \(B\). Then we learn \(A\) and \(B\) from the above sample data:

$$\mathrm{argmax}_{A, B} \sum_{i=1}^m \sum_{t=0}^{T-1} \left|s^{(i)}_{t+1}-\left(As^{(i)}_t+Ba_t^{(i)}\right)\right|^2.$$

Once we learn \(A\) and \(B\), we can have a deterministic model

$$s_{t+1} = As_t + Ba_t,$$


or a stochastic model

$$s_{t+1} = As_t + Ba_t + \epsilon_t,$$


for some \(\epsilon_t \sim \mathcal{N}(0, \epsilon)\) with a fixed \(\epsilon\).

Fitted value iteration

We will first approximate the optimal value function \(V\), assuming that \(V\) linearly depends on selected feature \(\phi(s)\) of \(s\). Here \(\phi: S = \R^n \to \R^N\) is a map from states space to features space. For example, we can take \(\phi(s)=s\). Therefore, we can write

$$\begin{equation} V(s) = \theta^T \cdot \phi(s), \end{equation}$$


for some parameter \(\theta \in \R^N\). Next, we are going to introduce an algorithm called fitted value iteration to learn \(\theta\):

sample \(\{s^{(1)}, \cdots, s^{(m)}\} \subset S\) randomly
initialize \(\theta = 0\)
repeat {
for i = 1, ..., m {
for each action a in A {
sample \(s_1^\prime, \dots, s_k^\prime \sim P_{s^{(i)}, a}\) using model
let \(q(a) = \frac{1}{k} \sum_{j=1}^k\left[R(s^{(i)})+\gamma V(s_j^\prime)\right]\)
}
set \(y^{(i)} = \max_{a \in A} q(a)\)
}
update \(\theta = \mathrm{argmin}_{\theta} \frac{1}{2}\sum_{i=1}^m \left(\theta^T \cdot \phi(s^{(i)})-y^{(i)}\right)^2\)
}

Remarks. (1) \(q(a)\) is an estimate of \(R(s^{(i)})+\gamma E_{s^\prime \sim P_{s^{(i)}, a}}[V(s^\prime)]\). So if we have a deterministic model, we can just take \(k\) to be \(1\).

(2) \(y^{(i)}\) is an estimate of \(R(s^{(i)})+\gamma \max_a E_{s^\prime \sim P_{s^{(i)}, a}}[V(s^\prime)]\).

(3) We update \(\theta\) so that \(V(s^{(i)}) \approx y^{(i)}\).

(4) Once we know \(V\) (i.e., \(\theta\) is determined from the algorithm above), we can calculate the optimal policy \(\pi\) on the fly:

$$\pi(s) = \mathrm{argmax}_a E_{s^\prime \sim P_{sa}}[V(s^\prime)].$$


As \(q(a)\), \(E_{s^\prime \sim P_{sa}}[V(s^\prime)]\) can be computed efficiently by sampling.