Expectation-Maximization algorithm
In this article, I will collect my notes on Expectation-Maximization algorithm (EM) based on lecture 12 and 13 of Andrew Ng's online course. Given a set of unlabeled data points EM tries iteratively to determine the distribution of data, assuming that all data points are implicitly labeled (unobserved latent variables). For simplicity, we shall limit ourselves to the case where there are only finitely many implicit labels.
Description of the problem
Given a set of unlabeled data \(\{x^{(1)}, \dots, x^{(m)}\}\), our goal is to determine \(P(x)\), the distribution of \(x\), with the following assumptions.
Assumptions 1 and 2 will gives us a set of parameters \(\theta = (\phi_1, \dots, \phi_j, a_1,\dots, a_j)\) and
We want to find this set of parameters so that the likelihood function
is maximized. Or equivalently, the log likelihood function below is maximized:
where
EM algorithm
For any \(x^{(i)}\), using Jensen's inequality, we have for any multinomial distribution \(Q_i\) on \(\{1, \dots, k\}\),
Inequality holds if and only if \(\frac{P(x^{(i)}, z=j; \theta)}{Q_i(z=j)}\) are the same for all \(j\). That is to say, inequality holds if and only if
where \(P(x^{(i)})\) is calculated by
Therefore,
Based on Equations (\(\ref{inequality}\)) and (\(\ref{qij}\)), EM algorithm uses the following steps to update \(\theta\).
Mixture of Gaussians
Suppose \(x \in \mathbb{R}^n\) are column vectors, i.e, there are \(n\) features. In Assumptions, we now require \(P(x|z=j)\) is a Gaussian distribution for any \(j\). Thus, the above Assumptions are now specified as follow.
-
\(P(z=j) = \phi_j\), for \(j = 1, \dots, k\) and \(\sum_j \phi_j = 1\).
-
\(\displaystyle P(x|z=j) = \frac{1}{\sqrt{(2\pi)^n \det \Sigma_j}} \exp\left(-\frac{1}{2} (x-\mu_j)^T \Sigma_j^{-1} (x-\mu_j)\right)\), for \(j = 1, \dots, k\).
With these assumptions, our set of parameters are
Once we determine \(\theta\) via EM algorithm, we then can use Equation (\(\ref{px}\)) to find \(P(x; \theta)\).
The E-step is always relatively easy. For the current \(\theta\), we set
from Equation (\(\ref{qij}\)). Quantities \(P(x^{(i)}, z=j; \theta)\) and \(P(x^{(i)}; \theta)\) can be both deduced from \(P(z)\) and \(P(x|z)\).
The M-step needs some work. Under the above assumptions and using \(w_{ij}\) in place of \(Q_i(z=j)\), Equation (\(\ref{mstep}\)) becomes
For \(\phi_j\), we recall that we have the restriction that \(\sum_j \phi_j = 1\). Moreover, when concern only about \(\phi_j\), we can change our objective function to
Apply the Lagrange multiplier to \(f_1\) with restriction \(\sum_j \phi_j = 1\), there is a \(\lambda\) so that
Recall that \(w_{ij} = Q_i(z=j)\) is some multinomial distribution on \(\{1, \dots, k\}\), so \(\sum_j w_{ij} = 1\). Thus, from above equations, we conclude that
Therefore, we have the following formula to update \(\phi_j\) for \(j = 1, \dots, k\):
Now, let's work on parameter \(\mu_j\) for some fixed \(j\). The gradient of \(f\) with respect to \(\mu_j\) is
Since \(\Sigma_j\) is invertible, if we set the gradient above to zero, then we can get for \(\mu_j\) a updating formula:
Finally, we should focus on \(\Sigma_j\). To calculate the gradient of \(f\) with respect to \(\Sigma_j\) is not immediate. Indeed, we need some lemmas from linear algebra.
Given some invertible matrix \(X\), we will denote \(\mathrm{adj}(X)\) the adjugate matrix of \(X\). Then
Proof. For (1), note that \(\det X = \sum_{k=1}^n X_{ik} \mathrm{adj}(X)_{ki}\). Thus,
The last equality holds because \(\mathrm{adj}(X)_{ki}\) is independent of \(X_{ij}\) for any \(k\). Since \(\mathrm{adj}(X) = (\det X) X^{-1}\), \(\mathrm{adj}(X)_{ji} = (\det X)(X^{-1})_{ji}\) is obvious.
For (2), we simply use the identity \(XX^{-1}=I\) and product rule.\(\square\)
The above lemma allows us to calculate the gradient of \(f\) with respect to \(\Sigma_j\) for some fixed \(j\), taking into the consideration that \(\Sigma_j\) is symmetric. In fact, the corresponding gradient is
Setting the gradient to zero, we solve for \(\Sigma_j\)
Equations (\(\ref{g-phi}\)), (\(\ref{g-mu}\)) and (\(\ref{g-sigma}\)) together update the current \(\theta\), which is the M-step.