In this article, I will collect my notes on Expectation-Maximization algorithm (EM) based on lecture 12 and 13 of Andrew Ng's online course. Given a set of unlabeled data points EM tries iteratively to determine the distribution of data, assuming that all data points are implicitly labeled (unobserved latent variables). For simplicity, we shall limit ourselves to the case where there are only finitely many implicit labels.

Description of the problem

Given a set of unlabeled data \(\{x^{(1)}, \dots, x^{(m)}\}\), our goal is to determine \(P(x)\), the distribution of \(x\), with the following assumptions.

Assumptions.

  1. There are finitely many unobserved latent variables \(z \in \{1, \dots, k\}\) and they obey some multinomial distribution, i.e., \(P(z=j) = \phi_j\) with \(\sum \phi_j = 1\).

  2. \(\{P(x|z=j; a_j): j=1, \dots, k\}\) are a family of uniformly parametrized distribution.

Assumptions 1 and 2 will gives us a set of parameters \(\theta = (\phi_1, \dots, \phi_j, a_1,\dots, a_j)\) and

$$\begin{equation} P(x; \theta) = \sum_{j=1}^k P(x|z=j; \theta)P(z=j; \theta). \label{px} \end{equation}$$


We want to find this set of parameters so that the likelihood function

$$L(\theta) = \prod_{i=1}^m P(x^{(i)}) = \prod_{i=1}^m \sum_{j=1}^k P(x^{(i)}|z=j; \theta)P(z=j; \theta).$$


is maximized. Or equivalently, the log likelihood function below is maximized:

$$\begin{equation} l(\theta) = \sum_{i=1}^m \log\left(\sum_{j=1}^k P(x^{(i)}, z=j; \theta)\right), \label{log-likelihood} \end{equation}$$


where

$$P(x^{(i)}, z=j; \theta) = P(x^{(i)}|z=j; \theta)P(z=j; \theta).$$

EM algorithm

For any \(x^{(i)}\), using Jensen's inequality, we have for any multinomial distribution \(Q_i\) on \(\{1, \dots, k\}\),

$$\begin{split} \log \left(\sum_{j=1}^k P(x^{(i)}, z=j; \theta)\right) &= \log \left(\sum_j Q_i(z=j) \frac{P(x^{(i)}, z=j; \theta)}{Q_i(z=j)}\right) \\ &\ge \sum_j Q_i(z=j) \log \left(\frac{P(x^{(i)}, z=j; \theta)}{Q_i(z=j)}\right). \end{split}$$


Inequality holds if and only if \(\frac{P(x^{(i)}, z=j; \theta)}{Q_i(z=j)}\) are the same for all \(j\). That is to say, inequality holds if and only if

$$\begin{equation} Q_i(z=j) = \frac{P(x^{(i)}, z=j; \theta)}{P(x^{(i)}; \theta)}, \label{qij} \end{equation}$$


where \(P(x^{(i)})\) is calculated by

$$P(x^{(i)}) = \sum_j P(x^{(i)}, z=j; \theta).$$


Therefore,

$$\begin{equation}\begin{split} l(\theta) &= \sum_{i=1}^m \log\left(\sum_{j=1}^k P(x^{(i)}, z=j; \theta)\right)\\ &\ge \sum_i \sum_j Q_i(z=j) \log \left(\frac{P(x^{(i)}, z=j; \theta)}{Q_i(z=j)}\right). \label{inequality} \end{split}\end{equation}$$

Based on Equations (\(\ref{inequality}\)) and (\(\ref{qij}\)), EM algorithm uses the following steps to update \(\theta\).

EM algorithm.

E-Step) Calculate \(Q_i(z=j)\) using Equation (\(\ref{qij}\)) from current \(\theta\).

M-Step) Put all resulted \(Q_i(z=j)\) in Equation (\(\ref{inequality}\)), and update \(\theta\) to maximize

$$ \begin{equation} f(\theta) = \sum_{i=1}^m \sum_{j=1}^k Q_i(z=j) \log \left(\frac{P(x^{(i)}, z=j; \theta)}{Q_i(z=j)}\right), \label{mstep} \end{equation}$$

subject to \(\sum_j \phi_j = 1\).

Mixture of Gaussians

Suppose \(x \in \mathbb{R}^n\) are column vectors, i.e, there are \(n\) features. In Assumptions, we now require \(P(x|z=j)\) is a Gaussian distribution for any \(j\). Thus, the above Assumptions are now specified as follow.

  1. \(P(z=j) = \phi_j\), for \(j = 1, \dots, k\) and \(\sum_j \phi_j = 1\).

  2. \(\displaystyle P(x|z=j) = \frac{1}{\sqrt{(2\pi)^n \det \Sigma_j}} \exp\left(-\frac{1}{2} (x-\mu_j)^T \Sigma_j^{-1} (x-\mu_j)\right)\), for \(j = 1, \dots, k\).

With these assumptions, our set of parameters are

$$\theta = (\phi_1, \dots, \phi_k, \mu_1, \dots, \mu_k, \Sigma_1, \dots, \Sigma_k).$$


Once we determine \(\theta\) via EM algorithm, we then can use Equation (\(\ref{px}\)) to find \(P(x; \theta)\).

The E-step is always relatively easy. For the current \(\theta\), we set

$$w_{ij} = Q_i(z=j) = \frac{P(x^{(i)}, z=j; \theta)}{P(x^{(i)}; \theta)}$$


from Equation (\(\ref{qij}\)). Quantities \(P(x^{(i)}, z=j; \theta)\) and \(P(x^{(i)}; \theta)\) can be both deduced from \(P(z)\) and \(P(x|z)\).

The M-step needs some work. Under the above assumptions and using \(w_{ij}\) in place of \(Q_i(z=j)\), Equation (\(\ref{mstep}\)) becomes

$$ f(\theta) = \sum_{i=1}^m \sum_{j=1}^k w_{ij} \log \phi_j - \frac{1}{2}w_{ij}(x^{(i)}-\mu_j)^T \Sigma_j^{-1} (x^{(i)}-\mu_j) - \frac{1}{2}w_{ij}\log \det \Sigma_j - \frac{nw_{ij}}{2}\log(2\pi). $$


For \(\phi_j\), we recall that we have the restriction that \(\sum_j \phi_j = 1\). Moreover, when concern only about \(\phi_j\), we can change our objective function to

$$f_1(\theta) = \sum_{i=1}^m \sum_{j=1}^k w_{ij} \log \phi_j = \sum_j \left(\sum_i w_{ij}\right) \log \phi_j.$$


Apply the Lagrange multiplier to \(f_1\) with restriction \(\sum_j \phi_j = 1\), there is a \(\lambda\) so that

$$\frac{\sum_i w_{ij}}{\phi_j} = \lambda.$$


Recall that \(w_{ij} = Q_i(z=j)\) is some multinomial distribution on \(\{1, \dots, k\}\), so \(\sum_j w_{ij} = 1\). Thus, from above equations, we conclude that

$$\lambda = \sum_{i=1}^m \sum_{j=1}^k w_{ij} = m.$$


Therefore, we have the following formula to update \(\phi_j\) for \(j = 1, \dots, k\):

$$\begin{equation} \phi_j = \frac{\sum_i w_{ij}}{m}. \label{g-phi} \end{equation}$$


Now, let's work on parameter \(\mu_j\) for some fixed \(j\). The gradient of \(f\) with respect to \(\mu_j\) is

$$\sum_{i=1}^m w_{ij} \Sigma_j^{-1} (x^{(i)} - \mu_j).$$


Since \(\Sigma_j\) is invertible, if we set the gradient above to zero, then we can get for \(\mu_j\) a updating formula:

$$\begin{equation} \mu_j = \frac{\sum_{i=1}^m w_{ij}x^{(i)}}{\sum_{i=1}^m w_{ij}}. \label{g-mu} \end{equation}$$


Finally, we should focus on \(\Sigma_j\). To calculate the gradient of \(f\) with respect to \(\Sigma_j\) is not immediate. Indeed, we need some lemmas from linear algebra.

Given some invertible matrix \(X\), we will denote \(\mathrm{adj}(X)\) the adjugate matrix of \(X\). Then

$$X^{-1} = (\det X)^{-1} \mathrm{adj}(X).$$

Lemma. Let $X = (X_{ij})$ be an invertible $n \times n$ matrix, viewed as a function in terms of $X_{ij}$. Then

(1) \(\displaystyle \frac{\partial \det X}{\partial X_{ij}} = \mathrm{adj}(X)_{ji} = (\det X)(X^{-1})_{ji}\).

(2) \(\displaystyle \frac{\partial X^{-1}}{\partial X_{ij}} = -X^{-1} \frac{\partial X}{\partial X_{ij}} X^{-1}\).

Proof. For (1), note that \(\det X = \sum_{k=1}^n X_{ik} \mathrm{adj}(X)_{ki}\). Thus,

$$\begin{split} \frac{\partial \det X}{\partial X_{ij}} &= \sum_k \mathrm{adj}(X)_{ki} \frac{\partial X_{ik}}{\partial X_{ij}} + X_{ik} \frac{\partial \,\mathrm{adj}(X)_{ki}}{\partial X_{ij}} \\ &= \mathrm{adj}(X)_{ji}. \end{split}$$


The last equality holds because \(\mathrm{adj}(X)_{ki}\) is independent of \(X_{ij}\) for any \(k\). Since \(\mathrm{adj}(X) = (\det X) X^{-1}\), \(\mathrm{adj}(X)_{ji} = (\det X)(X^{-1})_{ji}\) is obvious.

For (2), we simply use the identity \(XX^{-1}=I\) and product rule.\(\square\)

The above lemma allows us to calculate the gradient of \(f\) with respect to \(\Sigma_j\) for some fixed \(j\), taking into the consideration that \(\Sigma_j\) is symmetric. In fact, the corresponding gradient is

$$\sum_{i=1}^m -\frac{1}{2}w_{ij} \Sigma_j^{-1}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T\Sigma_j^{-1}+\frac{1}{2}w_{ij}\Sigma_j^{-1}.$$


Setting the gradient to zero, we solve for \(\Sigma_j\)

$$\begin{equation} \Sigma_j = \frac{\sum_{i=1}^m w_{ij} (x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^m w_{ij}}. \label{g-sigma} \end{equation}$$

Equations (\(\ref{g-phi}\)), (\(\ref{g-mu}\)) and (\(\ref{g-sigma}\)) together update the current \(\theta\), which is the M-step.