In this article, I will collect my notes on ExpectationMaximization algorithm (EM) based on lecture 12 and 13 of Andrew Ng's online course. Given a set of unlabeled data points EM tries iteratively to determine the distribution of data, assuming that all data points are implicitly labeled (unobserved latent variables). For simplicity, we shall limit ourselves to the case where there are only finitely many implicit labels.
Description of the problem
Given a set of unlabeled data \(\{x^{(1)}, \dots, x^{(m)}\}\), our goal is to determine \(P(x)\), the distribution of \(x\), with the following assumptions.
Assumptions.

There are finitely many unobserved latent variables \(z \in \{1, \dots, k\}\) and they obey some multinomial distribution, i.e., \(P(z=j) = \phi_j\) with \(\sum \phi_j = 1\).

\(\{P(xz=j; a_j): j=1, \dots, k\}\) are a family of uniformly parametrized distribution.
Assumptions 1 and 2 will gives us a set of parameters \(\theta = (\phi_1, \dots, \phi_j, a_1,\dots, a_j)\) and
$$\begin{equation}
P(x; \theta) = \sum_{j=1}^k P(xz=j; \theta)P(z=j; \theta).
\label{px}
\end{equation}$$
We want to find this set of parameters so that the likelihood function
$$L(\theta) = \prod_{i=1}^m P(x^{(i)}) = \prod_{i=1}^m \sum_{j=1}^k P(x^{(i)}z=j; \theta)P(z=j; \theta).$$
is maximized. Or equivalently, the log likelihood function below is maximized:
$$\begin{equation}
l(\theta) = \sum_{i=1}^m \log\left(\sum_{j=1}^k P(x^{(i)}, z=j; \theta)\right),
\label{loglikelihood}
\end{equation}$$
where
$$P(x^{(i)}, z=j; \theta) = P(x^{(i)}z=j; \theta)P(z=j; \theta).$$