This article is my notes on the topic of factor analysis. These notes come out of lecture 13 and 14 of Andrew Ng's online course. Roughly speaking, factor analysis models some \(n\) dimensional observed data with the assumption that these data are actually from some \(d\) dimensional plane in \(\R\), up to some Gaussian distributed errors. Let's make it more precise.
Suppose we have a set of observed data \(\{x^{(1)}, \dots, x^{(m)}\}\) implicitly labeled by some latent random variable \(z \in \R^d\) where
$$z \sim \mathcal{N}(0, I).$$
Factor analysis model tries to model \(P(x)\) using the assumption that
$$\begin{equation}
x|z \sim \mathcal{N}(\mu+\Lambda z, \Psi),
\label{cond-xz}
\end{equation}$$
for some \(\mu \in \R^n, \Lambda \in \R^{n \times d}\) and diagonal matrix \(\Psi \in \R^{n \times n}\). These \(\mu, \Lambda\) and \(\Psi\) are parameters of the model.
An equivalent form of the model is that \(x = \mu + \Lambda z + \epsilon\), where
$$\begin{split}
z & \sim \mathcal{N}(0, I) \\
\epsilon & \sim \mathcal{N}(0, \Psi)
\end{split}$$
and \(z\) and \(\epsilon \in \R^n\) are independent. This is because
$$\mathbb{E}_\epsilon x = \mu + \Lambda z,$$
and
$$\begin{split}
\mathrm{Cov}_\epsilon(x)
&= \mathbb{E}_\epsilon[(x-\mathbb{E}_\epsilon x) (x-\mathbb{E}_\epsilon x)^T] \\
&= \mathbb{E}_\epsilon [\epsilon\epsilon^T] = \Psi.
\end{split}$$
Distribution \(P(x)\)
In this section, we try to compute \(P(x)\) from the factor analysis model. A trivial idea is to compute
$$P(x) = \int_{\R^d} P(x|z)P(z) \, dz.$$
While we know \(P(x|z)\) and \(P(z)\), this integral turns out to be very hard to handle. Therefore, we need other better method to figure out \(P(x)\). Note that the joint distribution of \(x\) and \(z\) is
$$\begin{equation}
P(x, z) = P(x|z)P(z).
\label{joint-xz-bayes}
\end{equation}$$
Since \(P(x|z)\) and \(P(z)\) are Guassian, \(P(x, z)\) will be Gaussian also. Thus, \(P(x)\) is Gaussian. The expected value of \(x = \mu + \Lambda z + \epsilon\) is
$$\begin{split}
\mathbb{E}_{z, \epsilon}[x] &= \mu + \Lambda \mathbb{E}_z [z]+\mathbb{E}_\epsilon [\epsilon] \\
&= \mu.
\end{split}$$
Now the covariance matrix of \(x\) is
$$\begin{split}
\mathrm{Cov}_{z, \epsilon}(x) &= \mathbb{E}_{z, \epsilon}[(x-\mathbb{E}_{z, \epsilon}[x]) (x-\mathbb{E}_{z, \epsilon}[x])^T]\\
&= \Lambda \mathbb{E}_{z}[zz^T] \Lambda^T + \mathbb{E}_{\epsilon}[\epsilon\epsilon^T] \\
&= \Lambda \Lambda^T + \Psi.
\end{split}$$
Therefore, we get
$$\begin{equation}
x \sim \mathcal{N}(\mu, \Lambda\Lambda^T+\Psi).
\label{px}
\end{equation}$$
Joint Distribution \(P(x, z)\)
We have already know \(z \sim \mathcal{N}(0, I)\) and \(x \sim \mathcal{N}(\mu, \Lambda\Lambda^T+\Psi)\), to compute \(P(x, z)\), we need only to calculate \(\mathrm{Cov}_{z, \epsilon}(x, z)\).
$$\begin{split}
\mathrm{Cov}_{z, \epsilon}(x, z)
&= \mathbb{E}_{z, \epsilon}[(x-\mathbb{E}_{z, \epsilon}[x])(z-\mathbb{E}_z[z])^T] \\
&= \mathbb{E}_{z, \epsilon}[\Lambda zz^T+\epsilon z^T] \\
&= \Lambda.
\end{split}$$
Thus, the joint distribution for \(x\) and \(z\) is
$$\begin{equation}
\left(\begin{array}{c}x \\ z\end{array}\right) \sim
\mathcal{N}\left(
\left(\begin{array}{c}\mu \\ 0\end{array}\right),
\left(\begin{array}{cc} \Lambda\Lambda^T+\Psi & \Lambda \\ \Lambda^T & I\end{array}\right)
\right).
\label{joint-xz}
\end{equation}$$
Conditional Distribution \(P(z|x)\)
Let's work on a more general case. Suppose
$$\begin{equation}
x = \left(\begin{array}{c}x_1 \\ x_2 \end{array}\right) \sim
\mathcal{N}\left(
\mu = \left(\begin{array}{c}\mu_1 \\ \mu_2\end{array}\right),
\Sigma = \left(\begin{array}{cc} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{array}\right)
\right),
\end{equation}$$
where \(x_1 \in \R^{n_1}\) and \(x_2 \in \R^{n_2}\). Then we know that
$$x_1 \sim \mathcal{N}(\mu_1, \Sigma_{11}) \text{ and } x_2 \sim \mathcal{N}(\mu_2, \Sigma_{22}).$$
The question is: What is \(P(x_1|x_2)\)? We will go through a cumbersome calculation to get the result. Before we go into any calculation, we note that in Gaussian distribution, \(\Sigma^{T}=\Sigma\), so
$$\Sigma_{11}^T = \Sigma_{11}, \Sigma_{22}^T = \Sigma_{22} \text{ and } \Sigma_{21}^T = \Sigma_{12}.$$
Let
$$\Gamma = \left(\begin{array}{cc} \Gamma_{11} & \Gamma_{12} \\ \Gamma_{21} & \Gamma_{22}\end{array}\right)$$
be the inverse of \(\Sigma\). Then from \(\Sigma\Gamma=I\), we get
$$\begin{align}
\Sigma_{11}\Gamma_{11} + \Sigma_{12}\Gamma_{21} = I \label{a}\\
\Sigma_{11}\Gamma_{12} + \Sigma_{12}\Gamma_{22} = 0 \label{b}\\
\Sigma_{21}\Gamma_{11} + \Sigma_{22}\Gamma_{21} = 0 \label{c}\\
\Sigma_{21}\Gamma_{12} + \Sigma_{22}\Gamma_{22} = I \label{d}
\end{align}$$
From Equation \(\eqref{c}\), one has \(\Gamma_{21} = -\Sigma_{22}^{-1}\Sigma_{21}\Gamma_{11}\). Putting this into Equation \(\eqref{a}\), one gets
$$\begin{equation}
\Gamma_{11} = (\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})^{-1}.
\label{gamma11}
\end{equation}$$
Since \(\Sigma\) is symmetric, its inverse \(\Gamma\) is also symmetric, so
$$\begin{equation}
\Gamma_{12} = \Gamma_{21}^T = - \Gamma_{11} \Sigma_{12}\Sigma_{22}^{-1}.
\label{gamma12}
\end{equation}$$
Putting \(\Gamma_{12}\) into Equation \(\eqref{d}\), we get
$$\begin{equation}
\Gamma_{22} - \Sigma_{22}^{-1} = \Sigma_{22}^{-1}\Sigma_{21}\Gamma_{11}\Sigma_{12}\Sigma_{22}^{-1}.
\label{gamma22}
\end{equation}$$
By assumption,
$$P(x) = P(x_1, x_2) = \frac{1}{\sqrt{(2\pi)^{n_1+n_2}\det\Sigma}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)$$
and
$$P(x_2) = \frac{1}{\sqrt{(2\pi)^{n_2}\det\Sigma_{22}}}\exp\left(-\frac{1}{2}(x_2-\mu_2)^T\Sigma_{22}^{-1}(x_2-\mu_2)\right).$$
By Bayes' rule, \(P(x_1|x_2)=P(x_1, x_2)/P(x_2)\). We separate the calculation \(P(x_1, x_2)/P(x_2)\) into several parts. First of all, let's work on \(\det \Sigma\). By an easy matrix calculation,
$$\begin{split}
\Sigma &= \left(\begin{array}{cc} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{array}\right) \\
&= \left(\begin{array}{cc} \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & \Sigma_{12}\Sigma_{22}^{-1} \\ 0 & I\end{array}\right)
\left(\begin{array}{cc} I & \\ \Sigma_{21} & I\end{array}\right)
\left(\begin{array}{cc} I & \\ & \Sigma_{22}\end{array}\right).
\end{split}$$
Thus,
$$\begin{equation}
\det \Sigma = \det (\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}) \cdot \det \Sigma_{22}.
\label{det}
\end{equation}$$
Next, we need to take care of the arguments inside \(\exp\). Replacing \(\Sigma^{-1}\) by \(\Gamma\) and using symmetric block matrices \(\Gamma_{ij}\) and \(\Sigma_{ij}\) with Equations \(\eqref{gamma12}\) and \(\eqref{gamma22}\), we have
$$\begin{split}
& (x-\mu)^T\Sigma^{-1}(x-\mu) - (x_2-\mu_2)^T\Sigma_{22}^{-1}(x_2-\mu_2)\\
= & (x_1-\mu_1)^T \Gamma_{11} (x-\mu_1) + (x_2-\mu_2)^T\Gamma_{21}(x_1-\mu_1) \\
& \;\; + (x_1-\mu_1)^T \Gamma_{12}(x_2-\mu_2) + (x_2-\mu_2)(\Gamma_{22}-\Sigma_{22}^{-1})(x_2-\mu_2) \\
= & (x_1-\mu_1)^T \Gamma_{11} (x-\mu_1) - (x_2-\mu_2)^T (\Sigma_{22}^{-1})^T\Sigma_{12}^T \Gamma_{11}(x_1-\mu_1) \\
& \;\; -(x_1-\mu_1)^T \Gamma_{11}\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2) \\
& \;\; + (x_2-\mu_2)^T(\Sigma_{22}^{-1})^T\Sigma_{12}^T\Gamma_{11}\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2) \\
=& [x_1-\mu_1-\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2)]^T\Gamma_{11}[x_1-\mu_1-\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2)].
\end{split}$$
Now using Equation \(\eqref{gamma11}\), and \(\eqref{det}\), we have
$$\begin{split}
& P(x_1|x_2) = \frac{1}{\sqrt{(2\pi)^{n_1} \det(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})}} \\
& \cdot \exp\left([x_1-\mu_1-\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2)]^T(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})^{-1}[x_1-\mu_1-\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2)]\right).
\end{split}$$
In other word,
$$\begin{equation}
x_1|x_2 \sim \mathcal{N}(\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2), \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}).
\label{cond-x12}
\end{equation}$$
Applying Equation \(\eqref{cond-x12}\) to our factor analysis model, we finally get
$$\begin{equation}
z|x \sim \mathcal{N}(\Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}(x-\mu), I-\Lambda^{T}(\Lambda\Lambda^T+\Psi)^{-1}\Lambda).
\label{cond-zx}
\end{equation}$$
EM for factor analysis
We now recall that we have observed data \(\{x_1, \dots, x_m\}\), and the latent variable \(z \in \R^d\). Factor analysis model assumes \(x = \mu + \Lambda z + \epsilon\) with
$$\begin{split}
z & \sim \mathcal{N}(0, I) \\
\epsilon & \sim \mathcal{N}(0, \Psi)
\end{split}$$
The parameters \(\mu, \Lambda\) and \(\Psi\) will be determined via maximizing the log-likelihood function
$$\begin{split}
& l(\mu, \Lambda, \Psi) = \sum_{i=0}^m \log P(x^{(i)}) \\
& = -\frac{mn}{2}\log(2\pi)-\frac{m}{2}\log\det(\Lambda\Lambda^T+\Psi)-\frac{1}{2}\sum_{i=1}^m(x^{(i)}-\mu)^T(\Lambda\Lambda^T+\Psi)^{-1}(x^{(i)}-\mu).
\end{split}$$
We are going to use EM algorithm to find the parameters, but before that, we can find \(\mu\) already from \(l\).
$$\nabla_\mu l(\mu, \Lambda, \Psi) = (\Lambda\Lambda^T+\Psi)^{-1}\left(\sum_{i=1}^m x^{(i)}-m\mu\right).$$
Setting \(\nabla_\mu l = 0\), we get
$$\begin{equation}
\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}.
\label{mu}
\end{equation}$$
I have written up some notes on EM algorithm with the assumption that the latent variable \(z\) obeys multinomial distribution. Here in factor analysis, \(z\) is continuous and it obeys Gaussian distribution. Fortunately, EM algorithm can be adopted to continuous latent variable without much effort: we just need to change proper summations to integrals.
The E-step is to calculate for each \(x^{(i)}\) a conditional distribution
$$Q_i(z^{(i)}) = \frac{P(x^{(i)}, z^{(i)})}{P(x^{(i)})} = P(z^{(i)}|x^{(i)}).$$
Using Equation \(\eqref{cond-zx}\), we get \(Q_i\) for current \(\mu, \Lambda\) and \(\Psi\)
$$\begin{equation}
z^{(i)} \sim_{Q_i} \mathcal{N}(\mu^{(i)}, \Sigma^{(i)}),
\label{estep}
\end{equation}$$
where
$$\begin{equation}
\begin{split}
\mu^{(i)} &= \Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}(x^{(i)}-\mu) \\
\Sigma^{(i)} &= I-\Lambda^{T}(\Lambda\Lambda^T+\Psi)^{-1}\Lambda)
\label{zi}
\end{split}
\end{equation}$$
Note that, we use concrete value of parameters to calculate \(Q_i\), so \(Q_i(z)\) is actually independent of parameters. The M-step is to maximize
$$\begin{split}
f(\mu, \Lambda, \Psi) &= \sum_{i=1}^m \mathbb{E}_{Q_i} \log \frac{P(x^{(i)}, z^{(i)})}{Q_i(z^{(i)})} \\
&= \sum_{i} \mathbb{E}_{Q_i} \log P(x^{(i)}|z^{(i)}) + \mathbb{E}_{Q_i} \log P(z^{(i)}) - \mathbb{E}_{Q_i} \log Q_i(z^{(i)}).
\end{split}$$
Since \(P(z)\) and \(Q_i(z)\) are independent of \(\mu, \Lambda\) and \(\Psi\), we just need to update the parameters with respect to the following function
$$\begin{equation}
\begin{split}
F(\mu, \Lambda, \Psi) &= \sum_{i=1}^m \mathbb{E}_{Q_i} \log P(x^{(i)}|z^{(i)}) \\
&= \sum_{i=1}^m \mathbb{E}_{Q_i}\left[-\frac{n}{2}\log(2\pi)-\frac{1}{2}\log\det\Psi - \frac{1}{2}(x^{(i)}-\mu-\Lambda z^{(i)})^T \Psi^{-1} (x^{(i)}-\mu-\Lambda z^{(i)})\right].
\end{split}
\label{F}
\end{equation}$$
Let's find the update formula for \(\Psi\) first. \(\Psi\) is symmetric, so using a Lemma in this post on matrix derivatives, we have
$$
\nabla_\Psi F = \frac{m}{2} \Psi^{-1} - \frac{1}{2} \Psi^{-1} \left(\sum_{i=1}^m \mathbb{E}_{Q_i}\left[(x^{(i)}-\mu-\Lambda z^{(i)})(x^{(i)}-\mu-\Lambda z^{(i)})^T \right]\right)\Psi^{-1}.
$$
Setting \(\nabla_\Psi F = 0\), we get
$$\Psi^\prime = \frac{1}{m}\left(\sum_{i=1}^m \mathbb{E}_{Q_i}\left[(x^{(i)}-\mu-\Lambda z^{(i)})(x^{(i)}-\mu-\Lambda z^{(i)})^T \right]\right).$$
Using Equations \(\eqref{mu}\), \(\eqref{estep}\) and \(\eqref{zi}\), we can simply the above equation to
$$\begin{equation}
\Psi^\prime = \frac{1}{m}\sum_{i=1}^m x^{(i)}(x^{(i)})^T+\Lambda(\Sigma^{(i)}+\mu^{(i)}(\mu^{(i)})^T)\Lambda^T-x^{(i)}(\mu^{(i)})^T\Lambda^T-\Lambda\mu^{(i)}(x^{(i)})^T-\mu\mu^T.
\label{psi}
\end{equation}$$
Since \(\Psi\) is diagonal, so \(\Psi\) should be updated according to the diagonal part of \(\Psi^\prime\). Finally, we shall find the formula for \(\Lambda\). It turns out that we need to do matrix derivatives on traces, so we will prove the following lemma first.
Lemma. Let $A, B$ and $C$ be matrices, and $f(A)$ is a differentiable function on matrices. Then
(1) \(\displaystyle \nabla_A \mathrm{tr}\, AB = B^{T}\).
(2) \(\displaystyle \nabla_{A^T} f(A) = (\nabla_A f(A))^T\).
(3) \(\displaystyle \nabla_A \mathrm{tr}\, ABA^TC = CAB+C^TAB^T\).
Proof. (1) From the definition of traces, we write
$$\mathrm{tr}\, AB = \sum_{i}\sum_{j} A_{ij}B_{ji}.$$
Thus,
$$\frac{\partial \mathrm{tr}\, AB}{\partial A_{ij}} = B_{ji}.$$
This finishes (1). And for (2), it is simply because
$$\frac{\partial f(A)}{\partial (A^T)_{ij}} = \frac{\partial f(A)}{\partial A_{ji}}.$$
Finally for (3), using product rule and (1)(2),
$$\begin{split}
\nabla_A \mathrm{tr}\, ABA^TC
&= \nabla_A \mathrm{tr}\, (AB)(A^TC) \\
&= C^TA \nabla_A \mathrm{tr}\, AB + (B^TA^T\nabla_A \mathrm{tr}\, AC)^T \\
&= C^TAB^T + CAB.
\end{split}$$
This finishes the proof of the lemma. \(\square\)
From Equation \(\eqref{F}\), we can extract the \(\Lambda\) terms:
$$F(\Lambda) = \sum_{i=1}^m \mathbb{E}_{Q_i}\left[-\frac{1}{2}(z^{(i)})^T \Lambda^T \Psi^{-1}\Lambda z^{(i)} + (x^{(i)}-\mu)^T\Psi^{-1}\Lambda z^{(i)}\right].$$
The above \(\Lambda\) terms can be expressed in terms of traces:
$$F(\Lambda) = \sum_{i=1}^m \mathbb{E}_{Q_i} \mathrm{tr}\, \left[-\frac{1}{2}\Lambda^T\Psi^{-1}\Lambda z^{(i)}(z^{(i)})^T + \Psi^{-1}\Lambda z^{(i)}(x^{(i)}-\mu)^T\right].$$
Now using the above Lemma, we get
$$\nabla_\Lambda F(\Lambda) = \sum_{i=1}^m \mathbb{E}_{Q_i} [-\Psi^{-1}\Lambda z^{(i)}(z^{(i)})^T+\Psi^{-1}(x^{(i)}-\mu)(z^{(i)})^T].$$
We then set \(\nabla_\Lambda F(\Lambda)=0\) to get
$$\Lambda = \left(\sum_{i=1}^m (x^{(i)}-\mu)(\mathbb{E}_{Q_i}[z^{(i)}])^T\right)\left(\sum_{i=1}^m \mathbb{E}_{Q_i} z^{(i)}(z^{(i)})^T\right)^{-1}.$$
If we use Equations \(\eqref{estep}\) and \(\eqref{zi}\) again, we have
$$\begin{equation}
\Lambda = \left(\sum_{i=1}^m (x^{(i)}-\mu)(\mu^{(i)})^T\right)\left(\sum_{i=1}^m \Sigma^{(i)}+\mu^{(i)}(\mu^{(i)})^T\right)^{-1}.
\label{lambda}
\end{equation}$$
Therefore, M-step is given by Equations \(\eqref{psi}\) and \(\eqref{lambda}\).