Markov Decision Process

This article is my notes for 16th lecture in Machine Learning by Andrew Ng on Markov Decision Process (MDP). MDP is a typical way in machine learning to formulate reinforcement learning, whose tasks roughly speaking are to train agents to take actions in order to get maximal rewards in some settings. One example of reinforcement learning would be developing a game bot to play Super Mario on its own.

Another simple example is used in the lecture, and I will use it throughout the post as well. Since the example is really simple, so MDP shown below is not of the most general form, but only good enough to solve the example and give the idea of what MDP and reinforcement learning are. The example begins with a 3 by 4 grid as below.

the grid

Read More

The Phantom of the Opera


Read More

Principal Component Analysis

This article is my notes on Principal Component Analysis (PCA) for Lecture 14 and 15 of Machine Learning by Andrew Ng. Given a set of high dimensional data \(\{x^{(1)}, \dots, x^{(m)}\}\), where each \(x^{(i)} \in \R^{n}\), with the assumption that these data actually roughly lie in a much smaller \(k\) dimensional subspace, PCA tries to find a basis for this \(k\) dimensional subspace. Let's look at a simple example:

Read More




Read More




Read More

Factor Analysis

This article is my notes on the topic of factor analysis. These notes come out of lecture 13 and 14 of Andrew Ng's online course. Roughly speaking, factor analysis models some \(n\) dimensional observed data with the assumption that these data are actually from some \(d\) dimensional plane in \(\R\), up to some Gaussian distributed errors. Let's make it more precise.

Suppose we have a set of observed data \(\{x^{(1)}, \dots, x^{(m)}\}\) implicitly labeled by some latent random variable \(z \in \R^d\) where

$$z \sim \mathcal{N}(0, I).$$

Factor analysis model tries to model \(P(x)\) using the assumption that

$$\begin{equation} x|z \sim \mathcal{N}(\mu+\Lambda z, \Psi), \label{cond-xz} \end{equation}$$

for some \(\mu \in \R^n, \Lambda \in \R^{n \times d}\) and diagonal matrix \(\Psi \in \R^{n \times n}\). These \(\mu, \Lambda\) and \(\Psi\) are parameters of the model.

Read More

Expectation-Maximization algorithm

In this article, I will collect my notes on Expectation-Maximization algorithm (EM) based on lecture 12 and 13 of Andrew Ng's online course. Given a set of unlabeled data points EM tries iteratively to determine the distribution of data, assuming that all data points are implicitly labeled (unobserved latent variables). For simplicity, we shall limit ourselves to the case where there are only finitely many implicit labels.

Description of the problem

Given a set of unlabeled data \(\{x^{(1)}, \dots, x^{(m)}\}\), our goal is to determine \(P(x)\), the distribution of \(x\), with the following assumptions.


  1. There are finitely many unobserved latent variables \(z \in \{1, \dots, k\}\) and they obey some multinomial distribution, i.e., \(P(z=j) = \phi_j\) with \(\sum \phi_j = 1\).

  2. \(\{P(x|z=j; a_j): j=1, \dots, k\}\) are a family of uniformly parametrized distribution.

Assumptions 1 and 2 will gives us a set of parameters \(\theta = (\phi_1, \dots, \phi_j, a_1,\dots, a_j)\) and

$$\begin{equation} P(x; \theta) = \sum_{j=1}^k P(x|z=j; \theta)P(z=j; \theta). \label{px} \end{equation}$$

We want to find this set of parameters so that the likelihood function

$$L(\theta) = \prod_{i=1}^m P(x^{(i)}) = \prod_{i=1}^m \sum_{j=1}^k P(x^{(i)}|z=j; \theta)P(z=j; \theta).$$

is maximized. Or equivalently, the log likelihood function below is maximized:

$$\begin{equation} l(\theta) = \sum_{i=1}^m \log\left(\sum_{j=1}^k P(x^{(i)}, z=j; \theta)\right), \label{log-likelihood} \end{equation}$$


$$P(x^{(i)}, z=j; \theta) = P(x^{(i)}|z=j; \theta)P(z=j; \theta).$$

Read More

Digit recognition, Softmax

在这篇博客中,我将使用softmax模型来识别手写数字。文章的第一部分是关于softmax模型的理论推导,而第二部分则是模型的实现。softmax的本质是一个线性模型,所以推导所需要的理论在我之前的一篇博客Generalized Linear Model已经详细介绍过了。softmax是逻辑回归(logistic regression)的推广:逻辑回归使用Bernoulli分布(二项分布),而softmax使用多项分布。

Read More

Support Vector Machine

This article is my notes on support vector machine for Lecture 7 and 8 of Machine Learning by Andrew Ng.


In a binary classification problem, we can use logistic regression

$$h_\theta(x) = \frac{1}{1+e^{-\theta^T x}} = g(\theta^T x),$$

where \(g\) is the sigmoid function with a figure of it below.

Then given input \(x\), the model predicts \(1\) if and only if \(\theta^x \ge 0\), in which case \(h_\theta(x) = g(\theta^T x) \ge 0.5\); and it predicts \(0\) if and only if \(\theta^T x < 0\). Moreover, based on the shape of sigmoid function, if \(\theta^T x >> 0\), we are very confident that \(y=1\). Likewise, if \(\theta^T x << 0\), we are very confident that \(y=0\). Therefore, we hope that for the training set \(\{(x^{(i)}, y^{(i)})\}_{i=1}^m\), we can find such a \(\theta\) that \(\theta^T x^{(i)} >> 0\) if \(y^{(i)}=1\) and \(\theta^T x^{(i)} << 0\) if \(y^{(i)}=0\).

Read More

Generative Model

This article is my notes on generative model for Lecture 5 and 6 of Machine Learning by Andrew Ng. What we do in logistic regression using generalized linear model is that, we approximate \(P(y|x)\) using given data. This kind of learning algorithms is discriminative, in which we predict \(y\) based on the input features \(x\). On the contrary, generative model is to model \(P(x|y)\), the probability of the features \(x\) given class \(y\). In other words, we want to study how the features structure looks like given a class \(y\). If we also learn what \(P(y)\) is, we can easily recover \(P(y|x)\), for example, in the binary classification problem,

$$\begin{equation} P(y=1|x) = \frac{P(x|y=1)P(y=1)}{P(x)}, \label{eqn:bayes} \end{equation}$$

where \(P(x) = P(x|y=0)P(y=0) + P(x|y=1)P(y=1)\).

In this article, we are going to see a simple example of generative model on Gaussian discriminant analysis and Naive Bayes.

Read More