This article on Generalized Linear Model (GLM) is based on the first four lectures of Machine Learning by Andrew Ng. But the structure of the article is quite different from the lecture. I will talk about exponential family of distributions first. Then I will discuss the general idea of GLM. Finally, I will try to derive some well known learning algorithms from GLM.

Exponential Family

A family of distributions is an exponential family if it can be parametrized by vector $\eta$ in the form $$P(y; \eta) = b(y)\exp(\eta^{T} T(y)-a(\eta)),$$ where $T(y)$ and $b(y)$ are (vector-valued) functions in terms of $y$, and $a(\eta)$ is a function in terms of $\eta$.

$\eta$ is called the natural parameter and $T(y)$ is called the sufficient statistic.

Example 1. Consider a family of normal distributions $P(y; \mu)$ with unknown mean $\mu$ and known variance $\sigma^2$. Then

$$P(y; \mu) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(y-\mu)^2}{2\sigma^2}}.$$

We can rewrite $P(y; \mu)$ in the following form:

$$P(y; \mu) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{y^2}{2\sigma^2}} \cdot \exp(\frac{\mu}{\sigma} \cdot \frac{y}{\sigma}-\frac{\mu^2}{2\sigma^2}).$$

Therefore, we can set

\begin{align*} \eta & = \frac{\mu}{\sigma} \\ T(y) & = \frac{y}{\sigma} \\ a(\eta) & = \frac{1}{2}\eta^2 \\ b(y) &= \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{y^2}{2\sigma^2}} \end{align*}

Example 2. Consider a family of Bernoulli distributions parametrized by $\phi$. Then

$$P(y; \phi) = \phi^{y}(1-\phi)^{1-y}.$$

Let's rewrite it in the exponential form:

$$\begin{equation*} \begin{split} P(y; \phi) &= \exp(\log(\phi^y(1-\phi)^{1-y})) \\ &= \exp(y\log(\phi)+(1-y)\log(1-\phi)) \\ &= \exp(y[\log(\phi)-\log(1-\phi)] + \log(1-\phi)) \\ &= \exp\left(\log\left(\frac{\phi}{1-\phi}\right) \cdot y + \log(1-\phi)\right) \end{split} \end{equation*}$$

Compare the above result with the definition of exponential family, we get

\begin{align*} \eta & = \log\left(\frac{\phi}{1-\phi}\right) \\ T(y) & = y \\ a(\eta) & = -\log(1-\phi) \\ b(y) &= 1 \end{align*}

We notice that $a(\eta)$ is not in terms of $\eta$. We need to express $\phi$ in terms of $\eta$. From $\eta = \log\left(\frac{\phi}{1-\phi}\right)$, we solve

$$\phi = \frac{1}{1+e^{-\eta}}.$$

So,

$$\begin{split} a(\eta) &= -\log(1-\phi) = -\log\left(1-\frac{1}{1+e^{-\eta}}\right) \\ &= -\log\left(\frac{e^{-\eta}}{1+e^{-\eta}}\right) = -\log\left(\frac{1}{1+e^{\eta}}\right) \\ &= \log\left(1+e^\eta\right). \end{split}$$

Generalized Linear Model

In machine learning, we are often given a large samples set $S = \{(x^{(i)}, y^{(i)}): i = 1, \dots, m\}$ and our task is to come up with some learning algorithm $h_\theta(x)$ depending on $\theta$ such that $h_\theta(x)$ models $S$ well in certain sense. GLM is one powerful machinery which gives us a way to find reasonably good $h_\theta(x)$.

GLM has three assumptions:

(1) Given input $x$ and learning parameter $\theta$, the output $y|x, \theta$ is distributed in an exponential family $P(y; \eta)$ for some natural parameter $\eta$.

(2) $h_\theta(x) = E(T(y)|x; \theta)$, the expected value of $T(y)$ given $x$.

(3) $\eta = \theta^T x$. (In case $\eta$ is a vector, assume $\eta_i = \theta_i^T x$.)

Remark. The exponential family varies as learning problem varies. For example, if we want to model the number of people visiting a certain website according to time, we should use Poisson distribution. The nature of the problem usually determine the exponential family we should use. The second assumption is the one that give us the learning algorithm, which is $h_\theta(x)$. However, one thing to keep in mind is that, the algorithm predicts $T(y)$ but not $y$. The third and the last assumption is the design choice in GLM. I guess this is the reason why this model is called generalized linear model.

Once we decide what kind of exponential family we should use, we can derive the learning algorithm $h_\theta(x)$ from GLM. But how to determine $\theta$? One of the answers is to use maximum likelihood estimation. Let's dive into the idea of maximum likelihood estimation.

The chosen exponential family $P(y; \eta)$ or $P(y; x, \theta)$ is a probability density function of $y$ in terms of $x$ and $\theta$. Let's fix $\theta$ for now. Then $P(y; x, \theta)$ is a probability density function of $y$ in terms of $x$. Given a sample point $(x^{(i)}, y^{(i)})$, $P(y^{(i)}; x^{(i)}, \theta)$ is the relative likelihood of $h_\theta(x^{(i)})$ being $y^{(i)}$, measuring how good our learning algorithm $h_\theta$ at the $i$-th sample point. We must be aware that $P(y^{(i)}; x^{(i)}, \theta)$ is the relative likelihood, and the absolute likelihood of $h_\theta(x^{(i)})$ being $y^{(i)}$ is always 0. Using the relative likelihood, we can define a likelihood function of $\theta$:

$$L(\theta) = \prod_{i=1}^m P(y^{(i)}; x^{(i)}, \theta).$$

Therefore, the larger $L(\theta)$ is, the better our learning algorithm $h_\theta(x)$ is. Hence, to find the best learning parameter $\theta$, we need to maximize $L(\theta)$. Equivalently, we need to maximize the log-likelihood function

$$l(\theta) = \log L(\theta) = \sum_{i = 1}^m \log P(y^{(i)}; x^{(i)}, \theta).$$

Linear regression

Suppose the chosen exponential family $P(y; \mu)$ for GLM is a family of normal distributions parametrized by mean $\mu$ with fix variant $\sigma^2$:

$$P(y; \mu) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(y-\mu)^2}{2\sigma^2}}.$$

From Example 1 above,

\begin{align*} \eta & = \frac{\mu}{\sigma} \\ T(y) & = \frac{y}{\sigma} \\ a(\eta) & = \frac{1}{2}\eta^2 \\ b(y) &= \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{y^2}{2\sigma^2}} \end{align*}

By GLM, we have

$$$$\begin{split} h_\theta(x) &= E(T(y)|x; \theta) = \int_{-\infty}^{\infty} T(y)P(y; \mu) \, dy \\ &= \int_{-\infty}^{\infty} \frac{y}{\sigma} \cdot \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(y-\mu)^2}{2\sigma^2}} \, dy \\ &= \frac{\mu}{\sigma} = \eta = \theta^T x. \end{split} \label{eqn:linear}$$$$

By the remark of GLM above, $h_\theta(x)$ predicts $T(y)=\sigma y$, so the model predicting $y$ is the following, which I denote it also $h_\theta(x)$, since $\sigma$ can be absorbed in $\theta$.

$$h_\theta(x) = \sigma \theta^T x.$$

From Equation (\ref{eqn:linear}), we aslo know that $\mu = \sigma \theta^T x$. At each sample point $(x^{(i)}, y^{(i)})$,

$$\begin{split} P(y^{(i)}; x^{(i)}, \theta) &= P(y^{(i)}; \mu=\sigma \theta x^{(i)}) \\ &= \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(y^{(i)}-\sigma \theta x^{(i)})^2}{2\sigma^2}} \\ &= \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(y^{(i)}-h_\theta(x^{(i)}))^2}{2\sigma^2}}. \end{split}$$

Hence,

$$l(\theta) = -m\log(2\sqrt{\pi}\sigma)-\frac{1}{2\sigma^2}\sum_{i=1}^{m}\left(y^{(i)}-h_\theta(x^{(i)})\right)^2.$$

To maximize $l(\theta)$ is then the same as minmimize

$$J(\theta) = \sum_{i=1}^{m}\left(y^{(i)}-h_\theta(x^{(i)})\right)^2.$$

Since $\sigma$ is fixed, we can absorb $\sigma$ into $\theta$ so that our learning algorithm is in the standard form

$$h_\theta(x) = \theta^T x,$$

where $\theta$ minimize $J(\theta)$. We rediscover linear regression from normal distributions. This discovery also hints that we should use linear regression where $y$ is normally distributed according to $x$ with fixed variant.

Logistic regression

The core of logistic problems is to group data into two category. For example, to determine whether a tumor is benign or malignant according to a set of medical features is a logistic problems. In the actual world, instead of being 100% certain, a doctor might tell a patient that there is, say 99% chances, that a tumor is benign. So a logistic regression is to consume an input and output the probability of being positive. (In the tumor example above, being positive means the tumor is benign.) The probability of being positive or not simply obeys Bernoulli distribution. Therefore, to obtain the learning algorithm for logistic regression, we start off the exponential family of Bernoulli distributions parametrized by $\phi$:

$$P(y; \phi) = \phi^{y}(1-\phi)^{1-y}.$$

Example 2 gives us

\begin{align*} \eta & = \log\left(\frac{\phi}{1-\phi}\right) \\ T(y) & = y \\ a(\eta) & = -\log(1-\phi) \\ b(y) &= 1 \end{align*}

and

$$\phi = \frac{1}{1+e^{-\eta}}.$$

So,

$$\begin{split} h_{\theta}(x) &= E(T(y)|x; \theta) = \phi \cdot 1 + (1-\phi) \cdot 0 \\ &= \phi = \frac{1}{1+e^{-\eta}} = \frac{1}{1+e^{-\theta^T x}}. \end{split}$$

Thus, the learning algorithm for logistic regression is

$$h_\theta(x) = \frac{1}{1+e^{-\theta^T x}}.$$

At each sample point $(x^{(i)}, y^{(i)})$, the relative likelihood is given by

$$\begin{split} P(y^{(i)}; x^{(i)}, \theta) &= P(y^{(i)}; \phi = h_\theta(x^{(i)})) \\ &= h_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)})^{1-y^{(i)}}. \end{split}$$

Thus, to determine the best $\theta$, we need to maximize

$$l(\theta) = \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})).$$

Poisson regression

Suppose needed to design a learning algorithm to model some counting problems, Poisson distribution might be a good choice. In this section, we assume that sample points are distributed according to Poisson distribution parametried by mean $\mu$:

$$P(y; \mu) = \frac{e^{-\mu} \mu^y}{y!},$$

where $y = 0, 1, 2, \dots$. We need to rewrite $P(y; \mu)$ into an exponential family.

$$P(y; \mu) = \frac{e^{-\mu} \mu^y}{y!} = \frac{e^{-\mu}}{y!} \exp(y \log \mu).$$

Therefore, $P(y; \mu)$ is indeed an exponential family and

\begin{align*} \eta & = \log \mu \\ T(y) & = y \\ a(\eta) & = 0 \\ b(y) &= \frac{e^{-\mu}}{y!} \end{align*}

Since $T(y)=y$,

$$\begin{split} h_\theta(x) &= E(T(y)|x; \theta) \\ &= \sum_{y=0}^{\infty} y \cdot \frac{e^{-\mu} \mu^y}{y!} \\ &= \mu e^{-\mu} \sum_{y=1}^{\infty} \frac{\mu^{y-1}}{(y-1)!} \\ &= \mu e^{-\mu} e^{\mu} = \mu \\ &= e^\eta = e^{\theta^T x}. \end{split}$$

We obtain our learning algorithm: $h_\theta(x) = e^{\theta^T x}$. Again, we use maximum likelihood estimation to determine the best $\theta$. At each sample point $(x^{(i)}, y^{(i)})$, the relative likelihood is

$$\begin{split} P(y^{(i)}; x^{(i)}, \theta) &= P(y^{(i)}; \mu = h_\theta(x^{(i)})) \\ &= \frac{e^{-h_\theta(x^{(i)})} h_\theta(x^{(i)})^{y^{(i)}}}{y^{(i)}!}. \end{split}$$

Therefore,

$$l(\theta) = \sum_{i=0}^m -e^{\theta^T x^{(i)}} + y^{(i)} \cdot \theta^T x^{(i)} - \log\left(y^{(i)}!\right).$$

The desired $\theta$ is the one maximizing $l(\theta)$, or equivalently, the one maximizing

$$l^\prime(\theta) = \sum_{i=0}^m -e^{\theta^T x^{(i)}} + y^{(i)} \cdot \theta^T x^{(i)}.$$