Digit recognition, Softmax | 叶某人的碎碎念

在这篇博客中，我将使用softmax模型来识别手写数字。文章的第一部分是关于softmax模型的理论推导，而第二部分则是模型的实现。softmax的本质是一个线性模型，所以推导所需要的理论在我之前的一篇博客Generalized Linear Model已经详细介绍过了。softmax是逻辑回归(logistic regression)的推广：逻辑回归使用Bernoulli分布（二项分布），而softmax使用多项分布。

softmax模型

假设$y$取值于$\{1, 2, \dots, k\}$，也就是说数据集中包含$k$类。一般来说对于输入特征$x \in \mathbb{R}^{n+1}$，我们会规定$x_0 = 1$。也就是说实际上输入特征是$n$维的，引入$x_0=1$是为了可以将线性方程写成

$$\theta^T x.$$

如果去掉$x_0$这个维度，那么关于$x$的线性方程就要写成

$$W^T x + b.$$

在本文中我们只考虑特征$x \in \mathbb{R}^n$具有真实的$n$维，也就是说去除$x_0$维度。最后假设训练集有$m$个数据点：$\{(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})\}$。

我们的目标是根据训练集建立一个线性模型进行分类。首先我们引进$k$个正参数：$\phi_1, \dots, \phi_k$，并且假设$y$遵循多项分布

$$P(y=j) = \frac{\phi_j}{\phi_1+\cdots+\phi_k}.$$

这些参数是跟输入特征$x$有关，我们希望根据特征$x$算出所有参数，而这些参数根据上述多项分布给出y属于各个类的概率。

推导线性模型的第一步是将概率分布写成指数形式，为此，我们先引入一个函数$1\{P\}$：

$$1\{P\} = \left\{\begin{array}{ll} 1 & \text{ if P is true} \\ 0 & \text{ if P is false} \end{array}\right.$$

那么上述多项分布可以写成

$$\begin{split} P(y) &= \frac{\prod_{j=1}^k \phi_j^{1\{y=j\}}}{\sum_{j=1}^k \phi_j} \\ &= \exp\left(\sum_{j=1}^k 1\{y=j\} \log \phi_j - \log\left(\sum_{j=1}^k \phi_j\right)\right) \\ &= b(y)\exp\left(\eta^T \cdot T(y) - a(\eta)\right) \end{split},$$

其中 $b(y) = 1, a(\eta) = \log\left(\sum_{j=1}^k \phi_j\right)$，

$$T(y) = \left[\begin{array}{c}1\{y=1\} \\ \vdots \\ 1\{y=k\}\end{array}\right], \text{ } \eta = \left[\begin{array}{c}\log(\phi_1) \\ \vdots \\ \log(\phi_k)\end{array}\right].$$

推广的线性模型（具体看Generalized Linear Model）假设

$$\log(\phi_i) = W_i^T \cdot x + b_i,$$

也就是

$$\phi_i = \exp(W_i^T \cdot x + b_i),$$

这里$W_i$是一个$n$维的列向量，$b_i$是一个实数。这样此线性模型由$T(y)$的期望值给出：

$$\begin{equation} \begin{split} h_{W, b}(x) &= E[T(y)|x; W, b] = \sum_{j=1}^k T(j)P(j) \\ &= \left[\begin{array}{c}\frac{\phi_1}{\sum_{j=1}^k \phi_j} \\ \vdots \\ \frac{\phi_k}{\sum_{j=1}^k \phi_j}\end{array}\right] = \left[\begin{array}{c}\frac{\exp(W_1^T \cdot x + b_1)}{\sum_{j=1}^k \exp(W_j^T \cdot x + b_j)} \\ \vdots \\ \frac{\exp(W_k^T \cdot x + b_k)}{\sum_{j=1}^k \exp(W_j^T \cdot x + b_j)}\end{array}\right] \end{split}. \label{model1} \end{equation}$$

如果我们定义

$$W = \left[\begin{array}{c}W_1^T \\ \vdots \\ W_k^T\end{array}\right], \text{ } b = \left[\begin{array}{c}b_1 \\ \vdots \\ b_k\end{array}\right],$$

并且定义softmax函数$\sigma: \mathbb{R}^k \to \mathbb{R}^k$为

$$\sigma\left(\begin{array}{c}z_1 \\ \vdots \\ z_k\end{array}\right) = \left[\begin{array}{c} \frac{e^{z_1}}{\sum_{j=1}^k e^{z_j}} \\ \vdots \\ \frac{e^{z_k}}{\sum_{j=1}^k e^{z_j}}\end{array}\right],$$

那么上述的线性模型($\ref{model1}$)可以简化为

$$\begin{equation} h_{W, b}(x) = \sigma(Wx+b). \label{model} \end{equation}$$

给定训练集，我们通过最大可能性估计(maximum likelihood estimation)来学习$W$和$b$。对于任意一个数据点$(x^{i}, y^{i})$

$$\begin{split} P(y^{(i)}; x^{(i)}, W, b) &= \exp\left(\sum_{j=1}^k 1\{y^{(i)}=j\} \log (\exp(W_j^T x^{(i)} + b_j)) - \log\left(\sum_{j=1}^k \exp(W_j^T x^{(i)} + b_j)\right)\right) \\ &= \exp\left(\sum_{j=1}^k 1\{y^{(i)}=j\} \log \frac{\exp(W_j^T x^{(i)} + b_j)}{\sum_{j=1}^k \exp(W_j^T x^{(i)} + b_j)}\right) \\ &= \exp(<T(y^{(i)}), \log h_{W, b}(x^{(i)})>). \end{split}$$

最后等式中的$<,>$是内积，$\log h_{W, b}(x^{(i)})$表示log函数作用在$h_{W, b}(x^{(i)})$的每个分量上。因此我们要最大化

$$\begin{equation} \begin{split} l(W, b) &= \frac{1}{m}\sum_{i=1}^m \log P(y^{(i)}; x^{(i)}, W, b) \\ &= \frac{1}{m}\sum_{i=1}^m <T(y^{(i)}), \log h_{W, b}(x^{(i)})> \end{split} \label{loglike} \end{equation}$$

模型实现

我们将用上面推导出来的softmax模型识别手写数字。这些数据从LeCun Yann的网页得到。说明一下的是我用jupyter notebook使用python3运行下面代码的。使用到的第三方库有 numpy, matplotlib 和 tensorflow。

import sys
py = 'Python ' + '.'.join(map(str, sys.version_info[:3]))
print('Jupyter notebook with kernel: {}'.format(py))

import gzip
import time
from urllib.request import urlopen
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

Jupyter notebook with kernel: Python 3.6.3

首先要获取训练集。因为文件是gz格式的，所以要使用gzip来解压。解压后的文件格式在网页上有说明，我根据说明写出了下面的函数get_samples。

def get_samples(images, labels):
    img_num = int.from_bytes(images[4:8], byteorder='big')
    lbl_num = int.from_bytes(labels[4:8], byteorder='big')
    assert(img_num == lbl_num)
    print('There are {} samples.'.format(img_num))
    row = int.from_bytes(images[8:12], byteorder='big')
    col = int.from_bytes(images[12:16], byteorder='big')
    img_size = row * col
    x, y = [], []
    for i in range(img_num):
        img_offset = 16 + img_size * i
        lbl_offset = 8 + i
        img = np.array(list(images[img_offset:img_offset+img_size]))
        lbl = labels[lbl_offset]
        lbl = np.array([i==lbl for i in range(10)], dtype=np.int8)
        x.append(img)
        y.append(lbl)
    return x, y

get_samples的输入是解压后图像文件和对应的标记文件的字节串，输出为(x, y)，其中x是包含表示图像的列表，每个元素是一个28x28=784维的向量。y是对应x的数字标记。下面是获取所有训练集并且展示其中前25个数据点的代码。

images_url = 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz'
labels_url = 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
images = gzip.decompress(urlopen(images_url).read())
labels = gzip.decompress(urlopen(labels_url).read())
train_x, train_y = get_samples(images, labels)

There are 60000 samples.

r = c = 5
fig, ax = plt.subplots(r, c)
for i in range(r):
    for j in range(c):
        k = i * c + j
        img = train_x[k]
        img = img.reshape((28, 28))
        lbl = train_y[k].argmax()
        lbl = str(lbl)
        ax[i, j].imshow(img, cmap='gray_r')
        ax[i, j].set_title(lbl)
        ax[i, j].axis('off')
plt.subplots_adjust(hspace=0.7)
plt.show()

接着我们使用tensorflow训练得到参数$W$和$b$。我们是要最大化等式$(\ref{loglike})$中的$l(W, b)$，这等价于最小化$-l(W, b)$。

x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W)+b)
y_ = tf.placeholder(tf.float32, [None, 10])
cost = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.00001).minimize(cost)

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

begin = time.time()
for i in range(200):
    sess.run(train_step, feed_dict={x: train_x, y_: train_y})
end = time.time()
print('Runtime: {} seconds'.format(end-begin))

Runtime: 107.31829977035522 seconds

为了检测参数的好坏，我们从刚才的网页上下载测试集并计算错误率：

images_url = 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz'
labels_url = 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
images = gzip.decompress(urlopen(images_url).read())
labels = gzip.decompress(urlopen(labels_url).read())
test_x, test_y = get_samples(images, labels)

There are 10000 samples.

pred_y = sess.run(y, feed_dict={x: test_x})
result = [0, 0]
for ty, py in zip(test_y, pred_y):
    result[ty.argmax()==py.argmax()] += 1
error_rate = result[0] * 100 / sum(result)
print('Error rate is: {}%'.format(error_rate))

Error rate is: 8.9%

让我们看看对于测试集前25个数据的预测情况。图像上面的数字是预测的数字，紧接的"✔"表示预测成功，“✖”表示预测失败。

r = c = 5
fig, ax = plt.subplots(r, c)
for i in range(r):
    for j in range(c):
        k = i * c + j
        img = test_x[k]
        img = img.reshape((28, 28))
        lbl = test_y[k].argmax()
        lbl = str(lbl)
        pred = str(pred_y[k].argmax())
        ind = '✔' if lbl == pred else '✖'
        title = '{}{}'.format(pred, ind)
        ax[i, j].imshow(img, 'gray_r')
        ax[i, j].set_title(title)
        ax[i, j].axis('off')
plt.subplots_adjust(hspace=0.7)
plt.show()

第二行中的5被错误预测成6了。我们应该怎样减低错误率呢？一个最简单的方法是增加迭代次数。问题在于训练集中有60000数据点，所以迭代200次在我的机器上已经耗时107秒了。如果迭代1000次的话大概要10分钟，错误率将会降低到7.92%。注意到输入的特征$x$是一个784维的向量，所以特征空间的VC维度是785。因此每次迭代的时候我们只需要大概785个数据点就行，没有必要使用全部数据点。根据这个观察，我们可以将数据分成若干块，每次迭代使用其中一块数据，这种我们在增加迭代的同时可以减少时间的消耗（相对于使用全部数据点）。

sess.close()
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W)+b)
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.00001).minimize(cross_entropy)

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

begin = time.time()
for i in range(1000):
    k = i % 60
    bb= k * 1000
    be = (k+1)*1000
    batch_x = train_x[bb:be]
    batch_y = train_y[bb:be]
    sess.run(train_step, feed_dict={x: batch_x, y_: batch_y})
end = time.time()
print('Rumtime: {} seconds'.format(end-begin))
pred_y = sess.run(y, feed_dict={x: test_x})
result = [0, 0]
for ty, py in zip(test_y, pred_y):
    result[ty.argmax()==py.argmax()] += 1
error_rate = result[0] * 100 / sum(result)
print('Error rate is: {}%'.format(error_rate))

Rumtime: 8.727920055389404 seconds
Error rate is: 7.77%

上述代码每次迭代使用1000个数据点，迭代1000次耗时仅9秒，但是错误率比之前用107秒得到的8.9%低了1.13%！最后看看预测错误的25个数据点。

r = c = 5
fig, ax = plt.subplots(r, c)
k = 0
for i in range(r):
    for j in range(c):
        while test_y[k].argmax() == pred_y[k].argmax():
            k += 1
        img = test_x[k]
        img = img.reshape((28, 28))
        lbl = test_y[k].argmax()
        lbl = str(lbl)
        pred = str(pred_y[k].argmax())
        ind = '✔' if lbl == pred else '✖'
        title = '{}{}'.format(pred, ind)
        ax[i, j].imshow(img, 'gray_r')
        ax[i, j].set_title(title)
        ax[i, j].axis('off')
        k += 1
plt.subplots_adjust(hspace=0.7)
plt.show()