简单的MNIST numpy从头开始
Summary
This note is an MNIST digit recognizer implemented in numpy from scratch.
This is a simple demonstration mainly for pedagogical purposes, which shows the basic workflow of a machine learning algorithm using a simple feedforward neural network. The derivative at the backpropagation stage is computed explicitly through the chain rule.
- The model is a 3-layer feedforward neural network (FNN), in which the input layer has 784 units, and the 256-unit hidden layer is activated by ReLU, while the output layer is activated by softmax function to produce a discrete probability distribution for each input.
- The loss function, model hypothesis function, and the gradient of the loss function are all implemented from ground-up in numpy in a highly vectorized fashion (no FOR loops).
- The training is through a standard gradient descent with step size adaptively changing by Root Mean Square prop (RMSprop), and there is no cross-validation set reserved nor model averaging for simplicity.
The code is vectorized and is adapted from the Softmax regression and Neural Network lectures used in UCI Math 10.
Caveat lector: fluency in linear algebra and multivariate calculus.
References:
- Stanford Deep Learning tutorial in MATLAB.
- Learning PyTorch with examples.
- An overview of gradient descent optimization algorithms.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
Data input
import os
path = os.listdir("../input")
print(path)
['train.csv', 'sample_submission.csv', 'test.csv']
# Read the data
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv("../input/test.csv")
# Set up the data
y_train = train_data['label'].values
X_train = train_data.drop(columns=['label']).values/255
X_test = test_data.values/255
Some examples of the input data
We randomly choose 10 samples from the training set and visualize it.
fig, axes = plt.subplots(2,5, figsize=(12,5))
axes = axes.flatten()
idx = np.random.randint(0,42000,size=10)
for i in range(10):
axes[i].imshow(X_train[idx[i],:].reshape(28,28), cmap='gray')
axes[i].axis('off') # hide the axes ticks
axes[i].set_title(str(int(y_train[idx[i]])), color= 'black', fontsize=25)
plt.show()

Network structures¶

The figure above is a simplication of the neural network used in this example. The circles labeled “+1” are the bias units. Layer 1 is the input layer, and Layer 3 is the output layer. The middle layer, Layer 2, is the hidden layer.
The neural network in the figure above has 2 input units (not counting the bias unit), 3 hidden units, and 1 output unit. In this actual computation below, the input layer has 784 units, the hidden layer has 256 units, and the output layers has 10 units (K=10K=10 classes).
The weight matrix W(0)W(0) mapping input xx from the input layer (Layer 1) to the hidden layer (Layer 2) is of shape (784,256)
together with a (256,)
bias. Then aa is the activation from the hidden layer (Layer 2) can be written as:a=ReLU((W(0))⊤x+b),a=ReLU((W(0))⊤x+b),where the ReLU activation function is ReLU(z)=max(z,0)ReLU(z)=max(z,0) and can be implemented in a vectorized fashion as follows.
# relu activation function
# THE fastest vectorized implementation for ReLU
def relu(x):
x[x<0]=0
return x
Softmax activation, prediction, and the loss function
From the hidden layer (Layer 2) to the output layer (layer 3), the weight matrix W(1)W(1) is of shape (256,10)
, the form of which is as follows:W(1)=⎛⎝⎜|θ1||θ2||⋯||θK|⎞⎠⎟,W(1)=(||||θ1θ2⋯θK||||),which maps the activation from Layer 2 to Layer 3 (output layer), and there is no bias because a constant can be freely added to the activation without changing the final output.
At the last layer, a softmax activation is used, which can be written as follows combining the weights matrix W(1)W(1) that maps the activation aa from the hidden layer to output layer:P(y=k|a;W(1))=σk(a;W(1)):=exp(θ⊤ka)∑Kj=1exp(θ⊤ja).P(y=k|a;W(1))=σk(a;W(1)):=exp(θk⊤a)∑j=1Kexp(θj⊤a).{P(y=k|a;W(1))}Kk=1{P(y=k|a;W(1))}k=1K is the probability distribution of our model, which estimates the probability of the input xx’s label yy is of class kk. We denote this distribution by a vectorσ:=(σ1,…,σK)⊤.σ:=(σ1,…,σK)⊤.We hope that this estimate is as close as possible to the true probability: 1{y=k}1{y=k}, that is 11 if the sample xx is in the kk-th class and 0 otherwise.
Lastly, our prediction y^y^ for sample xx can be made by choosing the class with the highest probability:y^=argmaxk=1,…,KP(y=k|a;W(1)).(∗)(∗)y^=argmaxk=1,…,KP(y=k|a;W(1)).
Denote the label of the ii-th input as y(i)y(i), and then the sample-wise loss function is the cross entropy measuring the difference of the distribution of this model function above with the true one 1{y(i)=k}1{y(i)=k}: denote W=(W(0),W(1))W=(W(0),W(1)), b=(b)b=(b), let a(i)a(i) be the activation for the ii-th sample in the hidden layer (Layer 2),Ji:=J(W,b;x(i),y(i)):=−∑k=1K{1{y(i)=k}logP(y(i)=k|a(i);W(1))}.(1)(1)Ji:=J(W,b;x(i),y(i)):=−∑k=1K{1{y(i)=k}logP(y(i)=k|a(i);W(1))}.
Denote the data sample matrix X:=(x(1),…,x(N))⊤X:=(x(1),…,x(N))⊤, its label vector as y:=(y(1),…,y(N))y:=(y(1),…,y(N)), and then the final loss has an extra L2L2-regularization term for the weight matrices (not for bias):L(W,b;X,y):=1N∑i=1NJi+α2(∥W(0)∥2+∥W(1)∥2),(2)(2)L(W,b;X,y):=1N∑i=1NJi+α2(‖W(0)‖2+‖W(1)‖2),where α>0α>0 is a hyper-parameter determining the strength of the regularization, the bigger the αα is, the smaller the magnitudes of the weights will be after training.
def h(X,W,b):
'''
Hypothesis function: simple FNN with 1 hidden layer
Layer 1: input
Layer 2: hidden layer, with a size implied by the arguments W[0], b
Layer 3: output layer, with a size implied by the arguments W[1]
'''
# layer 1 = input layer
a1 = X
# layer 1 (input layer) -> layer 2 (hidden layer)
z1 = np.matmul(X, W[0]) + b[0]
# add one more layer
# layer 2 activation
a2 = relu(z1)
# layer 2 (hidden layer) -> layer 3 (output layer)
z2 = np.matmul(a2, W[1])
s = np.exp(z2)
total = np.sum(s, axis=1).reshape(-1,1)
sigma = s/total
# the output is a probability for each sample
return sigma
def softmax(X_in,weights):
'''
Un-used cell for demo
activation function for the last FC layer: softmax function
Output: K probabilities represent an estimate of P(y=k|X_in;weights) for k=1,...,K
the weights has shape (n, K)
n: the number of features X_in has
n = X_in.shape[1]
K: the number of classes
K = 10
'''
s = np.exp(np.matmul(X_in,weights))
total = np.sum(s, axis=1).reshape(-1,1)
return s / total
def loss(y_pred,y_true): ''' Loss function: cross entropy with an L^2 regularization y_true: ground truth, of shape (N, ) y_pred: prediction made by the model, of shape (N, K) N: number of samples in the batch K: global variable, number of classes ''' global K K = 10 N = len(y_true) # loss_sample stores the cross entropy for each sample in X # convert y_true from labels to one-hot-vector encoding y_true_one_hot_vec = (y_true[:,np.newaxis] == np.arange(K)) loss_sample = (np.log(y_pred) * y_true_one_hot_vec).sum(axis=1) # loss_sample is a dimension (N,) array # for the final loss, we need take the average return -np.mean(loss_sample)
Backpropagation (Chain rule)¶
The derivative of the cross entropy JJ in (1), for a single sample and its label (x,y)(x,y) , with respect to the weights and the bias is computed using the following procedure:
Step 1: Forward pass: computing the activations a=(a1,…,an2) from the hidden layer (Layer 2), and σ=(σ1,…,σK) from the output layer (Layer 3).
Step 2: Derivatives for W(1) : recall that W(1)=(θ1,⋯,θK) and denote
z(2)=(z(2)1,…,z(2)K)=(W(1))⊤a=(θ⊤1a,⋯,θ⊤Ka),
for the k -th output unit in the output layer (Layer 3), then
δ(2)k:=∂J∂z(2)k={P(y=k|a;W(1))−1{y=k}}=σk−1{y=k}
and
∂J∂θk=∂J∂z(2)k∂z(2)k∂θk=δ(2)ka.
Step 3: Derivatives for W(0) , b : recall that W(0)=(w1,⋯,wn2) , b=(b1,…,bn2) , where n2 is the number of units in the hidden layer (Layer 2), and denote
z(1)=(z(1)1,…,z(1)n2)=(W(0))⊤x+b=(w⊤1x+b1,⋯,w⊤n2x+bn2),
for each node i in the hidden layer (Layer 2 ), i=1,…,n2 , then
δ(1)i:=∂J∂z(1)i=∂J∂ai∂ai∂z(1)i=∂J∂z(2)⋅(∂z(2)∂ai∂ai∂z(1)i)=(∑k=1K∂J∂z(2)k∂z(2)k∂ai)f′(z(1)i)=(∑k=1Kwkiδ(2)k)1{z(1)i>0},
where 1{z(1)i>0} is ReLU activation f 's (weak) derivative, and the partial derivative of the k -th component (before activated by the softmax) in the output layer z(2)k with respect to the i -th activation ai from the hidden layer is the weight w(1)ki . Thus
∂J∂wji=xjδ(1)i,∂J∂bi=δ(1)i, and ∂J∂wi=δ(1)ix,∂J∂b=δ(1).
def backprop(W,b,X,y,alpha=1e-4):
'''
Step 1: explicit forward pass h(X;W,b)
Step 2: backpropagation for dW and db
'''
K = 10
N = X.shape[0]
### Step 1:
# layer 1 = input layer
a1 = X
# layer 1 (input layer) -> layer 2 (hidden layer)
z1 = np.matmul(X, W[0]) + b[0]
# layer 2 activation
a2 = relu(z1)
# one more layer
# layer 2 (hidden layer) -> layer 3 (output layer)
z2 = np.matmul(a2, W[1])
s = np.exp(z2)
total = np.sum(s, axis=1).reshape(-1,1)
sigma = s/total
### Step 2:
# layer 2->layer 3 weights' derivative
# delta2 is \partial L/partial z2, of shape (N,K)
y_one_hot_vec = (y[:,np.newaxis] == np.arange(K))
delta2 = (sigma - y_one_hot_vec)
grad_W1 = np.matmul(a2.T, delta2)
# layer 1->layer 2 weights' derivative
# delta1 is \partial a2/partial z1
# layer 2 activation's (weak) derivative is 1*(z1>0)
delta1 = np.matmul(delta2, W[1].T)*(z1>0)
grad_W0 = np.matmul(X.T, delta1)
# Student project: extra layer of derivative
# no derivative for layer 1
# the alpha part is the derivative for the regularization
# regularization = 0.5*alpha*(np.sum(W[1]**2) + np.sum(W[0]**2))
dW = [grad_W0/N + alpha*W[0], grad_W1/N + alpha*W[1]]
db = [np.mean(delta1, axis=0)]
# dW[0] is W[0]'s derivative, and dW[1] is W[1]'s derivative; similar for db
return dW, db
Hyper-parameters and network initialization
eta = 5e-1 alpha = 1e-6 # regularization gamma = 0.99 # RMSprop eps = 1e-3 # RMSprop num_iter = 2000 # number of iterations of gradient descent n_H = 256 # number of neurons in the hidden layer n = X_train.shape[1] # number of pixels in an image K = 10
# initialization
np.random.seed(1127)
W = [1e-1*np.random.randn(n, n_H), 1e-1*np.random.randn(n_H, K)]
b = [np.random.randn(n_H)]
Gradient Descent: training of the network
In the training, we use a GD-variant of the RMSprop: for ww which stands for the parameter vector in our model
Choose w0 , η , γ , ϵ , and let g−1=1
For k=0,1,2,⋯,M
gk=γgk−1+(1−γ)|∂wL(wk)|2
wk+1=wk−ηgk+ϵ−−−−−√∂wL(wk)
Remark:
The training takes a while since we use the gradient descent for all samples.
%%time
gW0 = gW1 = gb0 = 1
for i in range(num_iter):
dW, db = backprop(W,b,X_train,y_train,alpha)
gW0 = gamma*gW0 + (1-gamma)*np.sum(dW[0]**2)
etaW0 = eta/np.sqrt(gW0 + eps)
W[0] -= etaW0 * dW[0]
gW1 = gamma*gW1 + (1-gamma)*np.sum(dW[1]**2)
etaW1 = eta/np.sqrt(gW1 + eps)
W[1] -= etaW1 * dW[1]
gb0 = gamma*gb0 + (1-gamma)*np.sum(db[0]**2)
etab0 = eta/np.sqrt(gb0 + eps)
b[0] -= etab0 * db[0]
if i % 500 == 0:
# sanity check 1
y_pred = h(X_train,W,b)
print("Cross-entropy loss after", i+1, "iterations is {:.8}".format(
loss(y_pred,y_train)))
print("Training accuracy after", i+1, "iterations is {:.4%}".format(
np.mean(np.argmax(y_pred, axis=1)== y_train)))
# sanity check 2
print("gW0={:.4f} gW1={:.4f} gb0={:.4f}\netaW0={:.4f} etaW1={:.4f} etab0={:.4f}"
.format(gW0, gW1, gb0, etaW0, etaW1, etab0))
# sanity check 3
print("|dW0|={:.5f} |dW1|={:.5f} |db0|={:.5f}"
.format(np.linalg.norm(dW[0]), np.linalg.norm(dW[1]), np.linalg.norm(db[0])), "\n")
# reset RMSprop
gW0 = gW1 = gb0 = 1
y_pred_final = h(X_train,W,b)
print("Final cross-entropy loss is {:.8}".format(loss(y_pred_final,y_train)))
print("Final training accuracy is {:.4%}".format(np.mean(np.argmax(y_pred_final, axis=1)== y_train)))
Cross-entropy loss after 1 iterations is 7.3264114
Training accuracy after 1 iterations is 37.3524%
gW0=1.0715 gW1=1.2174 gb0=0.9919
etaW0=0.4828 etaW1=0.4530 etab0=0.5018
|dW0|=2.85527 |dW1|=4.76880 |db0|=0.43260
Cross-entropy loss after 501 iterations is 0.077221224
Training accuracy after 501 iterations is 97.8333%
gW0=0.1288 gW1=0.0487 gb0=0.0095
etaW0=1.3878 etaW1=2.2437 etab0=4.8738
|dW0|=0.01327 |dW1|=0.00698 |db0|=0.00108
Cross-entropy loss after 1001 iterations is 0.053528417
Training accuracy after 1001 iterations is 98.5405%
gW0=0.1640 gW1=0.0330 gb0=0.0101
etaW0=1.2308 etaW1=2.7135 etab0=4.7466
|dW0|=0.01254 |dW1|=0.00551 |db0|=0.00110
Cross-entropy loss after 1501 iterations is 0.025509637
Training accuracy after 1501 iterations is 99.4524%
gW0=0.0595 gW1=0.0156 gb0=0.0078
etaW0=2.0321 etaW1=3.8786 etab0=5.3213
|dW0|=0.01524 |dW1|=0.00678 |db0|=0.00237
Final cross-entropy loss is 0.027968823
Final training accuracy is 99.3476%
CPU times: user 1h 43min 11s, sys: 33min 39s, total: 2h 16min 50s
Wall time: 35min 24s
Predictions for testing data
The prediction labels are generated by (∗)(∗).
# predictions
y_pred_test = np.argmax(h(X_test,W,b), axis=1)
# Generating submission using pandas for grading
submission = pd.DataFrame({'ImageId': range(1,len(X_test)+1) ,'Label': y_pred_test })
submission.to_csv("simplemnist_result.csv",index=False)
https://www.kaggle.com/scaomath/simple-mnist-numpy-from-scratch