This note is an MNIST digit recognizer implemented in numpy from scratch.

This is a simple demonstration mainly for pedagogical purposes, which shows the basic workflow of a machine learning algorithm using a simple feedforward neural network. The derivative at the backpropagation stage is computed explicitly through the chain rule.

  • The model is a 3-layer feedforward neural network (FNN), in which the input layer has 784 units, and the 256-unit hidden layer is activated by ReLU, while the output layer is activated by softmax function to produce a discrete probability distribution for each input.
  • The loss function, model hypothesis function, and the gradient of the loss function are all implemented from ground-up in numpy in a highly vectorized fashion (no FOR loops).
  • The training is through a standard gradient descent with step size adaptively changing by Root Mean Square prop (RMSprop), and there is no cross-validation set reserved nor model averaging for simplicity.

The code is vectorized and is adapted from the Softmax regression and Neural Network lectures used in UCI Math 10.

Caveat lector: fluency in linear algebra and multivariate calculus.


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline

Data input

import os
path = os.listdir("../input")
['train.csv', 'sample_submission.csv', 'test.csv']
# Read the data
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv("../input/test.csv")
# Set up the data
y_train = train_data['label'].values
X_train = train_data.drop(columns=['label']).values/255
X_test = test_data.values/255

Some examples of the input data

We randomly choose 10 samples from the training set and visualize it.

fig, axes = plt.subplots(2,5, figsize=(12,5))
axes = axes.flatten()
idx = np.random.randint(0,42000,size=10)
for i in range(10):
    axes[i].imshow(X_train[idx[i],:].reshape(28,28), cmap='gray')
    axes[i].axis('off') # hide the axes ticks
    axes[i].set_title(str(int(y_train[idx[i]])), color= 'black', fontsize=25)

Network structures

The figure above is a simplication of the neural network used in this example. The circles labeled “+1” are the bias units. Layer 1 is the input layer, and Layer 3 is the output layer. The middle layer, Layer 2, is the hidden layer.

The neural network in the figure above has 2 input units (not counting the bias unit), 3 hidden units, and 1 output unit. In this actual computation below, the input layer has 784 units, the hidden layer has 256 units, and the output layers has 10 units (K=10K=10 classes).

The weight matrix W(0)W(0) mapping input xx from the input layer (Layer 1) to the hidden layer (Layer 2) is of shape (784,256) together with a (256,) bias. Then aa is the activation from the hidden layer (Layer 2) can be written as:a=ReLU((W(0))⊤x+b),a=ReLU((W(0))⊤x+b),where the ReLU activation function is ReLU(z)=max(z,0)ReLU(z)=max(z,0) and can be implemented in a vectorized fashion as follows.

# relu activation function
# THE fastest vectorized implementation for ReLU
def relu(x):
    return x

Softmax activation, prediction, and the loss function

From the hidden layer (Layer 2) to the output layer (layer 3), the weight matrix W(1)W(1) is of shape (256,10), the form of which is as follows:W(1)=⎛⎝⎜|θ1||θ2||⋯||θK|⎞⎠⎟,W(1)=(||||θ1θ2⋯θK||||),which maps the activation from Layer 2 to Layer 3 (output layer), and there is no bias because a constant can be freely added to the activation without changing the final output.

At the last layer, a softmax activation is used, which can be written as follows combining the weights matrix W(1)W(1) that maps the activation aa from the hidden layer to output layer:P(y=k|a;W(1))=σk(a;W(1)):=exp(θ⊤ka)∑Kj=1exp(θ⊤ja).P(y=k|a;W(1))=σk(a;W(1)):=exp⁡(θk⊤a)∑j=1Kexp⁡(θj⊤a).{P(y=k|a;W(1))}Kk=1{P(y=k|a;W(1))}k=1K is the probability distribution of our model, which estimates the probability of the input xx’s label yy is of class kk. We denote this distribution by a vectorσ:=(σ1,…,σK)⊤.σ:=(σ1,…,σK)⊤.We hope that this estimate is as close as possible to the true probability: 1{y=k}1{y=k}, that is 11 if the sample xx is in the kk-th class and 0 otherwise.

Lastly, our prediction y^y^ for sample xx can be made by choosing the class with the highest probability:y^=argmaxk=1,…,KP(y=k|a;W(1)).(∗)(∗)y^=argmaxk=1,…,K⁡P(y=k|a;W(1)).

Denote the label of the ii-th input as y(i)y(i), and then the sample-wise loss function is the cross entropy measuring the difference of the distribution of this model function above with the true one 1{y(i)=k}1{y(i)=k}: denote W=(W(0),W(1))W=(W(0),W(1)), b=(b)b=(b), let a(i)a(i) be the activation for the ii-th sample in the hidden layer (Layer 2),Ji:=J(W,b;x(i),y(i)):=−∑k=1K{1{y(i)=k}logP(y(i)=k|a(i);W(1))}.(1)(1)Ji:=J(W,b;x(i),y(i)):=−∑k=1K{1{y(i)=k}log⁡P(y(i)=k|a(i);W(1))}.

Denote the data sample matrix X:=(x(1),…,x(N))⊤X:=(x(1),…,x(N))⊤, its label vector as y:=(y(1),…,y(N))y:=(y(1),…,y(N)), and then the final loss has an extra L2L2-regularization term for the weight matrices (not for bias):L(W,b;X,y):=1N∑i=1NJi+α2(∥W(0)∥2+∥W(1)∥2),(2)(2)L(W,b;X,y):=1N∑i=1NJi+α2(‖W(0)‖2+‖W(1)‖2),where α>0α>0 is a hyper-parameter determining the strength of the regularization, the bigger the αα is, the smaller the magnitudes of the weights will be after training.

def h(X,W,b):
    Hypothesis function: simple FNN with 1 hidden layer
    Layer 1: input
    Layer 2: hidden layer, with a size implied by the arguments W[0], b
    Layer 3: output layer, with a size implied by the arguments W[1]
    # layer 1 = input layer
    a1 = X
    # layer 1 (input layer) -> layer 2 (hidden layer)
    z1 = np.matmul(X, W[0]) + b[0]
    # add one more layer
    # layer 2 activation
    a2 = relu(z1)
    # layer 2 (hidden layer) -> layer 3 (output layer)
    z2 = np.matmul(a2, W[1])
    s = np.exp(z2)
    total = np.sum(s, axis=1).reshape(-1,1)
    sigma = s/total
    # the output is a probability for each sample
    return sigma
def softmax(X_in,weights):
    Un-used cell for demo
    activation function for the last FC layer: softmax function 
    Output: K probabilities represent an estimate of P(y=k|X_in;weights) for k=1,...,K
    the weights has shape (n, K)
    n: the number of features X_in has
    n = X_in.shape[1]
    K: the number of classes
    K = 10
    s = np.exp(np.matmul(X_in,weights))
    total = np.sum(s, axis=1).reshape(-1,1)
    return s / total
def loss(y_pred,y_true):
    Loss function: cross entropy with an L^2 regularization
    y_true: ground truth, of shape (N, )
    y_pred: prediction made by the model, of shape (N, K) 
    N: number of samples in the batch
    K: global variable, number of classes
    global K 
    K = 10
    N = len(y_true)
    # loss_sample stores the cross entropy for each sample in X
    # convert y_true from labels to one-hot-vector encoding
    y_true_one_hot_vec = (y_true[:,np.newaxis] == np.arange(K))
    loss_sample = (np.log(y_pred) * y_true_one_hot_vec).sum(axis=1)
    # loss_sample is a dimension (N,) array
    # for the final loss, we need take the average
    return -np.mean(loss_sample)

Backpropagation (Chain rule)

The derivative of the cross entropy JJ in (1), for a single sample and its label (x,y)(x,y) , with respect to the weights and the bias is computed using the following procedure:

Step 1: Forward pass: computing the activations  a=(a1,…,an2)  from the hidden layer (Layer 2), and  σ=(σ1,…,σK)  from the output layer (Layer 3).

Step 2: Derivatives for  W(1) : recall that  W(1)=(θ1,⋯,θK)  and denote
for the  k -th output unit in the output layer (Layer 3), then
Step 3: Derivatives for  W(0) ,  b : recall that  W(0)=(w1,⋯,wn2) ,  b=(b1,…,bn2) , where  n2  is the number of units in the hidden layer (Layer 2), and denote
for each node  i  in the hidden layer (Layer  2 ),  i=1,…,n2 , then
where  1{z(1)i>0}  is ReLU activation  f 's (weak) derivative, and the partial derivative of the  k -th component (before activated by the softmax) in the output layer  z(2)k  with respect to the  i -th activation  ai  from the hidden layer is the weight  w(1)ki . Thus

∂J∂wji=xjδ(1)i,∂J∂bi=δ(1)i, and ∂J∂wi=δ(1)ix,∂J∂b=δ(1).
def backprop(W,b,X,y,alpha=1e-4):
    Step 1: explicit forward pass h(X;W,b)
    Step 2: backpropagation for dW and db
    K = 10
    N = X.shape[0]
    ### Step 1:
    # layer 1 = input layer
    a1 = X
    # layer 1 (input layer) -> layer 2 (hidden layer)
    z1 = np.matmul(X, W[0]) + b[0]
    # layer 2 activation
    a2 = relu(z1)
    # one more layer
    # layer 2 (hidden layer) -> layer 3 (output layer)
    z2 = np.matmul(a2, W[1])
    s = np.exp(z2)
    total = np.sum(s, axis=1).reshape(-1,1)
    sigma = s/total
    ### Step 2:
    # layer 2->layer 3 weights' derivative
    # delta2 is \partial L/partial z2, of shape (N,K)
    y_one_hot_vec = (y[:,np.newaxis] == np.arange(K))
    delta2 = (sigma - y_one_hot_vec)
    grad_W1 = np.matmul(a2.T, delta2)
    # layer 1->layer 2 weights' derivative
    # delta1 is \partial a2/partial z1
    # layer 2 activation's (weak) derivative is 1*(z1>0)
    delta1 = np.matmul(delta2, W[1].T)*(z1>0)
    grad_W0 = np.matmul(X.T, delta1)
    # Student project: extra layer of derivative
    # no derivative for layer 1
    # the alpha part is the derivative for the regularization
    # regularization = 0.5*alpha*(np.sum(W[1]**2) + np.sum(W[0]**2))
    dW = [grad_W0/N + alpha*W[0], grad_W1/N + alpha*W[1]]
    db = [np.mean(delta1, axis=0)]
    # dW[0] is W[0]'s derivative, and dW[1] is W[1]'s derivative; similar for db
    return dW, db

Hyper-parameters and network initialization

eta = 5e-1
alpha = 1e-6 # regularization
gamma = 0.99 # RMSprop
eps = 1e-3 # RMSprop
num_iter = 2000 # number of iterations of gradient descent
n_H = 256 # number of neurons in the hidden layer
n = X_train.shape[1] # number of pixels in an image
K = 10
# initialization
W = [1e-1*np.random.randn(n, n_H), 1e-1*np.random.randn(n_H, K)]
b = [np.random.randn(n_H)]

Gradient Descent: training of the network

In the training, we use a GD-variant of the RMSprop: for ww which stands for the parameter vector in our model

Choose  w0 ,  η ,  γ ,  ϵ , and let  g−1=1  

For  k=0,1,2,⋯,M 




The training takes a while since we use the gradient descent for all samples.

gW0 = gW1 = gb0 = 1

for i in range(num_iter):
    dW, db = backprop(W,b,X_train,y_train,alpha)
    gW0 = gamma*gW0 + (1-gamma)*np.sum(dW[0]**2)
    etaW0 = eta/np.sqrt(gW0 + eps)
    W[0] -= etaW0 * dW[0]
    gW1 = gamma*gW1 + (1-gamma)*np.sum(dW[1]**2)
    etaW1 = eta/np.sqrt(gW1 + eps)
    W[1] -= etaW1 * dW[1]
    gb0 = gamma*gb0 + (1-gamma)*np.sum(db[0]**2)
    etab0 = eta/np.sqrt(gb0 + eps)
    b[0] -= etab0 * db[0]
    if i % 500 == 0:
        # sanity check 1
        y_pred = h(X_train,W,b)
        print("Cross-entropy loss after", i+1, "iterations is {:.8}".format(
        print("Training accuracy after", i+1, "iterations is {:.4%}".format( 
              np.mean(np.argmax(y_pred, axis=1)== y_train)))
        # sanity check 2
        print("gW0={:.4f} gW1={:.4f} gb0={:.4f}\netaW0={:.4f} etaW1={:.4f} etab0={:.4f}"
              .format(gW0, gW1, gb0, etaW0, etaW1, etab0))
        # sanity check 3
        print("|dW0|={:.5f} |dW1|={:.5f} |db0|={:.5f}"
             .format(np.linalg.norm(dW[0]), np.linalg.norm(dW[1]), np.linalg.norm(db[0])), "\n")
        # reset RMSprop
        gW0 = gW1 = gb0 = 1

y_pred_final = h(X_train,W,b)
print("Final cross-entropy loss is {:.8}".format(loss(y_pred_final,y_train)))
print("Final training accuracy is {:.4%}".format(np.mean(np.argmax(y_pred_final, axis=1)== y_train)))
Cross-entropy loss after 1 iterations is 7.3264114
Training accuracy after 1 iterations is 37.3524%
gW0=1.0715 gW1=1.2174 gb0=0.9919
etaW0=0.4828 etaW1=0.4530 etab0=0.5018
|dW0|=2.85527 |dW1|=4.76880 |db0|=0.43260 

Cross-entropy loss after 501 iterations is 0.077221224
Training accuracy after 501 iterations is 97.8333%
gW0=0.1288 gW1=0.0487 gb0=0.0095
etaW0=1.3878 etaW1=2.2437 etab0=4.8738
|dW0|=0.01327 |dW1|=0.00698 |db0|=0.00108 

Cross-entropy loss after 1001 iterations is 0.053528417
Training accuracy after 1001 iterations is 98.5405%
gW0=0.1640 gW1=0.0330 gb0=0.0101
etaW0=1.2308 etaW1=2.7135 etab0=4.7466
|dW0|=0.01254 |dW1|=0.00551 |db0|=0.00110 

Cross-entropy loss after 1501 iterations is 0.025509637
Training accuracy after 1501 iterations is 99.4524%
gW0=0.0595 gW1=0.0156 gb0=0.0078
etaW0=2.0321 etaW1=3.8786 etab0=5.3213
|dW0|=0.01524 |dW1|=0.00678 |db0|=0.00237 

Final cross-entropy loss is 0.027968823
Final training accuracy is 99.3476%
CPU times: user 1h 43min 11s, sys: 33min 39s, total: 2h 16min 50s
Wall time: 35min 24s

Predictions for testing data

The prediction labels are generated by (∗)(∗).

# predictions
y_pred_test = np.argmax(h(X_test,W,b), axis=1)
# Generating submission using pandas for grading
submission = pd.DataFrame({'ImageId': range(1,len(X_test)+1) ,'Label': y_pred_test })