Convolutional Neural Network Specifications (draft)
Remarks on cs231n.
Minibatch SGD
Loop (num_iterations
times):
 Sample a batch of data (
batch_size
size) instead of using all data.  Forward prop it through the graph, get its loss (the final data loss and each layer’s regularization loss).
 Backprop to calculate the gradients.
 Update the weights by using gradients.
Notes
Usually, in above situation, the number of loops that we apply is provided by ourselves, e.g. num_iterations
. But in some literature, there’s another form to use num_epochs
instead to express this single number indirectly.
What’s the meaning of a epoch. It often means one round of going through all training data. But in every loop above, our minibatch can only go through batch_size
number of training data.
So in order to finish one epoch quantity execution, we need to go through num_train / batch_size
, which is also defined as iterations_per_epoch
, times of above loop.
Correspondingly, in order to satisfy num_epochs
quantity standard, we need to apply num_epochs * iterations_per_epoch
times of above loops. (We only use the middle variable iterations_per_epoch
here once.) That’s why I said if we provide num_epochs
only, it’s just another way to express the num_iterations
indirectly.
One Time Setup
 Activation functions
 Preprocessing
 Weight initialization
 Regularization
 Gradient Checking
1. Activation functions
sigmoid
, tanh
, ReLu
, LeakReLu
, Maxout
, and ELU
.
Three problems should be considered. Use sigmoid function
as example.

Problem 1: saturated neuraons kill the gradients. (squashes numbers to range )

Problem 2: sigmoid outputs are not zero centered.
As its formula is , if all the are positive, which is the case of sigmoid function’s output, the gradient on will be
Now, all the gradient on will be all positive or negative, depending on if is positive or negative. This means, the updating path of will follow the zigzag path! And this is also why we need the data to be zero mean!
 Problem 3: is a bit computation expensive.
In practice:
 Use
ReLu
, be careful with your learning rates.  Try out
Leaky Relu
,Maxout
,ELU
.  Try out
tanh
but don’t expect much.  Don’t use
sigmoid
2. Data Preprocessing
 zerocentered, normalized data.
 decorrelated, whitened data. (PCA and Whitening)
In practice for image: centered only.
3. Weight Initialization
If all are zero, then the network can’t just startup. So the first idea is just giving it small random numbers close to zero:
W = 0.01 * np.random.randn(D, H)
But it only works fine for small networks, but leads to nonhomogeneous distributions of activations across the layers of a network. Let’s take a look at one example with the following update step, where is initialized from above formula:
a = np.dot(W, X)
h = np.tanh(a)
As is normal distribution centered at zero point, the mean value of in above code will be closer and closer to zero with more layers. Also pay attension, this layer’s will be next layer’s . And as the gettting closer to zero with , in the backpropagation, the gradient of will also be closer to zero: . Thus the last multiple layers’ gradient will stay at zero, which makes the whole network not work.
On the other side, if the random coefficient of is close to 1, we’d face another extreme point. For example, if we use the formula:
W = 1.0 * np.random.randn(D, H)
Then each element of may be greater than or less than , which may lead to equals or . Thus, when we take gradient on those points in , we’d get the gradients equal . And again, this makes the whole network not work.
In summary, we can’t make the initialization of very close to or , but between them. So in order to improve this situation, we can allocate the value of based on the input size of , which introduces the second method to initialize .
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in) # layer initialization
which is introduced by the paper Xavier initialization, by Glorot et all, 2010.
But there may still be problem. When using the ReLu
, nonlinearity, it breaks. The tiny modification can be made on it(note the additonal /2
):
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in / 2) # layer initialization
which is referenced from He et al., 2015.
Another thing need to be mentioned about is the Batch Normalization method, which can reduce the problem of coordinating updates across many layers.
According to Ioffe and Szegedy, 2015, batch normalization’s forward process is getting the mean and variance of one minibatch, which is then used to normalize the inputting data. Backward process is computed accordingly.
Forward algorithm:
Input: Values of over a minibatch ; Parameters to be learned .
Output:
where is a small number, e.g. 1e8
, to avoid dividing zero.
The corresponding python code is:
# Here we treat variable x as minibatch directly.
# 1. Compute sample mean, variance: (D, )
sample_mean = np.mean(x, axis=0)
sample_var = np.std(x, axis=0) ** 2
# 2. Scale and shift, x_hat: (N, D)
x_hat = (x  sample_mean) / np.sqrt(eps + sample_var)
# 3. Compute result: (N, D)
out = gamma * x_hat + beta
cache = (x, x_hat, gamma, eps, sample_mean, sample_var)
Backward
It’s the reverse process relative to forward.
Step 1: Output level, reverse of out = gamma * x_hat + beta
Step 2: Scale, shift level, reverse of x_hat = (x  sample_mean) / np.sqrt(eps + sample_var)
.
Step 3: Sample mean, variance level, reverse of
sample_mean = np.mean(x, axis=0)
sample_var = np.std(x, axis=0) ** 2
Step 3.1: Another gradient on mean from variance
Thus the whole gradient on mean should be
Step 3.2: The gradient on from variance, reverse of
Step 3.3: The gradient on from mean, the reverse of
Thus the whole gradient on should be
Based on above math formula, it’s very easy to write corresponding python code:
x, x_hat, gamma, eps, sample_mean, sample_var = cache
N, D = x_hat.shape
# Level 1, 'out = gamma * x_hat + beta'
#  dout : (N, D)
#  gamma: (D, )
#  beta : (D, )
#  x_hat: (N, D)
dx_hat = dout * gamma
dgamma = np.sum(dout * x_hat, axis=0)
dbeta = np.sum(dout, axis=0)
# Level 2, 'x_hat = (x  sample_mean) / np.sqrt(eps + sample_var)'
#  sample_mean, sample_var: (D, )
dvar = np.sum(dx_hat * (x  sample_mean), axis=0) * (0.5) * ((sample_var + eps) ** (1.5))
dmean1 = np.sum(dx_hat * (1 / (sample_var + eps) ** 0.5), axis=0)
dx1 = dx_hat / (sample_var + eps) ** 0.5
# Level 3, 'sample_mean = np.mean(x, axis=0); sample_var = np.std(x, axis=0) ** 2'
dmean2 = dvar * (np.sum(2 * (x  sample_mean), axis=0) / N)
dmean = dmean1 + dmean2
dx2 = dvar * (2 * (x  sample_mean) / N)
dx3 = dmean / N
dx = dx1 + dx2 + dx3
4. Regularization
Dropout
Dropout is a very powerful technique to address the overfitting issue. The idea is just removing the neurons randomly in training process. To drop the neuron, we only need to generate a mask, which element is 0 or 1 based on one predefined probability distribution. Let’s go into the details.
Assume we have layers, its index is , i.e. . In layer , define as hidden layer, its parameters and bias as , the input data is , which is the output of last layer (assume ). The action function as . Thus, for normal neural networks, we have relations:
With dropout, the process becomes
where the is the elementwise product. In practice, we can directly set , which is efficient for most neural networks situations.
The corresponding code is obvious:
# forward training
mask = np.random.rand(*x.shape) < p
out = x * mask
...
# backward training
dx = dout * mask
But for testing process, there’s no dropout process but scale
# forward test
out = x
...
# backward test
dx = dout * p
The production with in testing stage is used to keep expectation of each neuron consistent with training stage. For example, for a neuron , its output expectation of dropout is . For training stage, the output is fixed, i.e. . Thus we need to multiple on to make them equal.
But this is not a good idea to keep consistency in training stage, in which the efficiency is important. To solve this issue, we can directly add the coefficient in training stage.
# forward training
mask = (np.random.rand(*x.shape) < p) / p
out = x * mask
...
# backward training
dx = dout * mask
##############
# forward test
out = x
...
# backward test
dx = dout
Now, the output expectation of neuron will be .
TK
5. Gradient Checking
TK
Training Dynamics
 Babysitting the learning process
 Parameter updates
 Hyperparameter optimization
1. Babysitting the Learning Process
2. Parameter updates
The method of learning rate updates includes:
 SGD
 SGD + Momentum
 Adagrad
 RMSProp
 Adam
SGD
is simple and obvious:
x += learning_rate * dx
. But it may cause the zigzag convergent path, and progress with flat direction when one direction is steep and another is shallow. Thus, one direct modification is use momentum to reduce the flat direction impact.
Momentum
in physics is computed through the formula: . In SGD plus Momentum algorithm:
v = mu * v  learning_rate * dx
x += v
, the value of mass is often assumed as unit quantity, i.e. the . Thus, we only need to consider the velocity part. The whole optimization process can be treated as one particle’s movement. Its position at time can be treated as function . Then the particle may experience the net force . And we can get the acceleration as (note our mass has been assumed as 1)
. Introducing velocity , the above partial differential equation becomes: .
Now, we need to construct the force :
 As is the net force, it’s obvious to contain one force that proportional to the negative gradient of the cost function: , which pushes the particle downhill along the cost function surface.
 But if there’s only this force constituting net force, this particle will move forever between uphill and downhill. Thus we need to introduce another force, called viscous drag in physics, which is proportional to . Here, the negative symbol means it’s one reverse direction relative to the . This viscous drag can ensure the particle stop at the local minimum position.
So now, our partial differential equation becomes:
where is one constant number. Use Euler method to update :
Assume the (as for the next 1
step), use the definition of learning rate, and introduce the varible v
, we can get the beginning SGD plus Momentum code’s first part. As , we use Euler method again:
, which is the code’s second part. There’s also another accelerated gradient method Nesterov Momentum, but it seems not imporove the rate of convergence according to Deep Learning Book, Goodfellow, Bengio.
AdaGrad
adapts the learning rates:
cache += dx ** 2
x +=  learning_rate * dx / (np.sqrt(cache) + 1e7)
It has some desirable theoretical properties, but not all deep learning models perform well by using it.
RMSProp
improves AdaGrad on the nonconvex problem by changing the greadient accumulation into an exponentially weighted moving average.
cache += decay_rate * cache + (1  decay_rate) * (dx ** 2)
x +=  learning_rate * dx / (np.sqrt(cache) + 1e7)
Adam
is another adaptive learning rate optimization method. Its names comes from the phrase adaptive moments. It can be treated as one variant of the combination of Moment and RMSProp algorithms.
m, v = # ... initialize caches to zeros
for t in xrange(1, big_number):
dx = # ... evaluate gradient
# 1. Update momentum
# update first momentum
m = beta1 * m + (1  beta1) * dx
# update second momentum
v = beta2 * v + (1  beta2) * (dx ** 2)
# 2. Bias Correction, only relevant
# in first few iterations when t is small
mb = m / (1  beta1 ** t)
vb = v / (1  beta2 ** t)
# 3. Update x
x +=  learning_rate * mb / (np.sqrt(vb) + 1e7)
In practice, Adam is a good default choice in most cases. If you can afford the full batch updates, then try out LBFGS.
3. Hyperparameter optimization
TK
Evaluation
 Model Ensembles
1. Model Ensembles
TK