5.1 Dataset Normalization
Deep learning model takes various types of inputs and the range of their values can vary a lot. For example, in the house price prediction problem, the input are properties of the house, such as area, number of bedrooms, distance to downtown and so on. The area could easily go up to while the number of bedrooms is usually under
. Therefore, there is a scale difference between properties but this does not mean a large scale property is more or less influential than the small scale ones. This could potentially be a problem since the model tends to update in sensitive directions where the sensitivity comes from the scale of this property not its importance. As shown in Fig. 1 which is a contour of loss and the inner circle has lower loss than the outer ones. There are two features
and
deciding the loss of the model. To get to the optimal region,
and
are both important. In Fig. 1 (A), those two features have different scales and a slight change in
has more effect than change in
. So the update is more likely to happen in
direction. The repeated updates in one direction make the model hard to converge to the optimal region. While in Fig.1. (B), since two features have the same scale, updating will happen in both directions and make the model converge in less steps.
In order to eliminate the uneven scale in the inputs, we can apply normalization which shifts and scales the corresponding elements in the inputs to make them consistent. This idea of normalization can also be applied to image data. In practice, we usually apply different normalization on different image channels. Since this normalization is applied on the dataset, it is also called dataset normalization.

5.2 Normalization Layers
Rather than the dataset normalization, normalization can also be applied to different parts of the training~\cite{ioffe2015batch, salimans2016weight, ba2016layer, ulyanov2016instance, wu2018group}. They can prevent features from being ignored and keep them contributing to model improvement which makes the model converge faster. They also smooth the loss surface~\cite{santurkar2018does} and prevent weights from exploding all over the place by constraining them to a certain region. Finally, it reduce the internal covariate shift (ICS) which is defined as the change of distribution in the activation because of weights updating. In the following sections, we will introduce batch normalization (BN)~\cite{ioffe2015batch} and layer normalization~\cite{ba2016layer}. To compare their effective, we will perform them on the dog-vs-cat classification problem.
5.2.1 Batch Normalization

Batch normalization is one of the most popular normalization methods used for training deep learning models. As shown in Fig.~?? (left) where is the size of activations of each sample,
is size of mini-batch and
is the number of channels, BN normalizes over the activations across mini-batch. Demonstrated in Alg.~??, BN takes a mini-batch
which consists of
samples, and
as inputs where
and
are used to determine the final mean and variance. It first calculates the mean and variance of the mini-batch denoted as
and
, respectively. Then we normalize each sample
to
by
and
. Notice, we have an extra term
to prevent it from dividing by zero. Finally,
is scaled and shifted to the target mean and variance defined by
.

Even though lots of success has been achieved using BN, it still has some problems. First, it is highly dependent on the mini-batch. So if the batch size is too small, BN will become very noisy and affect the training. Second, it is not suitable for recurrent neural network which requires a separate BN for each time step and results in a very complicate model.
5.2.2 Layer Normalization
In order to solve those issues, layer normalization is proposed. It normalizes each input along with feature space as shown in Fig.2 (right). This makes the normalization independent from the mini-batch. As shown in Eq.2, the mean and variance are calculated along with the feature space.
(1)
Another benefit of using layer normalization is its effectiveness even in RNN~\cite{ba2016layer}
5.3 Experiment
First, to discover the effectiveness of dataset normalization, we tried to learn to learn dog-vs-cat task with input normalization or without input normalization. As shown in Fig.3, with normalization (blue line), the model indeed converges faster. We then compare BN and LN by incorporating them to the same architecture. Batch normalization (layer normalization) is added to the layer after first two convolutional layer of AlexNet. We keep all other setup and hyperparameters the same and the results are shown in Fig.4. As we can see from the figure, both batch norm and layer norm perform better than baseline method.


# input normalization
from torchvision import transforms
trans = transforms.Compose([transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])
# batch normalization
import torch.nn as nn
nn.BatchNorm1d(features)
nn.BatchNorm2d(channels)
# layer normalization
nn.LayerNorm(input_shape)