5 Normalization

We're aiming to create a knowledge hub for 3D printing of the future.

5.1 Dataset Normalization

Deep learning model takes various types of inputs and the range of their values can vary a lot. For example, in the house price prediction problem, the input are properties of the house, such as area, number of bedrooms, distance to downtown and so on. The area could easily go up to 100 while the number of bedrooms is usually under 10. Therefore, there is a scale difference between properties but this does not mean a large scale property is more or less influential than the small scale ones. This could potentially be a problem since the model tends to update in sensitive directions where the sensitivity comes from the scale of this property not its importance. As shown in Fig. 1 which is a contour of loss and the inner circle has lower loss than the outer ones. There are two features x_1 and x_2 deciding the loss of the model. To get to the optimal region, x_1 and x_2 are both important. In Fig. 1 (A), those two features have different scales and a slight change in x_2 has more effect than change in x_1. So the update is more likely to happen in x_2 direction. The repeated updates in one direction make the model hard to converge to the optimal region. While in Fig.1. (B), since two features have the same scale, updating will happen in both directions and make the model converge in less steps.

In order to eliminate the uneven scale in the inputs, we can apply normalization which shifts and scales the corresponding elements in the inputs to make them consistent. This idea of normalization can also be applied to image data. In practice, we usually apply different normalization on different image channels. Since this normalization is applied on the dataset, it is also called dataset normalization.

Figure 1: (A)Non-normalized: gradient of larger parameter dominants the update, (B) Normalized: both parameters can be updated in equal proportions.

5.2 Normalization Layers

Rather than the dataset normalization, normalization can also be applied to different parts of the training~\cite{ioffe2015batch, salimans2016weight, ba2016layer, ulyanov2016instance, wu2018group}. They can prevent features from being ignored and keep them contributing to model improvement which makes the model converge faster. They also smooth the loss surface~\cite{santurkar2018does} and prevent weights from exploding all over the place by constraining them to a certain region. Finally, it reduce the internal covariate shift (ICS) which is defined as the change of distribution in the activation because of weights updating. In the following sections, we will introduce batch normalization (BN)~\cite{ioffe2015batch} and layer normalization~\cite{ba2016layer}. To compare their effective, we will perform them on the dog-vs-cat classification problem.

5.2.1 Batch Normalization

Batch normalization is one of the most popular normalization methods used for training deep learning models. As shown in Fig.~?? (left) where H \times W is the size of activations of each sample, N is size of mini-batch and C is the number of channels, BN normalizes over the activations across mini-batch. Demonstrated in Alg.~??, BN takes a mini-batch \mathcal{B} which consists of m samples, and \gamma, \beta as inputs where \gamma and \beta are used to determine the final mean and variance. It first calculates the mean and variance of the mini-batch denoted as \mu_{\beta} and \sigma_{\beta}^2, respectively. Then we normalize each sample x_i to \Hat{x}<em>i by \mu</em>{\beta} and \sigma_{\beta}^2. Notice, we have an extra term \epsilon to prevent it from dividing by zero. Finally, \Hat{x}_i is scaled and shifted to the target mean and variance defined by \gamma, \beta.

Figure 2: (Left). Batch normalization. (Right). Layer normalization

Even though lots of success has been achieved using BN, it still has some problems. First, it is highly dependent on the mini-batch. So if the batch size is too small, BN will become very noisy and affect the training. Second, it is not suitable for recurrent neural network which requires a separate BN for each time step and results in a very complicate model.

5.2.2 Layer Normalization

In order to solve those issues, layer normalization is proposed. It normalizes each input along with feature space as shown in Fig.2 (right). This makes the normalization independent from the mini-batch. As shown in Eq.2, the mean and variance are calculated along with the feature space.

(1)   \begin{equation*} \begin{split}\mu_i  &= \frac{1}{m} \sum_{j=1}^m x_{i,j} \\\sigma^2_i  &= \frac{1}{m} \sum_{j =1}^m (x_{i,j - \mu_i}^2) \\\Hat{x}{i, j}  &= \frac{x{i, j} - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}\end{split}\end{equation*}

Another benefit of using layer normalization is its effectiveness even in RNN~\cite{ba2016layer}

5.3 Experiment

First, to discover the effectiveness of dataset normalization, we tried to learn to learn dog-vs-cat task with input normalization or without input normalization. As shown in Fig.3, with normalization (blue line), the model indeed converges faster. We then compare BN and LN by incorporating them to the same architecture. Batch normalization (layer normalization) is added to the layer after first two convolutional layer of AlexNet. We keep all other setup and hyperparameters the same and the results are shown in Fig.4. As we can see from the figure, both batch norm and layer norm perform better than baseline method.

Figure 3: Model performance on Dog-vs-Cat dataset with input normalization or not (Left) validation accuracy, (Right): training loss.
Figure 4: Model performance on Dog-vs-Cat dataset with batch normalization (BN), layer normalization (LN) and baseline (no normalization is used), (left): accuracy, (Right): training loss.
    # input normalization 
    from torchvision import transforms
    trans = transforms.Compose([transforms.Resize(255),
    transforms.Normalize([0.485, 0.456, 0.406],
    [0.229, 0.224, 0.225])])
    # batch normalization
    import torch.nn as nn
    # layer normalization