4.2 L1/L2 Norm Regularization

We're aiming to create a knowledge hub for 3D printing of the future.

L1/L2 Norm Regularization

Norm regularization methods work by introducing a parameter norm penalty \Omega(\pmb{\theta}) to the objective function J. The unregularized objective function J would become regularized objective function \tilde{J}:

(1)   \begin{equation*}    \tilde{J}(\pmb{\theta}; \textbf{X}, \textbf{y}) = J(\pmb{\theta}; \textbf{X}, \textbf{y}) + \alpha \Omega(\pmb{\theta})\end{equation*}

where \alpha \in [0, \infty) is a coefficient that balances the contribution of the penalty term \Omega with the unregularized objective function J. \alpha = 0 means there is no regularization. As \alpha becomes larger, the regularization effect becomes stronger. \alpha is a hyperparameter that needs to be tuned for each specific problem.

Figure 1: Effect of regularization strength on the fitting results. Higher regularized models resulted in smoother decision boundaries, making the model less prone to overfitting.

During training the algorithm will try to minimize both the unregularized objective function J and the size of parameters \pmb{\theta}. The choice of norm \Omega affects the process of shrinking \pmb{\theta} that would prefer some set of solutions for \pmb{\theta} over the others. We also note that in practice, the parameter \pmb{\theta} reduces to the weights w of the affine transformations at each layer only. The biases are usually not accounted for in the norm regularization term. This is because the bias only controls one variable while the weights control a lot more variables proportional to the number of neurons in the two adjacent layers and thus account for most of the variance of the model outcome under different circumstances. We also note that although we can use different strength of the weight decay term \alpha for each layer, it is a common practice to use the same value of \alpha across all the layers for simpler hyperparameter tuning.

L2 norm regularization, also known as Ridge regression, is one of the most common type of norm penalty regularization. It works by defining the norm penalty term as \Omega(\pmb{\theta}) = \frac{1}{2} ||w||^2_2. Let’s take a deeper look into how the regularization is carried out in accordance with the normal optimization process of minimizing gradient of objective function. The regularized objective function after adding L2 norm term will be:

(2)   \begin{equation*}\tilde{J}(\pmb{\theta}; \textbf{X}, \textbf{y}) = \frac{\alpha}{2}\textbf{w}^T\textbf{w} + J(\pmb{\theta}; \textbf{X}, \textbf{y}) \end{equation*}

which will have the gradient as:

(3)   \begin{equation*}\nabla_\textbf{w} \tilde{J}(\pmb{\theta}; \textbf{X}, \textbf{y}) = \alpha \textbf{w} + \nabla_\textbf{w} J(\pmb{\theta}; \textbf{X}, \textbf{y}) \end{equation*}

At each gradient step, we update the weight \textbf{w}:

(4)   \begin{equation*} \textbf{w} \leftarrow \textbf{w} - \epsilon (\alpha \textbf{w} + \nabla_\textbf{w} J(\pmb{\theta}; \textbf{X}, \textbf{y})) \end{equation*}

which is equivalent to:

(5)   \begin{equation*}\textbf{w} \leftarrow (1 - \epsilon\alpha) \textbf{w} - \nabla_\textbf{w} J(\pmb{\theta}; \textbf{X}, \textbf{y}) \end{equation*}

Thus the weights \textbf{w} is effectively shrunk by a constant before being updated with the gradient term.

L1 norm regularization, also knownn as Lasso regression. It is defined as adding an L1 norm penalty term \Omega(\pmb{\theta}) = ||w||_1 = \sum_i|w_i|, scaled by a positive hyperparameter \alpha to the orignal objective function J:

(6)   \begin{equation*}    \tilde{J}(\textbf{w}; \textbf{X}, \textbf{y}) = J(\textbf{w}; \textbf{X}, \textbf{y}) + \alpha ||w||_1 \end{equation*}

which has a gradient of:

(7)   \begin{equation*}\nabla_\textbf{w} \tilde{J}(\textbf{w}; \textbf{X}, \textbf{y}) = \nabla_\textbf{w} J(\textbf{w}; \textbf{X}, \textbf{y}) + \alpha \text{sign}(\textbf{w})\end{equation*}

where \text{sign}(\textbf{w}) is the sign of w applied element-wise.

We can observe that the L1 norm regularization contribution to the gradient update no longer scales linearly with each w_i as in L2 norm regularization, but rather is a constant \alpha whose sign is \text{sign}(w_i).

L1 norm regularization encourages more sparse solutions compared to L2 norm regularization, i.e. the resulting set of parameters w tend to have more optimal elements of value zero. Therefore L1 norm regularization is commonly used as a feature selection mechanism. Feature selection in the context of deep learning means choosing the set of features that most explain the data with respect to the output. Since L1 norm regularization yields a more sparse set of weights, it essentially means that features with weights of zeros are discarded when computing the network’s output. The network has learned which features are irrelevant in determining the output, and selectively turned down contributions of those features by setting corresponding weights to zeros.