L1/L2 Norm Regularization
Norm regularization methods work by introducing a parameter norm penalty to the objective function
. The unregularized objective function
would become regularized objective function
:
(1)
where is a coefficient that balances the contribution of the penalty term
with the unregularized objective function
.
means there is no regularization. As
becomes larger, the regularization effect becomes stronger.
is a hyperparameter that needs to be tuned for each specific problem.

During training the algorithm will try to minimize both the unregularized objective function and the size of parameters
. The choice of norm
affects the process of shrinking
that would prefer some set of solutions for
over the others. We also note that in practice, the parameter
reduces to the weights
of the affine transformations at each layer only. The biases are usually not accounted for in the norm regularization term. This is because the bias only controls one variable while the weights control a lot more variables proportional to the number of neurons in the two adjacent layers and thus account for most of the variance of the model outcome under different circumstances. We also note that although we can use different strength of the weight decay term
for each layer, it is a common practice to use the same value of
across all the layers for simpler hyperparameter tuning.
L2 norm regularization, also known as Ridge regression, is one of the most common type of norm penalty regularization. It works by defining the norm penalty term as . Let’s take a deeper look into how the regularization is carried out in accordance with the normal optimization process of minimizing gradient of objective function. The regularized objective function after adding L2 norm term will be:
(2)
which will have the gradient as:
(3)
At each gradient step, we update the weight :
(4)
which is equivalent to:
(5)
Thus the weights is effectively shrunk by a constant before being updated with the gradient term.
L1 norm regularization, also knownn as Lasso regression. It is defined as adding an L1 norm penalty term , scaled by a positive hyperparameter
to the orignal objective function
:
(6)
which has a gradient of:
(7)
where is the sign of w applied element-wise.
We can observe that the L1 norm regularization contribution to the gradient update no longer scales linearly with each as in L2 norm regularization, but rather is a constant
whose sign is
.
L1 norm regularization encourages more sparse solutions compared to L2 norm regularization, i.e. the resulting set of parameters tend to have more optimal elements of value zero. Therefore L1 norm regularization is commonly used as a feature selection mechanism. Feature selection in the context of deep learning means choosing the set of features that most explain the data with respect to the output. Since L1 norm regularization yields a more sparse set of weights, it essentially means that features with weights of zeros are discarded when computing the network’s output. The network has learned which features are irrelevant in determining the output, and selectively turned down contributions of those features by setting corresponding weights to zeros.