# 4.2 L1/L2 Norm Regularization

We're aiming to create a knowledge hub for 3D printing of the future.

### L1/L2 Norm Regularization

Norm regularization methods work by introducing a parameter norm penalty to the objective function . The unregularized objective function would become regularized objective function :

(1)

where is a coefficient that balances the contribution of the penalty term with the unregularized objective function . means there is no regularization. As becomes larger, the regularization effect becomes stronger. is a hyperparameter that needs to be tuned for each specific problem.

During training the algorithm will try to minimize both the unregularized objective function and the size of parameters . The choice of norm affects the process of shrinking that would prefer some set of solutions for over the others. We also note that in practice, the parameter reduces to the weights of the affine transformations at each layer only. The biases are usually not accounted for in the norm regularization term. This is because the bias only controls one variable while the weights control a lot more variables proportional to the number of neurons in the two adjacent layers and thus account for most of the variance of the model outcome under different circumstances. We also note that although we can use different strength of the weight decay term for each layer, it is a common practice to use the same value of across all the layers for simpler hyperparameter tuning.

L2 norm regularization, also known as Ridge regression, is one of the most common type of norm penalty regularization. It works by defining the norm penalty term as . Let’s take a deeper look into how the regularization is carried out in accordance with the normal optimization process of minimizing gradient of objective function. The regularized objective function after adding L2 norm term will be:

(2)

which will have the gradient as:

(3)

At each gradient step, we update the weight :

(4)

which is equivalent to:

(5)

Thus the weights is effectively shrunk by a constant before being updated with the gradient term.

L1 norm regularization, also knownn as Lasso regression. It is defined as adding an L1 norm penalty term , scaled by a positive hyperparameter to the orignal objective function :

(6)