Neural networks have great expressibility in the sense that they could match exactly the distribution of any given training data (overfitting). The longer the training time and the deeper the network are, the more likely the network is susceptible to this problem. However, we would want the network to not only perform well on the training data but also on new inputs which the network has not seen before, as this is the ultimate goal for deploying deep learning models to practical real-world problems. Many strategies have been proposed to address this issue, forcing the network to sacrifice some training performance with an exchange of lower testing error and higher generalizability. Some strategies put restrictions on the parameter values. Others add extra penalty terms in the objective function which is designed to promote a simpler model to be learned and thus encourage generalization. Sometimes the extra terms contain prior knowledge for the specific problem domain at hand. Other times we combine multiple hypotheses to explain the data (ensembling). These methods are known collectively as regularization. Regularization can be defined as “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error” \cite{goodfellow2016deep}. In terms of bias/variance tradeoff we discussed in the previous section, a good regularizer is one that strike a right balance, reducing the variance while not overly increasing the bias. In this section we are going to introduce some regularization methods that are commonly used in the deep learning community, including parameter norm penalty (L1/L2 regularization), dropout, early stopping and data augmentation.