**4.3 Dropout**

Dropout method was proposed by Srivastava \cite{srivastava2014dropout} in 2014 as another effort to regularize deep neural networks. Dropout can be thought of as a bagging method for neural networks. Bagging, short for “bootstrap aggregating”, in machine learning terminology involves training and testing multiple models on each test sample \cite{goodfellow2016deep}. Bagging is widely used in other machine learning algorithms, but proved to be practically difficult when it comes to neural networks, as training a neural network usually requires more training time and computing resources than other traditional machine learning methods, let alone training multiple versions of them for bagging to work. Dropout is a computationally inexpensive approximation to make bagging available to neural networks, by training a bagged ensemble of sub-networks of the same large network at different training step.

In particular, dropout works by randomly removing certain units in the hidden layers of the base network. This is done by multiplying elementwise the layer responses with a binary mask where 1 means the neural response is retained and 0 means the response of that unit is nullified, essentially dropping the unit from contributing to the network outcome. The binary mask is constructed randomly based on a hyperparameter p which decides the probability of dropping a particular unit at a given time. The typical values of p are 0.8 for the input layer and 0.5 for the hidden layers \cite{goodfellow2016deep}. By this we have a random subset of the base network at each training step, and we only perform forward, backward propagation and weight updates on that subnetwork, turning a blind eye to the dropped units and connections at that time. At testing phase, no dropout is done and the network output is determined using responses from all units, each multiplied by p. This multiplication simulates the expected output of the unit at training time. The final result will effectively be multiple trained subnetworks collectively deciding the network outcome, similar to how bagging works. We note that dropout is not exactly the same as bagging however. In bagging, the models are independently trained, while in dropout the models share overlapped subsets of the base model’s parameters, thus are not independent from each other.