4.4 Early Stopping
The ultimate goal for training neural networks is to have a good performance on test samples. To achieve this goal, it is typical to maintain two separate sets of data: a training set and a validation set, and monitor the performance of the network on both sets over the course of training. We often observe that both training and validation loss curves fall gradually over time until some point that the validation loss starts to rise, indicating the point of overfitting (see figure ??). This behavior suggests that we could achieve a model that are best generalizable by returning to the learned parameter sets at the point in time when the validation loss is at the lowest. This commonly used practice is known as “early stopping” and is a cost-effective form of regularization as it prevents the network from overly fitting the training data. Early stopping can be thought of as restricting (regularizing) the space of learned parameters around a neighborhood of the initial parameter values.
Early stopping can be thought of as an efficient hyperparameter search. The hyperparameter here is the number of training steps. Unlike other hyperparameters which we typically have to make non-thorough guesses on the values of the hyperparameters, the number of training steps can be determined exactly by looking at the lowest point on the U-shaped validation loss curve. By doing so we are essentially controlling the capacity of the model by stipulating how long the model should be trained to fit the training set in a simple and highly efficient manner.
One potential cost of early stopping method is that it needs to store a copy of the best set of parameters so far. However, reference to this optimal copy is infrequent, thus it can be stored on a slower and larger form of memory like hard disk drive. The impact of accessing this copy is negligible compared to other costs of training. Another cost is the need to run evaluation on the validation set periodically over the course of training. However, we can minimize this cost by evaluating less frequently at the expense of lower resolution and potentially missing the best optimal point, or using a smaller validation set so the evaluation time cost is negligible. Another solution is we could also run evaluation on a separate thread, e.g. on CPU when the training job is running on GPU, since the evaluation does not affect the ongoing training process.
Early stopping requires a validation set, which means the number of available data points used for training the model decreases, which may not be desired in cases where data is limited. To eliminate this inefficiency, two strategies may be employed. The first strategy is to re-initialize the parameters and retrain the whole network all over again after the early stopping has determined the optimal stopping point. The retraining will use all available data from training and validation sets and run until the optimal number of steps is reached. In this strategy, a problem occurs that we are not sure the optimal point to stop that we got from early stopping should be the number of data points we iterate through (number of iterations), or the number of passes through all the data points (number of epochs) since we now have more data points compared to previously. The second strategy is to keep the learned parameters after early stopping and continue training now using both training and validation datasets until the validation loss falls below the loss objective determined from early stopping. This method avoids retraining the network from scratch, but it is not guaranteed that validation curve could achieve such objective loss. Oftentimes the convergence is never reached.