As we know, we can either update the model using one training sample or the entire training dataset. Those two training strategies are called SGD and GD, respectively. However, GD also requires a long training time even though its updating direction is stable and optimal. SGD is fast but this is at the price of unstable updating. Therefore, in practice, people take a balance between SGD and GD and come up with mb GD which splits the whole batch set into mini-batches and trains the model using each mini-batch in one iteration. Mb GD is not only efficient but also has stable weight updating. Even so, mb GD is not a panacea to all learning problems because the model’s performance can still be highly impacted by the learning rate .

As shown in Fig 1, with a high learning rate (green), the loss will get stuck at a relatively high point. If the learning rate goes higher (yellow), the loss could even explode. On the contrary, if the learning rate is too low (blue), even though the loss constantly decreases, it takes a long time for it to converge. Ideally, we will have a good learning rate as the red curve converges to optimal loss efficiently. However, those are ideal examples. In reality, even with a good learning rate, our loss curve would be similar to Fig.2 (B) which is generally decreasing but is also stochastic. This is because when we apply mb GD, different sets of training samples are used to update the model and their updating direction is not necessarily consistent. In order to mitigate this inconsistency and get a stable update, we will apply optimizers for weight updating. In the following section, we will first introduce exponentially moving averages and from that derive several optimizers.