Inspired by the EMA method, we can also apply this method to the gradient values which will then be used to update the weights. As shown in Eq.1, the updating direction at current step is obtained from the weighted sum between previous updating direction
and current gradient
. This new updating direction is then scaled by the learning rate
to update the weights from
to
.
(1)
Here, rather than the learning rate , we have one more hyperparameter
to tune. This method is similar to the momentum optimizer we introduced last week which just replaces
with
, shown in Eq.2.
(2)
We can also extend this to the second-order momentum by applying EMA on squared gradient, shown in Eq.3. This method is called RMSprop.
(3)
Here is a scalar which scales the learning rate
by
where
is always a very small number used to prevent the equation from dividing by zero. Notice,
is another hyperparameter we can tune.
Even though EMA can produce a good estimation over the long term, it has bias at the beginning. For example, in the first step of Eq.~??, if we choose ,
is much lower than the first observed value
. This bias will continue for the first several values. This might not be a problem for long-term time serial data, but in a deep learning problem, a good initial updating could prevent the model from trapping into a local minimum and therefore is essential to the success of the training. In addition, since the smoothed gradient will be much smaller than the actual gradient, this make the updating very slow. Therefore, we should try to correct the bias at the beginning. One possible solution is to modify
as shown in Eq. 4. For example, the corrected estimation at the first step will be
and therefore the updating is identical to
which is a better estimation than
. Notice, after a certain amount of steps,
, it becomes identical to the previous result.
(4)
After all, we can combine those three methods which gives us the Adam optimizer as shown in Eq.~5. We first estimate the first-order updating direction and second-order scaling factor
in the first two equations. Those are corrected and then applied on the final updating equation. Notably, there are four hyperparameters involved, i.e.,
and
.
(5)