2.3 Optimizers

We're aiming to create a knowledge hub for 3D printing of the future.

Inspired by the EMA method, we can also apply this method to the gradient values which will then be used to update the weights. As shown in Eq.1, the updating direction at current step \Hat{v}k is obtained from the weighted sum between previous updating direction \Hat{v}{k-1} and current gradient \nabla_{w_k}. This new updating direction is then scaled by the learning rate \alpha to update the weights from w_k to w_{k+1}.

(1)   \begin{equation*} \begin{split}\Hat{v_k} & = \gamma \Hat{v}{k-1} + (1-\gamma) \nabla{w_k} \\w_{k+1} & = w_k - \alpha \Hat{v}_k \\\end{split}\end{equation*}

Here, rather than the learning rate \alpha, we have one more hyperparameter \gamma to tune. This method is similar to the momentum optimizer we introduced last week which just replaces (1 - \gamma) with 1, shown in Eq.2.

(2)   \begin{equation*} \begin{split}\Hat{v_k} & = \gamma \Hat{v}{k-1} + \nabla{w_k} \\w_{k+1} & = w_k - \alpha \Hat{v}_k \\\end{split}\end{equation*}

We can also extend this to the second-order momentum by applying EMA on squared gradient, shown in Eq.3. This method is called RMSprop.

(3)   \begin{equation*}\begin{split}S_k & = \gamma S_{k-1} + (1 - \gamma) ( \nabla_{w_k})^2 \\w_{k + 1} & = w_k - \frac{\alpha}{\sqrt{S_k} + \epsilon} \nabla_{w_k}\\\end{split}\end{equation*}

Here S_k is a scalar which scales the learning rate \alpha by \frac{1}{\sqrt{S_k} + \epsilon} where \epsilon is always a very small number used to prevent the equation from dividing by zero. Notice, \epsilon is another hyperparameter we can tune.

Even though EMA can produce a good estimation over the long term, it has bias at the beginning. For example, in the first step of Eq.~??, if we choose \gamma = 0.95, v_1 is much lower than the first observed value \theta_1. This bias will continue for the first several values. This might not be a problem for long-term time serial data, but in a deep learning problem, a good initial updating could prevent the model from trapping into a local minimum and therefore is essential to the success of the training. In addition, since the smoothed gradient will be much smaller than the actual gradient, this make the updating very slow. Therefore, we should try to correct the bias at the beginning. One possible solution is to modify v_k as shown in Eq. 4. For example, the corrected estimation at the first step will be \Hat{v}_1 = \frac{v_1}{1 - \gamma} and therefore the updating is identical to \theta_1 which is a better estimation than (1 - \gamma) \theta_1. Notice, after a certain amount of steps, \gamma^k \sim 0, it becomes identical to the previous result.

(4)   \begin{equation*}\begin{split}\Hat{v}_k = \frac{v_k}{1 - \gamma^k}\end{split}\end{equation*}

After all, we can combine those three methods which gives us the Adam optimizer as shown in Eq.~5. We first estimate the first-order updating direction \Hat{v_k} and second-order scaling factor \Hat{S_k} in the first two equations. Those are corrected and then applied on the final updating equation. Notably, there are four hyperparameters involved, i.e., \gamma_1, \gamma_2, \alpha and \epsilon.

(5)   \begin{equation*}     \begin{split}        \Hat{v}k & =\gamma_1 \Hat{v}{k-1} + (1 - \gamma_1) \nabla_{w_k} \\        S_k & = \gamma_2 S_{k-1} + (1 - \gamma_2) (\nabla_{w_k})^2 \\         \Hat{v}_k^c & = \frac{\Hat{v}_k}{1 - \gamma_1^k} \\         S_k^c & = \frac{S_k}{1 - \gamma_2^k} \\         w_{k+1} & = w_k - \frac{\alpha}{\sqrt{S_k^c} + \epsilon} \Hat{v}_k^c\\     \end{split}\end{equation*}