Parameters: rms_regularizer: float, optional (default 1e-8) Regularize the learning rate scaling factor by this $$\epsilon$$. momentum: float, optional (default 0) Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum. nesterov: bool, optional (default False) Set this to True to enable Nesterov-style momentum updates, whenever momentum is nonzero.

Notes

The ADAGRAD method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference with ADAGRAD is that as gradients are computed during each parameter update, their squares are accumulated, and this accumulated value is used to rescale the global learning rate $$\alpha$$ separately for each parameter.

$\begin{split}\begin{eqnarray*} g_{t+1} &=& g_t + \left(\frac{\partial\mathcal{L}}{\partial p}\right)^2 \\ p_{t+1} &=& p_t - \frac{\alpha}{\sqrt{g_{t+1}} + \epsilon} \frac{\partial\mathcal{L}}{\partial p} \end{eqnarray*}\end{split}$

Like the other adaptive learning methods, learning method effectively maintains a sort of parameter-specific learning rate. Unlike RMSProp and ADADELTA, however, in ADAGRAD, the gradient magnitudes accumulate throughout training, which has the effect of scaling the learning rate for each parameter, but also effectively anneals the learning rate overall as training progresses.

In this implementation, the scale values are regularized (made less extreme) by $$\epsilon$$, which is specified using the rms_regularizer parameter.

References

 [Duch10] J. Duchi, E. Hazan, & Y. Singer (2010) “Adaptive subgradient methods for online leaning and stochastic optimization.” Proc. Conference on Learning Theory (COLT).