downhill.adaptive.ADAGRAD

class downhill.adaptive.ADAGRAD(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

ADAGRAD optimizer.

Notes

The ADAGRAD method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference with ADAGRAD is that as gradients are computed during each parameter update, their squares are accumulated, and this accumulated value is used to rescale the global learning rate \(\alpha\) separately for each parameter.

\[\begin{split}\begin{eqnarray*} g_{t+1} &=& g_t + \left(\frac{\partial\mathcal{L}}{\partial p}\right)^2 \\ p_{t+1} &=& p_t - \frac{\alpha}{\sqrt{g_{t+1}} + \epsilon} \frac{\partial\mathcal{L}}{\partial p} \end{eqnarray*}\end{split}\]

Like the other adaptive learning methods, learning method effectively maintains a sort of parameter-specific learning rate. Unlike RMSProp and ADADELTA, however, in ADAGRAD, the gradient magnitudes accumulate throughout training, which has the effect of scaling the learning rate for each parameter, but also effectively anneals the learning rate overall as training progresses.

In this implementation, the scale values are regularized (made less extreme) by \(\epsilon\), which is specified using the regularizer parameter.

References

[Duch10]J. Duchi, E. Hazan, & Y. Singer (2010) “Adaptive subgradient methods for online leaning and stochastic optimization.” Proc. Conference on Learning Theory (COLT).
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Methods