downhill.adaptive.ADAGRAD

class downhill.adaptive.ADAGRAD(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)[source]

ADAGRAD optimizer.

Parameters:
rms_regularizer: float, optional (default 1e-8)

Regularize the learning rate scaling factor by this \(\epsilon\).

momentum: float, optional (default 0)

Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum.

nesterov: bool, optional (default False)

Set this to True to enable Nesterov-style momentum updates, whenever momentum is nonzero.

Notes

The ADAGRAD method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference with ADAGRAD is that as gradients are computed during each parameter update, their squares are accumulated, and this accumulated value is used to rescale the global learning rate \(\alpha\) separately for each parameter.

\[\begin{split}\begin{eqnarray*} g_{t+1} &=& g_t + \left(\frac{\partial\mathcal{L}}{\partial p}\right)^2 \\ p_{t+1} &=& p_t - \frac{\alpha}{\sqrt{g_{t+1}} + \epsilon} \frac{\partial\mathcal{L}}{\partial p} \end{eqnarray*}\end{split}\]

Like the other adaptive learning methods, learning method effectively maintains a sort of parameter-specific learning rate. Unlike RMSProp and ADADELTA, however, in ADAGRAD, the gradient magnitudes accumulate throughout training, which has the effect of scaling the learning rate for each parameter, but also effectively anneals the learning rate overall as training progresses.

In this implementation, the scale values are regularized (made less extreme) by \(\epsilon\), which is specified using the rms_regularizer parameter.

References

[Duch10]J. Duchi, E. Hazan, & Y. Singer (2010) “Adaptive subgradient methods for online leaning and stochastic optimization.” Proc. Conference on Learning Theory (COLT).
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

x.__init__(…) initializes x; see help(type(x)) for signature

Methods

__init__(loss[, params, inputs, updates, …]) x.__init__(…) initializes x; see help(type(x)) for signature
evaluate(dataset) Evaluate the current model parameters on a dataset.
get_updates(**kwargs) Get parameter update expressions for performing optimization.
iterate([train, valid, max_updates]) Optimize a loss iteratively using a training and validation dataset.
minimize(*args, **kwargs) Optimize our loss exhaustively.
set_params([targets]) Set the values of the parameters to the given target values.