downhill.adaptive.ADADELTA

class downhill.adaptive.ADADELTA(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

ADADELTA optimizer.

Notes

The ADADELTA method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference with ADADELTA is that as gradients are computed during each parameter update, an exponentially-weighted weighted moving average (EWMA) gradient value, as well as an EWMA of recent parameter steps, are maintained as well. The actual gradient is normalized by the ratio of the root-mean-square (RMS) parameter step size to the RMS gradient magnitude.

\[\begin{split}\begin{eqnarray*} g_{t+1} &=& \gamma g_t + (1 - \gamma) \left( \frac{\partial\mathcal{L}}{\partial p}\right)^2 \\ v_{t+1} &=& \frac{\sqrt{x_t + \epsilon}}{\sqrt{g_{t+1} + \epsilon}} \frac{\partial\mathcal{L}}{\partial p} \\ x_{t+1} &=& \gamma x_t + (1 - \gamma) v_{t+1}^2 \\ p_{t+1} &=& p_t - v_{t+1} \end{eqnarray*}\end{split}\]

Like RProp and the RMSPropESGD family, this learning method effectively maintains a sort of parameter-specific momentum value. The primary difference between this method and RMSProp is that ADADELTA additionally incorporates a sliding window of RMS parameter step sizes, (somewhat) obviating the need for a learning rate parameter.

In this implementation, the RMS values are regularized (made less extreme) by \(\epsilon\), which is specified using the rms_regularizer parameter.

The weight parameter \(\gamma\) for the EWMA window is computed from the rms_halflife keyword argument, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\).

References

[Zeil12]M. Zeiler. (2012) “ADADELTA: An adaptive learning rate method.” http://arxiv.org/abs/1212.5701
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Methods