class downhill.adaptive.ADADELTA(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

ADADELTA optimizer.


rms_halflife: float, optional (default 14)

Compute RMS gradient values using an exponentially weighted moving average that decays with this halflife.

rms_regularizer: float, optional (default 1e-8)

Regularize RMS gradient values by this \(\epsilon\).

momentum: float, optional (default 0)

Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum.

nesterov: bool, optional (default False)

Set this to True to enable Nesterov-style momentum updates, whenever momentum is nonzero.


The ADADELTA method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference with ADADELTA is that as gradients are computed during each parameter update, an exponentially-weighted weighted moving average (EWMA) gradient value, as well as an EWMA of recent parameter steps, are maintained as well. The actual gradient is normalized by the ratio of the root-mean-square (RMS) parameter step size to the RMS gradient magnitude.

\[\begin{split}\begin{eqnarray*} g_{t+1} &=& \gamma g_t + (1 - \gamma) \left( \frac{\partial\mathcal{L}}{\partial p}\right)^2 \\ v_{t+1} &=& \frac{\sqrt{x_t + \epsilon}}{\sqrt{g_{t+1} + \epsilon}} \frac{\partial\mathcal{L}}{\partial p} \\ x_{t+1} &=& \gamma x_t + (1 - \gamma) v_{t+1}^2 \\ p_{t+1} &=& p_t - v_{t+1} \end{eqnarray*}\end{split}\]

Like RProp and the RMSPropESGD family, this learning method effectively maintains a sort of parameter-specific momentum value. The primary difference between this method and RMSProp is that ADADELTA additionally incorporates a sliding window of RMS parameter step sizes, (somewhat) obviating the need for a learning rate parameter.

In this implementation, the RMS values are regularized (made less extreme) by \(\epsilon\), which is specified using the rms_regularizer parameter.

The weight parameter \(\gamma\) for the EWMA window is computed from the rms_halflife keyword argument, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\).


[Zeil12]M. Zeiler. (2012) “ADADELTA: An adaptive learning rate method.”
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)


__init__(loss[, params, inputs, updates, ...])
evaluate(dataset) Evaluate the current model parameters on a dataset.
get_updates(**kwargs) Get parameter update expressions for performing optimization.
iterate([train, valid, max_updates]) Optimize a loss iteratively using a training and validation dataset.
minimize(*args, **kwargs) Optimize our loss exhaustively.
set_params([targets]) Set the values of the parameters to the given target values.