downhill.adaptive.ADADELTA¶
-
class
downhill.adaptive.
ADADELTA
(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)¶ ADADELTA optimizer.
Parameters: rms_halflife: float, optional (default 14)
Compute RMS gradient values using an exponentially weighted moving average that decays with this halflife.
rms_regularizer: float, optional (default 1e-8)
Regularize RMS gradient values by this \(\epsilon\).
momentum: float, optional (default 0)
Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum.
nesterov: bool, optional (default False)
Set this to
True
to enable Nesterov-style momentum updates, whenevermomentum
is nonzero.Notes
The ADADELTA method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.
The difference with ADADELTA is that as gradients are computed during each parameter update, an exponentially-weighted weighted moving average (EWMA) gradient value, as well as an EWMA of recent parameter steps, are maintained as well. The actual gradient is normalized by the ratio of the root-mean-square (RMS) parameter step size to the RMS gradient magnitude.
\[\begin{split}\begin{eqnarray*} g_{t+1} &=& \gamma g_t + (1 - \gamma) \left( \frac{\partial\mathcal{L}}{\partial p}\right)^2 \\ v_{t+1} &=& \frac{\sqrt{x_t + \epsilon}}{\sqrt{g_{t+1} + \epsilon}} \frac{\partial\mathcal{L}}{\partial p} \\ x_{t+1} &=& \gamma x_t + (1 - \gamma) v_{t+1}^2 \\ p_{t+1} &=& p_t - v_{t+1} \end{eqnarray*}\end{split}\]Like
RProp
and theRMSProp
–ESGD
family, this learning method effectively maintains a sort of parameter-specific momentum value. The primary difference between this method andRMSProp
is that ADADELTA additionally incorporates a sliding window of RMS parameter step sizes, (somewhat) obviating the need for a learning rate parameter.In this implementation, the RMS values are regularized (made less extreme) by \(\epsilon\), which is specified using the
rms_regularizer
parameter.The weight parameter \(\gamma\) for the EWMA window is computed from the
rms_halflife
keyword argument, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\).References
[Zeil12] M. Zeiler. (2012) “ADADELTA: An adaptive learning rate method.” http://arxiv.org/abs/1212.5701 -
__init__
(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)¶
Methods
__init__
(loss[, params, inputs, updates, ...])evaluate
(dataset)Evaluate the current model parameters on a dataset. get_updates
(**kwargs)Get parameter update expressions for performing optimization. iterate
([train, valid, max_updates])Optimize a loss iteratively using a training and validation dataset. minimize
(*args, **kwargs)Optimize our loss exhaustively. set_params
([targets])Set the values of the parameters to the given target values. -