class downhill.adaptive.RMSProp(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

RMSProp optimizer.


learning_rate: float, optional (default 1e-4)

Step size to take during optimization.

rms_halflife: float, optional (default 14)

Compute RMS gradient values using an exponentially weighted moving average that decays with this halflife.

rms_regularizer: float, optional (default 1e-8)

Regularize RMS gradient values by this \(\epsilon\).

momentum: float, optional (default 0)

Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum.

nesterov: bool, optional (default False)

Set this to True to enable Nesterov-style momentum updates, whenever momentum is nonzero.


The RMSProp method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference here is that as gradients are computed during each parameter update, an exponentially-weighted moving average (EWMA) of gradient magnitudes is maintained as well. At each update, the EWMA is used to compute the root-mean-square (RMS) gradient value that’s been seen in the recent past. The actual gradient is normalized by this RMS scaling factor before being applied to update the parameters. Intuitively, this makes RMSProp take steps near 1 whenever the gradient is of constant magnitude, and larger steps whenever the local scale of the gradient starts to increase.

\[\begin{split}\begin{eqnarray*} f_{t+1} &=& \gamma f_t + (1 - \gamma) \frac{\partial\mathcal{L}}{\partial p} \\ g_{t+1} &=& \gamma g_t + (1 - \gamma) \left( \frac{\partial\mathcal{L}}{\partial p}\right)^2 \\ p_{t+1} &=& p_t - \frac{\alpha}{\sqrt{g_{t+1} - f_{t+1}^2 + \epsilon}} \frac{\partial\mathcal{L}}{\partial p} \end{eqnarray*}\end{split}\]

Like RProp, this learning method effectively maintains a sort of parameter-specific momentum value, but this method takes into account both the sign and the magnitude of the gradient for each parameter.

In this algorithm, RMS values are regularized (made less extreme) by \(\epsilon\), which is specified using the rms_regularizer keyword argument.

The weight parameter \(\gamma\) for the EWMA window is computed from the rms_halflife keyword argument, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\).

The implementation here is taken from [Grav13], equations (38)–(45). Graves’ implementation in particular seems to have introduced the \(f_t\) terms into the RMS computation; these terms appear to act as a sort of momentum for the RMS values.


[Grav13](1, 2) A. Graves. (2013) “Generating Sequences With Recurrent Neural Networks.”
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)


__init__(loss[, params, inputs, updates, ...])
evaluate(dataset) Evaluate the current model parameters on a dataset.
get_updates(**kwargs) Get parameter update expressions for performing optimization.
iterate([train, valid, max_updates]) Optimize a loss iteratively using a training and validation dataset.
minimize(*args, **kwargs) Optimize our loss exhaustively.
set_params([targets]) Set the values of the parameters to the given target values.