downhill.adaptive.RMSProp

class downhill.adaptive.RMSProp(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

RMSProp optimizer.

Notes

The RMSProp method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference here is that as gradients are computed during each parameter update, an exponentially-weighted moving average (EWMA) of gradient magnitudes is maintained as well. At each update, the EWMA is used to compute the root-mean-square (RMS) gradient value that’s been seen in the recent past. The actual gradient is normalized by this RMS scaling factor before being applied to update the parameters. Intuitively, this makes RMSProp take steps near 1 whenever the gradient is of constant magnitude, and larger steps whenever the local scale of the gradient starts to increase.

\[\begin{split}\begin{eqnarray*} f_{t+1} &=& \gamma a_t + (1 - \gamma) \frac{\partial\mathcal{L}}{\partial p} \\ g_{t+1} &=& \gamma g_t + (1 - \gamma) \left( \frac{\partial\mathcal{L}}{\partial p}\right)^2 \\ p_{t+1} &=& p_t - \frac{\alpha}{\sqrt{g_{t+1} - f_{t+1}^2 + \epsilon}} \frac{\partial\mathcal{L}}{\partial p} \end{eqnarray*}\end{split}\]

Like RProp, this learning method effectively maintains a sort of parameter-specific momentum value, but this method takes into account both the sign and the magnitude of the gradient for each parameter.

In this algorithm, RMS values are regularized (made less extreme) by \(\epsilon\), which is specified using the rms_regularizer keyword argument.

The weight parameter \(\gamma\) for the EWMA window is computed from the rms_halflife keyword argument, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\).

The implementation here is taken from [Grav13], equations (38)–(45). Graves’ implementation in particular seems to have introduced the \(f_t\) terms into the RMS computation; these terms appear to act as a sort of momentum for the RMS values.

References

[Grav13](1, 2) A. Graves. (2013) “Generating Sequences With Recurrent Neural Networks.” http://arxiv.org/abs/1308.0850
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Methods