downhill.adaptive.RProp

class downhill.adaptive.RProp(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Resilient backpropagation optimizer.

Notes

The RProp method takes small steps in parameter space using local gradient information. RProp is unlike “vanilla” first-order techniques like SGD, however, because only the signs of the gradients are taken into account when making parameter updates. That is, the step size for each parameter is independent of the magnitude of the gradient for that parameter.

To accomplish this, RProp maintains a separate learning rate for every parameter in the model, and adjusts this learning rate based on the consistency of the sign of the gradient over time. Whenever two consecutive gradients for a parameter have the same sign, the learning rate for that parameter increases, and whenever the signs disagree, the learning rate decreases. This has a similar effect to momentum-based stochastic gradient methods but effectively maintains parameter-specific learning rates.

\[\begin{split}\begin{eqnarray*} && \mbox{if } \frac{\partial\mathcal{L}}{\partial p}_{t-1} \frac{\partial\mathcal{L}}{\partial p} > 0 \\ && \qquad \Delta_t = \min (\eta_+\Delta_{t−1}, \Delta_+) \\ && \mbox{if } \frac{\partial\mathcal{L}}{\partial p}_{t-1} \frac{\partial\mathcal{L}}{\partial p} < 0 \\ && \qquad \Delta_t = \max (\eta_-\Delta_{t−1}, \Delta_-) \\ && \qquad \frac{\partial\mathcal{L}}{\partial p} = 0 \\ && p_{t+1} = p_t − \mbox{sgn}\left( \frac{\partial\mathcal{L}}{\partial p}\right) \Delta_t \end{eqnarray*}\end{split}\]

Here, \(s(\cdot)\) is the sign function (i.e., returns -1 if its argument is negative and 1 otherwise), \(\eta_-\) and \(\eta_+\) are the amount to decrease (increase) the step size if the gradients disagree (agree) in sign, and \(\Delta_+\) and \(\Delta_-\) are the maximum and minimum step size.

The implementation here is actually the “iRprop-” variant of RProp described in Algorithm 4 from [Igel00]. This variant resets the running gradient estimates to zero in cases where the previous and current gradients have switched signs.

References

[Ried92]M. Riedmiller & H. Braun. (1992) “Rprop - A Fast Adaptive Learning Algorithm.” In Proceedings of the International Symposium on Computer and Information Science VII.
[Igel00](1, 2) C. Igel & M. Hüsken. (2000) “Improving the Rprop Learning Algorithm.” In Proceedings of the Second International Symposium on Neural Computation. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.1332
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Methods