downhill.adaptive.RProp

class downhill.adaptive.RProp(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Resilient backpropagation optimizer.

Parameters:

rprop_increase: float, optional (default 1.01)

Increase step sizes at this rate when the gradient sign stays the same.

rprop_decrease: float, optional (default 0.99)

Decrease step sizes at this rate when the gradient sign changes.

rprop_min_step: float, optional (default 0)

Minimum step size for any parameter.

rprop_max_step: float, optional (default 100)

Maximum step size for any parameter.

momentum: float, optional (default 0)

Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum.

nesterov: bool, optional (default False)

Set this to True to enable Nesterov-style momentum updates, whenever momentum is nonzero.

Notes

The RProp method takes small steps in parameter space using local gradient information. RProp is unlike “vanilla” first-order techniques like SGD, however, because only the signs of the gradients are taken into account when making parameter updates. That is, the step size for each parameter is independent of the magnitude of the gradient for that parameter.

To accomplish this, RProp maintains a separate learning rate for every parameter in the model, and adjusts this learning rate based on the consistency of the sign of the gradient over time. Whenever two consecutive gradients for a parameter have the same sign, the learning rate for that parameter increases, and whenever the signs disagree, the learning rate decreases. This has a similar effect to momentum-based stochastic gradient methods but effectively maintains parameter-specific learning rates.

\[\begin{split}\begin{eqnarray*} && \mbox{if } \frac{\partial\mathcal{L}}{\partial p}_{t-1} \frac{\partial\mathcal{L}}{\partial p} > 0 \\ && \qquad \Delta_t = \min (\eta_+\Delta_{t−1}, \Delta_+) \\ && \mbox{if } \frac{\partial\mathcal{L}}{\partial p}_{t-1} \frac{\partial\mathcal{L}}{\partial p} < 0 \\ && \qquad \Delta_t = \max (\eta_-\Delta_{t−1}, \Delta_-) \\ && \qquad \frac{\partial\mathcal{L}}{\partial p} = 0 \\ && p_{t+1} = p_t − \mbox{sgn}\left( \frac{\partial\mathcal{L}}{\partial p}\right) \Delta_t \end{eqnarray*}\end{split}\]

Here, \(s(\cdot)\) is the sign function (i.e., returns -1 if its argument is negative and 1 otherwise), \(\eta_-\) and \(\eta_+\) are the amount to decrease (increase) the step size if the gradients disagree (agree) in sign, and \(\Delta_+\) and \(\Delta_-\) are the maximum and minimum step size.

The implementation here is actually the “iRprop-” variant of RProp described in Algorithm 4 from [Igel00]. This variant resets the running gradient estimates to zero in cases where the previous and current gradients have switched signs.

References

[Ried92]M. Riedmiller & H. Braun. (1992) “Rprop - A Fast Adaptive Learning Algorithm.” In Proceedings of the International Symposium on Computer and Information Science VII.
[Igel00](1, 2) C. Igel & M. Hüsken. (2000) “Improving the Rprop Learning Algorithm.” In Proceedings of the Second International Symposium on Neural Computation. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.1332
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Methods

__init__(loss[, params, inputs, updates, ...])
evaluate(dataset) Evaluate the current model parameters on a dataset.
get_updates(**kwargs) Get parameter update expressions for performing optimization.
iterate([train, valid, max_updates]) Optimize a loss iteratively using a training and validation dataset.
minimize(*args, **kwargs) Optimize our loss exhaustively.
set_params([targets]) Set the values of the parameters to the given target values.