downhill.adaptive.ESGD

class downhill.adaptive.ESGD(*args, **kwargs)

Equilibrated SGD computes a diagonal preconditioner for gradient descent.

The ESGD method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference here is that as gradients are computed during each parameter update, an exponentially-weighted moving average (EWMA) of diagonal preconditioner estimates is maintained as well. At each update, the EWMA is used to compute the root-mean-square (RMS) diagonal preconditioner value that’s been seen in the recent past. The actual gradient is normalized by this preconditioner before being applied to update the parameters.

\[\begin{split}\begin{eqnarray*} r &\sim& \mathcal{N}(0, 1) \\ Hr &=& \frac{\partial^2 \mathcal{L}}{\partial^2 p}r \\ D_{t+1} &=& \gamma D_t + (1 - \gamma) (Hr)^2 \\ p_{t+1} &=& p_t + - \frac{\alpha}{\sqrt{D_{t+1} + \epsilon}} \frac{\partial\mathcal{L}}{\partial p} \end{eqnarray*}\end{split}\]

Like Rprop and the ADADELTARMSProp family, this learning method effectively maintains a sort of parameter-specific learning rate for each parameter in the loss.

The primary difference between this method and RMSProp is that ESGD treats the normalizing fraction explicitly as a preconditioner for the diaonal of the Hessian, and estimates this diagonal by drawing a vector of standard normal values at every training step.

The primary difference between this implementation and the algorithm described in the paper (see below) is the use of an EWMA to decay the diagonal values over time, while in the paper the diagonal is divided by the training iteration. The EWMA halflife should be set to something reasonably large to ensure that this method emulates the method described in the original paper.

In this implementation, \(\epsilon\) regularizes the RMS values; it is is specified using the rms_regularizer parameter.

The weight parameter \(\gamma\) for the EWMA is computed from the rms_halflife keyword argument, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\).

References

[Dauphin14]Y. Dauphin, H. de Vries, J. Chung & Y. Bengio. (2014) “RMSProp and equilibrated adaptive learning rates for non-convex optimization.” http://arxiv.org/abs/1502.04390
__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)