class downhill.adaptive.ESGD(*args, **kwargs)

Equilibrated SGD computes a diagonal Hessian preconditioner.


hv_method: {‘rop’, ‘lop’, ‘grad’}, optional

The Hv (Hessian-vector) product will be computed using the given method. The default is to use ‘rop’.

learning_rate: float, optional (default 1e-4)

Step size to take during optimization.

rms_halflife: float, optional (default 14)

Compute RMS gradient values using an exponentially weighted moving average that decays with this halflife.

rms_regularizer: float, optional (default 1e-8)

Regularize RMS gradient values by this \(\epsilon\).

momentum: float, optional (default 0)

Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum.

nesterov: bool, optional (default False)

Set this to True to enable Nesterov-style momentum updates, whenever momentum is nonzero.


The ESGD method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference here is that as gradients are computed during each parameter update, an exponentially-weighted moving average (EWMA) of estimates of the diagonal of the Hessian (the matrix of second derivatives) is maintained as well. At each update, the EWMA is used to compute the root-mean-square (RMS) diagonal value that’s been seen in the recent past. The actual gradient is scaled by the inverse of this diagonal preconditioner before being applied to update the parameters. Intuitively, this causes the algorithm to “reshape” the loss function in parameter space, such that directions of steep gradient (i.e., large diagonal values) and directions of shallow gradient (i.e., small diagonal values) are scaled to be approximately the same slope.

The diagonal estimates are computed using a nice trick: A vector \(r \sim \mathcal{N}(0, 1)\) consisting of standard normal values is sampled randomly at each update step, and the value of \(Hr\) is computed symbolically. These vector values tend to approximate the diagonal of the Hessian. Because \(Hr\) is itself a vector, the full Hessian \(H\) does not need to be computed or stored.

\[\begin{split}\begin{eqnarray*} r &\sim& \mathcal{N}(0, 1) \\ Hr &=& \frac{\partial^2 \mathcal{L}}{\partial^2 p}r \\ D_{t+1} &=& \gamma D_t + (1 - \gamma) (Hr)^2 \\ p_{t+1} &=& p_t + - \frac{\alpha}{\sqrt{D_{t+1} + \epsilon}} \frac{\partial\mathcal{L}}{\partial p} \end{eqnarray*}\end{split}\]

Like Rprop and the ADADELTARMSProp family, this learning method effectively maintains a sort of parameter-specific learning rate for each parameter in the loss.

In this implementation, \(\epsilon\) regularizes the RMS values; it is is specified using the rms_regularizer parameter.

The weight parameter \(\gamma\) for the EWMA is computed from the rms_halflife keyword argument, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\).

The primary difference between this implementation and the algorithm described in the paper (see below) is the use of an EWMA to decay the diagonal values over time, while in the paper the diagonal is divided by the training iteration. The EWMA halflife should be set to something reasonably large to ensure that this method emulates the method described in the original paper.


[Daup14]Y. Dauphin, H. de Vries, J. Chung & Y. Bengio. (2014) “RMSProp and equilibrated adaptive learning rates for non-convex optimization.”
__init__(*args, **kwargs)


__init__(*args, **kwargs)
evaluate(dataset) Evaluate the current model parameters on a dataset.
get_updates(**kwargs) Get parameter update expressions for performing optimization.
iterate([train, valid, max_updates]) Optimize a loss iteratively using a training and validation dataset.
minimize(*args, **kwargs) Optimize our loss exhaustively.
set_params([targets]) Set the values of the parameters to the given target values.