downhill.adaptive.ESGD¶

class
downhill.adaptive.
ESGD
(*args, **kwargs)¶ Equilibrated SGD computes a diagonal Hessian preconditioner.
Parameters: hv_method: {‘rop’, ‘lop’, ‘grad’}, optional
The Hv (Hessianvector) product will be computed using the given method. The default is to use ‘rop’.
learning_rate: float, optional (default 1e4)
Step size to take during optimization.
rms_halflife: float, optional (default 14)
Compute RMS gradient values using an exponentially weighted moving average that decays with this halflife.
rms_regularizer: float, optional (default 1e8)
Regularize RMS gradient values by this \(\epsilon\).
momentum: float, optional (default 0)
Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1  1e4) for large amounts of momentum.
nesterov: bool, optional (default False)
Set this to
True
to enable Nesterovstyle momentum updates, whenevermomentum
is nonzero.Notes
The ESGD method uses the same general strategy as all firstorder stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.
The difference here is that as gradients are computed during each parameter update, an exponentiallyweighted moving average (EWMA) of estimates of the diagonal of the Hessian (the matrix of second derivatives) is maintained as well. At each update, the EWMA is used to compute the rootmeansquare (RMS) diagonal value that’s been seen in the recent past. The actual gradient is scaled by the inverse of this diagonal preconditioner before being applied to update the parameters. Intuitively, this causes the algorithm to “reshape” the loss function in parameter space, such that directions of steep gradient (i.e., large diagonal values) and directions of shallow gradient (i.e., small diagonal values) are scaled to be approximately the same slope.
The diagonal estimates are computed using a nice trick: A vector \(r \sim \mathcal{N}(0, 1)\) consisting of standard normal values is sampled randomly at each update step, and the value of \(Hr\) is computed symbolically. These vector values tend to approximate the diagonal of the Hessian. Because \(Hr\) is itself a vector, the full Hessian \(H\) does not need to be computed or stored.
\[\begin{split}\begin{eqnarray*} r &\sim& \mathcal{N}(0, 1) \\ Hr &=& \frac{\partial^2 \mathcal{L}}{\partial^2 p}r \\ D_{t+1} &=& \gamma D_t + (1  \gamma) (Hr)^2 \\ p_{t+1} &=& p_t +  \frac{\alpha}{\sqrt{D_{t+1} + \epsilon}} \frac{\partial\mathcal{L}}{\partial p} \end{eqnarray*}\end{split}\]Like
Rprop
and theADADELTA
–RMSProp
family, this learning method effectively maintains a sort of parameterspecific learning rate for each parameter in the loss.In this implementation, \(\epsilon\) regularizes the RMS values; it is is specified using the
rms_regularizer
parameter.The weight parameter \(\gamma\) for the EWMA is computed from the
rms_halflife
keyword argument, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{\ln 2}{h}}\).The primary difference between this implementation and the algorithm described in the paper (see below) is the use of an EWMA to decay the diagonal values over time, while in the paper the diagonal is divided by the training iteration. The EWMA halflife should be set to something reasonably large to ensure that this method emulates the method described in the original paper.
References
[Daup14] Y. Dauphin, H. de Vries, J. Chung & Y. Bengio. (2014) “RMSProp and equilibrated adaptive learning rates for nonconvex optimization.” http://arxiv.org/abs/1502.04390 
__init__
(*args, **kwargs)¶
Methods
__init__
(*args, **kwargs)evaluate
(dataset)Evaluate the current model parameters on a dataset. get_updates
(**kwargs)Get parameter update expressions for performing optimization. iterate
([train, valid, max_updates])Optimize a loss iteratively using a training and validation dataset. minimize
(*args, **kwargs)Optimize our loss exhaustively. set_params
([targets])Set the values of the parameters to the given target values. 