downhill.first_order.NAG¶

class downhill.first_order.NAG(loss, params, inputs, updates=(), monitors=(), monitor_gradients=False)¶

Optimize using Nesterov’s Accelerated Gradient (NAG).

The basic difference between NAG and “classical” momentum in SGD optimization approaches is that NAG computes the gradients at the position in parameter space where “classical” momentum would put us at the next step. In classical SGD with momentum \(\mu\) and learning rate \(\alpha\), updates to parameter \(p\) at step \(t\) are computed by blending the current “velocity” \(v\) with the current gradient \(\frac{\partial\mathcal{L}}{\partial p}\):

\[\begin{split}\begin{eqnarray*} v_{t+1} &=& \mu v_t - \alpha \frac{\partial\mathcal{L}}{\partial p} \\ p_{t+1} &=& p_t + v_{t+1} \end{eqnarray*}\end{split}\]

In contrast, NAG adjusts the update by blending the current “velocity” with the gradient at the next step—that is, the gradient is computed at the point where the velocity would have taken us:

\[\begin{split}\begin{eqnarray*} v_{t+1} &=& \mu v_t - \alpha \left. \frac{\partial\mathcal{L}}{\partial p}\right|_{p_t + \mu v_t} \\ p_{t+1} &=& p_t + v_{t+1} \end{eqnarray*}\end{split}\]

Again, the difference here is that the gradient is computed at the place in parameter space where we would have stepped using the classical technique, in the absence of a new gradient.

In theory, this helps correct for oversteps during learning: If momentum would lead us to overshoot, then the gradient at that overshot place will point backwards, toward where we came from. See [R1] for details on this idea.

References

[R1]	(1, 2) I. Sutskever, J. Martens, G. Dahl, & G. Hinton. (ICML 2013) “On the importance of initialization and momentum in deep learning.” http://jmlr.csail.mit.edu/proceedings/papers/v28/sutskever13.pdf

__init__(loss, params, inputs, updates=(), monitors=(), monitor_gradients=False)¶

Methods