downhill.first_order.NAG

class downhill.first_order.NAG(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)[source]

Stochastic gradient optimization with Nesterov momentum.

This class name is an abbreviation for “Nesterov’s Accelerated Gradient.” Note that the momentum parameter must be given during optimization for Nesterov momentum to be employed; by default momentum is 0 and so no momentum is used.

Parameters:
learning_rate: float, optional (default 1e-4)

Step size to take during optimization.

momentum: float, optional (default 0)

Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum.

Notes

The basic difference between NAG and “classical” momentum in SGD optimization approaches is that NAG computes the gradients at the position in parameter space where “classical” momentum would put us at the next step. In classical SGD with momentum \(\mu\) and learning rate \(\alpha\), updates to parameter \(p\) at step \(t\) are computed by blending the current “velocity” \(v\) with the current gradient \(\frac{\partial\mathcal{L}}{\partial p}\):

\[\begin{split}\begin{eqnarray*} v_{t+1} &=& \mu v_t - \alpha \frac{\partial\mathcal{L}}{\partial p} \\ p_{t+1} &=& p_t + v_{t+1} \end{eqnarray*}\end{split}\]

In contrast, NAG adjusts the update by blending the current “velocity” with the gradient at the next step—that is, the gradient is computed at the point where the velocity would have taken us:

\[\begin{split}\begin{eqnarray*} v_{t+1} &=& \mu v_t - \alpha \left. \frac{\partial\mathcal{L}}{\partial p}\right|_{p_t + \mu v_t} \\ p_{t+1} &=& p_t + v_{t+1} \end{eqnarray*}\end{split}\]

Again, the difference here is that the gradient is computed at the place in parameter space where we would have stepped using the classical technique, in the absence of a new gradient.

In theory, this helps correct for oversteps during learning: If momentum would lead us to overshoot, then the gradient at that overshot place will point backwards, toward where we came from. See [Suts13] for a particularly clear exposition of this idea.

References

[Suts13](1, 2) I. Sutskever, J. Martens, G. Dahl, & G. Hinton. (ICML 2013) “On the importance of initialization and momentum in deep learning.” http://www.cs.toronto.edu/~fritz/absps/momentum.pdf
[Nest83]Y. Nesterov. (1983) “A method of solving a convex programming problem with convergence rate O(1/sqr(k)).” Soviet Mathematics Doklady, 27:372–376.
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

x.__init__(…) initializes x; see help(type(x)) for signature

Methods

__init__(loss[, params, inputs, updates, …]) x.__init__(…) initializes x; see help(type(x)) for signature
evaluate(dataset) Evaluate the current model parameters on a dataset.
get_updates(**kwargs) Get parameter update expressions for performing optimization.
iterate(*args, **kwargs) Optimize a loss iteratively using a training and validation dataset.
minimize(*args, **kwargs) Optimize our loss exhaustively.
set_params([targets]) Set the values of the parameters to the given target values.
iterate(*args, **kwargs)[source]

Optimize a loss iteratively using a training and validation dataset.

This method yields a series of monitor values to the caller. After every optimization epoch, a pair of monitor dictionaries is generated: one evaluated on the training dataset during the epoch, and another evaluated on the validation dataset at the most recent validation epoch.

The validation monitors might not be updated during every optimization iteration; in this case, the most recent validation monitors will be yielded along with the training monitors.

Additional keyword arguments supplied here will set the global optimizer attributes.

Parameters:
train : sequence or Dataset

A set of training data for computing updates to model parameters.

valid : sequence or Dataset

A set of validation data for computing monitor values and determining when the loss has stopped improving. Defaults to the training data.

max_updates : int, optional

If specified, halt optimization after this many gradient updates have been processed. If not provided, uses early stopping to decide when to halt.

Yields:
train_monitors : dict

A dictionary mapping monitor names to values, evaluated on the training dataset.

valid_monitors : dict

A dictionary containing monitor values evaluated on the validation dataset.