downhill.first_order.SGD

class downhill.first_order.SGD(loss, params, inputs, updates=(), monitors=(), monitor_gradients=False)

Optimize using stochastic gradient descent with momentum.

A stochastic gradient trainer with momentum \(\mu\) and learning rate \(\alpha\) updates parameter \(\theta\) at step \(t\) by blending the current “velocity” \(v\) with the current gradient \(\frac{\partial\mathcal{L}}{\partial\theta}\):

\[\begin{split}\begin{eqnarray*} v_{t+1} &=& \mu v_t - \alpha \frac{\partial\mathcal{L}}{\partial\theta} \\ \theta_{t+1} &=& \theta_t + v_{t+1} \end{eqnarray*}\end{split}\]

Without momentum (i.e., when \(\mu = 0\)), these updates reduce to \(\theta_{t+1} = \theta_t - \alpha \frac{\partial\mathcal{L}}{\partial\theta}\), which just takes steps downhill according to the the local gradient.

Adding the momentum term permits the algorithm to incorporate information from previous steps as well, which in practice is thought to have the effect of incorporating some information about second-order derivatives of the loss surface.

__init__(loss, params, inputs, updates=(), monitors=(), monitor_gradients=False)

Methods