downhill.first_order.SGD

class downhill.first_order.SGD(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Basic optimization using stochastic gradient descent.

Notes

A stochastic gradient trainer with momentum \(\mu\) and learning rate \(\alpha\) updates parameter \(\theta\) at step \(t\) by blending the current “velocity” \(v\) with the current gradient \(\frac{\partial\mathcal{L}}{\partial\theta}\):

\[\begin{split}\begin{eqnarray*} v_{t+1} &=& \mu v_t - \alpha \frac{\partial\mathcal{L}}{\partial\theta} \\ \theta_{t+1} &=& \theta_t + v_{t+1} \end{eqnarray*}\end{split}\]

Without momentum (i.e., when \(\mu = 0\)), these updates reduce to \(\theta_{t+1} = \theta_t - \alpha \frac{\partial\mathcal{L}}{\partial\theta}\), which just takes steps downhill according to the the local gradient.

Adding the momentum term permits the algorithm to incorporate information from previous steps as well, which in practice is thought to have the effect of incorporating some information about second-order derivatives of the loss surface.

References

[Rume86]D. E. Rumelhart, G. E. Hinton, & R. J. Williams. (1986) “Learning representations by back-propagating errors”. Nature 323 (6088):533–536. doi:10.1038/323533a0 http://www.nature.com/nature/journal/v323/n6088/abs/323533a0.html
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Methods