downhill.adaptive.Adam

class downhill.adaptive.Adam(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Adam optimizer using unbiased gradient moment estimates.

Notes

The Adam method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference here is that as gradients are computed during each parameter update, exponentially-weighted moving averages (EWMAs) of (1) the first moment of the recent gradient values and (2) the second moment of recent gradient values are maintained as well. At each update, the step taken is proportional to the ratio of the first moment to the second moment.

\[\begin{split}\begin{eqnarray*} \beta_1^t &=& \beta_1 \lambda^{t} f_{t+1} &=& \beta_1^t f_t + (1 - \beta_1^t) \frac{\partial\mathcal{L}}{\partial\theta} \\ g_{t+1} &=& \beta_2 g_t + (1 - \beta_2) \left(\frac{\partial\mathcal{L}}{\partial\theta}\right)^2 \\ \theta_{t+1} &=& \theta_t - \frac{f_{t+1} / (1 - \beta_1^t)}{\sqrt{g_{t+1} / (1 - \beta_2)} + \epsilon} \end{eqnarray*}\end{split}\]

Like all adaptive optimization algorithms, this optimizer effectively maintains a sort of parameter-specific momentum value. It shares with RMSProp and ADADELTA the idea of using an EWMA to track recent quantities related to the stochastic gradient during optimization. But the Adam method is unique in that it incorporates an explicit computation to remove the bias from these estimates.

In this implementation, \(\epsilon\) regularizes the RMS values and is given using the rms_regularizer keyword argument. The weight parameters \(\beta_1\) and \(\beta_2\) for the first and second EWMA windows are computed from the beta1_halflife and beta2_halflife keyword arguments, respectively, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\). The decay \(\lambda\) for the \(\beta_1\) EWMA is provided by the beta1_decay keyword argument.

The implementation here is taken from Algorithm 1 of [King15].

References

[King15](1, 2) D. Kingma & J. Ba. (ICLR 2015) “Adam: A Method for Stochastic Optimization.” http://arxiv.org/abs/1412.6980
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Methods