Optimizers

Says, you have the parameter W inited for your model and got its gradient stored as (perhaps from AutoGrad APIs). Here is minimal snippet of getting your parameter W baked by SGD.

julia> using MXNet

julia> opt = SGD(η = 10)
SGD(10, 0.0, 0, 0, 0.0001, MXNet.mx.LearningRate.Fixed(10.0), MXNet.mx.Momentum.Null())

julia> decend! = getupdater(opt)
(::getfield(MXNet.mx, Symbol("#updater#9144")){SGD,Dict{Int64,Any}}) (generic function with 1 method)

julia> W = NDArray(Float32[1, 2, 3, 4]);

julia> ∇ = NDArray(Float32[.1, .2, .3, .4]);

julia> decend!(1, ∇, W)
4-element NDArray{Float32,1} @ cpu0:
 -0.0010000467f0
 -0.0020000935f0
 -0.003000021f0
 -0.004000187f0

# MXNet.mx.AbstractOptimizerType.

AbstractOptimizer

Base type for all optimizers.

source

# MXNet.mx.getupdaterMethod.

getupdater(optimizer)

A utility function to create an updater function of KVStore, that uses its closure to store all the states needed for each weights.

Ther returned function has following signature:

decend!(index::Int, ∇::NDArray, x::NDArray)

If the optimizer is stateful and need access/store states during updating, index will be the key to access/store states.

source

# MXNet.mx.normgrad!Method.

normgrad(optimizer, W, ∇)

Get the properly normalized gradient (re-scaled and clipped if necessary).

  • optimizer: the optimizer, should contain the field scale, clip and λ.
  • W::NDArray: the trainable weights.
  • ∇::NDArray: the original gradient of the weights.

source

# MXNet.mx.AbstractLearningRateSchedulerType.

AbstractLearningRateScheduler

Base type for all learning rate scheduler.

source

# MXNet.mx.AbstractMomentumSchedulerType.

AbstractMomentumScheduler

Base type for all momentum scheduler.

source

# MXNet.mx.OptimizationStateType.

OptimizationState

Attributes

  • batch_size: The size of the mini-batch used in stochastic training.
  • curr_epoch: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.
  • curr_batch: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.
  • curr_iter: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.

source

# MXNet.mx.LearningRate.ExpType.

LearningRate.Exp(η₀; γ = 0.9)

Where t is the epoch count, or the iteration count.

source

# MXNet.mx.LearningRate.FixedType.

LearningRate.Fixed(η)

Fixed learning rate scheduler always return the same learning rate.

source

# MXNet.mx.LearningRate.InvType.

LearningRate.Inv(η₀; γ = 0.9, p = 0.5)

Where t is the epoch count, or the iteration count.

source

# Base.getMethod.

get(sched::AbstractLearningRateScheduler)

Returns the current learning rate.

source

# MXNet.mx.Momentum.FixedType.

Momentum.Fixed

Fixed momentum scheduler always returns the same value.

source

# MXNet.mx.Momentum.NadamSchedulerType.

NadamScheduler(; μ = 0.99, δ = 0.004, γ = 0.5, α = 0.96)

Nesterov-accelerated adaptive momentum scheduler.

Description in Incorporating Nesterov Momentum into Adam.

Where

  • t: iteration count
  • μ: default 0.99, μ₀
  • δ: default 0.004 is scheduler decay.
  • γ: default 0.5
  • α: default 0.96

source

# MXNet.mx.Momentum.NullType.

Momentum.Null

The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.

source

# Base.getMethod.

get(n::NadamScheduler, t)

Where t is the iteration count.

source

Built-in optimizers

Stochastic Gradient Descent

# MXNet.mx.SGDType.

SGD(; kwargs...)

Stochastic gradient descent optimizer.

Vanilla SGD:

SGD with momentum::

Arguments

  • η: default 0.01, learning rate.
  • μ: default 0, the momentum, usually set to 0.9 in this implementation.
  • λ: default 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.
  • clip: default 0, gradient clipping. If positive, will clip the gradient into the bounded range [-clip, clip].
  • scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
  • μ_sched::AbstractMomentumScheduler: default Momentum.Null(), a dynamic momentum scheduler. If set, will overwrite the momentum parameter.
  • η_sched::AbstractLearningRateScheduler: default LearningRate.Fixed(η), a dynamic learning rate scheduler. If set, will overwrite the η parameter.

source

ADAM

# MXNet.mx.ADAMType.

 ADAM

The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].

ADAM(; kwargs...)

Arguments

  • η: default 0.001, learning rate.
  • β1: default 0.9.
  • β2: default 0.999.
  • ϵ: default 1e-8.
  • clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
  • scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
  • λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
  • η_sched::AbstractLearningRateScheduler: default LearningRate.Fixed(η), a dynamic learning rate scheduler. If set, will overwrite the η parameter.

source

AdaGrad

# MXNet.mx.AdaGradType.

AdaGrad(; kwargs...)

Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.

Arguments

  • η: default 0.1, learning rate.
  • ϵ: default 1e-6, small value added for numerical stability.
  • clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
  • scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
  • λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

Using step size η AdaGrad calculates the learning rate for feature i at time step t as:

as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].

References

  1. Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.
  2. Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf

source

AdaDelta

# MXNet.mx.AdaDeltaType.

AdaDelta(; kwargs...)

Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.

Attributes

  • η: default 1.0, learning rate.
  • ρ: default 0.95, squared gradient moving average decay factor.
  • ϵ: default 1e-6, small value added for numerical stability.
  • clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
  • scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
  • λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

ρ should be between 0 and 1. A value of ρ close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

ρ = 0.95 and ϵ = 1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so η = 1.0). Probably best to keep it at this value.

ϵ is important for the very first update (so the numerator does not become 0).

Using the step size η and a decay factor ρ the learning rate is calculated as:

References

  1. Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.

source

AdaMax

# MXNet.mx.AdaMaxType.

AdaMax(; kwargs...)

This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.

Arguments

  • η: default 0.002, learning rate.
  • β1: default 0.9, exponential decay rate for the first moment estimates.
  • β2: default 0.999, exponential decay rate for the weighted infinity norm estimates.
  • ϵ: default 1e-8, small value added for numerical stability.
  • clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
  • scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
  • λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

References

  1. Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. Section 7. http://arxiv.org/abs/1412.6980.

source

RMSProp

# MXNet.mx.RMSPropType.

RMSProp(; kwargs...)

Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.

Arguments

  • η: default 0.1, learning rate.
  • ρ: default 0.9, gradient moving average decay factor.
  • ϵ: default 1e-8, small value added for numerical stability.
  • clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
  • scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
  • λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

ρ should be between 0 and 1. A value of ρ close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

Using the step size η and a decay factor ρ the learning rateηₜ` is calculated as:

References

  1. Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)

source

Nadam

# MXNet.mx.NadamType.

Nadam(; kwargs...)

Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.

Arguments

  • η: default 0.001, learning rate.
  • β1: default 0.99.
  • β2: default 0.999.
  • ϵ: default 1e-8, small value added for numerical stability.
  • clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
  • scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
  • λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
  • η_sched::AbstractLearningRateScheduler: default nothing, a dynamic learning rate scheduler. If set, will overwrite the η parameter.
  • μ_sched::NadamScheduler default NadamScheduler() of the form.

Notes

Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.

References

  1. Incorporating Nesterov Momentum into Adam.
  2. On the importance of initialization and momentum in deep learning.

source