# Optimizers

Says, you have the parameter W inited for your model and got its gradient stored as ∇ (perhaps from AutoGrad APIs). Here is minimal snippet of getting your parameter W baked by SGD.

julia> using MXNet

julia> opt = SGD(η = 10)
SGD(10, 0.0, 0, 0, 0.0001, MXNet.mx.LearningRate.Fixed(10.0), MXNet.mx.Momentum.Null())

julia> decend! = getupdater(opt)
(::getfield(MXNet.mx, Symbol("#updater#9256")){SGD,Dict{Int64,Any}}) (generic function with 1 method)

julia> W = NDArray(Float32[1, 2, 3, 4]);

julia> ∇ = NDArray(Float32[.1, .2, .3, .4]);

julia> decend!(1, ∇, W)
4-element NDArray{Float32,1} @ cpu0:
-0.0010000467f0
-0.0020000935f0
-0.003000021f0
-0.004000187f0


# MXNet.mx.AbstractOptimizerType.

AbstractOptimizer


Base type for all optimizers.

# MXNet.mx.getupdaterMethod.

getupdater(optimizer)


A utility function to create an updater function of KVStore, that uses its closure to store all the states needed for each weights.

Ther returned function has following signature:

decend!(index::Int, ∇::NDArray, x::NDArray)


If the optimizer is stateful and need access/store states during updating, index will be the key to access/store states.

# MXNet.mx.normgrad!Method.

normgrad(optimizer, W, ∇)


Get the properly normalized gradient (re-scaled and clipped if necessary).

• optimizer: the optimizer, should contain the field scale, clip and λ.
• W::NDArray: the trainable weights.
• ∇::NDArray: the original gradient of the weights.

# MXNet.mx.AbstractLearningRateSchedulerType.

AbstractLearningRateScheduler


Base type for all learning rate scheduler.

# MXNet.mx.AbstractMomentumSchedulerType.

AbstractMomentumScheduler


Base type for all momentum scheduler.

# MXNet.mx.OptimizationStateType.

OptimizationState


Attributes

• batch_size: The size of the mini-batch used in stochastic training.
• curr_epoch: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.
• curr_batch: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.
• curr_iter: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.

# MXNet.mx.LearningRate.ExpType.

LearningRate.Exp(η₀; γ = 0.9)


Where t is the epoch count, or the iteration count.

# MXNet.mx.LearningRate.FixedType.

LearningRate.Fixed(η)


Fixed learning rate scheduler always return the same learning rate.

# MXNet.mx.LearningRate.InvType.

LearningRate.Inv(η₀; γ = 0.9, p = 0.5)


Where t is the epoch count, or the iteration count.

# Base.getMethod.

get(sched::AbstractLearningRateScheduler)


Returns the current learning rate.

# MXNet.mx.Momentum.FixedType.

Momentum.Fixed


Fixed momentum scheduler always returns the same value.

# MXNet.mx.Momentum.NadamSchedulerType.

NadamScheduler(; μ = 0.99, δ = 0.004, γ = 0.5, α = 0.96)


Description in Incorporating Nesterov Momentum into Adam.

Where

• t: iteration count
• μ: default 0.99, μ₀
• δ: default 0.004 is scheduler decay.
• γ: default 0.5
• α: default 0.96

# MXNet.mx.Momentum.NullType.

Momentum.Null


The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.

# Base.getMethod.

get(n::NadamScheduler, t)


Where t is the iteration count.

## Built-in optimizers

# MXNet.mx.SGDType.

SGD(; kwargs...)


Vanilla SGD:

SGD with momentum::

Arguments

• η: default 0.01, learning rate.
• μ: default 0, the momentum, usually set to 0.9 in this implementation.
• λ: default 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.
• clip: default 0, gradient clipping. If positive, will clip the gradient into the bounded range [-clip, clip].
• scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
• μ_sched::AbstractMomentumScheduler: default Momentum.Null(), a dynamic momentum scheduler. If set, will overwrite the momentum parameter.
• η_sched::AbstractLearningRateScheduler: default LearningRate.Fixed(η), a dynamic learning rate scheduler. If set, will overwrite the η parameter.

# MXNet.mx.ADAMType.

 ADAM


The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].

ADAM(; kwargs...)


Arguments

• η: default 0.001, learning rate.
• β1: default 0.9.
• β2: default 0.999.
• ϵ: default 1e-8.
• clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
• scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
• λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
• η_sched::AbstractLearningRateScheduler: default LearningRate.Fixed(η), a dynamic learning rate scheduler. If set, will overwrite the η parameter.

# MXNet.mx.AdaGradType.

AdaGrad(; kwargs...)


Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.

Arguments

• η: default 0.1, learning rate.
• ϵ: default 1e-6, small value added for numerical stability.
• clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
• scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
• λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

Using step size η AdaGrad calculates the learning rate for feature i at time step t as:

as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].

References

1. Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.

# MXNet.mx.AdaDeltaType.

AdaDelta(; kwargs...)


Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.

Attributes

• η: default 1.0, learning rate.
• ρ: default 0.95, squared gradient moving average decay factor.
• ϵ: default 1e-6, small value added for numerical stability.
• clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
• scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
• λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

ρ should be between 0 and 1. A value of ρ close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

ρ = 0.95 and ϵ = 1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so η = 1.0). Probably best to keep it at this value.

ϵ is important for the very first update (so the numerator does not become 0).

Using the step size η and a decay factor ρ the learning rate is calculated as:

References

# MXNet.mx.AdaMaxType.

AdaMax(; kwargs...)


This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.

Arguments

• η: default 0.002, learning rate.
• β1: default 0.9, exponential decay rate for the first moment estimates.
• β2: default 0.999, exponential decay rate for the weighted infinity norm estimates.
• ϵ: default 1e-8, small value added for numerical stability.
• clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
• scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
• λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

References

1. Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. Section 7. http://arxiv.org/abs/1412.6980.

### RMSProp

# MXNet.mx.RMSPropType.

RMSProp(; kwargs...)


Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.

Arguments

• η: default 0.1, learning rate.
• ρ: default 0.9, gradient moving average decay factor.
• ϵ: default 1e-8, small value added for numerical stability.
• clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
• scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
• λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

ρ should be between 0 and 1. A value of ρ close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

Using the step size η and a decay factor ρ the learning rateηₜ is calculated as:

References

1. Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)

# MXNet.mx.NadamType.

Nadam(; kwargs...)


Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.

Arguments

• η: default 0.001, learning rate.
• β1: default 0.99.
• β2: default 0.999.
• ϵ: default 1e-8, small value added for numerical stability.
• clip: default 0, gradient clipping. If positive, will clip the gradient into the range [-clip, clip].
• scale: default 0, gradient rescaling. If != 0, multiply the gradient with scale before updating. Often choose to be 1.0 / batch_size. If leave it default, high-level API like fit! will set it to 1.0 / batch_size, since fit! knows the batch_size.
• λ: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
• η_sched::AbstractLearningRateScheduler: default nothing, a dynamic learning rate scheduler. If set, will overwrite the η parameter.
• μ_sched::NadamScheduler default NadamScheduler()` of the form.

Notes

Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.

References