# Optimizers

Says, you have the parameter `W`

inited for your model and got its gradient stored as `∇`

(perhaps from AutoGrad APIs). Here is minimal snippet of getting your parameter `W`

baked by `SGD`

.

```
julia> using MXNet
julia> opt = SGD(η = 10)
SGD(10, 0.0, 0, 0, 0.0001, MXNet.mx.LearningRate.Fixed(10.0), MXNet.mx.Momentum.Null())
julia> decend! = getupdater(opt)
(::getfield(MXNet.mx, Symbol("#updater#9256")){SGD,Dict{Int64,Any}}) (generic function with 1 method)
julia> W = NDArray(Float32[1, 2, 3, 4]);
julia> ∇ = NDArray(Float32[.1, .2, .3, .4]);
julia> decend!(1, ∇, W)
4-element NDArray{Float32,1} @ cpu0:
-0.0010000467f0
-0.0020000935f0
-0.003000021f0
-0.004000187f0
```

#
** MXNet.mx.AbstractOptimizer** —

*Type*.

```
AbstractOptimizer
```

Base type for all optimizers.

#
** MXNet.mx.getupdater** —

*Method*.

```
getupdater(optimizer)
```

A utility function to create an updater function of `KVStore`

, that uses its closure to store all the states needed for each weights.

Ther returned function has following signature:

```
decend!(index::Int, ∇::NDArray, x::NDArray)
```

If the optimizer is stateful and need access/store states during updating, `index`

will be the key to access/store states.

#
** MXNet.mx.normgrad!** —

*Method*.

```
normgrad(optimizer, W, ∇)
```

Get the properly normalized gradient (re-scaled and clipped if necessary).

`optimizer`

: the optimizer, should contain the field`scale`

,`clip`

and`λ`

.`W::NDArray`

: the trainable weights.`∇::NDArray`

: the original gradient of the weights.

#
** MXNet.mx.AbstractLearningRateScheduler** —

*Type*.

```
AbstractLearningRateScheduler
```

Base type for all learning rate scheduler.

#
** MXNet.mx.AbstractMomentumScheduler** —

*Type*.

```
AbstractMomentumScheduler
```

Base type for all momentum scheduler.

#
** MXNet.mx.OptimizationState** —

*Type*.

```
OptimizationState
```

**Attributes**

`batch_size`

: The size of the mini-batch used in stochastic training.`curr_epoch`

: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.`curr_batch`

: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.`curr_iter`

: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does**not**reset in each epoch. So it track the*total*number of mini-batches seen so far.

#
** MXNet.mx.LearningRate.Exp** —

*Type*.

```
LearningRate.Exp(η₀; γ = 0.9)
```

Where `t`

is the epoch count, or the iteration count.

#
** MXNet.mx.LearningRate.Fixed** —

*Type*.

```
LearningRate.Fixed(η)
```

Fixed learning rate scheduler always return the same learning rate.

#
** MXNet.mx.LearningRate.Inv** —

*Type*.

```
LearningRate.Inv(η₀; γ = 0.9, p = 0.5)
```

Where `t`

is the epoch count, or the iteration count.

#
** Base.get** —

*Method*.

```
get(sched::AbstractLearningRateScheduler)
```

Returns the current learning rate.

#
** MXNet.mx.Momentum.Fixed** —

*Type*.

```
Momentum.Fixed
```

Fixed momentum scheduler always returns the same value.

#
** MXNet.mx.Momentum.NadamScheduler** —

*Type*.

```
NadamScheduler(; μ = 0.99, δ = 0.004, γ = 0.5, α = 0.96)
```

Nesterov-accelerated adaptive momentum scheduler.

Description in Incorporating Nesterov Momentum into Adam.

Where

`t`

: iteration count`μ`

: default`0.99`

, μ₀`δ`

: default`0.004`

is scheduler decay.`γ`

: default`0.5`

`α`

: default`0.96`

#
** MXNet.mx.Momentum.Null** —

*Type*.

```
Momentum.Null
```

The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.

#
** Base.get** —

*Method*.

```
get(n::NadamScheduler, t)
```

Where `t`

is the iteration count.

## Built-in optimizers

### Stochastic Gradient Descent

#
** MXNet.mx.SGD** —

*Type*.

```
SGD(; kwargs...)
```

Stochastic gradient descent optimizer.

Vanilla SGD:

SGD with momentum::

**Arguments**

`η`

: default`0.01`

, learning rate.`μ`

: default`0`

, the momentum, usually set to`0.9`

in this implementation.`λ`

: default`0.0001`

, weight decay is equivalent to adding a global l2 regularizer to the parameters.`clip`

: default`0`

, gradient clipping. If positive, will clip the gradient into the bounded range`[-clip, clip]`

.`scale`

: default`0`

, gradient rescaling. If != 0, multiply the gradient with`scale`

before updating. Often choose to be`1.0 / batch_size`

. If leave it default, high-level API like`fit!`

will set it to`1.0 / batch_size`

, since`fit!`

knows the`batch_size`

.`μ_sched::AbstractMomentumScheduler`

: default`Momentum.Null()`

, a dynamic momentum scheduler. If set, will overwrite the`momentum`

parameter.`η_sched::AbstractLearningRateScheduler`

: default`LearningRate.Fixed(η)`

, a dynamic learning rate scheduler. If set, will overwrite the`η`

parameter.

### ADAM

#
** MXNet.mx.ADAM** —

*Type*.

```
ADAM
```

The solver described in Diederik Kingma, Jimmy Ba: *Adam: A Method for Stochastic Optimization*. arXiv:1412.6980 [cs.LG].

```
ADAM(; kwargs...)
```

**Arguments**

`η`

: default`0.001`

, learning rate.`β1`

: default`0.9`

.`β2`

: default`0.999`

.`ϵ`

: default`1e-8`

.`clip`

: default`0`

, gradient clipping. If positive, will clip the gradient into the range`[-clip, clip]`

.`scale`

: default`0`

, gradient rescaling. If != 0, multiply the gradient with`scale`

before updating. Often choose to be`1.0 / batch_size`

. If leave it default, high-level API like`fit!`

will set it to`1.0 / batch_size`

, since`fit!`

knows the`batch_size`

.`λ`

: default`0.00001`

, weight decay is equivalent to adding a global l2 regularizer for all the parameters.`η_sched::AbstractLearningRateScheduler`

: default`LearningRate.Fixed(η)`

, a dynamic learning rate scheduler. If set, will overwrite the`η`

parameter.

### AdaGrad

#
** MXNet.mx.AdaGrad** —

*Type*.

```
AdaGrad(; kwargs...)
```

Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.

**Arguments**

`η`

: default`0.1`

, learning rate.`ϵ`

: default`1e-6`

, small value added for numerical stability.`clip`

: default`0`

, gradient clipping. If positive, will clip the gradient into the range`[-clip, clip]`

.`scale`

: default`0`

, gradient rescaling. If != 0, multiply the gradient with`scale`

before updating. Often choose to be`1.0 / batch_size`

. If leave it default, high-level API like`fit!`

will set it to`1.0 / batch_size`

, since`fit!`

knows the`batch_size`

.`λ`

: default`0.00001`

, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

**Notes**

Using step size `η`

AdaGrad calculates the learning rate for feature `i`

at time step t as:

as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].

**References**

- Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.
- Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf

### AdaDelta

#
** MXNet.mx.AdaDelta** —

*Type*.

```
AdaDelta(; kwargs...)
```

Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.

**Attributes**

`η`

: default`1.0`

, learning rate.`ρ`

: default`0.95`

, squared gradient moving average decay factor.`ϵ`

: default`1e-6`

, small value added for numerical stability.`clip`

: default`0`

, gradient clipping. If positive, will clip the gradient into the range`[-clip, clip]`

.`scale`

: default`0`

, gradient rescaling. If != 0, multiply the gradient with`scale`

before updating. Often choose to be`1.0 / batch_size`

. If leave it default, high-level API like`fit!`

will set it to`1.0 / batch_size`

, since`fit!`

knows the`batch_size`

.`λ`

: default`0.00001`

, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

**Notes**

`ρ`

should be between 0 and 1. A value of `ρ`

close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

`ρ = 0.95`

and `ϵ = 1e-6`

are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so `η = 1.0`

). Probably best to keep it at this value.

`ϵ`

is important for the very first update (so the numerator does not become 0).

Using the step size `η`

and a decay factor `ρ`

the learning rate is calculated as:

**References**

- Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.

### AdaMax

#
** MXNet.mx.AdaMax** —

*Type*.

```
AdaMax(; kwargs...)
```

This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.

**Arguments**

`η`

: default`0.002`

, learning rate.`β1`

: default`0.9`

, exponential decay rate for the first moment estimates.`β2`

: default`0.999`

, exponential decay rate for the weighted infinity norm estimates.`ϵ`

: default`1e-8`

, small value added for numerical stability.`clip`

: default`0`

, gradient clipping. If positive, will clip the gradient into the range`[-clip, clip]`

.`scale`

: default`0`

, gradient rescaling. If != 0, multiply the gradient with`scale`

before updating. Often choose to be`1.0 / batch_size`

. If leave it default, high-level API like`fit!`

will set it to`1.0 / batch_size`

, since`fit!`

knows the`batch_size`

.`λ`

: default`0.00001`

, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

**References**

- Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. Section 7. http://arxiv.org/abs/1412.6980.

### RMSProp

#
** MXNet.mx.RMSProp** —

*Type*.

```
RMSProp(; kwargs...)
```

Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.

**Arguments**

`η`

: default`0.1`

, learning rate.`ρ`

: default`0.9`

, gradient moving average decay factor.`ϵ`

: default`1e-8`

, small value added for numerical stability.`clip`

: default`0`

, gradient clipping. If positive, will clip the gradient into the range`[-clip, clip]`

.`scale`

: default`0`

, gradient rescaling. If != 0, multiply the gradient with`scale`

before updating. Often choose to be`1.0 / batch_size`

. If leave it default, high-level API like`fit!`

will set it to`1.0 / batch_size`

, since`fit!`

knows the`batch_size`

.`λ`

: default`0.00001`

, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

**Notes**

`ρ`

should be between 0 and 1. A value of `ρ`

close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

Using the step size `η`

and a decay factor `ρ the learning rate`

ηₜ` is calculated as:

**References**

- Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)

### Nadam

#
** MXNet.mx.Nadam** —

*Type*.

```
Nadam(; kwargs...)
```

Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.

**Arguments**

`η`

: default`0.001`

, learning rate.`β1`

: default`0.99`

.`β2`

: default`0.999`

.`ϵ`

: default`1e-8`

, small value added for numerical stability.`clip`

: default`0`

, gradient clipping. If positive, will clip the gradient into the range`[-clip, clip]`

.`scale`

: default`0`

, gradient rescaling. If != 0, multiply the gradient with`scale`

before updating. Often choose to be`1.0 / batch_size`

. If leave it default, high-level API like`fit!`

will set it to`1.0 / batch_size`

, since`fit!`

knows the`batch_size`

.`λ`

: default`0.00001`

, weight decay is equivalent to adding a global l2 regularizer for all the parameters.`η_sched::AbstractLearningRateScheduler`

: default`nothing`

, a dynamic learning rate scheduler. If set, will overwrite the`η`

parameter.-
`μ_sched::NadamScheduler`

default`NadamScheduler()`

of the form.

**Notes**

Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.

**References**