Gluon Package

Warning

This package is currently experimental and may change in the near future.

Overview

Gluon package is a high-level interface for MXNet designed to be easy to use while keeping most of the flexibility of low level API. Gluon supports both imperative and symbolic programming, making it easy to train complex models imperatively in Python and then deploy with symbolic graph in C++ and Scala.

Parameter

Parameter A Container holding parameters (weights) of `Block`s.
ParameterDict A dictionary managing a set of parameters.

Containers

Block Base class for all neural network layers and models.
HybridBlock HybridBlock supports forwarding with both Symbol and NDArray.
SymbolBlock Construct block from symbol.

Neural Network Layers

Containers

Sequential Stacks `Block`s sequentially.
HybridSequential Stacks `HybridBlock`s sequentially.

Basic Layers

Dense Just your regular densely-connected NN layer.
Activation Applies an activation function to input.
Dropout Applies Dropout to the input.
BatchNorm Batch normalization layer (Ioffe and Szegedy, 2014).
LeakyReLU Leaky version of a Rectified Linear Unit.
Embedding Turns non-negative integers (indexes/tokens) into dense vectors of fixed size.

Convolutional Layers

Conv1D 1D convolution layer (e.g. temporal convolution).
Conv2D 2D convolution layer (e.g. spatial convolution over images).
Conv3D 3D convolution layer (e.g. spatial convolution over volumes).
Conv1DTranspose Transposed 1D convolution layer (sometimes called Deconvolution).
Conv2DTranspose Transposed 2D convolution layer (sometimes called Deconvolution).
Conv3DTranspose Transposed 3D convolution layer (sometimes called Deconvolution).

Pooling Layers

MaxPool1D Max pooling operation for one dimensional data.
MaxPool2D Max pooling operation for two dimensional (spatial) data.
MaxPool3D Max pooling operation for 3D data (spatial or spatio-temporal).
AvgPool1D Average pooling operation for temporal data.
AvgPool2D Average pooling operation for spatial data.
AvgPool3D Average pooling operation for 3D data (spatial or spatio-temporal).
GlobalMaxPool1D Global max pooling operation for temporal data.
GlobalMaxPool2D Global max pooling operation for spatial data.
GlobalMaxPool3D Global max pooling operation for 3D data.
GlobalAvgPool1D Global average pooling operation for temporal data.
GlobalAvgPool2D Global average pooling operation for spatial data.
GlobalAvgPool3D Global max pooling operation for 3D data.

Recurrent Layers

RecurrentCell Abstract base class for RNN cells
RNN Applies a multi-layer Elman RNN with tanh or ReLU non-linearity to an input sequence.
LSTM Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.
GRU Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
RNNCell Simple recurrent neural network cell.
LSTMCell Long-Short Term Memory (LSTM) network cell.
GRUCell Gated Rectified Unit (GRU) network cell.
SequentialRNNCell Sequentially stacking multiple RNN cells.
BidirectionalCell Bidirectional RNN cell.
DropoutCell Applies dropout on input.
ZoneoutCell Applies Zoneout on base cell.
ResidualCell Adds residual connection as described in Wu et al, 2016 (https://arxiv.org/abs/1609.08144).

Trainer

Trainer Applies an Optimizer on a set of Parameters.

Loss functions

L2Loss Calculates the mean squared error between output and label:
L1Loss Calculates the mean absolute error between output and label:
SoftmaxCrossEntropyLoss Computes the softmax cross entropy loss.
KLDivLoss The Kullback-Leibler divergence loss.

Utilities

split_data Splits an NDArray into num_slice slices along batch_axis.
split_and_load Splits an NDArray into len(ctx_list) slices along batch_axis and loads each slice to one context in ctx_list.
clip_global_norm Rescales NDArrays so that the sum of their 2-norm is smaller than max_norm.

Data

Dataset Abstract dataset class.
ArrayDataset A dataset with a data array and a label array.
RecordFileDataset A dataset wrapping over a RecordIO (.rec) file.
ImageRecordDataset
Sampler Base class for samplers.
SequentialSampler Samples elements from [0, length) sequentially.
RandomSampler Samples elements from [0, length) randomly without replacement.
BatchSampler Wraps over another Sampler and return mini-batches of samples.
DataLoader Loads data from a dataset and returns mini-batches of data.

Vision

MNIST MNIST handwritten digits dataset from `http://yann.lecun.com/exdb/mnist`_.
FashionMNIST A dataset of Zalando’s article images consisting of fashion products, a drop-in replacement of the original MNIST dataset from `https://github.com/zalandoresearch/fashion-mnist`_.
CIFAR10 CIFAR10 image classification dataset from `https://www.cs.toronto.edu/~kriz/cifar.html`_.

Model Zoo

Model zoo provides pre-defined and pre-trained models to help bootstrap machine learning applications.

Vision

Module for pre-defined neural network models.

This module contains definitions for the following model architectures: - AlexNet - DenseNet - Inception V3 - ResNet V1 - ResNet V2 - SqueezeNet - VGG

You can construct a model with random weights by calling its constructor: .. code:

import mxnet.gluon.models as models
resnet18 = models.resnet18_v1()
alexnet = models.alexnet()
squeezenet = models.squeezenet1_0()
densenet = models.densenet_161()

We provide pre-trained models for all the models except ResNet V2. These can constructed by passing pretrained=True: .. code:

import mxnet.gluon.models as models
resnet18 = models.resnet18_v1(pretrained=True)
alexnet = models.alexnet(pretrained=True)

Pretrained models are converted from torchvision. All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (N x 3 x H x W), where N is the batch size, and H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. The transformation should preferrably happen at preprocessing. You can use mx.image.color_normalize for such transformation:

image = image/255
normalized = mx.image.color_normalize(image,
                                      mean=mx.nd.array([0.485, 0.456, 0.406]),
                                      std=mx.nd.array([0.229, 0.224, 0.225]))
get_model Returns a pre-defined model by name

ResNet

resnet18_v1 ResNet-18 V1 model from “Deep Residual Learning for Image Recognition” paper.
resnet34_v1 ResNet-34 V1 model from “Deep Residual Learning for Image Recognition” paper.
resnet50_v1 ResNet-50 V1 model from “Deep Residual Learning for Image Recognition” paper.
resnet101_v1 ResNet-101 V1 model from “Deep Residual Learning for Image Recognition” paper.
resnet152_v1 ResNet-152 V1 model from “Deep Residual Learning for Image Recognition” paper.
resnet18_v2 ResNet-18 V2 model from “Identity Mappings in Deep Residual Networks” paper.
resnet34_v2 ResNet-34 V2 model from “Identity Mappings in Deep Residual Networks” paper.
resnet50_v2 ResNet-50 V2 model from “Identity Mappings in Deep Residual Networks” paper.
resnet101_v2 ResNet-101 V2 model from “Identity Mappings in Deep Residual Networks” paper.
resnet152_v2 ResNet-152 V2 model from “Identity Mappings in Deep Residual Networks” paper.
ResNetV1 ResNet V1 model from “Deep Residual Learning for Image Recognition” paper.
ResNetV2 ResNet V2 model from “Identity Mappings in Deep Residual Networks” paper.
BasicBlockV1 BasicBlock V1 from “Deep Residual Learning for Image Recognition” paper.
BasicBlockV2 BasicBlock V2 from “Identity Mappings in Deep Residual Networks” paper.
BottleneckV1 Bottleneck V1 from “Deep Residual Learning for Image Recognition” paper.
BottleneckV2 Bottleneck V2 from “Identity Mappings in Deep Residual Networks” paper.
get_resnet ResNet V1 model from “Deep Residual Learning for Image Recognition” paper.

Alexnet

alexnet AlexNet model from the “One weird trick...” paper.
AlexNet AlexNet model from the “One weird trick...” paper.

DenseNet

densenet121 Densenet-BC 121-layer model from the “Densely Connected Convolutional Networks” paper.
densenet161 Densenet-BC 161-layer model from the “Densely Connected Convolutional Networks” paper.
densenet169 Densenet-BC 169-layer model from the “Densely Connected Convolutional Networks” paper.
densenet201 Densenet-BC 201-layer model from the “Densely Connected Convolutional Networks” paper.
DenseNet Densenet-BC model from the “Densely Connected Convolutional Networks” paper.

API Reference

class mxnet.gluon.Parameter(name, grad_req='write', shape=None, dtype=, lr_mult=1.0, wd_mult=1.0, init=None, allow_deferred_init=False, differentiable=True)

A Container holding parameters (weights) of `Block`s.

Parameter holds a copy of the parameter on each Context after it is initialized with Parameter.initialize(...). If grad_req is not null, it will also hold a gradient array on each Context:

ctx = mx.gpu(0)
x = mx.nd.zeros((16, 100), ctx=ctx)
w = mx.gluon.Parameter('fc_weight', shape=(64, 100), init=mx.init.Xavier())
b = mx.gluon.Parameter('fc_bias', shape=(64,), init=mx.init.Zero())
w.initialize(ctx=ctx)
b.initialize(ctx=ctx)
out = mx.nd.FullyConnected(x, w.data(ctx), b.data(ctx), num_hidden=64)
Parameters:
  • name (str) – Name of this parameter.
  • grad_req ({'write', 'add', 'null'}, default 'write') –

    Specifies how to update gradient to grad arrays.

    • ‘write’ means everytime gradient is written to grad NDArray.
    • ‘add’ means everytime gradient is added to the grad NDArray. You need to manually call zero_grad() to clear the gradient buffer before each iteration when using this option.
    • ‘null’ means gradient is not requested for this parameter. gradient arrays will not be allocated.
  • shape (tuple of int, default None) – Shape of this parameter. By default shape is not specified. Parameter with unknown shape can be used for Symbol API, but init will throw an error when using NDArray API.
  • dtype (numpy.dtype or str, default 'float32') – Data type of this parameter. For example, numpy.float32 or ‘float32’.
  • lr_mult (float, default 1.0) – Learning rate multiplier. Learning rate will be multiplied by lr_mult when updating this parameter with optimizer.
  • wd_mult (float, default 1.0) – Weight decay multiplier (L2 regularizer coefficient). Works similar to lr_mult.
  • init (Initializer, default None) – Initializer of this parameter. Will use the global initializer by default.
grad_req

{‘write’, ‘add’, ‘null’}

This can be set before or after initialization. Setting grad_req to null with x.grad_req = ‘null’ saves memory and computation when you don’t need gradient w.r.t x.

initialize(init=None, ctx=None, default_init=, force_reinit=False)

Initializes parameter and gradient arrays. Only used for NDArray API.

Parameters:
  • init (Initializer) – The initializer to use. Overrides Parameter.init and default_init.
  • ctx (Context or list of Context, defaults to context.current_context().) –

    Initialize Parameter on given context. If ctx is a list of Context, a copy will be made for each context.

    Note

    Copies are independent arrays. User is responsible for keeping

    their values consistent when updating. Normally gluon.Trainer does this for you.

  • default_init (Initializer) – Default initializer is used when both init and Parameter.init are None.
  • force_reinit (bool, default False) – Whether to force re-initialization if parameter is already initialized.

Examples

>>> weight = mx.gluon.Parameter('weight', shape=(2, 2))
>>> weight.initialize(ctx=mx.cpu(0))
>>> weight.data()
[[-0.01068833  0.01729892]
 [ 0.02042518 -0.01618656]]

>>> weight.grad()
[[ 0.  0.]
 [ 0.  0.]]

>>> weight.initialize(ctx=[mx.gpu(0), mx.gpu(1)])
>>> weight.data(mx.gpu(0))
[[-0.00873779 -0.02834515]
 [ 0.05484822 -0.06206018]]

>>> weight.data(mx.gpu(1))
[[-0.00873779 -0.02834515]
 [ 0.05484822 -0.06206018]]

reset_ctx(ctx)

Re-assign Parameter to other contexts.

ctx : Context or list of Context, default context.current_context().
Assign Parameter to given context. If ctx is a list of Context, a copy will be made for each context.
set_data(data)

Sets this parameter’s value on all contexts to data.

data(ctx=None)

Returns a copy of this parameter on one context. Must have been initialized on this context before.

Parameters:ctx (Context) – Desired context.
Returns:
Return type:NDArray on ctx
list_data()

Returns copies of this parameter on all contexts, in the same order as creation.

grad(ctx=None)

Returns a gradient buffer for this parameter on one context.

Parameters:ctx (Context) – Desired context.
list_grad()

Returns gradient buffers on all contexts, in the same order as values.

list_ctx()

Returns a list of contexts this parameter is initialized on.

zero_grad()

Sets gradient buffer on all contexts to 0. No action is taken if parameter is uninitialized or doesn’t require gradient.

var()

Returns a symbol representing this parameter.

class mxnet.gluon.ParameterDict(prefix='', shared=None)

A dictionary managing a set of parameters.

Parameters:
  • prefix (str, default '') – The prefix to be prepended to all Parameters’ names created by this dict.
  • shared (ParameterDict or None) – If not None, when this dict’s get method creates a new parameter, will first try to retrieve it from shared dict. Usually used for sharing parameters with another Block.
prefix

Prefix of this dict. It will be prepended to Parameters’ name created with get.

get(name, **kwargs)

Retrieves a Parameter with name self.prefix+name. If not found, get will first try to retrieve it from shared dict. If still not found, get will create a new Parameter with key-word arguments and insert it to self.

Parameters:
  • name (str) – Name of the desired Parameter. It will be prepended with this dictionary’s prefix.
  • **kwargs

    The rest of key-word arguments for the created Parameter.

Returns:

The created or retrieved Parameter.

Return type:

Parameter

update(other)

Copies all Parameters in other to self.

initialize(init=, ctx=None, verbose=False, force_reinit=False)

Initializes all Parameters managed by this dictionary to be used for NDArray API. It has no effect when using Symbol API.

Parameters:
  • init (Initializer) – Global default Initializer to be used when Parameter.init is None. Otherwise, Parameter.init takes precedence.
  • ctx (Context or list of Context) – Keeps a copy of Parameters on one or many context(s).
  • force_reinit (bool, default False) – Whether to force re-initialization if parameter is already initialized.
zero_grad()

Sets all Parameters’ gradient buffer to 0.

reset_ctx(ctx)

Re-assign all Parameters to other contexts.

ctx : Context or list of Context, default context.current_context().
Assign Parameter to given context. If ctx is a list of Context, a copy will be made for each context.
setattr(name, value)

Set an attribute to a new value for all Parameters.

For example, set grad_req to null if you don’t need gradient w.r.t a model’s Parameters:

model.collect_params().setattr('grad_req', 'null')

or change the learning rate multiplier:

model.collect_params().setattr('lr_mult', 0.5)
Parameters:
  • name (str) – Name of the attribute.
  • value (valid type for attribute name) – The new value for the attribute.
save(filename, strip_prefix='')

Save parameters to file.

filename : str
Path to parameter file.
strip_prefix : str, default ‘’
Strip prefix from parameter names before saving.
load(filename, ctx, allow_missing=False, ignore_extra=False, restore_prefix='')

Load parameters from file.

filename : str
Path to parameter file.
ctx : Context or list of Context
Context(s) initialize loaded parameters on.
allow_missing : bool, default False
Whether to silently skip loading parameters not represents in the file.
ignore_extra : bool, default False
Whether to silently ignore parameters from the file that are not present in this ParameterDict.
restore_prefix : str, default ‘’
prepend prefix to names of stored parameters before loading.
class mxnet.gluon.Block(prefix=None, params=None)

Base class for all neural network layers and models. Your models should subclass this class.

Block can be nested recursively in a tree structure. You can create and assign child Block as regular attributes:

from mxnet.gluon import Block, nn
from mxnet import ndarray as F

class Model(Block):
    def __init__(self, **kwargs):
        super(Model, self).__init__(**kwargs)
        # use name_scope to give child Blocks appropriate names.
        # It also allows sharing Parameters between Blocks recursively.
        with self.name_scope():
            self.dense0 = nn.Dense(20)
            self.dense1 = nn.Dense(20)

    def forward(self, x):
        x = F.relu(self.dense0(x))
        return F.relu(self.dense1(x))

model = Model()
model.initialize(ctx=mx.cpu(0))
model(F.zeros((10, 10), ctx=mx.cpu(0)))

Child Block assigned this way will be registered and collect_params will collect their Parameters recursively.

Parameters:
  • prefix (str) – Prefix acts like a name space. It will be prepended to the names of all Parameters and child Block`s in this `Block‘s name_scope. Prefix should be unique within one model to prevent name collisions.
  • params (ParameterDict or None) –

    ParameterDict for sharing weights with the new Block. For example, if you want dense1 to share dense0‘s weights, you can do:

    dense0 = nn.Dense(20)
    dense1 = nn.Dense(20, params=dense0.collect_params())
    
__setattr__(name, value)

Registers parameters.

prefix

Prefix of this Block.

name

Name of this Block, without ‘_’ in the end.

name_scope()

Returns a name space object managing a child Block and parameter names. Should be used within a with statement:

with self.name_scope():
    self.dense = nn.Dense(20)
params

Returns this Block‘s parameter dictionary (does not include its children’s parameters).

collect_params()

Returns a ParameterDict containing this Block and all of its children’s Parameters.

save_params(filename)

Save parameters to file.

filename : str
Path to file.
load_params(filename, ctx, allow_missing=False, ignore_extra=False)

Load parameters from file.

filename : str
Path to parameter file.
ctx : Context or list of Context
Context(s) initialize loaded parameters on.
allow_missing : bool, default False
Whether to silently skip loading parameters not represents in the file.
ignore_extra : bool, default False
Whether to silently ignore parameters from the file that are not present in this Block.
register_child(block)

Registers block as a child of self. `Block`s assigned to self as attributes will be registered automatically.

initialize(init=, ctx=None, verbose=False)

Initializes Parameter`s of this `Block and its children.

Equivalent to block.collect_params().initialize(...)

hybridize(active=True)

Activates or deactivates `HybridBlock`s recursively. Has no effect on non-hybrid children.

Parameters:active (bool, default True) – Whether to turn hybrid on or off.
__call__(*args)

Calls forward. Only accepts positional arguments.

forward(*args)

Overrides to implement forward computation using NDArray. Only accepts positional arguments.

Parameters:*args

Input tensors.

class mxnet.gluon.HybridBlock(prefix=None, params=None)

HybridBlock supports forwarding with both Symbol and NDArray.

Forward computation in HybridBlock must be static to work with Symbol`s, i.e. you cannot call `.asnumpy(), .shape, .dtype, etc on tensors. Also, you cannot use branching or loop logic that bases on non-constant expressions like random numbers or intermediate results, since they change the graph structure for each iteration.

Before activating with hybridize(), HybridBlock works just like normal Block. After activation, HybridBlock will create a symbolic graph representing the forward computation and cache it. On subsequent forwards, the cached graph will be used instead of hybrid_forward.

Refer Hybrid tutorial to see the end-to-end usage.

__setattr__(name, value)

Registers parameters.

infer_shape(*args)

Infers shape of Parameters from inputs.

forward(x, *args)

Defines the forward computation. Arguments can be either NDArray or Symbol.

hybrid_forward(F, x, *args, **kwargs)

Overrides to construct symbolic graph for this Block.

Parameters:
  • x (Symbol or NDArray) – The first input tensor.
  • *args

    Additional input tensors.

class mxnet.gluon.SymbolBlock(outputs, inputs, params=None)

Construct block from symbol. This is useful for using pre-trained models as feature extractors. For example, you may want to extract get the output from fc2 layer in AlexNet.

Parameters:
  • outputs (Symbol or list of Symbol) – The desired output for SymbolBlock.
  • inputs (Symbol or list of Symbol) – The Variables in output’s argument that should be used as inputs.
  • params (ParameterDict) – Parameter dictionary for arguments and auxililary states of outputs that are not inputs.

Examples

>>> # To extract the feature from fc1 and fc2 layers of AlexNet:
>>> alexnet = gluon.model_zoo.vision.alexnet(pretrained=True, ctx=mx.cpu(),
                                             prefix='model_')
>>> inputs = mx.sym.var('data')
>>> out = alexnet(inputs)
>>> internals = out.get_internals()
>>> print(internals.list_outputs())
['data', ..., 'model_dense0_relu_fwd_output', ..., 'model_dense1_relu_fwd_output', ...]
>>> outputs = [internals['model_dense0_relu_fwd_output'],
               internals['model_dense1_relu_fwd_output']]
>>> # Create SymbolBlock that shares parameters with alexnet
>>> feat_model = gluon.SymbolBlock(outputs, inputs, params=alexnet.collect_params())
>>> x = mx.nd.random.normal(shape=(16, 3, 224, 224))
>>> print(feat_model(x))
class mxnet.gluon.nn.Sequential(prefix=None, params=None)

Stacks `Block`s sequentially.

Example:

net = nn.Sequential()
# use net's name_scope to give child Blocks appropriate names.
with net.name_scope():
    net.add(nn.Dense(10, activation='relu'))
    net.add(nn.Dense(20))
add(block)

Adds block on top of the stack.

class mxnet.gluon.nn.HybridSequential(prefix=None, params=None)

Stacks `HybridBlock`s sequentially.

Example:

net = nn.Sequential()
# use net's name_scope to give child Blocks appropriate names.
with net.name_scope():
    net.add(nn.Dense(10, activation='relu'))
    net.add(nn.Dense(20))
add(block)

Adds block on top of the stack.

class mxnet.gluon.nn.Dense(units, activation=None, use_bias=True, flatten=True, weight_initializer=None, bias_initializer='zeros', in_units=0, **kwargs)

Just your regular densely-connected NN layer.

Dense implements the operation: output = activation(dot(input, weight) + bias) where activation is the element-wise activation function passed as the activation argument, weight is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).

Note: the input must be a tensor with rank 2. Use flatten to convert it to rank 2 manually if necessary.

Parameters:
  • units (int) – Dimensionality of the output space.
  • activation (str) – Activation function to use. See help on Activation layer. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • flatten (bool) – Whether the input tensor should be flattened. If true, all but the first axis of input data are collapsed together. If false, all but the last axis of input data are kept the same, and the transformation applies on the last axis.
  • weight_initializer (str or Initializer) – Initializer for the kernel weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
  • in_units (int, optional) – Size of the input data. If not specified, initialization will be deferred to the first time forward is called and in_units will be inferred from the shape of input data.
  • prefix (str or None) – See document of Block.
  • params (ParameterDict or None) – See document of Block.

If flatten is set to be True, then the shapes are: Input shape:

An N-D input with shape (batch_size, x1, x2, ..., xn) with x1 * x2 * ... * xn equal to in_units.
Output shape:
The output would have shape (batch_size, units).

If flatten is set to be false, then the shapes are: Input shape:

An N-D input with shape (x1, x2, ..., xn, in_units).
Output shape:
The output would have shape (x1, x2, ..., xn, units).
class mxnet.gluon.nn.Activation(activation, **kwargs)

Applies an activation function to input.

Parameters:activation (str) – Name of activation function to use. See Activation() for available choices.
Input shape:
Arbitrary.
Output shape:
Same shape as input.
class mxnet.gluon.nn.Dropout(rate, **kwargs)

Applies Dropout to the input.

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

Parameters:rate (float) – Fraction of the input units to drop. Must be a number between 0 and 1.
Input shape:
Arbitrary.
Output shape:
Same shape as input.

References

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

class mxnet.gluon.nn.BatchNorm(axis=1, momentum=0.9, epsilon=1e-05, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', running_mean_initializer='zeros', running_variance_initializer='ones', in_channels=0, **kwargs)

Batch normalization layer (Ioffe and Szegedy, 2014). Normalizes the input at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

Parameters:
  • axis (int, default 1) – The axis that should be normalized. This is typically the channels (C) axis. For instance, after a Conv2D layer with layout=’NCHW’, set axis=1 in BatchNorm. If layout=’NHWC’, then set axis=3.
  • momentum (float, default 0.9) – Momentum for the moving average.
  • epsilon (float, default 1e-5) – Small float added to variance to avoid dividing by zero.
  • center (bool, default True) – If True, add offset of beta to normalized tensor. If False, beta is ignored.
  • scale (bool, default True) – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
  • beta_initializer (str or Initializer, default ‘zeros’) – Initializer for the beta weight.
  • gamma_initializer (str or Initializer, default ‘ones’) – Initializer for the gamma weight.
  • moving_mean_initializer (str or Initializer, default ‘zeros’) – Initializer for the moving mean.
  • moving_variance_initializer (str or Initializer, default ‘ones’) – Initializer for the moving variance.
  • in_channels (int, default 0) – Number of channels (feature maps) in input data. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Input shape:
Arbitrary.
Output shape:
Same shape as input.
class mxnet.gluon.nn.LeakyReLU(alpha, **kwargs)

Leaky version of a Rectified Linear Unit.

It allows a small gradient when the unit is not active:

`f(x) = alpha * x for x < 0`,
`f(x) = x for x >= 0`.
Parameters:alpha (float) – slope coefficient for the negative half axis. Must be >= 0.
Input shape:
Arbitrary.
Output shape:
Same shape as input.
class mxnet.gluon.nn.Embedding(input_dim, output_dim, dtype='float32', weight_initializer=None, **kwargs)

Turns non-negative integers (indexes/tokens) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

Parameters:
  • input_dim (int) – Size of the vocabulary, i.e. maximum integer index + 1.
  • output_dim (int) – Dimension of the dense embedding.
  • dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
  • weight_initializer (Initializer) – Initializer for the embeddings matrix.
Input shape:
2D tensor with shape: (N, M).
Output shape:
3D tensor with shape: (N, M, output_dim).
class mxnet.gluon.nn.Conv1D(channels, kernel_size, strides=1, padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

1D convolution layer (e.g. temporal convolution).

This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 1 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 1 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 1 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 1 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 3D array of shape (batch_size, in_channels, width) if layout is NCW.
Output shape:

This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW. out_width is calculated as:

out_width = floor((width+2*padding-dilation*(kernel_size-1)-1)/stride)+1
class mxnet.gluon.nn.Conv2D(channels, kernel_size, strides=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

2D convolution layer (e.g. spatial convolution over images).

This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 2 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 2 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 2 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 2 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 4D array of shape (batch_size, in_channels, height, width) if layout is NCHW.
Output shape:

This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.

out_height and out_width are calculated as:

out_height = floor((height+2*padding[0]-dilation[0]*(kernel_size[0]-1)-1)/stride[0])+1
out_width = floor((width+2*padding[1]-dilation[1]*(kernel_size[1]-1)-1)/stride[1])+1
class mxnet.gluon.nn.Conv3D(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

3D convolution layer (e.g. spatial convolution over volumes).

This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’ and ‘W’ dimensions.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 5D array of shape (batch_size, in_channels, depth, height, width) if layout is NCDHW.
Output shape:

This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.

out_depth, out_height and out_width are calculated as:

out_depth = floor((depth+2*padding[0]-dilation[0]*(kernel_size[0]-1)-1)/stride[0])+1
out_height = floor((height+2*padding[1]-dilation[1]*(kernel_size[1]-1)-1)/stride[1])+1
out_width = floor((width+2*padding[2]-dilation[2]*(kernel_size[2]-1)-1)/stride[2])+1
class mxnet.gluon.nn.Conv1DTranspose(channels, kernel_size, strides=1, padding=0, output_padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

Transposed 1D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 3D array of shape (batch_size, in_channels, width) if layout is NCW.
Output shape:

This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.

out_width is calculated as:

out_width = (width-1)*strides-2*padding+kernel_size+output_padding
class mxnet.gluon.nn.Conv2DTranspose(channels, kernel_size, strides=(1, 1), padding=(0, 0), output_padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

Transposed 2D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 4D array of shape (batch_size, in_channels, height, width) if layout is NCHW.
Output shape:

This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.

out_height and out_width are calculated as:

out_height = (height-1)*strides[0]-2*padding[0]+kernel_size[0]+output_padding[0]
out_width = (width-1)*strides[1]-2*padding[1]+kernel_size[1]+output_padding[1]
class mxnet.gluon.nn.Conv3DTranspose(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), output_padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)

Transposed 3D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’, and ‘W’ dimensions.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Input shape:
This depends on the layout parameter. Input is 5D array of shape (batch_size, in_channels, depth, height, width) if layout is NCDHW.
Output shape:

This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW. out_depth, out_height and out_width are calculated as:

out_depth = (depth-1)*strides[0]-2*padding[0]+kernel_size[0]+output_padding[0]
out_height = (height-1)*strides[1]-2*padding[1]+kernel_size[1]+output_padding[1]
out_width = (width-1)*strides[2]-2*padding[2]+kernel_size[2]+output_padding[2]
class mxnet.gluon.nn.MaxPool1D(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)

Max pooling operation for one dimensional data.

Parameters:
  • pool_size (int) – Size of the max pooling windows.
  • strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Pooling is applied on the W dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 3D array of shape (batch_size, channels, width) if layout is NCW.
Output shape:

This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.

out_width is calculated as:

out_width = floor((width+2*padding-pool_size)/strides)+1

When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.MaxPool2D(pool_size=(2, 2), strides=None, padding=0, layout='NCHW', ceil_mode=False, **kwargs)

Max pooling operation for two dimensional (spatial) data.

Parameters:
  • pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows.
  • strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int or list/tuple of 2 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 4D array of shape (batch_size, channels, height, width) if layout is NCHW.
Output shape:

This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.

out_height and out_width are calculated as:

out_height = floor((height+2*padding[0]-pool_size[0])/strides[0])+1
out_width = floor((width+2*padding[1]-pool_size[1])/strides[1])+1

When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.MaxPool3D(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)

Max pooling operation for 3D data (spatial or spatio-temporal).

Parameters:
  • pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows.
  • strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int or list/tuple of 3 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 5D array of shape (batch_size, channels, depth, height, width) if layout is NCDHW.
Output shape:

This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.

out_depth, out_height and out_width are calculated as

out_depth = floor((depth+2*padding[0]-pool_size[0])/strides[0])+1
out_height = floor((height+2*padding[1]-pool_size[1])/strides[1])+1
out_width = floor((width+2*padding[2]-pool_size[2])/strides[2])+1

When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.AvgPool1D(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)

Average pooling operation for temporal data.

Parameters:
  • pool_size (int) – Size of the max pooling windows.
  • strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. padding is applied on ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 3D array of shape (batch_size, channels, width) if layout is NCW.
Output shape:

This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.

out_width is calculated as:

out_width = floor((width+2*padding-pool_size)/strides)+1

When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.AvgPool2D(pool_size=(2, 2), strides=None, padding=0, ceil_mode=False, layout='NCHW', **kwargs)

Average pooling operation for spatial data.

Parameters:
  • pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows.
  • strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int or list/tuple of 2 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 4D array of shape (batch_size, channels, height, width) if layout is NCHW.
Output shape:

This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.

out_height and out_width are calculated as:

out_height = floor((height+2*padding[0]-pool_size[0])/strides[0])+1
out_width = floor((width+2*padding[1]-pool_size[1])/strides[1])+1

When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.AvgPool3D(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)

Average pooling operation for 3D data (spatial or spatio-temporal).

Parameters:
  • pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows.
  • strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int or list/tuple of 3 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Input shape:
This depends on the layout parameter. Input is 5D array of shape (batch_size, channels, depth, height, width) if layout is NCDHW.
Output shape:

This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.

out_depth, out_height and out_width are calculated as

out_depth = floor((depth+2*padding[0]-pool_size[0])/strides[0])+1
out_height = floor((height+2*padding[1]-pool_size[1])/strides[1])+1
out_width = floor((width+2*padding[2]-pool_size[2])/strides[2])+1

When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.GlobalMaxPool1D(layout='NCW', **kwargs)

Global max pooling operation for temporal data.

class mxnet.gluon.nn.GlobalMaxPool2D(layout='NCHW', **kwargs)

Global max pooling operation for spatial data.

class mxnet.gluon.nn.GlobalMaxPool3D(layout='NCDHW', **kwargs)

Global max pooling operation for 3D data.

class mxnet.gluon.nn.GlobalAvgPool1D(layout='NCW', **kwargs)

Global average pooling operation for temporal data.

class mxnet.gluon.nn.GlobalAvgPool2D(layout='NCHW', **kwargs)

Global average pooling operation for spatial data.

class mxnet.gluon.nn.GlobalAvgPool3D(layout='NCDHW', **kwargs)

Global max pooling operation for 3D data.

class mxnet.gluon.rnn.RecurrentCell(prefix=None, params=None)

Abstract base class for RNN cells

Parameters:
  • prefix (str, optional) – Prefix for names of Block`s (this prefix is also used for names of weights if `params is None i.e. if params are being created and not reused)
  • params (Parameter or None, optional) – Container for weight sharing between cells. A new Parameter container is created if params is None.
reset()

Reset before re-using the cell for another graph.

state_info(batch_size=0)

shape and layout information of states

begin_state(batch_size=0, func=, **kwargs)

Initial state for this cell.

Parameters:
  • func (callable, default symbol.zeros) –

    Function for creating initial state.

    For Symbol API, func can be symbol.zeros, symbol.uniform, symbol.var etc. Use symbol.var if you want to directly feed input as states.

    For NDArray API, func can be ndarray.zeros, ndarray.ones, etc.

  • batch_size (int, default 0) – Only required for NDArray API. Size of the batch (‘N’ in layout) dimension of input.
  • **kwargs

    Additional keyword arguments passed to func. For example mean, std, dtype, etc.

Returns:

states – Starting states for the first RNN step.

Return type:

nested list of Symbol

unroll(length, inputs, begin_state=None, layout='NTC', merge_outputs=None)

Unrolls an RNN cell across time steps.

Parameters:
  • length (int) – Number of steps to unroll.
  • inputs (Symbol, list of Symbol, or None) –

    If inputs is a single Symbol (usually the output of Embedding symbol), it should have shape (batch_size, length, ...) if layout is ‘NTC’, or (length, batch_size, ...) if layout is ‘TNC’.

    If inputs is a list of symbols (usually output of previous unroll), they should all have shape (batch_size, ...).

  • begin_state (nested list of Symbol, optional) – Input states created by begin_state() or output state of another cell. Created from begin_state() if None.
  • layout (str, optional) – layout of input symbol. Only used if inputs is a single Symbol.
  • merge_outputs (bool, optional) – If False, returns outputs as a list of Symbols. If True, concatenates output across time steps and returns a single symbol with shape (batch_size, length, ...) if layout is ‘NTC’, or (length, batch_size, ...) if layout is ‘TNC’. If None, output whatever is faster.
Returns:

  • outputs (list of Symbol or Symbol) – Symbol (if merge_outputs is True) or list of Symbols (if merge_outputs is False) corresponding to the output from the RNN from this unrolling.
  • states (list of Symbol) – The new state of this RNN after this unrolling. The type of this symbol is same as the output of begin_state().

forward(inputs, states)

Unrolls the recurrent cell for one time step.

Parameters:
  • inputs (sym.Variable) – Input symbol, 2D, of shape (batch_size * num_units).
  • states (list of sym.Variable) – RNN state from previous step or the output of begin_state().
Returns:

  • output (Symbol) – Symbol corresponding to the output from the RNN when unrolling for a single time step.
  • states (list of Symbol) – The new state of this RNN after this unrolling. The type of this symbol is same as the output of begin_state(). This can be used as an input state to the next time step of this RNN.

See also

begin_state()
This function can provide the states for the first time step.
unroll()
This function unrolls an RNN for a given number of (>=1) time steps.
class mxnet.gluon.rnn.RNN(hidden_size, num_layers=1, activation='relu', layout='TNC', dropout=0, bidirectional=False, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, **kwargs)

Applies a multi-layer Elman RNN with tanh or ReLU non-linearity to an input sequence.

For each element in the input sequence, each layer computes the following function:

\[h_t = \tanh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{(t-1)} + b_{hh})\]

where \(h_t\) is the hidden state at time t, and \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer. If nonlinearity=’relu’, then ReLU is used instead of tanh.

Parameters:
  • hidden_size (int) – The number of features in the hidden state h.
  • num_layers (int, default 1) – Number of recurrent layers.
  • activation ({'relu' or 'tanh'}, default 'tanh') – The activation function to use.
  • layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
  • dropout (float, default 0) – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer.
  • bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
  • i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
  • h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
  • i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
  • prefix (str or None) – Prefix of this Block.
  • params (ParameterDict or None) – Shared Parameters for this Block.
Input shapes:
The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
Output shape:
The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
Recurrent state:
The recurrent state is an NDArray with shape (num_layers, batch_size, num_hidden). If bidirectional is True, the recurrent state shape will instead be (2*num_layers, batch_size, num_hidden) If input recurrent state is None, zeros are used as default begin states, and the output recurrent state is omitted.

Examples

>>> layer = mx.gluon.rnn.RNN(100, 3)
>>> layer.initialize()
>>> input = mx.nd.random.uniform(shape=(5, 3, 10))
>>> # by default zeros are used as begin state
>>> output = layer(input)
>>> # manually specify begin state.
>>> h0 = mx.nd.random.uniform(shape=(3, 3, 100))
>>> output, hn = layer(input, h0)
class mxnet.gluon.rnn.LSTM(hidden_size, num_layers=1, layout='TNC', dropout=0, bidirectional=False, input_size=0, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', **kwargs)

Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

\[\begin{split}\begin{array}{ll} i_t = sigmoid(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\ f_t = sigmoid(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hc} h_{(t-1)} + b_{hg}) \\ o_t = sigmoid(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\ c_t = f_t * c_{(t-1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \end{array}\end{split}\]

where \(h_t\) is the hidden state at time t, \(c_t\) is the cell state at time t, \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer, and \(i_t\), \(f_t\), \(g_t\), \(o_t\) are the input, forget, cell, and out gates, respectively.

Parameters:
  • hidden_size (int) – The number of features in the hidden state h.
  • num_layers (int, default 1) – Number of recurrent layers.
  • layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
  • dropout (float, default 0) – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer.
  • bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
  • i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
  • h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
  • i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero.
  • h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
  • prefix (str or None) – Prefix of this Block.
  • params (ParameterDict or None) – Shared Parameters for this Block.
Input shapes:
The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
Output shape:
The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
Recurrent state:
The recurrent state is a list of two NDArrays. Both has shape (num_layers, batch_size, num_hidden). If bidirectional is True, each recurrent state will instead have shape (2*num_layers, batch_size, num_hidden). If input recurrent state is None, zeros are used as default begin states, and the output recurrent state is omitted.

Examples

>>> layer = mx.gluon.rnn.LSTM(100, 3)
>>> layer.initialize()
>>> input = mx.nd.random.uniform(shape=(5, 3, 10))
>>> # by default zeros are used as begin state
>>> output = layer(input)
>>> # manually specify begin state.
>>> h0 = mx.nd.random.uniform(shape=(3, 3, 100))
>>> c0 = mx.nd.random.uniform(shape=(3, 3, 100))
>>> output, hn = layer(input, [h0, c0])
class mxnet.gluon.rnn.GRU(hidden_size, num_layers=1, layout='TNC', dropout=0, bidirectional=False, input_size=0, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', **kwargs)

Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

\[\begin{split}\begin{array}{ll} r_t = sigmoid(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ i_t = sigmoid(W_{ii} x_t + b_{ii} + W_hi h_{(t-1)} + b_{hi}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t = (1 - i_t) * n_t + i_t * h_{(t-1)} \\ \end{array}\end{split}\]

where \(h_t\) is the hidden state at time t, \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer, and \(r_t\), \(i_t\), \(n_t\) are the reset, input, and new gates, respectively.

Parameters:
  • hidden_size (int) – The number of features in the hidden state h
  • num_layers (int, default 1) – Number of recurrent layers.
  • layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
  • dropout (float, default 0) – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer
  • bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
  • i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
  • h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
  • i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
  • prefix (str or None) – Prefix of this Block.
  • params (ParameterDict or None) – Shared Parameters for this Block.
Input shapes:
The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
Output shape:
The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
Recurrent state:
The recurrent state is an NDArray with shape (num_layers, batch_size, num_hidden). If bidirectional is True, the recurrent state shape will instead be (2*num_layers, batch_size, num_hidden) If input recurrent state is None, zeros are used as default begin states, and the output recurrent state is omitted.

Examples

>>> layer = mx.gluon.rnn.GRU(100, 3)
>>> layer.initialize()
>>> input = mx.nd.random.uniform(shape=(5, 3, 10))
>>> # by default zeros are used as begin state
>>> output = layer(input)
>>> # manually specify begin state.
>>> h0 = mx.nd.random.uniform(shape=(3, 3, 100))
>>> output, hn = layer(input, h0)
class mxnet.gluon.rnn.RNNCell(hidden_size, activation='tanh', i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)

Simple recurrent neural network cell.

Parameters:
  • hidden_size (int) – Number of units in output symbol
  • activation (str or Symbol, default 'tanh') – Type of activation function.
  • i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
  • h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
  • i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • prefix (str, default ‘rnn_‘) – Prefix for name of Block`s (and name of weight if params is `None).
  • params (Parameter or None) – Container for weight sharing between cells. Created if None.
class mxnet.gluon.rnn.LSTMCell(hidden_size, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)

Long-Short Term Memory (LSTM) network cell.

Parameters:
  • hidden_size (int) – Number of units in output symbol.
  • i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
  • h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
  • i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero.
  • h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • prefix (str, default ‘lstm_‘) – Prefix for name of Block`s (and name of weight if params is `None).
  • params (Parameter or None) – Container for weight sharing between cells. Created if None.
class mxnet.gluon.rnn.GRUCell(hidden_size, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)

Gated Rectified Unit (GRU) network cell. Note: this is an implementation of the cuDNN version of GRUs (slight modification compared to Cho et al. 2014).

Parameters:
  • hidden_size (int) – Number of units in output symbol.
  • i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
  • h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
  • i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
  • prefix (str, default ‘gru_‘) – prefix for name of Block`s (and name of weight if params is `None).
  • params (Parameter or None) – Container for weight sharing between cells. Created if None.
class mxnet.gluon.rnn.SequentialRNNCell(prefix=None, params=None)

Sequentially stacking multiple RNN cells.

add(cell)

Appends a cell into the stack.

Parameters:cell (rnn cell) –
class mxnet.gluon.rnn.BidirectionalCell(l_cell, r_cell, output_prefix='bi_')

Bidirectional RNN cell.

Parameters:
class mxnet.gluon.rnn.DropoutCell(rate, prefix=None, params=None)

Applies dropout on input.

Parameters:rate (float) – Percentage of elements to drop out, which is 1 - percentage to retain.
class mxnet.gluon.rnn.ZoneoutCell(base_cell, zoneout_outputs=0.0, zoneout_states=0.0)

Applies Zoneout on base cell.

class mxnet.gluon.rnn.ResidualCell(base_cell)

Adds residual connection as described in Wu et al, 2016 (https://arxiv.org/abs/1609.08144). Output of the cell is output of the base cell plus input.

class mxnet.gluon.Trainer(params, optimizer, optimizer_params=None, kvstore='device')

Applies an Optimizer on a set of Parameters. Trainer should be used together with autograd.

Parameters:
  • params (ParameterDict) – The set of parameters to optimize.
  • optimizer (str or Optimizer) – The optimizer to use. See help on Optimizer for a list of available optimizers.
  • optimizer_params (dict) – Key-word arguments to be passed to optimizer constructor. For example, {‘learning_rate’: 0.1}. All optimizers accept learning_rate, wd (weight decay), clip_gradient, and lr_scheduler. See each optimizer’s constructor for a list of additional supported arguments.
  • kvstore (str or KVStore) – kvstore type for multi-gpu and distributed training. See help on mxnet.kvstore.create for more information.
step(batch_size, ignore_stale_grad=False)

Makes one step of parameter update. Should be called after autograd.compute_gradient and outside of record() scope.

Parameters:
  • batch_size (int) – Batch size of data processed. Gradient will be normalized by 1/batch_size. Set this to 1 if you normalized loss manually with loss = mean(loss).
  • ignore_stale_grad (bool, optional, default=False) – If true, ignores Parameters with stale gradient (gradient that has not been updated by backward after last step) and skip update.
save_states(fname)

Saves trainer states (e.g. optimizer, momentum) to a file.

Parameters:fname (str) – Path to output states file.
load_states(fname)

Loads trainer states (e.g. optimizer, momentum) from a file.

Parameters:fname (str) – Path to input states file.
class mxnet.gluon.loss.L2Loss(weight=1.0, batch_axis=0, **kwargs)

Calculates the mean squared error between output and label:

\[L = \frac{1}{2}\sum_i \Vert {output}_i - {label}_i \Vert^2.\]

Output and label can have arbitrary shape as long as they have the same number of elements.

Parameters:
  • weight (float or None) – Global scalar weight for loss.
  • sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
  • batch_axis (int, default 0) – The axis that represents mini-batch.
class mxnet.gluon.loss.L1Loss(weight=None, batch_axis=0, **kwargs)

Calculates the mean absolute error between output and label:

\[L = \frac{1}{2}\sum_i \vert {output}_i - {label}_i \vert.\]

Output and label must have the same shape.

Parameters:
  • weight (float or None) – Global scalar weight for loss.
  • sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
  • batch_axis (int, default 0) – The axis that represents mini-batch.
class mxnet.gluon.loss.SoftmaxCrossEntropyLoss(axis=-1, sparse_label=True, from_logits=False, weight=None, batch_axis=0, **kwargs)

Computes the softmax cross entropy loss. (alias: SoftmaxCELoss)

If sparse_label is True, label should contain integer category indicators:

\[p = {softmax}({output})\]\[L = -\sum_i {log}(p_{i,{label}_i})\]

Label’s shape should be output’s shape without the axis dimension. i.e. for output.shape = (1,2,3,4) and axis = 2, label.shape should be (1,2,4).

If sparse_label is False, label should contain probability distribution with the same shape as output:

\[p = {softmax}({output})\]\[L = -\sum_i \sum_j {label}_j {log}(p_{ij})\]
Parameters:
  • axis (int, default -1) – The axis to sum over when computing softmax and entropy.
  • sparse_label (bool, default True) – Whether label is an integer array instead of probability distribution.
  • from_logits (bool, default False) – Whether input is a log probability (usually from log_softmax) instead of unnormalized numbers.
  • weight (float or None) – Global scalar weight for loss.
  • sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
  • batch_axis (int, default 0) – The axis that represents mini-batch.
class mxnet.gluon.loss.KLDivLoss(from_logits=True, weight=None, batch_axis=0, **kwargs)

The Kullback-Leibler divergence loss.

KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions.

\[L = 1/n \sum_i (label_i * (log(label_i) - output_i))\]

Label’s shape should be the same as output’s.

Parameters:
  • from_logits (bool, default is True) – Whether the input is log probability (usually from log_softmax) instead of unnormalized numbers.
  • weight (float or None) – Global scalar weight for loss.
  • sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
  • batch_axis (int, default 0) – The axis that represents mini-batch.
utils.split_data(data, num_slice, batch_axis=0, even_split=True)

Splits an NDArray into num_slice slices along batch_axis. Usually used for data parallelism where each slices is sent to one device (i.e. GPU).

Parameters:
  • data (NDArray) – A batch of data.
  • num_slice (int) – Number of desired slices.
  • batch_axis (int, default 0) – The axis along which to slice.
  • even_split (bool, default True) – Whether to force all slices to have the same number of elements. If True, an error will be raised when num_slice does not evenly divide data.shape[batch_axis].
Returns:

Return value is a list even if num_slice is 1.

Return type:

list of NDArray

utils.split_and_load(data, ctx_list, batch_axis=0, even_split=True)

Splits an NDArray into len(ctx_list) slices along batch_axis and loads each slice to one context in ctx_list.

Parameters:
  • data (NDArray) – A batch of data.
  • ctx_list (list of Context) – A list of Contexts.
  • batch_axis (int, default 0) – The axis along which to slice.
  • even_split (bool, default True) – Whether to force all slices to have the same number of elements.
Returns:

Each corresponds to a context in ctx_list.

Return type:

list of NDArray

utils.clip_global_norm(arrays, max_norm)

Rescales NDArrays so that the sum of their 2-norm is smaller than max_norm.

class mxnet.gluon.data.Dataset

Abstract dataset class. All datasets should have this interface.

Subclasses need to override __getitem__, which returns the i-th element, and __len__, which returns the total number elements.

Note

An mxnet or numpy array can be directly used as a dataset.

class mxnet.gluon.data.ArrayDataset(data, label)

A dataset with a data array and a label array.

The i-th sample is (data[i], lable[i]).

Parameters:
  • data (array-like object) – The data array. Can be mxnet or numpy array.
  • label (array-like object) – The label array. Can be mxnet or numpy array.
class mxnet.gluon.data.RecordFileDataset(filename)

A dataset wrapping over a RecordIO (.rec) file.

Each sample is a string representing the raw content of an record.

Parameters:filename (str) – Path to rec file.
class mxnet.gluon.data.Sampler

Base class for samplers.

All samplers should subclass Sampler and define __iter__ and __len__ methods.

class mxnet.gluon.data.SequentialSampler(length)

Samples elements from [0, length) sequentially.

Parameters:length (int) – Length of the sequence.
class mxnet.gluon.data.RandomSampler(length)

Samples elements from [0, length) randomly without replacement.

Parameters:length (int) – Length of the sequence.
class mxnet.gluon.data.BatchSampler(sampler, batch_size, last_batch='keep')

Wraps over another Sampler and return mini-batches of samples.

Parameters:
  • sampler (Sampler) – The source Sampler.
  • batch_size (int) – Size of mini-batch.
  • last_batch ({'keep', 'discard', 'rollover'}) –

    Specifies how the last batch is handled if batch_size does not evenly divide sequence length.

    If ‘keep’, the last batch will be returned directly, but will contain less element than batch_size requires.

    If ‘discard’, the last batch will be discarded.

    If ‘rollover’, the remaining elements will be rolled over to the next iteration.

Examples

>>> sampler = gluon.data.SequentialSampler(10)
>>> batch_sampler = gluon.data.BatchSampler(sampler, 3, 'keep')
>>> list(batch_sampler)
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
class mxnet.gluon.data.DataLoader(dataset, batch_size=None, shuffle=False, sampler=None, last_batch=None, batch_sampler=None)

Loads data from a dataset and returns mini-batches of data.

Parameters:
  • dataset (Dataset) – Source dataset. Note that numpy and mxnet arrays can be directly used as a Dataset.
  • batch_size (int) – Size of mini-batch.
  • shuffle (bool) – Whether to shuffle the samples.
  • sampler (Sampler) – The sampler to use. Either specify sampler or shuffle, not both.
  • last_batch ({'keep', 'discard', 'rollover'}) –

    How to handle the last batch if batch_size does not evenly divide len(dataset).

    keep - A batch with less samples than previous batches is returned. discard - The last batch is discarded if its incomplete. rollover - The remaining samples are rolled over to the next epoch.

  • batch_sampler (Sampler) – A sampler that returns mini-batches. Do not specify batch_size, shuffle, sampler, and last_batch if batch_sampler is specified.

Dataset container.

class mxnet.gluon.data.vision.MNIST(root='~/.mxnet/datasets/mnist', train=True, transform=None)

MNIST handwritten digits dataset from `http://yann.lecun.com/exdb/mnist`_.

Each sample is an image (in 3D NDArray) with shape (28, 28, 1).

Parameters:
  • root (str) – Path to temp folder for storing data.
  • train (bool) – Whether to load the training or testing set.
  • transform (function) –

    A user defined callback that transforms each instance. For example:

    transform=lambda data, label: (data.astype(np.float32)/255, label)

class mxnet.gluon.data.vision.FashionMNIST(root='~/.mxnet/datasets/fashion-mnist', train=True, transform=None)

A dataset of Zalando’s article images consisting of fashion products, a drop-in replacement of the original MNIST dataset from `https://github.com/zalandoresearch/fashion-mnist`_.

Each sample is an image (in 3D NDArray) with shape (28, 28, 1).

Parameters:
  • root (str) – Path to temp folder for storing data.
  • train (bool) – Whether to load the training or testing set.
  • transform (function) –

    A user defined callback that transforms each instance. For example:

    transform=lambda data, label: (data.astype(np.float32)/255, label)

class mxnet.gluon.data.vision.CIFAR10(root='~/.mxnet/datasets/cifar10', train=True, transform=None)

CIFAR10 image classification dataset from `https://www.cs.toronto.edu/~kriz/cifar.html`_.

Each sample is an image (in 3D NDArray) with shape (32, 32, 1).

Parameters:
  • root (str) – Path to temp folder for storing data.
  • train (bool) – Whether to load the training or testing set.
  • transform (function) –

    A user defined callback that transforms each instance. For example:

    transform=lambda data, label: (data.astype(np.float32)/255, label)

class mxnet.gluon.data.vision.ImageRecordDataset(filename, flag=1, transform=None)

A dataset wrapping over a RecordIO file containing images.

Each sample is an image and its corresponding label.

Parameters:
  • filename (str) – Path to rec file.
  • flag ({0, 1}, default 1) –

    If 0, always convert images to greyscale.

    If 1, always convert images to colored (RGB).

  • transform (function) –

    A user defined callback that transforms each instance. For example:

    transform=lambda data, label: (data.astype(np.float32)/255, label)

class mxnet.gluon.data.vision.ImageFolderDataset(root, flag=1, transform=None)

A dataset for loading image files stored in a folder structure like:

root/car/0001.jpg
root/car/xxxa.jpg
root/car/yyyb.jpg
root/bus/123.jpg
root/bus/023.jpg
root/bus/wwww.jpg
Parameters:
  • root (str) – Path to root directory.
  • flag ({0, 1}, default 1) – If 0, always convert loaded images to greyscale (1 channel). If 1, always convert loaded images to colored (3 channels).
  • transform (callable) –

    A function that takes data and label and transforms them:

    transform = lambda data, label: (data.astype(np.float32)/255, label)

synsets

list

List of class names. synsets[i] is the name for the integer label i

items

list of tuples

List of all images in (filename, label) pairs.

vision.get_model(name, **kwargs)

Returns a pre-defined model by name

Parameters:
  • name (str) – Name of the model.
  • pretrained (bool) – Whether to load the pretrained weights for model.
  • classes (int) – Number of classes for the output layer.
Returns:

The model.

Return type:

HybridBlock

vision.resnet18_v1(**kwargs)

ResNet-18 V1 model from “Deep Residual Learning for Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet34_v1(**kwargs)

ResNet-34 V1 model from “Deep Residual Learning for Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet50_v1(**kwargs)

ResNet-50 V1 model from “Deep Residual Learning for Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet101_v1(**kwargs)

ResNet-101 V1 model from “Deep Residual Learning for Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet152_v1(**kwargs)

ResNet-152 V1 model from “Deep Residual Learning for Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet18_v2(**kwargs)

ResNet-18 V2 model from “Identity Mappings in Deep Residual Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet34_v2(**kwargs)

ResNet-34 V2 model from “Identity Mappings in Deep Residual Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet50_v2(**kwargs)

ResNet-50 V2 model from “Identity Mappings in Deep Residual Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet101_v2(**kwargs)

ResNet-101 V2 model from “Identity Mappings in Deep Residual Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.resnet152_v2(**kwargs)

ResNet-152 V2 model from “Identity Mappings in Deep Residual Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.get_resnet(version, num_layers, pretrained=False, ctx=cpu(0), **kwargs)

ResNet V1 model from “Deep Residual Learning for Image Recognition” paper. ResNet V2 model from “Identity Mappings in Deep Residual Networks” paper.

Parameters:
  • version (int) – Version of ResNet. Options are 1, 2.
  • num_layers (int) – Numbers of layers. Options are 18, 34, 50, 101, 152.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
class mxnet.gluon.model_zoo.vision.ResNetV1(block, layers, channels, classes=1000, thumbnail=False, **kwargs)

ResNet V1 model from “Deep Residual Learning for Image Recognition” paper.

Parameters:
  • block (HybridBlock) – Class for the residual block. Options are BasicBlockV1, BottleneckV1.
  • layers (list of int) – Numbers of layers in each block
  • channels (list of int) – Numbers of channels in each block. Length should be one larger than layers list.
  • classes (int, default 1000) – Number of classification classes.
  • thumbnail (bool, default False) – Enable thumbnail.
class mxnet.gluon.model_zoo.vision.BasicBlockV1(channels, stride, downsample=False, in_channels=0, **kwargs)

BasicBlock V1 from “Deep Residual Learning for Image Recognition” paper. This is used for ResNet V1 for 18, 34 layers.

Parameters:
  • channels (int) – Number of output channels.
  • stride (int) – Stride size.
  • downsample (bool, default False) – Whether to downsample the input.
  • in_channels (int, default 0) – Number of input channels. Default is 0, to infer from the graph.
class mxnet.gluon.model_zoo.vision.BottleneckV1(channels, stride, downsample=False, in_channels=0, **kwargs)

Bottleneck V1 from “Deep Residual Learning for Image Recognition” paper. This is used for ResNet V1 for 50, 101, 152 layers.

Parameters:
  • channels (int) – Number of output channels.
  • stride (int) – Stride size.
  • downsample (bool, default False) – Whether to downsample the input.
  • in_channels (int, default 0) – Number of input channels. Default is 0, to infer from the graph.
class mxnet.gluon.model_zoo.vision.ResNetV2(block, layers, channels, classes=1000, thumbnail=False, **kwargs)

ResNet V2 model from “Identity Mappings in Deep Residual Networks” paper.

Parameters:
  • block (HybridBlock) – Class for the residual block. Options are BasicBlockV1, BottleneckV1.
  • layers (list of int) – Numbers of layers in each block
  • channels (list of int) – Numbers of channels in each block. Length should be one larger than layers list.
  • classes (int, default 1000) – Number of classification classes.
  • thumbnail (bool, default False) – Enable thumbnail.
class mxnet.gluon.model_zoo.vision.BasicBlockV2(channels, stride, downsample=False, in_channels=0, **kwargs)

BasicBlock V2 from “Identity Mappings in Deep Residual Networks” paper. This is used for ResNet V2 for 18, 34 layers.

Parameters:
  • channels (int) – Number of output channels.
  • stride (int) – Stride size.
  • downsample (bool, default False) – Whether to downsample the input.
  • in_channels (int, default 0) – Number of input channels. Default is 0, to infer from the graph.
class mxnet.gluon.model_zoo.vision.BottleneckV2(channels, stride, downsample=False, in_channels=0, **kwargs)

Bottleneck V2 from “Identity Mappings in Deep Residual Networks” paper. This is used for ResNet V2 for 50, 101, 152 layers.

Parameters:
  • channels (int) – Number of output channels.
  • stride (int) – Stride size.
  • downsample (bool, default False) – Whether to downsample the input.
  • in_channels (int, default 0) – Number of input channels. Default is 0, to infer from the graph.
vision.vgg11(**kwargs)

VGG-11 model from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.vgg13(**kwargs)

VGG-13 model from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.vgg16(**kwargs)

VGG-16 model from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.vgg19(**kwargs)

VGG-19 model from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.vgg11_bn(**kwargs)

VGG-11 model with batch normalization from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.vgg13_bn(**kwargs)

VGG-13 model with batch normalization from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.vgg16_bn(**kwargs)

VGG-16 model with batch normalization from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.vgg19_bn(**kwargs)

VGG-19 model with batch normalization from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.get_vgg(num_layers, pretrained=False, ctx=cpu(0), **kwargs)

VGG model from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • num_layers (int) – Number of layers for the variant of densenet. Options are 11, 13, 16, 19.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
class mxnet.gluon.model_zoo.vision.VGG(layers, filters, classes=1000, batch_norm=False, **kwargs)

VGG model from the “Very Deep Convolutional Networks for Large-Scale Image Recognition” paper.

Parameters:
  • layers (list of int) – Numbers of layers in each feature block.
  • filters (list of int) – Numbers of filters in each feature block. List length should match the layers.
  • classes (int, default 1000) – Number of classification classes.
  • batch_norm (bool, default False) – Use batch normalization.
vision.alexnet(pretrained=False, ctx=cpu(0), **kwargs)

AlexNet model from the “One weird trick...” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
class mxnet.gluon.model_zoo.vision.AlexNet(classes=1000, **kwargs)

AlexNet model from the “One weird trick...” paper.

Parameters:classes (int, default 1000) – Number of classes for the output layer.
vision.densenet121(**kwargs)

Densenet-BC 121-layer model from the “Densely Connected Convolutional Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.densenet161(**kwargs)

Densenet-BC 161-layer model from the “Densely Connected Convolutional Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.densenet169(**kwargs)

Densenet-BC 169-layer model from the “Densely Connected Convolutional Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.densenet201(**kwargs)

Densenet-BC 201-layer model from the “Densely Connected Convolutional Networks” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
class mxnet.gluon.model_zoo.vision.DenseNet(num_init_features, growth_rate, block_config, bn_size=4, dropout=0, classes=1000, **kwargs)

Densenet-BC model from the “Densely Connected Convolutional Networks” paper.

Parameters:
  • num_init_features (int) – Number of filters to learn in the first convolution layer.
  • growth_rate (int) – Number of filters to add each layer (k in the paper).
  • block_config (list of int) – List of integers for numbers of layers in each pooling block.
  • bn_size (int, default 4) – Multiplicative factor for number of bottle neck layers. (i.e. bn_size * k features in the bottleneck layer)
  • dropout (float, default 0) – Rate of dropout after each dense layer.
  • classes (int, default 1000) – Number of classification classes.
vision.squeezenet1_0(**kwargs)

SqueezeNet 1.0 model from the “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
vision.squeezenet1_1(**kwargs)

SqueezeNet 1.1 model from the official SqueezeNet repo. SqueezeNet 1.1 has 2.4x less computation and slightly fewer parameters than SqueezeNet 1.0, without sacrificing accuracy.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
class mxnet.gluon.model_zoo.vision.SqueezeNet(version, classes=1000, **kwargs)

SqueezeNet model from the “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” paper. SqueezeNet 1.1 model from the official SqueezeNet repo. SqueezeNet 1.1 has 2.4x less computation and slightly fewer parameters than SqueezeNet 1.0, without sacrificing accuracy.

Parameters:
  • version (str) – Version of squeezenet. Options are ‘1.0’, ‘1.1’.
  • classes (int, default 1000) – Number of classification classes.
vision.inception_v3(pretrained=False, ctx=cpu(0), **kwargs)

Inception v3 model from “Rethinking the Inception Architecture for Computer Vision” paper.

Parameters:
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
class mxnet.gluon.model_zoo.vision.Inception3(classes=1000, **kwargs)

Inception v3 model from “Rethinking the Inception Architecture for Computer Vision” paper.

Parameters:classes (int, default 1000) – Number of classification classes.