Gluon Package¶
Warning
This package is currently experimental and may change in the near future.
Overview¶
Gluon package is a highlevel interface for MXNet designed to be easy to use while keeping most of the flexibility of low level API. Gluon supports both imperative and symbolic programming, making it easy to train complex models imperatively in Python and then deploy with symbolic graph in C++ and Scala.
Parameter¶
Parameter 
A Container holding parameters (weights) of `Block`s. 
ParameterDict 
A dictionary managing a set of parameters. 
Containers¶
Block 
Base class for all neural network layers and models. 
HybridBlock 
HybridBlock supports forwarding with both Symbol and NDArray. 
SymbolBlock 
Construct block from symbol. 
Neural Network Layers¶
Containers¶
Sequential 
Stacks `Block`s sequentially. 
HybridSequential 
Stacks `HybridBlock`s sequentially. 
Basic Layers¶
Dense 
Just your regular denselyconnected NN layer. 
Activation 
Applies an activation function to input. 
Dropout 
Applies Dropout to the input. 
BatchNorm 
Batch normalization layer (Ioffe and Szegedy, 2014). 
LeakyReLU 
Leaky version of a Rectified Linear Unit. 
Embedding 
Turns nonnegative integers (indexes/tokens) into dense vectors of fixed size. 
Convolutional Layers¶
Conv1D 
1D convolution layer (e.g. temporal convolution). 
Conv2D 
2D convolution layer (e.g. spatial convolution over images). 
Conv3D 
3D convolution layer (e.g. spatial convolution over volumes). 
Conv1DTranspose 
Transposed 1D convolution layer (sometimes called Deconvolution). 
Conv2DTranspose 
Transposed 2D convolution layer (sometimes called Deconvolution). 
Conv3DTranspose 
Transposed 3D convolution layer (sometimes called Deconvolution). 
Pooling Layers¶
MaxPool1D 
Max pooling operation for one dimensional data. 
MaxPool2D 
Max pooling operation for two dimensional (spatial) data. 
MaxPool3D 
Max pooling operation for 3D data (spatial or spatiotemporal). 
AvgPool1D 
Average pooling operation for temporal data. 
AvgPool2D 
Average pooling operation for spatial data. 
AvgPool3D 
Average pooling operation for 3D data (spatial or spatiotemporal). 
GlobalMaxPool1D 
Global max pooling operation for temporal data. 
GlobalMaxPool2D 
Global max pooling operation for spatial data. 
GlobalMaxPool3D 
Global max pooling operation for 3D data. 
GlobalAvgPool1D 
Global average pooling operation for temporal data. 
GlobalAvgPool2D 
Global average pooling operation for spatial data. 
GlobalAvgPool3D 
Global max pooling operation for 3D data. 
Recurrent Layers¶
RecurrentCell 
Abstract base class for RNN cells 
RNN 
Applies a multilayer Elman RNN with tanh or ReLU nonlinearity to an input sequence. 
LSTM 
Applies a multilayer long shortterm memory (LSTM) RNN to an input sequence. 
GRU 
Applies a multilayer gated recurrent unit (GRU) RNN to an input sequence. 
RNNCell 
Simple recurrent neural network cell. 
LSTMCell 
LongShort Term Memory (LSTM) network cell. 
GRUCell 
Gated Rectified Unit (GRU) network cell. 
SequentialRNNCell 
Sequentially stacking multiple RNN cells. 
BidirectionalCell 
Bidirectional RNN cell. 
DropoutCell 
Applies dropout on input. 
ZoneoutCell 
Applies Zoneout on base cell. 
ResidualCell 
Adds residual connection as described in Wu et al, 2016 (https://arxiv.org/abs/1609.08144). 
Loss functions¶
L2Loss 
Calculates the mean squared error between output and label: 
L1Loss 
Calculates the mean absolute error between output and label: 
SoftmaxCrossEntropyLoss 
Computes the softmax cross entropy loss. 
KLDivLoss 
The KullbackLeibler divergence loss. 
Utilities¶
split_data 
Splits an NDArray into num_slice slices along batch_axis. 
split_and_load 
Splits an NDArray into len(ctx_list) slices along batch_axis and loads each slice to one context in ctx_list. 
clip_global_norm 
Rescales NDArrays so that the sum of their 2norm is smaller than max_norm. 
Data¶
Dataset 
Abstract dataset class. 
ArrayDataset 
A dataset with a data array and a label array. 
RecordFileDataset 
A dataset wrapping over a RecordIO (.rec) file. 
ImageRecordDataset 
Sampler 
Base class for samplers. 
SequentialSampler 
Samples elements from [0, length) sequentially. 
RandomSampler 
Samples elements from [0, length) randomly without replacement. 
BatchSampler 
Wraps over another Sampler and return minibatches of samples. 
DataLoader 
Loads data from a dataset and returns minibatches of data. 
Vision¶
MNIST 
MNIST handwritten digits dataset from `http://yann.lecun.com/exdb/mnist`_. 
CIFAR10 
CIFAR10 image classification dataset from `https://www.cs.toronto.edu/~kriz/cifar.html`_. 
Model Zoo¶
Model zoo provides predefined and pretrained models to help bootstrap machine learning applications.
Vision¶
Module for predefined neural network models.
This module contains definitions for the following model architectures:  AlexNet  DenseNet  Inception V3  ResNet V1  ResNet V2  SqueezeNet  VGG
You can construct a model with random weights by calling its constructor: .. code:
import mxnet.gluon.models as models
resnet18 = models.resnet18_v1()
alexnet = models.alexnet()
squeezenet = models.squeezenet1_0()
densenet = models.densenet_161()
We provide pretrained models for all the models except ResNet V2.
These can constructed by passing
pretrained=True
:
.. code:
import mxnet.gluon.models as models
resnet18 = models.resnet18_v1(pretrained=True)
alexnet = models.alexnet(pretrained=True)
Pretrained models are converted from torchvision.
All pretrained models expect input images normalized in the same way,
i.e. minibatches of 3channel RGB images of shape (N x 3 x H x W),
where N is the batch size, and H and W are expected to be at least 224.
The images have to be loaded in to a range of [0, 1] and then normalized
using mean = [0.485, 0.456, 0.406]
and std = [0.229, 0.224, 0.225]
.
The transformation should preferrably happen at preprocessing. You can use
mx.image.color_normalize
for such transformation:
image = image/255
normalized = mx.image.color_normalize(image,
mean=mx.nd.array([0.485, 0.456, 0.406]),
std=mx.nd.array([0.229, 0.224, 0.225]))
get_model 
Returns a predefined model by name 
ResNet¶
resnet18_v1 
ResNet18 V1 model from “Deep Residual Learning for Image Recognition” paper. 
resnet34_v1 
ResNet34 V1 model from “Deep Residual Learning for Image Recognition” paper. 
resnet50_v1 
ResNet50 V1 model from “Deep Residual Learning for Image Recognition” paper. 
resnet101_v1 
ResNet101 V1 model from “Deep Residual Learning for Image Recognition” paper. 
resnet152_v1 
ResNet152 V1 model from “Deep Residual Learning for Image Recognition” paper. 
resnet18_v2 
ResNet18 V2 model from “Identity Mappings in Deep Residual Networks” paper. 
resnet34_v2 
ResNet34 V2 model from “Identity Mappings in Deep Residual Networks” paper. 
resnet50_v2 
ResNet50 V2 model from “Identity Mappings in Deep Residual Networks” paper. 
resnet101_v2 
ResNet101 V2 model from “Identity Mappings in Deep Residual Networks” paper. 
resnet152_v2 
ResNet152 V2 model from “Identity Mappings in Deep Residual Networks” paper. 
ResNetV1 
ResNet V1 model from “Deep Residual Learning for Image Recognition” paper. 
ResNetV2 
ResNet V2 model from “Identity Mappings in Deep Residual Networks” paper. 
BasicBlockV1 
BasicBlock V1 from “Deep Residual Learning for Image Recognition” paper. 
BasicBlockV2 
BasicBlock V2 from “Identity Mappings in Deep Residual Networks” paper. 
BottleneckV1 
Bottleneck V1 from “Deep Residual Learning for Image Recognition” paper. 
BottleneckV2 
Bottleneck V2 from “Identity Mappings in Deep Residual Networks” paper. 
get_resnet 
ResNet V1 model from “Deep Residual Learning for Image Recognition” paper. 
VGG¶
vgg11 
VGG11 model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
vgg13 
VGG13 model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
vgg16 
VGG16 model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
vgg19 
VGG19 model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
vgg11_bn 
VGG11 model with batch normalization from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
vgg13_bn 
VGG13 model with batch normalization from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
vgg16_bn 
VGG16 model with batch normalization from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
vgg19_bn 
VGG19 model with batch normalization from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
VGG 
VGG model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
get_vgg 
VGG model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper. 
Alexnet¶
alexnet 
AlexNet model from the “One weird trick...” paper. 
AlexNet 
AlexNet model from the “One weird trick...” paper. 
DenseNet¶
densenet121 
DensenetBC 121layer model from the “Densely Connected Convolutional Networks” paper. 
densenet161 
DensenetBC 161layer model from the “Densely Connected Convolutional Networks” paper. 
densenet169 
DensenetBC 169layer model from the “Densely Connected Convolutional Networks” paper. 
densenet201 
DensenetBC 201layer model from the “Densely Connected Convolutional Networks” paper. 
DenseNet 
DensenetBC model from the “Densely Connected Convolutional Networks” paper. 
SqueezeNet¶
squeezenet1_0 
SqueezeNet 1.0 model from the “SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size” paper. 
squeezenet1_1 
SqueezeNet 1.1 model from the official SqueezeNet repo. 
SqueezeNet 
SqueezeNet model from the “SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size” paper. 
Inception¶
inception_v3 
Inception v3 model from “Rethinking the Inception Architecture for Computer Vision” paper. 
Inception3 
Inception v3 model from “Rethinking the Inception Architecture for Computer Vision” paper. 
API Reference¶

class
mxnet.gluon.
Parameter
(name, grad_req='write', shape=None, dtype=, lr_mult=1.0, wd_mult=1.0, init=None, allow_deferred_init=False, differentiable=True)¶ A Container holding parameters (weights) of `Block`s.
Parameter holds a copy of the parameter on each Context after it is initialized with Parameter.initialize(...). If grad_req is not null, it will also hold a gradient array on each Context:
ctx = mx.gpu(0) x = mx.nd.zeros((16, 100), ctx=ctx) w = mx.gluon.Parameter('fc_weight', shape=(64, 100), init=mx.init.Xavier()) b = mx.gluon.Parameter('fc_bias', shape=(64,), init=mx.init.Zero()) w.initialize(ctx=ctx) b.initialize(ctx=ctx) out = mx.nd.FullyConnected(x, w.data(ctx), b.data(ctx), num_hidden=64)
Parameters:  name (str) – Name of this parameter.
 grad_req ({'write', 'add', 'null'}, default 'write') –
Specifies how to update gradient to grad arrays.
 ‘write’ means everytime gradient is written to grad NDArray.
 ‘add’ means everytime gradient is added to the grad NDArray. You need to manually call zero_grad() to clear the gradient buffer before each iteration when using this option.
 ‘null’ means gradient is not requested for this parameter. gradient arrays will not be allocated.
 shape (tuple of int, default None) – Shape of this parameter. By default shape is not specified. Parameter with unknown shape can be used for Symbol API, but init will throw an error when using NDArray API.
 dtype (numpy.dtype or str, default 'float32') – Data type of this parameter. For example, numpy.float32 or ‘float32’.
 lr_mult (float, default 1.0) – Learning rate multiplier. Learning rate will be multiplied by lr_mult when updating this parameter with optimizer.
 wd_mult (float, default 1.0) – Weight decay multiplier (L2 regularizer coefficient). Works similar to lr_mult.
 init (Initializer, default None) – Initializer of this parameter. Will use the global initializer by default.

grad_req
¶ {‘write’, ‘add’, ‘null’} – This can be set before or after initialization. Setting grad_req to null with x.grad_req = ‘null’ saves memory and computation when you don’t need gradient w.r.t x.

initialize
(init=None, ctx=None, default_init=, force_reinit=False)¶ Initializes parameter and gradient arrays. Only used for NDArray API.
Parameters:  init (Initializer) – The initializer to use. Overrides Parameter.init and default_init.
 ctx (Context or list of Context, defaults to context.current_context().) –
Initialize Parameter on given context. If ctx is a list of Context, a copy will be made for each context.
Note
Copies are independent arrays. User is responsible for keeping
their values consistent when updating. Normally gluon.Trainer does this for you.
 default_init (Initializer) – Default initializer is used when both init and Parameter.init are None.
 force_reinit (bool, default False) – Whether to force reinitialization if parameter is already initialized.
Examples
>>> weight = mx.gluon.Parameter('weight', shape=(2, 2)) >>> weight.initialize(ctx=mx.cpu(0)) >>> weight.data() [[0.01068833 0.01729892] [ 0.02042518 0.01618656]]
>>> weight.grad() [[ 0. 0.] [ 0. 0.]] >>> weight.initialize(ctx=[mx.gpu(0), mx.gpu(1)]) >>> weight.data(mx.gpu(0)) [[0.00873779 0.02834515] [ 0.05484822 0.06206018]] >>> weight.data(mx.gpu(1)) [[0.00873779 0.02834515] [ 0.05484822 0.06206018]]

reset_ctx
(ctx)¶ Reassign Parameter to other contexts.
 ctx : Context or list of Context, default context.current_context().
 Assign Parameter to given context. If ctx is a list of Context, a copy will be made for each context.

set_data
(data)¶ Sets this parameter’s value on all contexts to data.

data
(ctx=None)¶ Returns a copy of this parameter on one context. Must have been initialized on this context before.
Parameters: ctx (Context) – Desired context. Returns: Return type: NDArray on ctx

list_data
()¶ Returns copies of this parameter on all contexts, in the same order as creation.

grad
(ctx=None)¶ Returns a gradient buffer for this parameter on one context.
Parameters: ctx (Context) – Desired context.

list_grad
()¶ Returns gradient buffers on all contexts, in the same order as values.

list_ctx
()¶ Returns a list of contexts this parameter is initialized on.

zero_grad
()¶ Sets gradient buffer on all contexts to 0. No action is taken if parameter is uninitialized or doesn’t require gradient.

var
()¶ Returns a symbol representing this parameter.

class
mxnet.gluon.
ParameterDict
(prefix='', shared=None)¶ A dictionary managing a set of parameters.
Parameters:  prefix (str, default '') – The prefix to be prepended to all Parameters’ names created by this dict.
 shared (ParameterDict or None) – If not None, when this dict’s get method creates a new parameter, will first try to retrieve it from shared dict. Usually used for sharing parameters with another Block.

prefix
¶ Prefix of this dict. It will be prepended to Parameters’ name created with get.

get
(name, **kwargs)¶ Retrieves a Parameter with name self.prefix+name. If not found, get will first try to retrieve it from shared dict. If still not found, get will create a new Parameter with keyword arguments and insert it to self.
Parameters:  name (str) – Name of the desired Parameter. It will be prepended with this dictionary’s prefix.
 **kwargs (dict) – The rest of keyword arguments for the created Parameter.
Returns: The created or retrieved Parameter.
Return type:

update
(other)¶ Copies all Parameters in other to self.

initialize
(init=, ctx=None, verbose=False, force_reinit=False)¶ Initializes all Parameters managed by this dictionary to be used for NDArray API. It has no effect when using Symbol API.
Parameters:  init (Initializer) – Global default Initializer to be used when Parameter.init is None. Otherwise, Parameter.init takes precedence.
 ctx (Context or list of Context) – Keeps a copy of Parameters on one or many context(s).
 force_reinit (bool, default False) – Whether to force reinitialization if parameter is already initialized.

zero_grad
()¶ Sets all Parameters’ gradient buffer to 0.

reset_ctx
(ctx)¶ Reassign all Parameters to other contexts.
 ctx : Context or list of Context, default context.current_context().
 Assign Parameter to given context. If ctx is a list of Context, a copy will be made for each context.

setattr
(name, value)¶ Set an attribute to a new value for all Parameters.
For example, set grad_req to null if you don’t need gradient w.r.t a model’s Parameters:
model.collect_params().setattr('grad_req', 'null')
or change the learning rate multiplier:
model.collect_params().setattr('lr_mult', 0.5)
Parameters:  name (str) – Name of the attribute.
 value (valid type for attribute name) – The new value for the attribute.

save
(filename, strip_prefix='')¶ Save parameters to file.
 filename : str
 Path to parameter file.
 strip_prefix : str, default ‘’
 Strip prefix from parameter names before saving.

load
(filename, ctx, allow_missing=False, ignore_extra=False, restore_prefix='')¶ Load parameters from file.
 filename : str
 Path to parameter file.
 ctx : Context or list of Context
 Context(s) initialize loaded parameters on.
 allow_missing : bool, default False
 Whether to silently skip loading parameters not represents in the file.
 ignore_extra : bool, default False
 Whether to silently ignore parameters from the file that are not present in this ParameterDict.
 restore_prefix : str, default ‘’
 prepend prefix to names of stored parameters before loading.

class
mxnet.gluon.
Block
(prefix=None, params=None)¶ Base class for all neural network layers and models. Your models should subclass this class.
Block can be nested recursively in a tree structure. You can create and assign child Block as regular attributes:
from mxnet.gluon import Block, nn from mxnet import ndarray as F class Model(Block): def __init__(self, **kwargs): super(Model, self).__init__(**kwargs) # use name_scope to give child Blocks appropriate names. # It also allows sharing Parameters between Blocks recursively. with self.name_scope(): self.dense0 = nn.Dense(20) self.dense1 = nn.Dense(20) def forward(self, x): x = F.relu(self.dense0(x)) return F.relu(self.dense1(x)) model = Model() model.initialize(ctx=mx.cpu(0)) model(F.zeros((10, 10), ctx=mx.cpu(0)))
Child Block assigned this way will be registered and collect_params will collect their Parameters recursively.
Parameters:  prefix (str) – Prefix acts like a name space. It will be prepended to the names of all Parameters and child Block`s in this `Block‘s name_scope. Prefix should be unique within one model to prevent name collisions.
 params (ParameterDict or None) –
ParameterDict for sharing weights with the new Block. For example, if you want dense1 to share dense0‘s weights, you can do:
dense0 = nn.Dense(20) dense1 = nn.Dense(20, params=dense0.collect_params())

prefix
¶ Prefix of this Block.

name
¶ Name of this Block, without ‘_’ in the end.

name_scope
()¶ Returns a name space object managing a child Block and parameter names. Should be used within a with statement:
with self.name_scope(): self.dense = nn.Dense(20)

params
¶ Returns this Block‘s parameter dictionary (does not include its children’s parameters).

collect_params
()¶ Returns a ParameterDict containing this Block and all of its children’s Parameters.

save_params
(filename)¶ Save parameters to file.
 filename : str
 Path to file.

load_params
(filename, ctx, allow_missing=False, ignore_extra=False)¶ Load parameters from file.
 filename : str
 Path to parameter file.
 ctx : Context or list of Context
 Context(s) initialize loaded parameters on.
 allow_missing : bool, default False
 Whether to silently skip loading parameters not represents in the file.
 ignore_extra : bool, default False
 Whether to silently ignore parameters from the file that are not present in this Block.

register_child
(block)¶ Registers block as a child of self. `Block`s assigned to self as attributes will be registered automatically.

initialize
(init=, ctx=None, verbose=False)¶ Initializes Parameter`s of this `Block and its children.
Equivalent to block.collect_params().initialize(...)

hybridize
(active=True)¶ Activates or deactivates `HybridBlock`s recursively. Has no effect on nonhybrid children.
Parameters: active (bool, default True) – Whether to turn hybrid on or off.

forward
(*args)¶ Overrides to implement forward computation using NDArray. Only accepts positional arguments.
Parameters: *args (list of NDArray) – Input tensors.

class
mxnet.gluon.
HybridBlock
(prefix=None, params=None)¶ HybridBlock supports forwarding with both Symbol and NDArray.
Forward computation in HybridBlock must be static to work with Symbol`s, i.e. you cannot call `.asnumpy(), .shape, .dtype, etc on tensors. Also, you cannot use branching or loop logic that bases on nonconstant expressions like random numbers or intermediate results, since they change the graph structure for each iteration.
Before activating with hybridize(), HybridBlock works just like normal Block. After activation, HybridBlock will create a symbolic graph representing the forward computation and cache it. On subsequent forwards, the cached graph will be used instead of hybrid_forward.
Refer Hybrid tutorial to see the endtoend usage.

infer_shape
(*args)¶ Infers shape of Parameters from inputs.

forward
(x, *args)¶ Defines the forward computation. Arguments can be either NDArray or Symbol.


class
mxnet.gluon.
SymbolBlock
(outputs, inputs, params=None)¶ Construct block from symbol. This is useful for using pretrained models as feature extractors. For example, you may want to extract get the output from fc2 layer in AlexNet.
Parameters:  outputs (Symbol or list of Symbol) – The desired output for SymbolBlock.
 inputs (Symbol or list of Symbol) – The Variables in output’s argument that should be used as inputs.
 params (ParameterDict) – Parameter dictionary for arguments and auxililary states of outputs that are not inputs.
Examples
>>> # To extract the feature from fc1 and fc2 layers of AlexNet: >>> alexnet = gluon.model_zoo.vision.alexnet(pretrained=True, ctx=mx.cpu(), prefix='model_') >>> inputs = mx.sym.var('data') >>> out = alexnet(inputs) >>> internals = out.get_internals() >>> print(internals.list_outputs()) ['data', ..., 'model_dense0_relu_fwd_output', ..., 'model_dense1_relu_fwd_output', ...] >>> outputs = [internals['model_dense0_relu_fwd_output'], internals['model_dense1_relu_fwd_output']] >>> # Create SymbolBlock that shares parameters with alexnet >>> feat_model = gluon.SymbolBlock(outputs, inputs, params=alexnet.collect_params()) >>> x = mx.nd.random_normal(shape=(16, 3, 224, 224)) >>> print(feat_model(x))

class
mxnet.gluon.nn.
Sequential
(prefix=None, params=None)¶ Stacks `Block`s sequentially.
Example:
net = nn.Sequential() # use net's name_scope to give child Blocks appropriate names. with net.name_scope(): net.add(nn.Dense(10, activation='relu')) net.add(nn.Dense(20))

add
(block)¶ Adds block on top of the stack.


class
mxnet.gluon.nn.
HybridSequential
(prefix=None, params=None)¶ Stacks `HybridBlock`s sequentially.
Example:
net = nn.Sequential() # use net's name_scope to give child Blocks appropriate names. with net.name_scope(): net.add(nn.Dense(10, activation='relu')) net.add(nn.Dense(20))

add
(block)¶ Adds block on top of the stack.


class
mxnet.gluon.nn.
Dense
(units, activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_units=0, **kwargs)¶ Just your regular denselyconnected NN layer.
Dense implements the operation: output = activation(dot(input, weight) + bias) where activation is the elementwise activation function passed as the activation argument, weight is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
Note: the input must be a tensor with rank 2. Use flatten to convert it to rank 2 manually if necessary.
Parameters:  units (int) – Dimensionality of the output space.
 activation (str) – Activation function to use. See help on Activation layer. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
 use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the kernel weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 in_units (int, optional) – Size of the input data. If not specified, initialization will be deferred to the first time forward is called and in_units will be inferred from the shape of input data.
 prefix (str or None) – See document of Block.
 params (ParameterDict or None) – See document of Block.
 Input shape:
 A 2D input with shape (batch_size, in_units).
 Output shape:
 The output would have shape (batch_size, units).

class
mxnet.gluon.nn.
Activation
(activation, **kwargs)¶ Applies an activation function to input.
Parameters: activation (str) – Name of activation function to use. See Activation()
for available choices. Input shape:
 Arbitrary.
 Output shape:
 Same shape as input.

class
mxnet.gluon.nn.
Dropout
(rate, **kwargs)¶ Applies Dropout to the input.
Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.
Parameters: rate (float) – Fraction of the input units to drop. Must be a number between 0 and 1.  Input shape:
 Arbitrary.
 Output shape:
 Same shape as input.
References
Dropout: A Simple Way to Prevent Neural Networks from Overfitting

class
mxnet.gluon.nn.
BatchNorm
(axis=1, momentum=0.9, epsilon=1e05, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', running_mean_initializer='zeros', running_variance_initializer='ones', in_channels=0, **kwargs)¶ Batch normalization layer (Ioffe and Szegedy, 2014). Normalizes the input at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.
Parameters:  axis (int, default 1) – The axis that should be normalized. This is typically the channels (C) axis. For instance, after a Conv2D layer with layout=’NCHW’, set axis=1 in BatchNorm. If layout=’NHWC’, then set axis=3.
 momentum (float, default 0.9) – Momentum for the moving average.
 epsilon (float, default 1e5) – Small float added to variance to avoid dividing by zero.
 center (bool, default True) – If True, add offset of beta to normalized tensor. If False, beta is ignored.
 scale (bool, default True) – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
 beta_initializer (str or Initializer, default ‘zeros’) – Initializer for the beta weight.
 gamma_initializer (str or Initializer, default ‘ones’) – Initializer for the gamma weight.
 moving_mean_initializer (str or Initializer, default ‘zeros’) – Initializer for the moving mean.
 moving_variance_initializer (str or Initializer, default ‘ones’) – Initializer for the moving variance.
 in_channels (int, default 0) – Number of channels (feature maps) in input data. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 Input shape:
 Arbitrary.
 Output shape:
 Same shape as input.

class
mxnet.gluon.nn.
LeakyReLU
(alpha, **kwargs)¶ Leaky version of a Rectified Linear Unit.
It allows a small gradient when the unit is not active:
`f(x) = alpha * x for x < 0`, `f(x) = x for x >= 0`.
Parameters: alpha (float) – slope coefficient for the negative half axis. Must be >= 0.  Input shape:
 Arbitrary.
 Output shape:
 Same shape as input.

class
mxnet.gluon.nn.
Embedding
(input_dim, output_dim, dtype='float32', weight_initializer=None, **kwargs)¶ Turns nonnegative integers (indexes/tokens) into dense vectors of fixed size. eg. [[4], [20]] > [[0.25, 0.1], [0.6, 0.2]]
Parameters:  input_dim (int) – Size of the vocabulary, i.e. maximum integer index + 1.
 output_dim (int) – Dimension of the dense embedding.
 dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
 weight_initializer (Initializer) – Initializer for the embeddings matrix.
 Input shape:
 2D tensor with shape: (N, M).
 Output shape:
 3D tensor with shape: (N, M, output_dim).

class
mxnet.gluon.nn.
Conv1D
(channels, kernel_size, strides=1, padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ 1D convolution layer (e.g. temporal convolution).
This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 1 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 1 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 1 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 1 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 3D array of shape (batch_size, in_channels, width) if layout is NCW.
 Output shape:
This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW. out_width is calculated as:
out_width = floor((width+2*paddingdilation*(kernel_size1)1)/stride)+1

class
mxnet.gluon.nn.
Conv2D
(channels, kernel_size, strides=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ 2D convolution layer (e.g. spatial convolution over images).
This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 2 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 2 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 2 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 2 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 4D array of shape (batch_size, in_channels, height, width) if layout is NCHW.
 Output shape:
This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.
out_height and out_width are calculated as:
out_height = floor((height+2*padding[0]dilation[0]*(kernel_size[0]1)1)/stride[0])+1 out_width = floor((width+2*padding[1]dilation[1]*(kernel_size[1]1)1)/stride[1])+1

class
mxnet.gluon.nn.
Conv3D
(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ 3D convolution layer (e.g. spatial convolution over volumes).
This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 3 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’ and ‘W’ dimensions.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 5D array of shape (batch_size, in_channels, depth, height, width) if layout is NCDHW.
 Output shape:
This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.
out_depth, out_height and out_width are calculated as:
out_depth = floor((depth+2*padding[0]dilation[0]*(kernel_size[0]1)1)/stride[0])+1 out_height = floor((height+2*padding[1]dilation[1]*(kernel_size[1]1)1)/stride[1])+1 out_width = floor((width+2*padding[2]dilation[2]*(kernel_size[2]1)1)/stride[2])+1

class
mxnet.gluon.nn.
Conv1DTranspose
(channels, kernel_size, strides=1, padding=0, output_padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ Transposed 1D convolution layer (sometimes called Deconvolution).
The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 3 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 3D array of shape (batch_size, in_channels, width) if layout is NCW.
 Output shape:
This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.
out_width is calculated as:
out_width = (width1)*strides2*padding+kernel_size+output_padding

class
mxnet.gluon.nn.
Conv2DTranspose
(channels, kernel_size, strides=(1, 1), padding=(0, 0), output_padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ Transposed 2D convolution layer (sometimes called Deconvolution).
The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 3 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 4D array of shape (batch_size, in_channels, height, width) if layout is NCHW.
 Output shape:
This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.
out_height and out_width are calculated as:
out_height = (height1)*strides[0]2*padding[0]+kernel_size[0]+output_padding[0] out_width = (width1)*strides[1]2*padding[1]+kernel_size[1]+output_padding[1]

class
mxnet.gluon.nn.
Conv3DTranspose
(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), output_padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)¶ Transposed 3D convolution layer (sometimes called Deconvolution).
The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.
If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Parameters:  channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
 kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
 strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
 padding (int or a tuple/list of 3 int,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points
 dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
 groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
 layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’, and ‘W’ dimensions.
 in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
 activation (str) – Activation function to use. See
Activation()
. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).  use_bias (bool) – Whether the layer uses a bias vector.
 weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
 bias_initializer (str or Initializer) – Initializer for the bias vector.
 Input shape:
 This depends on the layout parameter. Input is 5D array of shape (batch_size, in_channels, depth, height, width) if layout is NCDHW.
 Output shape:
This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW. out_depth, out_height and out_width are calculated as:
out_depth = (depth1)*strides[0]2*padding[0]+kernel_size[0]+output_padding[0] out_height = (height1)*strides[1]2*padding[1]+kernel_size[1]+output_padding[1] out_width = (width1)*strides[2]2*padding[2]+kernel_size[2]+output_padding[2]

class
mxnet.gluon.nn.
MaxPool1D
(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)¶ Max pooling operation for one dimensional data.
Parameters:  pool_size (int) – Size of the max pooling windows.
 strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Pooling is applied on the W dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 3D array of shape (batch_size, channels, width) if layout is NCW.
 Output shape:
This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.
out_width is calculated as:
out_width = floor((width+2*paddingpool_size)/strides)+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
MaxPool2D
(pool_size=(2, 2), strides=None, padding=0, layout='NCHW', ceil_mode=False, **kwargs)¶ Max pooling operation for two dimensional (spatial) data.
Parameters:  pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows.
 strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int or list/tuple of 2 ints,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 4D array of shape (batch_size, channels, height, width) if layout is NCHW.
 Output shape:
This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.
out_height and out_width are calculated as:
out_height = floor((height+2*padding[0]pool_size[0])/strides[0])+1 out_width = floor((width+2*padding[1]pool_size[1])/strides[1])+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
MaxPool3D
(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)¶ Max pooling operation for 3D data (spatial or spatiotemporal).
Parameters:  pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows.
 strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int or list/tuple of 3 ints,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 5D array of shape (batch_size, channels, depth, height, width) if layout is NCDHW.
 Output shape:
This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.
out_depth, out_height and out_width are calculated as
out_depth = floor((depth+2*padding[0]pool_size[0])/strides[0])+1 out_height = floor((height+2*padding[1]pool_size[1])/strides[1])+1 out_width = floor((width+2*padding[2]pool_size[2])/strides[2])+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
AvgPool1D
(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)¶ Average pooling operation for temporal data.
Parameters:  pool_size (int) – Size of the max pooling windows.
 strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCW') – Dimension ordering of data and weight. Can be ‘NCW’, ‘NWC’, etc. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. padding is applied on ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 3D array of shape (batch_size, channels, width) if layout is NCW.
 Output shape:
This depends on the layout parameter. Output is 3D array of shape (batch_size, channels, out_width) if layout is NCW.
out_width is calculated as:
out_width = floor((width+2*paddingpool_size)/strides)+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
AvgPool2D
(pool_size=(2, 2), strides=None, padding=0, ceil_mode=False, layout='NCHW', **kwargs)¶ Average pooling operation for spatial data.
Parameters:  pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows.
 strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int or list/tuple of 2 ints,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCHW') – Dimension ordering of data and weight. Can be ‘NCHW’, ‘NHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 4D array of shape (batch_size, channels, height, width) if layout is NCHW.
 Output shape:
This depends on the layout parameter. Output is 4D array of shape (batch_size, channels, out_height, out_width) if layout is NCHW.
out_height and out_width are calculated as:
out_height = floor((height+2*padding[0]pool_size[0])/strides[0])+1 out_width = floor((width+2*padding[1]pool_size[1])/strides[1])+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
AvgPool3D
(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)¶ Average pooling operation for 3D data (spatial or spatiotemporal).
Parameters:  pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows.
 strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
 padding (int or list/tuple of 3 ints,) – If padding is nonzero, then the input is implicitly zeropadded on both sides for padding number of points.
 layout (str, default 'NCDHW') – Dimension ordering of data and weight. Can be ‘NCDHW’, ‘NDHWC’, etc. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
 ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
 Input shape:
 This depends on the layout parameter. Input is 5D array of shape (batch_size, channels, depth, height, width) if layout is NCDHW.
 Output shape:
This depends on the layout parameter. Output is 5D array of shape (batch_size, channels, out_depth, out_height, out_width) if layout is NCDHW.
out_depth, out_height and out_width are calculated as
out_depth = floor((depth+2*padding[0]pool_size[0])/strides[0])+1 out_height = floor((height+2*padding[1]pool_size[1])/strides[1])+1 out_width = floor((width+2*padding[2]pool_size[2])/strides[2])+1
When ceil_mode is True, ceil will be used instead of floor in this equation.

class
mxnet.gluon.nn.
GlobalMaxPool1D
(layout='NCW', **kwargs)¶ Global max pooling operation for temporal data.

class
mxnet.gluon.nn.
GlobalMaxPool2D
(layout='NCHW', **kwargs)¶ Global max pooling operation for spatial data.

class
mxnet.gluon.nn.
GlobalMaxPool3D
(layout='NCDHW', **kwargs)¶ Global max pooling operation for 3D data.

class
mxnet.gluon.nn.
GlobalAvgPool1D
(layout='NCW', **kwargs)¶ Global average pooling operation for temporal data.

class
mxnet.gluon.nn.
GlobalAvgPool2D
(layout='NCHW', **kwargs)¶ Global average pooling operation for spatial data.

class
mxnet.gluon.nn.
GlobalAvgPool3D
(layout='NCDHW', **kwargs)¶ Global max pooling operation for 3D data.

class
mxnet.gluon.rnn.
RecurrentCell
(prefix=None, params=None)¶ Abstract base class for RNN cells
Parameters:  prefix (str, optional) – Prefix for names of Block`s (this prefix is also used for names of weights if `params is None i.e. if params are being created and not reused)
 params (Parameter or None, optional) – Container for weight sharing between cells. A new Parameter container is created if params is None.

begin_state
(batch_size=0, func=, **kwargs)¶ Initial state for this cell.
Parameters:  func (callable, default symbol.zeros) –
Function for creating initial state.
For Symbol API, func can be symbol.zeros, symbol.uniform, symbol.var etc. Use symbol.var if you want to directly feed input as states.
For NDArray API, func can be ndarray.zeros, ndarray.ones, etc.
 batch_size (int, default 0) – Only required for NDArray API. Size of the batch (‘N’ in layout) dimension of input.
 **kwargs – Additional keyword arguments passed to func. For example mean, std, dtype, etc.
Returns: states – Starting states for the first RNN step.
Return type: nested list of Symbol
 func (callable, default symbol.zeros) –

forward
(inputs, states)¶ Unrolls the recurrent cell for one time step.
Parameters:  inputs (sym.Variable) – Input symbol, 2D, of shape (batch_size * num_units).
 states (list of sym.Variable) – RNN state from previous step or the output of begin_state().
Returns:  output (Symbol) – Symbol corresponding to the output from the RNN when unrolling for a single time step.
 states (list of Symbol) – The new state of this RNN after this unrolling. The type of this symbol is same as the output of begin_state(). This can be used as an input state to the next time step of this RNN.
See also
begin_state()
 This function can provide the states for the first time step.
unroll()
 This function unrolls an RNN for a given number of (>=1) time steps.

reset
()¶ Reset before reusing the cell for another graph.

state_info
(batch_size=0)¶ shape and layout information of states

unroll
(length, inputs, begin_state=None, layout='NTC', merge_outputs=None)¶ Unrolls an RNN cell across time steps.
Parameters:  length (int) – Number of steps to unroll.
 inputs (Symbol, list of Symbol, or None) –
If inputs is a single Symbol (usually the output of Embedding symbol), it should have shape (batch_size, length, ...) if layout is ‘NTC’, or (length, batch_size, ...) if layout is ‘TNC’.
If inputs is a list of symbols (usually output of previous unroll), they should all have shape (batch_size, ...).
 begin_state (nested list of Symbol, optional) – Input states created by begin_state() or output state of another cell. Created from begin_state() if None.
 layout (str, optional) – layout of input symbol. Only used if inputs is a single Symbol.
 merge_outputs (bool, optional) – If False, returns outputs as a list of Symbols. If True, concatenates output across time steps and returns a single symbol with shape (batch_size, length, ...) if layout is ‘NTC’, or (length, batch_size, ...) if layout is ‘TNC’. If None, output whatever is faster.
Returns:  outputs (list of Symbol or Symbol) – Symbol (if merge_outputs is True) or list of Symbols (if merge_outputs is False) corresponding to the output from the RNN from this unrolling.
 states (list of Symbol) – The new state of this RNN after this unrolling. The type of this symbol is same as the output of begin_state().

class
mxnet.gluon.rnn.
RNN
(hidden_size, num_layers=1, activation='relu', layout='TNC', dropout=0, bidirectional=False, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, **kwargs)¶ Applies a multilayer Elman RNN with tanh or ReLU nonlinearity to an input sequence.
For each element in the input sequence, each layer computes the following function:
\[h_t = \tanh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{(t1)} + b_{hh})\]where \(h_t\) is the hidden state at time t, and \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer. If nonlinearity=’relu’, then ReLU is used instead of tanh.
Parameters:  hidden_size (int) – The number of features in the hidden state h.
 num_layers (int, default 1) – Number of recurrent layers.
 activation ({'relu' or 'tanh'}, default 'tanh') – The activation function to use.
 layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
 dropout (float, default 0) – If nonzero, introduces a dropout layer on the outputs of each RNN layer except the last layer.
 bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
 prefix (str or None) – Prefix of this Block.
 params (ParameterDict or None) – Shared Parameters for this Block.
 Input shapes:
 The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
 Output shape:
 The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
 Recurrent state:
 The recurrent state is an NDArray with shape (num_layers, batch_size, num_hidden). If bidirectional is True, the recurrent state shape will instead be (2*num_layers, batch_size, num_hidden) If input recurrent state is None, zeros are used as default begin states, and the output recurrent state is omitted.
Examples
>>> layer = mx.gluon.rnn.RNN(100, 3) >>> layer.initialize() >>> input = mx.nd.random_uniform(shape=(5, 3, 10)) >>> # by default zeros are used as begin state >>> output = layer(input) >>> # manually specify begin state. >>> h0 = mx.nd.random_uniform(shape=(3, 3, 100)) >>> output, hn = layer(input, h0)

class
mxnet.gluon.rnn.
LSTM
(hidden_size, num_layers=1, layout='TNC', dropout=0, bidirectional=False, input_size=0, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', **kwargs)¶ Applies a multilayer long shortterm memory (LSTM) RNN to an input sequence.
For each element in the input sequence, each layer computes the following function:
\[\begin{split}\begin{array}{ll} i_t = sigmoid(W_{ii} x_t + b_{ii} + W_{hi} h_{(t1)} + b_{hi}) \\ f_t = sigmoid(W_{if} x_t + b_{if} + W_{hf} h_{(t1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hc} h_{(t1)} + b_{hg}) \\ o_t = sigmoid(W_{io} x_t + b_{io} + W_{ho} h_{(t1)} + b_{ho}) \\ c_t = f_t * c_{(t1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \end{array}\end{split}\]where \(h_t\) is the hidden state at time t, \(c_t\) is the cell state at time t, \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer, and \(i_t\), \(f_t\), \(g_t\), \(o_t\) are the input, forget, cell, and out gates, respectively.
Parameters:  hidden_size (int) – The number of features in the hidden state h.
 num_layers (int, default 1) – Number of recurrent layers.
 layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
 dropout (float, default 0) – If nonzero, introduces a dropout layer on the outputs of each RNN layer except the last layer.
 bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
 prefix (str or None) – Prefix of this Block.
 params (ParameterDict or None) – Shared Parameters for this Block.
 Input shapes:
 The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
 Output shape:
 The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
 Recurrent state:
 The recurrent state is a list of two NDArrays. Both has shape (num_layers, batch_size, num_hidden). If bidirectional is True, each recurrent state will instead have shape (2*num_layers, batch_size, num_hidden). If input recurrent state is None, zeros are used as default begin states, and the output recurrent state is omitted.
Examples
>>> layer = mx.gluon.rnn.LSTM(100, 3) >>> layer.initialize() >>> input = mx.nd.random_uniform(shape=(5, 3, 10)) >>> # by default zeros are used as begin state >>> output = layer(input) >>> # manually specify begin state. >>> h0 = mx.nd.random_uniform(shape=(3, 3, 100)) >>> c0 = mx.nd.random_uniform(shape=(3, 3, 100)) >>> output, hn = layer(input, [h0, c0])

class
mxnet.gluon.rnn.
GRU
(hidden_size, num_layers=1, layout='TNC', dropout=0, bidirectional=False, input_size=0, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', **kwargs)¶ Applies a multilayer gated recurrent unit (GRU) RNN to an input sequence.
For each element in the input sequence, each layer computes the following function:
\[\begin{split}\begin{array}{ll} r_t = sigmoid(W_{ir} x_t + b_{ir} + W_{hr} h_{(t1)} + b_{hr}) \\ i_t = sigmoid(W_{ii} x_t + b_{ii} + W_hi h_{(t1)} + b_{hi}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t1)}+ b_{hn})) \\ h_t = (1  i_t) * n_t + i_t * h_{(t1)} \\ \end{array}\end{split}\]where \(h_t\) is the hidden state at time t, \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer, and \(r_t\), \(i_t\), \(n_t\) are the reset, input, and new gates, respectively.
Parameters:  hidden_size (int) – The number of features in the hidden state h
 num_layers (int, default 1) – Number of recurrent layers.
 layout (str, default 'TNC') – The format of input and output tensors. T, N and C stand for sequence length, batch size, and feature dimensions respectively.
 dropout (float, default 0) – If nonzero, introduces a dropout layer on the outputs of each RNN layer except the last layer
 bidirectional (bool, default False) – If True, becomes a bidirectional RNN.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 input_size (int, default 0) – The number of expected features in the input x. If not specified, it will be inferred from input.
 prefix (str or None) – Prefix of this Block.
 params (ParameterDict or None) – Shared Parameters for this Block.
 Input shapes:
 The input shape depends on layout. For layout=’TNC’, the input has shape (sequence_length, batch_size, input_size)
 Output shape:
 The output shape depends on layout. For layout=’TNC’, the output has shape (sequence_length, batch_size, num_hidden). If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
 Recurrent state:
 The recurrent state is an NDArray with shape (num_layers, batch_size, num_hidden). If bidirectional is True, the recurrent state shape will instead be (2*num_layers, batch_size, num_hidden) If input recurrent state is None, zeros are used as default begin states, and the output recurrent state is omitted.
Examples
>>> layer = mx.gluon.rnn.GRU(100, 3) >>> layer.initialize() >>> input = mx.nd.random_uniform(shape=(5, 3, 10)) >>> # by default zeros are used as begin state >>> output = layer(input) >>> # manually specify begin state. >>> h0 = mx.nd.random_uniform(shape=(3, 3, 100)) >>> output, hn = layer(input, h0)

class
mxnet.gluon.rnn.
RNNCell
(hidden_size, activation='tanh', i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)¶ Simple recurrent neural network cell.
Parameters:  hidden_size (int) – Number of units in output symbol
 activation (str or Symbol, default 'tanh') – Type of activation function.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 prefix (str, default ‘rnn_‘) – Prefix for name of Block`s (and name of weight if params is `None).
 params (Parameter or None) – Container for weight sharing between cells. Created if None.

class
mxnet.gluon.rnn.
LSTMCell
(hidden_size, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)¶ LongShort Term Memory (LSTM) network cell.
Parameters:  hidden_size (int) – Number of units in output symbol.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 prefix (str, default ‘lstm_‘) – Prefix for name of Block`s (and name of weight if params is `None).
 params (Parameter or None) – Container for weight sharing between cells. Created if None.

class
mxnet.gluon.rnn.
GRUCell
(hidden_size, i2h_weight_initializer=None, h2h_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, prefix=None, params=None)¶ Gated Rectified Unit (GRU) network cell. Note: this is an implementation of the cuDNN version of GRUs (slight modification compared to Cho et al. 2014).
Parameters:  hidden_size (int) – Number of units in output symbol.
 i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
 h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the recurrent state.
 i2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
 prefix (str, default ‘gru_‘) – prefix for name of Block`s (and name of weight if params is `None).
 params (Parameter or None) – Container for weight sharing between cells. Created if None.

class
mxnet.gluon.rnn.
SequentialRNNCell
(prefix=None, params=None)¶ Sequentially stacking multiple RNN cells.

add
(cell)¶ Appends a cell into the stack.
Parameters: cell (rnn cell) –


class
mxnet.gluon.rnn.
BidirectionalCell
(l_cell, r_cell, output_prefix='bi_')¶ Bidirectional RNN cell.
Parameters:  l_cell (RecurrentCell) – Cell for forward unrolling
 r_cell (RecurrentCell) – Cell for backward unrolling

class
mxnet.gluon.rnn.
DropoutCell
(rate, prefix=None, params=None)¶ Applies dropout on input.
Parameters: rate (float) – Percentage of elements to drop out, which is 1  percentage to retain.

class
mxnet.gluon.rnn.
ZoneoutCell
(base_cell, zoneout_outputs=0.0, zoneout_states=0.0)¶ Applies Zoneout on base cell.

class
mxnet.gluon.rnn.
ResidualCell
(base_cell)¶ Adds residual connection as described in Wu et al, 2016 (https://arxiv.org/abs/1609.08144). Output of the cell is output of the base cell plus input.

class
mxnet.gluon.
Trainer
(params, optimizer, optimizer_params=None, kvstore='device')¶ Applies an Optimizer on a set of Parameters. Trainer should be used together with autograd.
Parameters:  params (ParameterDict) – The set of parameters to optimize.
 optimizer (str or Optimizer) – The optimizer to use. See help on Optimizer for a list of available optimizers.
 optimizer_params (dict) – Keyword arguments to be passed to optimizer constructor. For example, {‘learning_rate’: 0.1}. All optimizers accept learning_rate, wd (weight decay), clip_gradient, and lr_scheduler. See each optimizer’s constructor for a list of additional supported arguments.
 kvstore (str or KVStore) – kvstore type for multigpu and distributed training. See help on
mxnet.kvstore.create
for more information.

step
(batch_size, ignore_stale_grad=False)¶ Makes one step of parameter update. Should be called after autograd.compute_gradient and outside of record() scope.
Parameters:  batch_size (int) – Batch size of data processed. Gradient will be normalized by 1/batch_size. Set this to 1 if you normalized loss manually with loss = mean(loss).
 ignore_stale_grad (bool, optional, default=False) – If true, ignores Parameters with stale gradient (gradient that has not been updated by backward after last step) and skip update.

class
mxnet.gluon.loss.
L2Loss
(weight=1.0, batch_axis=0, **kwargs)¶ Calculates the mean squared error between output and label:
\[L = \frac{1}{2}\sum_i \Vert {output}_i  {label}_i \Vert^2.\]Output and label can have arbitrary shape as long as they have the same number of elements.
Parameters:  weight (float or None) – Global scalar weight for loss.
 sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
 batch_axis (int, default 0) – The axis that represents minibatch.

class
mxnet.gluon.loss.
L1Loss
(weight=None, batch_axis=0, **kwargs)¶ Calculates the mean absolute error between output and label:
\[L = \frac{1}{2}\sum_i \vert {output}_i  {label}_i \vert.\]Output and label must have the same shape.
Parameters:  weight (float or None) – Global scalar weight for loss.
 sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
 batch_axis (int, default 0) – The axis that represents minibatch.

class
mxnet.gluon.loss.
SoftmaxCrossEntropyLoss
(axis=1, sparse_label=True, from_logits=False, weight=None, batch_axis=0, **kwargs)¶ Computes the softmax cross entropy loss. (alias: SoftmaxCELoss)
If sparse_label is True, label should contain integer category indicators:
\[ \begin{align}\begin{aligned}p = {softmax}({output})\\L = \sum_i {log}(p_{i,{label}_i})\end{aligned}\end{align} \]Label’s shape should be output’s shape without the axis dimension. i.e. for output.shape = (1,2,3,4) and axis = 2, label.shape should be (1,2,4).
If sparse_label is False, label should contain probability distribution with the same shape as output:
\[ \begin{align}\begin{aligned}p = {softmax}({output})\\L = \sum_i \sum_j {label}_j {log}(p_{ij})\end{aligned}\end{align} \]Parameters:  axis (int, default 1) – The axis to sum over when computing softmax and entropy.
 sparse_label (bool, default True) – Whether label is an integer array instead of probability distribution.
 from_logits (bool, default False) – Whether input is a log probability (usually from log_softmax) instead of unnormalized numbers.
 weight (float or None) – Global scalar weight for loss.
 sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
 batch_axis (int, default 0) – The axis that represents minibatch.

class
mxnet.gluon.loss.
KLDivLoss
(from_logits=True, weight=None, batch_axis=0, **kwargs)¶ The KullbackLeibler divergence loss.
KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions.
\[L = 1/n \sum_i (label_i * (log(label_i)  output_i))\]Label’s shape should be the same as output’s.
Parameters:  from_logits (bool, default is True) – Whether the input is log probability (usually from log_softmax) instead of unnormalized numbers.
 weight (float or None) – Global scalar weight for loss.
 sample_weight (Symbol or None) – Per sample weighting. Must be broadcastable to the same shape as loss. For example, if loss has shape (64, 10) and you want to weight each sample in the batch, sample_weight should have shape (64, 1).
 batch_axis (int, default 0) – The axis that represents minibatch.

utils.
split_data
(data, num_slice, batch_axis=0, even_split=True)¶ Splits an NDArray into num_slice slices along batch_axis. Usually used for data parallelism where each slices is sent to one device (i.e. GPU).
Parameters:  data (NDArray) – A batch of data.
 num_slice (int) – Number of desired slices.
 batch_axis (int, default 0) – The axis along which to slice.
 even_split (bool, default True) – Whether to force all slices to have the same number of elements. If True, an error will be raised when num_slice does not evenly divide data.shape[batch_axis].
Returns: Return value is a list even if num_slice is 1.
Return type: list of NDArray

utils.
split_and_load
(data, ctx_list, batch_axis=0, even_split=True)¶ Splits an NDArray into len(ctx_list) slices along batch_axis and loads each slice to one context in ctx_list.
Parameters:  data (NDArray) – A batch of data.
 ctx_list (list of Context) – A list of Contexts.
 batch_axis (int, default 0) – The axis along which to slice.
 even_split (bool, default True) – Whether to force all slices to have the same number of elements.
Returns: Each corresponds to a context in ctx_list.
Return type: list of NDArray

utils.
clip_global_norm
(arrays, max_norm)¶ Rescales NDArrays so that the sum of their 2norm is smaller than max_norm.

class
mxnet.gluon.data.
Dataset
¶ Abstract dataset class. All datasets should have this interface.
Subclasses need to override __getitem__, which returns the ith element, and __len__, which returns the total number elements.
Note
An mxnet or numpy array can be directly used as a dataset.

class
mxnet.gluon.data.
ArrayDataset
(data, label)¶ A dataset with a data array and a label array.
The ith sample is (data[i], lable[i]).
Parameters:  data (arraylike object) – The data array. Can be mxnet or numpy array.
 label (arraylike object) – The label array. Can be mxnet or numpy array.

class
mxnet.gluon.data.
RecordFileDataset
(filename)¶ A dataset wrapping over a RecordIO (.rec) file.
Each sample is a string representing the raw content of an record.
Parameters: filename (str) – Path to rec file.

class
mxnet.gluon.data.
Sampler
¶ Base class for samplers.
All samplers should subclass Sampler and define __iter__ and __len__ methods.

class
mxnet.gluon.data.
SequentialSampler
(length)¶ Samples elements from [0, length) sequentially.
Parameters: length (int) – Length of the sequence.

class
mxnet.gluon.data.
RandomSampler
(length)¶ Samples elements from [0, length) randomly without replacement.
Parameters: length (int) – Length of the sequence.

class
mxnet.gluon.data.
BatchSampler
(sampler, batch_size, last_batch='keep')¶ Wraps over another Sampler and return minibatches of samples.
Parameters:  sampler (Sampler) – The source Sampler.
 batch_size (int) – Size of minibatch.
 last_batch ({'keep', 'discard', 'rollover'}) –
Specifies how the last batch is handled if batch_size does not evenly divide sequence length.
If ‘keep’, the last batch will be returned directly, but will contain less element than batch_size requires.
If ‘discard’, the last batch will be discarded.
If ‘rollover’, the remaining elements will be rolled over to the next iteration.
Examples
>>> sampler = gluon.data.SequentialSampler(10) >>> batch_sampler = gluon.data.BatchSampler(sampler, 3, 'keep') >>> list(batch_sampler) [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

class
mxnet.gluon.data.
DataLoader
(dataset, batch_size=None, shuffle=False, sampler=None, last_batch=None, batch_sampler=None)¶ Loads data from a dataset and returns minibatches of data.
Parameters:  dataset (Dataset) – Source dataset. Note that numpy and mxnet arrays can be directly used as a Dataset.
 batch_size (int) – Size of minibatch.
 shuffle (bool) – Whether to shuffle the samples.
 sampler (Sampler) – The sampler to use. Either specify sampler or shuffle, not both.
 last_batch ({'keep', 'discard', 'rollover'}) –
How to handle the last batch if batch_size does not evenly divide len(dataset).
keep  A batch with less samples than previous batches is returned. discard  The last batch is discarded if its incomplete. rollover  The remaining samples are rolled over to the next epoch.
 batch_sampler (Sampler) – A sampler that returns minibatches. Do not specify batch_size, shuffle, sampler, and last_batch if batch_sampler is specified.
Dataset container.

class
mxnet.gluon.data.vision.
MNIST
(root='~/.mxnet/datasets/', train=True, transform=None)¶ MNIST handwritten digits dataset from `http://yann.lecun.com/exdb/mnist`_.
Each sample is an image (in 3D NDArray) with shape (28, 28, 1).
Parameters:  root (str) – Path to temp folder for storing data.
 train (bool) – Whether to load the training or testing set.
 transform (function) –
A user defined callback that transforms each instance. For example:
transform=lambda data, label: (data.astype(np.float32)/255, label)

class
mxnet.gluon.data.vision.
CIFAR10
(root='~/.mxnet/datasets/', train=True, transform=None)¶ CIFAR10 image classification dataset from `https://www.cs.toronto.edu/~kriz/cifar.html`_.
Each sample is an image (in 3D NDArray) with shape (32, 32, 1).
Parameters:  root (str) – Path to temp folder for storing data.
 train (bool) – Whether to load the training or testing set.
 transform (function) –
A user defined callback that transforms each instance. For example:
transform=lambda data, label: (data.astype(np.float32)/255, label)

class
mxnet.gluon.data.vision.
ImageRecordDataset
(filename, flag=1, transform=None)¶ A dataset wrapping over a RecordIO file containing images.
Each sample is an image and its corresponding label.
Parameters:  filename (str) – Path to rec file.
 flag ({0, 1}, default 1) –
If 0, always convert images to greyscale.
If 1, always convert images to colored (RGB).
 transform (function) –
A user defined callback that transforms each instance. For example:
transform=lambda data, label: (data.astype(np.float32)/255, label)

class
mxnet.gluon.data.vision.
ImageFolderDataset
(root, flag=1, transform=None)¶ A dataset for loading image files stored in a folder structure like:
root/car/0001.jpg root/car/xxxa.jpg root/car/yyyb.jpg root/bus/123.jpg root/bus/023.jpg root/bus/wwww.jpg
Parameters:  root (str) – Path to root directory.
 flag ({0, 1}, default 1) – If 0, always convert loaded images to greyscale (1 channel). If 1, always convert loaded images to colored (3 channels).
 transform (callable) –
A function that takes data and label and transforms them:
transform = lambda data, label: (data.astype(np.float32)/255, label)

synsets
¶ list – List of class names. synsets[i] is the name for the integer label i

items
¶ list of tuples – List of all images in (filename, label) pairs.

vision.
get_model
(name, **kwargs)¶ Returns a predefined model by name
Parameters:  name (str) – Name of the model.
 pretrained (bool) – Whether to load the pretrained weights for model.
 classes (int) – Number of classes for the output layer.
Returns: The model.
Return type:

vision.
resnet18_v1
(**kwargs)¶ ResNet18 V1 model from “Deep Residual Learning for Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet34_v1
(**kwargs)¶ ResNet34 V1 model from “Deep Residual Learning for Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet50_v1
(**kwargs)¶ ResNet50 V1 model from “Deep Residual Learning for Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet101_v1
(**kwargs)¶ ResNet101 V1 model from “Deep Residual Learning for Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet152_v1
(**kwargs)¶ ResNet152 V1 model from “Deep Residual Learning for Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet18_v2
(**kwargs)¶ ResNet18 V2 model from “Identity Mappings in Deep Residual Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet34_v2
(**kwargs)¶ ResNet34 V2 model from “Identity Mappings in Deep Residual Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet50_v2
(**kwargs)¶ ResNet50 V2 model from “Identity Mappings in Deep Residual Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet101_v2
(**kwargs)¶ ResNet101 V2 model from “Identity Mappings in Deep Residual Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
resnet152_v2
(**kwargs)¶ ResNet152 V2 model from “Identity Mappings in Deep Residual Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
get_resnet
(version, num_layers, pretrained=False, ctx=cpu(0), **kwargs)¶ ResNet V1 model from “Deep Residual Learning for Image Recognition” paper. ResNet V2 model from “Identity Mappings in Deep Residual Networks” paper.
Parameters:  version (int) – Version of ResNet. Options are 1, 2.
 num_layers (int) – Numbers of layers. Options are 18, 34, 50, 101, 152.
 pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

class
mxnet.gluon.model_zoo.vision.
ResNetV1
(block, layers, channels, classes=1000, thumbnail=False, **kwargs)¶ ResNet V1 model from “Deep Residual Learning for Image Recognition” paper.
Parameters:  block (HybridBlock) – Class for the residual block. Options are BasicBlockV1, BottleneckV1.
 layers (list of int) – Numbers of layers in each block
 channels (list of int) – Numbers of channels in each block. Length should be one larger than layers list.
 classes (int, default 1000) – Number of classification classes.
 thumbnail (bool, default False) – Enable thumbnail.

class
mxnet.gluon.model_zoo.vision.
BasicBlockV1
(channels, stride, downsample=False, in_channels=0, **kwargs)¶ BasicBlock V1 from “Deep Residual Learning for Image Recognition” paper. This is used for ResNet V1 for 18, 34 layers.
Parameters:  channels (int) – Number of output channels.
 stride (int) – Stride size.
 downsample (bool, default False) – Whether to downsample the input.
 in_channels (int, default 0) – Number of input channels. Default is 0, to infer from the graph.

class
mxnet.gluon.model_zoo.vision.
BottleneckV1
(channels, stride, downsample=False, in_channels=0, **kwargs)¶ Bottleneck V1 from “Deep Residual Learning for Image Recognition” paper. This is used for ResNet V1 for 50, 101, 152 layers.
Parameters:  channels (int) – Number of output channels.
 stride (int) – Stride size.
 downsample (bool, default False) – Whether to downsample the input.
 in_channels (int, default 0) – Number of input channels. Default is 0, to infer from the graph.

class
mxnet.gluon.model_zoo.vision.
ResNetV2
(block, layers, channels, classes=1000, thumbnail=False, **kwargs)¶ ResNet V2 model from “Identity Mappings in Deep Residual Networks” paper.
Parameters:  block (HybridBlock) – Class for the residual block. Options are BasicBlockV1, BottleneckV1.
 layers (list of int) – Numbers of layers in each block
 channels (list of int) – Numbers of channels in each block. Length should be one larger than layers list.
 classes (int, default 1000) – Number of classification classes.
 thumbnail (bool, default False) – Enable thumbnail.

class
mxnet.gluon.model_zoo.vision.
BasicBlockV2
(channels, stride, downsample=False, in_channels=0, **kwargs)¶ BasicBlock V2 from “Identity Mappings in Deep Residual Networks” paper. This is used for ResNet V2 for 18, 34 layers.
Parameters:  channels (int) – Number of output channels.
 stride (int) – Stride size.
 downsample (bool, default False) – Whether to downsample the input.
 in_channels (int, default 0) – Number of input channels. Default is 0, to infer from the graph.

class
mxnet.gluon.model_zoo.vision.
BottleneckV2
(channels, stride, downsample=False, in_channels=0, **kwargs)¶ Bottleneck V2 from “Identity Mappings in Deep Residual Networks” paper. This is used for ResNet V2 for 50, 101, 152 layers.
Parameters:  channels (int) – Number of output channels.
 stride (int) – Stride size.
 downsample (bool, default False) – Whether to downsample the input.
 in_channels (int, default 0) – Number of input channels. Default is 0, to infer from the graph.

vision.
vgg11
(**kwargs)¶ VGG11 model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
vgg13
(**kwargs)¶ VGG13 model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
vgg16
(**kwargs)¶ VGG16 model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
vgg19
(**kwargs)¶ VGG19 model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
vgg11_bn
(**kwargs)¶ VGG11 model with batch normalization from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
vgg13_bn
(**kwargs)¶ VGG13 model with batch normalization from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
vgg16_bn
(**kwargs)¶ VGG16 model with batch normalization from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
vgg19_bn
(**kwargs)¶ VGG19 model with batch normalization from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
get_vgg
(num_layers, pretrained=False, ctx=cpu(0), **kwargs)¶ VGG model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  num_layers (int) – Number of layers for the variant of densenet. Options are 11, 13, 16, 19.
 pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

class
mxnet.gluon.model_zoo.vision.
VGG
(layers, filters, classes=1000, batch_norm=False, **kwargs)¶ VGG model from the “Very Deep Convolutional Networks for LargeScale Image Recognition” paper.
Parameters:  layers (list of int) – Numbers of layers in each feature block.
 filters (list of int) – Numbers of filters in each feature block. List length should match the layers.
 classes (int, default 1000) – Number of classification classes.
 batch_norm (bool, default False) – Use batch normalization.

vision.
alexnet
(pretrained=False, ctx=cpu(0), **kwargs)¶ AlexNet model from the “One weird trick...” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

class
mxnet.gluon.model_zoo.vision.
AlexNet
(classes=1000, **kwargs)¶ AlexNet model from the “One weird trick...” paper.
Parameters: classes (int, default 1000) – Number of classes for the output layer.

vision.
densenet121
(**kwargs)¶ DensenetBC 121layer model from the “Densely Connected Convolutional Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
densenet161
(**kwargs)¶ DensenetBC 161layer model from the “Densely Connected Convolutional Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
densenet169
(**kwargs)¶ DensenetBC 169layer model from the “Densely Connected Convolutional Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
densenet201
(**kwargs)¶ DensenetBC 201layer model from the “Densely Connected Convolutional Networks” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

class
mxnet.gluon.model_zoo.vision.
DenseNet
(num_init_features, growth_rate, block_config, bn_size=4, dropout=0, classes=1000, **kwargs)¶ DensenetBC model from the “Densely Connected Convolutional Networks” paper.
Parameters:  num_init_features (int) – Number of filters to learn in the first convolution layer.
 growth_rate (int) – Number of filters to add each layer (k in the paper).
 block_config (list of int) – List of integers for numbers of layers in each pooling block.
 bn_size (int, default 4) – Multiplicative factor for number of bottle neck layers. (i.e. bn_size * k features in the bottleneck layer)
 dropout (float, default 0) – Rate of dropout after each dense layer.
 classes (int, default 1000) – Number of classification classes.

vision.
squeezenet1_0
(**kwargs)¶ SqueezeNet 1.0 model from the “SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

vision.
squeezenet1_1
(**kwargs)¶ SqueezeNet 1.1 model from the official SqueezeNet repo. SqueezeNet 1.1 has 2.4x less computation and slightly fewer parameters than SqueezeNet 1.0, without sacrificing accuracy.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

class
mxnet.gluon.model_zoo.vision.
SqueezeNet
(version, classes=1000, **kwargs)¶ SqueezeNet model from the “SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size” paper. SqueezeNet 1.1 model from the official SqueezeNet repo. SqueezeNet 1.1 has 2.4x less computation and slightly fewer parameters than SqueezeNet 1.0, without sacrificing accuracy.
Parameters:  version (str) – Version of squeezenet. Options are ‘1.0’, ‘1.1’.
 classes (int, default 1000) – Number of classification classes.

vision.
inception_v3
(pretrained=False, ctx=cpu(0), **kwargs)¶ Inception v3 model from “Rethinking the Inception Architecture for Computer Vision” paper.
Parameters:  pretrained (bool, default False) – Whether to load the pretrained weights for model.
 ctx (Context, default CPU) – The context in which to load the pretrained weights.

class
mxnet.gluon.model_zoo.vision.
Inception3
(classes=1000, **kwargs)¶ Inception v3 model from “Rethinking the Inception Architecture for Computer Vision” paper.
Parameters: classes (int, default 1000) – Number of classification classes.