gluon.loss¶
Gluon provides predefined loss functions in the mxnet.gluon.loss
module.
losses for training neural networks
Classes

Base class for loss. 

Calculates the mean squared error between label and pred. 

Calculates the mean absolute error between label and pred. 
The crossentropy loss for binary classification. 

The crossentropy loss for binary classification. 


Computes the softmax cross entropy loss. 
Computes the softmax cross entropy loss. 


The KullbackLeibler divergence loss. 

Connectionist Temporal Classification Loss. 

Calculates smoothed L1 loss that is equal to L1 loss if absolute error exceeds rho but is equal to L2 loss otherwise. 

Calculates the hinge loss function often used in SVMs: 

Calculates the softmargin loss function used in SVMs: 

Calculates the logistic loss (for binary losses only): 

Calculates triplet loss given three input tensors and a positive margin. 

For a target (Random Variable) in a Poisson distribution, the function calculates the Negative Log likelihood loss. 

For a target label 1 or 1, vectors input1 and input2, the function computes the cosine distance between the vectors. 

Calculates Batchwise Smoothed Deep Metric Learning (SDML) Loss given two input tensors and a smoothing weight SDM Loss learns similarity between paired samples by using unpaired samples in the minibatch as potential negative examples. 

class
mxnet.gluon.loss.
Loss
(weight, batch_axis, **kwargs)[source]¶ Bases:
mxnet.gluon.block.HybridBlock
Base class for loss.
 Parameters
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
Methods
hybrid_forward
(F, x, *args, **kwargs)Overrides to construct symbolic graph for this Block.

class
mxnet.gluon.loss.
L2Loss
(weight=1.0, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Calculates the mean squared error between label and pred.
\[L = \frac{1}{2} \sum_i \vert {label}_i  {pred}_i \vert^2.\]Methods
hybrid_forward
(F, pred, label[, sample_weight])Overrides to construct symbolic graph for this Block.
label and pred can have arbitrary shape as long as they have the same number of elements.
 Parameters
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: prediction tensor with arbitrary shape
label: target tensor with the same size as pred.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.

class
mxnet.gluon.loss.
L1Loss
(weight=None, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Calculates the mean absolute error between label and pred.
\[L = \sum_i \vert {label}_i  {pred}_i \vert.\]Methods
hybrid_forward
(F, pred, label[, sample_weight])Overrides to construct symbolic graph for this Block.
label and pred can have arbitrary shape as long as they have the same number of elements.
 Parameters
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: prediction tensor with arbitrary shape
label: target tensor with the same size as pred.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.

class
mxnet.gluon.loss.
SigmoidBinaryCrossEntropyLoss
(from_sigmoid=False, weight=None, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
The crossentropy loss for binary classification. (alias: SigmoidBCELoss)
BCE loss is useful when training logistic regression. If from_sigmoid is False (default), this loss computes:
\[ \begin{align}\begin{aligned}prob = \frac{1}{1 + \exp({pred})}\\L =  \sum_i {label}_i * \log({prob}_i) * pos\_weight + (1  {label}_i) * \log(1  {prob}_i)\end{aligned}\end{align} \]Methods
hybrid_forward
(F, pred, label[, …])Overrides to construct symbolic graph for this Block.
If from_sigmoid is True, this loss computes:
\[L =  \sum_i {label}_i * \log({pred}_i) * pos\_weight + (1  {label}_i) * \log(1  {pred}_i)\]A tensor pos_weight > 1 decreases the false negative count, hence increasing the recall. Conversely setting pos_weight < 1 decreases the false positive count and increases the precision.
pred and label can have arbitrary shape as long as they have the same number of elements.
 Parameters
from_sigmoid (bool, default is False) – Whether the input is from the output of sigmoid. Set this to false will make the loss calculate sigmoid and BCE together, which is more numerically stable through logsumexp trick.
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: prediction tensor with arbitrary shape
label: target tensor with values in range [0, 1]. Must have the same size as pred.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
pos_weight: a weighting tensor of positive examples. Must be a vector with length equal to the number of classes.For example, if pred has shape (64, 10), pos_weight should have shape (1, 10).
 Outputs:
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.

mxnet.gluon.loss.
SigmoidBCELoss
¶

Overrides to construct symbolic graph for this Block. 
Methods

class
mxnet.gluon.loss.
SoftmaxCrossEntropyLoss
(axis=1, sparse_label=True, from_logits=False, weight=None, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Computes the softmax cross entropy loss. (alias: SoftmaxCELoss)
If sparse_label is True (default), label should contain integer category indicators:
\[ \begin{align}\begin{aligned}\DeclareMathOperator{softmax}{softmax}\\p = \softmax({pred})\\L = \sum_i \log p_{i,{label}_i}\end{aligned}\end{align} \]Methods
hybrid_forward
(F, pred, label[, sample_weight])Overrides to construct symbolic graph for this Block.
label’s shape should be pred’s shape with the axis dimension removed. i.e. for pred with shape (1,2,3,4) and axis = 2, label’s shape should be (1,2,4).
If sparse_label is False, label should contain probability distribution and label’s shape should be the same with pred:
\[ \begin{align}\begin{aligned}p = \softmax({pred})\\L = \sum_i \sum_j {label}_j \log p_{ij}\end{aligned}\end{align} \] Parameters
axis (int, default 1) – The axis to sum over when computing softmax and entropy.
sparse_label (bool, default True) – Whether label is an integer array instead of probability distribution.
from_logits (bool, default False) – Whether input is a log probability (usually from log_softmax) instead of unnormalized numbers.
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: the prediction tensor, where the batch_axis dimension ranges over batch size and axis dimension ranges over the number of classes.
label: the truth tensor. When sparse_label is True, label’s shape should be pred’s shape with the axis dimension removed. i.e. for pred with shape (1,2,3,4) and axis = 2, label’s shape should be (1,2,4) and values should be integers between 0 and 2. If sparse_label is False, label’s shape must be the same as pred and values should be floats in the range [0, 1].
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as label. For example, if label has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.

mxnet.gluon.loss.
SoftmaxCELoss
¶

Overrides to construct symbolic graph for this Block. 
Methods

class
mxnet.gluon.loss.
KLDivLoss
(from_logits=True, axis=1, weight=None, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
The KullbackLeibler divergence loss.
KL divergence measures the distance between contiguous distributions. It can be used to minimize information loss when approximating a distribution. If from_logits is True (default), loss is defined as:
\[L = \sum_i {label}_i * \big[\log({label}_i)  {pred}_i\big]\]Methods
hybrid_forward
(F, pred, label[, sample_weight])Overrides to construct symbolic graph for this Block.
If from_logits is False, loss is defined as:
\[ \begin{align}\begin{aligned}\DeclareMathOperator{softmax}{softmax}\\prob = \softmax({pred})\\L = \sum_i {label}_i * \big[\log({label}_i)  \log({prob}_i)\big]\end{aligned}\end{align} \]label and pred can have arbitrary shape as long as they have the same number of elements.
 Parameters
from_logits (bool, default is True) – Whether the input is log probability (usually from log_softmax) instead of unnormalized numbers.
axis (int, default 1) – The dimension along with to compute softmax. Only used when from_logits is False.
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: prediction tensor with arbitrary shape. If from_logits is True, pred should be log probabilities. Otherwise, it should be unnormalized predictions, i.e. from a dense layer.
label: truth tensor with values in range (0, 1). Must have the same size as pred.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.
References

class
mxnet.gluon.loss.
CTCLoss
(layout='NTC', label_layout='NT', weight=None, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Connectionist Temporal Classification Loss.
 Parameters
layout (str, default 'NTC') – Layout of prediction tensor. ‘N’, ‘T’, ‘C’ stands for batch size, sequence length, and alphabet_size respectively.
label_layout (str, default 'NT') – Layout of the labels. ‘N’, ‘T’ stands for batch size, and sequence length respectively.
weight (float or None) – Global scalar weight for loss.
Methods
hybrid_forward
(F, pred, label[, …])Overrides to construct symbolic graph for this Block.
 Inputs:
pred: unnormalized prediction tensor (before softmax). Its shape depends on layout. If layout is ‘TNC’, pred should have shape (sequence_length, batch_size, alphabet_size). Note that in the last dimension, index alphabet_size1 is reserved for internal use as blank label. So alphabet_size is one plus the actual alphabet size.
label: zerobased label tensor. Its shape depends on label_layout. If label_layout is ‘TN’, label should have shape (label_sequence_length, batch_size).
pred_lengths: optional (default None), used for specifying the length of each entry when different pred entries in the same batch have different lengths. pred_lengths should have shape (batch_size,).
label_lengths: optional (default None), used for specifying the length of each entry when different label entries in the same batch have different lengths. label_lengths should have shape (batch_size,).
 Outputs:
loss: output loss has shape (batch_size,).
Example: suppose the vocabulary is [a, b, c], and in one batch we have three sequences ‘ba’, ‘cbb’, and ‘abac’. We can index the labels as {‘a’: 0, ‘b’: 1, ‘c’: 2, blank: 3}. Then alphabet_size should be 4, where label 3 is reserved for internal use by CTCLoss. We then need to pad each sequence with 1 to make a rectangular label tensor:
[[1, 0, 1, 1], [2, 1, 1, 1], [0, 1, 0, 2]]
References

class
mxnet.gluon.loss.
HuberLoss
(rho=1, weight=None, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Calculates smoothed L1 loss that is equal to L1 loss if absolute error exceeds rho but is equal to L2 loss otherwise. Also called SmoothedL1 loss.
\[\begin{split}L = \sum_i \begin{cases} \frac{1}{2 {rho}} ({label}_i  {pred}_i)^2 & \text{ if } {label}_i  {pred}_i < {rho} \\ {label}_i  {pred}_i  \frac{{rho}}{2} & \text{ otherwise } \end{cases}\end{split}\]Methods
hybrid_forward
(F, pred, label[, sample_weight])Overrides to construct symbolic graph for this Block.
label and pred can have arbitrary shape as long as they have the same number of elements.
 Parameters
rho (float, default 1) – Threshold for trimmed mean estimator.
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: prediction tensor with arbitrary shape
label: target tensor with the same size as pred.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.

class
mxnet.gluon.loss.
HingeLoss
(margin=1, weight=None, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Calculates the hinge loss function often used in SVMs:
\[L = \sum_i max(0, {margin}  {pred}_i \cdot {label}_i)\]Methods
hybrid_forward
(F, pred, label[, sample_weight])Overrides to construct symbolic graph for this Block.
where pred is the classifier prediction and label is the target tensor containing values 1 or 1. label and pred must have the same number of elements.
 Parameters
margin (float) – The margin in hinge loss. Defaults to 1.0
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: prediction tensor with arbitrary shape.
label: truth tensor with values 1 or 1. Must have the same size as pred.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.

class
mxnet.gluon.loss.
SquaredHingeLoss
(margin=1, weight=None, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Calculates the softmargin loss function used in SVMs:
\[L = \sum_i max(0, {margin}  {pred}_i \cdot {label}_i)^2\]Methods
hybrid_forward
(F, pred, label[, sample_weight])Overrides to construct symbolic graph for this Block.
where pred is the classifier prediction and label is the target tensor containing values 1 or 1. label and pred can have arbitrary shape as long as they have the same number of elements.
 Parameters
margin (float) – The margin in hinge loss. Defaults to 1.0
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: prediction tensor with arbitrary shape
label: truth tensor with values 1 or 1. Must have the same size as pred.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.

class
mxnet.gluon.loss.
LogisticLoss
(weight=None, batch_axis=0, label_format='signed', **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Calculates the logistic loss (for binary losses only):
\[L = \sum_i \log(1 + \exp( {pred}_i \cdot {label}_i))\]Methods
hybrid_forward
(F, pred, label[, sample_weight])Overrides to construct symbolic graph for this Block.
where pred is the classifier prediction and label is the target tensor containing values 1 or 1 (0 or 1 if label_format is binary). label and pred can have arbitrary shape as long as they have the same number of elements.
 Parameters
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
label_format (str, default 'signed') – Can be either ‘signed’ or ‘binary’. If the label_format is ‘signed’, all label values should be either 1 or 1. If the label_format is ‘binary’, all label values should be either 0 or 1.
Inputs –
pred: prediction tensor with arbitrary shape.
label: truth tensor with values 1/1 (label_format is ‘signed’) or 0/1 (label_format is ‘binary’). Must have the same size as pred.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
Outputs –
loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.

class
mxnet.gluon.loss.
TripletLoss
(margin=1, weight=None, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Calculates triplet loss given three input tensors and a positive margin. Triplet loss measures the relative similarity between a positive example, a negative example, and prediction:
\[L = \sum_i \max(\Vert {pos_i}_i  {pred} \Vert_2^2  \Vert {neg_i}_i  {pred} \Vert_2^2 + {margin}, 0)\]Methods
hybrid_forward
(F, pred, positive, negative)Overrides to construct symbolic graph for this Block.
positive, negative, and ‘pred’ can have arbitrary shape as long as they have the same number of elements.
 Parameters
margin (float) – Margin of separation between correct and incorrect pair.
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
 Inputs:
pred: prediction tensor with arbitrary shape
positive: positive example tensor with arbitrary shape. Must have the same size as pred.
negative: negative example tensor with arbitrary shape Must have the same size as pred.
 Outputs:
loss: loss tensor with shape (batch_size,).

class
mxnet.gluon.loss.
PoissonNLLLoss
(weight=None, from_logits=True, batch_axis=0, compute_full=False, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
For a target (Random Variable) in a Poisson distribution, the function calculates the Negative Log likelihood loss. PoissonNLLLoss measures the loss accrued from a poisson regression prediction made by the model.
\[L = \text{pred}  \text{target} * \log(\text{pred}) +\log(\text{target!})\]Methods
hybrid_forward
(F, pred, target[, …])Overrides to construct symbolic graph for this Block.
target, ‘pred’ can have arbitrary shape as long as they have the same number of elements.
 Parameters
from_logits (boolean, default True) – indicating whether log(predicted) value has already been computed. If True, the loss is computed as \(\exp(\text{pred})  \text{target} * \text{pred}\), and if False, then loss is computed as \(\text{pred}  \text{target} * \log(\text{pred}+\text{epsilon})\).The default value
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
compute_full (boolean, default False) – Indicates whether to add an approximation(Stirling factor) for the Factorial term in the formula for the loss. The Stirling factor is: \(\text{target} * \log(\text{target})  \text{target} + 0.5 * \log(2 * \pi * \text{target})\)
epsilon (float, default 1e08) – This is to avoid calculating log(0) which is not defined.
 Inputs:
pred: Predicted value
target: Random variable(count or number) which belongs to a Poisson distribution.
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: Average loss (shape=(1,1)) of the loss tensor with shape (batch_size,).

class
mxnet.gluon.loss.
CosineEmbeddingLoss
(weight=None, batch_axis=0, margin=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
For a target label 1 or 1, vectors input1 and input2, the function computes the cosine distance between the vectors. This can be interpreted as how similar/dissimilar two input vectors are.
\[\begin{split}L = \sum_i \begin{cases} 1  {cos\_sim({input1}_i, {input2}_i)} & \text{ if } {label}_i = 1\\ {cos\_sim({input1}_i, {input2}_i)} & \text{ if } {label}_i = 1 \end{cases}\\ cos\_sim(input1, input2) = \frac{{input1}_i.{input2}_i}{{input1}_i.{input2}_i}\end{split}\]Methods
hybrid_forward
(F, input1, input2, label[, …])Overrides to construct symbolic graph for this Block.
input1, input2 can have arbitrary shape as long as they have the same number of elements.
 Parameters
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
margin (float) – Margin of separation between correct and incorrect pair.
 Inputs:
input1: a tensor with arbitrary shape
input2: another tensor with same shape as pred to which input1 is compared for similarity and loss calculation
label: A 1D tensor indicating for each pair input1 and input2, target label is 1 or 1
sample_weight: elementwise weighting tensor. Must be broadcastable to the same shape as input1. For example, if input1 has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).
 Outputs:
loss: The loss tensor with shape (batch_size,).

class
mxnet.gluon.loss.
SDMLLoss
(smoothing_parameter=0.3, weight=1.0, batch_axis=0, **kwargs)[source]¶ Bases:
mxnet.gluon.loss.Loss
Calculates Batchwise Smoothed Deep Metric Learning (SDML) Loss given two input tensors and a smoothing weight SDM Loss learns similarity between paired samples by using unpaired samples in the minibatch as potential negative examples.
The loss is described in greater detail in “Large Scale Question Paraphrase Retrieval with Smoothed Deep Metric Learning.”  by Bonadiman, Daniele, Anjishnu Kumar, and Arpit Mittal. arXiv preprint arXiv:1905.12786 (2019). URL: https://arxiv.org/pdf/1905.12786.pdf
According to the authors, this loss formulation achieves comparable or higher accuracy to Triplet Loss but converges much faster. The loss assumes that the items in both tensors in each minibatch are aligned such that x1[0] corresponds to x2[0] and all other datapoints in the minibatch are unrelated. x1 and x2 are minibatches of vectors.
 Parameters
smoothing_parameter (float) – Probability mass to be distributed over the minibatch. Must be < 1.0.
weight (float or None) – Global scalar weight for loss.
batch_axis (int, default 0) – The axis that represents minibatch.
Inputs –
x1: Minibatch of data points with shape (batch_size, vector_dim)
x2: Minibatch of data points with shape (batch_size, vector_dim) Each item in x2 is a positive sample for the same index in x1. That is, x1[0] and x2[0] form a positive pair, x1[1] and x2[1] form a positive pair  and so on. All data points in different rows should be decorrelated
Outputs –
loss: loss tensor with shape (batch_size,).
Methods
hybrid_forward
(F, x1, x2)Overrides to construct symbolic graph for this Block.