contrib.quantization

Quantization module for generating quantized (INT8) models from FP32 models.

Classes

CalibrationCollector()

Base class for all other collectors used with quantization

Functions

calib_graph(qsym, arg_params, aux_params, …)

User-level API for calibrating a quantized model using a filled collector.

quantize_graph(sym, arg_params, aux_params)

User-level API for generating a quantized model from a FP32 model w/o calibration and a collector for naive or entropy calibration.

quantize_model(sym, arg_params, aux_params)

User-level API for generating a quantized model from a FP32 model w/ or w/o calibration.

quantize_model_onednn(sym, arg_params, …)

User-level API for generating a fusion + quantized model from a FP32 model w/ or w/o calibration with oneDNN.

quantize_net(network[, quantized_dtype, …])

User-level API for Gluon users to generate a quantized SymbolBlock from a FP32 HybridBlock w/ or w/o calibration.

class CalibrationCollector[source]

Bases: object

Base class for all other collectors used with quantization

Methods

collect(name, op_name, arr)

Function which is registered to Block as monitor callback.

post_collect()

Function called after collecting parameters.

abstract collect(name, op_name, arr)[source]

Function which is registered to Block as monitor callback. Names of layers requiring calibration are stored in self.include_layers variable.

Parameters
  • name (str) – Node name from which collected data comes from.

  • op_name (str) – Operator name from which collected data comes from. Single operator can have multiple input/ouput nodes - each should have different name.

  • arr (NDArray) – NDArray containing data of monitored node.

post_collect()[source]

Function called after collecting parameters. Returns dictionary of min and max values for each calibrated layer. If not overriden, returns content of self.min_max_dict.

calib_graph(qsym, arg_params, aux_params, collector, calib_mode='entropy', logger=None)[source]

User-level API for calibrating a quantized model using a filled collector. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.

Parameters
  • qsym (Symbol) – Defines the structure of a neural network for INT8 data types.

  • arg_params (dict) – Dictionary of name to NDArray.

  • aux_params (dict) – Dictionary of name to NDArray.

  • collector (function) – layer collector for naive or entropy calibration.

  • calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.

  • quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.

  • logger (Object) – A logging object for printing information during the process of quantization.

Returns

quantized_model – A tuple of calibrated symbol, quantized arg_params, aux_params.

Return type

tuple

quantize_graph(sym, arg_params, aux_params, device=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', quantized_dtype='int8', quantize_mode='full', quantize_granularity='tensor-wise', LayerOutputCollector=None, logger=None)[source]

User-level API for generating a quantized model from a FP32 model w/o calibration and a collector for naive or entropy calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.

Parameters
  • sym (Symbol) – Defines the structure of a neural network for FP32 data types.

  • device (Device) – Defines the device that users want to run forward propagation on the calibration dataset for collecting layer output statistics. Currently, only supports single device.

  • arg_params (dict) – Dictionary of name to NDArray.

  • aux_params (dict) – Dictionary of name to NDArray.

  • excluded_sym_names (list of strings) – A list of strings representing the names of the symbols that users want to excluding from being quantized.

  • excluded_op_names (list of strings) – A list of strings representing the names of the operators that users want to excluding

  • calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.

  • quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.

  • quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.

  • quantize_granularity (str) – The granularity of quantization, currently supports ‘tensor-wise’ and ‘channel-wise’ quantization. The default value is ‘tensor-wise’.

  • LayerOutputCollector (subclass of CalibrationCollector) – For custom calibration method usage. Passed object’s include_layers attribute will be feed with names of layers which needs calibration

  • logger (Object) – A logging object for printing information during the process of quantization.

Returns

quantized_model – A tuple of quantized symbol, quantized arg_params, aux_params and collector.

Return type

tuple

quantize_model(sym, arg_params, aux_params, data_names=('data', ), device=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', calib_data=None, num_calib_batches=None, quantized_dtype='int8', quantize_mode='smart', quantize_granularity='tensor-wise', logger=None)[source]

User-level API for generating a quantized model from a FP32 model w/ or w/o calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now. The quantization implementation adopts the TensorFlow’s approach: https://www.tensorflow.org/lite/performance/post_training_quantization. The calibration implementation borrows the idea of Nvidia’s 8-bit Inference with TensorRT: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf and adapts the method to MXNet.

Parameters
  • sym (Symbol) – Defines the structure of a neural network for FP32 data types.

  • arg_params (dict) – Dictionary of name to NDArray.

  • aux_params (dict) – Dictionary of name to NDArray.

  • data_names (list of strings) – Data names required for creating a Module object to run forward propagation on the calibration dataset.

  • device (Device) – Defines the device that users want to run forward propagation on the calibration dataset for collecting layer output statistics. Currently, only supports single device.

  • excluded_sym_names (list of strings) – A list of strings representing the names of the symbols that users want to excluding from being quantized.

  • excluded_op_names (list of strings) – A list of strings representing the names of the operators that users want to excluding from being quantized.

  • calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.

  • calib_data (DataLoader) – A DataLoader initialized by the calibration dataset.

  • num_calib_batches (int or None) – The maximum number of batches that user would like to use for calibration. If not provided, the whole calibration dataset will be used.

  • quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’, ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.

  • quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.

  • quantize_granularity (str) – The granularity of quantization, currently supports ‘tensor-wise’ and ‘channel-wise’ quantization. The default value is ‘tensor-wise’.

  • logger (Object) – A logging object for printing information during the process of quantization.

Returns

quantized_model – A tuple of quantized symbol, quantized arg_params, and aux_params.

Return type

tuple

quantize_model_onednn(sym, arg_params, aux_params, data_names=('data', ), device=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', calib_data=None, num_calib_batches=None, quantized_dtype='int8', quantize_mode='smart', quantize_granularity='tensor-wise', logger=None)[source]

User-level API for generating a fusion + quantized model from a FP32 model w/ or w/o calibration with oneDNN. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.

Parameters

allAs in quantize_model

Returns

quantized_model – A tuple of quantized symbol, quantized arg_params, and aux_params.

Return type

tuple

quantize_net(network, quantized_dtype='auto', quantize_mode='full', quantize_granularity='tensor-wise', exclude_layers=None, exclude_layers_match=None, exclude_operators=None, calib_data=None, data_shapes=None, calib_mode='none', num_calib_batches=None, device=cpu(0), LayerOutputCollector=None, logger=None)[source]

User-level API for Gluon users to generate a quantized SymbolBlock from a FP32 HybridBlock w/ or w/o calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.

Parameters
  • network (Gluon HybridBlock) – Defines the structure of a neural network for FP32 data types.

  • quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.

  • quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.

  • quantize_granularity (str) – The granularity of quantization, currently supports ‘tensor-wise’ and ‘channel-wise’ quantization. The default value is ‘tensor-wise’.

  • exclude_layers (list of strings) – A list of strings representing the names of the symbols that users want to excluding

  • exclude_layers_match (list of strings) – A list of strings wildcard matching the names of the symbols that users want to excluding from being quantized.

  • exclude_operators (list of strings) – A list of strings representing the names of the operators that users want to excluding

  • calib_data (gluon.DataLoader) – A iterable data loading object.

  • data_shapes (list of DataDesc or list of tuple) – A list of data shapes. Required if calib_data is not provided. In case of tuples, the names of inputs are generated.

  • calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset. If calib_mode=’custom’, the provided LayerOutputCollector will be used to determine the thresholds for quantization. For more information refer to CalibrationCollector documentation.

  • num_calib_batches (int or None) – The maximum number of batches that user would like to use for calibration. If not provided, the whole calibration dataset will be used.

  • device (Device) – Defines the device that users want to run forward propagation on the calibration dataset for collecting layer output statistics. Currently, only supports single device.

  • LayerOutputCollector (subclass of CalibrationCollector) – For custom calibration method usage. Passed object’s include_layers attribute will be feed with names of layers which needs calibration

  • logger (Object) – A logging object for printing information during the process of quantization.

Returns

network – Defines the structure of a neural network for INT8 data types.

Return type

Gluon SymbolBlock