contrib.quantization

Quantization module for generating quantized (INT8) models from FP32 models.

Functions

calib_graph(qsym, arg_params, aux_params, …)

User-level API for calibrating a quantized model using a filled collector.

combine_histogram(old_hist, arr, new_min, …)

Collect layer histogram for arr and combine it with old histogram.

quantize_graph(sym, arg_params, aux_params)

User-level API for generating a quantized model from a FP32 model w/o calibration and a collector for naive or entropy calibration.

quantize_model(sym, arg_params, aux_params)

User-level API for generating a quantized model from a FP32 model w/ or w/o calibration.

quantize_model_mkldnn(sym, arg_params, …)

User-level API for generating a fusion + quantized model from a FP32 model w/ or w/o calibration with Intel MKL-DNN.

quantize_net(network[, quantized_dtype, …])

User-level API for Gluon users to generate a quantized SymbolBlock from a FP32 HybridBlock w/ or w/o calibration.

mxnet.contrib.quantization.calib_graph(qsym, arg_params, aux_params, collector, calib_mode='entropy', quantized_dtype='int8', logger=<module 'logging' from '/work/conda_env/lib/python3.7/logging/__init__.py'>)[source]

User-level API for calibrating a quantized model using a filled collector. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now. :param qsym: Defines the structure of a neural network for INT8 data types. :type qsym: str or Symbol :param arg_params: Dictionary of name to NDArray. :type arg_params: dict :param aux_params: Dictionary of name to NDArray. :type aux_params: dict :param collector: layer collector for naive or entropy calibration. :type collector: function :param calib_mode: If calib_mode=’none’, no calibration will be used and the thresholds for

requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.

Parameters
  • quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.

  • logger (Object) – A logging object for printing information during the process of quantization.

Returns

  • tuple – A tuple of calibrated symbol, quantized arg_params, aux_params.

  • ——-

mxnet.contrib.quantization.combine_histogram(old_hist, arr, new_min, new_max, new_th)[source]

Collect layer histogram for arr and combine it with old histogram.

mxnet.contrib.quantization.quantize_graph(sym, arg_params, aux_params, ctx=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', quantized_dtype='int8', quantize_mode='full', logger=<module 'logging' from '/work/conda_env/lib/python3.7/logging/__init__.py'>)[source]

User-level API for generating a quantized model from a FP32 model w/o calibration and a collector for naive or entropy calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now. :param sym: Defines the structure of a neural network for FP32 data types. :type sym: str or Symbol :param ctx: Defines the device that users want to run forward propagation on the calibration

dataset for collecting layer output statistics. Currently, only supports single context.

Parameters
  • arg_params (dict) – Dictionary of name to NDArray.

  • aux_params (dict) – Dictionary of name to NDArray.

  • excluded_sym_names (list of strings) – A list of strings representing the names of the symbols that users want to excluding from being quantized.

  • excluded_op_names (list of strings) – A list of strings representing the names of the operators that users want to excluding

  • calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.

  • quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.

  • quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.

  • logger (Object) – A logging object for printing information during the process of quantization.

Returns

  • tuple – A tuple of quantized symbol, quantized arg_params, aux_params and collector.

  • ——-

mxnet.contrib.quantization.quantize_model(sym, arg_params, aux_params, data_names=('data', ), label_names=('softmax_label', ), ctx=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', calib_data=None, num_calib_examples=None, quantized_dtype='int8', quantize_mode='smart', logger=<module 'logging' from '/work/conda_env/lib/python3.7/logging/__init__.py'>)[source]

User-level API for generating a quantized model from a FP32 model w/ or w/o calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now. The quantization implementation adopts the TensorFlow’s approach: https://www.tensorflow.org/performance/quantization. The calibration implementation borrows the idea of Nvidia’s 8-bit Inference with TensorRT: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf and adapts the method to MXNet.

Parameters
  • sym (str or Symbol) – Defines the structure of a neural network for FP32 data types.

  • arg_params (dict) – Dictionary of name to NDArray.

  • aux_params (dict) – Dictionary of name to NDArray.

  • data_names (a list of strs) – Data names required for creating a Module object to run forward propagation on the calibration dataset.

  • label_names (a list of strs) – Label names required for creating a Module object to run forward propagation on the calibration dataset.

  • ctx (Context) – Defines the device that users want to run forward propagation on the calibration dataset for collecting layer output statistics. Currently, only supports single context.

  • excluded_sym_names (list of strings) – A list of strings representing the names of the symbols that users want to excluding from being quantized.

  • excluded_op_names (list of strings) – A list of strings representing the names of the operators that users want to excluding from being quantized.

  • calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.

  • calib_data (DataIter) – A data iterator initialized by the calibration dataset.

  • num_calib_examples (int or None) – The maximum number of examples that user would like to use for calibration. If not provided, the whole calibration dataset will be used.

  • quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’, ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.

  • quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.

  • logger (Object) – A logging object for printing information during the process of quantization.

Returns

  • tuple – A tuple of quantized symbol, quantized arg_params, and aux_params.

  • ——-

mxnet.contrib.quantization.quantize_model_mkldnn(sym, arg_params, aux_params, data_names=('data', ), label_names=('softmax_label', ), ctx=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', calib_data=None, num_calib_examples=None, quantized_dtype='int8', quantize_mode='smart', logger=<module 'logging' from '/work/conda_env/lib/python3.7/logging/__init__.py'>)[source]

User-level API for generating a fusion + quantized model from a FP32 model w/ or w/o calibration with Intel MKL-DNN. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.

Parameters

with quantize_model (same) –

Returns

  • tuple – A tuple of quantized symbol, quantized arg_params, and aux_params.

  • ——-

mxnet.contrib.quantization.quantize_net(network, quantized_dtype='auto', quantize_mode='full', exclude_layers=None, exclude_layers_match=None, exclude_operators=None, calib_data=None, data_shapes=None, calib_mode='none', num_calib_examples=None, ctx=cpu(0), logger=<module 'logging' from '/work/conda_env/lib/python3.7/logging/__init__.py'>)[source]

User-level API for Gluon users to generate a quantized SymbolBlock from a FP32 HybridBlock w/ or w/o calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.

Parameters
  • network (Gluon HybridBlock) – Defines the structure of a neural network for FP32 data types.

  • quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.

  • quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.

  • exclude_layers (list of strings) – A list of strings representing the names of the symbols that users want to excluding

  • exclude_layers_match (list of strings) – A list of strings wildcard matching the names of the symbols that users want to excluding from being quantized.

  • exclude_operators (list of strings) – A list of strings representing the names of the operators that users want to excluding

  • calib_data (mx.io.DataIter or gluon.DataLoader) – A iterable data loading object.

  • data_shapes (list) – List of DataDesc, required if calib_data is not provided

  • calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.

  • num_calib_examples (int or None) – The maximum number of examples that user would like to use for calibration. If not provided, the whole calibration dataset will be used.

  • ctx (Context) – Defines the device that users want to run forward propagation on the calibration dataset for collecting layer output statistics. Currently, only supports single context.

  • logger (Object) – A logging object for printing information during the process of quantization.

Returns

  • network (Gluon SymbolBlock) – Defines the structure of a neural network for INT8 data types.

  • ——-