contrib.quantization¶
Quantization module for generating quantized (INT8) models from FP32 models.
Classes
Base class for all other collectors used with quantization |
Functions
|
User-level API for calibrating a quantized model using a filled collector. |
|
User-level API for generating a quantized model from a FP32 model w/o calibration and a collector for naive or entropy calibration. |
|
User-level API for generating a quantized model from a FP32 model w/ or w/o calibration. |
|
User-level API for generating a fusion + quantized model from a FP32 model w/ or w/o calibration with oneDNN. |
|
User-level API for Gluon users to generate a quantized SymbolBlock from a FP32 HybridBlock w/ or w/o calibration. |
-
class
CalibrationCollector
[source]¶ Bases:
object
Base class for all other collectors used with quantization
Methods
collect
(name, op_name, arr)Function which is registered to Block as monitor callback.
Function called after collecting parameters.
-
abstract
collect
(name, op_name, arr)[source]¶ Function which is registered to Block as monitor callback. Names of layers requiring calibration are stored in self.include_layers variable.
- Parameters
name (str) – Node name from which collected data comes from.
op_name (str) – Operator name from which collected data comes from. Single operator can have multiple input/ouput nodes - each should have different name.
arr (NDArray) – NDArray containing data of monitored node.
-
abstract
-
calib_graph
(qsym, arg_params, aux_params, collector, calib_mode='entropy', logger=None)[source]¶ User-level API for calibrating a quantized model using a filled collector. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.
- Parameters
qsym (Symbol) – Defines the structure of a neural network for INT8 data types.
arg_params (dict) – Dictionary of name to NDArray.
aux_params (dict) – Dictionary of name to NDArray.
collector (function) – layer collector for naive or entropy calibration.
calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.
quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.
logger (Object) – A logging object for printing information during the process of quantization.
- Returns
quantized_model – A tuple of calibrated symbol, quantized arg_params, aux_params.
- Return type
tuple
-
quantize_graph
(sym, arg_params, aux_params, device=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', quantized_dtype='int8', quantize_mode='full', quantize_granularity='tensor-wise', LayerOutputCollector=None, logger=None)[source]¶ User-level API for generating a quantized model from a FP32 model w/o calibration and a collector for naive or entropy calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.
- Parameters
sym (Symbol) – Defines the structure of a neural network for FP32 data types.
device (Device) – Defines the device that users want to run forward propagation on the calibration dataset for collecting layer output statistics. Currently, only supports single device.
arg_params (dict) – Dictionary of name to NDArray.
aux_params (dict) – Dictionary of name to NDArray.
excluded_sym_names (list of strings) – A list of strings representing the names of the symbols that users want to excluding from being quantized.
excluded_op_names (list of strings) – A list of strings representing the names of the operators that users want to excluding
calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.
quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.
quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.
quantize_granularity (str) – The granularity of quantization, currently supports ‘tensor-wise’ and ‘channel-wise’ quantization. The default value is ‘tensor-wise’.
LayerOutputCollector (subclass of CalibrationCollector) – For custom calibration method usage. Passed object’s include_layers attribute will be feed with names of layers which needs calibration
logger (Object) – A logging object for printing information during the process of quantization.
- Returns
quantized_model – A tuple of quantized symbol, quantized arg_params, aux_params and collector.
- Return type
tuple
-
quantize_model
(sym, arg_params, aux_params, data_names=('data', ), device=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', calib_data=None, num_calib_batches=None, quantized_dtype='int8', quantize_mode='smart', quantize_granularity='tensor-wise', logger=None)[source]¶ User-level API for generating a quantized model from a FP32 model w/ or w/o calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now. The quantization implementation adopts the TensorFlow’s approach: https://www.tensorflow.org/lite/performance/post_training_quantization. The calibration implementation borrows the idea of Nvidia’s 8-bit Inference with TensorRT: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf and adapts the method to MXNet.
- Parameters
sym (Symbol) – Defines the structure of a neural network for FP32 data types.
arg_params (dict) – Dictionary of name to NDArray.
aux_params (dict) – Dictionary of name to NDArray.
data_names (list of strings) – Data names required for creating a Module object to run forward propagation on the calibration dataset.
device (Device) – Defines the device that users want to run forward propagation on the calibration dataset for collecting layer output statistics. Currently, only supports single device.
excluded_sym_names (list of strings) – A list of strings representing the names of the symbols that users want to excluding from being quantized.
excluded_op_names (list of strings) – A list of strings representing the names of the operators that users want to excluding from being quantized.
calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset.
calib_data (DataLoader) – A DataLoader initialized by the calibration dataset.
num_calib_batches (int or None) – The maximum number of batches that user would like to use for calibration. If not provided, the whole calibration dataset will be used.
quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’, ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.
quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.
quantize_granularity (str) – The granularity of quantization, currently supports ‘tensor-wise’ and ‘channel-wise’ quantization. The default value is ‘tensor-wise’.
logger (Object) – A logging object for printing information during the process of quantization.
- Returns
quantized_model – A tuple of quantized symbol, quantized arg_params, and aux_params.
- Return type
tuple
-
quantize_model_onednn
(sym, arg_params, aux_params, data_names=('data', ), device=cpu(0), excluded_sym_names=None, excluded_op_names=None, calib_mode='entropy', calib_data=None, num_calib_batches=None, quantized_dtype='int8', quantize_mode='smart', quantize_granularity='tensor-wise', logger=None)[source]¶ User-level API for generating a fusion + quantized model from a FP32 model w/ or w/o calibration with oneDNN. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.
- Parameters
all – As in quantize_model
- Returns
quantized_model – A tuple of quantized symbol, quantized arg_params, and aux_params.
- Return type
tuple
-
quantize_net
(network, quantized_dtype='auto', quantize_mode='full', quantize_granularity='tensor-wise', exclude_layers=None, exclude_layers_match=None, exclude_operators=None, calib_data=None, data_shapes=None, calib_mode='none', num_calib_batches=None, device=cpu(0), LayerOutputCollector=None, logger=None)[source]¶ User-level API for Gluon users to generate a quantized SymbolBlock from a FP32 HybridBlock w/ or w/o calibration. The backend quantized operators are only enabled for Linux systems. Please do not run inference using the quantized models on Windows for now.
- Parameters
network (Gluon HybridBlock) – Defines the structure of a neural network for FP32 data types.
quantized_dtype (str) – The quantized destination type for input data. Currently support ‘int8’ , ‘uint8’ and ‘auto’. ‘auto’ means automatically select output type according to calibration result. Default value is ‘int8’.
quantize_mode (str) – The mode that quantization pass to apply. Support ‘full’ and ‘smart’. ‘full’ means quantize all operator if possible. ‘smart’ means quantization pass will smartly choice which operator should be quantized.
quantize_granularity (str) – The granularity of quantization, currently supports ‘tensor-wise’ and ‘channel-wise’ quantization. The default value is ‘tensor-wise’.
exclude_layers (list of strings) – A list of strings representing the names of the symbols that users want to excluding
exclude_layers_match (list of strings) – A list of strings wildcard matching the names of the symbols that users want to excluding from being quantized.
exclude_operators (list of strings) – A list of strings representing the names of the operators that users want to excluding
calib_data (gluon.DataLoader) – A iterable data loading object.
data_shapes (list of DataDesc or list of tuple) – A list of data shapes. Required if calib_data is not provided. In case of tuples, the names of inputs are generated.
calib_mode (str) – If calib_mode=’none’, no calibration will be used and the thresholds for requantization after the corresponding layers will be calculated at runtime by calling min and max operators. The quantized models generated in this mode are normally 10-20% slower than those with calibrations during inference. If calib_mode=’naive’, the min and max values of the layer outputs from a calibration dataset will be directly taken as the thresholds for quantization. If calib_mode=’entropy’ (default mode), the thresholds for quantization will be derived such that the KL divergence between the distributions of FP32 layer outputs and quantized layer outputs is minimized based upon the calibration dataset. If calib_mode=’custom’, the provided LayerOutputCollector will be used to determine the thresholds for quantization. For more information refer to CalibrationCollector documentation.
num_calib_batches (int or None) – The maximum number of batches that user would like to use for calibration. If not provided, the whole calibration dataset will be used.
device (Device) – Defines the device that users want to run forward propagation on the calibration dataset for collecting layer output statistics. Currently, only supports single device.
LayerOutputCollector (subclass of CalibrationCollector) – For custom calibration method usage. Passed object’s include_layers attribute will be feed with names of layers which needs calibration
logger (Object) – A logging object for printing information during the process of quantization.
- Returns
network – Defines the structure of a neural network for INT8 data types.
- Return type
Gluon SymbolBlock