bitorch_engine.utils.quant_operators.nv_tensor_quant

bitorch_engine.utils.quant_operators.nv_tensor_quant(inputs, amax=None, num_bits=8, unsigned=False, narrow_range=True) Tuple[Tensor, Tensor][source]

Quantizes the given tensor using specified quantization parameters. This method supports both signed and unsigned quantization with an option for narrow range quantization. This function is shared between TensorQuantFunction and FakeTensorQuantFunction.

Author: nv_pytorch_quantization Source: https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/pytorch_quantization/tensor_quant.py#L315

Parameters:
  • inputs (torch.Tensor) – The input tensor to be quantized.

  • amax (torch.Tensor or None) – The maximum absolute value used for quantization scaling. If None, it will be calculated from the input tensor.

  • num_bits (int) – Number of bits to use for quantization, default is 8.

  • unsigned (bool) – Flag indicating if the quantization is unsigned, default is False.

  • narrow_range (bool) – Flag indicating if the quantization should use narrow range, default is True.

Raises:
  • ValueError – If amax has a different shape than inputs or contains negative values.

  • TypeError – If negative values are encountered in unsigned quantization mode.

Returns:

The quantized tensor. torch.Tensor: The scale factor used for quantization.

Return type:

torch.Tensor

Note

  • Quantization is performed in FP32 to avoid overflow.

  • If inputs or amax are in FP16 or BF16, they are converted to FP32 for calculation.

  • The quantization range is adjusted based on unsigned and narrow_range flags.

  • Special handling for amax values smaller than the minimum representable value of FP16.