bitorch_engine.utils.quant_operators.q8_quantization

bitorch_engine.utils.quant_operators.q8_quantization(input: Tensor, scale_a: Tensor | None = None, eps: Tensor | None = None) Tensor[source]

Quantizes an input tensor to 8-bit integers using uniform quantization.

The function first ensures that the input tensor is of floating-point type. It then adjusts the scale factor scale_a to avoid division by values too close to zero, applying a lower threshold defined by eps. The quantization process scales the input tensor by the inverse of scale_a, rounds the result to the nearest integer, and clamps the values to the 8-bit range [-128, 127].

Parameters:
  • input (torch.Tensor) – The input tensor to be quantized. Should ideally be of floating-point type.

  • scale_a (torch.Tensor) – The scale factor for quantization. Each element in scale_a scales the corresponding element in input.

  • eps (torch.Tensor) – A small positive tensor used to prevent division by zero or values too close to zero in the scale factor.

Returns:

The quantized tensor, with values rounded and clamped to fit within

the 8-bit integer range.

Return type:

torch.Tensor