
bitorch_engine.utils.quant_operators.q8_quantization(input: Tensor, scale_a: Tensor | None = None, eps: Tensor | None = None) Tensor[source]

Quantizes an input tensor to 8-bit integers using uniform quantization.

The function first ensures that the input tensor is of floating-point type. It then adjusts the scale factor scale_a to avoid division by values too close to zero, applying a lower threshold defined by eps. The quantization process scales the input tensor by the inverse of scale_a, rounds the result to the nearest integer, and clamps the values to the 8-bit range [-128, 127].

  • input (torch.Tensor) – The input tensor to be quantized. Should ideally be of floating-point type.

  • scale_a (torch.Tensor) – The scale factor for quantization. Each element in scale_a scales the corresponding element in input.

  • eps (torch.Tensor) – A small positive tensor used to prevent division by zero or values too close to zero in the scale factor.


The quantized tensor, with values rounded and clamped to fit within

the 8-bit integer range.

Return type:
