bitorch_engine.utils.quant_operators.q8_quantization
- bitorch_engine.utils.quant_operators.q8_quantization(input: Tensor, scale_a: Tensor | None = None, eps: Tensor | None = None) Tensor [source]
Quantizes an input tensor to 8-bit integers using uniform quantization.
The function first ensures that the input tensor is of floating-point type. It then adjusts the scale factor scale_a to avoid division by values too close to zero, applying a lower threshold defined by eps. The quantization process scales the input tensor by the inverse of scale_a, rounds the result to the nearest integer, and clamps the values to the 8-bit range [-128, 127].
- Parameters:
input (torch.Tensor) – The input tensor to be quantized. Should ideally be of floating-point type.
scale_a (torch.Tensor) – The scale factor for quantization. Each element in scale_a scales the corresponding element in input.
eps (torch.Tensor) – A small positive tensor used to prevent division by zero or values too close to zero in the scale factor.
- Returns:
- The quantized tensor, with values rounded and clamped to fit within
the 8-bit integer range.
- Return type:
torch.Tensor