bitorch_engine.layers.qlinear.nbit.cutlass.q8_layer.Q8LinearCutlass

class bitorch_engine.layers.qlinear.nbit.cutlass.q8_layer.Q8LinearCutlass(*args, **kwargs)[source]

Implements an 8-bit quantized linear layer using CUTLASS for efficient computation.

This class inherits from nBitLinearBase and adds specific functionality for handling 8-bit quantized weights and activations, aiming at reducing memory footprint and accelerating computation on compatible hardware. It introduces parameters for scaling and bias adjustment of activations to maintain accuracy with quantized values.

bias_a

Bias for the activation function, ensuring the quantized model’s accuracy.

Type:: torch.nn.Parameter

scale_a

Scale factor for activations, used in quantization to maintain numerical stability.

Type:: torch.nn.Parameter

scale_w

Scale factor for weights, adjusting the quantized weights’ magnitude.

Type:: torch.nn.Parameter

eps

A small epsilon value to prevent division by zero in computations, set to 0.00001 by default.

Type:: torch.Tensor

prepare_params()[source]: Prepares and initializes the model parameters for training, converting weights to int8 format.

generate_quantized_weight()[source]: Quantizes the weights, preparing them for efficient storage or computation.

_check_forward()[source]: Verifies the input dimensions match the weight dimensions.

set_activation()[source]: Quantizes activation to 8-bit using the scale factor scale_a and adjusts with bias_a.

forward()[source]: Defines the forward pass for the quantized linear layer.

Methods

`__init__`	Initializes the Q8LinearCutlass layer, setting up parameters for activation scaling, weight scaling, and a small epsilon value for numerical stability.
`forward`	Defines the forward pass for the quantized linear layer.
`generate_quantized_weight`	Performs weight quantization, preparing them for efficient computation or storage.
`prepare_params`	Prepares and initializes the model parameters for training.
`set_activation`	Quantizes activation to 8-bit and applies a learnable bias adjustment.

Attributes

training

__init__(*args, **kwargs)[source]: Initializes the Q8LinearCutlass layer, setting up parameters for activation scaling, weight scaling, and a small epsilon value for numerical stability.

forward(x: Tensor) → Tensor[source]

Defines the forward pass for the quantized linear layer.

Parameters:: x (torch.Tensor) – Input tensor with shape (batch size, number of features).
Returns:: The output of the quantized linear layer.
Return type:: torch.Tensor

generate_quantized_weight(qweight_only: bool = False) → None[source]

Performs weight quantization, preparing them for efficient computation or storage.

Parameters:: qweight_only (bool) – If True, only quantizes the weights without altering other parameters.

prepare_params() → None[source]: Prepares and initializes the model parameters for training.

Note

This method MUST be called after model initialization and before training starts to ensure the weights are properly prepared for efficient computation.

One can use “prepare_bie_layers” method from project_root.utils.model_helper to call this function.

set_activation(x: Tensor) → Tensor[source]

Quantizes activation to 8-bit and applies a learnable bias adjustment.

Parameters:: x (torch.Tensor) – The activation tensor before quantization and bias adjustment.
Returns:: The adjusted activation tensor.
Return type:: torch.Tensor