bitorch_engine.layers.qlinear.nbit.cutlass.q8_layer.Q8LinearCutlass
- class bitorch_engine.layers.qlinear.nbit.cutlass.q8_layer.Q8LinearCutlass(*args, **kwargs)[source]
Implements an 8-bit quantized linear layer using CUTLASS for efficient computation.
This class inherits from nBitLinearBase and adds specific functionality for handling 8-bit quantized weights and activations, aiming at reducing memory footprint and accelerating computation on compatible hardware. It introduces parameters for scaling and bias adjustment of activations to maintain accuracy with quantized values.
- bias_a
Bias for the activation function, ensuring the quantized model’s accuracy.
- Type:
torch.nn.Parameter
- scale_a
Scale factor for activations, used in quantization to maintain numerical stability.
- Type:
torch.nn.Parameter
- scale_w
Scale factor for weights, adjusting the quantized weights’ magnitude.
- Type:
torch.nn.Parameter
- eps
A small epsilon value to prevent division by zero in computations, set to 0.00001 by default.
- Type:
torch.Tensor
- prepare_params()[source]
Prepares and initializes the model parameters for training, converting weights to int8 format.
- generate_quantized_weight()[source]
Quantizes the weights, preparing them for efficient storage or computation.
- set_activation()[source]
Quantizes activation to 8-bit using the scale factor scale_a and adjusts with bias_a.
Methods
Initializes the Q8LinearCutlass layer, setting up parameters for activation scaling, weight scaling, and a small epsilon value for numerical stability.
Defines the forward pass for the quantized linear layer.
Performs weight quantization, preparing them for efficient computation or storage.
Prepares and initializes the model parameters for training.
Quantizes activation to 8-bit and applies a learnable bias adjustment.
Attributes
training
- __init__(*args, **kwargs)[source]
Initializes the Q8LinearCutlass layer, setting up parameters for activation scaling, weight scaling, and a small epsilon value for numerical stability.
- forward(x: Tensor) Tensor [source]
Defines the forward pass for the quantized linear layer.
- Parameters:
x (torch.Tensor) – Input tensor with shape (batch size, number of features).
- Returns:
The output of the quantized linear layer.
- Return type:
torch.Tensor
- generate_quantized_weight(qweight_only: bool = False) None [source]
Performs weight quantization, preparing them for efficient computation or storage.
- Parameters:
qweight_only (bool) – If True, only quantizes the weights without altering other parameters.
- prepare_params() None [source]
Prepares and initializes the model parameters for training.
Note
This method MUST be called after model initialization and before training starts to ensure the weights are properly prepared for efficient computation.
One can use “prepare_bie_layers” method from project_root.utils.model_helper to call this function.