bitorch_engine.layers.qlinear.nbit.layer.MPQLinearBase
- class bitorch_engine.layers.qlinear.nbit.layer.MPQLinearBase(in_channels: int, out_channels: int, a_bit: int = 16, w_bit: int = 4, dtype=torch.float16, group_size=-1, use_gba_quant=True, dq_group_size=-1, dq_mode=2, disable_bias=True, asym=False, requires_grad=True)[source]
Base class for mixed precision quantized (MPQ) linear layers, designed to support the computational needs of large language models (LLMs) with mixed precision quantization, such as 16-bit activations and 4-bit weights for efficient inference. It introduces optimized computation for bitwise unpacking of quantized weights and 16-bit floating-point matrix multiplication, tailored for various hardware platforms.
Different to nBitLinearBase, MPQLinearBase serves as the base class for mixed precision quantized linear layers. This special class is mainly to support the mixed precision linear layer in the current LLMs model, such as using 16-bit activation and 4-bit quantization weight for inference. During the reasoning process, two main calculation processes are introduced, namely bitwise unpacking of qweight from lower bits to 16-bit float, and 16-bit matrix multiplication. Correspondingly, the performance of these two processes has been optimized on different hardware.
- in_channels
The number of input features after bit-packing, representing the dimensionality of the input space.
- Type:
int
- out_channels
The number of output features, representing the dimensionality of the output space.
- Type:
int
- a_bit
The bit-width used for activation quantization, defaulting to 16 bits for high precision.
- Type:
int
- w_bit
The bit-width used for weight quantization, aiming to reduce memory footprint and computational cost.
- Type:
int
- dtype
The data type for computations within this layer, typically torch.half for efficiency.
- Type:
torch.dtype
- group_size
The grouping size for quantization, affecting scale and zero-point calculation. A value of -1 indicates that the entire input width is treated as one group.
- Type:
int
- use_gba_quant
Flag to indicate the use of GBA-specific quantization techniques over GPTQ-compliant methods.
- Type:
bool
- dq_group_size
Double quantization group size, specific to GBA quantization, for further granularity in quantization.
- Type:
int
- dq_mode
Double quantization mode, catering to different versions and requirements of LLaMA models.
- Type:
int
- disable_bias
Whether to include a bias term in the linear calculation. Disabling can reduce parameters and computation.
- Type:
bool
- asym
Indicates whether asymmetric quantization is used, offering an alternative to symmetric quantization strategies.
- Type:
bool
- initialize()[source]
Initializes parameters and quantization buffers based on the selected quantization method.
- init_gba()[source]
Configures buffers and scales for GBA quantization, accommodating for asymmetry and double quantization modes.
- generate_quantized_weight()[source]
Placeholder for weight quantization method, to be implemented by subclasses.
- check_parameters()[source]
Placeholder for parameter validation, ensuring correct layer configuration.
- prepare_params()[source]
Prepares quantized parameters for the forward pass, potentially decompressing quantized values.
Methods
- param in_channels:
dim of input features after bit-packing
Validates the configuration and parameters of the layer to ensure they are set correctly for the quantization process.
A placeholder method for the weight quantization process.
Prepares the layer for GBA-specific quantization, configuring buffers for scales, zero-points, and statistics for double quantization if enabled.
Initializes parameters and buffers specific to the GPTQ quantization method.
Initializes layer parameters and quantization buffers.
This method should be executed before the actual forward pass.
Updates the quantized weight tensor with new data.
Attributes
training
- __init__(in_channels: int, out_channels: int, a_bit: int = 16, w_bit: int = 4, dtype=torch.float16, group_size=-1, use_gba_quant=True, dq_group_size=-1, dq_mode=2, disable_bias=True, asym=False, requires_grad=True) None [source]
- Parameters:
in_channels (int) – dim of input features after bit-packing
out_channels (int) – dim of hidden states
a_bit – activation bits
w_bit – weight bits
dtype – data type used in this layer
group_size – number of associated weight elements->scale and zero facter
disable_bias – whether use bias
use_gba_quant – True: GBA specific quantization, False: use GPTQ-compliant methods
dq_group_size – gba specific parameter. Indicates double quantization group size.
dq_mode – gba specific parameter. Indicates double quantization mode, which is used to adapt to multiple different LLaMA versions.
asym – gba specific parameter. Indicates asymmetry or symmetry quantization strategies.
requires_grad (bool) – Indicates whether gradient calculation should be enabled for the parameters.
- check_parameters() None [source]
Validates the configuration and parameters of the layer to ensure they are set correctly for the quantization process. This method should check for common configuration errors and ensure that all required parameters for the selected quantization method are correctly initialized.
- Raises:
NotImplementedError – Indicates that the method has not been implemented yet and needs to be provided by subclasses.
- generate_quantized_weight(qweight_only: bool = False) None [source]
A placeholder method for the weight quantization process. Subclasses should implement this method to define how the layer’s weights are quantized based on the current configuration and quantization method. This operation is typically executed before saving the model weights or performing inference to ensure that the weights are in the appropriate quantized format.
- Parameters:
qweight_only (bool) – A flag to indicate whether only the quantized weights need to be generated, without considering other quantization parameters like scales or zero-points. Default is False, which means all relevant quantization parameters are generated.
- init_gba() None [source]
Prepares the layer for GBA-specific quantization, configuring buffers for scales, zero-points, and statistics for double quantization if enabled. GBA quantization allows for fine-tuned control over the quantization process, accommodating asymmetric quantization and providing additional parameters to adjust for different model versions and requirements.
- init_gptq() None [source]
Initializes parameters and buffers specific to the GPTQ quantization method. This includes setting up zero-point buffers, scale factors, and ensuring asymmetric quantization is enabled. GPTQ, being a more general quantization approach, requires specific buffers to hold quantization parameters for accurate computation and minimal precision loss.
- initialize() None [source]
Initializes layer parameters and quantization buffers. This method sets up the infrastructure for either GBA or GPTQ quantization methods, based on the layer configuration. It allocates memory for quantized weights, scales, zero-points, and other necessary buffers, ensuring they are ready for the quantization process.
- prepare_params() None [source]
This method should be executed before the actual forward pass. It mainly decompress quantized parameters such as qscale and qzero. This step could be simplified or eliminated in the future by having a kernel implementation that can decompress during kernel computation.
One can use “prepare_bie_layers” method from project_root.utils.model_helper to call this function.
Note
This method should be called before executing the forward pass, especially after loading the model from a checkpoint or before inference to ensure that quantized parameters are correctly prepared.
- Raises:
NotImplementedError – Indicates that the method has not been implemented yet and should be provided by subclasses.
- set_qweight_data(data: Tensor) None [source]
Updates the quantized weight tensor with new data. This method is crucial for adjusting the quantized weights based on training or fine-tuning processes, ensuring the layer’s weights reflect the most recent updates.
- Parameters:
data (torch.Tensor) – The new quantized weight data to be set in the layer.