bitorch_engine.layers.qlinear.nbit.layer.MPQLinearBase

class bitorch_engine.layers.qlinear.nbit.layer.MPQLinearBase(in_channels: int, out_channels: int, a_bit: int = 16, w_bit: int = 4, dtype=torch.float16, group_size=-1, use_gba_quant=True, dq_group_size=-1, dq_mode=2, disable_bias=True, asym=False, requires_grad=True)[source]

Base class for mixed precision quantized (MPQ) linear layers, designed to support the computational needs of large language models (LLMs) with mixed precision quantization, such as 16-bit activations and 4-bit weights for efficient inference. It introduces optimized computation for bitwise unpacking of quantized weights and 16-bit floating-point matrix multiplication, tailored for various hardware platforms.

Different to nBitLinearBase, MPQLinearBase serves as the base class for mixed precision quantized linear layers. This special class is mainly to support the mixed precision linear layer in the current LLMs model, such as using 16-bit activation and 4-bit quantization weight for inference. During the reasoning process, two main calculation processes are introduced, namely bitwise unpacking of qweight from lower bits to 16-bit float, and 16-bit matrix multiplication. Correspondingly, the performance of these two processes has been optimized on different hardware.

in_channels

The number of input features after bit-packing, representing the dimensionality of the input space.

Type:

int

out_channels

The number of output features, representing the dimensionality of the output space.

Type:

int

a_bit

The bit-width used for activation quantization, defaulting to 16 bits for high precision.

Type:

int

w_bit

The bit-width used for weight quantization, aiming to reduce memory footprint and computational cost.

Type:

int

dtype

The data type for computations within this layer, typically torch.half for efficiency.

Type:

torch.dtype

group_size

The grouping size for quantization, affecting scale and zero-point calculation. A value of -1 indicates that the entire input width is treated as one group.

Type:

int

use_gba_quant

Flag to indicate the use of GBA-specific quantization techniques over GPTQ-compliant methods.

Type:

bool

dq_group_size

Double quantization group size, specific to GBA quantization, for further granularity in quantization.

Type:

int

dq_mode

Double quantization mode, catering to different versions and requirements of LLaMA models.

Type:

int

disable_bias

Whether to include a bias term in the linear calculation. Disabling can reduce parameters and computation.

Type:

bool

asym

Indicates whether asymmetric quantization is used, offering an alternative to symmetric quantization strategies.

Type:

bool

initialize()[source]

Initializes parameters and quantization buffers based on the selected quantization method.

init_gptq()[source]

Sets up parameters specific to GPTQ quantization.

init_gba()[source]

Configures buffers and scales for GBA quantization, accommodating for asymmetry and double quantization modes.

set_qweight_data(data)[source]

Updates the quantized weight tensor with new data.

generate_quantized_weight()[source]

Placeholder for weight quantization method, to be implemented by subclasses.

check_parameters()[source]

Placeholder for parameter validation, ensuring correct layer configuration.

prepare_params()[source]

Prepares quantized parameters for the forward pass, potentially decompressing quantized values.

Methods

__init__

param in_channels:

dim of input features after bit-packing

check_parameters

Validates the configuration and parameters of the layer to ensure they are set correctly for the quantization process.

generate_quantized_weight

A placeholder method for the weight quantization process.

init_gba

Prepares the layer for GBA-specific quantization, configuring buffers for scales, zero-points, and statistics for double quantization if enabled.

init_gptq

Initializes parameters and buffers specific to the GPTQ quantization method.

initialize

Initializes layer parameters and quantization buffers.

prepare_params

This method should be executed before the actual forward pass.

set_qweight_data

Updates the quantized weight tensor with new data.

Attributes

training

__init__(in_channels: int, out_channels: int, a_bit: int = 16, w_bit: int = 4, dtype=torch.float16, group_size=-1, use_gba_quant=True, dq_group_size=-1, dq_mode=2, disable_bias=True, asym=False, requires_grad=True) None[source]
Parameters:
  • in_channels (int) – dim of input features after bit-packing

  • out_channels (int) – dim of hidden states

  • a_bit – activation bits

  • w_bit – weight bits

  • dtype – data type used in this layer

  • group_size – number of associated weight elements->scale and zero facter

  • disable_bias – whether use bias

  • use_gba_quant – True: GBA specific quantization, False: use GPTQ-compliant methods

  • dq_group_size – gba specific parameter. Indicates double quantization group size.

  • dq_mode – gba specific parameter. Indicates double quantization mode, which is used to adapt to multiple different LLaMA versions.

  • asym – gba specific parameter. Indicates asymmetry or symmetry quantization strategies.

  • requires_grad (bool) – Indicates whether gradient calculation should be enabled for the parameters.

check_parameters() None[source]

Validates the configuration and parameters of the layer to ensure they are set correctly for the quantization process. This method should check for common configuration errors and ensure that all required parameters for the selected quantization method are correctly initialized.

Raises:

NotImplementedError – Indicates that the method has not been implemented yet and needs to be provided by subclasses.

generate_quantized_weight(qweight_only: bool = False) None[source]

A placeholder method for the weight quantization process. Subclasses should implement this method to define how the layer’s weights are quantized based on the current configuration and quantization method. This operation is typically executed before saving the model weights or performing inference to ensure that the weights are in the appropriate quantized format.

Parameters:

qweight_only (bool) – A flag to indicate whether only the quantized weights need to be generated, without considering other quantization parameters like scales or zero-points. Default is False, which means all relevant quantization parameters are generated.

init_gba() None[source]

Prepares the layer for GBA-specific quantization, configuring buffers for scales, zero-points, and statistics for double quantization if enabled. GBA quantization allows for fine-tuned control over the quantization process, accommodating asymmetric quantization and providing additional parameters to adjust for different model versions and requirements.

init_gptq() None[source]

Initializes parameters and buffers specific to the GPTQ quantization method. This includes setting up zero-point buffers, scale factors, and ensuring asymmetric quantization is enabled. GPTQ, being a more general quantization approach, requires specific buffers to hold quantization parameters for accurate computation and minimal precision loss.

initialize() None[source]

Initializes layer parameters and quantization buffers. This method sets up the infrastructure for either GBA or GPTQ quantization methods, based on the layer configuration. It allocates memory for quantized weights, scales, zero-points, and other necessary buffers, ensuring they are ready for the quantization process.

prepare_params() None[source]

This method should be executed before the actual forward pass. It mainly decompress quantized parameters such as qscale and qzero. This step could be simplified or eliminated in the future by having a kernel implementation that can decompress during kernel computation.

One can use “prepare_bie_layers” method from project_root.utils.model_helper to call this function.

Note

This method should be called before executing the forward pass, especially after loading the model from a checkpoint or before inference to ensure that quantized parameters are correctly prepared.

Raises:

NotImplementedError – Indicates that the method has not been implemented yet and should be provided by subclasses.

set_qweight_data(data: Tensor) None[source]

Updates the quantized weight tensor with new data. This method is crucial for adjusting the quantized weights based on training or fine-tuning processes, ensuring the layer’s weights reflect the most recent updates.

Parameters:

data (torch.Tensor) – The new quantized weight data to be set in the layer.