bitorch_engine.layers.qlinear.nbit.cuda.mbwq_layer.MBWQLinearCudaFunction

class bitorch_engine.layers.qlinear.nbit.cuda.mbwq_layer.MBWQLinearCudaFunction(*args, **kwargs)[source]

Custom CUDA function for performing forward and backward passes in MBWQ Linear layers.

This function supports both forward and backward passes, implemented as static methods. The forward pass calculates the output of the MBWQ Linear layer based on the input tensor and quantized weights. The backward pass computes gradients with respect to the input tensor and quantized weights.

Methods

backward

Perform the backward pass of MBWQ Linear layer.

forward

Perform the forward pass of MBWQ Linear layer using CUDA.

Attributes

static backward(ctx: BackwardCFunction, output_gradient: Tensor) Tuple[Tensor, ...][source]

Perform the backward pass of MBWQ Linear layer.

Parameters:
  • ctx (Any) – Autograd context.

  • output_gradient (torch.Tensor) – Output gradient.

Returns:

Tuple[torch.Tensor, …]

Note

This method is experimental and may not guarantee error-free or consistent behavior.

static forward(ctx, x: Tensor, qweight: Tensor, use_mbw: bool, is_train: bool, scales: Tensor, zeros: Tensor, group_size: int, q_perm: Tensor | None = None, bits: int = 4, privileged_grad: Tensor | None = None, q_group_map: Tensor | None = None, rows: list | None = None) Tensor[source]

Perform the forward pass of MBWQ Linear layer using CUDA.

This method computes the output of a linear layer with mixed binary weight quantization (MBWQ), optimizing the computation for CUDA-enabled devices. It supports both standard quantization and MBWQ modes, dynamically adjusting the computation based on the use_mbw flag. Additionally, it can operate in both training and inference modes, indicated by the is_train flag.

Parameters:
  • ctx (Any) – Autograd context, used for saving variables needed for backward computation.

  • x (torch.Tensor) – Input tensor, representing the data that will be processed by the layer.

  • qweight (torch.Tensor) – Quantized weights tensor, which contains the quantized values of the weights used in the layer.

  • use_mbw (bool) – Flag indicating whether to use Mixed Binary Weight Quantization (MBWQ) mode for processing.

  • is_train (bool) – Flag indicating whether the operation is being performed in training mode.

  • scales (torch.Tensor) – Scale factors for quantization.

  • zeros (torch.Tensor) – Zero points for quantization.

  • group_size (int) – The size of groups for group-wise quantization.

  • q_perm (torch.Tensor, optional) – Permutation tensor for reordering the quantized weights.

  • q_group_map (torch.Tensor, optional) – Mapping tensor for group-wise quantization.

  • rows (list, optional) – Contains distribution and permutation information for weights in MBWQ mode.

  • bits (int) – q_weight’s bitwidth.

Returns:

The output tensor of the forward pass, after processing by the MBWQ Linear layer.

Return type:

torch.Tensor

Note

This method is specifically optimized for CUDA computation and should be used when performance on CUDA-enabled devices is a priority. The implementation details and parameter usage may be subject to change as the method is experimental and optimized for advanced use cases.