bitorch_engine.layers.qconv.nbit.cutlass.layer.Q4Conv2dCutlass

class bitorch_engine.layers.qconv.nbit.cutlass.layer.Q4Conv2dCutlass(*args, **kwargs)[source]

A specialized 4-bit quantized convolutional layer using CUTLASS kernels.

This class extends nBitConv2dBase to implement a 4-bit quantized convolution layer optimized for CUTLASS. It supports quantization for both weights and activations, aiming to reduce model size and improve computational efficiency while maintaining accuracy.

bias_a

Bias parameter for activation quantization.

Type:

torch.nn.Parameter

scale_a

Scale parameter for activation quantization.

Type:

torch.nn.Parameter

scale_w

Scale parameter for weight quantization.

Type:

torch.nn.Parameter

eps

A small epsilon value to prevent division by zero in calculations.

Type:

torch.Tensor

Methods

__init__

Initializes the Q4Conv2dCutlass layer with provided arguments.

forward

Defines the forward pass of the 4-bit quantized convolution using CUTLASS.

generate_quantized_weight

Performs weight quantization and optionally releases the floating-point weights.

prepare_params

Prepares and initializes the model parameters for training.

set_activation

Calculates scale of input and shift the input using a learnable bias_a.

Attributes

training

__init__(*args, **kwargs)[source]

Initializes the Q4Conv2dCutlass layer with provided arguments.

Parameters:
  • *args – Variable length argument list for base class.

  • **kwargs – Arbitrary keyword arguments for base class.

forward(x: Tensor) Tensor[source]

Defines the forward pass of the 4-bit quantized convolution using CUTLASS.

Parameters:

x (torch.Tensor) – Input tensor of shape (N, C, H, W).

Returns:

The output tensor of the convolution operation.

generate_quantized_weight(qweight_only: bool = False) None[source]

Performs weight quantization and optionally releases the floating-point weights.

This method should be called before saving the model weights, especially for inference.

Parameters:
  • qweight_only (bool) – If True, releases the floating-point weight after quantization.

  • inference. (It will save runtime memory for) –

prepare_params() None[source]

Prepares and initializes the model parameters for training.

Note

This method MUST be called after model initialization and before training starts to ensure the weights are properly prepared for efficient computation.

One can use “prepare_bie_layers” method from project_root.utils.model_helper to call this function.

set_activation(x: Tensor) Tensor[source]

Calculates scale of input and shift the input using a learnable bias_a.

Parameters:

x (torch.Tensor) – The input activation tensor.

Returns:

The quantized activation tensor.

Return type:

torch.Tensor