bitorch_engine.layers.qlinear.binary.cuda.layer.BinaryLinearCuda

class bitorch_engine.layers.qlinear.binary.cuda.layer.BinaryLinearCuda(*args, bmm_type: BMM = BMM.ADAPTIVE, **kwargs)[source]

A CUDA implementation of binary linear layers for neural networks. This class specializes in handling binary weights and activations for efficient computation on GPU devices. It extends BinaryLinearBase and mixes in BinaryLinearImplementationMixin to leverage both generic and hardware-specific optimizations.

bmm_type

Specifies the type of binary matrix multiplication kernel to use.

Type:

BMM

bits_binary_word

Defines the bit width of the binary words used in CUTLASS operations.

Type:

int

bias_a

Layer-wise bias for input activations.

Type:

torch.nn.Parameter

scale_a

Layer-wise scale for input activations to manage quantization effects.

Type:

torch.nn.Parameter

scale_w

Scale for the weights to maintain numerical stability in lower precision.

Type:

torch.nn.Parameter

Parameters:
  • *args – Variable length argument list for base class initialization.

  • bmm_type (BMM) – Enum indicating the binary matrix multiplication (BMM) kernel type.

  • **kwargs – Arbitrary keyword arguments for base class initialization.

Methods

__init__

Initializes the BinaryLinearBase class with specified configurations.

create_clone_from

Creates a clone of this layer from a given recipe and device.

forward

Forward pass for the binary linear layer.

generate_quantized_weight

Generates and sets the quantized weight parameter from the current weights.

prepare_params

Prepares and initializes the model parameters for training, specifically converting floating-point weights to int8 format.

set_activation

Sets and scales the activation tensor x using the layer's scaling parameter and bias.

set_weight_data

Sets the weight data for this layer and prepares the parameters for training.

w_pack

Packs the given floating-point weights into a binary format suitable for binary matrix multiplication.

Attributes

device_id

Returns the device index of the current device.

training

__init__(*args, bmm_type: BMM = BMM.ADAPTIVE, **kwargs)[source]

Initializes the BinaryLinearBase class with specified configurations.

Parameters:
  • input_features (int) – Dimension of input features after bit-packing.

  • out_features (int) – Dimension of output features or hidden states.

  • device (torch.device, optional) – Device on which to allocate tensors. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for floating-point weights. Defaults to torch.float.

  • symmetric (bool, optional) – If True, quantization is symmetric around 0. Defaults to True.

classmethod create_clone_from(recipe: LayerRecipe, device: device | None = None) Any[source]

Creates a clone of this layer from a given recipe and device.

Parameters:
  • recipe (LayerRecipe) – A recipe object containing layer configuration and weights.

  • device (torch.device, optional) – The device on which the layer should be deployed.

Returns:

An instance of BinaryLinearCuda with configurations and weights copied from the recipe.

property device_id: int

Returns the device index of the current device.

Returns:

The index of the device.

Return type:

int

forward(x: Tensor, bmm_type: BMM = BMM.ADAPTIVE) Tensor[source]

Forward pass for the binary linear layer. Applies quantized matrix multiplication based on the specified BMM type, scales and biases the input activations, and returns the output tensor.

Parameters:
  • x (torch.Tensor) – The input activation tensor.

  • bmm_type (BMM) – The type of binary matrix multiplication kernel to use.

Returns:

The output tensor of the binary linear operation.

Return type:

torch.Tensor

generate_quantized_weight(qweight_only: bool = False) None[source]

Generates and sets the quantized weight parameter from the current weights. A bit-packing CUDA kernel will be called to do this job.

Parameters:

qweight_only (bool) – If True, the original weight tensor is discarded to save memory.

prepare_params() None[source]

Prepares and initializes the model parameters for training, specifically converting floating-point weights to int8 format.

This method leverages the init_weight function to convert the model’s floating-point weights to int8, achieving a significant reduction in memory usage. It also computes a scale for the weights, which is essential for maintaining the numerical fidelity of the model’s computations in the lower precision format. The conversion to int8 format is particularly beneficial for accelerating training and inference on hardware that supports lower precision arithmetic.

Note

This method MUST be called after model initialization and before training starts to ensure the weights are properly prepared for efficient computation.

One can use “prepare_bie_layers” method from project_root.utils.model_helper to call this function.

set_activation(x: Tensor) Tensor[source]

Sets and scales the activation tensor x using the layer’s scaling parameter and bias.

Parameters:

x (torch.Tensor) – The input activation tensor.

Returns:

The scaled and biased activation tensor.

Return type:

torch.Tensor

set_weight_data(x: Tensor) None[source]

Sets the weight data for this layer and prepares the parameters for training.

Parameters:

x (torch.Tensor) – The new weight tensor.

static w_pack(weights: Tensor, bmm_type: BMM) Tensor[source]

Packs the given floating-point weights into a binary format suitable for binary matrix multiplication.

Parameters:
  • weights (torch.Tensor) – The floating-point weight tensor to be packed.

  • bmm_type (BMM) – The binary matrix multiplication kernel type to be used.

Returns:

The packed binary weights.

Return type:

torch.Tensor