Modules

Modules#

`Module`#

class Module() -> 'None':

Base class for all neural-network modules.

Subclasses must override :meth:forward. Parameters (tensors with requires_grad=True) assigned to attributes are automatically tracked and yielded by :meth:parameters. Submodules assigned to attributes are recursively tracked by :meth:modules.

Modules are registered as PyTree nodes, so they can be passed directly to transforms like :func:~nabla.vmap, :func:~nabla.grad, and :func:~nabla.compile without any special wrapping.

Example::

class MLP(nabla.nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.fc1 = nabla.nn.Linear(in_dim, 64)
        self.fc2 = nabla.nn.Linear(64, out_dim)

    def forward(self, x):
        return self.fc2(nabla.relu(self.fc1(x)))

Methods#

`backward`#

def backward(self, loss: 'Tensor', gradient: 'Tensor | None' = None, retain_graph: 'bool' = False, create_graph: 'bool' = False, *, realize_grads: 'bool | None' = None) -> 'None':

PyTorch-style backward convenience attached to Module.

Optionally realizes all parameter gradients after backward.

`buffers`#

def buffers(self) -> 'Iterator[Tensor]':

`eval`#

def eval(self) -> 'Module':

`extra_repr`#

def extra_repr(self) -> 'str':

`forward`#

def forward(self, *args: 'Any', **kwargs: 'Any') -> 'Any':

`load_state_dict`#

def load_state_dict(self, state_dict: 'OrderedDict[str, Tensor]') -> 'None':

`modules`#

def modules(self) -> 'Iterator[Module]':

`named_buffers`#

def named_buffers(self, prefix: 'str' = '') -> 'Iterator[tuple[str, Tensor]]':

`named_parameters`#

def named_parameters(self, prefix: 'str' = '') -> 'Iterator[tuple[str, Tensor]]':

`parameters`#

def parameters(self) -> 'Iterator[Tensor]':

`register_buffer`#

def register_buffer(self, name: 'str', tensor: 'Tensor | None') -> 'None':

`state_dict`#

def state_dict(self) -> 'OrderedDict[str, Tensor]':

`train`#

def train(self) -> 'Module':

`zero_grad`#

def zero_grad(self) -> 'None':

`Linear`#

class Linear(in_features: 'int', out_features: 'int', bias: 'bool' = True, *, dtype: 'DType' = float32) -> 'None':

Apply a linear transformation: y = x @ W.T + b.

Parameters

in_features – Number of input features.
out_features – Number of output features.
bias – If True (default), adds a learnable bias term.
dtype – Dtype for weight and bias. Default: float32.

Methods#

`extra_repr`#

def extra_repr(self) -> 'str':

`forward`#

def forward(self, x: 'Tensor') -> 'Tensor':

`LayerNorm`#

class LayerNorm(normalized_shape: 'int | tuple[int, ...]', eps: 'float' = 1e-05, elementwise_affine: 'bool' = True, *, dtype: 'DType' = float32) -> 'None':

Apply layer normalization over the last len(normalized_shape) dimensions.

Normalises inputs as (x - mean) / sqrt(var + eps) and then applies a learnable per-element affine transform when elementwise_affine is True.

Parameters

normalized_shape – Input shape from an expected input of size (*, normalized_shape[0], ..., normalized_shape[-1]). Can be an int for the common last-dimension case.
eps – Value added to the denominator for numerical stability. Default: 1e-5.
elementwise_affine – If True (default), learnable weight (initialized to 1) and bias (initialized to 0) are added.
dtype – Dtype for weight and bias. Default: float32.

Methods#

`extra_repr`#

def extra_repr(self) -> 'str':

`forward`#

def forward(self, x: 'Tensor') -> 'Tensor':

`Dropout`#

class Dropout(p: 'float' = 0.5) -> 'None':

Randomly zero elements of the input with probability p (Bernoulli dropout).

Elements that are not zeroed are scaled by 1 / (1 - p) (inverted dropout) so that the expected value of each element is unchanged. Set to eval() mode to disable dropout during inference.

Parameters

p – Probability of an element being zeroed. Default: 0.5.

`Embedding`#

class Embedding(num_embeddings: 'int', embedding_dim: 'int', *, dtype: 'DType' = float32) -> 'None':

A learnable lookup table mapping integer indices to dense vectors.

Parameters

num_embeddings : int – Size of the vocabulary (number of rows).
embedding_dim : int – Dimensionality of each embedding vector.

Methods#

`extra_repr`#

def extra_repr(self) -> 'str':

`forward`#

def forward(self, indices: 'Tensor') -> 'Tensor':

`MultiHeadAttention`#

class MultiHeadAttention(d_model: 'int', num_heads: 'int', dropout: 'float' = 0.0, bias: 'bool' = True, *, dtype: 'DType' = float32) -> 'None':

Multi-head attention as described in Attention Is All You Need.

Parameters

d_model : int – Total model dimensionality.
num_heads : int – Number of parallel attention heads. d_model must be divisible by num_heads.
dropout : float – Dropout probability on attention weights (applied during training).
bias : bool – Whether the linear projections include bias terms.

Methods#

`extra_repr`#

def extra_repr(self) -> 'str':

`forward`#

def forward(self, query: 'Tensor', key: 'Tensor', value: 'Tensor', attn_mask: 'Tensor | None' = None, is_causal: 'bool' = False) -> 'Tensor':

Run multi-head attention.

Parameters

query, key, value : `Tensor ``(batch, seq_*, d_model)``` – None
attn_mask : `optional additive mask ``(…, seq_q, seq_k)``` – None
is_causal : apply a causal mask – None

Returns

`Tensor ``(batch, seq_q, d_model)``` –

`TransformerEncoderLayer`#

class TransformerEncoderLayer(d_model: 'int', num_heads: 'int', dim_feedforward: 'int' = 2048, dropout: 'float' = 0.1, *, dtype: 'DType' = float32) -> 'None':

A single Transformer encoder layer (pre-norm variant).

Structure::

x ─→ LayerNorm ─→ MultiHeadAttention ─→ Dropout ─→ + ─→
│                                                    ↑
└────────────────────────────────────────────────────┘
x ─→ LayerNorm ─→ FFN ─→ Dropout ─→ + ─→
│                                      ↑
└──────────────────────────────────────┘

Parameters

d_model : int – Model dimensionality.
num_heads : int – Number of attention heads.
dim_feedforward : int – Hidden size of the position-wise feed-forward network.
dropout : float – Dropout probability.

Methods#

`forward`#

def forward(self, src: 'Tensor', src_mask: 'Tensor | None' = None, is_causal: 'bool' = False) -> 'Tensor':

`TransformerDecoderLayer`#

class TransformerDecoderLayer(d_model: 'int', num_heads: 'int', dim_feedforward: 'int' = 2048, dropout: 'float' = 0.1, *, dtype: 'DType' = float32) -> 'None':

A single Transformer decoder layer (pre-norm variant).

Structure::

tgt ─→ LayerNorm ─→ Masked-Self-Attention ─→ Dropout ─→ + ─→
tgt ─→ LayerNorm ─→ Cross-Attention(tgt, memory) ─→ Dropout ─→ + ─→
tgt ─→ LayerNorm ─→ FFN ─→ Dropout ─→ + ─→

Parameters

d_model, num_heads, dim_feedforward, dropout : same as encoder layer. – None

Methods#

`forward`#

def forward(self, tgt: 'Tensor', memory: 'Tensor', tgt_mask: 'Tensor | None' = None, memory_mask: 'Tensor | None' = None, is_causal: 'bool' = False) -> 'Tensor':

`Sequential`#

class Sequential(*args: 'Any') -> 'None':

Chain modules sequentially — output of each becomes input to the next.

Parameters

*args – Either a flat sequence of :class:Module instances, or a single :class:OrderedDict mapping string names to modules.

`ReLU`#

class ReLU() -> 'None':

Apply the Rectified Linear Unit activation: max(0, x).

`GELU`#

class GELU() -> 'None':

Apply the Gaussian Error Linear Unit activation.

`Sigmoid`#

class Sigmoid() -> 'None':

Apply the sigmoid activation: 1 / (1 + exp(-x)).

`Tanh`#

class Tanh() -> 'None':

Apply the hyperbolic tangent activation: tanh(x).

`SiLU`#

class SiLU() -> 'None':

Apply the Sigmoid Linear Unit (Swish) activation: x * sigmoid(x).

`Softmax`#

class Softmax(axis: 'int' = -1) -> 'None':

Apply softmax along a given axis (default: last).

Modules

Contents

Modules#

Module#

Methods#

backward#

buffers#

eval#

extra_repr#

forward#

load_state_dict#

modules#

named_buffers#

named_parameters#

parameters#

register_buffer#

state_dict#

train#

zero_grad#

Linear#

Methods#

extra_repr#

forward#

LayerNorm#

Methods#

extra_repr#

forward#

Dropout#

Embedding#

Methods#

extra_repr#

forward#

MultiHeadAttention#

Methods#

extra_repr#

forward#

TransformerEncoderLayer#

Methods#

forward#

TransformerDecoderLayer#

Methods#

forward#

Sequential#

ReLU#

GELU#

Sigmoid#

Tanh#

SiLU#

Softmax#

`Module`#

`backward`#

`buffers`#

`eval`#

`extra_repr`#

`forward`#

`load_state_dict`#

`modules`#

`named_buffers`#

`named_parameters`#

`parameters`#

`register_buffer`#

`state_dict`#

`train`#

`zero_grad`#

`Linear`#

`extra_repr`#

`forward`#

`LayerNorm`#

`extra_repr`#

`forward`#

`Dropout`#

`Embedding`#

`extra_repr`#

`forward`#

`MultiHeadAttention`#

`extra_repr`#

`forward`#

`TransformerEncoderLayer`#

`forward`#

`TransformerDecoderLayer`#

`forward`#

`Sequential`#

`ReLU`#

`GELU`#

`Sigmoid`#

`Tanh`#

`SiLU`#

`Softmax`#