Learn R Programming

ggmlR (version 0.6.1)

ag_multihead_attention: Create a Multi-Head Attention layer

Description

Implements scaled dot-product multi-head attention as in "Attention Is All You Need" (Vaswani et al., 2017).

Usage

ag_multihead_attention(d_model, n_heads, dropout = 0, bias = TRUE)

Value

An ag_multihead_attention environment with

$forward(q, k, v, causal_mask) and $parameters()

Arguments

d_model

Model (embedding) dimension

n_heads

Number of attention heads. d_model must be divisible by n_heads.

dropout

Attention dropout probability (default 0, applied in training mode only)

bias

Logical: add bias to output projection (default TRUE)

Details

Calling convention (mirrors PyTorch nn.MultiheadAttention):

  • layer$forward(q) — self-attention (k = v = q)

  • layer$forward(q, k, v) — cross-attention

Tensor layout: [d_model, seq_len] — columns are tokens, consistent with the rest of the ag_* API.

Forward pass:


  Q = W_q %*% q            [d_k * n_heads, seq_len]
  K = W_k %*% k            [d_k * n_heads, seq_len]
  V = W_v %*% v            [d_v * n_heads, seq_len]

for each head h: q_h = Q[h*d_k+1 : (h+1)*d_k, ] [d_k, seq_len] k_h = K[h*d_k+1 : (h+1)*d_k, ] [d_k, seq_len] v_h = V[h*d_v+1 : (h+1)*d_v, ] [d_v, seq_len] A_h = softmax(t(q_h) %*% k_h / sqrt(d_k)) [seq_len, seq_len] if causal_mask: A_h[i,j] = 0 for j > i head_h = v_h %*% A_h [d_v, seq_len]

concat = rbind(head_1, ..., head_H) [d_v*n_heads, seq_len] out = W_o %*% concat + b_o [d_model, seq_len]

Examples

Run this code
# \donttest{
# Self-attention
mha <- ag_multihead_attention(64L, 8L)
x   <- ag_tensor(matrix(rnorm(64 * 10), 64, 10))  # [d_model=64, seq_len=10]
out <- mha$forward(x)                              # [64, 10]

# Cross-attention
q   <- ag_tensor(matrix(rnorm(64 * 10), 64, 10))
kv  <- ag_tensor(matrix(rnorm(64 * 15), 64, 15))
out <- mha$forward(q, kv, kv)

# Causal (GPT-style)
out <- mha$forward(x, causal_mask = TRUE)
# }

Run the code above in your browser using DataLab