ggml_flash_attn_ext

Creates a graph node for Flash Attention computation.
This is a memory-efficient implementation of scaled dot-product attention.

Provides 'R' bindings to the 'GGML' tensor library for machine
learning, designed primarily for 'Vulkan' GPU acceleration with full CPU
fallback. 'Vulkan' support is auto-detected at build time on Linux (when
'libvulkan-dev' and 'glslc' are installed) and on Windows (when 'Vulkan'
'SDK' is installed and 'VULKAN_SDK' environment variable is set); all
operations fall back to CPU transparently when no GPU is available.
Implements tensor operations, neural network layers, quantization, and a
'Keras'-like sequential model API for building and training networks.
Includes 'AdamW' (Adam with Weight decay) and 'SGD' (Stochastic Gradient
Descent) optimizers with 'MSE' (Mean Squared Error) and cross-entropy
losses. Also provides a dynamic 'autograd' engine ('PyTorch'-style) with
data-parallel training via 'dp_train()', broadcast arithmetic, 'f16'
(half-precision) support on 'Vulkan' GPU, and a multi-head attention layer
for building Transformer architectures. Serves as backend for 'LLM' (Large
Language Model) inference via 'llamaR' and Stable Diffusion image
generation via 'sdR'. See <https://github.com/ggml-org/ggml> for more
information about the underlying library.

Yuri Baramykov

ggmlR

'GGML' Tensor Operations for Machine Learning

Georgi Gerganov

Jeffrey Quesnelle

Bowen Peng

Mozilla Foundation 

ggml_flash_attn_ext function

<dl><dt>ctx</dt>
<dd>GGML context</dd>
<dt>q</dt>
<dd>Query tensor of shape [head_dim, n_head, n_tokens, batch]</dd>
<dt>k</dt>
<dd>Key tensor of shape [head_dim, n_head_kv, n_kv, batch]</dd>
<dt>v</dt>
<dd>Value tensor of shape [head_dim, n_head_kv, n_kv, batch]</dd>
<dt>mask</dt>
<dd>Optional attention mask tensor (NULL for no mask).
For causal attention, use ggml_diag_mask_inf instead.</dd>
<dt>scale</dt>
<dd>Attention scale factor, typically 1/sqrt(head_dim)</dd>
<dt>max_bias</dt>
<dd>Maximum ALiBi bias (0.0 to disable ALiBi)</dd>
<dt>logit_softcap</dt>
<dd>Logit soft-capping value (0.0 to disable).
Used by some models like Gemma 2.</dd></dl>

Arguments

Flash Attention (Graph) — ggml_flash_attn_ext

<dl>

<dt>ctx</dt>
<dd>GGML context</dd>


<dt>q</dt>
<dd>Query tensor of shape [head_dim, n_head, n_tokens, batch]</dd>


<dt>k</dt>
<dd>Key tensor of shape [head_dim, n_head_kv, n_kv, batch]</dd>


<dt>v</dt>
<dd>Value tensor of shape [head_dim, n_head_kv, n_kv, batch]</dd>


<dt>mask</dt>
<dd>Optional attention mask tensor (NULL for no mask).
For causal attention, use ggml_diag_mask_inf instead.</dd>


<dt>scale</dt>
<dd>Attention scale factor, typically 1/sqrt(head_dim)</dd>


<dt>max_bias</dt>
<dd>Maximum ALiBi bias (0.0 to disable ALiBi)</dd>


<dt>logit_softcap</dt>
<dd>Logit soft-capping value (0.0 to disable).
Used by some models like Gemma 2.</dd>

</dl>

ggml_flash_attn_ext: Flash Attention (Graph)

Description

Usage

Value

Arguments

Details

Examples