edge_load_model

Load a local GGUF model for inference

Enables R users to run large language models locally using 'GGUF' model files
and the 'llama.cpp' inference engine. Provides a complete R interface for loading models,
generating text completions, and streaming responses in real-time. Supports local
inference without requiring cloud APIs or internet connectivity, ensuring complete
data privacy and control. Based on the 'llama.cpp' project by Georgi Gerganov (2023) <https://github.com/ggml-org/llama.cpp>.

Pawan Rama Mali

edgemodelr

Local Large Language Model Inference Engine

Georgi Gerganov

The ggml authors 

Jeffrey Quesnelle

Bowen Peng

pi6am 

Ivan Yurchenko

Dirk Eddelbuettel

edge_load_model function

<dl><dt>model_path</dt>
<dd>Path to a .gguf model file</dd>
<dt>n_ctx</dt>
<dd>Maximum context length (default: 2048)</dd>
<dt>n_gpu_layers</dt>
<dd>Number of layers to offload to GPU (default: 0, CPU-only)</dd>
<dt>n_threads</dt>
<dd>Number of CPU threads for inference (default: NULL = use all
hardware threads). Set to a lower value to leave cores free for other tasks.</dd>
<dt>flash_attn</dt>
<dd>Enable flash attention for faster inference (default: TRUE).
Reduces memory usage and improves speed. Set to FALSE for maximum compatibility.</dd></dl>

Arguments

Load a local GGUF model for inference — edge_load_model

<dl>

<dt>model_path</dt>
<dd>Path to a .gguf model file</dd>


<dt>n_ctx</dt>
<dd>Maximum context length (default: 2048)</dd>


<dt>n_gpu_layers</dt>
<dd>Number of layers to offload to GPU (default: 0, CPU-only)</dd>


<dt>n_threads</dt>
<dd>Number of CPU threads for inference (default: NULL = use all
hardware threads). Set to a lower value to leave cores free for other tasks.</dd>


<dt>flash_attn</dt>
<dd>Enable flash attention for faster inference (default: TRUE).
Reduces memory usage and improves speed. Set to FALSE for maximum compatibility.</dd>

</dl>

edge_load_model: Load a local GGUF model for inference

Description

Usage

Value

Arguments

Examples