Skip to content

strands.models.llamacpp

llama.cpp model provider.

Provides integration with llama.cpp servers running in OpenAI-compatible mode, with support for advanced llama.cpp-specific features.

class LlamaCppModel(Model)

Defined in: src/strands/models/llamacpp.py:41

llama.cpp model provider implementation.

Connects to a llama.cpp server running in OpenAI-compatible mode with support for advanced llama.cpp-specific features like grammar constraints, Mirostat sampling, native JSON schema validation, and native multimodal support for audio and image content.

The llama.cpp server must be started with the OpenAI-compatible API enabled: llama-server -m model.gguf —host 0.0.0.0 —port 8080

Example:

Basic usage:

model = LlamaCppModel(base_url=“http://localhost:8080”) model.update_config(params={“temperature”: 0.7, “top_k”: 40})

Grammar constraints via params:

model.update_config(params={ … “grammar”: ''' … root ::= answer … answer ::= “yes” | “no” … ''' … })

Advanced sampling:

model.update_config(params={ … “mirostat”: 2, … “mirostat_lr”: 0.1, … “tfs_z”: 0.95, … “repeat_penalty”: 1.1 … })

Multimodal usage (requires multimodal model like Qwen2.5-Omni):

audio_content = [{ … “audio”: {“source”: {“bytes”: audio_bytes}, “format”: “wav”}, … “text”: “What do you hear in this audio?” … }] response = agent(audio_content)

image_content = [{ … “image”: {“source”: {“bytes”: image_bytes}, “format”: “png”}, … “text”: “Describe this image” … }] response = agent(image_content)

class LlamaCppConfig(TypedDict)

Defined in: src/strands/models/llamacpp.py:89

Configuration options for llama.cpp models.

Attributes:

  • model_id - Model identifier for the loaded model in llama.cpp server. Default is “default” as llama.cpp typically loads a single model.

  • params - Model parameters supporting both OpenAI and llama.cpp-specific options.

    OpenAI-compatible parameters:

    • max_tokens: Maximum number of tokens to generate
    • temperature: Sampling temperature (0.0 to 2.0)
    • top_p: Nucleus sampling parameter (0.0 to 1.0)
    • frequency_penalty: Frequency penalty (-2.0 to 2.0)
    • presence_penalty: Presence penalty (-2.0 to 2.0)
    • stop: List of stop sequences
    • seed: Random seed for reproducibility
    • n: Number of completions to generate
    • logprobs: Include log probabilities in output
    • top_logprobs: Number of top log probabilities to include

    llama.cpp-specific parameters:

    • repeat_penalty: Penalize repeat tokens (1.0 = no penalty)
    • top_k: Top-k sampling (0 = disabled)
    • min_p: Min-p sampling threshold (0.0 to 1.0)
    • typical_p: Typical-p sampling (0.0 to 1.0)
    • tfs_z: Tail-free sampling parameter (0.0 to 1.0)
    • top_a: Top-a sampling parameter
    • mirostat: Mirostat sampling mode (0, 1, or 2)
    • mirostat_lr: Mirostat learning rate
    • mirostat_ent: Mirostat target entropy
    • grammar: GBNF grammar string for constrained generation
    • json_schema: JSON schema for structured output
    • penalty_last_n: Number of tokens to consider for penalties
    • n_probs: Number of probabilities to return per token
    • min_keep: Minimum tokens to keep in sampling
    • ignore_eos: Ignore end-of-sequence token
    • logit_bias: Token ID to bias mapping
    • cache_prompt: Cache the prompt for faster generation
    • slot_id: Slot ID for parallel inference
    • samplers: Custom sampler order
def __init__(base_url: str = "http://localhost:8080",
timeout: float | tuple[float, float] | None = None,
**model_config: Unpack[LlamaCppConfig]) -> None

Defined in: src/strands/models/llamacpp.py:134

Initialize llama.cpp provider instance.

Arguments:

  • base_url - Base URL for the llama.cpp server. Default is “http://localhost:8080” for local server.
  • timeout - Request timeout in seconds. Can be float or tuple of (connect, read) timeouts.
  • **model_config - Configuration options for the llama.cpp model.
@override
def update_config(**model_config: Unpack[LlamaCppConfig]) -> None

Defined in: src/strands/models/llamacpp.py:177

Update the llama.cpp model configuration with provided arguments.

Arguments:

  • **model_config - Configuration overrides.
@override
def get_config() -> LlamaCppConfig

Defined in: src/strands/models/llamacpp.py:187

Get the llama.cpp model configuration.

Returns:

The llama.cpp model configuration.

@override
async def stream(messages: Messages,
tool_specs: list[ToolSpec] | None = None,
system_prompt: str | None = None,
*,
tool_choice: ToolChoice | None = None,
**kwargs: Any) -> AsyncGenerator[StreamEvent, None]

Defined in: src/strands/models/llamacpp.py:513

Stream conversation with the llama.cpp model.

Arguments:

  • messages - List of message objects to be processed by the model.
  • tool_specs - List of tool specifications to make available to the model.
  • system_prompt - System prompt to provide context to the model.
  • tool_choice - Selection strategy for tool invocation. Note: This parameter is accepted for interface consistency but is currently ignored for this model provider.
  • **kwargs - Additional keyword arguments for future extensibility.

Yields:

Formatted message chunks from the model.

Raises:

  • ContextWindowOverflowException - When the context window is exceeded.
  • ModelThrottledException - When the llama.cpp server is overloaded.
@override
async def structured_output(
output_model: type[T],
prompt: Messages,
system_prompt: str | None = None,
**kwargs: Any) -> AsyncGenerator[dict[str, T | Any], None]

Defined in: src/strands/models/llamacpp.py:709

Get structured output using llama.cpp’s native JSON schema support.

This implementation uses llama.cpp’s json_schema parameter to constrain the model output to valid JSON matching the provided schema.

Arguments:

  • output_model - The Pydantic model defining the expected output structure.
  • prompt - The prompt messages to use for generation.
  • system_prompt - System prompt to provide context to the model.
  • **kwargs - Additional keyword arguments for future extensibility.

Yields:

Model events with the last being the structured output.

Raises:

  • json.JSONDecodeError - If the model output is not valid JSON.
  • pydantic.ValidationError - If the output doesn’t match the model schema.