strands.models.llamacpp

llama.cpp model provider.

Provides integration with llama.cpp servers running in OpenAI-compatible mode, with support for advanced llama.cpp-specific features.

Docs: https://github.com/ggml-org/llama.cpp
Server docs: https://github.com/ggml-org/llama.cpp/tree/master/tools/server
OpenAI API compatibility: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints

LlamaCppModel

class LlamaCppModel(Model)

Defined in: src/strands/models/llamacpp.py:41

llama.cpp model provider implementation.

Connects to a llama.cpp server running in OpenAI-compatible mode with support for advanced llama.cpp-specific features like grammar constraints, Mirostat sampling, native JSON schema validation, and native multimodal support for audio and image content.

The llama.cpp server must be started with the OpenAI-compatible API enabled: llama-server -m model.gguf —host 0.0.0.0 —port 8080

Example:

Basic usage:

model = LlamaCppModel(base_url=“http://localhost:8080”) model.update_config(params={“temperature”: 0.7, “top_k”: 40})

Grammar constraints via params:

model.update_config(params={ … “grammar”: ''' … root ::= answer … answer ::= “yes” | “no” … ''' … })

Advanced sampling:

model.update_config(params={ … “mirostat”: 2, … “mirostat_lr”: 0.1, … “tfs_z”: 0.95, … “repeat_penalty”: 1.1 … })

Multimodal usage (requires multimodal model like Qwen2.5-Omni):

Audio analysis
Section titled “Audio analysis”

audio_content = [{ … “audio”: {“source”: {“bytes”: audio_bytes}, “format”: “wav”}, … “text”: “What do you hear in this audio?” … }] response = agent(audio_content)

Image analysis
Section titled “Image analysis”

image_content = [{ … “image”: {“source”: {“bytes”: image_bytes}, “format”: “png”}, … “text”: “Describe this image” … }] response = agent(image_content)

LlamaCppConfig

class LlamaCppConfig(TypedDict)

Defined in: src/strands/models/llamacpp.py:89

Configuration options for llama.cpp models.

Attributes:

model_id - Model identifier for the loaded model in llama.cpp server. Default is “default” as llama.cpp typically loads a single model.
params - Model parameters supporting both OpenAI and llama.cpp-specific options.

OpenAI-compatible parameters:
- max_tokens: Maximum number of tokens to generate
- temperature: Sampling temperature (0.0 to 2.0)
- top_p: Nucleus sampling parameter (0.0 to 1.0)
- frequency_penalty: Frequency penalty (-2.0 to 2.0)
- presence_penalty: Presence penalty (-2.0 to 2.0)
- stop: List of stop sequences
- seed: Random seed for reproducibility
- n: Number of completions to generate
- logprobs: Include log probabilities in output
- top_logprobs: Number of top log probabilities to include
llama.cpp-specific parameters:
- repeat_penalty: Penalize repeat tokens (1.0 = no penalty)
- top_k: Top-k sampling (0 = disabled)
- min_p: Min-p sampling threshold (0.0 to 1.0)
- typical_p: Typical-p sampling (0.0 to 1.0)
- tfs_z: Tail-free sampling parameter (0.0 to 1.0)
- top_a: Top-a sampling parameter
- mirostat: Mirostat sampling mode (0, 1, or 2)
- mirostat_lr: Mirostat learning rate
- mirostat_ent: Mirostat target entropy
- grammar: GBNF grammar string for constrained generation
- json_schema: JSON schema for structured output
- penalty_last_n: Number of tokens to consider for penalties
- n_probs: Number of probabilities to return per token
- min_keep: Minimum tokens to keep in sampling
- ignore_eos: Ignore end-of-sequence token
- logit_bias: Token ID to bias mapping
- cache_prompt: Cache the prompt for faster generation
- slot_id: Slot ID for parallel inference
- samplers: Custom sampler order

init

def __init__(base_url: str = "http://localhost:8080",
             timeout: float | tuple[float, float] | None = None,
             **model_config: Unpack[LlamaCppConfig]) -> None

Defined in: src/strands/models/llamacpp.py:134

Initialize llama.cpp provider instance.

Arguments:

base_url - Base URL for the llama.cpp server. Default is “http://localhost:8080” for local server.
timeout - Request timeout in seconds. Can be float or tuple of (connect, read) timeouts.
**model_config - Configuration options for the llama.cpp model.

update_config

@override
def update_config(**model_config: Unpack[LlamaCppConfig]) -> None

Defined in: src/strands/models/llamacpp.py:177

Update the llama.cpp model configuration with provided arguments.

Arguments:

**model_config - Configuration overrides.

get_config

@override
def get_config() -> LlamaCppConfig

Defined in: src/strands/models/llamacpp.py:187

Get the llama.cpp model configuration.

Returns:

The llama.cpp model configuration.

stream

@override
async def stream(messages: Messages,
                 tool_specs: list[ToolSpec] | None = None,
                 system_prompt: str | None = None,
                 *,
                 tool_choice: ToolChoice | None = None,
                 **kwargs: Any) -> AsyncGenerator[StreamEvent, None]

Defined in: src/strands/models/llamacpp.py:513

Stream conversation with the llama.cpp model.

Arguments:

messages - List of message objects to be processed by the model.
tool_specs - List of tool specifications to make available to the model.
system_prompt - System prompt to provide context to the model.
tool_choice - Selection strategy for tool invocation. Note: This parameter is accepted for interface consistency but is currently ignored for this model provider.
**kwargs - Additional keyword arguments for future extensibility.

Yields:

Formatted message chunks from the model.

Raises:

ContextWindowOverflowException - When the context window is exceeded.
ModelThrottledException - When the llama.cpp server is overloaded.

structured_output

@override
async def structured_output(
        output_model: type[T],
        prompt: Messages,
        system_prompt: str | None = None,
        **kwargs: Any) -> AsyncGenerator[dict[str, T | Any], None]

Defined in: src/strands/models/llamacpp.py:709

Get structured output using llama.cpp’s native JSON schema support.

This implementation uses llama.cpp’s json_schema parameter to constrain the model output to valid JSON matching the provided schema.

Arguments:

output_model - The Pydantic model defining the expected output structure.
prompt - The prompt messages to use for generation.
system_prompt - System prompt to provide context to the model.
**kwargs - Additional keyword arguments for future extensibility.

Yields:

Model events with the last being the structured output.

Raises:

json.JSONDecodeError - If the model output is not valid JSON.
pydantic.ValidationError - If the output doesn’t match the model schema.