strands.models.llamacpp
llama.cpp model provider.
Provides integration with llama.cpp servers running in OpenAI-compatible mode, with support for advanced llama.cpp-specific features.
- Docs: https://github.com/ggml-org/llama.cpp
- Server docs: https://github.com/ggml-org/llama.cpp/tree/master/tools/server
- OpenAI API compatibility: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints
LlamaCppModel
Section titled “LlamaCppModel”class LlamaCppModel(Model)Defined in: src/strands/models/llamacpp.py:41
llama.cpp model provider implementation.
Connects to a llama.cpp server running in OpenAI-compatible mode with support for advanced llama.cpp-specific features like grammar constraints, Mirostat sampling, native JSON schema validation, and native multimodal support for audio and image content.
The llama.cpp server must be started with the OpenAI-compatible API enabled: llama-server -m model.gguf —host 0.0.0.0 —port 8080
Example:
Basic usage:
model = LlamaCppModel(base_url=“http://localhost:8080”) model.update_config(params={“temperature”: 0.7, “top_k”: 40})
Grammar constraints via params:
model.update_config(params={ … “grammar”: ''' … root ::= answer … answer ::= “yes” | “no” … ''' … })
Advanced sampling:
model.update_config(params={ … “mirostat”: 2, … “mirostat_lr”: 0.1, … “tfs_z”: 0.95, … “repeat_penalty”: 1.1 … })
Multimodal usage (requires multimodal model like Qwen2.5-Omni):
Audio analysis
Section titled “Audio analysis”audio_content = [{ … “audio”: {“source”: {“bytes”: audio_bytes}, “format”: “wav”}, … “text”: “What do you hear in this audio?” … }] response = agent(audio_content)
Image analysis
Section titled “Image analysis”image_content = [{ … “image”: {“source”: {“bytes”: image_bytes}, “format”: “png”}, … “text”: “Describe this image” … }] response = agent(image_content)
LlamaCppConfig
Section titled “LlamaCppConfig”class LlamaCppConfig(TypedDict)Defined in: src/strands/models/llamacpp.py:89
Configuration options for llama.cpp models.
Attributes:
-
model_id- Model identifier for the loaded model in llama.cpp server. Default is “default” as llama.cpp typically loads a single model. -
params- Model parameters supporting both OpenAI and llama.cpp-specific options.OpenAI-compatible parameters:
- max_tokens: Maximum number of tokens to generate
- temperature: Sampling temperature (0.0 to 2.0)
- top_p: Nucleus sampling parameter (0.0 to 1.0)
- frequency_penalty: Frequency penalty (-2.0 to 2.0)
- presence_penalty: Presence penalty (-2.0 to 2.0)
- stop: List of stop sequences
- seed: Random seed for reproducibility
- n: Number of completions to generate
- logprobs: Include log probabilities in output
- top_logprobs: Number of top log probabilities to include
llama.cpp-specific parameters:
- repeat_penalty: Penalize repeat tokens (1.0 = no penalty)
- top_k: Top-k sampling (0 = disabled)
- min_p: Min-p sampling threshold (0.0 to 1.0)
- typical_p: Typical-p sampling (0.0 to 1.0)
- tfs_z: Tail-free sampling parameter (0.0 to 1.0)
- top_a: Top-a sampling parameter
- mirostat: Mirostat sampling mode (0, 1, or 2)
- mirostat_lr: Mirostat learning rate
- mirostat_ent: Mirostat target entropy
- grammar: GBNF grammar string for constrained generation
- json_schema: JSON schema for structured output
- penalty_last_n: Number of tokens to consider for penalties
- n_probs: Number of probabilities to return per token
- min_keep: Minimum tokens to keep in sampling
- ignore_eos: Ignore end-of-sequence token
- logit_bias: Token ID to bias mapping
- cache_prompt: Cache the prompt for faster generation
- slot_id: Slot ID for parallel inference
- samplers: Custom sampler order
__init__
Section titled “__init__”def __init__(base_url: str = "http://localhost:8080", timeout: float | tuple[float, float] | None = None, **model_config: Unpack[LlamaCppConfig]) -> NoneDefined in: src/strands/models/llamacpp.py:134
Initialize llama.cpp provider instance.
Arguments:
base_url- Base URL for the llama.cpp server. Default is “http://localhost:8080” for local server.timeout- Request timeout in seconds. Can be float or tuple of (connect, read) timeouts.**model_config- Configuration options for the llama.cpp model.
update_config
Section titled “update_config”@overridedef update_config(**model_config: Unpack[LlamaCppConfig]) -> NoneDefined in: src/strands/models/llamacpp.py:177
Update the llama.cpp model configuration with provided arguments.
Arguments:
**model_config- Configuration overrides.
get_config
Section titled “get_config”@overridedef get_config() -> LlamaCppConfigDefined in: src/strands/models/llamacpp.py:187
Get the llama.cpp model configuration.
Returns:
The llama.cpp model configuration.
stream
Section titled “stream”@overrideasync def stream(messages: Messages, tool_specs: list[ToolSpec] | None = None, system_prompt: str | None = None, *, tool_choice: ToolChoice | None = None, **kwargs: Any) -> AsyncGenerator[StreamEvent, None]Defined in: src/strands/models/llamacpp.py:513
Stream conversation with the llama.cpp model.
Arguments:
messages- List of message objects to be processed by the model.tool_specs- List of tool specifications to make available to the model.system_prompt- System prompt to provide context to the model.tool_choice- Selection strategy for tool invocation. Note: This parameter is accepted for interface consistency but is currently ignored for this model provider.**kwargs- Additional keyword arguments for future extensibility.
Yields:
Formatted message chunks from the model.
Raises:
ContextWindowOverflowException- When the context window is exceeded.ModelThrottledException- When the llama.cpp server is overloaded.
structured_output
Section titled “structured_output”@overrideasync def structured_output( output_model: type[T], prompt: Messages, system_prompt: str | None = None, **kwargs: Any) -> AsyncGenerator[dict[str, T | Any], None]Defined in: src/strands/models/llamacpp.py:709
Get structured output using llama.cpp’s native JSON schema support.
This implementation uses llama.cpp’s json_schema parameter to constrain the model output to valid JSON matching the provided schema.
Arguments:
output_model- The Pydantic model defining the expected output structure.prompt- The prompt messages to use for generation.system_prompt- System prompt to provide context to the model.**kwargs- Additional keyword arguments for future extensibility.
Yields:
Model events with the last being the structured output.
Raises:
json.JSONDecodeError- If the model output is not valid JSON.pydantic.ValidationError- If the output doesn’t match the model schema.