Filesystem Hosted Models
To use a model hosted on a filesystem, specify the path to the model file or folder in the from field:
models:
- from: file://models/llms/llama3.2-1b-instruct/
name: llama3
params:
model_type: llama
Supported formats include GGUF, GGML, and SafeTensor for large language models (LLMs) and ONNX for traditional machine learning (ML) models.
Configuration
from
An absolute or relative path to the model file or folder:
from: file://absolute/path/models/llms/llama3.2-1b-instruct/
from: file:models/llms/llama3.2-1b-instruct/
params (optional)
| Param | Description |
|---|---|
model_type | The architecture to load the model as. Supported values: mistral, gemma, mixtral, llama, phi2, phi3, qwen2, gemma2, starcoder2, phi3.5moe, deepseekv2, deepseek |
tools | Which tools should be made available to the model. Set to auto to use all available tools. |
system_prompt | An additional system prompt used for all chat completions to this model. |
chat_template | Customizes the transformation of OpenAI chat messages into a character stream for the model. See Overriding the Chat Template. |
See Large Language Models for additional configuration options.
files (optional)
The files field specifies additional files required by the model, such as tokenizer, configuration, and other files.
- name: local-model
from: file://models/llms/llama3.2-1b-instruct/model.safetensors
files:
- path: //models/llms/llama3.2-1b-instruct/tokenizer.json
- path: //models/llms/llama3.2-1b-instruct/tokenizer_config.json
- path: //models/llms/llama3.2-1b-instruct/config.json
Examples
Loading a GGML Model
models:
- from: file://absolute/path/to/my/model.ggml
name: local_ggml_model
files:
- path: models/llms/ggml/tokenizer.json
- path: models/llms/ggml/tokenizer_config.json
- path: models/llms/ggml/config.json
Example: Loading a SafeTensor Model
models:
- name: safety
from: file:models/llms/llama3.2-1b-instruct/model.safetensors
files:
- path: models/llms/llama3.2-1b-instruct/tokenizer.json
- path: models/llms/llama3.2-1b-instruct/tokenizer_config.json
- path: models/llms/llama3.2-1b-instruct/config.json
Loading LLM from a directory
models:
- name: llama3
from: file:models/llms/llama3.2-1b-instruct/
Note: The folder provided should contain all the expected files (see examples above).
Loading an ONNX Model
models:
- from: file://absolute/path/to/my/model.onnx
name: local_fs_model
Loading a GGUF Model
models:
- from: file://absolute/path/to/my/model.gguf
name: local_gguf_model
Overriding the Chat Template
Chat templates convert the OpenAI compatible chat messages (see format) and other components of a request into a stream of characters for the language model. It follows Jinja3 templating syntax.
Further details on chat templates can be found here.
models:
- name: local_model
from: file:path/to/my/model.gguf
params:
chat_template: |
{% set loop_messages = messages %}
{% for message in loop_messages %}
{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
{{ content }}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{% endif %}
Templating Variables
messages: List of chat messages, in the OpenAI format.add_generation_prompt: Boolean flag whether to add a generation prompt.tools: List of callable tools, in the OpenAI format.
- The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports Apple metal and CUDA for accelerated inference.
