Filesystem Hosted Models
To use a model hosted on a filesystem, specify the path to the model file or folder in the from
field:
models:
- from: file://models/llms/llama3.2-1b-instruct/
name: llama3
params:
model_type: llama
Supported formats include GGUF, GGML, and SafeTensor for large language models (LLMs) and ONNX for traditional machine learning (ML) models.
Configuration​
from
​
An absolute or relative path to the model file or folder:
from: file://absolute/path/models/llms/llama3.2-1b-instruct/
from: file:models/llms/llama3.2-1b-instruct/
params
(optional)​
Param | Description |
---|---|
model_type | The architecture to load the model as. Supported values: mistral , gemma , mixtral , llama , phi2 , phi3 , qwen2 , gemma2 , starcoder2 , phi3.5moe , deepseekv2 , deepseek |
tools | Which tools should be made available to the model. Set to auto to use all available tools. |
system_prompt | An additional system prompt used for all chat completions to this model. |
chat_template | Customizes the transformation of OpenAI chat messages into a character stream for the model. See Overriding the Chat Template. |
See Large Language Models for additional configuration options.
files
(optional)​
The files
field specifies additional files required by the model, such as tokenizer, configuration, and other files.
- name: local-model
from: file://models/llms/llama3.2-1b-instruct/model.safetensors
files:
- path: //models/llms/llama3.2-1b-instruct/tokenizer.json
- path: //models/llms/llama3.2-1b-instruct/tokenizer_config.json
- path: //models/llms/llama3.2-1b-instruct/config.json
Examples​
Loading a GGML Model​
models:
- from: file://absolute/path/to/my/model.ggml
name: local_ggml_model
files:
- path: models/llms/ggml/tokenizer.json
- path: models/llms/ggml/tokenizer_config.json
- path: models/llms/ggml/config.json
Example: Loading a SafeTensor Model​
models:
- name: safety
from: file:models/llms/llama3.2-1b-instruct/model.safetensors
files:
- path: models/llms/llama3.2-1b-instruct/tokenizer.json
- path: models/llms/llama3.2-1b-instruct/tokenizer_config.json
- path: models/llms/llama3.2-1b-instruct/config.json
Loading LLM from a directory​
models:
- name: llama3
from: file:models/llms/llama3.2-1b-instruct/
Note: The folder provided should contain all the expected files (see examples above).
Loading an ONNX Model​
models:
- from: file://absolute/path/to/my/model.onnx
name: local_fs_model
Loading a GGUF Model​
models:
- from: file://absolute/path/to/my/model.gguf
name: local_gguf_model
Overriding the Chat Template​
Chat templates convert the OpenAI compatible chat messages (see format) and other components of a request into a stream of characters for the language model. It follows Jinja3 templating syntax.
Further details on chat templates can be found here.
models:
- name: local_model
from: file:path/to/my/model.gguf
params:
chat_template: |
{% set loop_messages = messages %}
{% for message in loop_messages %}
{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
{{ content }}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{% endif %}
Templating Variables​
messages
: List of chat messages, in the OpenAI format.add_generation_prompt
: Boolean flag whether to add a generation prompt.tools
: List of callable tools, in the OpenAI format.
- The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports Apple metal and CUDA for accelerated inference.