Skip to main content

Filesystem Hosted Models

To use a model hosted on a filesystem, specify the path to the model file or folder in the from field:

models:
- from: file://models/llms/llama3.2-1b-instruct/
name: llama3
params:
model_type: llama

Supported formats include GGUF, GGML, and SafeTensor for large language models (LLMs) and ONNX for traditional machine learning (ML) models.

Configuration​

from​

An absolute or relative path to the model file or folder:

from: file://absolute/path/models/llms/llama3.2-1b-instruct/
from: file:models/llms/llama3.2-1b-instruct/

params (optional)​

ParamDescription
model_typeThe architecture to load the model as. Supported values: mistral, gemma, mixtral, llama, phi2, phi3, qwen2, gemma2, starcoder2, phi3.5moe, deepseekv2, deepseek
toolsWhich tools should be made available to the model. Set to auto to use all available tools.
system_promptAn additional system prompt used for all chat completions to this model.
chat_templateCustomizes the transformation of OpenAI chat messages into a character stream for the model. See Overriding the Chat Template.

See Large Language Models for additional configuration options.

files (optional)​

The files field specifies additional files required by the model, such as tokenizer, configuration, and other files.

- name: local-model
from: file://models/llms/llama3.2-1b-instruct/model.safetensors
files:
- path: //models/llms/llama3.2-1b-instruct/tokenizer.json
- path: //models/llms/llama3.2-1b-instruct/tokenizer_config.json
- path: //models/llms/llama3.2-1b-instruct/config.json

Examples​

Loading a GGML Model​

models:
- from: file://absolute/path/to/my/model.ggml
name: local_ggml_model
files:
- path: models/llms/ggml/tokenizer.json
- path: models/llms/ggml/tokenizer_config.json
- path: models/llms/ggml/config.json

Example: Loading a SafeTensor Model​

models:
- name: safety
from: file:models/llms/llama3.2-1b-instruct/model.safetensors
files:
- path: models/llms/llama3.2-1b-instruct/tokenizer.json
- path: models/llms/llama3.2-1b-instruct/tokenizer_config.json
- path: models/llms/llama3.2-1b-instruct/config.json

Loading LLM from a directory​

models:
- name: llama3
from: file:models/llms/llama3.2-1b-instruct/

Note: The folder provided should contain all the expected files (see examples above).

Loading an ONNX Model​

models:
- from: file://absolute/path/to/my/model.onnx
name: local_fs_model

Loading a GGUF Model​

models:
- from: file://absolute/path/to/my/model.gguf
name: local_gguf_model

Overriding the Chat Template​

Chat templates convert the OpenAI compatible chat messages (see format) and other components of a request into a stream of characters for the language model. It follows Jinja3 templating syntax.

Further details on chat templates can be found here.

models:
- name: local_model
from: file:path/to/my/model.gguf
params:
chat_template: |
{% set loop_messages = messages %}
{% for message in loop_messages %}
{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
{{ content }}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{% endif %}

Templating Variables​

  • messages: List of chat messages, in the OpenAI format.
  • add_generation_prompt: Boolean flag whether to add a generation prompt.
  • tools: List of callable tools, in the OpenAI format.
Limitations
  • The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports Apple metal and CUDA for accelerated inference.