Skip to main content

Filesystem Hosted Models

To use a model hosted on a filesystem, specify the path to the model file or folder in the from field:

models:
- from: file://models/llms/llama3.2-1b-instruct/
name: llama3
params:
model_type: llama

Supported formats include GGUF, GGML, and SafeTensor for large language models (LLMs) and ONNX for traditional machine learning (ML) models.

Configuration

from

An absolute or relative path to the model file or folder:

from: file://absolute/path/models/llms/llama3.2-1b-instruct/
from: file:models/llms/llama3.2-1b-instruct/

params (optional)

ParamDescription
model_typeThe architecture to load the model as. Supported values: mistral, gemma, mixtral, llama, phi2, phi3, qwen2, gemma2, starcoder2, phi3.5moe, deepseekv2, deepseek
toolsWhich tools should be made available to the model. Set to auto to use all available tools.
system_promptAn additional system prompt used for all chat completions to this model.
chat_templateCustomizes the transformation of OpenAI chat messages into a character stream for the model. See Overriding the Chat Template.

See Large Language Models for additional configuration options.

files (optional)

The files field specifies additional files required by the model, such as tokenizer, configuration, and other files.

- name: local-model
from: file://models/llms/llama3.2-1b-instruct/model.safetensors
files:
- path: //models/llms/llama3.2-1b-instruct/tokenizer.json
- path: //models/llms/llama3.2-1b-instruct/tokenizer_config.json
- path: //models/llms/llama3.2-1b-instruct/config.json

Examples

Loading a GGML Model

models:
- from: file://absolute/path/to/my/model.ggml
name: local_ggml_model
files:
- path: models/llms/ggml/tokenizer.json
- path: models/llms/ggml/tokenizer_config.json
- path: models/llms/ggml/config.json

Example: Loading a SafeTensor Model

models:
- name: safety
from: file:models/llms/llama3.2-1b-instruct/model.safetensors
files:
- path: models/llms/llama3.2-1b-instruct/tokenizer.json
- path: models/llms/llama3.2-1b-instruct/tokenizer_config.json
- path: models/llms/llama3.2-1b-instruct/config.json

Loading LLM from a directory

models:
- name: llama3
from: file:models/llms/llama3.2-1b-instruct/

Note: The folder provided should contain all the expected files (see examples above).

Loading an ONNX Model

models:
- from: file://absolute/path/to/my/model.onnx
name: local_fs_model

Loading a GGUF Model

models:
- from: file://absolute/path/to/my/model.gguf
name: local_gguf_model

Overriding the Chat Template

Chat templates convert the OpenAI compatible chat messages (see format) and other components of a request into a stream of characters for the language model. It follows Jinja3 templating syntax.

Further details on chat templates can be found here.

models:
- name: local_model
from: file:path/to/my/model.gguf
params:
chat_template: |
{% set loop_messages = messages %}
{% for message in loop_messages %}
{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
{{ content }}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{% endif %}

Templating Variables

  • messages: List of chat messages, in the OpenAI format.
  • add_generation_prompt: Boolean flag whether to add a generation prompt.
  • tools: List of callable tools, in the OpenAI format.
Limitations
  • The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports Apple metal and CUDA for accelerated inference.