Vector-Based Search
Spice provides advanced vector-based search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:
- Local embedding models, e.g. sentence-transformers/all-MiniLM-L6-v2.
- Remote embedding providers, e.g. OpenAI.
Embedding models are defined in the spicepod.yaml
file as top-level components.
embeddings:
- from: openai
name: remote_service
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
- name: local_embedding_model
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
acceleration:
enabled: true
columns:
- name: body
embeddings:
- from: local_embedding_model # Embedding model used for this column
By defining embeddings on the body
column, Spice is now configured to execute similarity searches on the dataset.
curl -XPOST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d '{
"datasets": ["spiceai.issues"],
"text": "cutting edge AI",
"where": "author=\"jeadie\"",
"additional_columns": ["title", "state"],
"limit": 2
}'
For more details, see the API reference for /v1/search.
Spice also supports vector search on datasets with preexisting embeddings. See below for compatibility details.
Document Retrieval​
When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the additional_columns
list.
For example:
curl -XPOST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d '{
"datasets": ["spiceai.issues"],
"text": "cutting edge AI",
"where": "array_has(assignees, \"jeadie\")",
"additional_columns": ["title", "state", "body"],
"limit": 2
}'
Response:
{
"matches": [
{
"value": "implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])",
"dataset": "spiceai.issues",
"metadata": {
"title": "Improve scalar UDF array_distance",
"state": "Closed",
"body": "## Overview\n- Previous PR https://github.com/spiceai/spiceai/pull/1601 implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])\narray_distance(FixedSizeList[Float32], List[Float64])\n```\n\n### Changes\n - Improve using Native arrow function, e.g. `arrow_cast`, [`sub_checked`](https://arrow.apache.org/rust/arrow/array/trait.ArrowNativeTypeOp.html#tymethod.sub_checked)\n - Support a greater range of array types and numeric types\n - Possibly create a sub operator and UDF, e.g.\n\t- `FixedSizeList[Float32] - FixedSizeList[Float32]`\n\t- `Norm(FixedSizeList[Float32])`"
}
},
{
"value": "est external tools being returned for toolusing models",
"dataset": "spiceai.issues",
"metadata": {
"title": "Automatic NSQL retries in /v1/nsql ",
"state": "Open",
"body": "To mimic our ability for LLMs to repeatedly retry tools based on errors, the `/v1/nsql`, which does not use this same paradigm, should retry internally.\n\nIf possible, improve the structured output to increase the likelihood of valid SQL in the response. Currently we just inforce JSON like this\n```json\n{\n "sql": "SELECT ..."\n}\n```"
}
}
],
"duration_ms": 45
}
Pre-Existing Embeddings​
Datasets that already include embeddings can utilize the same functionalities (e.g., vector search) as those augmented with embeddings using Spice. To ensure compatibility, these table columns must adhere to the following constraints:
-
Underlying Column Presence:
- The underlying column must exist in the table, and be of
string
Arrow data type .
- The underlying column must exist in the table, and be of
-
Embeddings Column Naming Convention:
- For each underlying column, the corresponding embeddings column must be named as
<column_name>_embedding
. For example, acustomer_reviews
table with areview
column must have areview_embedding
column.
- For each underlying column, the corresponding embeddings column must be named as
-
Embeddings Column Data Type:
- The embeddings column must have the following Arrow data type when loaded into Spice:
FixedSizeList[Float32 or Float64, N]
, whereN
is the dimension (size) of the embedding vector.FixedSizeList
is used for efficient storage and processing of fixed-size vectors.- If the column is chunked, use
List[FixedSizeList[Float32 or Float64, N]]
.
- The embeddings column must have the following Arrow data type when loaded into Spice:
-
Offset Column for Chunked Data:
- If the underlying column is chunked, there must be an additional offset column named
<column_name>_offsets
with the following Arrow data type:List[FixedSizeList[Int32, 2]]
, where each element is a pair of integers[start, end]
representing the start and end indices of the chunk in the underlying text column. This offset column maps each chunk in the embeddings back to the corresponding segment in the underlying text column.
- For instance,
[[0, 100], [101, 200]]
indicates two chunks covering indices 0–100 and 101–200, respectively.
- If the underlying column is chunked, there must be an additional offset column named
By following these guidelines, you can ensure that your dataset with pre-existing embeddings is fully compatible with the vector search and other embedding functionalities provided by Spice.
Example​
A table sales
with an address
column and corresponding embedding column(s).
sql> describe sales;
+-------------------+-----------------------------------------+-------------+
| column_name | data_type | is_nullable |
+-------------------+-----------------------------------------+-------------+
| order_number | Int64 | YES |
| quantity_ordered | Int64 | YES |
| price_each | Float64 | YES |
| order_line_number | Int64 | YES |
| address | Utf8 | YES |
| address_embedding | FixedSizeList( | NO |
| | Field { | |
| | name: "item", | |
| | data_type: Float32, | |
| | nullable: false, | |
| | dict_id: 0, | |
| | dict_is_ordered: false, | |
| | metadata: {} | |
| | }, | |
| | 384 | |
+-------------------+-----------------------------------------+-------------+
The same table if it was chunked:
sql> describe sales;
+-------------------+-----------------------------------------+-------------+
| column_name | data_type | is_nullable |
+-------------------+-----------------------------------------+-------------+
| order_number | Int64 | YES |
| quantity_ordered | Int64 | YES |
| price_each | Float64 | YES |
| order_line_number | Int64 | YES |
| address | Utf8 | YES |
| address_embedding | List(Field { | NO |
| | name: "item", | |
| | data_type: FixedSizeList( | |
| | Field { | |
| | name: "item", | |
| | data_type: Float32, | |
| | }, | |
| | 384 | |
| | ), | |
| | }) | |
+-------------------+-----------------------------------------+-------------+
| address_offset | List(Field { | NO |
| | name: "item", | |
| | data_type: FixedSizeList( | |
| | Field { | |
| | name: "item", | |
| | data_type: Int32, | |
| | }, | |
| | 2 | |
| | ), | |
| | }) | |
+-------------------+-----------------------------------------+-------------+