Datasets
A Spicepod can contain one or more datasets
referenced by relative path or defined inline.
Inline example:
spicepod.yaml
datasets:
- from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
acceleration:
enabled: true
mode: memory # / file
engine: arrow # / duckdb / sqlite / postgres
refresh_check_interval: 1h
refresh_mode: full / append # update / incremental
spicepod.yaml
datasets:
- from: databricks:spiceai.datasets.specific_table
name: uniswap_eth_usd
params:
environment: prod
acceleration:
enabled: true
mode: memory # / file
engine: arrow # / duckdb
refresh_check_interval: 1h
refresh_mode: full / append # update / incremental
Relative path example:
spicepod.yaml
datasets:
- ref: datasets/taxi_trips
datasets/taxi_trips/dataset.yaml
from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
type: overwrite
acceleration:
enabled: true
refresh: 1h
from
​
The from
field is a string that represents the Uniform Resource Identifier (URI) for the dataset. This URI is composed of two parts: a prefix indicating the Data Connector to use to connect to the dataset, a delimiter, and the path to the dataset within the source.
The syntax for the from
field is as follows:
from: <data_connector>:<path>
# OR
from: <data_connector>/<path>
# OR
from: <data_connector>://<path>
Where:
-
<data_connector>
: The Data Connector to use to connect to the datasetCurrently supported data connectors:
spiceai
dremio
spark
databricks
s3
postgres
mysql
flightsql
snowflake
ftp
,sftp
http
,https
clickhouse
graphql
If the Data Connector is not explicitly specified, it defaults to
spiceai
. -
<delimiter>
: The delimiter between the Data Connector and the path. Currently supported delimiters are:
,/
, and://
. Some connectors place additional restrictions on the allowed delimiters to better conform to the expected syntax of the underlying data source, i.e.s3://
is the only supported delimiter for thes3
connector. -
<path>
: The path to the dataset within the source.
ref
​
An alternative to adding the dataset definition inline in the spicepod.yaml
file. ref
can be use to point to a directory with a dataset defined in a dataset.yaml
file. For example, a dataset configured in a dataset.yaml in the "datasets/sample" directory can be referenced with the following:
dataset.yaml
from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
type: overwrite
acceleration:
enabled: true
refresh: 1h
ref used in spicepod.yaml
version: v1
kind: Spicepod
name: duckdb
datasets:
- ref: datasets/sample
name
​
The name of the dataset. Used to reference the dataset in the pod manifest, as well as in external data sources.
description
​
The description of the dataset. Used as part of the Semantic Data Model.
time_column
​
Optional. The name of the column that represents the temporal (time) ordering of the dataset.
Required to enable a retention policy on the dataset.
time_format
​
Optional. The format of the time_column
. The following values are supported:
timestamp
- Default. Timestamp without a timezone. E.g.2016-06-22 19:10:25
with data typetimestamp
.timestamptz
- Timestamp with a timezone. E.g.2016-06-22 19:10:25-07
with data typetimestamptz
.unix_seconds
- Unix timestamp in seconds. E.g.1718756687
.unix_millis
- Unix timestamp in milliseconds. E.g.1718756687000
.ISO8601
- ISO 8601 format.date
- Date in YYYY-MM-DD format. E.g.2024-01-01
.
Spice emits a warning if the time_column
from the data source is incompatible with the time_format
config.
- String-based columns are assumed to be ISO8601 format.
time_partition_column
​
(Optional) Specify the column that represents the physical partitioning of the dataset when using append-based acceleration. When the defined time_column
is a fine-grained timestamp and the dataset is physically partitioned by a coarser granularity (for example, by date), setting time_partition_column
to the partition column (e.g. date_col) improves partition pruning, excludes irrelevant partitions during refreshes, and optimizes scan efficiency.
time_partition_format
​
(Optional) Define the format of the time_partition_column
. For instance, if the physical partitions follow a date format (YYYY-MM-DD), set this value to date
. The same format options as time_format
are supported for time_partition_column
.
unsupported_type_action
​
Optional. Specifies the action to take when a data type that is not supported by the data connector is encountered.
The following values are supported:
error
- Default. Return an error when an unsupported data type is encountered.warn
- Log a warning and ignore the column containing the unsupported data type.ignore
- Log nothing and ignore the column containing the unsupported data type.string
- Attempt to convert the unsupported data type to a string. Currently only supports converting the PostgreSQL JSONB type.
Not all connectors support specifying an unsupported_type_action
. When specified on a connector that does not support the option, the connector will fail to register. The following connectors support unsupported_type_action
:
ready_state
​
Supports one of two values:
on_registration
: Mark the dataset as ready immediately, and queries on this table will fall back to the underlying source directly until the initial acceleration is completeon_load
: Mark the dataset as ready only after the initial acceleration. Queries against the dataset will return an error before the load has been completed.
datasets:
- from: s3://my_bucket/my_dataset/
name: my_dataset
ready_state: on_registration # or on_load
params: ...
acceleration:
enabled: true
acceleration
​
Optional. Accelerate queries to the dataset by caching data locally.
acceleration.enabled
​
Enable or disable acceleration, defaults to true
.
acceleration.engine
​
The acceleration engine to use, defaults to arrow
. The following engines are supported:
arrow
- Accelerated in-memory backed by Apache Arrow DataTables.duckdb
- Accelerated by an embedded DuckDB database.postgres
- Accelerated by a Postgres database.sqlite
- Accelerated by an embedded Sqlite database.
acceleration.mode
​
Optional. The mode of acceleration. The following values are supported:
memory
- Store acceleration data in-memory.file
- Store acceleration data in a file. Only supported forduckdb
andsqlite
acceleration engines.
mode
is currently only supported for the duckdb
engine.
acceleration.refresh_mode
​
Optional. How to refresh the dataset. The following values are supported:
full
- Refresh the entire dataset.append
- Append new data to the dataset. Whentime_column
is specified, new records are fetched from the latest timestamp in the accelerated data at theacceleration.refresh_check_interval
.