Data Ingestion
Data can be ingested into the Spice runtime using the following methods:
- Acceleration Refresh Modes – Pull data from a source connector into a local accelerator using one of the standard refresh modes (
full,append,changes,snapshot,caching). This is the most common ingestion path for keeping a local accelerator in sync with an upstream system. - SQL Statements – Write data directly to write-capable connectors using standard SQL
INSERT(and, where supported,UPDATE/DELETE) syntax. - OpenTelemetry (OTEL) Ingestion – Stream OTEL metrics for real-time processing and acceleration.
Data ingestion is useful for scenarios such as keeping a local accelerator continuously in sync with an upstream database, collecting metrics from edge devices, writing application events for later analysis, or populating datasets from external sources.
Ingestion via Acceleration Refresh Modes​
When a dataset is configured with acceleration.enabled: true, Spice ingests rows from the source connector into a local engine (Arrow, DuckDB, SQLite, PostgreSQL, or Cayenne). The refresh_mode controls how that ingestion happens.
| Refresh Mode | What it ingests | Typical source |
|---|---|---|
full | Replaces the accelerator's contents with a fresh read of the source on every refresh. | Slowly-changing reference tables; small lookup datasets. |
append | Inserts only rows newer than the highest seen time_column value on each refresh. | Time-series, event/log data, append-only tables. |
changes | Streams row-level inserts, updates, and deletes from a source CDC feed (PostgreSQL logical replication, DynamoDB Streams, MongoDB Change Streams, Debezium, Kafka, etc.). | Operational databases where you need near real-time mirror of the source. |
snapshot | Loads exclusively from an external snapshot store; no source reads. | Read-only replicas bootstrapped from a centralized snapshot, e.g. for fan-out reader fleets. |
caching | Read-through caches per-request HTTP/HTTPS responses with a TTL. | API search results or other request-keyed content fetched lazily. |
For cross-cutting refresh behavior — refresh intervals, on-demand refresh, retries, retention, and zero-results handling — see Data Refresh.
Example: continuous CDC ingestion into an accelerator​
datasets:
- from: postgres:public.users
name: users
params:
pg_host: pg.internal
pg_port: '5432'
pg_user: spice
pg_pass: ${secrets:pg_pass}
pg_db: myapp
acceleration:
enabled: true
engine: duckdb
mode: file # Persistence so resume across restarts is cheap
refresh_mode: changes
This uses PostgreSQL Logical Replication to ingest every INSERT, UPDATE, and DELETE from public.users into a local DuckDB accelerator with low latency.
SQL Statements​
Spice supports writing data to compatible data connectors using standard SQL INSERT INTO syntax.
Write-Capable Connectors​
Data connectors that support write operations are tagged as write:
- Apache Iceberg - Write to Iceberg tables via data connector or catalog connector
- AWS Glue - Write to Glue Data Catalog tables via data connector or catalog connector
Configuration for Write Operations​
To enable write operations, configure your dataset or catalog with read_write access:
datasets:
- from: glue:my_catalog.my_schema.my_table
name: my_table
access: read_write
params:
# ... connector-specific parameters
Example SQL​
INSERT INTO my_table (column1, column2)
VALUES ('value1', 'value2');
INSERT INTO my_table (column1, column2)
SELECT source_column1, source_column2
FROM source_table
WHERE condition = 'filter';
For more details on the INSERT statement syntax, see the SQL INSERT documentation.
OpenTelemetry Data Ingestion​
By default, the runtime exposes an OpenTelemetry (OTEL) endpoint at grpc://127.0.0.1:50051 for the OTEL data ingestion.
OTEL metrics will be inserted into datasets with matching names (metric name = dataset name) and optionally replicated to the dataset source.
Benefits​
Spice.ai OSS includes built-in data ingestion support for collecting the latest data from edge nodes for use in subsequent queries. This feature eliminates the need for additional ETL pipelines and improves the speed of the feedback loop.
For example, consider CPU usage anomaly detection. When CPU metrics are sent to the Spice OpenTelemetry endpoint, the loaded machine learning model can use the most recent observations for inferencing and provide recommendations to the edge node. This process occurs quickly on the edge itself, within milliseconds, and without generating network traffic.
Additionally, Spice will periodically replicate the data to the data connector for further use.
Considerations​
Data Quality: Use Spice SQL capabilities to transform and cleanse ingested edge data, ensuring high-quality inputs.
Data Security: Evaluate data sensitivity and secure network connections between the edge and data connector when replicating data for further use. Implement encryption, access controls, and secure protocols.
Example​
Disk SMART​
Start Spice with the following dataset:
datasets:
- from: spice.ai/coolorg/smart/datasets/drive_stats
name: smart_attribute_raw_value
access: read_write
replication:
enabled: true
acceleration:
enabled: true
Start telegraf with the following config:
[[inputs.smart]]
attributes = true
[[outputs.opentelemetry]]
service_address = "localhost:50051"
[agent]
interval = "1s"
flush_interval = "1s"
SMART data will be available in the smart_attribute_raw_value dataset in Spice.ai OSS and replicated to the coolorg.smart.drive_stats dataset in Spice.ai Cloud.
Limitations​
- Write Support: Only selected write-capable connectors and catalogs support write operations.
- Only Spice.ai replication is supported for OpenTelemetry ingestion
