Partitioning
Partitioning splits an accelerated dataset into multiple physical units (files, tables, or in-memory tables) keyed by an expression evaluated per row. Queries that filter on the partitioning expression — or on a column it references — only read the partitions that can match, dramatically reducing the data scanned.
datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
acceleration:
enabled: true
engine: duckdb
mode: file
partition_by:
- bucket(50, PULocationID)
This config writes 50 separate DuckDB files, each containing the rows whose PULocationID hashes to the same bucket. A query like WHERE PULocationID = 132 only opens the single bucket file that could contain 132.
How partitioning works​
- At refresh time, Spice evaluates each
partition_byexpression for every row and routes the row to a partition keyed by the expression's value. - At query time, Spice rewrites filters that reference the partition column or expression into a partition selection, and only reads partitions that could satisfy the filter (partition pruning).
- Composite partitioning (Arrow and Cayenne) layers multiple expressions hierarchically — e.g.
yearthenmonth— to combine pruning across dimensions.
Partitioning is most useful when:
- Queries reliably filter on a small subset of values (
region = 'EU',tenant_id IN (...),created_at >= today() - INTERVAL '7 days'). - A single column has high cardinality and partitioning by it directly would create too many tiny files —
bucket(N, col)collapses toNpartitions. - The dataset grows large enough that scanning the whole acceleration on every query is expensive.
Configuration​
partition_by​
Lives directly under acceleration: (not under acceleration.params:). It's a list of expressions; each entry is either a plain string or a single-entry { name: expression } mapping:
acceleration:
enabled: true
engine: arrow
partition_by:
# Anonymous expression — auto-named "expr0", "expr1", …
- "YEAR(created_at)"
# Named expression
- month: "MONTH(created_at)"
Multi-entry mappings (- year: "…", month: "…" on one list item) are rejected at load time.
partition_mode (DuckDB only)​
Under acceleration.params.partition_mode. Selects how DuckDB physically lays out partitions:
| Value | Layout | Default |
|---|---|---|
files | Each partition is its own DuckDB file in a Hive-style directory (column=value/…). | ✓ |
tables | Single DuckDB file with one table per partition (discovered via information_schema.tables). |
acceleration:
enabled: true
engine: duckdb
mode: file
partition_by:
- bucket(50, PULocationID)
params:
partition_mode: tables # default is `files`
Supported engines​
| Engine | Required mode: | Multi-expression | Layout |
|---|---|---|---|
arrow | (memory; default) | Yes | One Arrow MemTable per partition value. |
duckdb | file | No (single only) | Hive-style files (partition_mode: files) or per-partition tables (partition_mode: tables). |
cayenne | file | Yes | One Vortex table per partition; catalog in a SQLite metadata file. |
sqlite, postgres, and turso accelerators do not support partition_by.
Partition transforms​
partition_by accepts any DataFusion-compatible scalar SQL expression that returns a String, integer, Boolean, or Timestamp and references exactly one column from the dataset. The most common partition transforms are:
bucket(num_buckets, column)​
Hashes column into num_buckets deterministic buckets.
num_bucketsmust be a positive integer literal≤ 1,000,000.- The return type matches
num_buckets—bucket(50, …)(Int64 literal) returnsInt64;bucket(50::int32, …)returnsInt32. - Same input always maps to the same bucket for a given
num_buckets(uses ahash with a fixed seed). - Use for: high-cardinality columns where direct partitioning would create too many partitions (
user_id,account_id,device_id).
partition_by:
- bucket(100, user_id)
See the bucket reference for the full SQL signature.
truncate(width, value)​
Truncates to the next-lower multiple of width. Iceberg's truncate transform.
widthmust be a positiveInt64literal.valuemay be any signed/unsigned integer,Decimal128/Decimal256,Utf8(string), orBinary. For strings/binary, returns the firstwidthunits.- Returns the same type as
value. - Use for: floor-bucketing wide numeric ranges (
truncate(1000, amount)), or grouping strings by prefix (truncate(2, country_code)).
partition_by:
- truncate(1000, amount)
See the truncate reference for examples.
date_part(unit, column) and date_trunc(unit, column)​
Built-in DataFusion datetime functions. Useful for time-based partitioning at year, month, day, or hour granularity.
date_part('year', col)returns the integer year (e.g.2026).date_trunc('day', col)returns the timestamp truncated to the start of the day.
partition_by:
- date_part('year', l_shipdate)
partition_by:
- day: "date_trunc('day', created_at)"
YEAR(col), MONTH(col), DAY(col), etc. are aliases of date_part(...) and work identically.
date_part() is not yet implementedA date_part('year', l_shipdate) partition still produces correctly-distributed partitions, but a filter like WHERE l_shipdate >= '2026-01-01' does not currently translate to a partition pruning. Queries return correct results — they just scan more partitions than necessary. If your filter is on the bare partition expression (WHERE date_part('year', l_shipdate) = 2026), the equality form does prune.
Modulo (column % N)​
A plain modulo expression also produces stable, partition-prunable buckets:
partition_by:
- "id % 16"
Range filters on the base column (id BETWEEN 0 AND 1000) are pruned for % partitions.
Plain column reference​
A bare column name partitions one partition per distinct value — useful for low-cardinality columns:
partition_by:
- region
Pruning supports equality, IN, NOT IN, and range filters on the column.
Composite partitioning (Arrow + Cayenne)​
Arrow and Cayenne accelerations accept multiple partition_by expressions. Spice partitions hierarchically — first by the leftmost expression, then by the next, and so on.
acceleration:
enabled: true
engine: cayenne
mode: file
partition_by:
- year: "date_part('year', created_at)"
- month: "date_part('month', created_at)"
- region: region
A query with WHERE region = 'EU' AND date_part('year', created_at) = 2026 prunes on both axes.
DuckDB partitioned acceleration rejects multiple expressions with PartitionByRequired (single-expression only).
Partition pruning​
Spice attempts to translate query-time filters into a partition selection. The matrix below summarizes which filter shapes prune which partition transforms:
| Filter shape | Plain column / truncate / date_trunc / % N | bucket(N, col) |
|---|---|---|
col = X | ✓ | ✓ (filter substituted into bucket expression) |
col != X | ✓ | ✗ (no pruning) |
col IN (a, b, …) | ✓ | ✓ |
col NOT IN (a, b, …) | ✓ | ✗ (no pruning) |
col < X / <= X / > X / >= X | ✓ | ✓ only for bounded ranges (both lower and upper) |
col BETWEEN a AND b | ✓ | ✓ (when range expands to ≤ a few thousand Int32 values) |
col = a OR col = b OR … | ✓ | ✓ |
partition_expr = X (e.g. bucket(50, c) = 7) | ✓ | ✓ |
expr1_filter AND expr2_filter (composite partitions) | ✓ | ✓ |
Pruning notes:
- Filter on the base column is substituted into the partition expression.
WHERE user_id = 42against abucket(100, user_id)partition evaluatesbucket(100, 42)once and reads only that partition. - Bucket inequality pruning enumerates candidate values within a bounded
Int32range (capped atMAX_BUCKET_ENUMERATION_I32candidate values). Open-ended ranges (col < Xwith no lower bound) and non-Int32types fall back to no pruning. date_part()filter pruning is not yet implemented for time-range filters. Equality on the partition expression still prunes; range filters on the base column do not.- Filters that don't fully resolve at the partition layer are passed through to the data layer — pruning is best-effort and never returns wrong rows.
Engine-specific behavior​
Arrow (engine: arrow)​
In-memory MemTable per partition value. Partitions are rebuilt on every refresh. The simplest engine to start with for moderate datasets that fit in RAM. hash_index and sort_columns are propagated per partition.
DuckDB files mode (engine: duckdb, partition_mode: files)​
Each partition value becomes its own DuckDB file under a Hive-style directory layout (e.g. accelerator_dir/dataset/year=2026/data.duckdb). Useful when you want OS-level visibility of partitions and per-partition I/O.
Requires mode: file. Single-expression only.
DuckDB tables mode (engine: duckdb, partition_mode: tables)​
A single DuckDB file with one table per partition value. Cheaper to manage operationally (one file to back up, one connection pool to size), but loses per-partition file-level isolation.
Requires mode: file. Single-expression only.
Cayenne (engine: cayenne)​
Each partition is a separate Cayenne (Vortex) table; the partition catalog is tracked in a SQLite metadata file. Cayenne supports composite partitioning natively and is the right pick for very large datasets where Arrow would not fit in memory.
Requires mode: file.
Changing partition_by after refresh​
Once partitions exist on disk, changing partition_by is rejected:
The
partition_byexpressions are different from the expressions used to create the existing partition files. Revert thepartition_byexpressions, delete the partition files, or change the location the partition files are stored to create new partitions.
To re-partition:
- Stop the runtime.
- Delete the partitioned acceleration directory or point
accelerator_dirat a fresh location. - Update
partition_by. - Restart — Spice will rebuild from the source.
There is no automatic in-place re-partitioning.
Validation rules​
A partition_by expression is rejected at startup if any of the following hold:
- Result type is not
String, an integer (signed or unsigned),Boolean, orTimestamp. - Expression references zero or multiple dataset columns.
- Expression contains a subquery,
OUTER REFERENCE,UNNEST, window function, aggregate function,EXISTS,GROUPING SET, orPLACEHOLDER. - Expression aliases the column (
col AS partition_key).
For DuckDB engines, additionally:
mode:must befile.- Only one
partition_byexpression is allowed.
Examples​
High-cardinality hash partitioning​
Partition events by user, collapsing millions of users into 100 stable buckets:
datasets:
- from: s3://my-bucket/events/
name: events
params:
file_format: parquet
acceleration:
enabled: true
engine: duckdb
mode: file
partition_by:
- bucket(100, user_id)
-- Reads the single bucket that hashes user_id 42:
SELECT COUNT(*) FROM events WHERE user_id = 42;
-- Reads at most 4 bucket files:
SELECT * FROM events WHERE user_id IN (1, 2, 3, 4);
Time-based partitioning by year​
acceleration:
enabled: true
engine: duckdb
mode: file
partition_by:
- date_part('year', l_shipdate)
-- Equality on the partition expression prunes:
SELECT * FROM lineitem WHERE date_part('year', l_shipdate) = 2026;
Composite year/month partitioning (Cayenne)​
acceleration:
enabled: true
engine: cayenne
mode: file
partition_by:
- year: "date_part('year', created_at)"
- month: "date_part('month', created_at)"
Truncate-based numeric partitioning​
acceleration:
enabled: true
engine: arrow
partition_by:
- "truncate(1000, order_total_cents)"
-- Range pruning works because truncate is monotonic:
SELECT * FROM orders WHERE order_total_cents BETWEEN 5000 AND 9999;
Plain column partitioning by region​
acceleration:
enabled: true
engine: arrow
partition_by:
- region
-- One partition read:
SELECT * FROM events WHERE region = 'EU';
Related​
bucketSQL referencetruncateSQL referencedate_partSQL referencedate_truncSQL reference- Data refresh modes —
time_partition_columnfor time-pruned refreshes - Sharded deployment — partitioning across multiple Spice instances rather than within one acceleration
