Hash Index for Arrow Acceleration
Hash index is an experimental feature available in Spice v1.11.0-rc.2 and later.
The hash index is an optional, high-performance indexing feature for Arrow-accelerated datasets. It provides O(1) point lookups on primary key and secondary index columns, dramatically improving query performance for equality predicates.
Key Featuresβ
- O(1) Point Lookups: Direct row access via primary key or secondary indexes without full table scans
- Secondary Indexes: Optional indexes on non-primary-key columns for fast lookups
- 256-Shard Design: Minimizes lock contention for concurrent reads
- SIMD-Optimized Hashing: Uses XXH3_64 for fast, high-quality hashing
- Built-in Bloom Filter: Fast negative lookups to skip unnecessary hash table probes
- Auto-Threshold: Index is only built when data size exceeds a minimum threshold
Configurationβ
To use the hash index, explicitly enable it and specify a primary key:
datasets:
- from: s3://bucket/orders.parquet
name: orders
acceleration:
engine: arrow
primary_key: order_id
params:
hash_index: enabled
Secondary Indexesβ
Secondary indexes can be added on non-primary-key columns to accelerate equality lookups on those columns. Define them using the indexes field in the acceleration configuration:
datasets:
- from: s3://bucket/users.parquet
name: users
acceleration:
engine: arrow
primary_key: user_id
indexes:
email: unique
status: enabled
'(region, category)': unique
params:
hash_index: enabled
Index types:
uniqueβ Enforces uniqueness and enables O(1) indexed lookups.enabledβ Permits duplicates. The index is built and maintained but does not currently accelerate queries (queries fall back to a full scan).
Compound secondary indexes can be defined with a multicolumn key in parentheses, e.g. '(col1, col2)': unique, but are not yet used for query optimization.
Only single-column unique secondary indexes currently accelerate queries. Non-unique and compound secondary indexes are maintained for future use.
Configuration Optionsβ
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
hash_index | enabled/disabled | No | disabled | Enable hash indexing |
primary_key | string or list | Yes (if hash_index enabled) | None | Column(s) for the primary key index |
indexes | YAML map | No | None | Secondary indexes (see indexes) |
Supported Data Typesβ
The hash index supports the following primary key column types:
Primitive Typesβ
Int8,Int16,Int32,Int64UInt8,UInt16,UInt32,UInt64
String Typesβ
Utf8,LargeUtf8
Binary Typesβ
Binary,LargeBinary
Query Optimizationβ
The hash index automatically accelerates queries with equality predicates on indexed columns.
Optimized Queriesβ
-- Primary key lookup (uses primary key index)
SELECT * FROM my_dataset WHERE id = 123;
-- Multiple key lookups (uses primary key index for each key)
SELECT * FROM my_dataset WHERE id IN (1, 2, 3);
-- Secondary index lookup (uses unique secondary index)
SELECT * FROM my_dataset WHERE email = '[email protected]';
-- Primary key lookup with additional filter (index + post-filter)
SELECT * FROM my_dataset WHERE id = 123 AND status = 'active';
When a primary key lookup is combined with additional filters (e.g. WHERE id = 123 AND status = 'active'), the index is used for the primary key lookup and the remaining filters are applied afterward by DataFusion.
Non-Optimized Queriesβ
-- Range queries (full scan)
SELECT * FROM my_dataset WHERE id > 100 AND id < 200;
-- Pattern matching (full scan)
SELECT * FROM my_dataset WHERE id LIKE 'A%';
-- Composite primary keys (full scan, not yet supported)
SELECT * FROM my_dataset WHERE region = 'US' AND customer_id = 42;
-- Non-unique secondary index (full scan, not yet optimized)
SELECT * FROM my_dataset WHERE status = 'active';
Index Thresholdβ
The hash index is only built when the dataset exceeds a minimum size:
threshold = 256 Γ CPU_cores
| CPU Cores | Minimum Rows for Index |
|---|---|
| 1 | 256 |
| 4 | 1,024 |
| 8 | 2,048 |
| 16 | 4,096 |
| 32 | 8,192 |
For small tables below the threshold, a full scan is faster than index maintenance overhead.
Performanceβ
Bloom Filter Performanceβ
The built-in bloom filter provides:
- ~0.82% false positive rate (10 bits/item, 7 hash functions)
- O(1) negative lookup confirmation
- Reduced unnecessary hash table probes for non-existent keys
Memory Usageβ
| Component | Memory per Entry |
|---|---|
| Hash slot | 16 bytes (8-byte hash + 8-byte location) |
| Bloom filter | ~1.25 bytes |
| Total | ~17.25 bytes per indexed row |
Estimating Memoryβ
For a 10 million row dataset:
Index memory β 10M Γ 17.25 bytes β 165 MB
Architectureβ
Sharded Hash Tableβ
The index uses 256 independent shards to minimize lock contention:
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β HashIndex β
ββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββ βββββββββββ βββββββββββ β
β β Shard 0 β β Shard 1 β ... βShard 255β β
β β RwLock β β RwLock β β RwLock β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ β
β β Hash β β Hash β ... β Hash β β
β β Table β β Table β β Table β β
β βββββββββββ βββββββββββ βββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Optional Bloom Filter β β
β β (Fast Negative Lookups) β β
β βββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Shard Selection: Uses XOR-folded hash bits: ((hash >> 56) ^ (hash >> 48) ^ hash) & 0xFF
Row Locationβ
Each indexed key maps to a RowLocation:
RowLocation {
partition: u32, // Partition index
batch: u32, // Batch index within partition
row: u32, // Row index within batch
}
Hash Functionβ
Uses XXH3_64 with a fixed seed (0x5370_6963_6541_4920 = "SpiceAI ") for:
- Deterministic hashing across instances
- High-quality distribution (passes SMHasher)
- SIMD acceleration on arm64/amd64
Limitationsβ
- Arrow Engine Only: Hash index is only available for
engine: arrowacceleration - Single-Column Primary Keys Only: Composite primary keys are not yet supported for indexed lookups; only single-column primary keys use the index
- Experimental: API and behavior may change in future releases
- No Persistence: Index is rebuilt on restart (data persists, index is in-memory)
- Duplicate Keys: Primary key columns must have unique values
- Secondary Index Limitations: Only single-column
uniquesecondary indexes accelerate queries. Non-unique and compound secondary indexes are built and maintained but do not yet optimize queries
Troubleshootingβ
"No index available for point lookup"β
Cause: Dataset row count is below the index threshold.
Solution: This is expected behavior for small datasets. The full scan is faster than index overhead.
Warning: "Add 'hash_index: enabled' to use primary_key for fast lookups"β
Cause: primary_key is specified but hash_index is not enabled.
Solution: Add hash_index: enabled to params:
params:
hash_index: enabled
High Memory Usageβ
Cause: Index consumes ~17 bytes per row.
Solution:
- Disable hash_index for datasets where point lookups are rare
- Consider using a different acceleration engine for very large datasets
