ScyllaDB Data Connector
The ScyllaDB Data Connector enables federated SQL queries on data stored in ScyllaDB clusters using CQL (Cassandra Query Language).
datasets:
- from: scylladb:users
name: users
params:
scylladb_host: localhost
scylladb_port: 9042
scylladb_keyspace: my_app
Configuration
from
The from field takes the form scylladb:{table_name} where table_name is the table identifier in the ScyllaDB keyspace to read from.
datasets:
- from: scylladb:users
name: users
params:
scylladb_keyspace: my_app
...
name
The dataset name. This will be used as the table name within Spice.
Example:
datasets:
- from: scylladb:users
name: app_users
params: ...
SELECT COUNT(*) FROM app_users;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
The dataset name cannot be a reserved keyword.
params
The ScyllaDB data connector can be configured by providing the following params. Use the secret replacement syntax to load the secret from a secret store, e.g. ${secrets:scylladb_pass}.
| Parameter Name | Description | Required | Default |
|---|---|---|---|
scylladb_host | Hostname(s) of ScyllaDB nodes. Comma-separated for multiple nodes. | Yes | - |
scylladb_hosts | Alternative to scylladb_host. Comma-separated list of hostnames. | No | - |
scylladb_port | ScyllaDB CQL native transport port. | No | 9042 |
scylladb_keyspace | The keyspace to use for queries. | Yes | - |
scylladb_user | Username for authentication. | No | - |
scylladb_pass | Password for authentication. | No | - |
scylladb_datacenter | Preferred datacenter for connection routing. | No | - |
scylladb_ssl | Enable SSL/TLS for connections. | No | false |
connection_timeout | Connection timeout in milliseconds. | No | 10000 |
Types
The table below shows the CQL data types supported, along with the type mapping to Apache Arrow types in Spice.
| CQL Type | Arrow Type | Notes |
|---|---|---|
boolean | Boolean | |
tinyint | Int8 | |
smallint | Int16 | |
int | Int32 | |
bigint | Int64 | |
counter | Int64 | Cassandra counter type |
float | Float32 | |
double | Float64 | |
decimal | Decimal128(38, 2) | Arbitrary precision → fixed precision |
blob | Binary | |
date | Date32 | Days since epoch |
timestamp | Timestamp(Millisecond, None) | Milliseconds since epoch |
time | Timestamp(Microsecond, None) | Time of day |
text, varchar, ascii | Utf8 | |
uuid, timeuuid | Utf8 | String representation |
inet | Utf8 | IP address as string |
varint | Utf8 | Arbitrary precision integer as string |
duration | Utf8 | ISO 8601 duration string |
list<T>, set<T>, map<K,V> | Utf8 | JSON string representation |
tuple<...> | Utf8 | String representation |
frozen<T> | Utf8 | Same as underlying collection |
| User-Defined Types | Utf8 | String representation |
Decimal Handling
CQL decimal is an arbitrary-precision type, while Arrow Decimal128 has a maximum precision of 38 digits. The connector uses:
- Precision: 38 (maximum for Decimal128)
- Scale: 2 (suitable for monetary/financial data)
For decimals that exceed this precision, values may be truncated or rounded.
Date/Time Handling
- CQL
date: Stored as days since epoch with a 2^31 offset. Converted to ArrowDate32(signed days since 1970-01-01). - CQL
timestamp: Milliseconds since Unix epoch. Directly mapped to ArrowTimestamp(Millisecond). - CQL
time: Nanoseconds since midnight. Converted to ArrowTimestamp(Microsecond)with nanosecond truncation.
Query Execution
Due to fundamental differences between CQL and SQL, the connector implements a local filtering strategy. All data is fetched from ScyllaDB using SELECT * queries, and filtering, joins, aggregations, and other SQL operations are performed locally by DataFusion.
Why Filter Pushdown is Disabled
CQL lacks many SQL constructs:
| Feature | SQL | CQL |
|---|---|---|
| CASE WHEN | ✅ | ❌ |
| Subqueries | ✅ | ❌ |
| Complex JOINs | ✅ | ❌ |
| CAST expressions | ✅ | ❌ |
| INTERVAL | ✅ | ❌ |
| Window functions | ✅ | ❌ |
| NULLS FIRST/LAST | ✅ | ❌ |
| COUNT(DISTINCT) | ✅ | ❌ |
| Arbitrary WHERE | ✅ | ❌ |
Because of these limitations, the connector:
- Fetches full table data using
SELECT * FROM keyspace.table - Lets DataFusion handle all filtering locally after data retrieval
- Supports projection pushdown—only requested columns are transferred
Streaming Execution
Query results are streamed using the scylla driver's paging mechanism in batches of 8192 rows by default, minimizing memory usage for large result sets.
Performance Considerations
Since filter pushdown is disabled, all table data is transferred for each query. Consider the following optimizations:
Enable Acceleration
For frequently queried data, enable Spice acceleration to cache data locally:
datasets:
- from: scylladb:products
name: products
params:
scylladb_host: ${env:SCYLLADB_HOST}
scylladb_keyspace: catalog
acceleration:
enabled: true
engine: duckdb
refresh_check_interval: 1h
Configure Datacenter Locality
Set the datacenter preference to route queries to the nearest nodes:
params:
scylladb_datacenter: us-east-1
Adjust Connection Timeout
Set connection timeouts appropriately for your network:
params:
connection_timeout: 30000 # 30 seconds
Examples
Basic Federated Query
datasets:
- from: scylladb:users
name: users
params:
scylladb_host: localhost
scylladb_port: 9042
scylladb_keyspace: my_app
With Authentication
datasets:
- from: scylladb:orders
name: orders
params:
scylladb_host: scylla-cluster.example.com
scylladb_keyspace: ecommerce
scylladb_user: app_user
scylladb_pass: ${secrets:SCYLLA_PASSWORD}
Multi-Node Cluster
datasets:
- from: scylladb:events
name: events
params:
scylladb_hosts: node1.scylla.local,node2.scylla.local,node3.scylla.local
scylladb_keyspace: analytics
scylladb_datacenter: us-west-2
With Acceleration
datasets:
- from: scylladb:products
name: products
params:
scylladb_host: ${env:SCYLLADB_HOST}
scylladb_keyspace: catalog
acceleration:
enabled: true
engine: duckdb
refresh_check_interval: 1h
Using Environment Variables
datasets:
- from: scylladb:customer
name: customer
params:
scylladb_host: ${env:SCYLLADB_HOST}
scylladb_port: ${env:SCYLLADB_PORT}
scylladb_keyspace: ${env:SCYLLADB_KEYSPACE}
Limitations
CQL Limitations
The following SQL operations cannot be pushed down to ScyllaDB and are performed locally:
- JOINs: All joins are performed locally by DataFusion
- Aggregations: COUNT, SUM, AVG, etc. are computed locally
- Subqueries: Nested queries are not supported in CQL
- Window functions: RANK, ROW_NUMBER, etc. not supported
- Complex WHERE clauses: CQL requires partition key in WHERE; Spice fetches all data
- ORDER BY: Sorting is done locally
Connector Limitations
- Read-only: The connector does not support INSERT, UPDATE, or DELETE operations
- Decimal precision: Fixed at precision=38, scale=2; may not suit all use cases
- Collection types: Lists, sets, and maps are converted to JSON string representation
- Large tables: Without acceleration, large tables cause significant data transfer
Data Type Limitations
- varint: Arbitrary-precision integers are converted to strings
- duration: CQL durations are converted to string representation
- UDTs: User-defined types are converted to string representation
- Nested collections: Deeply nested collections become complex JSON strings
Secrets
Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation. Additionally, learn how to use referenced secrets in component parameters by visiting the using referenced secrets guide.
