Skip to main content
Version: Next (v1.11)

ScyllaDB Data Connector

The ScyllaDB Data Connector enables federated SQL queries on data stored in ScyllaDB clusters using CQL (Cassandra Query Language).

datasets:
- from: scylladb:users
name: users
params:
scylladb_host: localhost
scylladb_port: 9042
scylladb_keyspace: my_app

Configuration

from

The from field takes the form scylladb:{table_name} where table_name is the table identifier in the ScyllaDB keyspace to read from.

datasets:
- from: scylladb:users
name: users
params:
scylladb_keyspace: my_app
...

name

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
- from: scylladb:users
name: app_users
params: ...
SELECT COUNT(*) FROM app_users;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+

The dataset name cannot be a reserved keyword.

params

The ScyllaDB data connector can be configured by providing the following params. Use the secret replacement syntax to load the secret from a secret store, e.g. ${secrets:scylladb_pass}.

Parameter NameDescriptionRequiredDefault
scylladb_hostHostname(s) of ScyllaDB nodes. Comma-separated for multiple nodes.Yes-
scylladb_hostsAlternative to scylladb_host. Comma-separated list of hostnames.No-
scylladb_portScyllaDB CQL native transport port.No9042
scylladb_keyspaceThe keyspace to use for queries.Yes-
scylladb_userUsername for authentication.No-
scylladb_passPassword for authentication.No-
scylladb_datacenterPreferred datacenter for connection routing.No-
scylladb_sslEnable SSL/TLS for connections.Nofalse
connection_timeoutConnection timeout in milliseconds.No10000

Types

The table below shows the CQL data types supported, along with the type mapping to Apache Arrow types in Spice.

CQL TypeArrow TypeNotes
booleanBoolean
tinyintInt8
smallintInt16
intInt32
bigintInt64
counterInt64Cassandra counter type
floatFloat32
doubleFloat64
decimalDecimal128(38, 2)Arbitrary precision → fixed precision
blobBinary
dateDate32Days since epoch
timestampTimestamp(Millisecond, None)Milliseconds since epoch
timeTimestamp(Microsecond, None)Time of day
text, varchar, asciiUtf8
uuid, timeuuidUtf8String representation
inetUtf8IP address as string
varintUtf8Arbitrary precision integer as string
durationUtf8ISO 8601 duration string
list<T>, set<T>, map<K,V>Utf8JSON string representation
tuple<...>Utf8String representation
frozen<T>Utf8Same as underlying collection
User-Defined TypesUtf8String representation

Decimal Handling

CQL decimal is an arbitrary-precision type, while Arrow Decimal128 has a maximum precision of 38 digits. The connector uses:

  • Precision: 38 (maximum for Decimal128)
  • Scale: 2 (suitable for monetary/financial data)

For decimals that exceed this precision, values may be truncated or rounded.

Date/Time Handling

  • CQL date: Stored as days since epoch with a 2^31 offset. Converted to Arrow Date32 (signed days since 1970-01-01).
  • CQL timestamp: Milliseconds since Unix epoch. Directly mapped to Arrow Timestamp(Millisecond).
  • CQL time: Nanoseconds since midnight. Converted to Arrow Timestamp(Microsecond) with nanosecond truncation.

Query Execution

Due to fundamental differences between CQL and SQL, the connector implements a local filtering strategy. All data is fetched from ScyllaDB using SELECT * queries, and filtering, joins, aggregations, and other SQL operations are performed locally by DataFusion.

Why Filter Pushdown is Disabled

CQL lacks many SQL constructs:

FeatureSQLCQL
CASE WHEN
Subqueries
Complex JOINs
CAST expressions
INTERVAL
Window functions
NULLS FIRST/LAST
COUNT(DISTINCT)
Arbitrary WHERE

Because of these limitations, the connector:

  1. Fetches full table data using SELECT * FROM keyspace.table
  2. Lets DataFusion handle all filtering locally after data retrieval
  3. Supports projection pushdown—only requested columns are transferred

Streaming Execution

Query results are streamed using the scylla driver's paging mechanism in batches of 8192 rows by default, minimizing memory usage for large result sets.

Performance Considerations

Since filter pushdown is disabled, all table data is transferred for each query. Consider the following optimizations:

Enable Acceleration

For frequently queried data, enable Spice acceleration to cache data locally:

datasets:
- from: scylladb:products
name: products
params:
scylladb_host: ${env:SCYLLADB_HOST}
scylladb_keyspace: catalog
acceleration:
enabled: true
engine: duckdb
refresh_check_interval: 1h

Configure Datacenter Locality

Set the datacenter preference to route queries to the nearest nodes:

params:
scylladb_datacenter: us-east-1

Adjust Connection Timeout

Set connection timeouts appropriately for your network:

params:
connection_timeout: 30000 # 30 seconds

Examples

Basic Federated Query

datasets:
- from: scylladb:users
name: users
params:
scylladb_host: localhost
scylladb_port: 9042
scylladb_keyspace: my_app

With Authentication

datasets:
- from: scylladb:orders
name: orders
params:
scylladb_host: scylla-cluster.example.com
scylladb_keyspace: ecommerce
scylladb_user: app_user
scylladb_pass: ${secrets:SCYLLA_PASSWORD}

Multi-Node Cluster

datasets:
- from: scylladb:events
name: events
params:
scylladb_hosts: node1.scylla.local,node2.scylla.local,node3.scylla.local
scylladb_keyspace: analytics
scylladb_datacenter: us-west-2

With Acceleration

datasets:
- from: scylladb:products
name: products
params:
scylladb_host: ${env:SCYLLADB_HOST}
scylladb_keyspace: catalog
acceleration:
enabled: true
engine: duckdb
refresh_check_interval: 1h

Using Environment Variables

datasets:
- from: scylladb:customer
name: customer
params:
scylladb_host: ${env:SCYLLADB_HOST}
scylladb_port: ${env:SCYLLADB_PORT}
scylladb_keyspace: ${env:SCYLLADB_KEYSPACE}

Limitations

CQL Limitations

The following SQL operations cannot be pushed down to ScyllaDB and are performed locally:

  • JOINs: All joins are performed locally by DataFusion
  • Aggregations: COUNT, SUM, AVG, etc. are computed locally
  • Subqueries: Nested queries are not supported in CQL
  • Window functions: RANK, ROW_NUMBER, etc. not supported
  • Complex WHERE clauses: CQL requires partition key in WHERE; Spice fetches all data
  • ORDER BY: Sorting is done locally

Connector Limitations

  • Read-only: The connector does not support INSERT, UPDATE, or DELETE operations
  • Decimal precision: Fixed at precision=38, scale=2; may not suit all use cases
  • Collection types: Lists, sets, and maps are converted to JSON string representation
  • Large tables: Without acceleration, large tables cause significant data transfer

Data Type Limitations

  • varint: Arbitrary-precision integers are converted to strings
  • duration: CQL durations are converted to string representation
  • UDTs: User-defined types are converted to string representation
  • Nested collections: Deeply nested collections become complex JSON strings

Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation. Additionally, learn how to use referenced secrets in component parameters by visiting the using referenced secrets guide.

See Also