Databricks Data Connector
Databricks as a connector for federated SQL query against Databricks using Spark Connect or directly from Delta Lake tables.
datasets:
- from: databricks:spiceai.datasets.my_awesome_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_aws_access_key_id: ${secrets:aws_access_key_id}
databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
Configuration​
from
​
The from
field for the Databricks connector takes the form databricks:catalog.schema.table
where catalog.schema.table
is the fully-qualified path to the table to read from.
name
​
The dataset name. This will be used as the table name within Spice.
Example:
datasets:
- from: databricks:spiceai.datasets.my_awesome_table
name: cool_dataset
params: ...
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
params
​
Use the secret replacement syntax to reference a secret, e.g. ${secrets:my_token}
.
Parameter Name | Description |
---|---|
mode | The execution mode for querying against Databricks. The default is spark_connect . Possible values:
|
databricks_endpoint | The endpoint of the Databricks instance. Required for both modes. |
databricks_cluster_id | The ID of the compute cluster in Databricks to use for the query. Only valid when mode is spark_connect . |
databricks_use_ssl | If true, use a TLS connection to connect to the Databricks endpoint. Default is true . |
client_timeout | Optional. Applicable only in delta_lake mode. Specifies timeout for object store operations. Default value is 30s E.g. client_timeout: 60s |
Delta Lake object store parameters​
Configure the connection to the object store when using mode: delta_lake
. Use the secret replacement syntax to reference a secret, e.g. ${secrets:aws_access_key_id}
.
AWS S3​
Parameter Name | Description |
---|---|
databricks_aws_region | Optional. The AWS region for the S3 object store. E.g. us-west-2 . |
databricks_aws_access_key_id | The access key ID for the S3 object store. |
databricks_aws_secret_access_key | The secret access key for the S3 object store. |
databricks_aws_endpoint | Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com . |
Azure Blob​
One of the following auth values must be provided for Azure Blob:
databricks_azure_storage_account_key
,databricks_azure_storage_client_id
andazure_storage_client_secret
, ordatabricks_azure_storage_sas_key
.
Parameter Name | Description |
---|---|
databricks_azure_storage_account_name | The Azure Storage account name. |
databricks_azure_storage_account_key | The Azure Storage key for accessing the storage account. |
databricks_azure_storage_client_id | The Service Principal client ID for accessing the storage account. |
databricks_azure_storage_client_secret | The Service Principal client secret for accessing the storage account. |
databricks_azure_storage_sas_key | The shared access signature key for accessing the storage account. |
databricks_azure_storage_endpoint | Optional. The endpoint for the Azure Blob storage account. |
Google Storage (GCS)​
Parameter Name | Description |
---|---|
google_service_account | Filesystem path to the Google service account JSON key file. |
Examples​
Spark Connect​
- from: databricks:spiceai.datasets.my_spark_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: spark_connect
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_cluster_id: 1234-567890-abcde123
databricks_token: ${secrets:my_token}
Delta Lake (S3)​
- from: databricks:spiceai.datasets.my_delta_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_aws_region: us-west-2 # Optional
databricks_aws_access_key_id: ${secrets:aws_access_key_id}
databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
databricks_aws_endpoint: s3.us-west-2.amazonaws.com # Optional
Delta Lake (Azure Blobs)​
- from: databricks:spiceai.datasets.my_adls_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
# Account Name + Key
databricks_azure_storage_account_name: my_account
databricks_azure_storage_account_key: ${secrets:my_key}
# OR Service Principal + Secret
databricks_azure_storage_client_id: my_client_id
databricks_azure_storage_client_secret: ${secrets:my_secret}
# OR SAS Key
databricks_azure_storage_sas_key: my_sas_key
Delta Lake (GCP)​
- from: databricks:spiceai.datasets.my_gcp_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_google_service_account_path: /path/to/service-account.json
Types​
mode: delta_lake​
The table below shows the Databricks (mode: delta_lake) data types supported, along with the type mapping to Apache Arrow types in Spice.
Databricks SQL Type | Arrow Type |
---|---|
STRING | Utf8 |
BIGINT | Int64 |
INT | Int32 |
SMALLINT | Int16 |
TINYINT | Int8 |
FLOAT | Float32 |
DOUBLE | Float64 |
BOOLEAN | Boolean |
BINARY | Binary |
DATE | Date32 |
TIMESTAMP | Timestamp(Microsecond, Some("UTC")) |
TIMESTAMP_NTZ | Timestamp(Microsecond, None) |
DECIMAL | Decimal128 |
ARRAY | List |
STRUCT | Struct |
MAP | Map |
Secrets​
Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation. Additionally, learn how to use referenced secrets in component parameters by visiting the using referenced secrets guide.
Limitations​
-
Databricks connector (mode: delta_lake) does not support reading Delta tables with the
V2Checkpoint
feature enabled. To use the Databricks connector (mode: delta_lake) with such tables, drop theV2Checkpoint
feature by executing the following command:ALTER TABLE <table-name> DROP FEATURE v2Checkpoint [TRUNCATE HISTORY];
For more details on dropping Delta table features, refer to the official documentation: Drop Delta table features
-
When using
mode: spark_connect
, correlated scalar subqueries can only be used in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands. Spark Docs
When using the Databricks (mode: delta_lake) Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb
and sqlite
accelerators by specifying mode: file
.
- The Databricks Connector (
mode: spark_connect
) does not yet support streaming query results from Spark.
Cookbook​
- A cookbook recipe to configure Databricks as data connector in Spice under
delta_lake
mode. Spice on Databricks (mode: delta_lake)