Data Connectors
Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.
Supported Data Connectors include:
| Name | Description | Status | Protocol/Format |
|---|---|---|---|
postgres | PostgreSQL, Amazon Redshift | Stable | PostgreSQL-wire |
mysql | MySQL | Stable | |
s3 | S3 | Stable | Parquet, CSV, JSON |
file | File | Stable | Parquet, CSV, JSON |
duckdb | DuckDB | Stable | Embedded |
dremio | Dremio | Stable | Arrow Flight |
spice.ai | Spice.ai OSS & Cloud | Stable | Arrow Flight |
databricks (mode: delta_lake) | Databricks | Stable | S3/Delta Lake |
delta_lake | Delta Lake | Stable | Delta Lake |
github | GitHub | Stable | GitHub API |
graphql | GraphQL | Release Candidate | JSON |
dynamodb | DynamoDB | Release Candidate | |
databricks (mode: spark_connect) | Databricks | Beta | Spark Connect |
flightsql | FlightSQL | Beta | Arrow Flight SQL |
mssql | Microsoft SQL Server | Beta | Tabular Data Stream (TDS) |
odbc | ODBC | Beta | ODBC |
snowflake | Snowflake | Beta | Arrow |
spark | Spark | Beta | Spark Connect |
iceberg | Apache Iceberg | Beta | Parquet |
abfs | Azure BlobFS | Alpha | Parquet, CSV, JSON |
ftp, sftp | FTP/SFTP | Alpha | Parquet, CSV, JSON |
smb | SMB | Alpha | Parquet, CSV, JSON |
nfs | NFS | Alpha | Parquet, CSV, JSON |
glue | Glue | Alpha | Iceberg, Parquet, CSV |
http, https | HTTP(s) | Alpha | Parquet, CSV, JSON |
imap | IMAP | Alpha | IMAP Emails |
localpod | Local dataset replication | Alpha | |
oracle | Oracle | Alpha | Oracle ODPI-C |
sharepoint | Microsoft SharePoint | Alpha | Unstructured UTF-8 documents |
clickhouse | Clickhouse | Alpha | |
debezium | Debezium CDC | Alpha | Kafka + JSON |
kafka | Kafka | Alpha | Kafka + JSON |
mongodb | MongoDB | Alpha | |
scylladb | ScyllaDB | Alpha | CQL, Alternator (DynamoDB) |
elasticsearch | ElasticSearch | Roadmap |
File Formats
Data connectors that read files from object stores (S3, Azure Blob, GCS) or network-attached storage (FTP, SFTP, SMB, NFS) support a variety of file formats. These connectors work with both structured data formats (Parquet, CSV) and document formats (Markdown, PDF).
Specifying File Format
When connecting to a directory, specify the file format using params.file_format:
datasets:
- from: s3://bucket/data/sales/
name: sales
params:
file_format: parquet
When connecting to a specific file, the format is inferred from the file extension:
datasets:
- from: sftp://files.example.com/reports/quarterly.parquet
name: quarterly_report
Supported Formats
| Name | Parameter | Status | Description |
|---|---|---|---|
| Apache Parquet | file_format: parquet | Stable | Columnar format optimized for analytics |
| CSV | file_format: csv | Stable | Comma-separated values |
| JSON | file_format: json | Roadmap | JavaScript Object Notation |
| Apache Iceberg | file_format: iceberg | Roadmap | Open table format for large analytic datasets |
| Microsoft Excel | file_format: xlsx | Roadmap | Excel spreadsheet format |
| Markdown | file_format: md | Stable | Plain text with formatting (document format) |
| Text | file_format: txt | Stable | Plain text files (document format) |
file_format: pdf | Alpha | Portable Document Format (document format) | |
| Microsoft Word | file_format: docx | Alpha | Word document format (document format) |
Format-Specific Parameters
File formats support additional parameters for fine-grained control. Common examples include:
| Parameter | Applies To | Description |
|---|---|---|
csv_has_header | CSV | Whether the first row contains column headers |
csv_delimiter | CSV | Field delimiter character (default: ,) |
csv_quote | CSV | Quote character for fields containing delimiters |
For complete format options, see File Formats Reference.
Applicable Connectors
The following data connectors support file format configuration:
| Connector Type | Connectors |
|---|---|
| Object Stores | S3, Azure Blob (ABFS), GCS, HTTP/HTTPS |
| Network-Attached Storage | FTP, SFTP, SMB, NFS |
| Local Storage | File |
Hive Partitioning
File-based connectors support Hive-style partitioning, which extracts partition columns from folder names. Enable with hive_partitioning_enabled: true.
Given a folder structure:
/data/
year=2024/
month=01/
data.parquet
month=02/
data.parquet
Configure the dataset:
datasets:
- from: s3://bucket/data/
name: partitioned_data
params:
file_format: parquet
hive_partitioning_enabled: true
Query with partition filters:
SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
Partition pruning improves query performance by reading only the relevant files.
| Name | Parameter | Supported | Is Document Format |
|---|---|---|---|
| Apache Parquet | file_format: parquet | ✅ | ❌ |
| CSV | file_format: csv | ✅ | ❌ |
| Apache Iceberg | file_format: iceberg | Roadmap | ❌ |
| JSON | file_format: json | Roadmap | ❌ |
| Microsoft Excel | file_format: xlsx | Roadmap | ❌ |
| Markdown | file_format: md | ✅ | ✅ |
| Text | file_format: txt | ✅ | ✅ |
file_format: pdf | Alpha | ✅ | |
| Microsoft Word | file_format: docx | Alpha | ✅ |
Document Formats
Document formats (Markdown, Text, PDF, Word) are handled differently from structured data formats. Each file becomes a row in the resulting table, with the file contents stored in a content column.
Document formats in Alpha (PDF, DOCX) may not parse all structure or text from the underlying documents correctly.
Document Table Schema
| Column | Type | Description |
|---|---|---|
location | String | Path to the source file |
content | String | Full text content of the document |
Example
Consider a local filesystem:
>>> ls -la
total 232
drwxr-sr-x@ 22 jeadie staff 704 30 Jul 13:12 .
drwxr-sr-x@ 18 jeadie staff 576 30 Jul 13:12 ..
-rw-r--r--@ 1 jeadie staff 1329 15 Jan 2024 DR-000-Template.md
-rw-r--r--@ 1 jeadie staff 4966 11 Aug 2023 DR-001-Dremio-Architecture.md
-rw-r--r--@ 1 jeadie staff 2307 28 Jul 2023 DR-002-Data-Completeness.md
And the spicepod
datasets:
- name: my_documents
from: file:docs/decisions/
params:
file_format: md
A Document table will be created.
>>> SELECT * FROM my_documents LIMIT 3
+----------------------------------------------------+--------------------------------------------------+
| location | content |
+----------------------------------------------------+--------------------------------------------------+
| Users/docs/decisions/DR-000-Template.md | # DR-000: DR Template |
| | **Date:** <> |
| | **Decision Makers:** |
| | - @<> |
| | - @<> |
| | ... |
| Users/docs/decisions/DR-001-Dremio-Architecture.md | # DR-001: Add "Cached" Dremio Dataset |
| | |
| | ## Context |
| | |
| | We use [Dremio](https://www.dremio.com/) to p... |
| Users/docs/decisions/DR-002-Data-Completeness.md | # DR-002: Append-Only Data Completeness |
| | |
| | ## Context |
| | |
| | Our Ethereum append-only dataset is incomple... |
+----------------------------------------------------+--------------------------------------------------+
Data Connector Docs
📄️ Redshift Data Connector
Connect to Amazon Redshift using the PostgreSQL connector in Spice.
📄️ Azure BlobFS Data Connector
Azure BlobFS Data Connector Documentation
📄️ ClickHouse Data Connector
ClickHouse Data Connector Documentation
📄️ Databricks Data Connector
Databricks Data Connector Documentation
📄️ Debezium Data Connector
Debezium Data Connector Documentation
📄️ Delta Lake Data Connector
Delta Lake Data Connector Documentation
📄️ Dremio Data Connector
Dremio Data Connector Documentation
📄️ DuckDB Data Connector
DuckDB Data Connector Documentation
📄️ DynamoDB Data Connector
DynamoDB Data Connector Documentation
📄️ File Data Connector
File Data Connector Documentation
📄️ Flight SQL Data Connector
Flight SQL Data Connector Documentation
📄️ FTP/SFTP Data Connector
FTP/SFTP Data Connector Documentation
📄️ GitHub Data Connector
GitHub Data Connector Documentation
📄️ Glue Data Connector
Glue Data Connector Documentation
📄️ GraphQL Data Connector
GraphQL Data Connector Documentation
📄️ HTTP(s) Data Connector
HTTP(s) Data Connector Documentation
📄️ Iceberg Data Connector
Connect to and query Apache Iceberg tables
📄️ IMAP Data Connector
IMAP Data Connector Documentation
📄️ Kafka Data Connector
Kafka Data Connector Documentation
📄️ Localpod Data Connector
Localpod Data Connector Documentation
📄️ Memory Data Connector
Memory Data Connector Documentation
📄️ MongoDB Data Connector
MongoDB Data Connector Documentation
📄️ Microsoft SQL Server
Microsoft SQL Server Data Connector
📄️ MySQL Data Connector
MySQL Data Connector Documentation
📄️ NFS Data Connector
NFS Data Connector Documentation
📄️ ODBC Data Connector
ODBC Data Connector Documentation
📄️ Oracle Data Connector
Oracle Data Connector Documentation
📄️ PostgreSQL Data Connector
PostgreSQL Data Connector Documentation
📄️ S3 Data Connector
S3 Data Connector Documentation
📄️ ScyllaDB Data Connector
ScyllaDB Data Connector Documentation
📄️ SharePoint Data Connector
SharePoint Data Connector Documentation
📄️ SMB Data Connector
SMB Data Connector Documentation
📄️ Snowflake Data Connector
Snowflake Data Connector Documentation
📄️ Apache Spark Connector
Apache Spark Connector Documentation
📄️ Spice.ai Data Connector
Spice.ai Data Connector Documentation
