Version: v1.10

Data Connectors

Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.

Supported Data Connectors include:

Name	Description	Status	Protocol/Format
`postgres`	PostgreSQL, Amazon Redshift	Stable	PostgreSQL-line
`mysql`	MySQL	Stable
`s3`	S3	Stable	Parquet, CSV, JSON
`file`	File	Stable	Parquet, CSV, JSON
`duckdb`	DuckDB	Stable	Embedded
`dremio`	Dremio	Stable	Arrow Flight
`spice.ai`	Spice.ai OSS & Cloud	Stable	Arrow Flight
`databricks (mode: delta_lake)`	Databricks	Stable	S3/Delta Lake
`delta_lake`	Delta Lake	Stable	Delta Lake
`github`	GitHub	Stable	GitHub API
`graphql`	GraphQL	Release Candidate	JSON
`databricks (mode: spark_connect)`	Databricks	Beta	Spark Connect
`flightsql`	FlightSQL	Beta	Arrow Flight SQL
`mssql`	Microsoft SQL Server	Beta	Tabular Data Stream (TDS)
`odbc`	ODBC	Beta	ODBC
`snowflake`	Snowflake	Beta	Arrow
`spark`	Spark	Beta	Spark Connect
`iceberg`	Apache Iceberg	Beta	Parquet
`abfs`	Azure BlobFS	Alpha	Parquet, CSV, JSON
`ftp`, `sftp`	FTP/SFTP	Alpha	Parquet, CSV, JSON
`glue`	Glue	Alpha	Iceberg, Parquet, CSV
`http`, `https`	HTTP(s)	Alpha	Parquet, CSV, JSON
`imap`	IMAP	Alpha	IMAP Emails
`localpod`	Local dataset replication	Alpha
`oracle`	Oracle	Alpha	Oracle ODPI-C
`sharepoint`	Microsoft SharePoint	Alpha	Unstructured UTF-8 documents
`clickhouse`	Clickhouse	Alpha
`debezium`	Debezium CDC	Alpha	Kafka + JSON
`kafka`	Kafka	Alpha	Kafka + JSON
`dynamodb`	DynamoDB	Release Candidate
`mongodb`	MongoDB	Alpha
`elasticsearch`	ElasticSearch	Roadmap

Object Store File Formats

For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format.

If a file is provided, the file format will be inferred, and params.file_format is unnecessary.

File formats currently supported are:

datasets:
  - from: s3://bucket/data/sales/
    name: sales
    params:
      file_format: parquet

When connecting to a specific file, the format is inferred from the file extension:

datasets:
  - from: sftp://files.example.com/reports/quarterly.parquet
    name: quarterly_report

Supported Formats

Name	Parameter	Status	Description
Apache Parquet	`file_format: parquet`	Stable	Columnar format optimized for analytics
CSV	`file_format: csv`	Stable	Comma-separated values
JSON	`file_format: json`	Roadmap	JavaScript Object Notation
Apache Iceberg	`file_format: iceberg`	Roadmap	Open table format for large analytic datasets
Microsoft Excel	`file_format: xlsx`	Roadmap	Excel spreadsheet format
Markdown	`file_format: md`	Stable	Plain text with formatting (document format)
Text	`file_format: txt`	Stable	Plain text files (document format)
PDF	`file_format: pdf`	Alpha	Portable Document Format (document format)
Microsoft Word	`file_format: docx`	Alpha	Word document format (document format)

Format-Specific Parameters

File formats support additional parameters for fine-grained control. Common examples include:

Parameter	Applies To	Description
`csv_has_header`	CSV	Whether the first row contains column headers
`csv_delimiter`	CSV	Field delimiter character (default: `,`)
`csv_quote`	CSV	Quote character for fields containing delimiters

For complete format options, see File Formats Reference.

Applicable Connectors

The following data connectors support file format configuration:

Connector Type	Connectors
Object Stores	S3, Azure Blob (ABFS), GCS, HTTP/HTTPS
Network-Attached Storage	FTP, SFTP, SMB, NFS
Local Storage	File

Hive Partitioning

File-based connectors support Hive-style partitioning, which extracts partition columns from folder names. Enable with hive_partitioning_enabled: true.

Given a folder structure:

/data/
  year=2024/
    month=01/
      data.parquet
    month=02/
      data.parquet

Configure the dataset:

datasets:
  - from: s3://bucket/data/
    name: partitioned_data
    params:
      file_format: parquet
      hive_partitioning_enabled: true

Query with partition filters:

SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';

Partition pruning improves query performance by reading only the relevant files.

Name	Parameter	Supported	Is Document Format
Apache Parquet	`file_format: parquet`	✅	❌
CSV	`file_format: csv`	✅	❌
Apache Iceberg	`file_format: iceberg`	Roadmap	❌
JSON	`file_format: json`	Roadmap	❌
Microsoft Excel	`file_format: xlsx`	Roadmap	❌
Markdown	`file_format: md`	✅	✅
Text	`file_format: txt`	✅	✅
PDF	`file_format: pdf`	Alpha	✅
Microsoft Word	`file_format: docx`	Alpha	✅

File formats support additional parameters in the params (like csv_has_header) described in File Formats

If a format is a document format, each file will be treated as a document, as per document support below.

Note

Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly.

Document Support

If a Data Connector supports documents, when the appropriate file format is specified (see above), each file will be treated as a row in the table, with the contents of the file within the content column. Additional columns will exist, dependent on the data connector.

Example

Consider a local filesystem

>>> ls -la
total 232
drwxr-sr-x@ 22 jeadie  staff    704 30 Jul 13:12 .
drwxr-sr-x@ 18 jeadie  staff    576 30 Jul 13:12 ..
-rw-r--r--@  1 jeadie  staff   1329 15 Jan  2024 DR-000-Template.md
-rw-r--r--@  1 jeadie  staff   4966 11 Aug  2023 DR-001-Dremio-Architecture.md
-rw-r--r--@  1 jeadie  staff   2307 28 Jul  2023 DR-002-Data-Completeness.md

And the spicepod

datasets:
  - name: my_documents
    from: file:docs/decisions/
    params:
      file_format: md

A Document table will be created.

>>> SELECT * FROM my_documents LIMIT 3
+----------------------------------------------------+--------------------------------------------------+
| location                                           | content                                          |
+----------------------------------------------------+--------------------------------------------------+
| Users/docs/decisions/DR-000-Template.md            | # DR-000: DR Template                            |
|                                                    | **Date:** <>                                     |
|                                                    | **Decision Makers:**                             |
|                                                    | - @<>                                            |
|                                                    | - @<>                                            |
|                                                    | ...                                              |
| Users/docs/decisions/DR-001-Dremio-Architecture.md | # DR-001: Add "Cached" Dremio Dataset            |
|                                                    |                                                  |
|                                                    | ## Context                                       |
|                                                    |                                                  |
|                                                    | We use [Dremio](https://www.dremio.com/) to p... |
| Users/docs/decisions/DR-002-Data-Completeness.md   | # DR-002: Append-Only Data Completeness          |
|                                                    |                                                  |
|                                                    | ## Context                                       |
|                                                    |                                                  |
|                                                    | Our Ethereum append-only dataset is incomple...  |
+----------------------------------------------------+--------------------------------------------------+

Object Store File Formats​

Supported Formats​

Format-Specific Parameters​

Applicable Connectors​

Hive Partitioning​

Document Support​

Example​

Data Connector Docs​

📄️ Redshift Data Connector

📄️ Azure BlobFS Data Connector

📄️ ClickHouse Data Connector

📄️ Databricks Data Connector

📄️ Debezium Data Connector

📄️ Delta Lake Data Connector

📄️ Dremio Data Connector

📄️ DuckDB Data Connector

📄️ DynamoDB Data Connector

📄️ File Data Connector

📄️ Flight SQL Data Connector

📄️ FTP/SFTP Data Connector

📄️ GitHub Data Connector

📄️ Glue Data Connector

📄️ GraphQL Data Connector

📄️ HTTP(s) Data Connector

📄️ Iceberg Data Connector

📄️ IMAP Data Connector

📄️ Kafka Data Connector

📄️ Localpod Data Connector

📄️ Memory Data Connector

📄️ MongoDB Data Connector

📄️ Microsoft SQL Server

📄️ MySQL Data Connector

📄️ ODBC Data Connector

📄️ Oracle Data Connector

📄️ PostgreSQL Data Connector

📄️ S3 Data Connector

📄️ SharePoint Data Connector

📄️ Snowflake Data Connector

📄️ Apache Spark Connector

📄️ Spice.ai Data Connector