Skip to main content
Version: v1.10

Data Connectors

Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.

Supported Data Connectors include:

NameDescriptionStatusProtocol/Format
postgresPostgreSQL, Amazon RedshiftStablePostgreSQL-line
mysqlMySQLStable
s3S3StableParquet, CSV, JSON
fileFileStableParquet, CSV, JSON
duckdbDuckDBStableEmbedded
dremioDremioStableArrow Flight
spice.aiSpice.ai OSS & CloudStableArrow Flight
databricks (mode: delta_lake)DatabricksStableS3/Delta Lake
delta_lakeDelta LakeStableDelta Lake
githubGitHubStableGitHub API
graphqlGraphQLRelease CandidateJSON
databricks (mode: spark_connect)DatabricksBetaSpark Connect
flightsqlFlightSQLBetaArrow Flight SQL
mssqlMicrosoft SQL ServerBetaTabular Data Stream (TDS)
odbcODBCBetaODBC
snowflakeSnowflakeBetaArrow
sparkSparkBetaSpark Connect
icebergApache IcebergBetaParquet
abfsAzure BlobFSAlphaParquet, CSV, JSON
ftp, sftpFTP/SFTPAlphaParquet, CSV, JSON
glueGlueAlphaIceberg, Parquet, CSV
http, httpsHTTP(s)AlphaParquet, CSV, JSON
imapIMAPAlphaIMAP Emails
localpodLocal dataset replicationAlpha
oracleOracleAlphaOracle ODPI-C
sharepointMicrosoft SharePointAlphaUnstructured UTF-8 documents
clickhouseClickhouseAlpha
debeziumDebezium CDCAlphaKafka + JSON
kafkaKafkaAlphaKafka + JSON
dynamodbDynamoDBRelease Candidate
mongodbMongoDBAlpha
elasticsearchElasticSearchRoadmap

Object Store File Formats

For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format.

If a file is provided, the file format will be inferred, and params.file_format is unnecessary.

File formats currently supported are:

datasets:
- from: s3://bucket/data/sales/
name: sales
params:
file_format: parquet

When connecting to a specific file, the format is inferred from the file extension:

datasets:
- from: sftp://files.example.com/reports/quarterly.parquet
name: quarterly_report

Supported Formats

NameParameterStatusDescription
Apache Parquetfile_format: parquetStableColumnar format optimized for analytics
[CSV(../../reference/file_format.md#csv)file_format: csvStableComma-separated values
JSONfile_format: jsonRoadmapJavaScript Object Notation
Apache Icebergfile_format: icebergRoadmapOpen table format for large analytic datasets
Microsoft Excelfile_format: xlsxRoadmapExcel spreadsheet format
Markdownfile_format: mdStablePlain text with formatting (document format)
Textfile_format: txtStablePlain text files (document format)
PDFfile_format: pdfAlphaPortable Document Format (document format)
Microsoft Wordfile_format: docxAlphaWord document format (document format)

Format-Specific Parameters

File formats support additional parameters for fine-grained control. Common examples include:

ParameterApplies ToDescription
csv_has_headerCSVWhether the first row contains column headers
csv_delimiterCSVField delimiter character (default: ,)
csv_quoteCSVQuote character for fields containing delimiters

For complete format options, see [File Formats Reference(../../reference/file_format).

Applicable Connectors

The following data connectors support file format configuration:

Connector TypeConnectors
Object StoresS3, Azure Blob (ABFS), GCS, HTTP/HTTPS
Network-Attached StorageFTP, SFTP, SMB, NFS
Local StorageFile

Hive Partitioning

File-based connectors support Hive-style partitioning, which extracts partition columns from folder names. Enable with hive_partitioning_enabled: true.

Given a folder structure:

/data/
year=2024/
month=01/
data.parquet
month=02/
data.parquet

Configure the dataset:

datasets:
- from: s3://bucket/data/
name: partitioned_data
params:
file_format: parquet
hive_partitioning_enabled: true

Query with partition filters:

SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';

Partition pruning improves query performance by reading only the relevant files.

NameParameterSupportedIs Document Format
Apache Parquetfile_format: parquet
[CSV(../../reference/file_format.md#csv)file_format: csv
Apache Icebergfile_format: icebergRoadmap
JSONfile_format: jsonRoadmap
Microsoft Excelfile_format: xlsxRoadmap
Markdownfile_format: md
Textfile_format: txt
PDFfile_format: pdfAlpha
Microsoft Wordfile_format: docxAlpha

File formats support additional parameters in the params (like csv_has_header) described in File Formats

If a format is a document format, each file will be treated as a document, as per document support below.

Note

Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly.

Document Support

If a Data Connector supports documents, when the appropriate file format is specified (see above), each file will be treated as a row in the table, with the contents of the file within the content column. Additional columns will exist, dependent on the data connector.

Example

Consider a local filesystem

>>> ls -la
total 232
drwxr-sr-x@ 22 jeadie staff 704 30 Jul 13:12 .
drwxr-sr-x@ 18 jeadie staff 576 30 Jul 13:12 ..
-rw-r--r--@ 1 jeadie staff 1329 15 Jan 2024 DR-000-Template.md
-rw-r--r--@ 1 jeadie staff 4966 11 Aug 2023 DR-001-Dremio-Architecture.md
-rw-r--r--@ 1 jeadie staff 2307 28 Jul 2023 DR-002-Data-Completeness.md

And the spicepod

datasets:
- name: my_documents
from: file:docs/decisions/
params:
file_format: md

A Document table will be created.

>>> SELECT * FROM my_documents LIMIT 3
+----------------------------------------------------+--------------------------------------------------+
| location | content |
+----------------------------------------------------+--------------------------------------------------+
| Users/docs/decisions/DR-000-Template.md | # DR-000: DR Template |
| | **Date:** <> |
| | **Decision Makers:** |
| | - @<> |
| | - @<> |
| | ... |
| Users/docs/decisions/DR-001-Dremio-Architecture.md | # DR-001: Add "Cached" Dremio Dataset |
| | |
| | ## Context |
| | |
| | We use [Dremio](https://www.dremio.com/) to p... |
| Users/docs/decisions/DR-002-Data-Completeness.md | # DR-002: Append-Only Data Completeness |
| | |
| | ## Context |
| | |
| | Our Ethereum append-only dataset is incomple... |
+----------------------------------------------------+--------------------------------------------------+

Data Connector Docs