Skip to main content
Version: Latest (v1.11)

Data Connectors

Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.

Each connector is configured using the from field in a dataset definition. For example:

datasets:
- from: postgres:public.orders # Database connector
name: orders
params:
pg_host: localhost
pg_db: mydb
pg_user: reader
pg_pass: ${secrets:PG_PASS}

- from: s3://my-bucket/events/ # Object storage connector
name: events
params:
file_format: parquet
s3_auth: iam_role

Supported Data Connectors include:

NameDescriptionStatusProtocol/Format
postgresPostgreSQL, Amazon RedshiftStablePostgreSQL-wire
mysqlMySQLStable
s3S3StableParquet, CSV, JSON
fileFileStableParquet, CSV, JSON
duckdbDuckDBStableEmbedded
dremioDremioStableArrow Flight
spice.aiSpice.ai OSS & CloudStableArrow Flight
databricks (mode: delta_lake)DatabricksStableS3/Delta Lake
delta_lakeDelta LakeStableDelta Lake
githubGitHubStableGitHub API
graphqlGraphQLRelease CandidateJSON
dynamodbDynamoDBRelease Candidate
databricks (mode: spark_connect)DatabricksBetaSpark Connect
flightsqlFlightSQLBetaArrow Flight SQL
mssqlMicrosoft SQL ServerBetaTabular Data Stream (TDS)
odbcODBCBetaODBC
snowflakeSnowflakeBetaArrow
sparkSparkBetaSpark Connect
icebergApache IcebergBetaParquet
abfsAzure BlobFSAlphaParquet, CSV, JSON
ftp, sftpFTP/SFTPAlphaParquet, CSV, JSON
smbSMBAlphaParquet, CSV, JSON
nfsNFSAlphaParquet, CSV, JSON
glueGlueAlphaIceberg, Parquet, CSV
http, httpsHTTP(s)AlphaParquet, CSV, JSON
imapIMAPAlphaIMAP Emails
localpodLocal dataset replicationAlpha
oracleOracleAlphaOracle ODPI-C
sharepointMicrosoft SharePointAlphaUnstructured UTF-8 documents
clickhouseClickhouseAlpha
debeziumDebezium CDCAlphaKafka + JSON
kafkaKafkaAlphaKafka + JSON
mongodbMongoDBAlpha
scylladbScyllaDBAlphaCQL, Alternator (DynamoDB)
elasticsearchElasticSearchRoadmap

File Formats

Data connectors that read files from object stores (S3, Azure Blob, GCS) or network-attached storage (FTP, SFTP, SMB, NFS) support a variety of file formats. These connectors work with both structured data formats (Parquet, CSV) and document formats (Markdown, PDF).

Specifying File Format

When connecting to a directory, specify the file format using params.file_format:

datasets:
- from: s3://bucket/data/sales/
name: sales
params:
file_format: parquet

When connecting to a specific file, the format is inferred from the file extension:

datasets:
- from: sftp://files.example.com/reports/quarterly.parquet
name: quarterly_report

Supported Formats

NameParameterStatusDescription
Apache Parquetfile_format: parquetStableColumnar format optimized for analytics
CSVfile_format: csvStableComma-separated values
JSONfile_format: jsonStableJavaScript Object Notation
Delta Lakefile_format: deltaStableOpen table format with ACID transactions. Object stores only.
Apache Icebergfile_format: icebergBetaOpen table format for large analytic datasets. Object stores only. Requires a catalog.
Microsoft Excelfile_format: xlsxRoadmapExcel spreadsheet format
Markdownfile_format: mdStablePlain text with formatting (document format)
Textfile_format: txtStablePlain text files (document format)
PDFfile_format: pdfBetaPortable Document Format (document format)
Microsoft Wordfile_format: docxAlphaWord document format (document format)

Format-Specific Parameters

File formats support additional parameters for fine-grained control. Common examples include:

ParameterApplies ToDescription
csv_has_headerCSVWhether the first row contains column headers
csv_delimiterCSVField delimiter character (default: ,)
csv_quoteCSVQuote character for fields containing delimiters

For complete format options, see File Formats Reference.

Applicable Connectors

The following data connectors support file format configuration:

Connector TypeConnectors
Object StoresS3, Azure Blob (ABFS), GCS, HTTP/HTTPS
Network-Attached StorageFTP, SFTP, SMB, NFS
Local StorageFile

Hive Partitioning

File-based connectors support Hive-style partitioning, which extracts partition columns from folder names. Enable with hive_partitioning_enabled: true.

Given a folder structure:

/data/
year=2024/
month=01/
data.parquet
month=02/
data.parquet

Configure the dataset:

datasets:
- from: s3://bucket/data/
name: partitioned_data
params:
file_format: parquet
hive_partitioning_enabled: true

Query with partition filters:

SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';

Partition pruning improves query performance by reading only the relevant files.

Metadata Columns

File-based connectors can expose per-file object store metadata as virtual columns in the dataset schema. These columns are not stored in the data files — they are derived from object store file metadata at query time.

Available Columns

ColumnTypeDescription
locationUtf8Full URI of the source file
last_modifiedTimestamp(µs, "UTC")When the file was last modified
sizeUInt64File size in bytes

Enabling Metadata Columns

Metadata columns are enabled by adding a metadata section to the dataset definition with each desired column set to enabled:

datasets:
- from: s3://bucket/data/
name: my_data
params:
file_format: parquet
metadata:
location: enabled
last_modified: enabled
size: enabled

Each column can be individually enabled or omitted:

metadata:
location: enabled # Only add the location column
note

If the data files already contain a column with the same name as a metadata column (e.g., a Parquet file with a size column), the metadata column is not added to avoid conflicts.

Querying Metadata Columns

Once enabled, metadata columns appear alongside the regular data columns:

SELECT * FROM my_data LIMIT 3;
+----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+
| id | value | year | month | day | last_modified | location | size |
+----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+
| 0 | value_0 | 2022 | 1 | 1 | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet | 2317 |
| 1 | value_1 | 2022 | 1 | 1 | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet | 2317 |
| 2 | value_2 | 2022 | 1 | 1 | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet | 2317 |
+----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+

Metadata columns can be used in filters, projections, aggregations, and joins like any other column:

-- Filter by file location
SELECT id, value FROM my_data
WHERE location = 's3://bucket/data/year=2022/month=1/day=1/data_0.parquet';

-- Find recently modified files
SELECT DISTINCT location, last_modified FROM my_data
WHERE last_modified > '2024-01-01T00:00:00Z';

-- Aggregate by file
SELECT location, COUNT(*) AS row_count, size
FROM my_data
GROUP BY location, size
ORDER BY location;

Applicable Connectors

Metadata columns are supported by all file-based connectors:

Connector TypeConnectors
Object StoresS3, Azure Blob (ABFS), HTTP/HTTPS
Network-Attached StorageFTP, SFTP, SMB, NFS
Local StorageFile

Schema Inference

Spice infers the schema for each dataset from its data source at startup. The inferred schema defines the column names, data types, and nullability used by the dataset for the lifetime of that runtime process.

Schema inference happens once, when the dataset is first registered. Some connectors support tuning the inference behavior with connector-specific parameters:

ConnectorParameterDefaultDescription
Kafkaschema_infer_max_records10Number of messages sampled to infer the JSON schema
DynamoDBschema_infer_max_records10Number of items sampled to infer the schema
MongoDBmongodb_num_docs_to_infer_schema400Number of documents sampled to infer the schema
CSV filescsv_schema_infer_max_records1000Number of rows sampled to infer the CSV schema

For connectors that read self-describing formats (Parquet, Arrow, Avro), the schema is read directly from file metadata and does not require sampling.

Runtime Schema Changes

Spice does not apply schema changes at runtime. If the source schema changes while the runtime is running — for example, new columns are added, columns are removed, or data types change — subsequent data refreshes will fail with an error such as:

Failed to load data for dataset <name>: Cannot cast struct field ...

This behavior is by design. Blocking runtime schema evolution protects accelerated tables from unintentional or breaking schema changes that could corrupt data or produce unexpected query results.

To apply a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source and re-initializes the dataset with the updated column definitions.

Recommendation

Pin a known-good schema version in the data source or use the columns configuration to explicitly define the expected columns. This makes schema expectations explicit and produces clear errors if the source drifts.

note

Runtime schema evolution controls are planned for a future release. When available, schema evolution will remain off by default.

NameParameterSupportedIs Document Format
Apache Parquetfile_format: parquet
CSVfile_format: csv
Delta Lakefile_format: delta
Apache Icebergfile_format: iceberg
JSONfile_format: json
Microsoft Excelfile_format: xlsxRoadmap
Markdownfile_format: md
Textfile_format: txt
PDFfile_format: pdfBeta
Microsoft Wordfile_format: docxAlpha

Document Formats

Document formats (Markdown, Text, PDF, Word) are handled differently from structured data formats. Each file becomes a row in the resulting table, with the file contents stored in a content column.

Note

Document formats in Alpha (PDF, DOCX) may not parse all structure or text from the underlying documents correctly.

Document Table Schema

ColumnTypeDescription
locationStringPath to the source file
contentStringFull text content of the document

Example

Consider a local filesystem:

>>> ls -la
total 232
drwxr-sr-x@ 22 jeadie staff 704 30 Jul 13:12 .
drwxr-sr-x@ 18 jeadie staff 576 30 Jul 13:12 ..
-rw-r--r--@ 1 jeadie staff 1329 15 Jan 2024 DR-000-Template.md
-rw-r--r--@ 1 jeadie staff 4966 11 Aug 2023 DR-001-Dremio-Architecture.md
-rw-r--r--@ 1 jeadie staff 2307 28 Jul 2023 DR-002-Data-Completeness.md

And the spicepod

datasets:
- name: my_documents
from: file:docs/decisions/
params:
file_format: md

A Document table will be created.

>>> SELECT * FROM my_documents LIMIT 3
+----------------------------------------------------+--------------------------------------------------+
| location | content |
+----------------------------------------------------+--------------------------------------------------+
| Users/docs/decisions/DR-000-Template.md | # DR-000: DR Template |
| | **Date:** <> |
| | **Decision Makers:** |
| | - @<> |
| | - @<> |
| | ... |
| Users/docs/decisions/DR-001-Dremio-Architecture.md | # DR-001: Add "Cached" Dremio Dataset |
| | |
| | ## Context |
| | |
| | We use [Dremio](https://www.dremio.com/) to p... |
| Users/docs/decisions/DR-002-Data-Completeness.md | # DR-002: Append-Only Data Completeness |
| | |
| | ## Context |
| | |
| | Our Ethereum append-only dataset is incomple... |
+----------------------------------------------------+--------------------------------------------------+

Data Connector Docs