Skip to main content

S3 Data Connector

The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

If a folder path is specified as the dataset source, all files within the folder will be loaded.

File formats are specified using the file_format parameter, as described in Object Store File Formats.

datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet

Configuration​

from​

S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>

Example: from: s3://my-bucket/path/to/file.parquet

name​

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
- from: s3://s3-bucket-name/taxi_sample.csv
name: cool_dataset
params:
file_format: csv
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+

params​

Parameter NameDescription
file_formatSpecifies the data format. Required if it cannot be inferred from the object URI. Options: parquet, csv, json. Refer to Object Store File Formats for details.
s3_endpointS3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server
s3_regionS3 bucket region. Default: us-east-1.
client_timeoutTimeout for S3 operations. Default: 30s.
hive_partitioning_enabledEnable partitioning using hive-style partitioning from the folder structure. Defaults to false
s3_authAuthentication type. Options: public, key and iam_role. Defaults to public if s3_key and s3_secret are not provided, otherwise defaults to key.
s3_keyAccess key (e.g. AWS_ACCESS_KEY_ID for AWS)
s3_secretSecret key (e.g. AWS_SECRET_ACCESS_KEY for AWS)
allow_httpAllow insecure HTTP connections to s3_endpoint. Defaults to false

For additional CSV parameters, see CSV Parameters

Authentication​

No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. For Kubernetes Service Accounts with assigned IAM roles, set s3_auth to iam_role. If using iam_role, the AWS IAM role of the running instance is used.

Minimum IAM policy for S3 access:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::company-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::company-bucketname-datasets/*"
}
]
}

Types​

Refer to Object Store Data Types for data type mapping from object store files to arrow data type.

Examples​

Public bucket Example​

Create a dataset named taxi_trips from a public S3 folder.

- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet

MinIO Example​

Create a dataset named cool_dataset from a Parquet file stored in MinIO.

- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_endpoint: http://my.minio.server
s3_region: 'us-east-1' # Best practice for MinIO
allow_http: true

Hive Partitioning Example​

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet

Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled is set to true.

version: v1
kind: Spicepod
name: hive_data

datasets:
- from: s3://spiceai-public-datasets/hive_partitioned_data/
name: hive_data_infer
params:
file_format: parquet
hive_partitioning_enabled: true

Secrets​

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation. Additionally, learn how to use referenced secrets in component parameters by visiting the using referenced secrets guide.

Limitations​

Performance Considerations

When using the S3 Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb and sqlite accelerators by specifying mode: file.

Each query retrieves data from the S3 source, which might result in significant network requests and bandwidth consumption. This can affect network performance and incur costs related to data transfer from S3.

Cookbook​