S3 Data Connector
The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).
If a folder path is specified as the dataset source, all files within the folder will be loaded.
File formats are specified using the file_format
parameter, as described in Object Store File Formats.
datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
Configuration​
from
​
S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>
Example: from: s3://my-bucket/path/to/file.parquet
name
​
The dataset name. This will be used as the table name within Spice.
Example:
datasets:
- from: s3://s3-bucket-name/taxi_sample.csv
name: cool_dataset
params:
file_format: csv
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
The dataset name cannot be a reserved keyword.
params
​
Parameter Name | Description |
---|---|
file_format | Specifies the data format. Required if it cannot be inferred from the object URI. Options: parquet , csv , json . Refer to Object Store File Formats for details. |
s3_endpoint | S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server |
s3_region | S3 bucket region. Default: us-east-1 . |
client_timeout | Timeout for S3 operations. Default: 30s . |
hive_partitioning_enabled | Enable partitioning using hive-style partitioning from the folder structure. Defaults to false |
s3_auth | Authentication type. Options: public , key and iam_role . Defaults to public . If set to key the s3_key and s3_secret parameters must also be set. If set to iam_role the credentials will be loaded from environment variables or IAM roles (see Authentication for details). |
s3_key | Access key (e.g. AWS_ACCESS_KEY_ID for AWS). Requires s3_auth is set to key . |
s3_secret | Secret key (e.g. AWS_SECRET_ACCESS_KEY for AWS). Requires s3_auth is set to key . |
s3_session_token | Session token (e.g. AWS_SESSION_TOKEN for AWS) for temporary credentials. Requires s3_auth is set to key . |
allow_http | Enables insecure HTTP connections to s3_endpoint . Defaults to false . |
schema_source_path | Specifies the URL used to infer the dataset schema. Default to the most recently modified file |
For additional CSV parameters, see CSV Parameters
Authentication​
No authentication is required for public endpoints. For private buckets, set s3_auth
to key
or iam_role
.
If s3_auth
is set to iam_role
, the connector will automatically load credentials from the following sources in order.
-
Environment Variables:
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN
(if using temporary credentials)
-
Shared AWS Config/Credentials Files:
-
Config file:
~/.aws/config
(Linux/Mac) or%UserProfile%\.aws\config
(Windows) -
Credentials file:
~/.aws/credentials
(Linux/Mac) or%UserProfile%\.aws\credentials
(Windows) -
The
AWS_PROFILE
environment variable can be used to specify a named profile, otherwise the[default]
profile is used. -
Supports both static credentials and SSO sessions
-
Example credentials file:
# Static credentials (in .aws/credentials)
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
# SSO profile (in .aws/config)
[profile sso-profile]
sso_start_url = https://my-sso-portal.awsapps.com/start
sso_region = us-west-2
sso_account_id = 123456789012
sso_role_name = MyRole
region = us-west-2
tipTo set up SSO authentication:
- Run
aws configure sso
to configure a new SSO profile - Use the profile by setting
AWS_PROFILE=sso-profile
- Run
aws sso login --profile sso-profile
to start a new SSO session
-
-
AWS STS Web Identity Token Credentials:
- Used primarily with OpenID Connect (OIDC) and OAuth
- Common in Kubernetes environments using IAM roles for service accounts (IRSA)
- Relies on the environment variables
AWS_WEB_IDENTITY_TOKEN_FILE
andAWS_ROLE_ARN
to be present.- These environment variables are automatically injected by EKS when the IAM Role is annotated on the pod Service Account.
-
ECS Container Credentials:
- Used when running in Amazon ECS containers
- Automatically uses the task's IAM role
- Retrieved from the ECS credential provider endpoint.
- Relies on the environment variable
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
orAWS_CONTAINER_CREDENTIALS_FULL_URI
which are automatically injected by ECS.
-
AWS EC2 Instance Metadata Service (IMDSv2):
- Used when running on EC2 instances.
- Automatically uses the instance's IAM role.
- Retrieved securely using IMDSv2.
The connector will try each source in order until valid credentials are found. If no valid credentials are found, an authentication error will be returned.
Regardless of the credential source, the IAM role or user must have appropriate S3 permissions (e.g., s3:ListBucket
, s3:GetObject
) to access the files. If the Spicepod connects to multiple different AWS services, the permissions should cover all of them.
kube2iam
is a project that provides IAM roles to Kubernetes pods based on annotations. It has been superceded by IAM Roles for service accounts (IRSA), which should be preferred for new deployments.
Spice requires kube2iam >= 0.12
- versions prior to 0.12
only supported IMDSv1.
Required IAM Permissions​
Minimum IAM policy for S3 access:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::company-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::company-bucketname-datasets/*"
}
]
}
Permission Details​
Permission | Purpose |
---|---|
s3:ListBucket | Required. Allows scanning all objects from the bucket |
s3:GetObject | Required. Allows fetching objects |
Types​
Refer to Object Store Data Types for data type mapping from object store files to arrow data type.
Examples​
Public bucket Example​
Create a dataset named taxi_trips
from a public S3 folder.
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
MinIO Example​
Create a dataset named cool_dataset
from a Parquet file stored in MinIO.
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_endpoint: http://my.minio.server
s3_region: 'us-east-1' # Best practice for MinIO
allow_http: true
Hive Partitioning Example​
Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.
For example, a dataset partitioned by year, month, and day might have a directory structure like:
s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet
Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled
is set to true
.
version: v1
kind: Spicepod
name: hive_data
datasets:
- from: s3://spiceai-public-datasets/hive_partitioned_data/
name: hive_data_infer
params:
file_format: parquet
hive_partitioning_enabled: true
Schema Source Path example​
Use schema_source_path
to speed up dataset registration by specifying a URL to use to infer the schema.
- from: s3://spiceai-demo-datasets/taxi_trips/
name: taxi_trips
params:
file_format: parquet
schema_source_path: s3://spiceai-demo-datasets/taxi_trips/2014/1/trips_01.parquet # or s3://spiceai-demo-datasets/taxi_trips/2014/1/
Secrets​
Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation. Additionally, learn how to use referenced secrets in component parameters by visiting the using referenced secrets guide.
Limitations​
When using the S3 Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb
and sqlite
accelerators by specifying mode: file
.
Each query retrieves data from the S3 source, which might result in significant network requests and bandwidth consumption. This can affect network performance and incur costs related to data transfer from S3.
Cookbook​
- A cookbook recipe to configure S3 as a data connector in Spice. S3 Data Connector