URL Tables
URL tables enable querying files in object stores directly using their URLs, without pre-registering datasets in a Spicepod. This provides an ad-hoc query capability for exploring data stored in S3, Azure Blob Storage, or HTTP endpoints.
Enabling URL Tables
URL tables are disabled by default and must be explicitly enabled in the Spicepod configuration:
runtime:
params:
url_tables: enabled
Supported URL Schemes
| Scheme | Description | Example |
|---|---|---|
s3:// | Amazon S3 | s3://bucket/path/file.parquet |
abfs:// | Azure Blob Storage | abfs://container@account/path/file.parquet |
abfss:// | Azure Data Lake Storage Gen2 | abfss://[email protected]/path/ |
https:// | HTTPS endpoints | https://example.com/data.parquet |
http:// | HTTP endpoints | http://localhost:8080/data.csv |
Query Patterns
Single File
Query a single file by specifying its full URL:
SELECT * FROM 's3://my-bucket/data/sales.parquet' LIMIT 10
Directory or Prefix
Query all files under a directory or prefix by including a trailing slash:
-- All files in a directory
SELECT * FROM 's3://my-bucket/data/'
-- All files in a bucket
SELECT * FROM 's3://my-bucket/'
Glob Patterns
Use glob patterns to match specific files:
-- All parquet files in a directory
SELECT * FROM 's3://my-bucket/data/*.parquet'
-- Files matching a pattern across subdirectories
SELECT * FROM 's3://my-bucket/year=2024/month=*/data.parquet'
Hive-Style Partitions
Hive-style partitions are automatically inferred from the path structure, enabling partition pruning:
-- If data is stored at s3://bucket/data/year=2024/month=01/file.parquet
-- the year and month columns are available for filtering
SELECT * FROM 's3://my-bucket/data/' WHERE year = '2024' AND month = '01'
Authentication
URL tables use the same authentication mechanisms as the corresponding data connectors. Credentials are loaded automatically from environment variables or cloud provider defaults.
S3
For S3, credentials are loaded from:
- Environment variables:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN - Shared AWS credentials file (
~/.aws/credentials) - IAM instance profiles or roles
For public buckets, no authentication is required.
Azure Blob Storage
For Azure, set the storage account name via environment variable:
export AZURE_STORAGE_ACCOUNT=mystorageaccount
Alternatively, include the account name in the URL:
SELECT * FROM 'abfss://[email protected]/path/file.parquet'
Additional authentication options:
- Environment variable:
AZURE_STORAGE_KEYfor access key authentication - Azure Managed Identity (automatic when running on Azure)
- Azure CLI credentials
Examples
S3 Query
runtime:
params:
url_tables: enabled
-- Query a public S3 dataset
SELECT VendorID, passenger_count, trip_distance
FROM 's3://spiceai-public-datasets/taxi_small_samples/taxi_sample.parquet'
LIMIT 5
Azure Blob Storage Query
runtime:
params:
url_tables: enabled
Set the account via environment variable:
export AZURE_STORAGE_ACCOUNT=mystorageaccount
export AZURE_STORAGE_KEY=${your_access_key}
Or include the account in the URL:
SELECT *
FROM 'abfss://[email protected]/data/'
LIMIT 10
Cross-Source Query
URL tables can be combined with registered datasets in federated queries:
runtime:
params:
url_tables: enabled
datasets:
- from: postgres:orders
name: orders
params:
pg_host: localhost
pg_db: mydb
-- Join a registered dataset with an ad-hoc S3 query
SELECT o.order_id, o.customer_id, s.product_name
FROM orders o
JOIN 's3://my-bucket/products.parquet' s ON o.product_id = s.id
Considerations
- Schema Inference: The schema is inferred from the files at query time. For best performance with large datasets, consider registering datasets in the Spicepod.
- File Format Detection: File formats are automatically inferred from file extensions. Supported formats include Parquet, CSV, and JSON.
- Performance: URL tables query data directly from the object store without local acceleration. For frequently accessed data or performance-critical queries, register datasets with data acceleration.
- Authentication Scope: URL table queries use environment-level credentials. For queries requiring different credentials per source, register datasets with explicit authentication parameters.
Related Topics
- S3 Data Connector - Register S3 datasets with full configuration options
- Azure BlobFS Data Connector - Register Azure datasets with full configuration options
- Query Federation - Learn about federated queries across multiple sources
- Data Acceleration - Accelerate query performance with local caching
