Skip to main content
Version: Latest (v1.11)

URL Tables

URL tables enable querying files in object stores directly using their URLs, without pre-registering datasets in a Spicepod. This provides an ad-hoc query capability for exploring data stored in S3, Azure Blob Storage, or HTTP endpoints.

Enabling URL Tables

URL tables are disabled by default and must be explicitly enabled in the Spicepod configuration:

runtime:
params:
url_tables: enabled

Supported URL Schemes

SchemeDescriptionExample
s3://Amazon S3s3://bucket/path/file.parquet
abfs://Azure Blob Storageabfs://container@account/path/file.parquet
abfss://Azure Data Lake Storage Gen2abfss://[email protected]/path/
https://HTTPS endpointshttps://example.com/data.parquet
http://HTTP endpointshttp://localhost:8080/data.csv

Query Patterns

Single File

Query a single file by specifying its full URL:

SELECT * FROM 's3://my-bucket/data/sales.parquet' LIMIT 10

Directory or Prefix

Query all files under a directory or prefix by including a trailing slash:

-- All files in a directory
SELECT * FROM 's3://my-bucket/data/'

-- All files in a bucket
SELECT * FROM 's3://my-bucket/'

Glob Patterns

Use glob patterns to match specific files:

-- All parquet files in a directory
SELECT * FROM 's3://my-bucket/data/*.parquet'

-- Files matching a pattern across subdirectories
SELECT * FROM 's3://my-bucket/year=2024/month=*/data.parquet'

Hive-Style Partitions

Hive-style partitions are automatically inferred from the path structure, enabling partition pruning:

-- If data is stored at s3://bucket/data/year=2024/month=01/file.parquet
-- the year and month columns are available for filtering
SELECT * FROM 's3://my-bucket/data/' WHERE year = '2024' AND month = '01'

Authentication

URL tables use the same authentication mechanisms as the corresponding data connectors. Credentials are loaded automatically from environment variables or cloud provider defaults.

S3

For S3, credentials are loaded from:

  1. Environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN
  2. Shared AWS credentials file (~/.aws/credentials)
  3. IAM instance profiles or roles

For public buckets, no authentication is required.

Azure Blob Storage

For Azure, set the storage account name via environment variable:

export AZURE_STORAGE_ACCOUNT=mystorageaccount

Alternatively, include the account name in the URL:

SELECT * FROM 'abfss://[email protected]/path/file.parquet'

Additional authentication options:

  • Environment variable: AZURE_STORAGE_KEY for access key authentication
  • Azure Managed Identity (automatic when running on Azure)
  • Azure CLI credentials

Examples

S3 Query

runtime:
params:
url_tables: enabled
-- Query a public S3 dataset
SELECT VendorID, passenger_count, trip_distance
FROM 's3://spiceai-public-datasets/taxi_small_samples/taxi_sample.parquet'
LIMIT 5

Azure Blob Storage Query

runtime:
params:
url_tables: enabled

Set the account via environment variable:

export AZURE_STORAGE_ACCOUNT=mystorageaccount
export AZURE_STORAGE_KEY=${your_access_key}

Or include the account in the URL:

SELECT *
FROM 'abfss://[email protected]/data/'
LIMIT 10

Cross-Source Query

URL tables can be combined with registered datasets in federated queries:

runtime:
params:
url_tables: enabled

datasets:
- from: postgres:orders
name: orders
params:
pg_host: localhost
pg_db: mydb
-- Join a registered dataset with an ad-hoc S3 query
SELECT o.order_id, o.customer_id, s.product_name
FROM orders o
JOIN 's3://my-bucket/products.parquet' s ON o.product_id = s.id

Considerations

  • Schema Inference: The schema is inferred from the files at query time. For best performance with large datasets, consider registering datasets in the Spicepod.
  • File Format Detection: File formats are automatically inferred from file extensions. Supported formats include Parquet, CSV, and JSON.
  • Performance: URL tables query data directly from the object store without local acceleration. For frequently accessed data or performance-critical queries, register datasets with data acceleration.
  • Authentication Scope: URL table queries use environment-level credentials. For queries requiring different credentials per source, register datasets with explicit authentication parameters.