HTTP(s) Data Connector Deployment Guide
Production operating guide for the HTTP(s) data connector covering authentication, rate control, retry tuning, and observability.
Authentication & Secrets​
The connector supports HTTP Basic, custom-header, and OAuth2 refresh-token authentication. Secrets must be sourced from a secret store in production.
| Parameter | Description |
|---|---|
http_username | Username for HTTP Basic authentication. |
http_password | Password for HTTP Basic authentication. Use ${secrets:...} to resolve from a secret store. |
http_headers | Custom headers (e.g. Authorization:Bearer ${secrets:api_token}). Treated as sensitive — not logged. |
auth_token_url | OAuth2 token endpoint URL (must be HTTPS in production). |
http_auth_refresh_token | OAuth2 refresh token. Required when auth_token_url is set. |
http_auth_client_id | OAuth2 client ID (required for confidential clients). |
http_auth_client_secret | OAuth2 client secret (required for confidential clients). Use ${secrets:...}. |
For OAuth2-protected APIs, prefer refresh-token flow over storing long-lived bearer tokens. The connector exchanges the refresh token for short-lived access tokens at startup and refreshes them before expiry.
TLS​
Use HTTPS endpoints in production. auth_token_url must use HTTPS (loopback addresses are allowed for local testing only). Self-signed certificates require a trusted CA bundle in the container or host OS trust store.
Resilience Controls​
Rate Control​
The HTTP connector participates in the shared HTTP rate control system. Concurrency and per-second/per-minute request limits can be configured per-dataset (in params) or globally (in runtime.params). Dataset-level settings override the global defaults. Multiple datasets targeting the same upstream origin share a single rate controller.
| Parameter | Description |
|---|---|
max_concurrent_requests | Maximum concurrent HTTP requests to the same origin. Disabled when unset. |
requests_per_second_limit | Maximum HTTP requests per second to the same origin. Disabled when unset. |
requests_per_minute_limit | Maximum HTTP requests per minute to the same origin. Disabled when unset. |
rate_control_jitter_min | Minimum random delay before requests when rate control is active. Defaults to 5ms. |
rate_control_jitter_max | Maximum random delay before requests when rate control is active. Defaults to 10ms. |
The runtime equivalents (http_max_concurrent_requests, http_requests_per_second_limit, http_requests_per_minute_limit, http_rate_control_jitter_min, http_rate_control_jitter_max) set defaults that apply to every HTTP-based connector unless overridden per dataset.
runtime:
params:
http_max_concurrent_requests: 10
http_requests_per_second_limit: 5
datasets:
- from: https://api.example.com/v1
name: api_data
params:
file_format: json
allowed_request_paths: '/data/**'
max_concurrent_requests: 3 # Override for this dataset
requests_per_minute_limit: 60
Use rate control when the upstream API enforces request quotas, when many datasets share a single origin, or when running large IN-list refreshes that would otherwise burst hundreds of concurrent requests.
Retry Behavior​
HTTP-level retries follow the shared resilient_http policy: 408, 429, and 5xx responses plus transient network errors are retried. The connector respects Retry-After, retry-after-ms, and x-retry-after-ms headers.
| Parameter | Default | Description |
|---|---|---|
max_retries | 3 | Maximum retry attempts per request. |
retry_backoff_method | fibonacci | Backoff strategy. Options: fibonacci, linear, exponential. |
retry_max_duration | unset | Maximum total duration across all retries (e.g. 30s, 5m). When set, retries stop after this elapsed time. |
retry_jitter | 0.3 | Randomization factor (0.0–1.0) applied to retry delays. Set to 0 to disable jitter. |
Retries are independent of rate control. If a retry would exceed the configured per-second or per-minute rate, it waits for the rate window to open before issuing the request.
Timeouts and Connection Pool​
| Parameter | Default | Description |
|---|---|---|
client_timeout | 30 | Maximum time (seconds) to wait for the entire request-response cycle. |
connect_timeout | 10 | Maximum time (seconds) to establish a TCP/TLS connection. |
pool_max_idle_per_host | 10 | Maximum idle connections held per upstream host. |
pool_idle_timeout | 90 | Idle connection lifetime (seconds) before the pool closes them. |
Increase client_timeout for endpoints with large response bodies or expensive server-side computation. Reduce pool_max_idle_per_host when running many small datasets against the same host to keep the runtime's open file descriptors bounded.
Caching Mode​
When using refresh_mode: caching, transient HTTP errors (5xx, 429) are excluded from the cache and propagated to clients. Set caching_stale_if_error: enabled to serve expired cached data on upstream failure. Always set caching_ttl explicitly — the default of 30s is rarely the desired window.
Capacity & Sizing​
- Throughput: Bounded by the upstream rate limit, then by
max_concurrent_requestsandconnect_timeout. Plan limits to stay within the API quota. - Memory: Response bodies are streamed; memory footprint is bounded by
max_request_body_bytes(filter inputs) and DataFusion's record-batch size for response rows. - Connection setup: TLS handshake adds latency. The connection pool keeps
pool_max_idle_per_hostwarm connections to absorb burst traffic. - Partitioned refreshes: When using
IN-list filters or cross-product partitioning, the runtime issues one HTTP request per partition. Usemax_request_partitionsto cap the request count for unbounded filter combinations, andmax_concurrent_requeststo throttle their fan-out.
Metrics​
When rate control is active, the connector exposes per-origin metrics that can be enabled per-dataset:
| Metric Name | Description |
|---|---|
rate_control_max_concurrent_requests | Configured maximum concurrent HTTP requests for this upstream origin; 0 means disabled. |
rate_control_requests_per_second_limit | Configured HTTP request-per-second limit for this upstream origin; 0 means disabled. |
rate_control_requests_per_minute_limit | Configured HTTP request-per-minute limit for this upstream origin; 0 means disabled. |
rate_control_jitter_min_ms | Minimum jitter (in milliseconds) added before HTTP requests when rate control is active. |
rate_control_jitter_max_ms | Maximum jitter (in milliseconds) added before HTTP requests when rate control is active. |
Enable component metrics in the dataset's metrics section. See Component Metrics for general configuration.
For broader observability, also monitor:
- Spice query execution metrics (
query_duration_ms,query_processed_rows,query_failures_total) fromruntime.metrics. - HTTP response status distribution via the shared
resilient_httpinstrumentation.
Task History​
HTTP requests participate in task history through the HTTP client's span. Each partitioned request and each pagination page is a child of the enclosing sql_query or accelerated_table_refresh task.
Known Limitations​
- Read-only: The connector is read-only. Only
GETandPOST(viarequest_bodyfilters) are supported. - Filter pushdown is opt-in:
request_path,request_query,request_body, andrequest_headersfilters require explicit allowlists or_filters: enabledparameters. - OAuth2 OOS scope: Only the refresh-token grant is supported. Client-credentials and authorization-code flows are not exposed.
- OR across virtual filter columns:
WHERE request_path = '/a' OR request_query = 'b=1'is rejected. Use separate datasets orUNION ALLfor cross-column alternatives. Single-columnOR(andIN-lists) is supported.
Troubleshooting​
| Symptom | Likely cause | Resolution |
|---|---|---|
401 Unauthorized | Wrong/expired token or password. | Rotate the credential in the secret store. |
429 Too Many Requests (frequent) | Upstream rate limit hit; concurrency too high. | Set requests_per_second_limit / requests_per_minute_limit; reduce max_concurrent_requests. |
| Refresh blocked / queue building up | max_concurrent_requests set too low for the workload. | Raise the dataset-level limit or move heavy datasets to their own origin. |
| OAuth2 token refresh fails | auth_token_url not HTTPS, or wrong client credentials. | Verify the token endpoint URL; check http_auth_client_id/secret and required scopes. |
| Request rejected: "OR across HTTP filter columns" | WHERE request_path = '...' OR request_query = '...'. | Split into separate refreshes or UNION ALL. |
| Many partitions created from cross-product | Multiple IN-list filters multiplied into many requests. | Set max_request_partitions to cap; tighten filters. |
| Slow first refresh | Cold connection pool + TLS handshake per request. | Raise pool_max_idle_per_host; ensure pool_idle_timeout is long enough to keep connections warm. |
