Observability & Monitoring
Spice provides monitoring and observability through three mechanisms:
- Prometheus-compatible metrics endpoint: Exposes metrics in the Prometheus exposition format for scraping by monitoring systems like Datadog, New Relic, and Chronosphere.
- OpenTelemetry metrics export: Pushes metrics to an OpenTelemetry collector using gRPC.
- Distributed tracing: Integrates with Zipkin and compatible tracing systems for request tracing.
Monitoring Integrations
Prometheus Metrics Endpoint
Spice exposes a Prometheus-compatible metrics endpoint that monitoring systems can scrape. The endpoint serves metrics in the Prometheus exposition format, which is supported by most enterprise monitoring platforms including Datadog, New Relic, Chronosphere, Grafana Cloud, and others.
Default Configuration
The metrics endpoint listens on port 9090 by default. The endpoint address is logged at startup:
2024-11-28T19:48:10.942003Z INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
Custom Port Binding
Use the --metrics flag to bind to a specific address and port:
spiced --metrics 0.0.0.0:9091
For Docker deployments:
FROM spiceai/spiceai:latest
CMD ["--metrics", "0.0.0.0:9090"]
EXPOSE 9090
Verifying the Endpoint
Verify the metrics endpoint is working with a GET request:
curl http://localhost:9090/metrics
# HELP runtime_flight_server_started Indicates the runtime Flight server has started.
# TYPE runtime_flight_server_started counter
runtime_flight_server_started 1
# HELP runtime_http_server_started Indicates the runtime HTTP server has started.
# TYPE runtime_http_server_started counter
runtime_http_server_started 1
# HELP dataset_load_state Status of the dataset. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
# TYPE dataset_load_state gauge
dataset_load_state{dataset="taxi_trips"} 2
dataset_load_state{dataset="taxi_trips_accelerated"} 2
# HELP dataset_active_count Number of currently loaded datasets.
# TYPE dataset_active_count gauge
dataset_active_count{engine="None"} 1
dataset_active_count{engine="duckdb"} 1
...
OpenTelemetry Metrics Exporter
Spice can push metrics to an OpenTelemetry collector, enabling integration with platforms such as Jaeger, New Relic, Honeycomb, and other OpenTelemetry-compatible backends.
Configuration
Configure the OpenTelemetry exporter in spicepod.yaml under runtime.telemetry.otel_exporter:
| Parameter | Required | Default | Description |
|---|---|---|---|
enabled | No | true | Whether the OpenTelemetry exporter is enabled. |
endpoint | Yes | - | The OpenTelemetry collector endpoint. Protocol (gRPC or HTTP) is inferred from the format. |
push_interval | No | 60s | How frequently metrics are pushed to the collector. |
metrics | No | [] | List of metric names to export. When empty, all metrics are exported. |
headers | No | {} | Map of headers to send with each export request. For HTTP: sent as HTTP headers. For gRPC: sent as metadata entries (keys must be lowercase ASCII). Values support the ${secrets:...} replacement syntax. |
Protocol
Spice infers the OTLP protocol from the endpoint format:
- gRPC — bare host:port with no scheme (e.g.
localhost:4317). Default port:4317. - HTTP — includes the
http://orhttps://scheme and ends in/v1/metrics(e.g.http://localhost:4318/v1/metrics,https://otlp.us3.datadoghq.com/v1/metrics). Default port:4318.
Authentication
For collectors that require authentication (Datadog, Grafana Cloud, New Relic, Honeycomb, etc.), set the headers map. Secret values should be loaded from a supported secret store using the ${secrets:...} replacement syntax rather than committed to source:
runtime:
telemetry:
otel_exporter:
endpoint: 'https://otlp.example.com/v1/metrics'
headers:
Authorization: 'Bearer ${secrets:otlp_token}'
When exporting over gRPC, header keys are sent as gRPC metadata and must be lowercase ASCII — use authorization, not Authorization. The runtime fails fast at startup if any gRPC metadata key is invalid. HTTP exports preserve the casing you provide.
Examples
Local gRPC collector
runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'localhost:4317'
push_interval: '30s'
Local HTTP collector
runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'http://localhost:4318/v1/metrics'
push_interval: '30s'
Datadog (OTLP/HTTP)
Replace us3 with your Datadog site (us3, us5, eu, ap1, etc.) and store the API key in a secret store:
runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'https://otlp.us3.datadoghq.com/v1/metrics'
push_interval: '30s'
headers:
DD-API-KEY: ${secrets:dd_api_key}
Equivalent standard OTLP environment-variable form (for cross-reference):
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp.us3.datadoghq.com"
export OTEL_EXPORTER_OTLP_HEADERS="DD-API-KEY=${DD_API_KEY}"
For a complete Datadog setup including metric prefixing and custom tags via OTLP resource attributes, see the Datadog monitoring guide.
Grafana Cloud (OTLP/HTTP)
Grafana Cloud's OTLP gateway expects HTTP Basic authentication. Obtain the base64-encoded instanceID:accessPolicyToken credential from the Grafana Cloud "OpenTelemetry" connection page and store it in a secret:
runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'https://otlp-gateway-us-central2.grafana.net/otlp/v1/metrics'
push_interval: '30s'
headers:
Authorization: 'Basic ${secrets:grafana_cloud_auth}'
Equivalent standard OTLP environment-variable form (for cross-reference):
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-us-central2.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic ${GRAFANA_CLOUD_AUTH}"
Match the region in the URL to your Grafana Cloud stack (us-central2, eu-west-2, prod-ap-south-0, etc.).
gRPC collector with auth metadata
runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'otel-collector.internal:4317'
push_interval: '30s'
headers:
# Keys MUST be lowercase for gRPC
api-key: ${secrets:collector_api_key}
Metric Naming and Custom Tags
Two runtime fields control how exported metrics are named and labeled across all readers (Prometheus scrape, cluster OTLP reader, and the otel_exporter push exporter):
runtime.telemetry.metric_prefix— prepends a string to every metric name (e.g.spiceai.query_duration_ms). Useful for namespacing in shared backends.runtime.telemetry.properties— attaches custom key/value attributes as OpenTelemetry resource attributes, which most backends surface as dimensions or tags.
runtime:
telemetry:
metric_prefix: 'spiceai.'
properties:
environment: prod
region: us-west-2
team: data-platform
Both fields apply to every exporter the runtime has enabled. See the Datadog monitoring guide for backend-specific notes (Datadog requires dd-otel-metric-config to map resource attributes to tags).
Metric Filtering
To export only specific metrics, use the metrics parameter:
runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'localhost:4317'
metrics:
- query_duration_ms
- query_executions
- dataset_load_state
When metrics is empty or omitted, all available metrics are exported.
For full configuration details, see the runtime.telemetry reference.
Available Metrics
Spice exposes the following metrics. The Dimensions column lists labels available for filtering and aggregation; — indicates the metric is emitted without dimensions. Dimensions annotated (request context) expand to: protocol, client, client_version, client_system, user_agent, runtime, runtime_version, runtime_system (individual labels are only emitted when the corresponding request attribute is present).
| Metric | Type | Dimensions |
|---|---|---|
accelerated_ready_state_federated_fallbackNumber of times the federated table was queried due to the accelerated table loading the initial data. | count | dataset_name |
accelerated_zero_results_federated_fallbackNumber of times the federated table was queried due to the accelerated table returning zero results. | count | dataset_name |
ai_inferences_with_spice_countAI Inferences with Spice count. | count | tools_used |
catalog_load_errorsNumber of errors loading the catalog provider. | count | — |
catalog_load_stateStatus of the catalog provider. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown. | gauge | catalog |
component_metric_registered_countNumber of currently registered component metrics. | gauge | — |
dataset_acceleration_ingestion_lag_msLag between the current wall-clock time and the maximum time_column value after the refresh operation, in milliseconds. Disabled by default | gauge | dataset, mode |
dataset_acceleration_last_refresh_time_msUnix timestamp in milliseconds when the last refresh completed. Disabled by default | gauge | dataset |
dataset_acceleration_max_timestamp_after_refresh_msMaximum value of the dataset's time_column after the refresh operation, in milliseconds. Disabled by default | gauge | dataset, mode |
dataset_acceleration_max_timestamp_before_refresh_msMaximum value of the dataset's time_column before the refresh operation, in milliseconds. Disabled by default | gauge | dataset, mode |
dataset_acceleration_refresh_data_fetches_skippedNumber of refresh data fetches skipped due to unchanged file metadata. | count | dataset, mode |
dataset_acceleration_refresh_duration_msDuration in milliseconds to load a full or appended refresh data. | histogram | dataset, mode |
dataset_acceleration_refresh_errorsNumber of errors refreshing the dataset. | count | dataset, mode |
dataset_acceleration_refresh_lag_msDifference between the maximum time_column value after and before the refresh operation, in milliseconds. | gauge | dataset, mode |
dataset_acceleration_refresh_rows_writtenCumulative number of rows read from the federated source and written into the accelerated table. | count | dataset |
dataset_acceleration_refresh_bytes_writtenCumulative number of bytes (Arrow in-memory size) read from the federated source and written into the accelerated table. | count | dataset |
dataset_acceleration_refresh_worker_panicsNumber of times a refresh worker panicked while refreshing a dataset. | count | dataset |
dataset_acceleration_size_bytesSize of the accelerated table storage in bytes. | gauge | dataset |
dataset_acceleration_snapshot_bootstrap_bytesNumber of bytes downloaded when bootstrapping the acceleration from a snapshot. | gauge | dataset |
dataset_acceleration_snapshot_bootstrap_checksumChecksum of the snapshot downloaded during bootstrap (emitted with checksum attribute). | gauge | dataset, checksum |
dataset_acceleration_snapshot_bootstrap_duration_msTime in milliseconds taken to download the snapshot used to bootstrap acceleration. | count | dataset |
dataset_acceleration_snapshot_failure_countNumber of failures encountered while writing snapshots. | count | dataset |
dataset_acceleration_snapshot_write_bytesNumber of bytes written for the most recent snapshot. | gauge | dataset |
dataset_acceleration_snapshot_write_checksumChecksum of the most recent snapshot write (emitted with checksum attribute). | gauge | dataset, checksum |
dataset_acceleration_snapshot_write_duration_msTime in milliseconds taken to write the latest snapshot to object storage. | histogram | dataset |
dataset_acceleration_snapshot_write_timestampUnix timestamp (seconds) when the most recent snapshot write completed. | gauge | dataset |
dataset_active_countNumber of currently loaded datasets. | gauge | engine |
dataset_load_errorsNumber of errors loading the dataset. | count | — |
dataset_load_stateStatus of the dataset. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown. | gauge | dataset |
dataset_unavailable_time_msTime dataset went offline in milliseconds. | gauge | dataset |
embeddings_active_countNumber of currently loaded embeddings. | gauge | embeddings, source |
embeddings_cache_evictionsNumber of cache evictions. | count | — |
embeddings_cache_hit_ratioCache hit ratio (hits / total requests). | gauge | — |
embeddings_cache_hitsCache hit count. | count | — |
embeddings_cache_items_countNumber of items currently in the cache. | gauge | — |
embeddings_cache_max_size_bytesMaximum allowed size of the cache in bytes. | gauge | — |
embeddings_cache_missesCache miss count. | count | — |
embeddings_cache_requestsNumber of requests to get a key from the cache. | count | — |
embeddings_cache_size_bytesSize of the cache in bytes. | gauge | — |
embeddings_cache_stale_swr_countNumber of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation. | count | — |
embeddings_cache_swr_background_query_countNumber of background queries triggered for stale-while-revalidate cache refreshes. | count | — |
embeddings_failuresNumber of embedding failures. | count | model, encoding_format, user, dimensions |
embeddings_internal_request_duration_msThe duration of running an embedding(s) internally. | histogram | model, encoding_format, user, dimensions |
embeddings_load_errorsNumber of errors loading the embedding. | count | — |
embeddings_load_stateStatus of the embedding. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown. | gauge | model |
embeddings_requestsNumber of embedding requests. | count | model, encoding_format, user, dimensions |
flight_do_exchange_data_updates_sentNumber of data updates sent via DoExchange. | count | — |
flight_do_put_bytes_writtenCumulative number of bytes (Arrow in-memory size) received and written via Flight DoPut. | count | dataset |
flight_do_put_rows_writtenCumulative number of rows received and written via Flight DoPut. | count | dataset |
flight_request_duration_msMeasures the duration of Flight requests in milliseconds. | histogram | method, command, (request context) |
flight_requestsTotal number of Flight requests. | count | method, command, (request context) |
http_requestsNumber of HTTP requests. | count | method, path, status, (request context) |
http_requests_duration_msMeasures the duration of HTTP requests in milliseconds. | histogram | method, path, status, (request context) |
llm_failuresNumber of LLM failures. | count | model, stream, request_level_tools, tool_choice, user, metadata, responses_api, instructions |
llm_internal_request_duration_msThe duration of running an LLM request internally. | histogram | model, stream, request_level_tools, tool_choice, user, metadata, responses_api, instructions |
llm_load_stateStatus of the LLM model. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown. | gauge | model |
llm_requestsNumber of LLM requests. | count | model, stream, request_level_tools, tool_choice, user, metadata, responses_api, instructions |
model_active_countNumber of currently loaded models. | gauge | model, source |
model_load_duration_msDuration in milliseconds to load the model. | histogram | — |
model_load_errorsNumber of errors loading the model. | count | — |
model_load_stateStatus of the model. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown. | gauge | model |
query_active_countNumber of concurrent top-level queries actively being processed in the runtime. | histogram | protocol (one of http, flight, flightsql, internal) |
query_duration_msThe total amount of time spent planning and executing queries in milliseconds. | histogram | tags, datasets, (request context) |
query_execution_duration_msThe total amount of time spent only executing queries (0 for cached queries). | histogram | tags, datasets, (request context) |
query_executionsNumber of query executions. | count | (request context) |
query_failuresNumber of query failures. | count | tags, datasets, err_code, (request context) |
query_processed_bytesNumber of bytes processed by the runtime. | count | (request context) |
query_produced_spillsNumber of spills produced by the query. | count | (request context) |
query_returned_bytesNumber of bytes returned to query clients. | count | (request context) |
query_returned_rowsNumber of rows returned to query clients. | histogram | (request context) |
query_spilled_bytesNumber of spilled bytes produced by the query. | count | (request context) |
query_spilled_rowsNumber of spilled rows produced by the query. | count | (request context) |
results_cache_evictionsNumber of cache evictions. | count | — |
results_cache_hit_ratioCache hit ratio (hits / total requests). | gauge | — |
results_cache_hitsCache hit count. | count | — |
results_cache_items_countNumber of items currently in the cache. | gauge | — |
results_cache_max_size_bytesMaximum allowed size of the cache in bytes. | gauge | — |
results_cache_missesCache miss count. | count | — |
results_cache_requestsNumber of requests to get a key from the cache. | count | — |
results_cache_size_bytesSize of the cache in bytes. | gauge | — |
results_cache_stale_swr_countNumber of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation. | count | — |
results_cache_swr_background_query_countNumber of background queries triggered for stale-while-revalidate cache refreshes. | count | — |
runtime_flight_server_startedIndicates the runtime Flight server has started. | count | — |
runtime_http_server_startedIndicates the runtime HTTP server has started. | count | — |
scheduler_active_executors_countNumber of executors currently connected to the scheduler node. | gauge | node_id |
search_results_cache_evictionsNumber of cache evictions. | count | — |
search_results_cache_hit_ratioCache hit ratio (hits / total requests). | gauge | — |
search_results_cache_hitsSearch cache hit count. | count | — |
search_results_cache_items_countNumber of items currently in the search cache. | gauge | — |
search_results_cache_max_size_bytesMaximum allowed size of the search cache in bytes. | gauge | — |
search_results_cache_missesCache miss count. | count | — |
search_results_cache_requestsNumber of requests to get a key from the search cache. | count | — |
search_results_cache_size_bytesSize of the search cache in bytes. | gauge | — |
search_results_cache_stale_swr_countNumber of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation. | count | — |
search_results_cache_swr_background_query_countNumber of background queries triggered for stale-while-revalidate cache refreshes. | count | — |
secrets_store_load_duration_msDuration in milliseconds to load the secret stores. | histogram | — |
tool_active_countNumber of currently loaded LLM tools. | gauge | tool or tool_catalog |
tool_load_errorsNumber of errors loading the LLM tool. | count | — |
tool_load_stateStatus of the LLM tools. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown. | gauge | tool or tool_catalog |
view_load_errorsNumber of errors loading the view. | count | — |
view_load_stateStatus of the views. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown. | gauge | view |
worker_active_countNumber of currently loaded workers. | gauge | worker |
workers_load_duration_msDuration in milliseconds to load the worker. | histogram | — |
In addition to these core metrics, individual components can expose their own metrics. For example, the MySQL data connector exposes connection pool metrics. See Component Metrics for more information.
