Version: Next

Observability & Monitoring

Spice provides monitoring and observability through three mechanisms:

Prometheus-compatible metrics endpoint: Exposes metrics in the Prometheus exposition format for scraping by monitoring systems like Datadog, New Relic, and Chronosphere.
OpenTelemetry metrics export: Pushes metrics to an OpenTelemetry collector using gRPC.
Distributed tracing: Integrates with Zipkin and compatible tracing systems for request tracing.

Monitoring Integrations

Prometheus Metrics Endpoint

Spice exposes a Prometheus-compatible metrics endpoint that monitoring systems can scrape. The endpoint serves metrics in the Prometheus exposition format, which is supported by most enterprise monitoring platforms including Datadog, New Relic, Chronosphere, Grafana Cloud, and others.

Default Configuration

The metrics endpoint listens on port 9090 by default. The endpoint address is logged at startup:

2024-11-28T19:48:10.942003Z  INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090

Custom Port Binding

Use the --metrics flag to bind to a specific address and port:

spiced --metrics 0.0.0.0:9091

For Docker deployments:

FROM spiceai/spiceai:latest

CMD ["--metrics", "0.0.0.0:9090"]
EXPOSE 9090

Verifying the Endpoint

Verify the metrics endpoint is working with a GET request:

curl http://localhost:9090/metrics

# HELP runtime_flight_server_started Indicates the runtime Flight server has started.
# TYPE runtime_flight_server_started counter
runtime_flight_server_started 1
# HELP runtime_http_server_started Indicates the runtime HTTP server has started.
# TYPE runtime_http_server_started counter
runtime_http_server_started 1

# HELP dataset_load_state Status of the dataset. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
# TYPE dataset_load_state gauge
dataset_load_state{dataset="taxi_trips"} 2
dataset_load_state{dataset="taxi_trips_accelerated"} 2

# HELP dataset_active_count Number of currently loaded datasets.
# TYPE dataset_active_count gauge
dataset_active_count{engine="None"} 1
dataset_active_count{engine="duckdb"} 1
...

OpenTelemetry Metrics Exporter

Spice can push metrics to an OpenTelemetry collector, enabling integration with platforms such as Jaeger, New Relic, Honeycomb, and other OpenTelemetry-compatible backends.

Configuration

Configure the OpenTelemetry exporter in spicepod.yaml under runtime.telemetry.otel_exporter:

Parameter	Required	Default	Description
`enabled`	No	`true`	Whether the OpenTelemetry exporter is enabled.
`endpoint`	Yes	-	The OpenTelemetry collector endpoint. Protocol (gRPC or HTTP) is inferred from the format.
`push_interval`	No	`60s`	How frequently metrics are pushed to the collector.
`metrics`	No	`[]`	List of metric names to export. When empty, all metrics are exported.
`headers`	No	`{}`	Map of headers to send with each export request. For HTTP: sent as HTTP headers. For gRPC: sent as metadata entries (keys must be lowercase ASCII). Values support the `${secrets:...}` replacement syntax.

Protocol

Spice infers the OTLP protocol from the endpoint format:

gRPC — bare host:port with no scheme (e.g. localhost:4317). Default port: 4317.
HTTP — includes the http:// or https:// scheme and ends in /v1/metrics (e.g. http://localhost:4318/v1/metrics, https://otlp.us3.datadoghq.com/v1/metrics). Default port: 4318.

Authentication

For collectors that require authentication (Datadog, Grafana Cloud, New Relic, Honeycomb, etc.), set the headers map. Secret values should be loaded from a supported secret store using the ${secrets:...} replacement syntax rather than committed to source:

runtime:
  telemetry:
    otel_exporter:
      endpoint: 'https://otlp.example.com/v1/metrics'
      headers:
        Authorization: 'Bearer ${secrets:otlp_token}'

gRPC metadata keys must be lowercase

When exporting over gRPC, header keys are sent as gRPC metadata and must be lowercase ASCII — use authorization, not Authorization. The runtime fails fast at startup if any gRPC metadata key is invalid. HTTP exports preserve the casing you provide.

Examples

Local gRPC collector

runtime:
  telemetry:
    enabled: true
    otel_exporter:
      endpoint: 'localhost:4317'
      push_interval: '30s'

Local HTTP collector

runtime:
  telemetry:
    enabled: true
    otel_exporter:
      endpoint: 'http://localhost:4318/v1/metrics'
      push_interval: '30s'

Datadog (OTLP/HTTP)

Replace us3 with your Datadog site (us3, us5, eu, ap1, etc.) and store the API key in a secret store:

runtime:
  telemetry:
    enabled: true
    otel_exporter:
      endpoint: 'https://otlp.us3.datadoghq.com/v1/metrics'
      push_interval: '30s'
      headers:
        DD-API-KEY: ${secrets:dd_api_key}

Equivalent standard OTLP environment-variable form (for cross-reference):

export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp.us3.datadoghq.com"
export OTEL_EXPORTER_OTLP_HEADERS="DD-API-KEY=${DD_API_KEY}"

For a complete Datadog setup including metric prefixing and custom tags via OTLP resource attributes, see the Datadog monitoring guide.

Grafana Cloud (OTLP/HTTP)

Grafana Cloud's OTLP gateway expects HTTP Basic authentication. Obtain the base64-encoded instanceID:accessPolicyToken credential from the Grafana Cloud "OpenTelemetry" connection page and store it in a secret:

runtime:
  telemetry:
    enabled: true
    otel_exporter:
      endpoint: 'https://otlp-gateway-us-central2.grafana.net/otlp/v1/metrics'
      push_interval: '30s'
      headers:
        Authorization: 'Basic ${secrets:grafana_cloud_auth}'

Equivalent standard OTLP environment-variable form (for cross-reference):

export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-us-central2.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic ${GRAFANA_CLOUD_AUTH}"

Match the region in the URL to your Grafana Cloud stack (us-central2, eu-west-2, prod-ap-south-0, etc.).

gRPC collector with auth metadata

runtime:
  telemetry:
    enabled: true
    otel_exporter:
      endpoint: 'otel-collector.internal:4317'
      push_interval: '30s'
      headers:
        # Keys MUST be lowercase for gRPC
        api-key: ${secrets:collector_api_key}

Metric Naming and Custom Tags

Two runtime fields control how exported metrics are named and labeled across all readers (Prometheus scrape, cluster OTLP reader, and the otel_exporter push exporter):

runtime.telemetry.metric_prefix — prepends a string to every metric name (e.g. spiceai.query_duration_ms). Useful for namespacing in shared backends.
runtime.telemetry.properties — attaches custom key/value attributes as OpenTelemetry resource attributes, which most backends surface as dimensions or tags.

runtime:
  telemetry:
    metric_prefix: 'spiceai.'
    properties:
      environment: prod
      region: us-west-2
      team: data-platform

Both fields apply to every exporter the runtime has enabled. See the Datadog monitoring guide for backend-specific notes (Datadog requires dd-otel-metric-config to map resource attributes to tags).

Metric Filtering

To export only specific metrics, use the metrics parameter:

runtime:
  telemetry:
    enabled: true
    otel_exporter:
      endpoint: 'localhost:4317'
      metrics:
        - query_duration_ms
        - query_executions
        - dataset_load_state

When metrics is empty or omitted, all available metrics are exported.

Filtering happens after metric_prefix is applied

The whitelist is matched against the final metric name, after runtime.telemetry.metric_prefix has been prepended. If you set metric_prefix: 'spiceai.', the entries under metrics: must include the prefix (e.g. spiceai.query_duration_ms), otherwise nothing will match and no metrics will be exported.

For full configuration details, see the runtime.telemetry reference.

Available Metrics

Spice exposes the following metrics. The Dimensions column lists labels available for filtering and aggregation; — indicates the metric is emitted without dimensions. Dimensions annotated (request context) expand to: protocol, client, client_version, client_system, user_agent, runtime, runtime_version, runtime_system (individual labels are only emitted when the corresponding request attribute is present).

Metric	Type	Dimensions
`accelerated_ready_state_federated_fallback` Number of times the federated table was queried due to the accelerated table loading the initial data.	count	`dataset_name`
`accelerated_zero_results_federated_fallback` Number of times the federated table was queried due to the accelerated table returning zero results.	count	`dataset_name`
`ai_inferences_with_spice_count` AI Inferences with Spice count.	count	`tools_used`
`catalog_load_errors` Number of errors loading the catalog provider.	count	—
`catalog_load_state` Status of the catalog provider. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.	gauge	`catalog`
`component_metric_registered_count` Number of currently registered component metrics.	gauge	—
`dataset_acceleration_ingestion_lag_ms` Lag between the current wall-clock time and the maximum time_column value after the refresh operation, in milliseconds. Disabled by default	gauge	`dataset`, `mode`
`dataset_acceleration_last_refresh_time_ms` Unix timestamp in milliseconds when the last refresh completed. Disabled by default	gauge	`dataset`
`dataset_acceleration_max_timestamp_after_refresh_ms` Maximum value of the dataset's time_column after the refresh operation, in milliseconds. Disabled by default	gauge	`dataset`, `mode`
`dataset_acceleration_max_timestamp_before_refresh_ms` Maximum value of the dataset's time_column before the refresh operation, in milliseconds. Disabled by default	gauge	`dataset`, `mode`
`dataset_acceleration_refresh_data_fetches_skipped` Number of refresh data fetches skipped due to unchanged file metadata.	count	`dataset`, `mode`
`dataset_acceleration_refresh_duration_ms` Duration in milliseconds to load a full or appended refresh data.	histogram	`dataset`, `mode`
`dataset_acceleration_refresh_errors` Number of errors refreshing the dataset.	count	`dataset`, `mode`
`dataset_acceleration_refresh_lag_ms` Difference between the maximum time_column value after and before the refresh operation, in milliseconds.	gauge	`dataset`, `mode`
`dataset_acceleration_refresh_rows_written` Cumulative number of rows read from the federated source and written into the accelerated table.	count	`dataset`
`dataset_acceleration_refresh_bytes_written` Cumulative number of bytes (Arrow in-memory size) read from the federated source and written into the accelerated table.	count	`dataset`
`dataset_acceleration_refresh_worker_panics` Number of times a refresh worker panicked while refreshing a dataset.	count	`dataset`
`dataset_acceleration_size_bytes` Size of the accelerated table storage in bytes.	gauge	`dataset`
`dataset_acceleration_snapshot_bootstrap_bytes` Number of bytes downloaded when bootstrapping the acceleration from a snapshot.	gauge	`dataset`
`dataset_acceleration_snapshot_bootstrap_checksum` Checksum of the snapshot downloaded during bootstrap (emitted with `checksum` attribute).	gauge	`dataset`, `checksum`
`dataset_acceleration_snapshot_bootstrap_duration_ms` Time in milliseconds taken to download the snapshot used to bootstrap acceleration.	count	`dataset`
`dataset_acceleration_snapshot_failure_count` Number of failures encountered while writing snapshots.	count	`dataset`
`dataset_acceleration_snapshot_write_bytes` Number of bytes written for the most recent snapshot.	gauge	`dataset`
`dataset_acceleration_snapshot_write_checksum` Checksum of the most recent snapshot write (emitted with `checksum` attribute).	gauge	`dataset`, `checksum`
`dataset_acceleration_snapshot_write_duration_ms` Time in milliseconds taken to write the latest snapshot to object storage.	histogram	`dataset`
`dataset_acceleration_snapshot_write_timestamp` Unix timestamp (seconds) when the most recent snapshot write completed.	gauge	`dataset`
`dataset_active_count` Number of currently loaded datasets.	gauge	`engine`
`dataset_load_errors` Number of errors loading the dataset.	count	—
`dataset_load_state` Status of the dataset. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.	gauge	`dataset`
`dataset_unavailable_time_ms` Time dataset went offline in milliseconds.	gauge	`dataset`
`embeddings_active_count` Number of currently loaded embeddings.	gauge	`embeddings`, `source`
`embeddings_cache_evictions` Number of cache evictions.	count	—
`embeddings_cache_hit_ratio` Cache hit ratio (hits / total requests).	gauge	—
`embeddings_cache_hits` Cache hit count.	count	—
`embeddings_cache_items_count` Number of items currently in the cache.	gauge	—
`embeddings_cache_max_size_bytes` Maximum allowed size of the cache in bytes.	gauge	—
`embeddings_cache_misses` Cache miss count.	count	—
`embeddings_cache_requests` Number of requests to get a key from the cache.	count	—
`embeddings_cache_size_bytes` Size of the cache in bytes.	gauge	—
`embeddings_cache_stale_swr_count` Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.	count	—
`embeddings_cache_swr_background_query_count` Number of background queries triggered for stale-while-revalidate cache refreshes.	count	—
`embeddings_failures` Number of embedding failures.	count	`model`, `encoding_format`, `user`, `dimensions`
`embeddings_internal_request_duration_ms` The duration of running an embedding(s) internally.	histogram	`model`, `encoding_format`, `user`, `dimensions`
`embeddings_load_errors` Number of errors loading the embedding.	count	—
`embeddings_load_state` Status of the embedding. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.	gauge	`model`
`embeddings_requests` Number of embedding requests.	count	`model`, `encoding_format`, `user`, `dimensions`
`flight_do_exchange_data_updates_sent` Number of data updates sent via DoExchange.	count	—
`flight_do_put_bytes_written` Cumulative number of bytes (Arrow in-memory size) received and written via Flight DoPut.	count	`dataset`
`flight_do_put_rows_written` Cumulative number of rows received and written via Flight DoPut.	count	`dataset`
`flight_request_duration_ms` Measures the duration of Flight requests in milliseconds.	histogram	`method`, `command`, (request context)
`flight_requests` Total number of Flight requests.	count	`method`, `command`, (request context)
`http_requests` Number of HTTP requests.	count	`method`, `path`, `status`, (request context)
`http_requests_duration_ms` Measures the duration of HTTP requests in milliseconds.	histogram	`method`, `path`, `status`, (request context)
`llm_failures` Number of LLM failures.	count	`model`, `stream`, `request_level_tools`, `tool_choice`, `user`, `metadata`, `responses_api`, `instructions`
`llm_internal_request_duration_ms` The duration of running an LLM request internally.	histogram	`model`, `stream`, `request_level_tools`, `tool_choice`, `user`, `metadata`, `responses_api`, `instructions`
`llm_load_state` Status of the LLM model. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.	gauge	`model`
`llm_requests` Number of LLM requests.	count	`model`, `stream`, `request_level_tools`, `tool_choice`, `user`, `metadata`, `responses_api`, `instructions`
`model_active_count` Number of currently loaded models.	gauge	`model`, `source`
`model_load_duration_ms` Duration in milliseconds to load the model.	histogram	—
`model_load_errors` Number of errors loading the model.	count	—
`model_load_state` Status of the model. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.	gauge	`model`
`query_active_count` Number of concurrent top-level queries actively being processed in the runtime.	histogram	`protocol` (one of `http`, `flight`, `flightsql`, `internal`)
`query_duration_ms` The total amount of time spent planning and executing queries in milliseconds.	histogram	`tags`, `datasets`, (request context)
`query_execution_duration_ms` The total amount of time spent only executing queries (0 for cached queries).	histogram	`tags`, `datasets`, (request context)
`query_executions` Number of query executions.	count	(request context)
`query_failures` Number of query failures.	count	`tags`, `datasets`, `err_code`, (request context)
`query_processed_bytes` Number of bytes processed by the runtime.	count	(request context)
`query_produced_spills` Number of spills produced by the query.	count	(request context)
`query_returned_bytes` Number of bytes returned to query clients.	count	(request context)
`query_returned_rows` Number of rows returned to query clients.	histogram	(request context)
`query_spilled_bytes` Number of spilled bytes produced by the query.	count	(request context)
`query_spilled_rows` Number of spilled rows produced by the query.	count	(request context)
`results_cache_evictions` Number of cache evictions.	count	—
`results_cache_hit_ratio` Cache hit ratio (hits / total requests).	gauge	—
`results_cache_hits` Cache hit count.	count	—
`results_cache_items_count` Number of items currently in the cache.	gauge	—
`results_cache_max_size_bytes` Maximum allowed size of the cache in bytes.	gauge	—
`results_cache_misses` Cache miss count.	count	—
`results_cache_requests` Number of requests to get a key from the cache.	count	—
`results_cache_size_bytes` Size of the cache in bytes.	gauge	—
`results_cache_stale_swr_count` Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.	count	—
`results_cache_swr_background_query_count` Number of background queries triggered for stale-while-revalidate cache refreshes.	count	—
`runtime_flight_server_started` Indicates the runtime Flight server has started.	count	—
`runtime_http_server_started` Indicates the runtime HTTP server has started.	count	—
`runtime_tls_reload_total` Number of TLS certificate hot-reload attempts.	count	`scope` (`public`, `cluster`), `result` (`ok`, `io_error`, `parse_error`)
`scheduler_active_executors_count` Number of executors currently connected to the scheduler node.	gauge	`node_id`
`search_results_cache_evictions` Number of cache evictions.	count	—
`search_results_cache_hit_ratio` Cache hit ratio (hits / total requests).	gauge	—
`search_results_cache_hits` Search cache hit count.	count	—
`search_results_cache_items_count` Number of items currently in the search cache.	gauge	—
`search_results_cache_max_size_bytes` Maximum allowed size of the search cache in bytes.	gauge	—
`search_results_cache_misses` Cache miss count.	count	—
`search_results_cache_requests` Number of requests to get a key from the search cache.	count	—
`search_results_cache_size_bytes` Size of the search cache in bytes.	gauge	—
`search_results_cache_stale_swr_count` Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.	count	—
`search_results_cache_swr_background_query_count` Number of background queries triggered for stale-while-revalidate cache refreshes.	count	—
`secrets_store_load_duration_ms` Duration in milliseconds to load the secret stores.	histogram	—
`tool_active_count` Number of currently loaded LLM tools.	gauge	`tool` or `tool_catalog`
`tool_load_errors` Number of errors loading the LLM tool.	count	—
`tool_load_state` Status of the LLM tools. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.	gauge	`tool` or `tool_catalog`
`view_load_errors` Number of errors loading the view.	count	—
`view_load_state` Status of the views. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.	gauge	`view`
`worker_active_count` Number of currently loaded workers.	gauge	`worker`
`workers_load_duration_ms` Duration in milliseconds to load the worker.	histogram	—

Component Metrics

In addition to these core metrics, individual components can expose their own metrics. For example, the MySQL data connector exposes connection pool metrics. See Component Metrics for more information.

Monitoring Integrations​

Prometheus Metrics Endpoint​

Default Configuration​

Custom Port Binding​

Verifying the Endpoint​

OpenTelemetry Metrics Exporter​

Configuration​

Protocol​

Authentication​

Examples​

Local gRPC collector​

Local HTTP collector​

Datadog (OTLP/HTTP)​

Grafana Cloud (OTLP/HTTP)​

gRPC collector with auth metadata​

Metric Naming and Custom Tags​

Metric Filtering​

Available Metrics​