Skip to main content
Version: Next

Observability & Monitoring

Spice provides monitoring and observability through three mechanisms:

observability

Monitoring Integrations

Prometheus Metrics Endpoint

Spice exposes a Prometheus-compatible metrics endpoint that monitoring systems can scrape. The endpoint serves metrics in the Prometheus exposition format, which is supported by most enterprise monitoring platforms including Datadog, New Relic, Chronosphere, Grafana Cloud, and others.

Default Configuration

The metrics endpoint listens on port 9090 by default. The endpoint address is logged at startup:

2024-11-28T19:48:10.942003Z  INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090

Custom Port Binding

Use the --metrics flag to bind to a specific address and port:

spiced --metrics 0.0.0.0:9091

For Docker deployments:

FROM spiceai/spiceai:latest

CMD ["--metrics", "0.0.0.0:9090"]
EXPOSE 9090

Verifying the Endpoint

Verify the metrics endpoint is working with a GET request:

curl http://localhost:9090/metrics

# HELP runtime_flight_server_started Indicates the runtime Flight server has started.
# TYPE runtime_flight_server_started counter
runtime_flight_server_started 1
# HELP runtime_http_server_started Indicates the runtime HTTP server has started.
# TYPE runtime_http_server_started counter
runtime_http_server_started 1

# HELP dataset_load_state Status of the dataset. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
# TYPE dataset_load_state gauge
dataset_load_state{dataset="taxi_trips"} 2
dataset_load_state{dataset="taxi_trips_accelerated"} 2

# HELP dataset_active_count Number of currently loaded datasets.
# TYPE dataset_active_count gauge
dataset_active_count{engine="None"} 1
dataset_active_count{engine="duckdb"} 1
...

OpenTelemetry Metrics Exporter

Spice can push metrics to an OpenTelemetry collector, enabling integration with platforms such as Jaeger, New Relic, Honeycomb, and other OpenTelemetry-compatible backends.

Configuration

Configure the OpenTelemetry exporter in spicepod.yaml under runtime.telemetry.otel_exporter:

ParameterRequiredDefaultDescription
enabledNotrueWhether the OpenTelemetry exporter is enabled.
endpointYes-The OpenTelemetry collector endpoint. Protocol (gRPC or HTTP) is inferred from the format.
push_intervalNo60sHow frequently metrics are pushed to the collector.
metricsNo[]List of metric names to export. When empty, all metrics are exported.
headersNo{}Map of headers to send with each export request. For HTTP: sent as HTTP headers. For gRPC: sent as metadata entries (keys must be lowercase ASCII). Values support the ${secrets:...} replacement syntax.

Protocol

Spice infers the OTLP protocol from the endpoint format:

  • gRPC — bare host:port with no scheme (e.g. localhost:4317). Default port: 4317.
  • HTTP — includes the http:// or https:// scheme and ends in /v1/metrics (e.g. http://localhost:4318/v1/metrics, https://otlp.us3.datadoghq.com/v1/metrics). Default port: 4318.

Authentication

For collectors that require authentication (Datadog, Grafana Cloud, New Relic, Honeycomb, etc.), set the headers map. Secret values should be loaded from a supported secret store using the ${secrets:...} replacement syntax rather than committed to source:

runtime:
telemetry:
otel_exporter:
endpoint: 'https://otlp.example.com/v1/metrics'
headers:
Authorization: 'Bearer ${secrets:otlp_token}'
gRPC metadata keys must be lowercase

When exporting over gRPC, header keys are sent as gRPC metadata and must be lowercase ASCII — use authorization, not Authorization. The runtime fails fast at startup if any gRPC metadata key is invalid. HTTP exports preserve the casing you provide.

Examples

Local gRPC collector

runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'localhost:4317'
push_interval: '30s'

Local HTTP collector

runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'http://localhost:4318/v1/metrics'
push_interval: '30s'

Datadog (OTLP/HTTP)

Replace us3 with your Datadog site (us3, us5, eu, ap1, etc.) and store the API key in a secret store:

runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'https://otlp.us3.datadoghq.com/v1/metrics'
push_interval: '30s'
headers:
DD-API-KEY: ${secrets:dd_api_key}

Equivalent standard OTLP environment-variable form (for cross-reference):

export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp.us3.datadoghq.com"
export OTEL_EXPORTER_OTLP_HEADERS="DD-API-KEY=${DD_API_KEY}"

For a complete Datadog setup including metric prefixing and custom tags via OTLP resource attributes, see the Datadog monitoring guide.

Grafana Cloud (OTLP/HTTP)

Grafana Cloud's OTLP gateway expects HTTP Basic authentication. Obtain the base64-encoded instanceID:accessPolicyToken credential from the Grafana Cloud "OpenTelemetry" connection page and store it in a secret:

runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'https://otlp-gateway-us-central2.grafana.net/otlp/v1/metrics'
push_interval: '30s'
headers:
Authorization: 'Basic ${secrets:grafana_cloud_auth}'

Equivalent standard OTLP environment-variable form (for cross-reference):

export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-us-central2.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic ${GRAFANA_CLOUD_AUTH}"

Match the region in the URL to your Grafana Cloud stack (us-central2, eu-west-2, prod-ap-south-0, etc.).

gRPC collector with auth metadata

runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'otel-collector.internal:4317'
push_interval: '30s'
headers:
# Keys MUST be lowercase for gRPC
api-key: ${secrets:collector_api_key}

Metric Naming and Custom Tags

Two runtime fields control how exported metrics are named and labeled across all readers (Prometheus scrape, cluster OTLP reader, and the otel_exporter push exporter):

  • runtime.telemetry.metric_prefix — prepends a string to every metric name (e.g. spiceai.query_duration_ms). Useful for namespacing in shared backends.
  • runtime.telemetry.properties — attaches custom key/value attributes as OpenTelemetry resource attributes, which most backends surface as dimensions or tags.
runtime:
telemetry:
metric_prefix: 'spiceai.'
properties:
environment: prod
region: us-west-2
team: data-platform

Both fields apply to every exporter the runtime has enabled. See the Datadog monitoring guide for backend-specific notes (Datadog requires dd-otel-metric-config to map resource attributes to tags).

Metric Filtering

To export only specific metrics, use the metrics parameter:

runtime:
telemetry:
enabled: true
otel_exporter:
endpoint: 'localhost:4317'
metrics:
- query_duration_ms
- query_executions
- dataset_load_state

When metrics is empty or omitted, all available metrics are exported.

For full configuration details, see the runtime.telemetry reference.

Available Metrics

Spice exposes the following metrics. The Dimensions column lists labels available for filtering and aggregation; indicates the metric is emitted without dimensions. Dimensions annotated (request context) expand to: protocol, client, client_version, client_system, user_agent, runtime, runtime_version, runtime_system (individual labels are only emitted when the corresponding request attribute is present).

MetricTypeDimensions
accelerated_ready_state_federated_fallback

Number of times the federated table was queried due to the accelerated table loading the initial data.
countdataset_name
accelerated_zero_results_federated_fallback

Number of times the federated table was queried due to the accelerated table returning zero results.
countdataset_name
ai_inferences_with_spice_count

AI Inferences with Spice count.
counttools_used
catalog_load_errors

Number of errors loading the catalog provider.
count
catalog_load_state

Status of the catalog provider. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
gaugecatalog
component_metric_registered_count

Number of currently registered component metrics.
gauge
dataset_acceleration_ingestion_lag_ms

Lag between the current wall-clock time and the maximum time_column value after the refresh operation, in milliseconds. Disabled by default
gaugedataset, mode
dataset_acceleration_last_refresh_time_ms

Unix timestamp in milliseconds when the last refresh completed. Disabled by default
gaugedataset
dataset_acceleration_max_timestamp_after_refresh_ms

Maximum value of the dataset's time_column after the refresh operation, in milliseconds. Disabled by default
gaugedataset, mode
dataset_acceleration_max_timestamp_before_refresh_ms

Maximum value of the dataset's time_column before the refresh operation, in milliseconds. Disabled by default
gaugedataset, mode
dataset_acceleration_refresh_data_fetches_skipped

Number of refresh data fetches skipped due to unchanged file metadata.
countdataset, mode
dataset_acceleration_refresh_duration_ms

Duration in milliseconds to load a full or appended refresh data.
histogramdataset, mode
dataset_acceleration_refresh_errors

Number of errors refreshing the dataset.
countdataset, mode
dataset_acceleration_refresh_lag_ms

Difference between the maximum time_column value after and before the refresh operation, in milliseconds.
gaugedataset, mode
dataset_acceleration_refresh_rows_written

Cumulative number of rows read from the federated source and written into the accelerated table.
countdataset
dataset_acceleration_refresh_bytes_written

Cumulative number of bytes (Arrow in-memory size) read from the federated source and written into the accelerated table.
countdataset
dataset_acceleration_refresh_worker_panics

Number of times a refresh worker panicked while refreshing a dataset.
countdataset
dataset_acceleration_size_bytes

Size of the accelerated table storage in bytes.
gaugedataset
dataset_acceleration_snapshot_bootstrap_bytes

Number of bytes downloaded when bootstrapping the acceleration from a snapshot.
gaugedataset
dataset_acceleration_snapshot_bootstrap_checksum

Checksum of the snapshot downloaded during bootstrap (emitted with checksum attribute).
gaugedataset, checksum
dataset_acceleration_snapshot_bootstrap_duration_ms

Time in milliseconds taken to download the snapshot used to bootstrap acceleration.
countdataset
dataset_acceleration_snapshot_failure_count

Number of failures encountered while writing snapshots.
countdataset
dataset_acceleration_snapshot_write_bytes

Number of bytes written for the most recent snapshot.
gaugedataset
dataset_acceleration_snapshot_write_checksum

Checksum of the most recent snapshot write (emitted with checksum attribute).
gaugedataset, checksum
dataset_acceleration_snapshot_write_duration_ms

Time in milliseconds taken to write the latest snapshot to object storage.
histogramdataset
dataset_acceleration_snapshot_write_timestamp

Unix timestamp (seconds) when the most recent snapshot write completed.
gaugedataset
dataset_active_count

Number of currently loaded datasets.
gaugeengine
dataset_load_errors

Number of errors loading the dataset.
count
dataset_load_state

Status of the dataset. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
gaugedataset
dataset_unavailable_time_ms

Time dataset went offline in milliseconds.
gaugedataset
embeddings_active_count

Number of currently loaded embeddings.
gaugeembeddings, source
embeddings_cache_evictions

Number of cache evictions.
count
embeddings_cache_hit_ratio

Cache hit ratio (hits / total requests).
gauge
embeddings_cache_hits

Cache hit count.
count
embeddings_cache_items_count

Number of items currently in the cache.
gauge
embeddings_cache_max_size_bytes

Maximum allowed size of the cache in bytes.
gauge
embeddings_cache_misses

Cache miss count.
count
embeddings_cache_requests

Number of requests to get a key from the cache.
count
embeddings_cache_size_bytes

Size of the cache in bytes.
gauge
embeddings_cache_stale_swr_count

Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.
count
embeddings_cache_swr_background_query_count

Number of background queries triggered for stale-while-revalidate cache refreshes.
count
embeddings_failures

Number of embedding failures.
countmodel, encoding_format, user, dimensions
embeddings_internal_request_duration_ms

The duration of running an embedding(s) internally.
histogrammodel, encoding_format, user, dimensions
embeddings_load_errors

Number of errors loading the embedding.
count
embeddings_load_state

Status of the embedding. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
gaugemodel
embeddings_requests

Number of embedding requests.
countmodel, encoding_format, user, dimensions
flight_do_exchange_data_updates_sent

Number of data updates sent via DoExchange.
count
flight_do_put_bytes_written

Cumulative number of bytes (Arrow in-memory size) received and written via Flight DoPut.
countdataset
flight_do_put_rows_written

Cumulative number of rows received and written via Flight DoPut.
countdataset
flight_request_duration_ms

Measures the duration of Flight requests in milliseconds.
histogrammethod, command, (request context)
flight_requests

Total number of Flight requests.
countmethod, command, (request context)
http_requests

Number of HTTP requests.
countmethod, path, status, (request context)
http_requests_duration_ms

Measures the duration of HTTP requests in milliseconds.
histogrammethod, path, status, (request context)
llm_failures

Number of LLM failures.
countmodel, stream, request_level_tools, tool_choice, user, metadata, responses_api, instructions
llm_internal_request_duration_ms

The duration of running an LLM request internally.
histogrammodel, stream, request_level_tools, tool_choice, user, metadata, responses_api, instructions
llm_load_state

Status of the LLM model. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
gaugemodel
llm_requests

Number of LLM requests.
countmodel, stream, request_level_tools, tool_choice, user, metadata, responses_api, instructions
model_active_count

Number of currently loaded models.
gaugemodel, source
model_load_duration_ms

Duration in milliseconds to load the model.
histogram
model_load_errors

Number of errors loading the model.
count
model_load_state

Status of the model. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
gaugemodel
query_active_count

Number of concurrent top-level queries actively being processed in the runtime.
histogramprotocol (one of http, flight, flightsql, internal)
query_duration_ms

The total amount of time spent planning and executing queries in milliseconds.
histogramtags, datasets, (request context)
query_execution_duration_ms

The total amount of time spent only executing queries (0 for cached queries).
histogramtags, datasets, (request context)
query_executions

Number of query executions.
count(request context)
query_failures

Number of query failures.
counttags, datasets, err_code, (request context)
query_processed_bytes

Number of bytes processed by the runtime.
count(request context)
query_produced_spills

Number of spills produced by the query.
count(request context)
query_returned_bytes

Number of bytes returned to query clients.
count(request context)
query_returned_rows

Number of rows returned to query clients.
histogram(request context)
query_spilled_bytes

Number of spilled bytes produced by the query.
count(request context)
query_spilled_rows

Number of spilled rows produced by the query.
count(request context)
results_cache_evictions

Number of cache evictions.
count
results_cache_hit_ratio

Cache hit ratio (hits / total requests).
gauge
results_cache_hits

Cache hit count.
count
results_cache_items_count

Number of items currently in the cache.
gauge
results_cache_max_size_bytes

Maximum allowed size of the cache in bytes.
gauge
results_cache_misses

Cache miss count.
count
results_cache_requests

Number of requests to get a key from the cache.
count
results_cache_size_bytes

Size of the cache in bytes.
gauge
results_cache_stale_swr_count

Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.
count
results_cache_swr_background_query_count

Number of background queries triggered for stale-while-revalidate cache refreshes.
count
runtime_flight_server_started

Indicates the runtime Flight server has started.
count
runtime_http_server_started

Indicates the runtime HTTP server has started.
count
scheduler_active_executors_count

Number of executors currently connected to the scheduler node.
gaugenode_id
search_results_cache_evictions

Number of cache evictions.
count
search_results_cache_hit_ratio

Cache hit ratio (hits / total requests).
gauge
search_results_cache_hits

Search cache hit count.
count
search_results_cache_items_count

Number of items currently in the search cache.
gauge
search_results_cache_max_size_bytes

Maximum allowed size of the search cache in bytes.
gauge
search_results_cache_misses

Cache miss count.
count
search_results_cache_requests

Number of requests to get a key from the search cache.
count
search_results_cache_size_bytes

Size of the search cache in bytes.
gauge
search_results_cache_stale_swr_count

Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.
count
search_results_cache_swr_background_query_count

Number of background queries triggered for stale-while-revalidate cache refreshes.
count
secrets_store_load_duration_ms

Duration in milliseconds to load the secret stores.
histogram
tool_active_count

Number of currently loaded LLM tools.
gaugetool or tool_catalog
tool_load_errors

Number of errors loading the LLM tool.
count
tool_load_state

Status of the LLM tools. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
gaugetool or tool_catalog
view_load_errors

Number of errors loading the view.
count
view_load_state

Status of the views. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
gaugeview
worker_active_count

Number of currently loaded workers.
gaugeworker
workers_load_duration_ms

Duration in milliseconds to load the worker.
histogram
Component Metrics

In addition to these core metrics, individual components can expose their own metrics. For example, the MySQL data connector exposes connection pool metrics. See Component Metrics for more information.