Skip to main content
Version: Next

Azure Cosmos DB Data Connector

The Azure Cosmos DB Data Connector exposes Cosmos DB containers (NoSQL / Core SQL API) as SQL tables in Spice. The connector samples a configurable number of documents at startup, infers an Arrow schema, and streams documents into DataFusion for federated SQL queries alongside data from other connectors.

datasets:
- from: cosmosdb:store.products
name: products
params:
cosmosdb_connection_string: ${secrets:COSMOSDB_CONNECTION_STRING}

Configuration​

from​

The from field takes the form cosmosdb:{database}.{container} or cosmosdb:{database}/{container}. The connector also accepts a bare {container} when cosmosdb_database is provided in params.

datasets:
- from: cosmosdb:store.orders
name: orders

name​

The dataset name used as the table name within Spice. The dataset name cannot be a reserved keyword.

params​

Authentication​

Provide either a full Cosmos DB connection string (preferred) or the discrete account_endpoint + account_key pair. Secrets must be sourced from a secret store in production.

Parameter NameDescriptionRequired
cosmosdb_connection_stringFull connection string copied from the Azure portal. Takes precedence over account_endpoint / account_key.Either this or both endpoint+key
cosmosdb_account_endpointAccount endpoint URL, e.g. https://my-account.documents.azure.com:443/.When connection string isn't set
cosmosdb_account_keyPrimary or secondary account key.When connection string isn't set

Microsoft Entra ID and managed-identity authentication are tracked as a post-RC enhancement and are not supported in the current release.

Data Shape​

Parameter NameDescriptionDefault
cosmosdb_databaseDatabase name. When unset, parsed from the from: path (database.container).-
queryCosmos SQL query used to scan the container. Useful when the container is large and only a subset should be surfaced as a dataset.SELECT * FROM c
schema_infer_max_recordsNumber of documents sampled during schema inference at dataset registration. Larger samples produce a more precise schema at the cost of more RU consumption.100

Resilience​

The connector applies per-account concurrency limits, bounded retries with backoff, and a permanent-error latch that disables the connector account-wide on 401/403/404 responses.

Parameter NameDescriptionDefault
max_concurrent_requestsMaximum number of concurrent Cosmos DB requests per account endpoint, shared across all datasets pointing at the same account.4
http_max_retriesMaximum number of retries for transient errors (HTTP 429, 5xx, network) during the schema-inference pass at dataset registration. Retries honor Retry-After and x-ms-retry-after-ms headers.3
backoff_methodBackoff strategy between retries. exponential doubles the delay each attempt (capped at 30s); fibonacci follows the Fibonacci sequence (capped at 30s).exponential
disable_on_permanent_errorWhen true, a permanent error (401/403/404) latches the connector into a disabled state and short-circuits subsequent requests until Spice is restarted.true

See the deployment guide for sizing, troubleshooting, and observability details.

Authentication​

Copy the connection string from the Azure portal under Settings → Keys for the Cosmos DB account, then reference it from a secret store:

datasets:
- from: cosmosdb:store.products
name: products
params:
cosmosdb_connection_string: ${secrets:COSMOSDB_CONNECTION_STRING}

Explicit Endpoint and Key​

When the endpoint and key are stored separately (for example in Key Vault), provide both:

datasets:
- from: cosmosdb:store.products
name: products
params:
cosmosdb_account_endpoint: https://my-account.documents.azure.com:443/
cosmosdb_account_key: ${secrets:COSMOSDB_ACCOUNT_KEY}

If both styles are supplied, the connection string takes precedence.

Schema Inference​

Cosmos DB has no native schema. At dataset registration the connector runs the configured query (default SELECT * FROM c) limited to schema_infer_max_records documents and hands the result to Arrow's JSON inference. The inferred schema is locked for the lifetime of the runtime process.

JSON → Arrow type mapping​

Cosmos / JSON valueArrow typeNotes
"abc"Utf8
Integer (42, -7)Int64Widens to Float64 if any sampled document contains a decimal value for the same field.
Floating (3.14, 1.0e9)Float64
true / falseBoolean
Object { ... }StructNested objects are preserved as Arrow structs.
Array [ ... ]ListThe element type is inferred from the first non-null item; heterogeneous arrays may surface as Utf8 or require a wider sample to disambiguate.
All-null in sampleNullWarn-dropped by default. Set unsupported_type_action: string to coerce to Utf8, or widen the sample so real values appear.
System fields (_rid, ...)strippedThe system fields _rid, _self, _etag, _attachments, and _ts are stripped and never appear in the dataset schema.

Cosmos does not emit Date, Time, Timestamp, Decimal, or Binary natively — they round-trip as strings and should be handled with CAST at query time.

Tuning Schema Inference​

When optional fields are sparse in the first 100 documents but present in production data, increase schema_infer_max_records:

datasets:
- from: cosmosdb:store.orders
name: orders
params:
cosmosdb_connection_string: ${secrets:COSMOSDB_CONNECTION_STRING}
schema_infer_max_records: '500'

Each unit increase costs additional Request Units (RUs) at startup. Pin a schema explicitly via columns: for the most precise control.

unsupported_type_action​

Optional. Controls behavior for fields that infer as DataType::Null (every sampled document had null for the field). Defaults to warn.

  • error — Fail dataset registration.
  • warn — Log a warning and drop the column. (Default.)
  • ignore — Silently drop the column.
  • string — Coerce the column to Utf8.
datasets:
- from: cosmosdb:store.products
name: products
unsupported_type_action: string
params:
cosmosdb_connection_string: ${secrets:COSMOSDB_CONNECTION_STRING}

Querying​

After registering a dataset, query it like any other Spice table:

SELECT id, name, price
FROM products
WHERE price > 100
ORDER BY price DESC
LIMIT 10;
SELECT
category,
COUNT(*) AS count,
AVG(price) AS avg_price
FROM products
GROUP BY category
ORDER BY count DESC;

Custom Cosmos SQL Queries​

When the container is large and only a subset should be surfaced as a dataset, push the predicate to Cosmos with a custom query:

datasets:
- from: cosmosdb:store.orders
name: active_orders
params:
cosmosdb_connection_string: ${secrets:COSMOSDB_CONNECTION_STRING}
query: "SELECT * FROM c WHERE c.status = 'active'"

Joins Across Containers​

Cosmos DB does not support joins across containers. Spice federates joins between Cosmos-backed datasets (and any other connector) in the local DataFusion engine:

SELECT
o.id AS order_id,
p.name AS product_name,
p.price AS unit_price
FROM active_orders o
JOIN products p ON o.product_id = p.id
LIMIT 50;

Acceleration​

Standard Spice acceleration (DuckDB, SQLite, Arrow in-memory, Cayenne) works on top of the Cosmos DB connector. Acceleration is recommended when the container is large or when query latency matters — it avoids per-query RU consumption against the Cosmos account.

datasets:
- from: cosmosdb:store.products
name: products
params:
cosmosdb_connection_string: ${secrets:COSMOSDB_CONNECTION_STRING}
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_check_interval: 1h

Limitations​

  • Read-only: Only SELECT scans are supported. Writes (INSERT / UPDATE / DELETE) are not implemented.
  • No filter / projection / limit pushdown into Cosmos: SQL predicates are evaluated locally by DataFusion after the documents have been streamed in. Use the query parameter to narrow at the Cosmos side.
  • Schema is frozen at registration: New fields added in Cosmos after startup are not picked up until the runtime restarts.
  • No change feed: acceleration.refresh_mode: changes is not supported.
  • No native temporal / decimal / binary types: Cosmos stores everything as JSON. Round-trip these as strings and CAST in SQL.
  • Microsoft Entra ID / managed identity not supported: Use account keys via connection string or account_endpoint + account_key.

Cookbook​

A copy-pasteable example Spicepod is in the runtime repo at examples/cosmosdb-connector/.