Skip to main content
Version: Next

OpenAI Embedding Deployment Guide

Production operating guide for the OpenAI embedding provider (and OpenAI-compatible endpoints) covering authentication, usage-tier rate limiting, batching, retries, and observability.

Authentication & Secrets​

ParameterDescription
openai_api_key / api_keyOpenAI API key. Use ${secrets:...} to resolve from a configured secret store.
openai_org_id / org_idOpenAI organization ID (optional).
openai_project_id / project_idOpenAI project ID (optional).
openai_usage_tier / usage_tierOpenAI account usage tier.
endpointEndpoint override. Defaults to https://api.openai.com/v1. Set for OpenAI-compatible providers (Azure OpenAI, etc.).

API keys must be sourced from a secret store in production. Aliases exist for credential parameters: api_key ↔ openai_api_key, org_id ↔ openai_org_id, etc.

OpenAI-Compatible Providers​

Set endpoint to route embeddings through any OpenAI-compatible provider (Azure OpenAI, Together, vLLM, Groq, local Ollama with the OpenAI-compat endpoint). Verify the provider implements /v1/embeddings.

Resilience Controls​

Usage Tier Rate Limiting​

Tier selection governs the internal rate controller:

TierMax concurrencyRequests / minute
free1100
tier1353,000
tier2605,000
tier3605,000
tier412510,000
tier512510,000

Batching​

The embeddings client automatically chunks input into batches bounded by:

  • 256 inputs per batch (OpenAI's per-request input cap).
  • ~512 KiB of string bytes per request batch (safeguard against oversized requests).

Large embedding jobs are transparently split across multiple API calls.

Retry Behavior​

Embeddings retry with fibonacci backoff, up to 10 retries. Retriable conditions:

  • HTTP 429 (rate limit, throttling)
  • HTTP 500, 503 (transient server errors)
  • Transient reqwest errors (connect failures, timeouts)

Throttling (429 with rate-limit body) is detected explicitly and surfaces as a structured rate-limit error after retries are exhausted.

Capacity & Sizing​

  • Vector dimensions: Bounded by the selected embedding model (e.g., text-embedding-3-small: 1536, text-embedding-3-large: 3072). Choose based on downstream storage and retrieval cost.
  • Concurrency budget: Plan for tier-based concurrency × typical per-request latency (~100-300 ms) to estimate achievable throughput. Embedding requests are IO-bound and scale well with concurrency up to the budget.
  • Token limits: Each input is bounded by the model's context window (8192 tokens for text-embedding-3-*). Inputs longer than the window fail with a 400 — truncate or chunk at the caller.

Metrics​

Embedding requests use a dedicated metric namespace separate from chat/LLM metrics:

MetricTypeLabelsDescription
embeddings_requestsCountermodel, encoding_format, optional user, optional dimensionsTotal embedding requests issued.
embeddings_failuresCountersame as aboveTotal embedding request failures.
embeddings_internal_request_duration_msHistogramsame as aboveRequest latency (client-side).
embeddings_load_errorsCounter-Runtime load-time errors.
embeddings_active_countGauge-Currently-loaded embedding models.
embeddings_load_stateGauge-Load state (0/1).

See Component Metrics for enabling and exporting metrics.

Task History​

Embedding request operations emit text_embed spans in task history, with fields:

  • input (truncated)
  • Labels (model, encoding_format, optional user, optional dimensions)
  • outputs_produced (number of vectors returned)
  • Errors (when applicable)

Known Limitations​

  • No automatic truncation: Inputs longer than the model's context window fail with a 400 error; truncate or chunk at the caller.
  • No token-level rate limiting: The rate controller counts requests; token-level TPM limits imposed by OpenAI may still be hit and surface as 429.
  • Provider compatibility varies: OpenAI-compatible providers may not implement every parameter (dimensions, user, encoding_format).

Troubleshooting​

SymptomLikely causeResolution
401 UnauthorizedWrong / revoked API key.Rotate the key; update the secret store.
Sustained 429 rate_limit_exceededTier budget too low or burst exceeds concurrency.Raise openai_usage_tier, reduce max_concurrency, or upgrade the OpenAI tier.
400 with "maximum context length"Input exceeds model context window.Truncate or chunk inputs at the caller.
Embeddings much slower than expectedSingle-threaded caller, no batching.Batch inputs; the client chunks into 256-input / 512 KiB batches but the caller must parallelize embedding jobs.
Latency spikes every few hundred requestsTransient 429 with fibonacci backoff recovering.Expected at tier ceiling; raise tier or reduce load.