Skip to main content
Version: Next

OpenAI Model Deployment Guide

Production operating guide for the OpenAI model provider (and OpenAI-compatible endpoints) covering authentication, usage-tier-based rate limiting, the Responses API, and observability.

Authentication & Secrets​

ParameterDescription
openai_api_keyOpenAI API key. Use ${secrets:...} to resolve from a configured secret store.
openai_org_idOpenAI organization ID (optional).
openai_project_idOpenAI project ID (optional).
endpointEndpoint override. Defaults to https://api.openai.com/v1. Set for OpenAI-compatible providers (Azure OpenAI, Groq, etc.).

API keys must be sourced from a secret store in production. Rotate keys periodically; OpenAI dashboard tracks per-key usage which helps with rotation planning.

OpenAI-Compatible Providers​

Set endpoint to target any OpenAI-compatible provider (Azure OpenAI, xAI, Groq, Together, on-prem vLLM, etc.). See the OpenAI model reference for provider-specific configuration examples. When using the Responses API (responses_api: enabled), confirm the target provider implements OpenAI's Responses API.

Resilience Controls​

Usage Tier Rate Limiting​

ParameterDefaultDescription
openai_usage_tiertier1OpenAI account usage tier. Accepted values: free, tier1, tier2, tier3, tier4, tier5.

Tier selection governs the internal rate controller's concurrency + per-minute budget:

TierMax concurrencyRequests / minute
free1100
tier1353,000
tier2605,000
tier3605,000
tier412510,000
tier512510,000

Override tier defaults per model via global model parameters:

ParameterDescription
max_concurrencyOverride the per-model concurrency budget.
requests_per_minute_limitOverride the per-model RPM budget.

The built-in rate controller queues and paces outbound requests to stay within these budgets, avoiding the tight loop of hitting the OpenAI rate limiter and retrying.

Retry Behavior​

Chat Completions / Responses requests rely on the rate controller for pacing; there is no application-level retry loop around the chat/responses path. Transient 429 / 5xx responses surface to the caller.

Embeddings (see the OpenAI Embedding Deployment Guide) implement an application-level retry with fibonacci backoff and a 10-retry cap.

Responses API​

ParameterDefaultDescription
responses_apidisabledenabled routes /v1/responses traffic through the OpenAI Responses API.
openai_responses_tools-Comma-separated list of OpenAI-hosted tools (code_interpreter, web_search) exposed via Responses.

Note: Responses API hosted tools are not available from the /v1/chat/completions endpoint.

Capacity & Sizing​

  • Context window: Driven by the selected model (e.g., gpt-4o-mini: 128k). Spice does not enforce context trimming — prompt and history management is the caller's responsibility.
  • Concurrency: Set by openai_usage_tier or explicit max_concurrency. For high-throughput workloads, upgrade the OpenAI tier rather than over-subscribing the rate limiter.
  • Latency: Bounded by OpenAI's server-side latency plus network RTT. Streaming (stream=true) begins emitting tokens quickly and is preferred for chat interfaces.

Metrics​

All LLM requests (OpenAI, Hugging Face, filesystem) share a common metric namespace:

MetricTypeDescription
llm_requestsCounterTotal LLM requests issued.
llm_failuresCounterTotal LLM request failures.
llm_internal_request_duration_msHistogramRequest latency (client-side).
llm_prompt_tokens_totalCounterTotal prompt tokens sent.
llm_completion_tokens_totalCounterTotal completion tokens received.

Requests routed through the Responses API carry a responses_api label for differentiation.

See Component Metrics for enabling and exporting metrics.

Task History​

Chat and Responses operations emit these task history spans:

SpanFieldsDescription
ai_completion`stream=truefalse`, usage tokens
responses`stream=truefalse`, usage tokens
health-Health probe against the endpoint (chat or responses).

captured_output and token usage fields are logged in task-history entries.

Known Limitations​

  • No automatic token counting for rate limiting: The rate controller counts requests, not tokens. Token-level rate limits imposed by OpenAI (TPM) are not pre-checked; excess requests surface as 429 errors.
  • No chat/responses application retry: Retries for chat/responses are not implemented at the Spice layer. Pace via max_concurrency/requests_per_minute_limit to stay under rate limits.
  • OpenAI-compatible providers vary: Not every OpenAI-compatible provider implements every parameter (e.g., responses_api, openai_reasoning_effort). Test against your specific provider.

Troubleshooting​

SymptomLikely causeResolution
401 UnauthorizedWrong / revoked API key.Rotate the key, update the secret store.
429 rate_limit_exceededTier budget too low or burst exceeds concurrency.Raise openai_usage_tier, reduce max_concurrency, or upgrade the OpenAI tier.
429 tokens_per_min errors despite pacingSpice rate-limits requests, not tokens.Reduce per-request token budget; throttle via max_concurrency.
Responses API returns 404Provider does not implement Responses.Set responses_api: disabled; or point at a Responses-capable endpoint.
Slow first-token latencystream=false waits for full completion.Use stream=true for interactive chat.