OpenAI Model Deployment Guide
Production operating guide for the OpenAI model provider (and OpenAI-compatible endpoints) covering authentication, usage-tier-based rate limiting, the Responses API, and observability.
Authentication & Secrets​
| Parameter | Description |
|---|---|
openai_api_key | OpenAI API key. Use ${secrets:...} to resolve from a configured secret store. |
openai_org_id | OpenAI organization ID (optional). |
openai_project_id | OpenAI project ID (optional). |
endpoint | Endpoint override. Defaults to https://api.openai.com/v1. Set for OpenAI-compatible providers (Azure OpenAI, Groq, etc.). |
API keys must be sourced from a secret store in production. Rotate keys periodically; OpenAI dashboard tracks per-key usage which helps with rotation planning.
OpenAI-Compatible Providers​
Set endpoint to target any OpenAI-compatible provider (Azure OpenAI, xAI, Groq, Together, on-prem vLLM, etc.). See the OpenAI model reference for provider-specific configuration examples. When using the Responses API (responses_api: enabled), confirm the target provider implements OpenAI's Responses API.
Resilience Controls​
Usage Tier Rate Limiting​
| Parameter | Default | Description |
|---|---|---|
openai_usage_tier | tier1 | OpenAI account usage tier. Accepted values: free, tier1, tier2, tier3, tier4, tier5. |
Tier selection governs the internal rate controller's concurrency + per-minute budget:
| Tier | Max concurrency | Requests / minute |
|---|---|---|
free | 1 | 100 |
tier1 | 35 | 3,000 |
tier2 | 60 | 5,000 |
tier3 | 60 | 5,000 |
tier4 | 125 | 10,000 |
tier5 | 125 | 10,000 |
Override tier defaults per model via global model parameters:
| Parameter | Description |
|---|---|
max_concurrency | Override the per-model concurrency budget. |
requests_per_minute_limit | Override the per-model RPM budget. |
The built-in rate controller queues and paces outbound requests to stay within these budgets, avoiding the tight loop of hitting the OpenAI rate limiter and retrying.
Retry Behavior​
Chat Completions / Responses requests rely on the rate controller for pacing; there is no application-level retry loop around the chat/responses path. Transient 429 / 5xx responses surface to the caller.
Embeddings (see the OpenAI Embedding Deployment Guide) implement an application-level retry with fibonacci backoff and a 10-retry cap.
Responses API​
| Parameter | Default | Description |
|---|---|---|
responses_api | disabled | enabled routes /v1/responses traffic through the OpenAI Responses API. |
openai_responses_tools | - | Comma-separated list of OpenAI-hosted tools (code_interpreter, web_search) exposed via Responses. |
Note: Responses API hosted tools are not available from the /v1/chat/completions endpoint.
Capacity & Sizing​
- Context window: Driven by the selected model (e.g.,
gpt-4o-mini: 128k). Spice does not enforce context trimming — prompt and history management is the caller's responsibility. - Concurrency: Set by
openai_usage_tieror explicitmax_concurrency. For high-throughput workloads, upgrade the OpenAI tier rather than over-subscribing the rate limiter. - Latency: Bounded by OpenAI's server-side latency plus network RTT. Streaming (
stream=true) begins emitting tokens quickly and is preferred for chat interfaces.
Metrics​
All LLM requests (OpenAI, Hugging Face, filesystem) share a common metric namespace:
| Metric | Type | Description |
|---|---|---|
llm_requests | Counter | Total LLM requests issued. |
llm_failures | Counter | Total LLM request failures. |
llm_internal_request_duration_ms | Histogram | Request latency (client-side). |
llm_prompt_tokens_total | Counter | Total prompt tokens sent. |
llm_completion_tokens_total | Counter | Total completion tokens received. |
Requests routed through the Responses API carry a responses_api label for differentiation.
See Component Metrics for enabling and exporting metrics.
Task History​
Chat and Responses operations emit these task history spans:
| Span | Fields | Description |
|---|---|---|
ai_completion | `stream=true | false`, usage tokens |
responses | `stream=true | false`, usage tokens |
health | - | Health probe against the endpoint (chat or responses). |
captured_output and token usage fields are logged in task-history entries.
Known Limitations​
- No automatic token counting for rate limiting: The rate controller counts requests, not tokens. Token-level rate limits imposed by OpenAI (TPM) are not pre-checked; excess requests surface as 429 errors.
- No chat/responses application retry: Retries for chat/responses are not implemented at the Spice layer. Pace via
max_concurrency/requests_per_minute_limitto stay under rate limits. - OpenAI-compatible providers vary: Not every OpenAI-compatible provider implements every parameter (e.g.,
responses_api,openai_reasoning_effort). Test against your specific provider.
Troubleshooting​
| Symptom | Likely cause | Resolution |
|---|---|---|
401 Unauthorized | Wrong / revoked API key. | Rotate the key, update the secret store. |
429 rate_limit_exceeded | Tier budget too low or burst exceeds concurrency. | Raise openai_usage_tier, reduce max_concurrency, or upgrade the OpenAI tier. |
429 tokens_per_min errors despite pacing | Spice rate-limits requests, not tokens. | Reduce per-request token budget; throttle via max_concurrency. |
Responses API returns 404 | Provider does not implement Responses. | Set responses_api: disabled; or point at a Responses-capable endpoint. |
| Slow first-token latency | stream=false waits for full completion. | Use stream=true for interactive chat. |
