docs: clarify BYOK + Custom Inference request path and data flow (#138)

hongyi-chen · oz-agent · oz-for-oss[bot] · web-flow · commit 13ae0470cfd6 · 2026-05-28T15:22:44.000+02:00
* docs: clarify BYOK request path and data flow The BYOK doc previously said keys are "stored locally" (true) and that Warp "directly routes" requests to the provider (misleading — the Warp Agent harness is server-hosted, so traffic does transit Warp's backend while the key is used in-flight per request). This commit: - Replaces "directly route" language with an explicit 3-step data flow. - Adds a "Why does the request route through Warp's backend?" note explaining the server-side harness. - Adds a sentence to the ZDR section noting BYOK request bodies are not retained, used for training, or logged for analytics. - Tightens the diagram alt text and intro paragraph to remove the same "directly" ambiguity. Co-Authored-By: Oz <oz-agent@warp.dev> * Update src/content/docs/agent-platform/inference/bring-your-own-api-key.mdx Co-authored-by: oz-for-oss[bot] <277970191+oz-for-oss[bot]@users.noreply.github.com> * docs: generalize harness explanation in BYOK doc Replace specific feature list (Codebase Context, Rules, Secret Redaction, multi-step tool orchestration) with a more general 'Warp's agent harness' reference. Keeps the explanation accurate without enumerating internals that may evolve. Co-Authored-By: Oz <oz-agent@warp.dev> * docs: clarify Custom Inference endpoint request path and data flow Issue #11681 reported that the privacy framing on the Custom Inference endpoint page was misleading: requests are server-hosted through warp-server, so traffic does transit Warp's backend even though the endpoint URL and API key are stored locally on the client. This commit narrows and corrects the privacy claim on the Custom Inference endpoint page, mirroring the BYOK rewrite already in this PR: - Replace the blanket 'never synced to the cloud' wording for endpoint URLs with a narrower, accurate claim: API keys are never synced or stored on Warp's servers; endpoint URLs and model identifiers may appear in Warp's usage telemetry, but API keys never do. - Add an explicit 3-step request flow (harness assembles -> in-flight key authenticates the call -> response streams back) so the server-side path is no longer surprising. - Add a 'Why does the request route through Warp's backend?' callout matching the BYOK page. - Tighten the ZDR section to note that prompts/responses transit Warp's backend without being used for training, and scope the existing retention bullets to the provider side. Also align the BYOK headline claim with the same wording ('never synced or stored on Warp's servers') so both pages converge on a single phrasing. Confirmed against warp-server: - logic/ai/llm/custom_endpoint/client.go:14-21 - the OpenAI-compatible client is constructed server-side using hostConfig.CustomEndpointAPIKey() and hostConfig.CustomEndpointBaseURL() from the request, not from persistent server config. - logic/ai/llm/user_api_keys/util.go:7 - keys arrive per-request via Request_Settings_ApiKeys. Co-Authored-By: Oz <oz-agent@warp.dev> * docs: drop telemetry caveat from Custom Inference doc Per review feedback, simplify the Custom Inference endpoint privacy framing to a single durable claim — API keys are never synced or stored on Warp's servers — without adding a separate caveat about endpoint URL or model identifier telemetry. Co-Authored-By: Oz <oz-agent@warp.dev> * docs: soften BYOK / Custom Inference key-storage claim per review Per Petra's review feedback: the previous phrasing 'stored locally on your device and never synced or stored on Warp's servers' technically holds but implies too strongly that the API key never leaves the user's machine. The key does transit Warp's backend in-flight per request (see the 3-step flow further down each page). Reframe the headline storage claim on both pages to focus on what the key is for instead of where it isn't: it is stored only on the user's device and used to authenticate requests to the model provider / configured endpoint. The downstream 3-step flow and 'Why does the request route through Warp's backend?' callout remain unchanged and continue to explain the actual transit path. Co-Authored-By: Oz <oz-agent@warp.dev> * docs: adopt Petra's plain-language framing for BYOK key handling Per the remaining review feedback on PR #138: - Replace 'stored only on your device' headline claim with explicit language that the key passes through Warp's servers but is not stored there, mirroring Petra's preferred phrasing. - Reshuffle the 3-step flow so step 1 is local (client pulls the key from secure storage and sends it up) and step 2 explicitly states that the agent harness runs on Warp's backend, answering Petra's question about where assembly happens. - Reword the 'held in memory' sentence to use the same 'passes through but is not stored' framing. Same changes applied in parallel to the Custom Inference Endpoint page. Co-Authored-By: Oz <oz-agent@warp.dev> --------- Co-authored-by: Oz <oz-agent@warp.dev> Co-authored-by: oz-for-oss[bot] <277970191+oz-for-oss[bot]@users.noreply.github.com>
diff --git a/src/content/docs/agent-platform/inference/bring-your-own-api-key.mdx b/src/content/docs/agent-platform/inference/bring-your-own-api-key.mdx
@@ -7,7 +7,7 @@ description: >-
 
 Warp supports **Bring Your Own API Key (BYOK)** for users who want to connect Warp's agents to their own Anthropic, OpenAI, or Google API accounts.
 
-This lets you use your own API keys to access models directly, giving you full control over model selection, billing, and data routing. See [Model Choice](/agent-platform/inference/model-choice/) for a list of supported models.
+This lets you use your own API keys for model access, giving you control over model selection, billing, and data routing. See [Model Choice](/agent-platform/inference/model-choice/) for a list of supported models.
 
 BYOK provides greater flexibility in model access and ensures Warp **never consumes your** [AI credits](/support-and-community/plans-and-billing/credits/) for requests routed through your own keys.
 
@@ -31,9 +31,19 @@ Platform credits apply to every cloud agent run on any plan, and to local agent
 
 ## How BYOK works
 
-When you add your own model API keys in Warp, those keys are stored **locally on your device** and are **never synced to the cloud**.
+When you add your own model API keys in Warp, those keys are stored **only on your device** (in your OS keychain or equivalent secure storage), never on Warp's servers. They're used to make requests to your chosen model provider.
 
-Warp uses these API keys when routing your agent requests to the model provider you've configured.
+When you send a prompt using a model with the **key icon**:
+
+1. Your local Warp client pulls your API key from your device's secure storage and sends it up to Warp's backend along with your prompt.
+2. Warp's agent harness, which runs on Warp's backend, assembles the full request (system instructions, conversation context, tools) and uses your key in-flight to call your chosen model provider (Anthropic, OpenAI, or Google).
+3. The provider's response streams back through Warp's backend to your client.
+
+Your API key passes through Warp's servers each time you send a request, but Warp never stores it there — it's used only in-flight to call the provider, then discarded.
+
+:::note
+**Why does the request route through Warp's backend?** Warp's agent harness runs server-side — the same runtime that powers [Agent Mode](/agent-platform/local-agents/interacting-with-agents/terminal-and-agent-modes/) with Warp-billed models. BYOK swaps the credential used to call the provider; it does not change where the harness runs.
+:::
 
 :::caution
 BYOK does not apply to [Cloud Agents](/agent-platform/cloud-agents/overview/). Because your API keys are stored locally on your device, they are not available to cloud-hosted agent runs. Cloud agent runs always consume [Warp credits](/support-and-community/plans-and-billing/credits/).
@@ -45,7 +55,7 @@ When a model is selected using your own key:
 * Costs are billed directly through your model provider account.
 * Warp does not retain or store your API key on any of its servers.
 
-![Diagram showing how Warp routes BYOK agent requests directly through your provider API key, bypassing Warp credits.](../../../../assets/support-and-community/Pricing-Blog-BYOK.png)
+![Diagram showing how Warp authenticates BYOK agent requests with your provider API key, bypassing Warp credits.](../../../../assets/support-and-community/Pricing-Blog-BYOK.png)
 
 ## Enabling BYOK
 
@@ -117,9 +127,11 @@ You can choose to enable **Warp credit fallback**. When enabled, if an agent req
 
 Warp is **SOC 2 compliant** and has **Zero Data Retention (ZDR)** policies with all of its contracted LLM providers. No customer AI data is retained, stored, or used for training by the model providers.
 
+BYOK prompts and responses transit Warp's backend (see [How BYOK works](#how-byok-works)). Warp does not use this content for training; retention and analytics handling follow the same account-level privacy and telemetry settings that apply to Warp-billed traffic.
+
 However, when you use your own API key:
 
-* Data retention policies depend on your provider’s account settings.
+* Data retention policies on the **provider side** depend on your provider’s account settings.
 * Warp cannot enforce ZDR for requests sent through your API keys.
 * If your Anthropic, OpenAI, or Google account does not have ZDR enabled, your requests may be retained by the provider according to their terms.
 
diff --git a/src/content/docs/agent-platform/inference/custom-inference-endpoint.mdx b/src/content/docs/agent-platform/inference/custom-inference-endpoint.mdx
@@ -18,7 +18,7 @@ Custom inference endpoints are available on Free and all eligible paid plans for
 * **OpenAI-compatible** - Works with any endpoint that implements the OpenAI Chat Completions API.
 * **Provider flexibility** - Use a model router (OpenRouter, LiteLLM), a model provider with an OpenAI-compatible surface (z.ai), or your own internal gateway.
 * **No AI credits consumed for inference** - Inference is billed directly by your endpoint provider. On Business and Enterprise, local agent runs that route through a custom inference endpoint still consume [platform credits](/support-and-community/plans-and-billing/platform-credits/) for Warp's platform infrastructure.
-* **Local configuration** - Endpoint URLs and credentials are stored locally on your device and never synced to the cloud.
+* **Local API key storage** - Your endpoint API key is stored **only on your device** (in your OS keychain or equivalent secure storage), never on Warp's servers. It's used to make requests to your configured endpoint.
 
 ## How it works
 
@@ -29,7 +29,19 @@ A custom inference endpoint expects your endpoint to implement the **OpenAI Chat
 * **z.ai** - A model provider with an OpenAI-compatible API surface for its models.
 * **Internal gateways** - Any in-house service that fronts model providers behind an OpenAI-compatible endpoint (for example, a corporate AI gateway with logging, redaction, or access control).
 
-When you configure a custom inference endpoint, Warp stores the endpoint URL, model identifiers, and credentials **locally on your device**. They are never synced to Warp's servers.
+When you configure a custom inference endpoint, your endpoint URL, model identifiers, and API key are stored **only on your device**, never on Warp's servers. Your API key is used to make requests to your configured endpoint.
+
+When you send a prompt using an endpoint-routed model:
+
+1. Your local Warp client pulls your endpoint URL and API key from your device's secure storage and sends them up to Warp's backend along with your prompt.
+2. Warp's agent harness, which runs on Warp's backend, assembles the full request (system instructions, conversation context, tools) and uses your key in-flight to call your configured endpoint.
+3. Your endpoint's response streams back through Warp's backend to your client.
+
+Your API key passes through Warp's servers each time you send a request, but Warp never stores it there — it's used only in-flight to call your endpoint, then discarded.
+
+:::note
+**Why does the request route through Warp's backend?** Warp's agent harness runs server-side — the same runtime that powers [Agent Mode](/agent-platform/local-agents/interacting-with-agents/terminal-and-agent-modes/) with Warp-billed models and [BYOK](/agent-platform/inference/bring-your-own-api-key/). A custom inference endpoint swaps the upstream destination and credential; it does not change where the harness runs.
+:::
 
 :::caution
 Custom inference endpoints don't apply to [Cloud Agents](/agent-platform/cloud-agents/overview/). Because the configuration is stored locally, it isn't available to cloud-hosted agent runs. Cloud agent runs always consume [Warp credits](/support-and-community/plans-and-billing/credits/).
@@ -39,7 +51,7 @@ When a model routed through your endpoint is selected:
 
 * Warp **doesn't consume** your [AI credits](/support-and-community/plans-and-billing/credits/) for that request.
 * Costs are billed directly by your endpoint provider.
-* Warp doesn't retain or store your endpoint credentials on any of its servers.
+* Warp doesn't retain or store your API key on any of its servers.
 
 ## Enabling a custom inference endpoint
 
@@ -86,13 +98,15 @@ Some AI-powered features (Codebase Context, Active AI recommendations, cloud age
 
 Warp is **SOC 2 compliant** and has **Zero Data Retention (ZDR)** agreements with all of its contracted LLM providers.
 
+Custom inference endpoint prompts and responses transit Warp's backend (see [How it works](#how-it-works)). Warp does not use this content for training; retention and analytics handling follow the same account-level privacy and telemetry settings that apply to Warp-billed traffic.
+
 When you use a custom inference endpoint:
 
-* Data retention is determined by **your endpoint provider** and any upstream model providers they route to.
+* Data retention on the **provider side** is determined by your endpoint provider and any upstream model providers they route to.
 * Warp **cannot enforce ZDR** for requests sent through a custom inference endpoint.
 * If your endpoint provider does not have ZDR with the underlying model provider, your requests may be retained according to their terms.
 
-Review your endpoint provider's data handling and retention policies before routing sensitive prompts through a custom inference endpoint.
+Warp itself never stores your endpoint API key. Review your endpoint provider's data handling and retention policies before routing sensitive prompts through a custom inference endpoint.
 
 ## Centrally managed configuration