diff --git a/daprdocs/content/en/concepts/building-blocks-concept.md b/daprdocs/content/en/concepts/building-blocks-concept.md index 08d94f5ccde..d9f5d78c84c 100644 --- a/daprdocs/content/en/concepts/building-blocks-concept.md +++ b/daprdocs/content/en/concepts/building-blocks-concept.md @@ -31,4 +31,4 @@ Dapr provides the following building blocks: | [**Distributed lock**]({{% ref "distributed-lock-api-overview" %}}) | `/v1.0-alpha1/lock` | The distributed lock API enables you to take a lock on a resource so that multiple instances of an application can access the resource without conflicts and provide consistency guarantees. | [**Cryptography**]({{% ref "cryptography-overview" %}}) | `/v1.0-alpha1/crypto` | The Cryptography API enables you to perform cryptographic operations, such as encrypting and decrypting messages, without exposing keys to your application. | [**Jobs**]({{% ref "jobs-overview" %}}) | `/v1.0-alpha1/jobs` | The Jobs API enables you to schedule and orchestrate jobs. Example scenarios include: -| [**Conversation**]({{% ref "conversation-overview" %}}) | `/v1.0-alpha2/conversation` | The Conversation API enables you to supply prompts to converse with different large language models (LLMs) and includes features such as prompt caching and personally identifiable information (PII) obfuscation. \ No newline at end of file +| [**Conversation**]({{% ref "conversation-overview" %}}) | `/v1.0-alpha2/conversation` | The Conversation API enables you to supply prompts to converse with different large language models (LLMs) and includes features such as prompt caching, response formatting, usage metrics, and personally identifiable information (PII) obfuscation. \ No newline at end of file diff --git a/daprdocs/content/en/concepts/overview.md b/daprdocs/content/en/concepts/overview.md index 125b040c421..e72b48ab4b1 100644 --- a/daprdocs/content/en/concepts/overview.md +++ b/daprdocs/content/en/concepts/overview.md @@ -53,7 +53,7 @@ Each of these building block APIs is independent, meaning that you can use any n | [**Distributed lock**]({{% ref "distributed-lock-api-overview" %}}) | The distributed lock API enables your application to acquire a lock for any resource that gives it exclusive access until either the lock is released by the application, or a lease timeout occurs. | [**Cryptography**]({{% ref "cryptography-overview" %}}) | The cryptography API provides an abstraction layer on top of security infrastructure such as key vaults. It contains APIs that allow you to perform cryptographic operations, such as encrypting and decrypting messages, without exposing keys to your applications. | [**Jobs**]({{% ref "jobs-overview" %}}) | The jobs API enables you to schedule jobs at specific times or intervals. -| [**Conversation**]({{% ref "conversation-overview" %}}) | The conversation API enables you to abstract the complexities of interacting with large language models (LLMs) and includes features such as prompt caching and personally identifiable information (PII) obfuscation. Using [conversation components]({{% ref supported-conversation %}}), you can supply prompts to converse with different LLMs. +| [**Conversation**]({{% ref "conversation-overview" %}}) | The conversation API enables you to abstract the complexities of interacting with large language models (LLMs) and includes features such as prompt caching, response formatting, usage metrics, and personally identifiable information (PII) obfuscation. Using [conversation components]({{% ref supported-conversation %}}), you can supply prompts to converse with different LLMs. ### Cross-cutting APIs diff --git a/daprdocs/content/en/developing-applications/building-blocks/conversation/conversation-overview.md b/daprdocs/content/en/developing-applications/building-blocks/conversation/conversation-overview.md index 7483a41c296..e5dfddad27f 100644 --- a/daprdocs/content/en/developing-applications/building-blocks/conversation/conversation-overview.md +++ b/daprdocs/content/en/developing-applications/building-blocks/conversation/conversation-overview.md @@ -14,7 +14,7 @@ Dapr's conversation API reduces the complexity of securely and reliably interact Diagram showing the flow of a user's app communicating with Dapr's LLM components. -In addition to enabling critical performance and security functionality (like [prompt caching]({{% ref "#prompt-caching" %}}) and [PII scrubbing]({{% ref "#personally-identifiable-information-pii-obfuscation" %}})), the conversation API also provides: +In addition to enabling critical performance and security functionality (like [caching]({{% ref "#caching" %}}) and [PII scrubbing]({{% ref "#personally-identifiable-information-pii-obfuscation" %}})), the conversation API also provides: - **Tool calling capabilities** that allow LLMs to interact with external functions and APIs, enabling more sophisticated AI applications - **OpenAI-compatible interface** for seamless integration with existing AI workflows and tools @@ -29,9 +29,20 @@ You can also pair the conversation API with Dapr functionalities, like: The following features are out-of-the-box for [all the supported conversation components]({{% ref supported-conversation %}}). -### Prompt caching +### Caching -The Conversation API includes a built-in caching mechanism (enabled by the cacheTTL parameter) that optimizes both performance and cost by storing previous model responses for faster delivery to repetitive requests. This is particularly valuable in scenarios where similar prompt patterns occur frequently. When caching is enabled, Dapr creates a deterministic hash of the prompt text and all configuration parameters, checks if a valid cached response exists for this hash within the time period (for example, 10 minutes), and returns the cached response immediately if found. If no match exists, Dapr makes the API call and stores the result. This eliminates external API calls, lowers latency, and avoids provider charges for repeated requests. The cache exists entirely within your runtime environment, with each Dapr sidecar maintaining its own local cache. +The Conversation API supports two kinds of caching: + +- **Prompt caching**: Some LLM providers cache prompt prefixes on their side to speed up and reduce cost of repeated prompts. You enable this per request via the API using the `promptCacheRetention` parameter (for example, `24h` for OpenAI). See the [Conversation API reference]({{% ref conversation_api.md %}}) for request-level options. Support depends on the provider. +- **Response caching**: Conversation components can cache full LLM responses in the sidecar. When you set the component metadata field `responseCacheTTL` (for example, `10m`), Dapr caches responses keyed by the request (prompt and options). Repeated identical requests are served from the cache without calling the LLM, reducing latency and cost. This cache is in-memory and per sidecar. Configure this in your [conversation component]({{% ref supported-conversation %}}) spec. + +### Response formatting + +You can request structured output from the model by passing a `responseFormat` (JSON Schema) in the request. Supported by Deepseek, Google AI, Hugging Face, OpenAI, and Anthropic. See the [Conversation API reference]({{% ref conversation_api.md %}}). + +### Usage metrics + +Responses can include token usage (`promptTokens`, `completionTokens`, `totalTokens`) for the conversation. See [Response content]({{% ref "conversation_api.md#response-content" %}}) in the API reference. ### Personally identifiable information (PII) obfuscation diff --git a/daprdocs/content/en/developing-applications/building-blocks/conversation/howto-conversation-layer.md b/daprdocs/content/en/developing-applications/building-blocks/conversation/howto-conversation-layer.md index e79af9f190e..b71824fb978 100644 --- a/daprdocs/content/en/developing-applications/building-blocks/conversation/howto-conversation-layer.md +++ b/daprdocs/content/en/developing-applications/building-blocks/conversation/howto-conversation-layer.md @@ -149,7 +149,7 @@ with DaprClient() as d: metadata = { 'model': 'modelname', 'key': 'authKey', - 'cacheTTL': '10m', + 'responseCacheTTL': '10m', } response = d.converse_alpha1( diff --git a/daprdocs/content/en/reference/api/conversation_api.md b/daprdocs/content/en/reference/api/conversation_api.md index 95980f41e10..c84289b9017 100644 --- a/daprdocs/content/en/reference/api/conversation_api.md +++ b/daprdocs/content/en/reference/api/conversation_api.md @@ -40,6 +40,8 @@ POST http://localhost:/v1.0-alpha2/conversation//converse | `temperature` | A float value to control the temperature of the model. Used to optimize for consistency (0) or creativity (1). Optional | | `tools` | Tools register the tools available to be used by the LLM during the conversation. Optional | | `toolChoice` | Controls which (if any) tool is called by the model. Values: `auto`, `required`, or specific tool name. Defaults to `auto` if tools are present. Optional | +| `responseFormat` | Structured output described using a JSON Schema object. Use this when you want typed structured output. Supported by Deepseek, Google AI, Hugging Face, OpenAI, and Anthropic components. Optional | +| `promptCacheRetention` | Retention duration for the prompt cache. When set, enables extended prompt caching so cached prefixes stay active longer. With OpenAI, supports up to 24 hours. See [OpenAI prompt caching](https://platform.openai.com/docs/guides/prompt-caching#prompt-cache-retention). Optional | #### Input body @@ -211,8 +213,35 @@ Code | Description `400` | Request was malformed `500` | Request formatted correctly, error in Dapr code or underlying component +### Resiliency and timeouts + +Conversation component calls use Dapr's [resiliency policies]({{% ref "resiliency-overview.md" %}}). You can target the conversation component by name under `targets/components//outbound` and attach timeout, retry, and circuit breaker policies. + +- **Timeout**: The timeout is applied to the request context. That context is passed through to the conversation component (and thus to the LLM provider in the sidecar). If the LLM does not respond within the configured duration, the context is cancelled and the request is terminated with an error. Set a timeout that accounts for typical LLM response times. +- **Retries and circuit breaker**: These apply to the overall Converse invocation. Retries re-run the entire conversation call on failure (for example, after a timeout or network error). The circuit breaker, when open, skips calling the component and returns an error immediately. These are not passed to the LLM as configuration. + ### Response content +Each item in `outputs` can include: + +| Field | Description | +| ----- | ----------- | +| `choices` | Completion choices. | +| `model` | The model used for the conversation. Optional | +| `usage` | Token usage metrics for the request. Optional | + +#### Usage metrics + +When present, `usage` contains: + +| Field | Description | +| ----- | ----------- | +| `promptTokens` | Number of tokens in the prompt. | +| `completionTokens` | Number of tokens in the generated completion. | +| `totalTokens` | Total tokens used (prompt + completion). | +| `promptTokensDetails` | Optional. Can include `audioTokens` (audio input tokens in the prompt) and `cachedTokens` (tokens served from prompt cache). | +| `completionTokensDetails` | Optional. Can include `reasoningTokens`, `acceptedPredictionTokens`, `rejectedPredictionTokens`, `audioTokens`. | + #### Basic conversation response ```json @@ -226,7 +255,23 @@ Code | Description "content": "Distributed application runtime, open-source." } } - ] + ], + "model": "gpt-4o", + "usage": { + "promptTokens": 12, + "completionTokens": 8, + "totalTokens": 20, + "promptTokensDetails": { + "audioTokens": 0, + "cachedTokens": 0 + }, + "completionTokensDetails": { + "acceptedPredictionTokens": 0, + "audioTokens": 0, + "reasoningTokens": 0, + "rejectedPredictionTokens": 0 + } + } } ] } @@ -253,7 +298,23 @@ Code | Description ] } } - ] + ], + "model": "gpt-4o", + "usage": { + "promptTokens": 25, + "completionTokens": 18, + "totalTokens": 43, + "promptTokensDetails": { + "audioTokens": 0, + "cachedTokens": 0 + }, + "completionTokensDetails": { + "acceptedPredictionTokens": 0, + "audioTokens": 0, + "reasoningTokens": 0, + "rejectedPredictionTokens": 0 + } + } } ] } diff --git a/daprdocs/content/en/reference/components-reference/supported-conversation/anthropic.md b/daprdocs/content/en/reference/components-reference/supported-conversation/anthropic.md index 8ef75ce9c16..7dc03c0cccc 100644 --- a/daprdocs/content/en/reference/components-reference/supported-conversation/anthropic.md +++ b/daprdocs/content/en/reference/components-reference/supported-conversation/anthropic.md @@ -21,7 +21,7 @@ spec: value: "mykey" - name: model value: claude-3-5-sonnet-20240620 - - name: cacheTTL + - name: responseCacheTTL value: 10m ``` @@ -35,7 +35,7 @@ The above example uses secrets as plain strings. It is recommended to use a secr |--------------------|:--------:|---------|---------| | `key` | Y | API key for Anthropic. | `"mykey"` | | `model` | N | The Anthropic LLM to use. Defaults to `claude-3-5-sonnet-20240620` | `claude-3-5-sonnet-20240620` | -| `cacheTTL` | N | A time-to-live value for a prompt cache to expire. Uses Golang duration format. | `10m` | +| `responseCacheTTL` | N | Time-to-live for the in-memory response cache. When set, identical requests are served from cache until they expire. | `10m` | ## Related links diff --git a/daprdocs/content/en/reference/components-reference/supported-conversation/aws-bedrock.md b/daprdocs/content/en/reference/components-reference/supported-conversation/aws-bedrock.md index 4ec5a68f1d7..a57a45f4baf 100644 --- a/daprdocs/content/en/reference/components-reference/supported-conversation/aws-bedrock.md +++ b/daprdocs/content/en/reference/components-reference/supported-conversation/aws-bedrock.md @@ -21,7 +21,7 @@ spec: value: "http://localhost:4566" - name: model value: amazon.titan-text-express-v1 - - name: cacheTTL + - name: responseCacheTTL value: 10m ``` @@ -35,7 +35,7 @@ The above example uses secrets as plain strings. It is recommended to use a secr |--------------------|:--------:|---------|---------| | `endpoint` | N | AWS endpoint for the component to use and connect to emulators. Not recommended for production AWS use. | `http://localhost:4566` | | `model` | N | The LLM to use. Defaults to Bedrock's default provider model from Amazon. | `amazon.titan-text-express-v1` | -| `cacheTTL` | N | A time-to-live value for a prompt cache to expire. Uses Golang duration format. | `10m` | +| `responseCacheTTL` | N | A time-to-live for the in-memory response cache. When set, identical requests are served from cache until they expire. | `10m` | ## Authenticating AWS diff --git a/daprdocs/content/en/reference/components-reference/supported-conversation/googleai.md b/daprdocs/content/en/reference/components-reference/supported-conversation/googleai.md index ad3621637db..6ec53123ac9 100644 --- a/daprdocs/content/en/reference/components-reference/supported-conversation/googleai.md +++ b/daprdocs/content/en/reference/components-reference/supported-conversation/googleai.md @@ -21,7 +21,7 @@ spec: value: mykey - name: model value: gemini-1.5-flash - - name: cacheTTL + - name: responseCacheTTL value: 10m ``` @@ -35,7 +35,7 @@ The above example uses secrets as plain strings. It is recommended to use a secr |--------------------|:--------:|---------|---------| | `key` | Y | API key for GoogleAI. | `mykey` | | `model` | N | The GoogleAI LLM to use. Defaults to `gemini-1.5-flash`. | `gemini-2.0-flash` | -| `cacheTTL` | N | A time-to-live value for a prompt cache to expire. Uses Golang duration format. | `10m` | +| `responseCacheTTL` | N | Time-to-live for the in-memory response cache. When set, identical requests are served from cache until they expire. | `10m` | ## Related links diff --git a/daprdocs/content/en/reference/components-reference/supported-conversation/hugging-face.md b/daprdocs/content/en/reference/components-reference/supported-conversation/hugging-face.md index 64f7058a0ea..5a34f73f5f2 100644 --- a/daprdocs/content/en/reference/components-reference/supported-conversation/hugging-face.md +++ b/daprdocs/content/en/reference/components-reference/supported-conversation/hugging-face.md @@ -21,7 +21,7 @@ spec: value: mykey - name: model value: meta-llama/Meta-Llama-3-8B - - name: cacheTTL + - name: responseCacheTTL value: 10m ``` @@ -35,7 +35,7 @@ The above example uses secrets as plain strings. It is recommended to use a secr |--------------------|:--------:|---------|---------| | `key` | Y | API key for Huggingface. | `mykey` | | `model` | N | The Huggingface LLM to use. Defaults to `meta-llama/Meta-Llama-3-8B`. | `meta-llama/Meta-Llama-3-8B` | -| `cacheTTL` | N | A time-to-live value for a prompt cache to expire. Uses Golang duration format. | `10m` | +| `responseCacheTTL` | N | Time-to-live for the in-memory response cache. When set, identical requests are served from cache until they expire. | `10m` | ## Related links diff --git a/daprdocs/content/en/reference/components-reference/supported-conversation/mistral.md b/daprdocs/content/en/reference/components-reference/supported-conversation/mistral.md index 3085de37617..79aaeb188a9 100644 --- a/daprdocs/content/en/reference/components-reference/supported-conversation/mistral.md +++ b/daprdocs/content/en/reference/components-reference/supported-conversation/mistral.md @@ -21,7 +21,7 @@ spec: value: mykey - name: model value: open-mistral-7b - - name: cacheTTL + - name: responseCacheTTL value: 10m ``` @@ -35,7 +35,7 @@ The above example uses secrets as plain strings. It is recommended to use a secr |--------------------|:--------:|---------|---------| | `key` | Y | API key for Mistral. | `mykey` | | `model` | N | The Mistral LLM to use. Defaults to `open-mistral-7b`. | `open-mistral-7b` | -| `cacheTTL` | N | A time-to-live value for a prompt cache to expire. Uses Golang duration format. | `10m` | +| `responseCacheTTL` | N | Time-to-live for the in-memory response cache. When set, identical requests are served from cache until they expire. | `10m` | ## Related links diff --git a/daprdocs/content/en/reference/components-reference/supported-conversation/ollama.md b/daprdocs/content/en/reference/components-reference/supported-conversation/ollama.md index a0dbac727fc..44a3c4ee4bf 100644 --- a/daprdocs/content/en/reference/components-reference/supported-conversation/ollama.md +++ b/daprdocs/content/en/reference/components-reference/supported-conversation/ollama.md @@ -19,7 +19,7 @@ spec: metadata: - name: model value: llama3.2:latest - - name: cacheTTL + - name: responseCacheTTL value: 10m ``` @@ -32,7 +32,7 @@ The above example uses secrets as plain strings. It is recommended to use a secr | Field | Required | Details | Example | |--------------------|:--------:|---------|---------| | `model` | N | The Ollama LLM to use. Defaults to `llama3.2:latest`. | `phi4:latest` | -| `cacheTTL` | N | A time-to-live value for a prompt cache to expire. Uses Golang duration format. | `10m` | +| `responseCacheTTL` | N | Time-to-live for the in-memory response cache. When set, identical requests are served from cache until they expire. | `10m` | ### OpenAI Compatibility diff --git a/daprdocs/content/en/reference/components-reference/supported-conversation/openai.md b/daprdocs/content/en/reference/components-reference/supported-conversation/openai.md index f1c29e2b5f3..c016fa14fb2 100644 --- a/daprdocs/content/en/reference/components-reference/supported-conversation/openai.md +++ b/daprdocs/content/en/reference/components-reference/supported-conversation/openai.md @@ -23,7 +23,7 @@ spec: value: gpt-4-turbo - name: endpoint value: 'https://api.openai.com/v1' - - name: cacheTTL + - name: responseCacheTTL value: 10m # - name: apiType # Optional # value: 'azure' @@ -42,7 +42,7 @@ The above example uses secrets as plain strings. It is recommended to use a secr | `key` | Y | API key for OpenAI. | `mykey` | | `model` | N | The OpenAI LLM to use. Defaults to `gpt-4-turbo`. | `gpt-4-turbo` | | `endpoint` | N | Custom API endpoint URL for OpenAI API-compatible services. If not specified, the default OpenAI API endpoint is used. Required when `apiType` is set to `azure`. | `https://api.openai.com/v1`, `https://example.openai.azure.com/` | -| `cacheTTL` | N | A time-to-live value for a prompt cache to expire. Uses Golang duration format. | `10m` | +| `responseCacheTTL` | N | Time-to-live for the in-memory response cache. When set, identical requests are served from cache until they expire. | `10m` | | `apiType` | N | Specifies the API provider type. Required when using a provider that does not follow the default OpenAI API endpoint conventions. | `azure` | | `apiVersion`| N | The API version to use. Required when the `apiType` is set to `azure`. | `2025-04-01-preview` |