diff --git a/.gitignore b/.gitignore index 2fb302a1c..058dbc86c 100644 --- a/.gitignore +++ b/.gitignore @@ -286,6 +286,10 @@ plan/ **/agent_state_*.json **/optimize_state_*.json **/runtime_config.json +**/aws-targets.json +!01-features/07-centralize-and-govern-your-ai-infrastructure/01-gateway/**/aws-targets.json +!06-workshops/02-AgentCore-gateway/**/aws-targets.json +**/.cli/deployed-state.json # Evaluation and session results **/batch_eval_*_result.json diff --git a/01-features/05-authenticate-and-authorize/01-inbound-auth/04-inbound-auth-pingfederate/agent/private-idp-ping-agent/agentcore/aws-targets.json b/01-features/05-authenticate-and-authorize/01-inbound-auth/04-inbound-auth-pingfederate/agent/private-idp-ping-agent/agentcore/aws-targets.json deleted file mode 100644 index abf2e642e..000000000 --- a/01-features/05-authenticate-and-authorize/01-inbound-auth/04-inbound-auth-pingfederate/agent/private-idp-ping-agent/agentcore/aws-targets.json +++ /dev/null @@ -1 +0,0 @@ -[{"name": "default", "account": "837460776723", "region": "us-east-1"}] diff --git a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/.gitignore b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/.gitignore index 3eb0d5e9d..a50df3d23 100644 --- a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/.gitignore +++ b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/.gitignore @@ -21,4 +21,9 @@ Thumbs.db # Temporary files *.tmp -*.temp \ No newline at end of file +*.temp + +# Generated state and output files (contain account-specific ARNs) +agent_state_*.json +optimize_state_*.json +insights_result.json \ No newline at end of file diff --git a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/.pylintrc b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/.pylintrc new file mode 100644 index 000000000..f69ec50d9 --- /dev/null +++ b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/.pylintrc @@ -0,0 +1,3 @@ +[FORMAT] +# Match ruff's default line length so the two tools don't conflict. +max-line-length=120 diff --git a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/README.md b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/README.md index a86a98f57..0b0453a71 100644 --- a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/README.md +++ b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/README.md @@ -1,18 +1,19 @@ -# AgentCore Optimization +# AgentCore optimization -End-to-end optimization workflow for an HR Assistant agent on Amazon Bedrock AgentCore runtime. Demonstrates how to measure baseline performance, generate AI-driven improvements, and validate them through A/B testing — without redeploying code. +End-to-end optimization workflow for an HR Assistant agent on Amazon Bedrock AgentCore runtime. Covers automated insights to detect agent failures, run baseline evaluation, get recommendations to improve the agent, and A/B testing via configuration bundles and target-based routing. ### What You Will Learn | Stage | Concepts Covered | |-------|-----------------| +| **Insights** | FailureAnalysis, UserIntent, ExecutionSummary: root cause clustering of agent failures | | **Baseline evaluation** | Batch evaluations on agent sessions | | **Recommendations** | System prompt optimization, tool description optimization from production traces | | **Configuration Bundles** | Versioned config containers, runtime config hooks, baggage-based injection | | **A/B Test: Config-Bundle Routing** | Prompt-level A/B testing without redeployment, online evaluation, statistical analysis | | **A/B Test: Target-Based Routing** | Code-level A/B testing, phased rollout (90/10 canary), multi-runtime comparison | -![AgentCore optimization](images/ac-optimization.png) +![AgentCore optimization](images/observe-eval-improve.png) ### Key Components @@ -21,6 +22,8 @@ End-to-end optimization workflow for an HR Assistant agent on Amazon Bedrock Age | AgentCore runtime | `bedrock-agentcore-control` | Hosts the HR Assistant container | | Configuration Bundle | `bedrock-agentcore-control` | Versioned system prompt and tool description storage | | Batch evaluation | `bedrock-agentcore` (DP) | Off-line scoring of historical sessions | +| Batch insights | `bedrock-agentcore` (DP) | Root-cause failure clustering, user intent analysis, execution summaries | +| Online insights config | `bedrock-agentcore-control` (CP) | Recurring daily insights over live agent traffic | | Recommendation | `bedrock-agentcore` (DP) | AI-generated prompt/tool improvements | | gateway + Targets | `bedrock-agentcore-control` | Traffic routing for A/B tests | | Online Eval Config | `bedrock-agentcore-control` | Continuous automatic session scoring | @@ -40,7 +43,7 @@ End-to-end optimization workflow for an HR Assistant agent on Amazon Bedrock Age - Python 3.10+ - Access to Amazon Bedrock models (Nova Lite) in your region -> **Timing note:** CloudWatch ingestion takes 2–3 minutes after invoking the agent. Batch evaluations take 1–5 minutes. Recommendations take 2–5 minutes. Budget ~45 minutes for the full workflow. +> **Timing note:** CloudWatch ingestion takes 2-3 minutes after invoking the agent. Batch evaluations take 1-5 minutes. Recommendations take 2-5 minutes. Budget ~45 minutes for the full workflow. ## Quick Start @@ -53,6 +56,9 @@ python deploy.py --name HRAssistV1 # Invoke the deployed agent python invoke.py --name HRAssistV1 +# [Optional] Run insights: generate traces then analyze with all 3 insight types +python insights.py --name HRAssistV1 --generate-traces + # Run the full optimization workflow python optimize.py --name HRAssistV1 @@ -67,8 +73,8 @@ The following commands reproduce the full optimization workflow from the command Install the AgentCore CLI: ```bash -npm install -g @aws/agentcore -agentcore --version # should print 0.13.0 or later +npm install -g @aws/agentcore@latest +agentcore --version # 0.20.1 or later ``` ### Step 1: Deploy the HR Assistant @@ -103,7 +109,46 @@ agentcore run batch-evaluation \ --evaluator Builtin.GoalSuccessRate Builtin.Helpfulness Builtin.Correctness ``` -### Step 3: Get Recommendations +### Step 3: Run Insights + +After generating traffic, run insights to understand which sessions are failing and why before optimizing. The CLI insights commands require an `agentcore` project with a deployed runtime (v0.20.1+). + +```bash +# Run a one-shot insights job over the last 7 days of traces: +agentcore run insights \ + --runtime HRAssistant \ + --insights Builtin.Insight.FailureAnalysis \ + --insights Builtin.Insight.UserIntent \ + --insights Builtin.Insight.ExecutionSummary \ + --lookback-days 7 \ + --wait \ + --json + +# List all insights jobs for this project: +agentcore insights history --json + +# View results for a specific job: +agentcore insights results --id --json + +# Add a recurring daily online-insights config: +agentcore add online-insights \ + --name HROnlineInsights \ + --runtime HRAssistant \ + --insights Builtin.Insight.FailureAnalysis \ + --insights Builtin.Insight.UserIntent \ + --insights Builtin.Insight.ExecutionSummary \ + --sampling-rate 100 \ + --clustering-frequency DAILY \ + --enable-on-create +agentcore deploy -y +``` + +Notes: +- `insights` and `evaluators` are mutually exclusive in a single batch job. Do not pass `--evaluator` to `agentcore run insights`. +- Resource names must match `^[a-zA-Z][a-zA-Z0-9_]{0,47}$` (letters, numbers, underscores; no hyphens). +- `agentcore insights history` and `agentcore insights results` must be run from inside the project directory. + +### Step 4: Get Recommendations ```bash # System prompt recommendation (optimize for GoalSuccessRate) @@ -121,7 +166,7 @@ agentcore run recommendation \ --tools "get_policy:Look up an HR policy by name" ``` -### Step 4: Create Configuration Bundles +### Step 5: Create Configuration Bundles ```bash # Create control bundle (original prompt) @@ -141,7 +186,7 @@ agentcore cb versions --bundle HRControl --json agentcore cb versions --bundle HRTreatment --json ``` -### Step 5a: A/B Test — Config-Bundle Routing +### Step 6a: A/B Test -- Config-Bundle Routing ```bash # Create gateway @@ -181,13 +226,13 @@ agentcore deploy agentcore ab-test HRBundleABTest ``` -### Step 5b: A/B Test — Target-Based Routing (Phased Rollout) +### Step 6b: A/B Test -- Target-Based Routing (Phased Rollout) ```bash # Deploy v2 of the agent (with new code changes) agentcore create --name HRAssistantV2 --framework Strands --model-provider Bedrock --defaults cp utils/hr_assistant_agent.py app/HRAssistantV2/main.py -# (Apply v2 code changes to main.py — e.g. add escalate_to_hr_manager tool) +# (Apply v2 code changes to main.py -- e.g. add escalate_to_hr_manager tool) cd HRAssistantV2 && agentcore deploy # Add v2 gateway target @@ -224,7 +269,7 @@ agentcore ab-test HRTargetABTest agentcore stop ab-test HRTargetABTest ``` -### Step 6: Cleanup +### Step 7: Cleanup ```bash agentcore stop ab-test HRBundleABTest @@ -249,13 +294,176 @@ Creates an IAM execution role, packages `utils/hr_assistant_agent.py` with ARM64 State is saved to `agent_state_{name}.json` for use by subsequent scripts. -The `--version v2` flag builds an enhanced version that adds an `escalate_to_hr_manager` tool and a more detailed system prompt baked into the code — used in the target-based A/B test. +The `--version v2` flag builds an enhanced version that adds an `escalate_to_hr_manager` tool and a more detailed system prompt baked into the code. This is the version used in the target-based A/B test. + +### Step 2: Insights (`insights.py`) + +Sends a set of failure-mode and successful sessions to the agent, waits for traces to propagate to CloudWatch, then calls `start_batch_evaluation` with all three insight types. Polls until the job completes and prints the full failure hierarchy, user intent clusters, and execution summary clusters. The `--online` flag also creates a recurring daily `OnlineEvaluationConfig` so insights continue running automatically over live traffic. + + +| Insight | What It Produces | +|---------|-----------------| +| **FailureAnalysis** | Identifies failures, categorizes them using a signal taxonomy (see below), traces root causes to specific spans, and provides fix recommendations. Results appear as a three-level hierarchy: failure categories, subcategories, and root cause clusters with affected session IDs. | +| **UserIntent** | Extracts what users were trying to accomplish in each session, then clusters similar intents together. Shows the most common use cases your agent handles and reveals gaps between user requests and agent capabilities. | +| **ExecutionSummary** | Summarizes the approach the agent took and the outcome for each session, then clusters similar execution patterns. Requires at least 3 sessions. | + +Each failure in the response includes one or more `signals`, the specific evidence found at a span level. Each signal has a `category` (a machine-readable taxonomy label), `evidence` (a quoted description of what went wrong in that span), and `confidence` (0–1 float). + +The signal categories returned by the API: + +| Category | What it means | +|---|---| +| `hallucination-category-hall-capabilities` | Agent invented constraints or limitations for a tool that do not exist in the tool spec (e.g., "this tool only supports years 2025-2026" when no such constraint is documented) | +| `hallucination-category-hall-misunderstand` | Agent misread the tool result and reported a different value (e.g., tool returned "5% match", agent reported "6%") | +| `hallucination-category-hall-usage` | Agent asserted knowledge of a tool's contents without calling it (e.g., "sabbatical leave is not in the available topics" without calling `lookup_hr_policy`) | +| `task-instruction-category-non-compliance` | Agent violated an explicit instruction in the system prompt (e.g., system prompt says "always use available tools", agent answered without calling any tool) | +| `orchestration-related-errors-category-premature-termination` | Agent terminated the task without attempting a relevant tool call, rather than calling the tool and handling any error it returns | +| `llm-output-category-nonsensical` | Agent output was incoherent or exposed internal artifacts — unresolved template placeholders, raw `` tags, or other content that should not appear in a user-facing response | +| `repetitive-behavior-category-repetition-info` | Agent asked for information the user had already provided in the same session | + +Each `affectedSessions` entry under a root cause contains: +- `sessionId` — the session where this failure occurred +- `explanation` — a sentence citing the specific span ID and what went wrong +- `fixType` — `SYSTEM_PROMPT_FIX` (addressable via prompt change) or `OTHERS` (backend or data issue) +- `recommendation` — a concrete instruction to add to the system prompt +- `failureSpans[]` — the span(s) where the failure was detected, each with `spanId`, `traceId`, and `signals[]` + +Example from a real run: + +```json +{ + "sessionId": "3b927bee-...", + "explanation": "At span 619f31562100b9b6, the agent concluded without ever invoking get_pay_stub, because it hallucinated a non-existent date-range constraint.", + "fixType": "SYSTEM_PROMPT_FIX", + "recommendation": "Add a guardrail requiring the agent to attempt tool calls before concluding a request cannot be fulfilled.", + "failureSpans": [{ + "spanId": "619f31562100b9b6", + "traceId": "6a3080fa61c3929d7c75522421a71cec", + "signals": [{ + "category": "hallucination-category-hall-capabilities", + "evidence": "Agent claims get_pay_stub only supports 2025-2026, but the tool definition has no such constraint documented.", + "confidence": 0.9 + }] + }] +} +``` + +### How Insights Are Triggered + +Insights run in two modes: + +**One-time (batch):** Call `start_batch_evaluation` with an `insights` list. Results come back through `get_batch_evaluation`. Use this for on-demand analysis over a specific time range. + +**Recurring (scheduled):** Create an `OnlineEvaluationConfig` with a `clusteringConfig` frequency (`DAILY`, `WEEKLY`, or `MONTHLY`). The service automatically triggers batch evaluation jobs on that cadence. Per-session analysis runs continuously; clustered results are generated during each scheduled batch job. + +### Data Source + +Insights pull from the `aws/spans` CloudWatch log group, which receives OTel span documents from AgentCore Runtime via the `opentelemetry-instrument` entry point. Each session's tool calls, model calls, and errors are captured as spans and correlated by session ID. + +The runtime log group (`/aws/bedrock-agentcore/runtimes/...`) must also be included. Without it, the insights engine cannot resolve the log events that spans reference, which produces incomplete results. + +`insights` and `evaluators` are mutually exclusive in the batch evaluation API. Use a separate batch job for each. + + +The `--generate-traces` flag sends sessions across several failure categories: +- **Unknown employee IDs** (`EMP-999`, `EMP-003`) -> tool returns "not found" errors +- **Unsupported policy topics** (`sabbatical`, `floating_holiday`, `relocation`) -> tool returns error +- **Unknown benefit types** (`gym`, `commuter`, `wellness`) -> tool returns error +- **Unavailable pay periods** (`2019-12`, `2020-03`) -> tool returns "not found" error +- **Invalid date formats** in PTO requests -> agent confusion / multi-turn failure +- **Normal successful sessions** -> required for UserIntent and ExecutionSummary clustering + +### Reading the FailureAnalysis Output -### Step 2: Configuration Bundles (`optimize.py`) +``` +FailureAnalysis (2 top-level categories) + + Category: Tool Execution Failures (sessions affected: 8) + Subcategory: Employee Lookup Failures + Root cause: Unknown Employee ID Errors (4 sessions) + Recommendation: Add input validation and a list of valid employee ID formats + to the system prompt. Return a helpful error message with + instructions to verify the employee ID. + Session IDs: ['3fa85f64-...', 'c3d4e5f6-...', ...] + + Subcategory: Data Not Found + Root cause: Pay Stub Period Unavailable (2 sessions) + Recommendation: Document available pay periods in the tool description + and add graceful handling when a period is not found. + + Category: Out-of-Scope Requests (sessions affected: 7) + Subcategory: Unsupported Policy Topics + Root cause: Missing Policy Coverage (4 sessions) + Recommendation: Expand the policy knowledge base or add a clear out-of-scope + message when a policy topic is not available. +``` + +### Online Insights Config + +The `--online` flag creates an `OnlineEvaluationConfig` that runs on a daily schedule. It uses `clusteringConfig` with `frequencies: ["DAILY"]` to re-cluster sessions automatically. View results with `get_online_evaluation_config` or the CLI: + +```bash +# Run from inside your agentcore project directory: +agentcore insights history --json +agentcore insights results --id --json +``` + +The Python SDK call (used by `insights.py --online`): + +```python +ctrl.create_online_evaluation_config( + onlineEvaluationConfigName="HROnlineInsights", + rule={"samplingConfig": {"samplingPercentage": 100}}, + dataSourceConfig={ + "cloudWatchLogs": { + "logGroupNames": ["aws/spans", "/aws/bedrock-agentcore/runtimes/-DEFAULT"], + "serviceNames": [".DEFAULT"], + } + }, + insights=[ + {"insightId": "Builtin.Insight.FailureAnalysis"}, + {"insightId": "Builtin.Insight.UserIntent"}, + {"insightId": "Builtin.Insight.ExecutionSummary"}, + ], + clusteringConfig={"frequencies": ["DAILY"]}, + evaluationExecutionRoleArn=ROLE_ARN, + enableOnCreate=True, +) +``` + +### Chaining Insights into Recommendations -A **Configuration Bundle** is a versioned container for agent configuration keyed by runtime ARN. The agent reads the bundle at invocation time via `BedrockAgentCoreContext.get_config_bundle()` — changing the system prompt or tool descriptions requires no redeployment. +Pass the insights job ID to a recommendation to generate a system prompt that targets the identified failures: -Each bundle call returns a `bundleId` (stable) and a `versionId` (immutable snapshot). Pass `parentVersionIds` on updates to record lineage and prevent accidental overwrites. Use `commitMessage` on every create and update to document why the config changed — just like a Git commit. +```bash +# CLI: +agentcore run recommendation \ + --from-insights \ + --type system-prompt \ + --inline "You are a helpful HR Assistant for Acme Corp..." \ + --json + +# Python (DP client): +dp.start_recommendation( + name="HRSpRecFromInsights", + type="SYSTEM_PROMPT_RECOMMENDATION", + recommendationConfig={ + "systemPromptRecommendationConfig": { + "systemPrompt": {"text": CURRENT_SYSTEM_PROMPT}, + "agentTraces": {"batchEvaluation": {"batchEvaluationArn": INSIGHTS_EVAL_ARN}}, + "evaluationConfig": { + "evaluators": [{"evaluatorArn": "arn:aws:bedrock-agentcore:::evaluator/Builtin.GoalSuccessRate"}] + }, + } + }, +) +``` + +### Step 3: Configuration Bundles (`optimize.py`) + +A **Configuration Bundle** is a versioned container for agent configuration keyed by runtime ARN. The agent reads the bundle at invocation time via `BedrockAgentCoreContext.get_config_bundle()`. Changing the system prompt or tool descriptions does not require redeployment. + +Each bundle call returns a `bundleId` (stable) and a `versionId` (immutable snapshot). Pass `parentVersionIds` on updates to track lineage and prevent accidental overwrites. Use `commitMessage` on every create and update to document the change. #### Bundle lifecycle @@ -267,10 +475,10 @@ Each bundle call returns a `bundleId` (stable) and a `versionId` (immutable snap | Compare | `get_configuration_bundle_version` | Diff two versions; useful for audits and rollback decisions | **What we create:** -- **Control (C)** — original system prompt + original tool descriptions -- **Treatment (T1)** — recommended system prompt + recommended tool descriptions +- **Control (C)** -- original system prompt + original tool descriptions +- **Treatment (T1)** -- recommended system prompt + recommended tool descriptions -### Step 3: Batch evaluation +### Step 4: Batch evaluation Baseline batch evaluation discovers sessions from CloudWatch, runs them through built-in LLM evaluators, and returns aggregate scores: @@ -280,106 +488,96 @@ Baseline batch evaluation discovers sessions from CloudWatch, runs them through | **Helpfulness** | Was the response useful and actionable? | | **Correctness** | Did the agent give accurate information? | -### Step 4: Optimization Recommendations +### Step 5: Optimization Recommendations AgentCore analyzes production traces and generates: - **System Prompt Recommendation**: rewrites your system prompt to improve a target metric - **Tool Description Recommendation**: improves tool descriptions so the agent selects tools more reliably -Recommendations are returned as text and can be applied immediately via configuration bundles — no code changes needed. +Recommendations are returned as text and can be applied immediately via configuration bundles. No code changes needed. -### Step 5: Config-Bundle A/B Test +### Step 6: Config-Bundle A/B Test -Use configuration bundle routing when the change you are testing is purely configuration — a different system prompt, a different model ID, or different tool descriptions. Both variants run on the **same runtime** with different configuration bundle versions. +Use configuration bundle routing when the change is purely configuration: a different system prompt, model ID, or tool descriptions. Both variants run on the same runtime with different bundle versions. **Architecture:** ``` User Request - │ - ▼ -[gateway] ──50%──▶ [Control Bundle C] ──┐ - │ ├──▶ [HR runtime v1] ──▶ CloudWatch - └──50%──▶ [Treatment Bundle T1] ──────┘ │ - [Online Eval Config] ◀┘ - │ - [A/B Test Results] + | + v +[gateway] --50%--> [Control Bundle C] --| + | |--> [HR runtime v1] --> CloudWatch + +--50%--> [Treatment Bundle T1] -----| | + [Online Eval Config] <-+ + | + [A/B Test Results] ``` The gateway injects the correct bundle reference into each request via W3C baggage headers. The agent reads it at runtime via `BedrockAgentCoreContext.get_config_bundle()`. -**Session stickiness:** Once a session ID is assigned to a variant, all subsequent requests with that same session ID route to the same variant. This ensures a consistent experience within a session while distributing new sessions across variants according to your traffic weights. +**Session stickiness:** Once a session ID is assigned to a variant, all requests with that session ID go to the same variant. New sessions are distributed across variants by weight. -An **online evaluation config** automatically scores every session as it closes, without requiring explicit API calls per session. It monitors the agent's CloudWatch log group, detects when a session closes (after `sessionTimeoutMinutes` of inactivity), and runs the configured evaluators. +An **online evaluation config** scores sessions automatically as they close. It watches the agent's CloudWatch log group, detects session close (after `sessionTimeoutMinutes` of inactivity), and runs the configured evaluators. -**Results timeline:** Budget 10–15 minutes from your last request: session timeout (2 min) → evaluation (2–3 min) → aggregation (~5 min cycle). Poll until `analysisTimestamp` is populated. +**Results timeline:** Budget 10-15 minutes from your last request: session timeout (2 min) -> evaluation (2-3 min) -> aggregation (~5 min cycle). Poll until `analysisTimestamp` is populated. -### Step 6: Target-Based A/B Test +### Step 7: Target-Based A/B Test -When code changes are involved (new tools, framework upgrade, or entirely different agent implementation), use target-based routing instead. It sends traffic to two **separate runtimes** — each registered as a different gateway target. Each variant has its own online evaluation config since they have different log groups. +When the change involves code (new tools, framework upgrade, different agent implementation), use target-based routing. Traffic splits between two separate runtimes, each registered as a gateway target. Each variant needs its own online evaluation config because they have different log groups. **Architecture:** ``` -User ──► [gateway] ──90%──► [Target HRAgentV1 → HR runtime v1 (stable)] ──► CloudWatch - │ │ - └──10%──► [Target HRAgentV2 → HR runtime v2 (canary)] ──► CloudWatch - │ - [Online Eval v1 + Online Eval v2] - │ +User --> [gateway] --90%--> [Target HRAgentV1 -> HR runtime v1 (stable)] --> CloudWatch + | | + +--10%--> [Target HRAgentV2 -> HR runtime v2 (canary)] --> CloudWatch + | + [Online Eval v1 + Online Eval v2] + | [A/B Test Results] ``` -**Phased rollout:** 10% canary → validate no regressions → 50% ramp → gather statistical significance → 100% cutover → decommission old runtime. +**Phased rollout:** 10% canary -> validate no regressions -> 50% ramp -> gather statistical significance -> 100% cutover -> decommission old runtime. -**`gatewayFilter.targetPaths`** scopes the A/B test routing rule to requests matching the control target's path, ensuring only traffic intended for this test is intercepted. +**`gatewayFilter.targetPaths`** restricts the A/B routing rule to requests matching the control target's path, so only traffic for this test is affected. -## Files +## Key Concepts -| File | Description | -|:-----|:------------| -| `deploy.py` | Deploys HR Assistant v1 or v2 to AgentCore runtime | -| `invoke.py` | Invokes the deployed agent with sample HR queries | -| `optimize.py` | End-to-end optimization workflow (Steps 2–8) | -| `cleanup.py` | Deletes all AWS resources created by this tutorial | -| `requirements.txt` | Python dependencies | -| `utils/hr_assistant_agent.py` | HR Assistant agent with Configuration Bundle hook | +### From Triage to Optimization -## HR Assistant Sample Prompts +After insights identifies failure patterns, you can feed those findings into the Recommendations API to generate an improved system prompt: -```bash -python invoke.py --name HRAssistV1 --prompt "What is the PTO balance for EMP-001?" -python invoke.py --name HRAssistV1 --prompt "What is the company remote work policy?" -python invoke.py --name HRAssistV1 --prompt "Show me EMP-042 pay stub for January 2026." -python invoke.py --name HRAssistV1 --prompt "How many vacation days do I get after 3 years?" -``` +1. Run insights to identify recurring failure categories and root causes. +2. Call `start_recommendation` with your current system prompt, pointing it at the same agent traces (or pass the insights batch evaluation ARN directly). +3. Use A/B testing to compare the original and recommended configurations with live traffic. -## Key Concepts +The `insights.py --online` flag and the `agentcore run recommendation --from-insights ` CLI command both implement this flow. ### Config-Bundle vs. Target-Based A/B Testing | | Config-Bundle Routing | Target-Based Routing | |---|---|---| | **What changes** | System prompt or config (no code change) | Agent code, tools, model, or framework | -| **Redeployment needed** | No — config applied at request time | Yes — new runtime required | +| **Redeployment needed** | No -- config applied at request time | Yes -- new runtime required | | **Runtimes needed** | One shared runtime | Two separate runtimes | | **Eval configs needed** | One shared online eval config | One per variant (different log groups) | | **Best for** | Prompt tuning, config experiments | Code releases, version upgrades | | **Traffic split** | Typically 50/50 | Typically 90/10 canary | -| **Rollback** | Instant — update bundle version | runtime still running; shift weights back | -| **Risk** | Very low | Higher — binary change | +| **Rollback** | Instant -- update bundle version | runtime still running; shift weights back | +| **Risk** | Very low | Higher -- binary change | ### Phased Rollout (Target-Based) ``` -10% canary → validate no regressions (errors, latency, quality drop) - ↓ -50% ramp → gather statistical significance - ↓ -100% promote → complete cutover; decommission old runtime +10% canary -> validate no regressions (errors, latency, quality drop) + | +50% ramp -> gather statistical significance + | +100% promote -> complete cutover; decommission old runtime ``` ### Configuration Bundle Hook -The agent reads its system prompt **and tool descriptions** from the bundle on every model call — enabling live prompt updates and A/B testing without redeployment: +The agent reads its system prompt and tool descriptions from the bundle on every model call. This supports live prompt updates and A/B testing without redeployment: ```python from bedrock_agentcore.runtime import BedrockAgentCoreContext @@ -402,38 +600,23 @@ def _config_bundle_hook(event: BeforeModelCallEvent) -> None: agent.hooks.add_callback(BeforeModelCallEvent, _config_bundle_hook) ``` -This pattern allows testing both prompt changes and tool description improvements in the same A/B experiment. +This supports testing both prompt changes and tool description improvements in the same A/B experiment. ## Next Steps -- **Add custom evaluators**: Lambda-based code evaluators for deterministic policy compliance checks +- **Add custom evaluators**: Lambda-based evaluators for deterministic policy checks - **Automate the loop**: Run batch evaluations in CI/CD to catch regressions before deployment -- **Use recommendations iteratively**: Re-run recommendations after each traffic batch to compound improvements -- **Multi-metric optimization**: Run separate recommendation jobs targeting different evaluators, then pick the best balance -- **Increase canary exposure**: Use `update_ab_test` to gradually raise treatment weight (10% → 25% → 50% → 100%) -- **Continuous monitoring**: Keep online eval configs enabled in production for zero-effort quality monitoring - -## Workflow Summary - -| Step | What you do | Key API | -|------|-------------|---------| -| 1 | Deploy HR Assistant to AgentCore runtime | `create_agent_runtime` | -| 2 | Create baseline Configuration Bundle and send traffic | `create_configuration_bundle`, `invoke_agent_runtime` | -| 3 | Measure baseline performance with batch evaluation | `start_batch_evaluation` / `get_batch_evaluation` | -| 4a | Generate improved system prompt from production traces | `start_recommendation` (SYSTEM_PROMPT) | -| 4b | Generate improved tool descriptions from production traces | `start_recommendation` (TOOL_DESCRIPTION) | -| 5 | Package control and treatment configs into bundles | `create_configuration_bundle` / `update_configuration_bundle` | -| 6 | A/B test prompt + tool description change via config-bundle routing | `create_ab_test` (configurationBundle variants, 50/50) | -| 7 | Canary rollout of v2 via target-based routing | `create_ab_test` (target variants, 90/10 split) | -| 8 | Promote winner or roll back | `update_configuration_bundle` / stop A/B test | - ---- +- **Use recommendations iteratively**: Re-run after each traffic batch to compound improvements +- **Multi-metric optimization**: Run separate recommendation jobs for different evaluators, then pick the best result +- **Increase canary exposure**: Use `update_ab_test` to gradually raise treatment weight (10% -> 25% -> 50% -> 100%) +- **Continuous monitoring**: Leave online eval configs enabled in production + ## Decision Framework | A/B Test Result | Action | |-----------------|--------| -| **Config-bundle T1 wins** | Promote treatment bundle (`update_configuration_bundle`) as new default — no code deployment | +| **Config-bundle T1 wins** | Promote treatment bundle (`update_configuration_bundle`) as new default -- no code deployment | | **Target-based v2 wins** | Ramp to 50%, then 100% cutover; delete v1 runtime | | **Regression detected** | Stop A/B test immediately (`update_ab_test(executionStatus="STOPPED")`), investigate | | **Inconclusive** | Continue sending traffic to accumulate sample size (p < 0.05 threshold) | diff --git a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/images/observe-eval-improve.png b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/images/observe-eval-improve.png new file mode 100644 index 000000000..397d919be Binary files /dev/null and b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/images/observe-eval-improve.png differ diff --git a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/insights.py b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/insights.py new file mode 100644 index 000000000..8fc8293ea --- /dev/null +++ b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/insights.py @@ -0,0 +1,478 @@ +""" +Run AgentCore Insights on the HR Assistant agent. + +Three insight types: + - Builtin.Insight.FailureAnalysis -- clusters failure sessions by root cause category + - Builtin.Insight.UserIntent -- groups sessions by what the user was trying to do + - Builtin.Insight.ExecutionSummary -- summarizes agent execution patterns across sessions + +With --generate-traces, the script first sends a set of failure-mode sessions to populate +CloudWatch, so FailureAnalysis has real patterns to work with: + - Unknown employee IDs that the agent cannot look up + - HR policy topics not in the system (sabbatical, floating holiday, relocation) + - Unknown benefit types (gym membership, commuter benefits, wellness) + - Pay stubs for unavailable periods or unknown employees + - Ambiguous prompts that cause multi-step confusion + +Usage: + # Generate failure traces then run all three insights: + python insights.py --name HRInsights849 --generate-traces [--region us-west-2] + + # Run insights on existing traces from the last N days: + python insights.py --name HRInsights849 [--lookback-days 7] + + # Also create an OnlineEvaluationConfig for recurring daily insights: + python insights.py --name HRInsights849 --generate-traces --online + + # Run FailureAnalysis only (skips UserIntent and ExecutionSummary): + python insights.py --name HRInsights849 --insight Builtin.Insight.FailureAnalysis + +Prerequisites: + 1. Deploy the HR Assistant agent: + python deploy.py --name HRInsights849 --region us-west-2 + + 2. Install dependencies: + pip install -r requirements.txt +""" + +import argparse +import json +import os +import time +import uuid +from datetime import datetime, timedelta, timezone +from pathlib import Path + +import boto3 + +# ── Parse arguments ─────────────────────────────────────────────────────── + +parser = argparse.ArgumentParser(description="AgentCore Insights (FailureAnalysis, UserIntent, ExecutionSummary)") +parser.add_argument("--name", required=True, help="Runtime name (matches agent_state_{name}.json)") +parser.add_argument("--region", default=os.environ.get("AWS_DEFAULT_REGION", "us-west-2")) +parser.add_argument( + "--generate-traces", + action="store_true", + help="Send failure-mode invocations to generate diverse traces before running insights", +) +parser.add_argument( + "--online", + action="store_true", + help="Also create an OnlineEvaluationConfig for recurring daily insights", +) +parser.add_argument( + "--lookback-days", + type=int, + default=7, + help="Number of days of traces to analyze (default: 7)", +) +parser.add_argument( + "--insight", + action="append", + default=None, + dest="insights", + metavar="INSIGHT_ID", + help=( + "Specific insight IDs to run. May be repeated. " + "Default: all three (FailureAnalysis, UserIntent, ExecutionSummary). " + "Example: --insight Builtin.Insight.FailureAnalysis" + ), +) +args = parser.parse_args() + +REGION = args.region +LOOKBACK_DAYS = args.lookback_days + +# ── Load agent state ─────────────────────────────────────────────────────── + +STATE_FILE = Path(f"agent_state_{args.name}.json") +if not STATE_FILE.exists(): + raise FileNotFoundError( + f"{STATE_FILE} not found. Run 'python deploy.py --name {args.name} --region {REGION}' first." + ) + +state = json.loads(STATE_FILE.read_text(encoding="utf-8")) +AGENT_ARN = state["runtime_arn"] +LOG_GROUP = state["log_group"] +SERVICE_NAME = state["service_name"] +ROLE_ARN = state["role_arn"] +ACCOUNT_ID = state["account_id"] +SPANS_LOG_GROUP = "aws/spans" + +# Both log groups are required for reliable session coverage. +# aws/spans holds OTel span documents (used by all insight types). +# The runtime log group holds the log events that spans reference — without it, +# the evaluator engine sees incomplete spans (LogEventMissingException) even +# though the agent executed successfully. +LOG_GROUP_NAMES = [SPANS_LOG_GROUP, LOG_GROUP] + +# ── AWS clients ─────────────────────────────────────────────────────────── + +dp = boto3.client("bedrock-agentcore", region_name=REGION) +ctrl = boto3.client("bedrock-agentcore-control", region_name=REGION) + +print(f"Account : {ACCOUNT_ID}") +print(f"Region : {REGION}") +print(f"Runtime : {args.name}") +print(f"Agent ARN : {AGENT_ARN}") +print(f"Service Name: {SERVICE_NAME}") +print(f"Log Group : {LOG_GROUP}") + +# ── Insight selection ───────────────────────────────────────────────────── + +ALL_INSIGHTS = [ + "Builtin.Insight.FailureAnalysis", + "Builtin.Insight.UserIntent", + "Builtin.Insight.ExecutionSummary", +] + +SELECTED_INSIGHTS = args.insights if args.insights else ALL_INSIGHTS +print(f"\nInsights to run: {SELECTED_INSIGHTS}") + +# ── Step 1: Generate failure-mode traces (optional) ─────────────────────── +# +# These prompts exercise different failure paths in the HR Assistant so that +# FailureAnalysis has real patterns to cluster. Each category below maps to +# one or more tool errors or agent reasoning failures. +# +# Category: Tool Execution Failures +# - get_pto_balance(employee_id="EMP-999") -> "Employee EMP-999 not found" +# - get_pay_stub(employee_id="EMP-003", ...) -> "Pay stub not found" +# +# Category: Invalid Input / Bad Formatting +# - PTO request with non-date strings -> agent loops or declines +# +# Category: Out-of-Scope Requests +# - Policy topics not in the system: sabbatical, floating_holiday, relocation +# - Benefit types not in the system: gym, commuter, wellness +# +# Category: Ambiguous Requests +# - Vague prompts where the agent guesses incorrectly or asks for clarification + +FAILURE_PROMPTS = [ + # --- Tool failure: unknown employee IDs --- + ("EMP-999", "What is my current PTO balance?"), + ("EMP-999", "Please submit a PTO request for me from 2026-07-01 to 2026-07-05."), + ("EMP-003", "Can you pull up my January 2026 pay stub?"), + ("EMP-003", "How many PTO days do I have remaining?"), + # --- Tool failure: unavailable pay periods --- + ("EMP-001", "Get my pay stub for December 2019."), + ("EMP-001", "Show me my pay stub for March 2020."), + ("EMP-042", "I need my pay stub for period 2022-06."), + # --- Policy topics not in the system --- + ("EMP-001", "What is the sabbatical leave policy?"), + ("EMP-002", "Do we have a floating holiday policy?"), + ("EMP-042", "What is the relocation assistance policy?"), + ("EMP-001", "Explain the bereavement leave policy."), + # --- Benefit types not in the system --- + ("EMP-001", "What gym membership benefits does the company offer?"), + ("EMP-002", "Tell me about our commuter benefits — transit and parking."), + ("EMP-042", "What wellness program benefits are available?"), + ("EMP-001", "Do we have a childcare or dependent care FSA benefit?"), + # --- Invalid input formats --- + ("EMP-001", "Submit PTO for me starting yesterday through the end of next month."), + ("EMP-002", "Request time off from ASAP to whenever — for burnout recovery."), + # --- Multi-step failures: agent finds no data, user pushes further --- + ("EMP-999", "Check my PTO balance first, then submit a request for next week."), + ("EMP-001", "Can you give me the 2018 annual pay summary? I need it for a loan."), + # --- Successful sessions for user intent diversity --- + ("EMP-001", "What is my current PTO balance?"), + ("EMP-042", "Tell me about the 401k plan — how much does the company match?"), + ("EMP-001", "What are my health insurance options?"), + ("EMP-002", "What is the parental leave policy for primary caregivers?"), + ("EMP-001", "What is the remote work policy?"), + ("EMP-042", "Can you pull up my January 2026 pay stub?"), + ("EMP-001", "Submit a PTO request from 2026-08-11 to 2026-08-15 for a vacation."), + ("EMP-002", "How does the dental insurance work for major procedures?"), +] + +if args.generate_traces: + print("\n" + "=" * 60) + print("STEP 1: Generate Failure-Mode Traces") + print("=" * 60) + print(f"Sending {len(FAILURE_PROMPTS)} sessions (failure + success mix)...\n") + + session_ids = [] + success_count = 0 + error_count = 0 + + for i, (emp_id, prompt) in enumerate(FAILURE_PROMPTS): + session_id = str(uuid.uuid4()) + session_ids.append(session_id) + full_prompt = f"Employee ID: {emp_id}. {prompt}" if emp_id != "custom" else prompt + + try: + resp = dp.invoke_agent_runtime( + agentRuntimeArn=AGENT_ARN, + runtimeSessionId=session_id, + payload=json.dumps({"prompt": full_prompt}).encode(), + ) + resp["response"].read() + success_count += 1 + status_tag = "OK " # pylint: disable=invalid-name + except Exception as e: # pylint: disable=broad-exception-caught + error_count += 1 + status_tag = "ERR" # pylint: disable=invalid-name + + print(f" [{i + 1:2d}] {status_tag} {session_id[:8]}... [{emp_id}] {prompt[:60]}") + + print(f"\nSent {success_count} OK, {error_count} errors (invoke errors; tool errors are expected)") + print("Waiting 3 minutes for traces to propagate to CloudWatch...") + + for remaining in range(180, 0, -30): + print(f" {remaining}s remaining...") + time.sleep(30) + + print("CloudWatch ingestion complete.") +else: + print("\n(Skipping trace generation — use --generate-traces to send failure-mode sessions first)") + +# ── Step 2: Run Batch Insights ───────────────────────────────────────────── +# +# Notes on dataSourceConfig for insights: +# - "aws/spans" holds OTel span documents and is required for session coverage +# - The runtime log group (/aws/bedrock-agentcore/runtimes/...) must also be +# included; without it the engine cannot resolve log events referenced by spans +# - insights and evaluators are mutually exclusive -- do not mix them + +print("\n" + "=" * 60) +print("STEP 2: Start Batch Insights") +print("=" * 60) + +now = datetime.now(timezone.utc) +start_time = now - timedelta(days=LOOKBACK_DAYS) + +EVAL_NAME = f"HRInsights{uuid.uuid4().hex[:8]}" + +print(f"Batch eval name : {EVAL_NAME}") +print(f"Time range : {start_time.strftime('%Y-%m-%dT%H:%M:%SZ')} to {now.strftime('%Y-%m-%dT%H:%M:%SZ')}") +print(f"Service name : {SERVICE_NAME}") +print(f"Log groups : {LOG_GROUP_NAMES}") + +insights_list = [{"insightId": iid} for iid in SELECTED_INSIGHTS] + +eval_resp = dp.start_batch_evaluation( + batchEvaluationName=EVAL_NAME, + insights=insights_list, + dataSourceConfig={ + "cloudWatchLogs": { + "serviceNames": [SERVICE_NAME], + "logGroupNames": LOG_GROUP_NAMES, + "filterConfig": { + "timeRange": { + "startTime": start_time.strftime("%Y-%m-%dT%H:%M:%SZ"), + "endTime": now.strftime("%Y-%m-%dT%H:%M:%SZ"), + } + }, + } + }, + clientToken=str(uuid.uuid4()), +) + +EVAL_ID = eval_resp["batchEvaluationId"] +EVAL_ARN = eval_resp.get("batchEvaluationArn", "") +print(f"\nStarted : {EVAL_ID}") +print(f"ARN : {EVAL_ARN}") + +# ── Step 3: Poll for completion ──────────────────────────────────────────── + +print("\n" + "=" * 60) +print("STEP 3: Poll for Completion") +print("=" * 60) + +TERMINAL = {"COMPLETED", "FAILED", "STOPPED", "COMPLETED_WITH_ERRORS"} +poll = 0 +result = {} + +while True: + poll += 1 + result = dp.get_batch_evaluation(batchEvaluationId=EVAL_ID) + status = result["status"] + processed = result.get("statistics", {}).get("processedSessionCount", "?") + failed = result.get("statistics", {}).get("failedSessionCount", "?") + print(f" Poll {poll:3d} [{time.strftime('%H:%M:%S')}] status={status} processed={processed} failed={failed}") + + if status in TERMINAL: + break + + time.sleep(30) + +print(f"\nFinal status: {status}") + +if result.get("errorDetails"): + print(f"Error details: {result['errorDetails']}") + +# ── Step 4: Print Insights Results ──────────────────────────────────────── + +print("\n" + "=" * 60) +print("STEP 4: Insights Results") +print("=" * 60) + +# ── 4a: FailureAnalysis ─────────────────────────────────────────────────── + +fa = result.get("failureAnalysisResult") or result.get("failureAnalysisOutput") +if fa: + failures = fa.get("failures", []) + print(f"\n--- FailureAnalysis ({len(failures)} top-level categories) ---") + + if not failures: + print(" No failure categories found.") + else: + for cat in failures: + cat_name = cat.get("name", "(unnamed)") + sub_cats = cat.get("subCategories", []) + total_affected = sum( + rc.get("affectedSessionCount", 0) for sc in sub_cats for rc in sc.get("rootCauses", []) + ) + print(f"\n Category: {cat_name} (sessions affected: {total_affected})") + + for sc in sub_cats: + sc_name = sc.get("name", "(unnamed)") + root_causes = sc.get("rootCauses", []) + print(f" Subcategory: {sc_name}") + + for rc in root_causes: + rc_name = rc.get("name", "(unnamed)") + rc_rec = rc.get("recommendation", "") + rc_count = rc.get("affectedSessionCount", 0) + rc_sessions = rc.get("affectedSessions", []) + print(f" Root cause : {rc_name} ({rc_count} sessions)") + if rc_rec: + print(f" Recommendation: {rc_rec}") + if rc_sessions: + preview = rc_sessions[:3] + more = len(rc_sessions) - 3 + suffix = f" (+{more} more)" if more > 0 else "" + print(f" Session IDs : {preview}{suffix}") +else: + print("\n(No failureAnalysisResult in response)") + +# ── 4b: UserIntent ──────────────────────────────────────────────────────── + +ui = result.get("userIntentResult") or result.get("userIntentOutput") +if ui: + intents = ui.get("userIntents", []) + print(f"\n--- UserIntent ({len(intents)} intent clusters) ---") + + if not intents: + print(" No user intent clusters found.") + else: + for intent in intents: + cluster_id = intent.get("clusterId", "") + name = intent.get("name", "(unnamed)") + description = intent.get("description", "") + count = intent.get("affectedSessionCount", 0) + print(f"\n Intent cluster: {name} ({count} sessions)") + print(f" Cluster ID : {cluster_id}") + if description: + print(f" Description : {description}") +else: + print("\n(No userIntentResult in response — may be a known beta issue)") + +# ── 4c: ExecutionSummary ────────────────────────────────────────────────── + +es = result.get("executionSummaryResult") or result.get("executionSummaryOutput") +if es: + summaries = es.get("executionSummaries", []) + print(f"\n--- ExecutionSummary ({len(summaries)} clusters) ---") + + if not summaries: + print(" No execution summary clusters found.") + print(" Note: ExecutionSummary requires at least 3 sessions for clustering.") + else: + for summary in summaries: + cluster_id = summary.get("clusterId", "") + description = summary.get("description", "") + count = summary.get("affectedSessionCount", 0) + print(f"\n Cluster: {cluster_id} ({count} sessions)") + if description: + print(f" Description: {description}") +else: + print("\n(No executionSummaryResult in response)") + +# ── 4d: Error details per insight ───────────────────────────────────────── + +error_details = result.get("errorDetails", []) +if error_details: + print("\n--- Error details ---") + if isinstance(error_details, dict): + for key, val in error_details.items(): + print(f" {key}: {val}") + else: + for item in error_details: + print(f" {item}") + +# ── Step 5: Online Insights Config (optional) ────────────────────────────── +# +# Creates a recurring insights job that runs daily over the last 24 hours of +# traces. Results accumulate automatically with no manual intervention needed. + +if args.online: + print("\n" + "=" * 60) + print("STEP 5: Create Online Insights Config (Daily Recurring)") + print("=" * 60) + + ONLINE_NAME = f"HROnlineInsights{uuid.uuid4().hex[:6]}" + + online_resp = ctrl.create_online_evaluation_config( + onlineEvaluationConfigName=ONLINE_NAME, + description="HR Assistant daily insights: FailureAnalysis, UserIntent, ExecutionSummary", + rule={ + "samplingConfig": {"samplingPercentage": 100}, + }, + dataSourceConfig={ + "cloudWatchLogs": { + "logGroupNames": LOG_GROUP_NAMES, + "serviceNames": [SERVICE_NAME], + } + }, + insights=[{"insightId": iid} for iid in SELECTED_INSIGHTS], + clusteringConfig={"frequencies": ["DAILY"]}, + evaluationExecutionRoleArn=ROLE_ARN, + enableOnCreate=True, + clientToken=str(uuid.uuid4()), + ) + + ONLINE_ID = online_resp["onlineEvaluationConfigId"] + ONLINE_ARN = online_resp["onlineEvaluationConfigArn"] + + print("Online insights config created:") + print(f" ID : {ONLINE_ID}") + print(f" ARN : {ONLINE_ARN}") + print(f" Name : {ONLINE_NAME}") + print(f" Status: {online_resp.get('executionStatus', 'unknown')}") + print() + print("The config will run daily. To view results:") + print( + f" python -c \"import boto3, json; ctrl=boto3.client('bedrock-agentcore-control', " + f"region_name='{REGION}'); r=ctrl.get_online_evaluation_config(" + f"onlineEvaluationConfigId='{ONLINE_ID}'); print(json.dumps(r, indent=2, default=str))\"" + ) + print() + print("To archive (disable) this config:") + print(f" ctrl.update_online_evaluation_config(onlineEvaluationConfigId='{ONLINE_ID}', executionStatus='DISABLED')") + +# ── Summary ──────────────────────────────────────────────────────────────── + +print("\n" + "=" * 60) +print("INSIGHTS SUMMARY") +print("=" * 60) +print(f"Batch evaluation ID : {EVAL_ID}") +print(f"Status : {status}") + +stats = result.get("statistics", {}) +if stats: + print(f"Sessions processed : {stats.get('processedSessionCount', 'N/A')}") + print(f"Sessions failed : {stats.get('failedSessionCount', 'N/A')}") + +fa_clusters = len((result.get("failureAnalysisResult") or {}).get("failures", [])) +ui_clusters = len((result.get("userIntentResult") or {}).get("userIntents", [])) +es_clusters = len((result.get("executionSummaryResult") or {}).get("executionSummaries", [])) + +print(f"\nFailureAnalysis : {fa_clusters} top-level categories") +print(f"UserIntent : {ui_clusters} intent clusters") +print(f"ExecutionSummary : {es_clusters} execution clusters") + +print("\nFull response saved to insights_result.json") +with open("insights_result.json", "w", encoding="utf-8") as f: + json.dump(result, f, indent=2, default=str) diff --git a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/requirements.txt b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/requirements.txt index 7e2e80aa7..67838a727 100644 --- a/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/requirements.txt +++ b/01-features/06-observe-evaluate-optimize-your-agent/03-optimize/requirements.txt @@ -1,5 +1,5 @@ -bedrock-agentcore>=1.7.0 -boto3>=1.43.0 +bedrock-agentcore>=1.15.0 +boto3>=1.43.32 strands-agents[otel] aws-opentelemetry-distro requests diff --git a/01-features/06-observe-evaluate-optimize-your-agent/AGENT-LOOPS.PNG b/01-features/06-observe-evaluate-optimize-your-agent/AGENT-LOOPS.PNG new file mode 100644 index 000000000..384e98ceb Binary files /dev/null and b/01-features/06-observe-evaluate-optimize-your-agent/AGENT-LOOPS.PNG differ diff --git a/01-features/06-observe-evaluate-optimize-your-agent/README.md b/01-features/06-observe-evaluate-optimize-your-agent/README.md index 3170b8bd8..c02ccbe6c 100644 --- a/01-features/06-observe-evaluate-optimize-your-agent/README.md +++ b/01-features/06-observe-evaluate-optimize-your-agent/README.md @@ -6,19 +6,7 @@ automated evaluation, and AI-driven optimization with A/B testing. ## Overview -``` -┌─────────────────────────────────────────────────────────────────┐ -│ Agent Lifecycle on AgentCore │ -│ │ -│ [Infrastructure Setup] [Observe] [Evaluate] [Optimize] │ -│ (one-time) │ │ │ │ -│ │ ▼ ▼ ▼ │ -│ Enable Transaction ──► Advanced ──► Batch ──► Config │ -│ Search + ADOT OTel Techniques Eval Bundles │ -│ Custom Spans Online Eval A/B Tests │ -│ Data Protection GT Datasets AI Recs │ -└─────────────────────────────────────────────────────────────────┘ -``` +![Agent development to production loops](AGENT-LOOPS.png) ## Before You Start — Enable observability Infrastructure diff --git a/01-features/07-centralize-and-govern-your-ai-infrastructure/03-registry/TEST_LOG.md b/01-features/07-centralize-and-govern-your-ai-infrastructure/03-registry/TEST_LOG.md deleted file mode 100644 index fe126c69d..000000000 --- a/01-features/07-centralize-and-govern-your-ai-infrastructure/03-registry/TEST_LOG.md +++ /dev/null @@ -1,330 +0,0 @@ -# Registry Migration Test Log - -**Date:** 2026-05-18 / 2026-05-19 -**Source:** `06-workshops/10-Agent-Registry` (amazon-bedrock-agentcore-samples fork) -**Target:** `02-features/06-centralize-and-govern-your-ai-infrastructure/03-registry` (private staging) -**AWS Account:** 849138760372 | **Region:** us-west-2 - ---- - -## Summary - -| Check | Result | -|:------|:-------| -| Folder structure complete | PASS | -| README content preserved | PASS | -| Images/diagrams preserved | PASS (20/20 PNGs) | -| Python syntax checks | PASS (17/17 scripts) | -| AWS execution — scripts run | PASS (8/10 scripts; 2 skipped — external deps) | - ---- - -## Bugs Found and Fixed During Testing - -Four scripts shared the same bug: trying to submit a registry record for approval while it is still in `CREATING` state, causing a `ConflictException`. Fixed by adding a `wait_for_record_draft()` helper that polls until status is `DRAFT` before calling `submit_registry_record_for_approval`. - -Additionally, `deploy_lambda_push_sync.py` had a registry wait loop that ran only 12 iterations (60s), not enough for the ~90s CREATING phase. Fixed by changing to an unbounded `while True` loop. - -| Script | Bug | Fix Applied | -|:---|:---|:---| -| `registry_end_to_end_oauth.py` | Submit record while CREATING | Added `wait_for_record_draft()` | -| `registry_skills_dynamic_discovery.py` | Submit record while CREATING | Added `wait_for_record_draft()` | -| `publish_agentcore_a2a_mcp_in_registry.py` | Submit record while CREATING | Added `wait_for_record_draft()` | -| `deploy_lambda_push_sync.py` | Registry wait loop too short (60s); submit record while CREATING | Changed wait to unbounded loop; added `wait_for_record_draft()` | - ---- - -## Folder Mapping - -| Source Folder | Target Folder | Status | -|:---|:---|:---| -| `00-getting-started/end-to-end/01-registry-end-to-end/` | `01-registry-end-to-end/` | PASS | -| `00-getting-started/end-to-end/02-registry-end-to-end-oauth/` | `02-registry-end-to-end-oauth/` | PASS | -| `01-advanced/admin-approval-workflow/` | `03-advanced/admin-approval-workflow/` | PASS | -| `01-advanced/consumer-discovery-semantic-search/` | `03-advanced/consumer-discovery-semantic-search/` | PASS | -| `01-advanced/discovery-and-invocation-at-runtime/` | `03-advanced/discovery-and-invocation-at-runtime/` | PASS | -| `01-advanced/kiro-registry-dcr-auth0/` | `03-advanced/kiro-registry-dcr-auth0/` | PASS | -| `01-advanced/kiro/kiro-power-publisher-workflow/` | `03-advanced/kiro/kiro-power-publisher-workflow/` | PASS | -| `01-advanced/publish-agentcore-tools-in-registry/` | `03-advanced/publish-agentcore-tools-in-registry/` | PASS | -| `01-advanced/registry-push-sync-lambda/` | `03-advanced/registry-push-sync-lambda/` | PASS | -| `01-advanced/registry-skills-dynamic-discovery/` | `03-advanced/registry-skills-dynamic-discovery/` | PASS | -| `01-advanced/registry-synchronize-mcpserver/` | `03-advanced/registry-synchronize-mcpserver/` | PASS | - ---- - -## Per-Sample Test Results - -### 1. `01-registry-end-to-end` — Zero to Registry in 10 Minutes - -**README content check:** PASS -**Images:** `images/quick-setup-architecture.png` ✅ -**Python syntax:** PASS -**AWS execution:** PASS - -Output highlights: -- Registry created and reached READY (~90s) -- IAM users created for Admin, Publisher, Consumer personas -- MCP, A2A, CUSTOM records registered -- Governance guardrail tests: - - Publisher self-approval correctly denied (`AccessDeniedException`) ✅ - - Consumer `CreateRegistryRecord` correctly denied ✅ - - Consumer `UpdateRegistryRecordStatus` correctly denied ✅ - - Consumer read operations (List, Get) allowed ✅ - - Admin approval succeeded ✅ -- All 3 records reached APPROVED status -- Semantic search: 30s propagation wait insufficient for 3 records; returned 0 results (expected for short wait — index behavior) -- Cleanup: IAM users + records + registry deleted manually (cleanup section is commented out) - -**Note:** Cleanup is commented out in the script. Uncomment or add a cleanup step for production use. - ---- - -### 2. `02-registry-end-to-end-oauth` — Registry with OAuth Authentication - -**README content check:** PASS -**Images:** `images/registry-end-to-end-oauth.png` ✅ -**Python syntax:** PASS -**AWS execution:** PASS (after fix) - -**Bug fixed:** Script submitted record for approval while in `CREATING` state → added `wait_for_record_draft()`. - -Output highlights: -- Cognito user pool, app client, test user created -- Registry created with `CUSTOM_JWT` authorizer pointing to Cognito discovery URL -- MCP record approved successfully -- Cognito `USER_PASSWORD_AUTH` authentication succeeded, Bearer token obtained -- Authenticated search returned 1 result (`weather_server`) ✅ -- Negative auth tests: - - Request without Authorization header → `403` ✅ - - Request with invalid token → `401` ✅ -- Full cleanup (record, registry, Cognito pool, domain, user) completed - ---- - -### 3. `03-advanced/admin-approval-workflow` — Admin CI/CD Approval Workflow - -**README content check:** PASS -**Images:** `admin-flow-architecture.png`, `slack-message.png`, `ai-scan-report.png` ✅ -**Python syntax:** PASS -**AWS execution:** SKIPPED — requires a real Slack incoming webhook URL and channel. The `SLACK_INC_HOOK` env var must be set to a valid Slack webhook before running. The script is otherwise structurally sound. - ---- - -### 4. `03-advanced/consumer-discovery-semantic-search` — Consumer Discovery Semantic Search - -**README content check:** PASS -**Images:** `consumer-discovery-semantic-search.png` ✅ -**Python syntax:** PASS -**AWS execution:** PASS - -Output highlights: -- Registry created; 14 records seeded and submitted for approval -- All 12 discovery scenarios ran (semantic, filtered, cross-type, negative) -- Search index propagation: 45s wait caused most queries to return only the last-indexed record (`loyalty_rewards_tool`). This is a known index propagation latency — increasing the wait improves result diversity. -- Filtered search operators (`$eq`, `$ne`, `$in`, `$and`, `$or`) tested -- MCP `serverSchema`/`toolSchema` drill-down and A2A `agentCard` drill-down tested -- Full cleanup (14 records + registry) completed - ---- - -### 5. `03-advanced/discovery-and-invocation-at-runtime` — Discovering Tools and Agents at Runtime - -**README content check:** PASS -**Images:** `With_Vs_Without_AWS_Agent_Registry.png`, `OrderManagement_AWS_Agent_Registry_Flow.png`, `orchestrator_agent_flow_v3.png` ✅ -**Python syntax:** PASS -**AWS execution:** PASS - -Output highlights: -- Lambda function deployed (`order-management-mcp-20260518185145`) -- Cognito user pool + OAuth setup for AgentCore Gateway -- AgentCore Gateway created and targeted to Lambda -- Pricing A2A agent deployed to AgentCore Runtime via CodeBuild -- Customer Support A2A agent deployed to AgentCore Runtime via CodeBuild -- Orchestrator agent deployed to AgentCore Runtime via CodeBuild -- Registry created; 3 records registered (1 MCP + 2 A2A) and approved -- **Demo 1 (Order Status):** Retrieved order details via MCP tool ✅ -- **Demo 2 (Pricing & Discounts):** Retrieved order details via MCP + A2A pricing agent (some service hiccups handled gracefully) ✅ -- **Demo 3 (Return & Refund):** Return eligibility + refund amount via Customer Support A2A agent ✅ -- All resources deleted (registry, Lambda, gateway, Cognito, 3 runtimes) - ---- - -### 6. `03-advanced/kiro-registry-dcr-auth0` — Registry as MCP from Kiro (Auth0 DCR) - -**README content check:** PASS -**Images (all 5):** `0_authflow_dcr.png`, `1_kiro_mcp_json.png`, `2_authorization_pkce.png`, `3_successful_auth.png`, `4_kiro_search.png` ✅ -**Python syntax:** PASS -**AWS execution:** SKIPPED — requires an Auth0 account with DCR enabled and `.env` file with `AUTH0_DOMAIN`, `AUTH0_AUDIENCE`, `AWS_REGION`, `AWS_ACCOUNT_ID`. The script is structurally sound and creates a `CUSTOM_JWT` registry using Auth0 as IdP. - ---- - -### 7. `03-advanced/kiro/kiro-power-publisher-workflow` — Kiro Power Publisher Workflow - -**README content check:** PASS -**Images (all 4):** `publisher-workflow.png`, `activate-kiro-power.png`, `import-from-github.png`, `aws-agent-registry-power.png` ✅ -**Python syntax:** N/A — Kiro IDE-driven (no standalone script) -**AWS execution:** N/A — IDE-driven via `POWER.md` steering document - ---- - -### 8. `03-advanced/publish-agentcore-tools-in-registry` — Publishing AgentCore Tools in Registry - -**README content check:** PASS -**Images:** `images/agentregistry_flow.png` ✅ -**Python syntax:** PASS -**AWS execution:** PASS (after fix) - -**Bug fixed:** Submit for approval while records in `CREATING` state → added `wait_for_record_draft()`. - -Output highlights: -- MCP server (`mcp_order_server`) deployed to AgentCore Runtime via CodeBuild (~30s build) -- MCP tools verified via `tools/list` + `tools/call` (`get_order_status`, `create_order`, `update_order`, `cancel_order`) -- A2A agent (`a2a_order_agent`) deployed to AgentCore Runtime via CodeBuild -- A2A agent verified via `GET /agent_card` + `POST /message/send` (task completed with order details) -- Registry created; MCP + A2A records registered and approved -- Semantic search returned both records for all queries (`cancel update an order`, `order management MCP tools`, `A2A agent order`) ✅ -- Runtimes, records, and registry deleted - ---- - -### 9. `03-advanced/registry-push-sync-lambda` — Registry Push Sync Lambda - -**README content check:** PASS -**Images:** `architecture.png` ✅ -**Python syntax:** PASS -**AWS execution:** PASS (after fixes) - -**Bugs fixed:** -1. Registry wait loop ran only 12 iterations (60s) — insufficient for ~90s CREATING phase → changed to `while True` loop -2. Record submitted while still in `CREATING` → added `wait_for_record_draft()` - -Output highlights: -- Registry and MCP server record created and approved -- AgentCore Identity workload identity created (`registry-push-sync-agent`) -- IAM Lambda execution role created with registry + identity + secrets + logs permissions -- Lambda function built (bundled `boto3`, `botocore`, `requests` into `handler.zip`) and deployed -- EventBridge rule created to match `UpdateAgentRuntime` CloudTrail events -- Lambda test skipped (no `TEST_RUNTIME_ID` set) — expected behavior -- All resources deleted (Lambda, IAM role, workload identity, EventBridge rule, record, registry) - ---- - -### 10. `03-advanced/registry-skills-dynamic-discovery` — Publishing and Discovering Agent Skills - -**README content check:** PASS -**Images:** `images/registry-skill-flow.png` ✅ -**Python syntax:** PASS -**AWS execution:** PASS (after fix) - -**Bug fixed:** Submit for approval while record in `CREATING` state → added `wait_for_record_draft()`. - -Output highlights: -- Registry created with `AWS_IAM` authorizer -- PDF Processing Skill registered as `AGENT_SKILLS` record with `skillMd` + `skillDefinition` -- Record approved (DRAFT → PENDING_APPROVAL → APPROVED) -- 100s index propagation wait -- Strands Agent with `search_and_load_skill` tool initialized -- Agent searched registry: found `PDF_Processing_Skill` for query "PDF creation generate document" ✅ -- Agent downloaded skill from GitHub (`anthropics/skills/skills/pdf`), installed `pypdf` + `reportlab` -- Agent created `hello_from_agent_skills.pdf` (1421 bytes) using the loaded skill ✅ -- Record + registry deleted - ---- - -### 11. `03-advanced/registry-synchronize-mcpserver` — Synchronize MCP Server Metadata - -**README content check:** PASS -**Images:** `registry-synchronize-mcpserver-arch.png` ✅ -**Python syntax:** PASS -**AWS execution:** PASS - -Output highlights: -- Registry created with `AWS_IAM` authorizer -- **Section 3 — Public MCP server sync:** Synced `AWSKnowledgeMCP` from public URL; record transitioned CREATING → DRAFT with extracted `serverSchema` + `toolSchema` ✅ -- **Section 4 — OAuth-protected MCP server sync:** - - Cognito user pool + OAuth provider created - - MCP server deployed via CodeBuild (~30s) - - Synced with OAuth credential provider; tool schemas extracted automatically ✅ -- **Section 5 — IAM-protected MCP server sync:** - - MCP server deployed via CodeBuild (~30s) - - IAM role `RegistrySyncRole_1779155208` created for registry-to-runtime invocation - - Synced with IAM credential provider; tool schemas extracted automatically ✅ -- Final listing: 3 records (public + OAuth + IAM) -- Full cleanup (3 records, registry, 2 runtimes, OAuth provider, Cognito, IAM role, local files) - ---- - -## Image Migration Summary - -Total PNG images in source: **20** -Total PNG images in target: **20** -Missing: **0** - -| Sample | Images | Status | -|:---|:---|:---| -| `01-registry-end-to-end` | `quick-setup-architecture.png` | PASS | -| `02-registry-end-to-end-oauth` | `registry-end-to-end-oauth.png` | PASS | -| `admin-approval-workflow` | `admin-flow-architecture.png`, `slack-message.png`, `ai-scan-report.png` | PASS | -| `consumer-discovery-semantic-search` | `consumer-discovery-semantic-search.png` | PASS | -| `discovery-and-invocation-at-runtime` | `With_Vs_Without_AWS_Agent_Registry.png`, `OrderManagement_AWS_Agent_Registry_Flow.png`, `orchestrator_agent_flow_v3.png` | PASS | -| `kiro-registry-dcr-auth0` | `0_authflow_dcr.png`, `1_kiro_mcp_json.png`, `2_authorization_pkce.png`, `3_successful_auth.png`, `4_kiro_search.png` | PASS | -| `kiro-power-publisher-workflow` | `publisher-workflow.png`, `activate-kiro-power.png`, `import-from-github.png`, `aws-agent-registry-power.png` | PASS | -| `publish-agentcore-tools-in-registry` | `agentregistry_flow.png` | PASS | -| `registry-push-sync-lambda` | `architecture.png` | PASS | -| `registry-skills-dynamic-discovery` | `registry-skill-flow.png` | PASS | -| `registry-synchronize-mcpserver` | `registry-synchronize-mcpserver-arch.png` | PASS | - ---- - -## Python Syntax Check Summary - -All scripts tested with `python3 -m py_compile`. - -| Script | Result | -|:---|:---| -| `01-registry-end-to-end/getting_started_registry_end_to_end.py` | PASS | -| `02-registry-end-to-end-oauth/registry_end_to_end_oauth.py` | PASS | -| `03-advanced/admin-approval-workflow/admin_approval_workflow.py` | PASS | -| `03-advanced/admin-approval-workflow/utils.py` | PASS | -| `03-advanced/consumer-discovery-semantic-search/consumer_discovery_semantic_search.py` | PASS | -| `03-advanced/discovery-and-invocation-at-runtime/discovery_and_invocation_at_runtime.py` | PASS | -| `03-advanced/discovery-and-invocation-at-runtime/cleanup.py` | PASS | -| `03-advanced/discovery-and-invocation-at-runtime/utils.py` | PASS | -| `03-advanced/kiro-registry-dcr-auth0/dcr_registry_search_mcp_in_kiro.py` | PASS | -| `03-advanced/kiro-registry-dcr-auth0/seed_records.py` | PASS | -| `03-advanced/publish-agentcore-tools-in-registry/publish_agentcore_a2a_mcp_in_registry.py` | PASS | -| `03-advanced/registry-push-sync-lambda/deploy_lambda_push_sync.py` | PASS | -| `03-advanced/registry-push-sync-lambda/handler.py` | PASS | -| `03-advanced/registry-skills-dynamic-discovery/registry_skills_dynamic_discovery.py` | PASS | -| `03-advanced/registry-skills-dynamic-discovery/utils/python_exec_tool.py` | PASS | -| `03-advanced/registry-skills-dynamic-discovery/utils/skill_loader.py` | PASS | -| `03-advanced/registry-synchronize-mcpserver/registry_synchronize_mcpserver.py` | PASS | - ---- - -## AWS Execution Summary - -| Script | AWS Execution | Notes | -|:---|:---|:---| -| `getting_started_registry_end_to_end.py` | PASS | All IAM guardrails verified; cleanup section commented out | -| `registry_end_to_end_oauth.py` | PASS (after fix) | Cognito + CUSTOM_JWT auth + negative auth tests | -| `consumer_discovery_semantic_search.py` | PASS | 45s propagation wait may be short for 14 records | -| `discovery_and_invocation_at_runtime.py` | PASS | 3 demos: MCP, MCP+A2A, A2A — all ran | -| `publish_agentcore_a2a_mcp_in_registry.py` | PASS (after fix) | CodeBuild deployment; semantic search verified | -| `registry_synchronize_mcpserver.py` | PASS | Public + OAuth + IAM sync all verified | -| `registry_skills_dynamic_discovery.py` | PASS (after fix) | Agent loaded skill, created PDF | -| `deploy_lambda_push_sync.py` | PASS (after fixes) | Lambda + EventBridge + workload identity deployed | -| `admin_approval_workflow.py` | SKIPPED | Requires real Slack webhook (`SLACK_INC_HOOK` env var) | -| `dcr_registry_search_mcp_in_kiro.py` | SKIPPED | Requires Auth0 account + `.env` with `AUTH0_DOMAIN`, `AUTH0_AUDIENCE` | -| `kiro-power-publisher-workflow` | N/A | IDE-driven, no standalone script | - ---- - -## Migration Pattern Notes - -- Source used Jupyter notebooks (`.ipynb`); target uses Python scripts (`.py`) with equivalent logic. -- All target READMEs have a "Running the Python Scripts" section with `pip install` + `python