feat(ci): add hourly canary for smoke test#1486
Conversation
|
Claude Security Review: no high-confidence findings. (run) |
Package TarballHow to installgh release download pr-1486-tarball --repo aws/agentcore-cli --pattern "*.tgz" --dir /tmp/pr-tarball
npm install -g /tmp/pr-tarball/aws-agentcore-0.18.0.tgz |
agentcore-cli-automation
left a comment
There was a problem hiding this comment.
The CI plumbing looks solid overall — the script extraction is a nice cleanup, and the matrix covers the install vectors I'd expect. I have one concern about noise that's worth thinking through before merging; everything else is minor.
| OPENAI_API_KEY: ${{ env.E2E_OPENAI_API_KEY }} | ||
| GEMINI_API_KEY: ${{ env.E2E_GEMINI_API_KEY }} | ||
| # Smoke test for now, only runs strands-bedrock.test.ts | ||
| run: npx vitest run --project e2e e2e-tests/strands-bedrock.test.ts |
There was a problem hiding this comment.
Hourly canary + no retry will create noisy high-severity GitHub issues on transient failures.
This job runs an e2e flow (create → deploy → invoke → teardown against real AWS) hourly across 4 matrix cells. Real e2e tests are inherently flaky — AWS API throttles, transient CFN/Lambda hiccups, network blips, occasional model-provider 5xx, etc. — and the very first such flake will:
- Open a
high-severityissue with titleCanary Failure: <variant>/<build> - Notify the team via the Slack integration
- Stay open until someone manually closes it (which then re-arms the alert for the next flake)
There's no retry around npx vitest run … and vitest doesn't retry by default, so a single transient failure → a real page. At hourly cadence with 4 variants, that's almost certainly going to generate a steady stream of false-positive issues that erode signal.
A few options to consider (any one of these would help):
- Add
retry: 2(or similar) at the vitest level for canary runs, e.g. via avitest.canary.config.tsor an inline--retryflag, so the smoke test only fails after multiple consecutive attempts. - Wrap the smoke-test step in a small bash/
actions/retryloop that re-runs the wholenpm install -g … && vitest …flow on failure before falling through to the issue-creation step. - Require N consecutive failed scheduled runs before opening an issue (e.g., write a marker to repo state / artifacts and only open the issue on the 2nd or 3rd consecutive failure).
- Drop the severity to something less alarming (e.g., a separate
canarylabel withouthigh-severity) until you've calibrated the noise level for a week or two.
Happy with any of these — just want to avoid waking people up for transient AWS hiccups before the canary has earned its alert budget.
There was a problem hiding this comment.
Good call, added 3 retries via CLI.
|
Claude Security Review: no high-confidence findings. (run) |
|
Claude Security Review: no high-confidence findings. (run) |
aidandaly24
left a comment
There was a problem hiding this comment.
One comment that can be a follow up + a question.
| description: 'AWS region for deployment' | ||
| default: 'us-east-1' | ||
| schedule: | ||
| - cron: '0 14 * * 1' # Every Monday at 9 AM EST (14:00 UTC) |
There was a problem hiding this comment.
This removes the Monday cron from e2e-full-test which in my opinion shrinks our coverage pretty significantly. The new canary replaces it with only strands-bedrock.test.ts (1 of 30 e2e suites). For catching CDK schema drift this is fine. But a regression specific to, say, LangGraph+Gemini or container builds is now only caught on push-to-main, not on a schedule. Is this intentional?
There was a problem hiding this comment.
Yes, intentional for a few reasons:
- the scheduled monday cron job didn't have a notification mechanism, so we didn't know when it failed.
- running the entire test suite means we need to reduce the frequency we run it. This change proposed here is to lean towards frequent smoke tests instead of full e2e tests to justify a high frequency.
- full e2e tests are already run on every commit to main, which IMO is where we're more likely to see regressions.
| beforeAll(async () => { | ||
| if (!canRun) return; | ||
|
|
||
| await cleanupStaleCredentialProviders(); | ||
|
|
||
| testDir = join(tmpdir(), `agentcore-e2e-${randomUUID()}`); | ||
| await mkdir(testDir, { recursive: true }); | ||
|
|
||
| agentName = `E2e${cfg.framework.slice(0, 4)}${cfg.modelProvider.slice(0, 4)}${String(Date.now()).slice(-8)}`; | ||
| agentName = `E2e${cfg.framework.slice(0, 4)}${cfg.modelProvider.slice(0, 4)}${randomUUID().replace(/-/g, '').slice(0, 8)}`; |
There was a problem hiding this comment.
beforeAll calls cleanupStaleCredentialProviders(), which sweeps orphaned E2e* credential providers from runs that died before afterAll teardown. But there's no equivalent janitor for the CloudFormation stack itself.
In the existing e2e flows this is low-risk — they run on push/PR, so a leaked stack is rare and gets noticed. The canary changes that calculus: at 4 matrix cells × hourly = ~96 deploy/teardown cycles/day, a teardown that crashes between deploy and afterAll (timeout, runner eviction, AWS flake mid-tear-down) leaks an AgentCore-E2e…-default stack with nothing to reclaim it. A single persistent teardown bug could pile up ~96 stacks/day, and each carries real resources (runtime, IAM roles, KMS key from bootstrap) — so this compounds into cost and CFN/runtime quota exhaustion before anyone sees the Slack issue.
Suggest adding a beforeAll stack sweep that mirrors the credential-provider janitor — delete AgentCore-E2e* stacks older than the same cutoff.
This can also be a follow up, should be non-blocking.
There was a problem hiding this comment.
Good call, I think this is something we should do since I see 1000+ stacks siting in the CI account.
Description
Add smoke tests once an hour that create an issue on failure. The issue will notify the team via the slack integration.
Related Issue
#1485
Closes #
Documentation PR
Type of Change
Testing
How have you tested the change?
npm run test:unitandnpm run test:integnpm run typechecknpm run lintsrc/assets/, I rannpm run test:update-snapshotsand committed the updated snapshotsChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.