Releases: padosoft/agentic-qa-kit
v1.7.0-rc.1 — pack authoring (slices 1+2)
First release candidate for v1.7. Delivers the pack-authoring story (slices 1 and 2 of 4 planned).
What's in
📖 docs/PACK-AUTHORING.md
End-to-end tutorial for community pack authors:
- Directory layout, manifest schema, scenarios + risks structure
- Three distribution patterns (workspace pack / vendored copy / npm scope alias)
- Programmatic validation (
aqa validate) - Honest about current limitations: no-network
NO_NETWORK_PROBEstub returns{probe_id, status: 200, body: null}for every probe kind, onlyhttp_status/response_contains/response_not_containsoracles are wired, no custom oracle/probe loader yet
🛠 aqa pack new
CLI to scaffold a runnable pack at <cwd>/packs/<slug>/:
aqa pack new pack-myapp --sut-type api
aqa pack new pack-frontend --sut-type web --description "Smoke tests for the marketing site"The scaffold produces a starter scenario whose http_status: 200 oracle passes cleanly against the stub probe out of the box (avoiding the iter-17 footgun where bundled packs emitted synthetic findings). Supports:
--sut-type(api / web / cli / lib / agent / pipeline)--force— atomic backup-rename overwrite (non-destructive on failure)--description,--author,--license(SPDX)
Hardened against:
- Symlinks at packs/ parent and packDir (
lstatSyncchecks) - Non-directory parent (regular file at
<root>/packs) - Over-length slugs (cap at 52 chars to keep every derived ID within the 64-char Slug schema cap)
- TOCTOU on the existence check (uses non-recursive mkdir + explicit lstat)
- Schema-invalid generated output (in-memory
PackManifest/Scenario/RiskMapvalidation before writing) - Validation failures destroying the existing pack (backup-rename + restore-on-error)
🧪 Tests
54 in @aqa/kit (50 pack-new + 42 run-cmd subset). Lint + typecheck clean.
What's still pending in v1.7
- Slice 3 — Admin "Create pack" wizard (future PR)
- Slice 4 — Audit + wire/implement/document all 81 silent placeholder buttons across
packages/admin/src/app.tsx(slices 4a–4f, future PRs; plan documented indocs/internal/admin-placeholder-audit.md) - Final v1.7.0 tag after slices 3 + 4 ship
Review
19 iterations on PR #25 with both Copilot and Codex review bots. All real issues addressed; final iter returned 0 new must-fix items.
v1.6.0 — aqa run CLI + bundled packs
v1.6 — aqa run + ecosystem foundation
The missing piece between aqa init and a real audit trail. After 21 review iterations with Copilot + Codex (every one surfaced a real bug or coverage gap, zero false alarms), the inner loop is end-to-end usable.
What lands
aqa runCLI command. Loads.aqa/project.yaml+.aqa/profiles.yamlvia the canonical@aqa/schemasshapes, resolves packs from three discovery tiers (project'spacks/*,node_modules/@aqa/*, kit-bundleddist/packs/*), filters scenarios by the selected profile'stags, and runs each one via@aqa/runner.runScenario. Streams events + findings into.aqa/runs/<run_id>/.- Flags:
--profile <name>(defaults tosmokeif present, else first profile) and--seed <string>(deterministicrun_idfor tests + replay). - Bundled packs. All 5 baseline packs (
pack-core,pack-api-core,pack-web-ui,pack-llm-agent,pack-security) now ship inside@aqa/kit's npm tarball via abundle-packs.mjsbuild step. A freshaqa init+aqa run --profile smokeworks with only@aqa/kitinstalled. - SUT-aware init.
aqa initpicks the right packs from the detectedsut_type(api →pack-api-core, web →pack-web-ui, agent →pack-llm-agent, else →pack-core). Theframeworkclause onpack-api-corewas dropped so plain Node/Bun APIs without a recognized framework still get coverage. - Hardened orchestration: atomic run-dir creation (no TOCTOU on concurrent seeded runs), pack-manifest scenario discovery (no glob-scanning), path-traversal + symlink-escape rejection, applies_when filtering, manifest-name dedup with priority (project > node_modules > bundled), legacy bare-slug pack-name aliasing, agent-mode profile rejection until that driver lands, unrelated-broken-pack tolerance with structured warnings.
- Structured
RunResultwithok,runId,runDir,scenariosRun,findingsCount, cappederrorstring (MAX_DETAIL_PER_KIND+ "…+N more" truncation), and awarningsarray for non-fatal diagnostics. Detail samples (pack_error_samples,scenario_error_samples, …) live in therun_finishedaudit event for auditors. - 42 TDD tests in
packages/kit/test/run-cmd.test.ts— every behavior above is covered, written before the code existed.
Known scoped follow-ups (v1.7)
- Real HTTP probe runner. Today's
runScenariostill uses the no-network probe stub (@aqa/runner'sNO_NETWORK_PROBE). The release-gate "fail on any finding" semantic (fromrequire_deterministic_replay: true) is deferred until probes hit a real SUT — every finding the stub produces is synthetic. EventChainWriter↔verifyEventChainreconciliation. Writer omitsprev_hashfrom the canonical body and emitsnullfor seq=0;@aqa/compliance.verifyEventChainincludesprev_hashand expects"0…". Tests ship a local writer-matching verifier; reconciling the two implementations is a separate cleanup.- Pack authoring story. v1.7 will add
docs/PACK-AUTHORING.md(community tutorial),aqa pack new <slug>(scaffolding CLI), and an admin "Create pack" wizard — plus a full audit pass on every placeholder button in the admin panel so nothing renders as a "muted click". - Browser-driven ecosystem smoke. Playwright test that starts admin + runs
aqa runagainstexamples/bun-apiand asserts findings appear in the admin UI end-to-end.
v1.5.0 — admin design integration
v1.5 — Admin design integration
The hi-fi prototype shipped by Claude Design (30 screens) is now the official admin web panel. The bundled prototype was ported to Vite + React 19 + TypeScript strict, all 30 screens render in production, and a Playwright suite drives every screen.
What landed
- 30 screens, real markup — 8.9k LOC ported to
packages/admin/src/app.tsx(bundled,@ts-nocheckfor the prototype's design-tool conventions). Dark-themed, token-driven CSS. - Vite production build — replaced design-tool CDN React/Babel scripts with a regular Vite SPA.
bun run devboots in <500ms;bun run buildships a static bundle. - Playwright suite —
packages/admin/test/e2e/*.e2e.tscovers per-screen smoke, audit chain verify (OK + tampered), Findings views (Clusters/List/Kanban), Replay tabs, risk-map matrix, theme, palette. Real DOM, no mocks.bun run test:e2eruns it. - CI gating — new
E2E (Playwright, admin UI)job in.github/workflows/ci.ymlbuilds the admin and runs the Playwright suite against the dev server. - Quality — Biome ignores the bundled prototype to keep lint targeted; smoke filter tolerates the prototype's intentional
console.errordemo calls.
Known scoped tradeoffs
- In-memory routing only (the prototype was never URL-driven); reading
window.locationon boot is deferred to a follow-up. - Live-mode currently animates time but still reads in-file mock data; wiring
VITE_AQA_SERVER_URLto a real fetch layer is deferred to the next macro task.
What's next (v1.6)
Full end-to-end ecosystem smoke via Playwright: boot server + runner pool + admin in a single command, drive a real aqa run against examples/bun-api, verify findings appear in the admin and the audit chain stays valid end-to-end.
v1.4.0 — Admin API surface + issue #3 closed
v1.4.0 — Admin API surface + issue #3 closed
Backend gap closure ahead of the parallel admin v2 design integration.
Server expansion
packages/server/src/api.ts makeApi() grows from 4 → 28 routes:
- Runs: list, detail, events, create
- Findings: list, detail, status mutation (with audit reason)
- Packs: list, detail, install, uninstall
- Profiles: list, detail, save, delete
- Risks: list, detail, save, delete
- Scenarios: list, detail, save
- Audit: scoped event query
- Cost: per-window summary aggregation
- Queue: snapshot + runner tap
- Notifications: list, mark-read
- Saved views: list, save, delete
- API tokens: list, create, revoke
- Tenancy: list orgs, list projects, create org, create project
All routes are permission-gated via @aqa/auth, tenant-scoped via
x-aqa-org / x-aqa-project headers, and return shape-compliant
@aqa/schemas objects. Multi-tenant fail-closed: missing scope → 400;
cross-tenant ID lookup → 404 (so probing for IDs in other projects
gains no information).
Schemas
6 new @aqa/schemas namespaces with Draft 2020-12 JSON Schemas
emitted (schemas/v1/ now ships 15 files):
NotificationSavedViewApiTokenCostSummaryTenancy.Org+Tenancy.ProjectRef
Store
StoreProvider extended with 15+ methods covering the new endpoints.
MemoryStore implements all of them; PostgresStore retains the
explicit not implemented pattern so a misconfigured production
deployment fails loudly.
RunnerQueue gains snapshot(), requeue(id), kill(id) for the
admin queue ops screen.
Issue #3 closed
Three remaining Zod superRefines mirrored into JSON Schema:
Finding.status='duplicate' ⇒ duplicate_of requiredReproLevel.deterministic=true ⇒ attempts >= 1ProfilesFile.profile.name === key(via$comment— cross-field)
Cross-field invariants JSON Schema cannot express
(duplicate_of !== id, successes === attempts,
finished_at >= started_at, profile.name === key) surfaced via
$comment on the emitted schemas.
Ajv 2020 round-trip test (packages/schemas/test/ajv-roundtrip.test.ts)
validates every fixture against the emitted schema — catches Zod ↔
JSON-Schema divergence at build time.
All 6 emitter patches now resolve the #/definitions/<name>
indirection that zod-to-json-schema emits.
Docs
docs/design/admin-panel-spec-v2.md— full enterprise design brief
(tokens, 30 screens, component library, interactions, a11y, perf,
deliverables) for the external designer who builds the React
template in parallel.docs/PROGRESS.mdupdated with v1.4 entry, the post-design Playwright
smoke roadmap, and the final closing step (README + docs refresh
pass: audit v0.x references, finalise quick-start, write the
"How you use it" workflow section, prune obsolete docs).
Review loop
Codex + Copilot iterated 2 times before merge, surfacing and addressing
5 must-fix items across schema enum alignment, tenant-scope enforcement
on runs / findings detail, MemoryStore audit filter leniency on
unstamped events. CI 14/15 green throughout.
Numbers
- 28 server routes (was 4)
- 15 JSON Schemas (was 9)
- 205 tests (was 165)
- 19 packages (added
@aqa/compliancepreviously; this release adds no new package)
PR: #22.
v1.3.0 — Quality batch
v1.3.0 — Quality batch
Six post-v1.2 polish items + an extended review-and-fix loop. No new packages; all quality / coverage / docs / correctness.
What landed
1. Admin server↔UI mapping
packages/admin/src/data/api.tsfetches fromVITE_AQA_SERVER_URL(real@aqa/servershape) with explicit error surfacing — no silent mock fallback in live mode.mapRun()/mapFinding()translateRun.Run(state,totals.findings,totals.llm_cost_usd) andFinding.Finding(statusenumdraft|verified|rejected|duplicate|fixed,verification_floorenumbug_level|scenario_level|agent_level,discovered_at) into the UI types. Screens stay source-agnostic.live/mockbadge + red error banner on Runs and Findings.
2. Admin sub-screens (6 detail routes)
/runs/$runId, /findings/$findingId, /risk-map/$riskId, /profiles/$profileName, /packs/$packSlug, /scenarios/$scenarioId. Each with Breadcrumb + PageHeader. Runs table rows are clickable links.
3. Admin unit tests (12 new, 176 total)
test/audit.test.ts(5):parseEventLines×2,verifyEventChain×3 (good chain, tampered, vacuous truth).test/cluster.test.ts(6):signatureOf(identity, normalisation, divergence),clusterFindings(grouping, worst-severity, sort).
4. CLI E2E smoke gate
scripts/e2e-cli.mjs runs against a fresh tmpdir sandbox (seeded with a minimal package.json + aqa init). All four checks (--version, --help, doctor, validate) must exit 0. Wired into CI as a new e2e-cli job in .github/workflows/ci.yml.
5. Threat model expansion
docs/security/threat-model.md from 12-line stub to full STRIDE catalog: trust-boundary diagram, 20 specific threats with current mitigation + status, agentic-specific cross-cutting threats (tool-result poisoning, confirmation bypass, supply chain, cost-based DoS).
6. CHANGELOG.md backfill
Entries for v0.2.0 → v1.3.0 in Keep-a-Changelog format.
Review loop
Codex + Copilot review iterated 3 times before merge, surfacing and addressing 21 must-fix items across schema enum alignment, fake-live fallback, error-vs-not-found splitting, CLI E2E hardness, threat-model precision (S-03 narrower scope, D-01 / S-01 / I-03 downgraded from Mitigated to Partial / Unmitigated to reflect actual code). All inline comments addressed. CI 15/15 green.
v1.2.0 — Admin wired
v1.2.0 — Admin SPA wired
The admin panel goes from inline-style placeholder to a real SPA.
Stack
- Tailwind 4 via
@tailwindcss/vite—@themetokens +.darkvariant - TanStack Router (code-based) — 12 typed routes
- TanStack Query — async data + mutations
- Zustand — theme store, persisted to localStorage
- lucide-react — icons
- date-fns — relative timestamps
12 screens
Dashboard (KPIs), Runs (table), Findings (clustered via content-hash signature), Risk map (grouped by category), Profiles, Packs (with signature badge), Scenarios (pack→scenario tree), Agents (per-agent instruction files), Replay (per-finding repro.sh / repro.curl preview + verify button), Audit log (paste events.jsonl → re-walk sha256 chain in-browser; "Load tampered chain" demo button), Cost (bar by profile), Settings (theme toggle).
Real, not placeholder
- Findings clusters via
signatureOf(scenario × risk × normalised summary)using Web Crypto. Mirrors@aqa/clustering. - Audit log re-walks the sha256 prev_hash chain in-browser; demo shows mechanical tamper detection. Mirrors
@aqa/compliance'sverifyEventChainon top ofcrypto.subtle. - Theme toggle persisted, applies
.darkon<html>.
Browser-side hash verifier
node:crypto is not Vite-safe; the admin re-implements the verifier on Web Crypto. The Node CLI in @aqa/compliance remains the SOC2 source of truth — the in-browser copy is a UX affordance only.
Build
376 KB JS (116 KB gzip), 9.94 KB CSS (2.92 KB gzip). 165 tests still pass.
PR: #19.
v1.1.0 — Polish
v1.1.0 — Polish
Post-GA polish release. Drops the "pre-alpha" label, ships the operator-facing chart, demonstrates language-agnostic targeting.
What landed
- README — banner image wired (
docs/assets/banner.png), pre-alpha badge replaced with GA + Release badges, Status section reflects v1.0 GA / v1.1 current. deploy/helmfeature-complete — server Deployment + Service, runner StatefulSet with per-pod PVC (deterministic fixtures), optional Ingress + TLS, NetworkPolicy that confines runner egress to server + DNS + operator-provided CIDRs, optional in-cluster Postgres subchart for dev/PoC.- Examples —
examples/bun-api(Hono+Bun,api-core+security),examples/nextjs-saas(Next.js 15 with session-cookie invariant,web-ui+api-core),examples/laravel-app(PHP/Laravel 11,api-core— demonstrates that AQA is target-language agnostic). docs/LESSON.md— consolidated v1.0 → v1.1 retrospective: bundling strategy,exactOptionalPropertyTypespatterns, LongSlug pitfalls, deploy-scaffold self-labeling, audit-verifier package separation.
165 tests pass, biome + tsc strict zero errors.
PR: #18.
v1.0.0 — GA
v1.0.0 — GA: SOC2/ISO readiness
Task 23 — closes the 24-task roadmap.
What landed
@aqa/compliance— SOC2 TSC + ISO 27001:2022 Annex A controls catalog (CONTROL_MAPPINGS),controlsCoverage()summarizer.verifyEventChain(events)— re-walks the sha256 prev_hash chain emitted by the runner; reports first mismatch.aqa-audit-verify <path>CLI — non-zero exit on chain break; wire into CI to fail builds on tampered audit logs.docs/compliance/soc2-iso-mapping.md— auditor-facing source of truth.docs/compliance/pen-test-scope.md— pen-test engagement contract.
Roadmap closed
24 tasks, 18 packages, 165 tests across Bun 1.3.11 and Node 22 LTS.
PR: #17.
v0.6.0
v0.6.0 — Methodology + deploy assets
Tasks 21, 22 in one bundle.
What landed
@aqa/methodology(Task 21) — STRIDE/FMEA/OWASP risk mapping.strideOf,fmeaScore(RPN = severity × occurrence × detection),owaspOf,methodologyCheck(flags risks without any framework anchor).- Deploy scaffolds (Task 22) —
deploy/helmchart skeleton,deploy/terraformnamespace module,scripts/air-gap-install.sh(bundle + verify).
PR: #16.
v0.5.0
v0.5.0 — Multi-team + clustering
Tasks 19, 20 in one bundle.
What landed
@aqa/server(Task 19) — framework-agnosticmakeApi()routing table +RunnerQueue(FIFO with visibility leases).@aqa/clustering(Task 20) —signatureOf(sha256 of scenario / risk / normalised summary) +clusterFindings(representative = earliest, severity = worst).
PR: #15.