diff --git a/.ai/architecture-audit-report.md b/.ai/architecture-audit-report.md new file mode 100644 index 0000000..cfc8cda --- /dev/null +++ b/.ai/architecture-audit-report.md @@ -0,0 +1,399 @@ +# Architecture & Design Integrity Audit Report + +**Date:** 2026-07-03 +**Auditor:** Hermes Agent +**Repository:** agent-workbench (GitHub: MerverliPy/agent-workbench) +**Scope:** Actual code vs. declared architecture, protocol adherence, package boundary enforcement + +--- + +## Executive Summary + +The repository is **well-structured with strong protocol adherence**. The most significant finding is a **HIGH severity boundary violation**: the TUI depends on and imports from `@agent-workbench/eval`, which is not an allowed dependency per AGENTS.md. Additionally, **AGENTS.md and architecture docs are stale** — they list only 3 apps / 15 packages, but the actual codebase has 5 apps / 20 packages. The protocol-route-contract pattern is correctly implemented end-to-end (protocol → SDK → server → validation → OpenAPI). + +--- + +## 1. Workspace Configuration + +### Finding: ✅ Workspaces match directory structure + +**Root `package.json`** declares workspaces as: +```json +["apps/*", "packages/*", "tests"] +``` + +This correctly matches the actual directory layout: +- **apps/**: cli, dashboard, mobile-web, server, tui (5 apps, all with `package.json`) +- **packages/**: auth, cache, collab, config, core, diff, eval, events, models, permissions, planner, plugin-sdk, protocol, sdk, shell, storage, telemetry, tokens, tools, ui (20 packages, all with `package.json`) +- **tests/**: single `package.json` at root + +**All package names** follow the `@agent-workbench/` convention consistently. + +--- + +## 2. Package Boundary Compliance + +### DECLARED BOUNDARIES (from AGENTS.md) + +| Package | Allowed to Import | Must NOT Import | +|---------|-------------------|-----------------| +| apps/tui | sdk, protocol, events, ui | core, tools, shell, storage, permissions/internal, models/internal | + +### FINDING: HIGH — TUI imports `@agent-workbench/eval` (violation) + +**Evidence:** +- **`apps/tui/package.json`** line 13: `"@agent-workbench/eval": "workspace:*"` +- **`apps/tui/src/components/panels/PlaygroundPanel.tsx`** line 1: `import { ModelPlayground } from "@agent-workbench/eval"` +- **`apps/tui/src/components/panels/ComparisonPanel.tsx`** line 1: `import { ModelComparer } from "@agent-workbench/eval"` + +**Why it's a violation:** AGENTS.md explicitly states TUI may import "packages/sdk, packages/protocol, packages/events, and packages/ui". The `@agent-workbench/eval` package is not in that list. The eval package has storage dependencies (Drizzle ORM, SQLite) and should not be consumed by the thin TUI client. + +**Recommendation:** Either (a) add `@agent-workbench/eval` to the allowed-imports list in AGENTS.md, or (b) refactor the TUI eval panels to communicate through the SDK/server (like every other feature does). + +### FINDING: MEDIUM — TUI does not import `@agent-workbench/events` or `@agent-workbench/ui` + +**Evidence:** +- `apps/tui/package.json` has no `@agent-workbench/events` dependency — it consumes events through the SDK's SSE transport via `@agent-workbench/sdk` +- `apps/tui/package.json` has no `@agent-workbench/ui` dependency — the ui package exports are used nowhere + +**Analysis:** This is not a violation but means the declared allowances are aspirational rather than enforced. The TUI correctly gets events via SDK's SseTransport (which validates with `EventEnvelope.safeParse`), not by importing the events package directly. The `@agent-workbench/ui` package is essentially unused — it has no dependencies and no exports (`exports` field absent from its package.json). + +**Recommendation:** Remove `@agent-workbench/ui` from AGENTS.md allowed-imports list, or implement shared UI primitives in it. + +### FINDING: MEDIUM — AGENTS.md is incomplete + +**Packages that exist but are NOT documented in AGENTS.md boundaries:** + +| Package | Purpose | AGENTS.md coverage | +|---------|---------|-------------------| +| apps/cli | CLI entrypoint with plugin loading | Not mentioned | +| apps/dashboard | Observability dashboard (Vite + Solid) | Not mentioned | +| apps/mobile-web | Mobile web companion (Vite + Solid + PWA) | Not mentioned | +| packages/auth | Bearer token auth, TLS, session tokens | Not mentioned | +| packages/collab | Session sharing, review queue, presence | Not mentioned | +| packages/eval | Model evaluation & playground | Not mentioned | +| packages/telemetry | OpenTelemetry, Prometheus metrics | Not mentioned | +| packages/plugin-sdk | Plugin extension interfaces | Not mentioned | +| packages/config | Layered config loading | Not mentioned | + +These have been added through later phases (22-29) but AGENTS.md hasn't been updated to reflect the current package inventory. + +### FINDING: ✅ No TUI imports runtime authority packages + +Confirmed Zero violations across TUI source for importing: +- `from "@agent-workbench/core"` → 0 matches +- `from "@agent-workbench/tools"` → 0 matches +- `from "@agent-workbench/shell"` → 0 matches +- `from "@agent-workbench/storage"` → 0 matches +- `from "@agent-workbench/permissions"` → 0 matches +- `from "@agent-workbench/models"` → 0 matches + +The same is true for apps/dashboard and apps/mobile-web. + +### FINDING: ✅ Server dependencies are appropriate + +`apps/server` depends on: cache, core, events, models, permissions, protocol, shell, storage, tokens, tools, telemetry, plugin-sdk, auth, collab, hono, ulid, zod. All are legitimate server concerns. + +--- + +## 3. Protocol Adherence (Zod Schemas) + +### FINDING: ✅ Route contracts follow the declared pattern + +**`packages/protocol/src/types.ts`** defines the `RouteContract` interface with: +```typescript +interface RouteContract { + readonly method: "GET" | "POST" | "PATCH" | "DELETE"; + readonly path: string; + readonly pathParams?: z.ZodType; + readonly query?: z.ZodType; + readonly body?: z.ZodType; + readonly response: z.ZodType; + readonly errors: readonly [typeof ErrorEnvelope]; + readonly isStream?: boolean; +} +``` + +**Every route** in `packages/protocol/src/routes/` follows this pattern. Examples verified: +- `session.ts` — CreateSessionRoute, ListSessionsRoute, GetSessionRoute, UpdateSessionRoute, AbortSessionRoute, SummarizeSessionRoute, DeleteSessionRoute +- `message.ts` — SubmitMessageRoute, ListMessagesRoute, GetMessageRoute +- `plan.ts` — ListPlansRoute, GetPlanRoute, DecidePlanRoute +- `event.ts` — EventRoute (with isStream: true) + +### FINDING: ✅ Route contracts properly distinguish pathParams, query, body, response, errors + +Every route correctly specifies its shape. `SessionIdParams` is reused across routes. Errors always include `ErrorEnvelope`. + +### FINDING: ✅ OpenAPI is generated from route contracts + +**`packages/protocol/src/openapi/index.ts`** implements `createOpenApiDocument()` which: +- Creates an `OpenAPIRegistry` +- Registers all route contracts from the routes directory +- Uses `@asteasolutions/zod-to-openapi` to generate OpenAPI 3.0.3 documents +- Preserves path params (`:param` → `{param}`), query params, request bodies, error responses, and SSE media types (`text/event-stream`) + +17 route contracts are registered for OpenAPI generation. + +--- + +## 4. SDK → Protocol Contract Consumption + +### FINDING: ✅ SDK consumes protocol contracts, does not duplicate DTOs + +**`packages/sdk/src/resources/sessions.ts`** demonstrates the pattern: +```typescript +import { AbortSessionRoute, CreateSessionRoute, ... } from "@agent-workbench/protocol"; + +async create(data: z.infer): Promise> { + return this.transport.request(CreateSessionRoute.method, CreateSessionRoute.path, { + body: data, responseSchema: CreateSessionRoute.response, + }); +} +``` + +Every SDK resource (sessions, messages, permissions, plans, tools, agents, files, config, providers, auth, token-health, tui) follows this pattern. + +### FINDING: ✅ SDK validates responses, not casts + +**`packages/sdk/src/transport/http.ts`** lines 88-101: +```typescript +if (options?.responseSchema) { + const result = options.responseSchema.safeParse(parsed); + if (!result.success) { + throw new SdkError(`Response validation failed: ${issues}`, result.error); + } + return result.data as T; +} +``` + +Responses are validated with `safeParse` and throw on mismatch — no blind casts. + +### FINDING: ✅ SSE parsing validates event envelopes + +**`packages/sdk/src/transport/sse.ts`** lines 122-134: +```typescript +const result = EventEnvelope.safeParse({ ...raw, type: raw.type ?? type }); +if (!result.success) { + this.errorHandler?.(new SdkError(`Malformed SSE event: ${issues}`)); + return; +} +``` + +Event envelopes are validated. Malformed events are reported via error handler, not silently swallowed. + +### FINDING: ✅ SDK error handling uses ErrorEnvelope schema + +**`packages/sdk/src/transport/http.ts`** lines 117-118: +```typescript +const parsed = ErrorEnvelope.safeParse(body); +if (parsed.success) { ... } +``` + +HTTP error responses are parsed through the ErrorEnvelope Zod schema. + +--- + +## 5. Server Implementation + +### FINDING: ✅ Server uses Hono correctly + +**`apps/server/src/app.ts`** creates a `new Hono()` and applies middleware chain: +- `requestIdMiddleware` (first) +- `rateLimitMiddleware` +- `metricsMiddleware` +- `tracingMiddleware` +- CORS with localhost defaults (ADR-0004) +- `authMiddleware` (conditional, Phase 27) + +### FINDING: ✅ Server routes consume protocol contracts + +**`apps/server/src/routes/helpers.ts`**: +```typescript +export function createJsonRouteHandler( + contract: RouteContract, + handler: (context, routeContext) => T +): Handler { + return async (context) => { + const validated = await validateRequest(contract, context.req); + const result = await handler(context, { validated }); + const response = validateResponse(contract, result); + return context.json(response); + }; +} +``` + +**`apps/server/src/utils/validation.ts`** validates path params, query params, and body through the contract's Zod schemas. Response validation ensures handlers produce valid output. + +**All route handlers** (session-routes.ts, message-routes.ts, permission-routes.ts, plan-routes.ts, etc.) use `createJsonRouteHandler` with protocol contracts — no manual validation. + +### FINDING: ✅ SSE event routing is correct + +**`apps/server/src/routes/global.ts`** handles the EventRoute path: +- Validates request with `validateRequest(EventRoute, context.req)` +- Uses `hono/streaming`'s `streamSSE` for the SSE connection +- Subscribes to `EventBus` and writes JSON-serialized events as SSE data lines +- Includes keep-alive (30s interval) and retry directive + +--- + +## 6. Decisions Implementation + +### FINDING: ✅ Decision 0013 (Pre-Run Planner) is implemented + +**The pre-run planner gate exists and is operational:** + +| Requirement | Status | Evidence | +|-------------|--------|----------| +| Plan object exists | ✅ | `packages/planner/src/validate.ts` + `packages/protocol/src/schemas/plan.ts` | +| Mutation gate exists | ✅ | `packages/core/src/plan-gate.ts` — `PlanGate` class with `gate()` method | +| Unsafe plans can be rejected | ✅ | `validatePlan()` rejects empty/invalid plans; `PermissionEngine.evaluatePlan()` blocks denied plans | +| Plan events are ledgered | ✅ | `RunLedger` records plan proposed/denied/approved/completed | +| Mutation does not bypass plan gate | ✅ | `SessionRunner` calls `planGate.gate()` before mutation tool dispatch | + +**PlanGate flow:** Build plan → validate → permission evaluation → if "ask": persist + emit → wait for user decision via PermissionGate → proceed or block. + +### FINDING: ✅ Decision 0015 (Dry-Run) is partially implemented + +| Requirement | Status | Evidence | +|-------------|--------|----------| +| File dry-run exists | ✅ | diff previews generated before permission gates in SessionRunner (lines with `generateDiffPreview`) | +| Shell preview exists | ✅ | `packages/shell` exports `previewCommand()`, `CommandPreview` | +| Dry-run separate from execution | ✅ | Diff is preview-only, shell preview is static | +| Permission flow uses dry-run metadata | ✅ | Diff summary included in PermissionRequest payload | +| Ledger records dry-run events | ✅ | Ledger records diff.preview_created events | + +**Note:** True sandbox dry-run (non-mutating execution simulation) is not implemented — this is by design per decision 0015's notes. + +--- + +## 7. Architecture Documentation vs. Reality + +### FINDING: MEDIUM — docs/02_ARCHITECTURE.md is stale + +The architecture doc lists this package model: +``` +apps/ → cli, server, tui (3 apps) +packages/ → protocol, sdk, core, events, storage, config, permissions, tools, models, shell, diff, tokens, cache, planner, ui (15 packages) +``` + +Actual: +``` +apps/ → cli, dashboard, mobile-web, server, tui (5 apps) +packages/ → auth, cache, collab, config, core, diff, eval, events, models, permissions, planner, plugin-sdk, protocol, sdk, shell, storage, telemetry, tokens, tools, ui (20 packages) +``` + +Missing from architecture doc: dashboard, mobile-web, auth, collab, eval, plugin-sdk, telemetry (+ config exists in doc but is a stub with no source files). + +The **architecture diagram** in docs/02_ARCHITECTURE.md also doesn't show the dashboard, mobile-web, auth, collab, telemetry, or plugin-sdk. + +### FINDING: ✅ DESIGN.md is consistent with implemented web UI + +The DESIGN.md document describes a dark-first design system for mobile-web/dashboard companions: +- Color tokens (slate/navy palette, single blue accent) +- Typography, spacing, border radii +- Component definitions (buttons, cards, inputs, badges, messages, tabs, nav drawer, model chip, code blocks) +- Accessibility contract (ARIA, focus-visible, reduced-motion, touch targets, WCAG AA contrast) +- Anti-patterns (no gradients, no box-shadows, no LLM purple) + +This matches the actual styling in apps/mobile-web (Tailwind CSS with similar tokens, PWA manifest, notifications, offline support). + +--- + +## 8. Core Runtime Structure + +### FINDING: ✅ Clean internal architecture + +**`packages/core/src/`** files: +- `session-runner.ts` — Main orchestration loop (model/tool loop, permission gating, streaming) +- `plan-gate.ts` — Pre-mutation planning gate +- `model-router.ts` — Model provider routing +- `tool-dispatcher.ts` — Tool call dispatching +- `context-builder.ts` — Context building for model prompts +- `event-publisher.ts` — Event emission wrapper +- `run-ledger.ts` — Ledger recording wrapper +- `run-state.ts` — Active run registry +- `token-health.ts` — Token health service +- `pty-orchestrator.ts` — PTY orchestration +- `agent/` — Agent definitions, registry, types +- `types.ts` — CoreDependencies interface (DI container pattern) + +The `CoreDependencies` interface cleanly separates concerns — it accepts repository/service interfaces from storage, events, tools, models, permissions, shell — no direct imports of implementation internals. + +### FINDING: ✅ Planner is lightweight and focused + +**`packages/planner/src/`** has only 2 files: +- `index.ts` — Public exports (validatePlan, computePlanRiskLevel, etc.) +- `validate.ts` — Plan validation logic (98 lines) + +No overcomplicated DAG system. Lightweight validation per ADR-0013. + +### FINDING: ✅ Permission engine is clean and deterministic + +**`packages/permissions/src/`** has 5 files: +- `engine.ts` — Stateless `PermissionEngine` with 5-step evaluation (command → agent → path → tool → fallback) +- `gate.ts` — Stateful `PermissionGate` for async pause/resume of ask-gated operations +- `policy.ts` — Default policy definitions +- `types.ts` — Type definitions +- `index.ts` — Public exports + +The engine correctly never executes tools, accesses storage, renders UI, makes HTTP requests, or trusts model-generated risk assessments (all annotated in code comments). + +--- + +## 9. Summary of Findings + +### CRITICAL / HIGH + +| # | Severity | Finding | Location | Recommendation | +|---|----------|---------|----------|---------------| +| 1 | **HIGH** | TUI imports `@agent-workbench/eval` violating declared boundaries | `apps/tui/package.json`, `PlaygroundPanel.tsx`, `ComparisonPanel.tsx` | Either update AGENTS.md to allow eval, or refactor eval panels through SDK/server | + +### MEDIUM + +| # | Severity | Finding | Location | Recommendation | +|---|----------|---------|----------|---------------| +| 2 | **MEDIUM** | AGENTS.md missing 5 apps + 5 packages from boundary documentation | `AGENTS.md` | Update to list all 5 apps and 20 packages with their boundaries | +| 3 | **MEDIUM** | `docs/02_ARCHITECTURE.md` stale — missing dashboard, mobile-web, auth, collab, eval, plugin-sdk, telemetry | `docs/02_ARCHITECTURE.md` | Regenerate architecture doc to match actual codebase | +| 4 | **MEDIUM** | `@agent-workbench/ui` is declared in AGENTS.md but has zero deps, zero exports, zero consumers | `packages/ui/`, `AGENTS.md` | Either implement shared UI primitives or remove from documentation | +| 5 | **MEDIUM** | TUI doesn't directly use `@agent-workbench/events` despite AGENTS.md allowance | `apps/tui/package.json` | Minor — events consumed through SDK is correct | + +### LOW / OBSERVATIONS + +| # | Severity | Finding | Location | Recommendation | +|---|----------|---------|----------|---------------| +| 6 | **LOW** | `packages/config` has no source files (empty shell) | `packages/config/` | Either implement or remove workspace entry | +| 7 | **LOW** | `packages/tokens` has zero dependencies (no protocol import, no storage) | `packages/tokens/package.json` | Verify this is intentional | +| 8 | **LOW** | `plugin-sdk` uses `zod: ^4.0.0` while all other packages use `^4.4.3` | `packages/plugin-sdk/package.json` | Normalize zod version | +| 9 | **LOW** | Server route for `/global/event` manually validates without `createJsonRouteHandler` | `apps/server/src/routes/global.ts` line 80-81 | Minor — deliberate choice for SSE handler | + +### POSITIVE FINDINGS (Strengths) + +| # | Finding | Evidence | +|---|---------|----------| +| ✅ | Protocol contracts are the single source of truth | Route contracts defined in protocol, consumed by SDK + Server + OpenAPI | +| ✅ | SDK validates responses (no blind casts) | `responseSchema.safeParse()` in HttpTransport | +| ✅ | SSE validates event envelopes | `EventEnvelope.safeParse()` in SseTransport | +| ✅ | OpenAPI generated from schemas | `createOpenApiDocument()` in openapi/index.ts | +| ✅ | No TUI imports from core/tools/shell/storage/permissions/models | grep confirmed zero matches | +| ✅ | Planner is lightweight (no DAG overengineering) | 2 files, 98 lines of validation logic | +| ✅ | Permission engine is stateless + deterministic | Documented design, no side effects | +| ✅ | Decision 0013 (pre-run planner) is fully implemented | PlanGate + validatePlan + permission evaluation | +| ✅ | Decision 0015 (dry-run) is partially implemented | Diff previews, shell previews, permission flow integration | +| ✅ | Server validates all requests through contract schemas | `validateRequest()` in validation.ts | +| ✅ | CoreDependencies uses clean DI pattern | No global storage imports in core | +| ✅ | DESIGN.md is consistent with implemented web UI | Color tokens, components, ARIA match mobile-web/dashboard | + +--- + +## 10. Recommendations (Priority Order) + +1. **HIGH** — Fix the TUI→eval boundary violation: either update AGENTS.md to include `@agent-workbench/eval` in TUI's allowed imports, or refactor eval panels to communicate through the SDK +2. **MEDIUM** — Update `AGENTS.md` with complete package inventory and boundaries for all 5 apps and 20 packages +3. **MEDIUM** — Update `docs/02_ARCHITECTURE.md` diagram and package model to match actual codebase +4. **MEDIUM** — Either implement `packages/ui` with shared primitives, or remove from documentation +5. **LOW** — Either implement or remove the empty `packages/config` +6. **LOW** — Normalize zod version in `packages/plugin-sdk` + +--- + +*Report generated by Hermes Agent. 44 source files examined across apps, packages, docs, and decisions.* diff --git a/.ai/master-audit-report.md b/.ai/master-audit-report.md new file mode 100644 index 0000000..3e33930 --- /dev/null +++ b/.ai/master-audit-report.md @@ -0,0 +1,163 @@ +# 🔍 agent-workbench — Comprehensive Multi-Perspective Audit + +**Date:** 2026-07-03 +**Methodology:** Mixture of Agents (3 parallel subagents) +**GitHub:** [MerverliPy/agent-workbench](https://github.com/MerverliPy/agent-workbench) +**Local Path:** `/home/calvin/agent-workbench` + +--- + +## Executive Summary + +| Dimension | Grade | Verdict | +|-----------|-------|---------| +| 🛡️ **Security & Dependencies** | 🟢 **A-** | Good posture, no HIGH findings. Fixed CVE-2026-39356. MEDIUM gaps in Dependabot coverage and CODEOWNERS. | +| 🏗️ **Architecture & Design Integrity** | 🟡 **B** | Strong protocol adherence but a HIGH boundary violation (TUI→eval) and stale docs missing 5 apps + 5 packages. | +| 📊 **Code Quality & Maintainability** | 🟡 **B+** | Excellent test infrastructure but broken pre-commit, stale doc references, Dockerfile bitrot. | + +**Overall: B+ (good with actionable gaps)** — 6 HIGHs, 11 MEDIUMs, 6 LOWs. No critical security vulnerabilities. The repo is actively developed and well-structured; the issues found are largely documentation drift and configuration gaps from rapid iteration. + +--- + +## CROSS-CUTTING FINDINGS + +These findings appear in multiple audit perspectives: + +| # | Issue | Affects | Severity | +|---|-------|---------|----------| +| C1 | **Stale `AGENTS.md`** — missing 5 apps (cli, dashboard, mobile-web) + 5 packages (auth, collab, eval, telemetry, plugin-sdk) | Architecture doc drift, unclear boundaries for new contributors | **HIGH** | +| C2 | **Stale `docs/02_ARCHITECTURE.md`** — same missing apps/packages, dead diagram | Architecture doc drift | **MEDIUM** | +| C3 | **`repo-health.yml` uses npm** on a Bun project — will fail | CI reliability | **MEDIUM** | +| C4 | **`actions/checkout` version drift** — `@v4` in 4 workflows vs `@v7` in CI | CI consistency | **MEDIUM** | +| C5 | **`scripts/build-all.sh` missing packages** — no `eval`, `auth`, `collab`, `config`, `ui`, `telemetry`, `plugin-sdk` | Build reliability | **HIGH** | + +--- + +## 🔴 HIGH SEVERITY FINDINGS (6 total) + +| # | Finding | Category | File(s) | Recommendation | +|---|---------|----------|---------|---------------| +| H1 | **TUI imports `@agent-workbench/eval`** violating declared AGENTS.md boundary. TUI should only import sdk/protocol/events/ui per docs, but `PlaygroundPanel.tsx` and `ComparisonPanel.tsx` import directly from eval. | Architecture | `apps/tui/package.json`, `apps/tui/src/components/panels/PlaygroundPanel.tsx`, `ComparisonPanel.tsx` | Either update AGENTS.md to allow eval in TUI, or refactor eval panels to communicate through the SDK/server | +| H2 | **Lint-staged pre-commit hook broken** — `bun run typecheck --noEmit` configured in `lint-staged` but no `typecheck` script exists at root level | Code Quality | `package.json` lines 56-62 | Add `"typecheck"` script to root `package.json` or restructure pre-commit hook | +| H3 | **Stale test counts** — README.md and CONTRIBUTING.md reference "523 tests" in 5 places (badge says 602) | Documentation | `README.md` lines 257, 322; `CONTRIBUTING.md` lines 118, 136 | Update all stale "523" → "602" references | +| H4 | **CHANGELOG stale** — Missing Phase 29.4 (prompt library + ModelComparer), 29.5 (TUI playground + comparison panels), CVE fix, mobile command center, DESIGN.md additions | Documentation | `CHANGELOG.md` | Add [Phase 29.4], [Phase 29.5] entries + CVE fix | +| H5 | **Dockerfile missing 7 packages** — telemetry, plugin-sdk, auth, collab, eval, config, ui not in build chain. Docker build will fail. | Build/Deploy | `Dockerfile` | Replace hardcoded list with `RUN bash scripts/build-all.sh` or update to include all packages | +| H6 | **`scripts/build-all.sh` missing `eval` package** — 4 test files exist but package never built. Also missing: auth, collab, config, ui, telemetry, plugin-sdk | Build | `scripts/build-all.sh` | Add `eval` (and other missing packages) to the build chain | + +--- + +## 🟡 MEDIUM SEVERITY FINDINGS (11 total) + +| # | Finding | Category | File(s) | Recommendation | +|---|---------|----------|---------|---------------| +| M1 | **Dependabot only scans root `package.json`** — 25+ workspace package.json files never scanned for vulnerabilities | Security | `.github/dependabot.yml` | Add per-workspace npm entries or use Bun audit in CI | +| M2 | **CODEOWNERS references non-existent paths** — `src/auth/*` and `src/security/*` don't exist; actual paths are `packages/auth/` and `packages/permissions/` | Security | `.github/CODEOWNERS` | Fix paths to actual package locations | +| M3 | **`bun audit` reports 3 advisories** — esbuild (MODERATE, dev server forgery), opentelemetry (MODERATE, unbounded memory), babel (LOW, file read) | Dependencies | `bun.lock` (transitive) | Run `bun update` to pick up patched versions | +| M4 | **Biome has no security rules** — `suspicious/noExplicitAny` and `complexity/noBannedTypes` explicitly skipped; no security-specific linting | Code Quality | `biome.json`, `.github/workflows/ci.yml` | Audit and re-enable skipped rules; consider ESLint overlay for security rules | +| M5 | **AGENTS.md incomplete** — missing 5 apps (cli, dashboard, mobile-web) + 5 packages (auth, collab, eval, telemetry, plugin-sdk, config) from boundary documentation | Architecture | `AGENTS.md` | Update to list all 5 apps and 20 packages | +| M6 | **`docs/02_ARCHITECTURE.md` stale** — diagram and package model missing recent additions | Architecture | `docs/02_ARCHITECTURE.md` | Regenerate to match actual codebase | +| M7 | **`packages/ui` is a dead package** — declared in docs but has zero deps, zero exports, zero consumers | Architecture | `packages/ui/` | Implement shared primitives or remove from doc | +| M8 | **`packages/config` has no source files** — empty workspace shell | Architecture | `packages/config/` | Implement or remove | +| M9 | **5 test files live outside `tests/` directory** — not covered by `cd tests && bun test` command | Testing | `packages/eval/src/__tests__/*`, `apps/cli/templates/bun/src/hello.test.ts` | Move into `tests/` or update test command | +| M10 | **`.dockerignore` is thin** — missing `.git/`, `docs/`, `tests/`, `benchmarks/`, `tools/`, `decisions/`, `*.md` | Build/Deploy | `.dockerignore` | Add common exclusions for faster builds | +| M11 | **CI cache disabled** — `setup-bun` has `no-cache: false` meaning dependencies reinstalled every run | CI | `.github/workflows/ci.yml` | Enable bun caching by removing `no-cache: false` | + +--- + +## 🟢 LOW SEVERITY FINDINGS (6 total) + +| # | Finding | Category | Recommendation | +|---|---------|----------|---------------| +| L1 | **No local pre-commit secret scanning** — `ai-safety.yml` scans on push but nothing catches secrets before commit | Security | Add lightweight `pre-commit` grep for API key patterns | +| L2 | **SECURITY.md marks CI as "out of scope"** for disclosure policy | Security | Consider acknowledging CI as in-scope | +| L3 | **`opencode.yml` grants broad write permissions** (contents/pull-requests/issues: write) | Security | Restrict to minimum needed when implementation is filled in | +| L4 | **`packages/plugin-sdk` uses zod `^4.0.0`** while rest of repo uses `^4.4.3` | Consistency | Normalize zod version | +| L5 | **README phase status says "Phase 29 next"** but it's actively in development | Documentation | Update to reflect current phase | +| L6 | **VERIFICATION.md baseline says "323 tests"** (Phase 15 era) | Documentation | Update to current test count | + +--- + +## ✅ STRENGTHS & POSITIVE FINDINGS + +### Security +- 🔒 **CVE-2026-39356 (drizzle-orm)**: Fixed to 0.45.2 with overrides across all workspaces +- 🔒 **No secrets in git history**: Only `.env.example` ever committed +- 🔒 **No live `.env` files**: Properly gitignored +- 🔒 **SECURITY.md**: Clear 48h/90-day disclosure policy +- 🔒 **Security model docs**: Thorough threat models in `docs/` (05/06) +- 🔒 **ai-safety.yml**: Excellent secret + destructive-pattern scanning on every push +- 🔒 **codeql.yml**: Weekly JS/TS + Python analysis +- 🔒 **Permission model**: read=allow, edit/bash=ask, destructive=deny — excellent defaults + +### Architecture +- ✅ **Protocol contracts = single source of truth**: Route contracts defined in protocol, consumed by SDK + Server + OpenAPI +- ✅ **SDK validates responses**, not blind casts — `safeParse()` everywhere +- ✅ **SSE validates event envelopes** — malformed events never silently swallowed +- ✅ **No TUI imports from core/tools/shell/storage/permissions/models** (except eval, see H1) +- ✅ **OpenAPI generated from Zod schemas** — 17 route contracts registered +- ✅ **Permission engine**: Stateless, deterministic, no side effects per design +- ✅ **Decision 0013 (pre-run planner)**: Fully implemented with PlanGate +- ✅ **Decision 0015 (dry-run)**: Partially implemented with diff previews + shell previews +- ✅ **CoreDependencies**: Clean DI pattern, no global storage imports + +### Code Quality +- ✅ **Excellent test infrastructure**: 45 test files (unit/integration/e2e), VERIFICATION.md with 13 intentional-break mutation tests +- ✅ **test-health.sh**: 5 static checks for boundary enforcement +- ✅ **test-repeat.sh**: Determinism validation (3 runs default) +- ✅ **TypeScript strict mode**: `strict: true`, `noUncheckedIndexedAccess`, `exactOptionalPropertyTypes` +- ✅ **Comprehensive CI**: 4-job pipeline (static → typecheck → test matrix → e2e) + cron +- ✅ **Active Dependabot**: Package + GitHub Actions updates +- ✅ **Biome linting + Husky pre-commit hooks** +- ✅ **Well-structured monorepo**: Clean package boundaries, consistent naming + +--- + +## 🔷 PRIORITIZED ACTION PLAN + +### 🚨 Immediate (First Sprint) +| # | Effort | Action | Repo | +|---|--------|--------|------| +| 1 | 2 min | Fix CODEOWNERS paths (`src/auth/*` → `packages/auth/*`) | Security | +| 2 | 5 min | Update stale test counts (README, CONTRIBUTING: 523→602) | Docs | +| 3 | 5 min | Update CHANGELOG with Phase 29.4/29.5, CVE fix | Docs | +| 4 | 10 min | Fix lint-staged — add `typecheck` script to root `package.json` | Build | +| 5 | 10 min | Fix Dockerfile — replace hardcoded list with `scripts/build-all.sh` | Build | +| 6 | 15 min | Expand Dependabot to cover workspace packages | CI | + +### 📋 Second Sprint +| # | Effort | Action | Repo | +|---|--------|--------|------| +| 7 | 5 min | Run `bun update` — fix esbuild + opentelemetry advisories | Dependencies | +| 8 | 15 min | Update `AGENTS.md` — add all 5 apps + 20 packages with boundaries | Architecture | +| 9 | 15 min | Update `docs/02_ARCHITECTURE.md` — regenerate diagram | Architecture | +| 10 | 15 min | Update `scripts/build-all.sh` — add missing packages | Build | +| 11 | 15 min | Fix `repo-health.yml` — use bun not npm | CI | +| 12 | 15 min | Normalize `actions/checkout@v4` → `@v7` across workflows | CI | + +### 🧹 Third Sprint +| # | Effort | Action | Repo | +|---|--------|--------|------| +| 13 | 30 min | Decide: refactor TUI-eval or update AGENTS.md | Architecture | +| 14 | 30 min | Add security Biome rules or ESLint overlay | Code Quality | +| 15 | 15 min | Expand `.dockerignore` | Build | +| 16 | 10 min | Enable bun caching in CI | CI | +| 17 | 15 min | Move eval tests into `tests/` directory | Testing | +| 18 | 10 min | Either implement `packages/ui` or remove from docs | Architecture | +| 19 | 10 min | Clean up stale branches | Git | + +--- + +## METHODOLOGY + +- **3 parallel subagents** — each had full repo context and independently examined all source files, docs, configs, and CI workflows +- **Total files examined**: 44 (arch), ~30 (security), ~50 (code quality) +- **Verification steps**: grep for cross-package imports, `bun audit`, git log analysis, file count comparisons, workflow YAML parsing +- **All reports saved to disk**: + - `.ai/master-audit-report.md` ← this file (consolidated) + - `.ai/architecture-audit-report.md` (architecture deep-dive) + - `AUDIT_REPORT.md` (in repo root — code quality) + - `AUDIT_REPORT.md` (in workspace — security) + +--- + +*Generated by Hermes Agent — Mixture of Agents audit. 3 specialists, 246s total runtime.* diff --git a/.dockerignore b/.dockerignore index 437cb22..35ee446 100644 --- a/.dockerignore +++ b/.dockerignore @@ -12,6 +12,24 @@ coverage/ # Docker .dockerignore +# Git +.git/ +.github/ +.gitignore + +# Docs — not needed at runtime +docs/ +decisions/ +*.md + +# Tests — not needed at runtime +tests/ +benchmarks/ +tools/ +.ai/ +.opencode/ +.husky/ + # Local environment .env .env.* @@ -35,3 +53,18 @@ coverage/ model-router-v3.3-repo-ready.zip tools/model-router-v3.3-repo-ready/archives/ tools/model-router-v3.3-repo-ready/reports/ + +# Git files +.git/ + +# Documentation and development artifacts +docs/ +tests/ +benchmarks/ +tools/ +decisions/ +*.md + +# Audit and analysis artifacts +.ai/ +AUDIT_REPORT.md diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 85bfcc6..0202333 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -16,6 +16,7 @@ scripts/* @MerverliPy *.env.example @MerverliPy config/* @MerverliPy -# Security/auth-sensitive source zones -src/auth/* @MerverliPy -src/security/* @MerverliPy +# Security/auth-sensitive packages +packages/auth/* @MerverliPy +packages/permissions/* @MerverliPy +packages/storage/* @MerverliPy diff --git a/.github/dependabot.yml b/.github/dependabot.yml index fe62ae1..7ba71d5 100644 --- a/.github/dependabot.yml +++ b/.github/dependabot.yml @@ -41,6 +41,108 @@ updates: prefix: "chore(deps)" prefix-development: "chore(deps-dev)" + # Workspace packages with external dependencies + - package-ecosystem: "npm" + directory: "/tests" + schedule: + interval: "weekly" + day: "monday" + time: "09:00" + groups: + test-deps: + patterns: + - "*" + open-pull-requests-limit: 5 + labels: + - "dependencies" + - "tests" + + - package-ecosystem: "npm" + directory: "/apps/server" + schedule: + interval: "weekly" + day: "monday" + time: "09:00" + groups: + server-deps: + patterns: + - "*" + open-pull-requests-limit: 5 + labels: + - "dependencies" + - "server" + + - package-ecosystem: "npm" + directory: "/apps/dashboard" + schedule: + interval: "weekly" + day: "monday" + time: "09:00" + groups: + dashboard-deps: + patterns: + - "*" + open-pull-requests-limit: 5 + labels: + - "dependencies" + - "dashboard" + + - package-ecosystem: "npm" + directory: "/apps/mobile-web" + schedule: + interval: "weekly" + day: "monday" + time: "09:00" + groups: + mobile-web-deps: + patterns: + - "*" + open-pull-requests-limit: 5 + labels: + - "dependencies" + - "mobile-web" + + - package-ecosystem: "npm" + directory: "/apps/tui" + schedule: + interval: "weekly" + day: "monday" + time: "09:00" + groups: + tui-deps: + patterns: + - "*" + open-pull-requests-limit: 5 + labels: + - "dependencies" + - "tui" + + - package-ecosystem: "npm" + directories: + - "/packages/storage" + - "/packages/protocol" + - "/packages/sdk" + - "/packages/tools" + - "/packages/eval" + - "/packages/auth" + - "/packages/diff" + - "/packages/plugin-sdk" + - "/packages/collab" + - "/packages/core" + - "/packages/permissions" + schedule: + interval: "weekly" + day: "monday" + time: "09:30" + groups: + packages-deps: + patterns: + - "*" + open-pull-requests-limit: 10 + labels: + - "dependencies" + - "packages" + - package-ecosystem: "pip" directory: "/" schedule: diff --git a/.github/workflows/ai-safety.yml b/.github/workflows/ai-safety.yml index da63acc..4e006cf 100644 --- a/.github/workflows/ai-safety.yml +++ b/.github/workflows/ai-safety.yml @@ -12,7 +12,7 @@ jobs: safety: runs-on: ubuntu-latest steps: - - uses: actions/checkout@v4 + - uses: actions/checkout@v7 - name: Check for obvious secret patterns run: | diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 9bfdabd..8a9dfeb 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -26,7 +26,6 @@ jobs: - uses: oven-sh/setup-bun@v2 with: bun-version: ${{ env.BUN_VERSION }} - no-cache: false - run: bun install --frozen-lockfile - name: Run test-health run: bash scripts/test-health.sh @@ -45,7 +44,6 @@ jobs: - uses: oven-sh/setup-bun@v2 with: bun-version: ${{ env.BUN_VERSION }} - no-cache: false - run: bun install --frozen-lockfile - name: Build workspace packages run: bash scripts/build-all.sh @@ -81,7 +79,6 @@ jobs: - uses: oven-sh/setup-bun@v2 with: bun-version: ${{ env.BUN_VERSION }} - no-cache: false - run: bun install --frozen-lockfile - name: Build workspace packages run: bash scripts/build-all.sh @@ -109,9 +106,8 @@ jobs: - uses: oven-sh/setup-bun@v2 with: bun-version: ${{ env.BUN_VERSION }} - no-cache: false - run: bun install --frozen-lockfile - name: Build workspace packages run: bash scripts/build-all.sh - name: Run e2e tests - run: bun run test:e2e + run: bun run test:e2e \ No newline at end of file diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml index d100c5a..cfefe6b 100644 --- a/.github/workflows/codeql.yml +++ b/.github/workflows/codeql.yml @@ -26,7 +26,7 @@ jobs: steps: - name: Checkout - uses: actions/checkout@v4 + uses: actions/checkout@v7 - name: Initialize CodeQL uses: github/codeql-action/init@v3 diff --git a/.github/workflows/opencode.yml b/.github/workflows/opencode.yml index f417889..097de7a 100644 --- a/.github/workflows/opencode.yml +++ b/.github/workflows/opencode.yml @@ -19,7 +19,7 @@ jobs: steps: - name: Checkout - uses: actions/checkout@v4 + uses: actions/checkout@v7 - name: Show request run: | diff --git a/.github/workflows/repo-health.yml b/.github/workflows/repo-health.yml index 8cdd64c..b72ff3d 100644 --- a/.github/workflows/repo-health.yml +++ b/.github/workflows/repo-health.yml @@ -13,20 +13,19 @@ jobs: runs-on: ubuntu-latest steps: - - uses: actions/checkout@v4 + - uses: actions/checkout@v7 - name: Repo inventory run: | - find . -maxdepth 3 -type f | sort | sed 's#^\./##' | head -300 + find . -maxdepth 3 -type f | sort | sed 's#^\\./##' | head -300 - - name: Node checks - if: hashFiles('package.json') != '' + - name: Bun checks + if: hashFiles('bun.lock') != '' run: | - npm ci || npm install - npm run lint --if-present - npm run typecheck --if-present - npm test --if-present - npm run build --if-present + bun install --frozen-lockfile + bun run typecheck --if-present + bun test --if-present + bun run build --if-present - name: Python checks if: hashFiles('requirements.txt', 'pyproject.toml') != '' diff --git a/AGENTS.md b/AGENTS.md index 1658310..ec79fe8 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -24,6 +24,9 @@ Architecture Boundaries * apps/tui: thin client only. * apps/server: local HTTP/SSE control plane. +* apps/cli: CLI entrypoint, plugin lifecycle management, project scaffolding. +* apps/dashboard: observability dashboard (SolidJS + Tailwind, connects via SDK). +* apps/mobile-web: mobile web companion (SolidJS + Tailwind + PWA, connects via SDK). * packages/core: agent runtime orchestration. * packages/protocol: Zod schemas, route contracts, shared protocol types. * packages/sdk: typed client for protocol/API/SSE. @@ -37,8 +40,15 @@ Architecture Boundaries * packages/cache: read/grep/glob cache. * packages/planner: execution planning before mutation. * packages/ui: shared UI primitives only. - -TUI may import packages/sdk, packages/protocol, packages/events, and packages/ui. +* packages/models: provider model definitions, smart routing, provider registry. +* packages/auth: bearer token auth, TLS certificate generation, session tokens. +* packages/collab: session sharing, review queue, presence management. +* packages/eval: model evaluation framework, prompt library, comparison runner. +* packages/telemetry: OpenTelemetry tracing, Prometheus metrics, error reporting. +* packages/plugin-sdk: plugin extension interfaces (tool, provider, hook, panel). +* packages/config: layered config loading from env, files, and CLI args. + +TUI may import packages/sdk, packages/protocol, packages/events, packages/ui, and packages/eval. TUI must not import runtime authority packages directly: core, tools, shell, storage, permissions/internal, or models/internal. diff --git a/AUDIT_REPORT.md b/AUDIT_REPORT.md new file mode 100644 index 0000000..12704cd --- /dev/null +++ b/AUDIT_REPORT.md @@ -0,0 +1,330 @@ +# Code Quality & Maintainability Audit Report +**Repository:** agent-workbench (MerverliPy/agent-workbench) +**Date:** 2026-07-03 +**Audit Scope:** Test health, documentation staleness, CI pipeline integrity, code standards, overall maintainability + +--- + +## EXECUTIVE SUMMARY + +This is a **well-maintained, actively developed** TypeScript monorepo with strong test infrastructure, comprehensive CI/CD, and thorough documentation. However, several **stale doc references**, a **broken lint-staged pre-commit hook**, and **Dockerfile bitrot** need attention. Overall maintainability score: **B+** (good with fixable issues). + +| Area | Score | Key Issues | +|------|-------|------------| +| Test Health | 🟢 A | Well-organized, VERIFICATION.md is excellent, test-health.sh strong | +| CI Pipeline | 🟡 B+ | Well-designed but repo-health.yml stale, actions/checkout version drift | +| Documentation | 🟡 B- | CHANGELOG stale (Phase 29.4/29.5 missing), stale test counts in docs | +| Code Standards | 🟡 B+ | Strong TypeScript strictness, Biome linting, but lint-staged pre-commit broken | +| Docker/Deploy | 🔴 C | Dockerfile missing 7 packages, .dockerignore thin | +| Overall | 🟡 B+ | 13 findings: 4 HIGH, 6 MEDIUM, 3 LOW | + +--- + +## FINDINGS + +### FINDING 1 — HIGH 🔴: Root-level `typecheck` script missing (breaks lint-staged) + +**File:** `/home/calvin/agent-workbench/package.json`, lines 56-62 + +The `lint-staged` config runs `bun run typecheck --noEmit` on `*.{ts,tsx}` files (line 58), but there is no `"typecheck"` script defined at the monorepo root level in `package.json` scripts (line 12-25). Each package/app has `typecheck` in its own `package.json`, but lint-staged runs from the root, so this will fail with `error: missing script "typecheck"`. + +**Evidence:** +- Root scripts: `['phase', 'validate', 'build', 'test', 'test:unit', 'test:integration', 'test:e2e', 'test:repeat', 'test:health', 'coverage', 'prepare', 'postinstall']` +- No `typecheck` entry exists + +**Recommendation:** Add `"typecheck": "echo 'Run typecheck per package (see lint-staged)'"` to root `package.json`, or configure lint-staged to run per-package typecheck commands instead. + +--- + +### FINDING 2 — HIGH 🔴: Stale test counts in documentation + +**Files:** `README.md` (lines 257, 322), `CONTRIBUTING.md` (lines 118, 136) + +The README claims **602 tests** in the badge and header (lines 10, 19), but the Implementation Status section (line 257) and Verification Commands section (line 322) still say **523 tests**. CONTRIBUTING.md also references **523 tests** in both its "Making Changes" and "Testing" sections. + +**Evidence:** +- README line 257: `- ✅ **Automated testing** — 523 tests (unit, integration, e2e)` +- README line 322: `bun test # 523 tests, 0 failures, 1495 expect() calls` +- CONTRIBUTING.md line 118: `Ensure all 523 tests pass` +- CONTRIBUTING.md line 136: `# Full test suite (523 tests, 0 failures)` + +**Recommendation:** Update all stale "523" references to the current test count throughout both files. + +--- + +### FINDING 3 — HIGH 🔴: CHANGELOG missing Phase 29.4 and 29.5 commits (and recent fixes) + +**File:** `CHANGELOG.md` (lines 9-20) + +The CHANGELOG's latest entry is **Phase 29 (2026-07-02)**, but the following commits on **2026-07-03** are not logged: + +| Commit | Description | +|--------|-------------| +| `a722f34` | `feat(phase-29.5): add TUI playground + comparison panels` | +| `38166c8` | `feat(phase-29.4): implement prompt library + ModelComparer` | +| `8c6bf86` | `fix(web-ui): functional fixes + new DESIGN.md spec` | +| `0df80bb` | `docs: add DESIGN.md — design system spec` | +| `7357377` | `fix: bump drizzle-orm to 0.45.2 (CVE-2026-39356)` | +| `78f3aaa` | `Add AI mobile command center integrations` | +| `a02b286` | `fix(ci): Biome lint fixes — import sort, ChatView unused import` | +| Various | Typecheck fixes, Biome config fixes, session-runner fixes | + +**Recommendation:** Add [Phase 29.4] and [Phase 29.5] entries (or a single updated Phase 29) covering prompt library, ModelComparer, TUI playground, comparison panels, and the CVE fix. Consider adding a [Unreleased] section per Keep a Changelog convention. + +--- + +### FINDING 4 — HIGH 🔴: Dockerfile missing 7 packages + +**File:** `Dockerfile` (lines 9-22) + +The Dockerfile hardcodes a list of 12 packages to build sequentially, but the following packages that exist in the workspace are **not included**: + +- `telemetry` (`packages/telemetry`) +- `plugin-sdk` (`packages/plugin-sdk`) +- `auth` (`packages/auth`) +- `collab` (`packages/collab`) +- `eval` (`packages/eval`) +- `config` (`packages/config`) +- `ui` (`packages/ui`) + +**Impact:** Any Docker build will fail to resolve these packages' imports. The server may crash at startup if any of these are imported. + +**Recommendation:** Replace the hardcoded `RUN cd ... && bun run build` chain with a single `RUN bash scripts/build-all.sh` call (which already handles dependency ordering), or update the list to include all current packages. + +--- + +### FINDING 5 — MEDIUM 🟡: `repo-health.yml` workflow is stale and uses npm instead of bun + +**File:** `.github/workflows/repo-health.yml` (lines 23-29) + +The Node checks section (triggered by `hashFiles('package.json')`) runs: +```yaml +- run: npm ci || npm install +- run: npm run lint --if-present +- run: npm run typecheck --if-present +- run: npm test --if-present +- run: npm run build --if-present +``` + +This is a **Bun project** — `npm ci || npm install` is wrong and will likely fail or install wrong dependencies. The project has no `package-lock.json`, only `bun.lockb`. + +**Recommendation:** Replace with `bun install --frozen-lockfile` and `bun run` equivalents. Also update `actions/checkout@v4` to `@v7` (matching the main CI workflow). + +--- + +### FINDING 6 — MEDIUM 🟡: `actions/checkout` version mismatch across workflows + +| Workflow | Checkout Version | +|----------|-----------------| +| `ci.yml` | `@v7` | +| `repo-health.yml` | `@v4` | +| `stale.yml` | (uses `actions/stale@v9`, no checkout) | +| `codeql.yml` | `@v4` | +| `ai-safety.yml` | `@v4` | +| `opencode.yml` | `@v4` | + +`@v4` is several major versions behind `@v7`. While this may not break immediately, the older version lacks recent fixes and performance improvements. + +**Recommendation:** Normalize all workflows to use `actions/checkout@v7` (or `@v5` minimum). + +--- + +### FINDING 7 — MEDIUM 🟡: 5 test files live outside `tests/` directory (not run via `cd tests && bun test`) + +**Files:** +- `packages/eval/src/__tests__/export.test.ts` +- `packages/eval/src/__tests__/metrics.test.ts` +- `packages/eval/src/__tests__/playground.test.ts` +- `packages/eval/src/__tests__/promptfoo.test.ts` +- `apps/cli/templates/bun/src/hello.test.ts` + +The 45 test files in `tests/` are the "official" test suite, but the 4 eval tests and 1 template test are outside this directory. The `package.json` has `"test": "cd tests && bun test"` which **excludes** these 5 files. Total repo `.test.ts` count: 50 files. + +**Impact:** `bun test` (from root) may include these if bun automatically discovers them, but the explicit `cd tests && bun test` might not. This is inconsistent. Additionally, the `build-all.sh` script does NOT build the `eval` package. + +**Recommendation:** Either move these tests into `tests/` or update the test command to include them. Also add `eval` to `scripts/build-all.sh` since it has `typecheck` and `build` scripts. + +--- + +### FINDING 8 — MEDIUM 🟡: `.dockerignore` is thin — missing common exclusions + +**File:** `.dockerignore` (37 lines) + +Missing exclusions that should be included: +- `.git/` (sends entire git history to Docker daemon) +- `.github/` (CI configs not needed at runtime) +- `docs/` (planning docs not needed at runtime) +- `decisions/` (ADRs not needed at runtime) +- `tests/` (not needed at runtime) +- `benchmarks/` (not needed at runtime) +- `tools/` (not needed at runtime) +- `scripts/` (partially, build scripts not needed) +- `changelog.md` and `readme.md` (not needed at runtime) +- `DESIGN.md`, `CONTRIBUTING.md`, `AGENTS.md` etc. + +**Impact:** Larger Docker build context = slower builds. Current context likely includes 100s of files that contribute nothing to the server image. + +**Recommendation:** Add `.git/`, `docs/`, `tests/`, `benchmarks/`, `tools/`, `decisions/`, `*.md` to `.dockerignore`. + +--- + +### FINDING 9 — MEDIUM 🟡: Duplicate filenames across packages (natural but worth noting) + +The pygount analysis reported 17 duplicate files. The filename frequency analysis reveals: + +| Filename | Occurrences | Notes | +|----------|-------------|-------| +| `index.ts` | 35 | Per-package barrel export — expected | +| `README.md` | 32 | Per-package readme — expected | +| `package.json` | 29 | Per-package config — expected | +| `tsconfig.json` | 28 | Per-package TypeScript config — expected | +| `types.ts` | 11 | Per-package type definitions | +| `config.ts` | 4 | Several packages have config modules | +| `errors.ts` | 3 | Reused error module pattern | + +The duplicates are **structural rather than accidental** — each package follows a `index.ts` + `types.ts` + `config.ts` pattern. This is natural for monorepos. However, the `types.ts` duplication across 11 packages could benefit from centralization. + +**Recommendation:** Low priority. Consider consolidating shared types into `packages/protocol` where schema-first design already covers most cross-package shapes. + +--- + +### FINDING 10 — LOW 🟢: DESIGN.md vs actual web UI — no automated verification + +**File:** `DESIGN.md` (141 lines) + +DESIGN.md was added on 2026-07-03 and is a well-structured design system spec covering colors, typography, spacing, motion, components, interaction states, and accessibility. It documents the intended design system for `mobile-web` and `dashboard` apps. + +**Assessment:** The spec looks thorough and internally consistent, but there is **no automated test or visual regression check** to verify the actual UI matches the spec. The "anti-patterns" section (line 126-131) lists hard rules like "No box-shadows for depth" and "No gradients anywhere" that could be checked programmatically but aren't. + +**Recommendation:** Consider adding a minimal DOM/CSS audit script that checks the spec's hard constraints (e.g., no box-shadow in card CSS, no gradient in computed styles). This is optional given the project's current stage. + +--- + +### FINDING 11 — LOW 🟢: README says "Phases 0–27 complete" but Phase 29 is actively developed + +**File:** `README.md` (line 19) + +``` +> **Status:** Phases 0–27 complete · 602 tests, 0 failures · Phase 29 (model eval) next +``` + +But Phase 29 has been actively committed (Phase 29.0 through 29.5) and Phase 28 is marked as "No active development — deferred to future phase" in the CHANGELOG. This status message is slightly misleading — Phase 29 is not "next", it's actively in development. + +**Recommendation:** Update to: `Phases 0–27 complete · Phase 29 in progress (29.0–29.5)` or `Phases 0–29 in development · 602 tests, 0 failures`. + +--- + +### FINDING 12 — LOW 🟢: Branch cleanup needed + +Three stale local branches exist alongside main: +- `ai-mobile-command-center-integrations` (also on origin) +- `fix/drizzle-orm-cve` (also on origin) +- `test` + +**Recommendation:** Delete merged/obsolete branches. The Drizzle CVE fix has been merged into main. + +--- + +### FINDING 13 — INFO ℹ️: Excellent test infrastructure worth preserving + +**Files:** `tests/VERIFICATION.md` (255 lines), `scripts/test-health.sh` (128 lines), `scripts/test-repeat.sh` + +The test infrastructure is a **standout feature** of this repo: +1. **VERIFICATION.md** contains 13 intentional-break mutation tests with step-by-step instructions covering model faults, tool faults, abort handling, SDK error mapping, API validation, protocol contracts, permission engine, planner, diff preview, token budgets, path safety, and shell deny +2. **test-health.sh** performs 5 static checks: server import boundary, no network call patterns, no secrets in fixtures, TUI boundary test existence, restricted import checking +3. **test-repeat.sh** runs the suite N times (default 3) for determinism validation + +**Recommendation:** Preserve and expand. Consider automating the intentional-break checklist as an opt-in script (noted as future work in VERIFICATION.md line 254). + +--- + +## TEST HEALTH DETAIL + +| Metric | Value | +|--------|-------| +| Test files (in `tests/`) | 45 | +| Test files (total repo) | 50 | +| Claimed test count | 602 | +| Test files with `expect()` | 45 | +| Unit tests | 23 files (models, permissions, planner, plugin-sdk, protocol, telemetry, tokens) | +| Integration tests | 15 files (core, diff, faults, sdk, security, server, shell, storage) | +| E2E tests | 7 files (boundary, fullstack, security, contracts, health, lifecycle, streaming) | +| Test helpers | 3+ (test-db, test-server, mock-model) | +| Test verification | ✅ VERIFICATION.md with 13 intentional-break mutations | +| Test health checks | ✅ test-health.sh (5 static checks) | +| Test repeatability | ✅ test-repeat.sh (default 3 runs) | + +**Verdict: 🟢 Excellent.** The test suite is well-structured, well-documented, and includes redundancy via intentional-break tests. The only concern is the 5 test files outside `tests/` directory and stale test count references. + +--- + +## CI HEALTH DETAIL + +| Pipeline | Status | Notes | +|----------|--------|-------| +| `ci.yml` | 🟢 Green | 4 jobs: static-check → typecheck → test (matrix) → e2e. Daily cron. Coverage upload. | +| `codeql.yml` | 🟢 Green | Weekly scan for JS/TS and Python. | +| `ai-safety.yml` | 🟢 Green | Secret detection + destructive pattern checks. | +| `stale.yml` | 🟢 Green | Daily run, marks issues/PRs after 60/30 days, closes after 7. | +| `repo-health.yml` | 🔴 Stale | Uses npm instead of bun. Will likely fail. | +| `release-drafter.yml` | 🟢 Configured | Auto-generates release notes from PR labels. | +| `opencode.yml` | 🟢 Placeholder | `/opencode` comment trigger, no write behavior yet (correctly). | + +**CI Gaps:** +- `repo-health.yml` uses `npm` for Bun project — **will fail on next run** +- `actions/checkout` versions not normalized (`@v4` vs `@v7`) +- No cache keys in CI (no `setup-bun` cache config despite `no-cache: false`) +- `no-cache: false` in setup-bun means **cache is disabled** — every CI run reinstalls from scratch + +--- + +## DOCUMENTATION STALENESS SUMMARY + +| Document | Status | Key Issues | +|----------|--------|------------| +| `README.md` | 🟡 Stale | 3 stale "523 tests" references; phase status says "Phase 29 next" but it's in progress | +| `CHANGELOG.md` | 🔴 Stale | Missing Phase 29.4, 29.5, CVE fix, mobile command center, DESIGN.md addition | +| `CONTRIBUTING.md` | 🟡 Stale | "523 tests" in 3 places; says "0–26 complete" (should be 0–29) | +| `DESIGN.md` | 🟢 Fresh | Added 2026-07-03, well-structured | +| `VERIFICATION.md` | 🟡 Stale | Baseline says "323 tests" (Phase 15), not updated for Phase 26-29 additions | +| `scripts/build-all.sh` | 🟡 Stale | Missing `auth`, `collab`, `eval`, `config`, `ui` packages | + +--- + +## RECOMMENDATIONS (Priority-Ordered) + +### Must Fix (HIGH) +1. **Fix lint-staged typecheck** — Add `"typecheck"` script to root `package.json` or restructure pre-commit hook +2. **Update all stale test counts** — `README.md` (2 places), `CONTRIBUTING.md` (2 places): 523 → 602 +3. **Update CHANGELOG.md** — Add Phase 29.4 (prompt library + ModelComparer), Phase 29.5 (TUI playground + comparison panels), CVE fix, mobile command center, DESIGN.md +4. **Fix Dockerfile** — Add missing packages (`telemetry`, `plugin-sdk`, `auth`, `collab`, `eval`, `config`, `ui`) or switch to `scripts/build-all.sh` + +### Should Fix (MEDIUM) +5. **Fix `repo-health.yml`** — Replace `npm` with `bun` commands, update `checkout@v4` to `@v7` +6. **Normalize `actions/checkout` versions** across all 6 workflows to `@v7` +7. **Add `eval` to `scripts/build-all.sh`** — Currently not built in the dependency chain +8. **Expand `.dockerignore`** — Add `.git/`, `docs/`, `tests/`, `benchmarks/`, `tools/`, `decisions/`, `*.md` +9. **Enable bun caching in CI** — Set `no-cache: true`/remove `no-cache: false` from `setup-bun` step + +### Nice to Have (LOW) +10. **Update README phase status** — Reflect Phase 29 as "in progress" not "next" +11. **Clean up stale branches** — `ai-mobile-command-center-integrations`, `fix/drizzle-orm-cve`, `test` +12. **Update VERIFICATION.md baseline** — Bump from "323 tests" (Phase 15) to current +13. **Add DESIGN.md lint verification** — Consider adding CSS/style audits for hard constraints + +--- + +## Methodology + +| Check | Method | +|-------|--------| +| CI workflow content | Read `.github/workflows/ci.yml` | +| Test directory structure | `find tests -name '*.test.ts'` | +| TypeScript config | Read `tsconfig.base.json` | +| CHANGELOG staleness | `git log --since` comparison | +| CONTRIBUTING accuracy | Cross-reference with actual test count and phase progress | +| build-all.sh completeness | Read `scripts/build-all.sh`, check against `packages/` listing | +| DESIGN.md contents | Read `DESIGN.md` | +| Duplicate files | `find + uniq -d` on all non-node_modules files | +| Dockerfile health | Read `Dockerfile` and `.dockerignore` | +| CI version drift | Check `actions/checkout` version in all `.github/workflows/*.yml` | diff --git a/CHANGELOG.md b/CHANGELOG.md index 1386cf8..f93d17a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,24 @@ The format follows [Keep a Changelog](https://keepachangelog.com/) conventions. --- +## [Phase 29.5] — 2026-07-03 + +### Added +- **TUI Playground panel** (`apps/tui`): interactive prompt editor with live provider comparison +- **Model Comparison panel** (`apps/tui`): side-by-side output comparison across providers +- **Prompt Library** (`packages/eval`): saved prompts with tags, search, and versioning +- **DESIGN.md**: comprehensive design system spec for mobile-web and dashboard UI (color tokens, typography, components, accessibility) +- **AI Mobile Command Center integrations**: mobile-friendly agent orchestration interfaces + +### Fixed +- **CVE-2026-39356**: drizzle-orm bumped to 0.45.2 across all workspace packages +- **Web UI regression fixes**: NavDrawer, GitTreePanel typecheck errors +- **SessionSidebar type error**: guarded by items.length check +- **Biome lint fixes**: import sort, ChatView unused import resolution +- **Biome config cleanup**: remaining TypeScript type errors resolved for clean CI + +--- + ## [Phase 29] — 2026-07-02 ### Added diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 315acb1..f932e80 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -115,7 +115,7 @@ This project follows a **phase-based development** model (currently 0–26 compl 1. Fork the repo and create a branch from `main` 2. Make your changes following the architecture boundaries 3. Run tests: `bun test` -4. Ensure all 523 tests pass +4. Ensure all 602 tests pass 5. Ensure `bash scripts/build-all.sh` completes cleanly 6. Submit a pull request @@ -133,7 +133,7 @@ See [`AGENTS.md`](./AGENTS.md) for the agent-specific workflow, which includes: ## Testing ```bash -# Full test suite (523 tests, 0 failures) +# Full test suite (602 tests, 0 failures) bun test # Per-category @@ -161,7 +161,7 @@ bash scripts/build-all.sh ## Pull Request Process -1. Ensure all 523 tests pass and CI is green +1. Ensure all 602 tests pass and CI is green 2. Update README and package-level docs if your change affects public API 3. Update phase checklists if your change completes a phase exit gate 4. Include a clear PR description referencing the relevant docs/decisions diff --git a/Dockerfile b/Dockerfile index d001651..566c3c8 100644 --- a/Dockerfile +++ b/Dockerfile @@ -4,22 +4,11 @@ WORKDIR /app COPY package.json bun.lock ./ COPY packages/ packages/ COPY apps/server/ apps/server/ +COPY apps/cli/ apps/cli/ +COPY scripts/ scripts/ COPY tsconfig.base.json ./ RUN bun install --frozen-lockfile -RUN cd packages/protocol && bun run build -RUN cd packages/models && bun run build -RUN cd packages/storage && bun run build -RUN cd packages/tokens && bun run build -RUN cd packages/diff && bun run build -RUN cd packages/events && bun run build -RUN cd packages/sdk && bun run build -RUN cd packages/shell && bun run build -RUN cd packages/permissions && bun run build -RUN cd packages/cache && bun run build -RUN cd packages/planner && bun run build -RUN cd packages/tools && bun run build -RUN cd packages/core && bun run build -RUN cd apps/server && bun run build +RUN bash scripts/build-all.sh FROM oven/bun:1.2-slim diff --git a/docs/02_ARCHITECTURE.md b/docs/02_ARCHITECTURE.md index 8f95e20..747a5df 100644 --- a/docs/02_ARCHITECTURE.md +++ b/docs/02_ARCHITECTURE.md @@ -37,6 +37,12 @@ graph TB CACHE[packages/cache
Read/Grep/Glob Cache] MODELS[packages/models
Provider Adapters] EVT[packages/events
Event Bus] + AUTH[packages/auth
Bearer Tokens + TLS] + COLLAB[packages/collab
Session Sharing] + EVAL[packages/eval
Model Evaluation] + TELE[packages/telemetry
OpenTelemetry] + PLUGIN[packages/plugin-sdk
Plugin Interfaces] + CONFIG[packages/config
Layered Config] end TERM --> TUI @@ -74,11 +80,11 @@ Tokens = context-health control ## 4. Target Package Model -Future implementation should use this structure after Phase 0: - ```text apps/ ├─ cli/ +├─ dashboard/ +├─ mobile-web/ ├─ server/ └─ tui/ @@ -97,11 +103,14 @@ packages/ ├─ tokens/ ├─ cache/ ├─ planner/ -└─ ui/ +├─ ui/ +├─ auth/ +├─ collab/ +├─ eval/ +├─ telemetry/ +└─ plugin-sdk/ ``` -Do not create these folders during Phase 0. They are target architecture only until Phase 1. - ## 5. Application Layers ### 5.1 CLI Layer diff --git a/package.json b/package.json index ae4b12e..d6ad5bc 100644 --- a/package.json +++ b/package.json @@ -20,6 +20,7 @@ "test:e2e": "cd tests && bun test e2e", "test:repeat": "bash scripts/test-repeat.sh", "test:health": "bash scripts/test-health.sh", + "typecheck": "bun run --filter='*' typecheck 2>&1", "coverage": "bun test --coverage", "prepare": "husky || true", "postinstall": "ln -sf ../../../packages/telemetry tests/node_modules/@agent-workbench/telemetry 2>/dev/null; ln -sf ../../../packages/plugin-sdk tests/node_modules/@agent-workbench/plugin-sdk 2>/dev/null; true"