fix(profile): ecosystem-aware verification detection (#471)#474
Conversation
Verification share read ~0.1% for verification-heavy non-JS/TS stacks because "verification" was a single JS/TS-flavored regex substring-matched against the tool label (/test|check|verify|lint|typecheck|tsc|vitest/i). That over-counted `git checkout` (substring "check") and missed runners whose names carry no English keyword: `rspec` has no "test", `rubocop` has no "lint", `playwright`/`bin/pw` drive a browser. Replace the regex with a shared semantic taxonomy keyed off the program token (basename of `command_norm ?? name`): - tool-taxonomy.ts: isVerificationTool / isContextTool. Ecosystem-aware programs (rspec, rubocop, pytest, ruff, mypy, phpunit, clippy, vitest, tsc, eslint, playwright, cypress, mcp__playwright__*, ...), multi-token subcommand forms (go test, cargo test/clippy, dotnet test, bun test), narrow generic keyword programs (verify/typecheck/lint/test/check), and an explicit git exclusion so `git checkout` no longer counts. - profile/queries.ts + dashboard/wrapped.ts: drop the duplicated regexes, share the predicates (toolCount now takes a predicate). - turn-analysis.ts: group the user-ask verification regex (/\bverify|test|.../ matched "test" inside "fastest"/"latest"). Matching on the program token (not a substring of the whole label) means flags/args can't trigger a false positive. Behavior of existing wrapped counts preserved (bun test -> verification, Read -> context). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Deploying ax with
|
| Latest commit: |
56539a2
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://de99a39e.ax-62d.pages.dev |
| Branch Preview URL: | https://fix-verification-taxonomy.ax-62d.pages.dev |
…iew) Address /review-all feedback (simplify x3 + codex review + adversarial): Correctness regressions vs the old regex, now fixed by reusing the existing ingest classifiers instead of a parallel hand-rolled parser: - Tokenize via commandTokenSegments (shell tokenizer): env prefixes (`NODE_ENV=test vitest`), `cd x && pytest`, and `a | b` chains now classify (naive whitespace split missed them). - Delegate the core decision to checkFamilyFromCommand (churn classifier): package run-scripts `bun run typecheck` / `npm run lint` / `pnpm lint` / `yarn check` count again (the first draft only credited `test`), and the shell builtin `test` (`test -f foo`) is correctly excluded. - Context: reuse isReadTool / READ_COMMANDS so `git grep` is context again (head-token reduction had dropped it to `git`). - turn-analysis: keep plural/inflected forms (`run the tests`, `checks pass`) while keeping the boundaries that avoid `fastest`/`latest`. Ecosystem coverage (the review's "cover every language" ask): Ruby, Python, Go, Rust, JS/TS, PHP, Elixir, .NET, JVM (mvn/gradle + mvnw/gradlew wrappers), Scala/Clojure, Haskell, Swift (xcodebuild), C/C++, shell (bats/shellcheck/ yamllint), plus wrapper runners (`bundle exec`, `poetry/uv/pdm run ...`). e2e/browser drivers (`bin/pw`, `playwright`, `cypress`, `mcp__playwright__*`) are credited as verification - the workload issue #471 was raised about. Codex flagged that bare navigate/screenshot is not a test assertion; kept as verification per the reporter's intent (it IS the QA loop on an e2e stack). check-family.ts consumed read-only - no change to churn episode classification (188 tests across profile/check-family/outcomes/churn green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review pass (
|
| # | Finding | Fix |
|---|---|---|
| 1 | bun run typecheck / npm run lint / pnpm lint / yarn check returned false (only test credited) |
Delegate to checkFamilyFromCommand (already models package run-scripts) |
| 2 | Naive split(/\s+/) missed env-prefixes (NODE_ENV=test vitest), &&/pipe chains (cd app && pytest), wrappers (bundle exec rspec) |
Tokenize via commandTokenSegments (the shared shell tokenizer) |
| 3 | Grouped regex dropped plural asks (run the tests, make sure the checks pass) |
Keep tests?/checks? while retaining \b boundaries (no fastest/latest) |
| 4 | Head-token reduction dropped git grep → git, so it stopped counting as context |
Reuse isReadTool / READ_COMMANDS (already lists git grep) |
| 5 | Shell builtin test (test -f foo) counted as verification |
checkFamilyFromCommand deliberately excludes it; keyword-prefix fallback removed |
Architecture (altitude consensus): the draft was a 2nd verification classifier drifting from ingest/check-family.ts (the self-declared single source of truth). Revised to consume check-family / tool-calls / tool-classes read-only (no churn-path change — 188 tests across profile/check-family/outcomes/churn green) and layer only the cross-ecosystem programs they don't model.
Ecosystem coverage (the "cover every language" ask): Ruby, Python, Go, Rust, JS/TS, PHP, Elixir, .NET, JVM (+ mvnw/gradlew wrappers), Scala/Clojure, Haskell, Swift (xcodebuild test), C/C++, shell (bats/shellcheck/yamllint), plus bundle exec / poetry|uv|pdm run wrapper runners. Each has a fixture in tool-taxonomy.test.ts.
Dedup (simplify): removed the keyword-prefix fallback (killed a dead testcafe entry + the shell-test false positive), collapsed CONTEXT_PROGRAMS/head() redundancy into one programOf + isReadTool.
One open policy call
Codex flagged that bare MCP browser actions (mcp__playwright__browser_navigate, …__screenshot) are context, not a test assertion, and argued only playwright test should count. Kept them as verification — that's the premise of this issue: on an e2e/Playwright stack the QA loop is bin/pw + mcp__playwright__* browser-driving (~10% of all tool calls here), and the stricter line puts verification share back near ~0%. If preferred, these can instead land in a separate e2e bucket (credited, but not conflated with test/lint/typecheck proof) — small follow-up.
Out of scope (deferred, per the issue's own list)
Crediting verification surfaced via skills (create-pr, ci-watch), agent types (frontend-qc), and hooks (rubocop-check) — those live in the invoked / hook_invocation streams, a different entity than tool_call labels; the (label) => boolean shape doesn't generalize to them. Tracked as a follow-up.
…gaps Final-review (codex adversarial) caught that the production aggregation feeds the COLLAPSED command_norm to the classifier, not the full command. Since normalizeCommand strips the subcommand for tools outside SUBCOMMAND_TOOLS (`mvn test` -> `mvn`, `npm run lint` -> `npm run`, `bundle exec rspec` -> `bundle`), the expanded taxonomy could never see those verifiers - the tests passed only because they fed full command strings that never reach the query. - profile/queries.ts + dashboard/wrapped.ts: new VERIFY_AGG / WRAPPED_VERIFY query groups by `command_text ?? command_norm ?? name` and classifies the full command in-process (counts only ever leave - privacy invariant intact). This makes the shell tokenizer + ecosystem maps actually live in production, so JVM/Scala/.NET/run-script/wrapper verifiers now count. - tool-taxonomy: e2e drivers exclude setup/inspection subcommands (`playwright install|codegen`, `pw --help`) when the full command is known; a bare normalized `playwright` still counts. Add module-runner forms (`python -m pytest`, `node --test`, `rake test`) and option-value-safe subcommand scanning (`xcodebuild -scheme App test`, `mvn clean test`). - Context: credit `NotebookRead` (the old /read/i regex matched it). Tests: per-ecosystem + regression fixtures; render/queries mocks updated for the added query. 190 tests across profile/dashboard/ingest green; check-family consumed read-only (no churn-path change); typecheck clean on changed files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Final review round (codex adversarial) — resolved in
|
| Severity | Finding | Fix |
|---|---|---|
| high | Production aggregation feeds the collapsed command_norm, not the full command. normalizeCommand strips the subcommand for tools outside SUBCOMMAND_TOOLS (mvn test→mvn, npm run lint→npm run, bundle exec rspec→bundle), so the expanded taxonomy could never see those verifiers in production — the tests passed only because they fed full command strings. |
New VERIFY_AGG / WRAPPED_VERIFY query groups by command_text ?? command_norm ?? name and classifies the full command in-process (counts only leave — privacy invariant intact). This makes the shell tokenizer + ecosystem maps actually live, so JVM/Scala/.NET/run-script/wrapper verifiers count. No schema/ingest change. |
| medium | e2e drivers blanket-true → playwright install/codegen/pw --help counted as verification |
Exclude setup/inspection subcommands + help/version flags when the full command is known; a bare normalized playwright still counts (ambiguous → run) |
| medium | Module-runner forms missed (python -m pytest, node --test, rake test); xcodebuild -scheme App test read the scheme value as the action |
Module-runner handling + option-value-safe subcommand scan (mvn clean test, ./gradlew check also fixed) |
| medium | NotebookRead dropped from context (old /read/i matched it) |
Added to context extras |
Why command_text is the right call: outcomes.ts already classifies churn from command_text, the column exists on tool_call, and it's the only way "cover every language" is actually true rather than test-green. The classifier internals are unchanged — they were just being fed the wrong column.
Verification: 190 tests across profile/dashboard/ingest green (per-ecosystem + regression fixtures); check-family.ts consumed read-only (no churn-path change); typecheck clean on changed files.
Note: the command_norm-collapse limitation was pre-existing (the old regex ran on the same collapsed label) — this round removes it entirely by switching to command_text.
…nomy # Conflicts: # apps/axctl/src/profile/queries.ts
Fixes #471.
Problem
axverification share read 0.12% for verification-heavy non-JS/TS stacks. Root cause: "verification" was one JS/TS-flavored regex (/test|check|verify|lint|typecheck|tsc|vitest|bun test/i), substring-matched against the tool label, duplicated acrossprofile/queries.tsanddashboard/wrapped.ts.git checkout(the largest false-positive bucket — substring "check").rspechas no "test",rubocophas no "lint",playwright/bin/pwdrive a browser.Fix
New shared
apps/axctl/src/profile/tool-taxonomy.ts—isVerificationTool/isContextTool, matched on the program token (basename ofcommand_norm ?? name) instead of any substring of the whole label:rspec,rubocop,standardrb,pytest,ruff,flake8,mypy,pyright,phpunit,phpstan,clippy,credo,vitest,jest,tsc,eslint,biome,oxlint,playwright,cypress,mcp__playwright__*, …go test/go vet,cargo test/clippy,dotnet test,mix test,bun test,gradle check, … (and explicitly notgo build/cargo build).verify,typecheck,lint,test,check,bin/check-types.gitexclusion →git checkoutno longer counts.Both consumers now share the predicates (
toolCounttakes a predicate).turn-analysis.tsuser-ask regex grouped (/\bverify|test|.../matched "test" inside "fastest"/"latest").Scope
Per discussion on the issue, this covers the tool-taxonomy + git-exclude + turn-analysis regex directions. The broader "credit verification surfaced via skills (
create-pr), agent types (frontend-qc), and hooks (rubocop-check)" is not in this PR — those live in separate data streams (invoked/hook_invocation) and change the count model; left as a follow-up.Tests
tool-taxonomy.test.ts: rspec/rubocop/playwright credited,git checkoutexcluded,fastest/contestnot matched, multi-token subcommands, null-safe.queries.test.ts/wrapped.test.ts/turn-analysis.test.tsgreen (behavior preserved).bun run typecheck: no errors in changed files (pre-existingpackages/lib@effect/platform-bunfailures are unrelated).🤖 Generated with Claude Code