fix(profile): ecosystem-aware verification detection (#471) by Necmttn · Pull Request #474 · Necmttn/ax

Necmttn · 2026-06-16T08:24:42Z

Fixes #471.

Problem

Over-counted git checkout (the largest false-positive bucket — substring "check").
Missed runners whose names carry no English keyword: rspec has no "test", rubocop has no "lint", playwright/bin/pw drive a browser.

Fix

New shared apps/axctl/src/profile/tool-taxonomy.ts — isVerificationTool / isContextTool, matched on the program token (basename of command_norm ?? name) instead of any substring of the whole label:

Ecosystem-aware programs: rspec, rubocop, standardrb, pytest, ruff, flake8, mypy, pyright, phpunit, phpstan, clippy, credo, vitest, jest, tsc, eslint, biome, oxlint, playwright, cypress, mcp__playwright__*, …
Multi-token subcommand forms: go test/go vet, cargo test/clippy, dotnet test, mix test, bun test, gradle check, … (and explicitly not go build / cargo build).
Narrow generic keyword programs only as the leading token: verify, typecheck, lint, test, check, bin/check-types.
Explicit git exclusion → git checkout no longer counts.

Both consumers now share the predicates (toolCount takes a predicate). turn-analysis.ts user-ask regex grouped (/\bverify|test|.../ matched "test" inside "fastest"/"latest").

Scope

Per discussion on the issue, this covers the tool-taxonomy + git-exclude + turn-analysis regex directions. The broader "credit verification surfaced via skills (create-pr), agent types (frontend-qc), and hooks (rubocop-check)" is not in this PR — those live in separate data streams (invoked / hook_invocation) and change the count model; left as a follow-up.

Tests

New tool-taxonomy.test.ts: rspec/rubocop/playwright credited, git checkout excluded, fastest/contest not matched, multi-token subcommands, null-safe.
Existing queries.test.ts / wrapped.test.ts / turn-analysis.test.ts green (behavior preserved).
bun run typecheck: no errors in changed files (pre-existing packages/lib @effect/platform-bun failures are unrelated).

🤖 Generated with Claude Code

Verification share read ~0.1% for verification-heavy non-JS/TS stacks because "verification" was a single JS/TS-flavored regex substring-matched against the tool label (/test|check|verify|lint|typecheck|tsc|vitest/i). That over-counted `git checkout` (substring "check") and missed runners whose names carry no English keyword: `rspec` has no "test", `rubocop` has no "lint", `playwright`/`bin/pw` drive a browser. Replace the regex with a shared semantic taxonomy keyed off the program token (basename of `command_norm ?? name`): - tool-taxonomy.ts: isVerificationTool / isContextTool. Ecosystem-aware programs (rspec, rubocop, pytest, ruff, mypy, phpunit, clippy, vitest, tsc, eslint, playwright, cypress, mcp__playwright__*, ...), multi-token subcommand forms (go test, cargo test/clippy, dotnet test, bun test), narrow generic keyword programs (verify/typecheck/lint/test/check), and an explicit git exclusion so `git checkout` no longer counts. - profile/queries.ts + dashboard/wrapped.ts: drop the duplicated regexes, share the predicates (toolCount now takes a predicate). - turn-analysis.ts: group the user-ask verification regex (/\bverify|test|.../ matched "test" inside "fastest"/"latest"). Matching on the program token (not a substring of the whole label) means flags/args can't trigger a false positive. Behavior of existing wrapped counts preserved (bun test -> verification, Read -> context). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-06-16T08:24:43Z

Deploying ax with Cloudflare Pages

Latest commit:	`56539a2`
Status:	✅ Deploy successful!
Preview URL:	https://de99a39e.ax-62d.pages.dev
Branch Preview URL:	https://fix-verification-taxonomy.ax-62d.pages.dev

View logs

…iew) Address /review-all feedback (simplify x3 + codex review + adversarial): Correctness regressions vs the old regex, now fixed by reusing the existing ingest classifiers instead of a parallel hand-rolled parser: - Tokenize via commandTokenSegments (shell tokenizer): env prefixes (`NODE_ENV=test vitest`), `cd x && pytest`, and `a | b` chains now classify (naive whitespace split missed them). - Delegate the core decision to checkFamilyFromCommand (churn classifier): package run-scripts `bun run typecheck` / `npm run lint` / `pnpm lint` / `yarn check` count again (the first draft only credited `test`), and the shell builtin `test` (`test -f foo`) is correctly excluded. - Context: reuse isReadTool / READ_COMMANDS so `git grep` is context again (head-token reduction had dropped it to `git`). - turn-analysis: keep plural/inflected forms (`run the tests`, `checks pass`) while keeping the boundaries that avoid `fastest`/`latest`. Ecosystem coverage (the review's "cover every language" ask): Ruby, Python, Go, Rust, JS/TS, PHP, Elixir, .NET, JVM (mvn/gradle + mvnw/gradlew wrappers), Scala/Clojure, Haskell, Swift (xcodebuild), C/C++, shell (bats/shellcheck/ yamllint), plus wrapper runners (`bundle exec`, `poetry/uv/pdm run ...`). e2e/browser drivers (`bin/pw`, `playwright`, `cypress`, `mcp__playwright__*`) are credited as verification - the workload issue #471 was raised about. Codex flagged that bare navigate/screenshot is not a test assertion; kept as verification per the reporter's intent (it IS the QA loop on an e2e stack). check-family.ts consumed read-only - no change to churn episode classification (188 tests across profile/check-family/outcomes/churn green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Necmttn · 2026-06-16T12:59:05Z

Review pass (`/review-all`: simplify ×3 + codex review + codex adversarial)

Ran the multi-reviewer pass against the first taxonomy draft and revised in 1cc14572. Summary for the record.

Findings (all addressed in the revision)

Correctness regressions vs the old regex — the first draft hand-rolled its own command parser, which was weaker than what ingest already has:

#	Finding	Fix
1	`bun run typecheck` / `npm run lint` / `pnpm lint` / `yarn check` returned `false` (only `test` credited)	Delegate to `checkFamilyFromCommand` (already models package run-scripts)
2	Naive `split(/\s+/)` missed env-prefixes (`NODE_ENV=test vitest`), `&&`/pipe chains (`cd app && pytest`), wrappers (`bundle exec rspec`)	Tokenize via `commandTokenSegments` (the shared shell tokenizer)
3	Grouped regex dropped plural asks (`run the tests`, `make sure the checks pass`)	Keep `tests?`/`checks?` while retaining `\b` boundaries (no `fastest`/`latest`)
4	Head-token reduction dropped `git grep` → `git`, so it stopped counting as context	Reuse `isReadTool` / `READ_COMMANDS` (already lists `git grep`)
5	Shell builtin `test` (`test -f foo`) counted as verification	`checkFamilyFromCommand` deliberately excludes it; keyword-prefix fallback removed

Architecture (altitude consensus): the draft was a 2nd verification classifier drifting from ingest/check-family.ts (the self-declared single source of truth). Revised to consume check-family / tool-calls / tool-classes read-only (no churn-path change — 188 tests across profile/check-family/outcomes/churn green) and layer only the cross-ecosystem programs they don't model.

Ecosystem coverage (the "cover every language" ask): Ruby, Python, Go, Rust, JS/TS, PHP, Elixir, .NET, JVM (+ mvnw/gradlew wrappers), Scala/Clojure, Haskell, Swift (xcodebuild test), C/C++, shell (bats/shellcheck/yamllint), plus bundle exec / poetry|uv|pdm run wrapper runners. Each has a fixture in tool-taxonomy.test.ts.

Dedup (simplify): removed the keyword-prefix fallback (killed a dead testcafe entry + the shell-test false positive), collapsed CONTEXT_PROGRAMS/head() redundancy into one programOf + isReadTool.

One open policy call

Codex flagged that bare MCP browser actions (mcp__playwright__browser_navigate, …__screenshot) are context, not a test assertion, and argued only playwright test should count. Kept them as verification — that's the premise of this issue: on an e2e/Playwright stack the QA loop is bin/pw + mcp__playwright__* browser-driving (~10% of all tool calls here), and the stricter line puts verification share back near ~0%. If preferred, these can instead land in a separate e2e bucket (credited, but not conflated with test/lint/typecheck proof) — small follow-up.

Out of scope (deferred, per the issue's own list)

Crediting verification surfaced via skills (create-pr, ci-watch), agent types (frontend-qc), and hooks (rubocop-check) — those live in the invoked / hook_invocation streams, a different entity than tool_call labels; the (label) => boolean shape doesn't generalize to them. Tracked as a follow-up.

…gaps Final-review (codex adversarial) caught that the production aggregation feeds the COLLAPSED command_norm to the classifier, not the full command. Since normalizeCommand strips the subcommand for tools outside SUBCOMMAND_TOOLS (`mvn test` -> `mvn`, `npm run lint` -> `npm run`, `bundle exec rspec` -> `bundle`), the expanded taxonomy could never see those verifiers - the tests passed only because they fed full command strings that never reach the query. - profile/queries.ts + dashboard/wrapped.ts: new VERIFY_AGG / WRAPPED_VERIFY query groups by `command_text ?? command_norm ?? name` and classifies the full command in-process (counts only ever leave - privacy invariant intact). This makes the shell tokenizer + ecosystem maps actually live in production, so JVM/Scala/.NET/run-script/wrapper verifiers now count. - tool-taxonomy: e2e drivers exclude setup/inspection subcommands (`playwright install|codegen`, `pw --help`) when the full command is known; a bare normalized `playwright` still counts. Add module-runner forms (`python -m pytest`, `node --test`, `rake test`) and option-value-safe subcommand scanning (`xcodebuild -scheme App test`, `mvn clean test`). - Context: credit `NotebookRead` (the old /read/i regex matched it). Tests: per-ecosystem + regression fixtures; render/queries mocks updated for the added query. 190 tests across profile/dashboard/ingest green; check-family consumed read-only (no churn-path change); typecheck clean on changed files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Necmttn · 2026-06-16T13:15:03Z

Final review round (codex adversarial) — resolved in `6702de28`

The final pass caught a real blocker the earlier rounds missed, plus a few edge cases. All addressed:

Severity	Finding	Fix
high	Production aggregation feeds the collapsed `command_norm`, not the full command. `normalizeCommand` strips the subcommand for tools outside `SUBCOMMAND_TOOLS` (`mvn test`→`mvn`, `npm run lint`→`npm run`, `bundle exec rspec`→`bundle`), so the expanded taxonomy could never see those verifiers in production — the tests passed only because they fed full command strings.	New `VERIFY_AGG` / `WRAPPED_VERIFY` query groups by `command_text ?? command_norm ?? name` and classifies the full command in-process (counts only leave — privacy invariant intact). This makes the shell tokenizer + ecosystem maps actually live, so JVM/Scala/.NET/run-script/wrapper verifiers count. No schema/ingest change.
medium	e2e drivers blanket-true → `playwright install`/`codegen`/`pw --help` counted as verification	Exclude setup/inspection subcommands + help/version flags when the full command is known; a bare normalized `playwright` still counts (ambiguous → run)
medium	Module-runner forms missed (`python -m pytest`, `node --test`, `rake test`); `xcodebuild -scheme App test` read the scheme value as the action	Module-runner handling + option-value-safe subcommand scan (`mvn clean test`, `./gradlew check` also fixed)
medium	`NotebookRead` dropped from context (old `/read/i` matched it)	Added to context extras

Why command_text is the right call: outcomes.ts already classifies churn from command_text, the column exists on tool_call, and it's the only way "cover every language" is actually true rather than test-green. The classifier internals are unchanged — they were just being fed the wrong column.

Verification: 190 tests across profile/dashboard/ingest green (per-ecosystem + regression fixtures); check-family.ts consumed read-only (no churn-path change); typecheck clean on changed files.

Note: the command_norm-collapse limitation was pre-existing (the old regex ran on the same collapsed label) — this round removes it entirely by switching to command_text.

…nomy # Conflicts: # apps/axctl/src/profile/queries.ts

Merge remote-tracking branch 'origin/main' into fix/verification-taxo…

56539a2

…nomy # Conflicts: # apps/axctl/src/profile/queries.ts

Necmttn merged commit 418a13d into main Jun 16, 2026
3 checks passed

Necmttn deleted the fix/verification-taxonomy branch June 16, 2026 13:22

Necmttn mentioned this pull request Jun 16, 2026

chore(main): release 0.33.0 #483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(profile): ecosystem-aware verification detection (#471)#474

fix(profile): ecosystem-aware verification detection (#471)#474
Necmttn merged 4 commits into
mainfrom
fix/verification-taxonomy

Necmttn commented Jun 16, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

Necmttn commented Jun 16, 2026

Uh oh!

Necmttn commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Necmttn commented Jun 16, 2026

Problem

Fix

Scope

Tests

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying ax with Cloudflare Pages

Uh oh!

Necmttn commented Jun 16, 2026

Review pass (/review-all: simplify ×3 + codex review + codex adversarial)

Findings (all addressed in the revision)

One open policy call

Out of scope (deferred, per the issue's own list)

Uh oh!

Necmttn commented Jun 16, 2026

Final review round (codex adversarial) — resolved in 6702de28

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Jun 16, 2026 •

edited

Loading

Review pass (`/review-all`: simplify ×3 + codex review + codex adversarial)

Final review round (codex adversarial) — resolved in `6702de28`