Skip to content

fix(profile): ecosystem-aware verification detection (#471)#474

Merged
Necmttn merged 4 commits into
mainfrom
fix/verification-taxonomy
Jun 16, 2026
Merged

fix(profile): ecosystem-aware verification detection (#471)#474
Necmttn merged 4 commits into
mainfrom
fix/verification-taxonomy

Conversation

@Necmttn

@Necmttn Necmttn commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Fixes #471.

Problem

ax verification share read 0.12% for verification-heavy non-JS/TS stacks. Root cause: "verification" was one JS/TS-flavored regex (/test|check|verify|lint|typecheck|tsc|vitest|bun test/i), substring-matched against the tool label, duplicated across profile/queries.ts and dashboard/wrapped.ts.

  • Over-counted git checkout (the largest false-positive bucket — substring "check").
  • Missed runners whose names carry no English keyword: rspec has no "test", rubocop has no "lint", playwright/bin/pw drive a browser.

Fix

New shared apps/axctl/src/profile/tool-taxonomy.tsisVerificationTool / isContextTool, matched on the program token (basename of command_norm ?? name) instead of any substring of the whole label:

  • Ecosystem-aware programs: rspec, rubocop, standardrb, pytest, ruff, flake8, mypy, pyright, phpunit, phpstan, clippy, credo, vitest, jest, tsc, eslint, biome, oxlint, playwright, cypress, mcp__playwright__*, …
  • Multi-token subcommand forms: go test/go vet, cargo test/clippy, dotnet test, mix test, bun test, gradle check, … (and explicitly not go build / cargo build).
  • Narrow generic keyword programs only as the leading token: verify, typecheck, lint, test, check, bin/check-types.
  • Explicit git exclusion → git checkout no longer counts.

Both consumers now share the predicates (toolCount takes a predicate). turn-analysis.ts user-ask regex grouped (/\bverify|test|.../ matched "test" inside "fastest"/"latest").

Scope

Per discussion on the issue, this covers the tool-taxonomy + git-exclude + turn-analysis regex directions. The broader "credit verification surfaced via skills (create-pr), agent types (frontend-qc), and hooks (rubocop-check)" is not in this PR — those live in separate data streams (invoked / hook_invocation) and change the count model; left as a follow-up.

Tests

  • New tool-taxonomy.test.ts: rspec/rubocop/playwright credited, git checkout excluded, fastest/contest not matched, multi-token subcommands, null-safe.
  • Existing queries.test.ts / wrapped.test.ts / turn-analysis.test.ts green (behavior preserved).
  • bun run typecheck: no errors in changed files (pre-existing packages/lib @effect/platform-bun failures are unrelated).

🤖 Generated with Claude Code

Verification share read ~0.1% for verification-heavy non-JS/TS stacks
because "verification" was a single JS/TS-flavored regex substring-matched
against the tool label (/test|check|verify|lint|typecheck|tsc|vitest/i).
That over-counted `git checkout` (substring "check") and missed runners
whose names carry no English keyword: `rspec` has no "test", `rubocop`
has no "lint", `playwright`/`bin/pw` drive a browser.

Replace the regex with a shared semantic taxonomy keyed off the program
token (basename of `command_norm ?? name`):

- tool-taxonomy.ts: isVerificationTool / isContextTool. Ecosystem-aware
  programs (rspec, rubocop, pytest, ruff, mypy, phpunit, clippy, vitest,
  tsc, eslint, playwright, cypress, mcp__playwright__*, ...), multi-token
  subcommand forms (go test, cargo test/clippy, dotnet test, bun test),
  narrow generic keyword programs (verify/typecheck/lint/test/check), and
  an explicit git exclusion so `git checkout` no longer counts.
- profile/queries.ts + dashboard/wrapped.ts: drop the duplicated regexes,
  share the predicates (toolCount now takes a predicate).
- turn-analysis.ts: group the user-ask verification regex
  (/\bverify|test|.../  matched "test" inside "fastest"/"latest").

Matching on the program token (not a substring of the whole label) means
flags/args can't trigger a false positive. Behavior of existing wrapped
counts preserved (bun test -> verification, Read -> context).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 16, 2026

Copy link
Copy Markdown

Deploying ax with  Cloudflare Pages  Cloudflare Pages

Latest commit: 56539a2
Status: ✅  Deploy successful!
Preview URL: https://de99a39e.ax-62d.pages.dev
Branch Preview URL: https://fix-verification-taxonomy.ax-62d.pages.dev

View logs

…iew)

Address /review-all feedback (simplify x3 + codex review + adversarial):

Correctness regressions vs the old regex, now fixed by reusing the existing
ingest classifiers instead of a parallel hand-rolled parser:
- Tokenize via commandTokenSegments (shell tokenizer): env prefixes
  (`NODE_ENV=test vitest`), `cd x && pytest`, and `a | b` chains now classify
  (naive whitespace split missed them).
- Delegate the core decision to checkFamilyFromCommand (churn classifier):
  package run-scripts `bun run typecheck` / `npm run lint` / `pnpm lint` /
  `yarn check` count again (the first draft only credited `test`), and the
  shell builtin `test` (`test -f foo`) is correctly excluded.
- Context: reuse isReadTool / READ_COMMANDS so `git grep` is context again
  (head-token reduction had dropped it to `git`).
- turn-analysis: keep plural/inflected forms (`run the tests`, `checks pass`)
  while keeping the boundaries that avoid `fastest`/`latest`.

Ecosystem coverage (the review's "cover every language" ask): Ruby, Python,
Go, Rust, JS/TS, PHP, Elixir, .NET, JVM (mvn/gradle + mvnw/gradlew wrappers),
Scala/Clojure, Haskell, Swift (xcodebuild), C/C++, shell (bats/shellcheck/
yamllint), plus wrapper runners (`bundle exec`, `poetry/uv/pdm run ...`).

e2e/browser drivers (`bin/pw`, `playwright`, `cypress`, `mcp__playwright__*`)
are credited as verification - the workload issue #471 was raised about.
Codex flagged that bare navigate/screenshot is not a test assertion; kept as
verification per the reporter's intent (it IS the QA loop on an e2e stack).

check-family.ts consumed read-only - no change to churn episode classification
(188 tests across profile/check-family/outcomes/churn green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Necmttn

Necmttn commented Jun 16, 2026

Copy link
Copy Markdown
Owner Author

Review pass (/review-all: simplify ×3 + codex review + codex adversarial)

Ran the multi-reviewer pass against the first taxonomy draft and revised in 1cc14572. Summary for the record.

Findings (all addressed in the revision)

Correctness regressions vs the old regex — the first draft hand-rolled its own command parser, which was weaker than what ingest already has:

# Finding Fix
1 bun run typecheck / npm run lint / pnpm lint / yarn check returned false (only test credited) Delegate to checkFamilyFromCommand (already models package run-scripts)
2 Naive split(/\s+/) missed env-prefixes (NODE_ENV=test vitest), &&/pipe chains (cd app && pytest), wrappers (bundle exec rspec) Tokenize via commandTokenSegments (the shared shell tokenizer)
3 Grouped regex dropped plural asks (run the tests, make sure the checks pass) Keep tests?/checks? while retaining \b boundaries (no fastest/latest)
4 Head-token reduction dropped git grepgit, so it stopped counting as context Reuse isReadTool / READ_COMMANDS (already lists git grep)
5 Shell builtin test (test -f foo) counted as verification checkFamilyFromCommand deliberately excludes it; keyword-prefix fallback removed

Architecture (altitude consensus): the draft was a 2nd verification classifier drifting from ingest/check-family.ts (the self-declared single source of truth). Revised to consume check-family / tool-calls / tool-classes read-only (no churn-path change — 188 tests across profile/check-family/outcomes/churn green) and layer only the cross-ecosystem programs they don't model.

Ecosystem coverage (the "cover every language" ask): Ruby, Python, Go, Rust, JS/TS, PHP, Elixir, .NET, JVM (+ mvnw/gradlew wrappers), Scala/Clojure, Haskell, Swift (xcodebuild test), C/C++, shell (bats/shellcheck/yamllint), plus bundle exec / poetry|uv|pdm run wrapper runners. Each has a fixture in tool-taxonomy.test.ts.

Dedup (simplify): removed the keyword-prefix fallback (killed a dead testcafe entry + the shell-test false positive), collapsed CONTEXT_PROGRAMS/head() redundancy into one programOf + isReadTool.

One open policy call

Codex flagged that bare MCP browser actions (mcp__playwright__browser_navigate, …__screenshot) are context, not a test assertion, and argued only playwright test should count. Kept them as verification — that's the premise of this issue: on an e2e/Playwright stack the QA loop is bin/pw + mcp__playwright__* browser-driving (~10% of all tool calls here), and the stricter line puts verification share back near ~0%. If preferred, these can instead land in a separate e2e bucket (credited, but not conflated with test/lint/typecheck proof) — small follow-up.

Out of scope (deferred, per the issue's own list)

Crediting verification surfaced via skills (create-pr, ci-watch), agent types (frontend-qc), and hooks (rubocop-check) — those live in the invoked / hook_invocation streams, a different entity than tool_call labels; the (label) => boolean shape doesn't generalize to them. Tracked as a follow-up.

…gaps

Final-review (codex adversarial) caught that the production aggregation feeds
the COLLAPSED command_norm to the classifier, not the full command. Since
normalizeCommand strips the subcommand for tools outside SUBCOMMAND_TOOLS
(`mvn test` -> `mvn`, `npm run lint` -> `npm run`, `bundle exec rspec` ->
`bundle`), the expanded taxonomy could never see those verifiers - the tests
passed only because they fed full command strings that never reach the query.

- profile/queries.ts + dashboard/wrapped.ts: new VERIFY_AGG / WRAPPED_VERIFY
  query groups by `command_text ?? command_norm ?? name` and classifies the
  full command in-process (counts only ever leave - privacy invariant intact).
  This makes the shell tokenizer + ecosystem maps actually live in production,
  so JVM/Scala/.NET/run-script/wrapper verifiers now count.
- tool-taxonomy: e2e drivers exclude setup/inspection subcommands
  (`playwright install|codegen`, `pw --help`) when the full command is known;
  a bare normalized `playwright` still counts. Add module-runner forms
  (`python -m pytest`, `node --test`, `rake test`) and option-value-safe
  subcommand scanning (`xcodebuild -scheme App test`, `mvn clean test`).
- Context: credit `NotebookRead` (the old /read/i regex matched it).

Tests: per-ecosystem + regression fixtures; render/queries mocks updated for
the added query. 190 tests across profile/dashboard/ingest green; check-family
consumed read-only (no churn-path change); typecheck clean on changed files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Necmttn

Necmttn commented Jun 16, 2026

Copy link
Copy Markdown
Owner Author

Final review round (codex adversarial) — resolved in 6702de28

The final pass caught a real blocker the earlier rounds missed, plus a few edge cases. All addressed:

Severity Finding Fix
high Production aggregation feeds the collapsed command_norm, not the full command. normalizeCommand strips the subcommand for tools outside SUBCOMMAND_TOOLS (mvn testmvn, npm run lintnpm run, bundle exec rspecbundle), so the expanded taxonomy could never see those verifiers in production — the tests passed only because they fed full command strings. New VERIFY_AGG / WRAPPED_VERIFY query groups by command_text ?? command_norm ?? name and classifies the full command in-process (counts only leave — privacy invariant intact). This makes the shell tokenizer + ecosystem maps actually live, so JVM/Scala/.NET/run-script/wrapper verifiers count. No schema/ingest change.
medium e2e drivers blanket-true → playwright install/codegen/pw --help counted as verification Exclude setup/inspection subcommands + help/version flags when the full command is known; a bare normalized playwright still counts (ambiguous → run)
medium Module-runner forms missed (python -m pytest, node --test, rake test); xcodebuild -scheme App test read the scheme value as the action Module-runner handling + option-value-safe subcommand scan (mvn clean test, ./gradlew check also fixed)
medium NotebookRead dropped from context (old /read/i matched it) Added to context extras

Why command_text is the right call: outcomes.ts already classifies churn from command_text, the column exists on tool_call, and it's the only way "cover every language" is actually true rather than test-green. The classifier internals are unchanged — they were just being fed the wrong column.

Verification: 190 tests across profile/dashboard/ingest green (per-ecosystem + regression fixtures); check-family.ts consumed read-only (no churn-path change); typecheck clean on changed files.

Note: the command_norm-collapse limitation was pre-existing (the old regex ran on the same collapsed label) — this round removes it entirely by switching to command_text.

…nomy

# Conflicts:
#	apps/axctl/src/profile/queries.ts
@Necmttn Necmttn merged commit 418a13d into main Jun 16, 2026
3 checks passed
@Necmttn Necmttn deleted the fix/verification-taxonomy branch June 16, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verification share undercounted (~0.1%): detector is a JS/TS tool-name regex; misses Rails/Playwright tools, skills, and hooks

1 participant