Skip to content

v1.10 audit: release hygiene + judge-scoring fixes#170

Merged
forrestmurray-db merged 6 commits into
rc/v1.10.0from
fix/v1.10-audit
Jun 16, 2026
Merged

v1.10 audit: release hygiene + judge-scoring fixes#170
forrestmurray-db merged 6 commits into
rc/v1.10.0from
fix/v1.10-audit

Conversation

@jay-wong-db

@jay-wong-db jay-wong-db commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Targeted fixes surfaced by a review of the v1.10 release candidate, ahead of the public release.
Two buckets: release hygiene (what ships to customers) and judge-scoring correctness (the
agreement / Cohen's κ numbers the workshop produces). Each fix is a separate, individually
revertable commit (git revert <sha>).

Stacked on rc/v1.10.0 so it flows into the release via the existing RC → main PR. Does not touch
the bug-bash fixes already in the RC. Merged up to date with rc/v1.10.0 (no conflicts).

Key changes (one commit each)

Commit Type What
a30fb25 hygiene Regenerate requirements.txt with --no-dev — stop shipping pytest / debugpy / docker / testcontainers / black / mypy to customer deployments. Dev workflow unaffected (devs install via uv pip install -e ".[dev]" from pyproject).
af7d83b hygiene Untrack dev/test artifacts that were shipping in the repo: mlflow.db.bak (contains local employee paths), e2e-test-output.txt, the vite build-timestamp file, test-results/.last-run.json; fix the .gitignore rules.
74d87e2 hygiene Move internal Databricks package-proxy hostnames out of justfile into DB_PYPI_INDEX / DB_NPM_REGISTRY env vars with public-registry defaults, so no internal hostnames ship. Internal dev flow still works when the vars are set.
819ddcd correctness Simple-mode (non-MLflow) eval: (1) stop recording failed LLM calls as perfect agreement (predicted == human), which inflated κ; they're now excluded and surfaced as a failure count. (2) Match binary verdicts by whole word, not substring — "incorrect" / "unacceptable" were scoring as PASS (they contain "correct" / "acceptable"). (3) Stop defaulting unparseable output to PASS.
612bee1 correctness Alignment eval: fail loudly when zero results are produced (judge value column missing / all rows error) instead of reporting "success" with κ=0, which was indistinguishable from a real zero-agreement result.

Reviewer focus

  • 819ddcd and 612bee1 are the behavior changes worth the closest look — they're on the
    judge-scoring path (the workshop's core output). Both only affect the simple-mode (non-MLflow)
    eval path
    and the zero-results edge; the MLflow eval path was already correct.
  • The three chore/ commits are mechanical (generated file / untracking / config) — quick skim.
  • Heads-up: 819ddcd lives in code the proposed "require MLflow / drop simple-mode eval"
    simplification would remove. If you take that direction, this fix is simply mooted — revert it.

Validation

  • Backend: 913 passed locally (incl. a new test for the zero-results guard). The only 3
    failures are pre-existing and environment-only (google-genai not installed in the test venv),
    unrelated to this PR.
  • Frontend: 363 passed (untouched by these commits).
  • ruff: no new errors introduced.
  • No CI ran (GitHub Actions appear dormant on this repo) — all validation is local.

Risks / notes

  • No dedicated unit tests for 819ddcd (the simple-mode fixes): that logic is an inline closure
    in a large endpoint (not unit-testable without refactoring) and is a deletion candidate under
    the simplification above, so it was verified via full-suite no-regression rather than new tests.
    612bee1 has a dedicated test.
  • Pre-existing lint issues spotted while here (NOT introduced by this PR, flagging for
    awareness): F821 undefined name 'host' at server/routers/workshops.py:3577 and F821 undefined name 'JudgePrompt' at server/services/alignment_service.py:113,216 — both look like latent
    NameErrors on those paths. Also .pre-commit-config.yaml is absent on this branch, so local
    hooks no-op.
  • Opened as draft for review; happy to iterate on scope, granularity, or commit messages.

Review follow-up (from /review on this PR)

  • Binary parser now checks FAIL keywords before PASS ("negatives win"). Combined with the
    blunt pre-existing no / bad tokens, free-form prose like "no issues, correct" now classifies
    as FAIL where the old substring matching returned PASS. Low-likelihood (only when a judge emits
    prose rather than Pass/Fail/0/1) and simple-mode only, but flagging so it isn't a surprise later.
  • Emergent win: because failed calls no longer pad evaluations, an all-failed simple-mode run
    now falls through to the existing if not evaluations: set_status("failed") guard and fails
    loudly, so simple-mode gets zero-results protection for free (no new guard needed).

requirements.txt is the deploy/app-install artifact (justfile app-install
recipe); developers install dev deps via `uv pip install -e ".[dev]"` from
pyproject, so this does not affect the dev workflow.

Regenerated with `uv export --frozen --no-dev`. Removes pytest, debugpy,
black, mypy, ruff, testcontainers, etc. from what the deployed Databricks
App installs (smaller install, smaller attack surface; debugpy is a remote
debugger). Runtime-critical packages that also live in the [dev] group
(alembic, gunicorn, uvicorn) survive as mlflow transitive deps; docker
remains as an mlflow[databricks,genai] transitive dep (not the dev SDK).

Audit finding C1. Revertable in isolation.
Untrack mlflow.db.bak, e2e-test-output.txt, test-results/.last-run.json,
and the vite build-timestamp file so they don't ship in the public release.
Files stay on disk; only removed from version control. Fix the .gitignore
rule that should have matched test-results/, and add rules for *.bak and
the vite timestamp artifact.
The justfile hardcoded internal Databricks proxy hostnames. Read them from
DB_PYPI_INDEX / DB_NPM_REGISTRY with public-registry defaults so no internal
hostnames ship in the release; the internal dev flow still works when the
vars are set.
Two simple-mode (non-MLflow) judge-scoring bugs that silently skewed
Cohen's kappa:

- Failed LLM calls were recorded as predicted == human (fake perfect
  agreement). They are now excluded from the agreement set and surfaced
  as a failure count in metrics and logs.
- Binary verdicts were matched by substring, so "incorrect"/"unacceptable"
  scored as PASS (they contain "correct"/"acceptable"), and unparseable
  output defaulted to PASS/3. Now matched by whole word with FAIL checked
  first plus common negated-pass phrases; unparseable output is recorded
  as a failed evaluation instead of a fabricated rating.

The MLflow eval path already handled these correctly; this only touches
the simple-mode fallback. Combined because the parser fix reuses the
failed-evaluation tracking added for the first fix.
run_evaluation_with_answer_sheet built a success result even when no
evaluations were extracted (judge value column missing, result DataFrame
None, or every row failed to parse). _calculate_eval_metrics([]) returns
kappa=0, so the job reported "completed" with kappa=0 — indistinguishable
from a real zero-agreement result. It now yields an explicit error result
and stops before computing metrics.

Adds a JUDGE_EVALUATION_SPEC test asserting the zero-results guard runs
before metrics and yields success=False.
@jay-wong-db jay-wong-db marked this pull request as ready for review June 13, 2026 13:04
@forrestmurray-db forrestmurray-db merged commit c1fc41d into rc/v1.10.0 Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants