v1.10 audit: release hygiene + judge-scoring fixes#170
Merged
Conversation
requirements.txt is the deploy/app-install artifact (justfile app-install recipe); developers install dev deps via `uv pip install -e ".[dev]"` from pyproject, so this does not affect the dev workflow. Regenerated with `uv export --frozen --no-dev`. Removes pytest, debugpy, black, mypy, ruff, testcontainers, etc. from what the deployed Databricks App installs (smaller install, smaller attack surface; debugpy is a remote debugger). Runtime-critical packages that also live in the [dev] group (alembic, gunicorn, uvicorn) survive as mlflow transitive deps; docker remains as an mlflow[databricks,genai] transitive dep (not the dev SDK). Audit finding C1. Revertable in isolation.
Untrack mlflow.db.bak, e2e-test-output.txt, test-results/.last-run.json, and the vite build-timestamp file so they don't ship in the public release. Files stay on disk; only removed from version control. Fix the .gitignore rule that should have matched test-results/, and add rules for *.bak and the vite timestamp artifact.
The justfile hardcoded internal Databricks proxy hostnames. Read them from DB_PYPI_INDEX / DB_NPM_REGISTRY with public-registry defaults so no internal hostnames ship in the release; the internal dev flow still works when the vars are set.
Two simple-mode (non-MLflow) judge-scoring bugs that silently skewed Cohen's kappa: - Failed LLM calls were recorded as predicted == human (fake perfect agreement). They are now excluded from the agreement set and surfaced as a failure count in metrics and logs. - Binary verdicts were matched by substring, so "incorrect"/"unacceptable" scored as PASS (they contain "correct"/"acceptable"), and unparseable output defaulted to PASS/3. Now matched by whole word with FAIL checked first plus common negated-pass phrases; unparseable output is recorded as a failed evaluation instead of a fabricated rating. The MLflow eval path already handled these correctly; this only touches the simple-mode fallback. Combined because the parser fix reuses the failed-evaluation tracking added for the first fix.
run_evaluation_with_answer_sheet built a success result even when no evaluations were extracted (judge value column missing, result DataFrame None, or every row failed to parse). _calculate_eval_metrics([]) returns kappa=0, so the job reported "completed" with kappa=0 — indistinguishable from a real zero-agreement result. It now yields an explicit error result and stops before computing metrics. Adds a JUDGE_EVALUATION_SPEC test asserting the zero-results guard runs before metrics and yields success=False.
forrestmurray-db
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Targeted fixes surfaced by a review of the v1.10 release candidate, ahead of the public release.
Two buckets: release hygiene (what ships to customers) and judge-scoring correctness (the
agreement / Cohen's κ numbers the workshop produces). Each fix is a separate, individually
revertable commit (
git revert <sha>).Stacked on
rc/v1.10.0so it flows into the release via the existing RC → main PR. Does not touchthe bug-bash fixes already in the RC. Merged up to date with
rc/v1.10.0(no conflicts).Key changes (one commit each)
a30fb25requirements.txtwith--no-dev— stop shipping pytest / debugpy / docker / testcontainers / black / mypy to customer deployments. Dev workflow unaffected (devs install viauv pip install -e ".[dev]"frompyproject).af7d83bmlflow.db.bak(contains local employee paths),e2e-test-output.txt, the vite build-timestamp file,test-results/.last-run.json; fix the.gitignorerules.74d87e2justfileintoDB_PYPI_INDEX/DB_NPM_REGISTRYenv vars with public-registry defaults, so no internal hostnames ship. Internal dev flow still works when the vars are set.819ddcdpredicted == human), which inflated κ; they're now excluded and surfaced as a failure count. (2) Match binary verdicts by whole word, not substring —"incorrect"/"unacceptable"were scoring as PASS (they contain"correct"/"acceptable"). (3) Stop defaulting unparseable output to PASS.612bee1Reviewer focus
819ddcdand612bee1are the behavior changes worth the closest look — they're on thejudge-scoring path (the workshop's core output). Both only affect the simple-mode (non-MLflow)
eval path and the zero-results edge; the MLflow eval path was already correct.
chore/commits are mechanical (generated file / untracking / config) — quick skim.819ddcdlives in code the proposed "require MLflow / drop simple-mode eval"simplification would remove. If you take that direction, this fix is simply mooted — revert it.
Validation
failures are pre-existing and environment-only (
google-genainot installed in the test venv),unrelated to this PR.
ruff: no new errors introduced.Risks / notes
819ddcd(the simple-mode fixes): that logic is an inline closurein a large endpoint (not unit-testable without refactoring) and is a deletion candidate under
the simplification above, so it was verified via full-suite no-regression rather than new tests.
612bee1has a dedicated test.awareness):
F821 undefined name 'host'atserver/routers/workshops.py:3577andF821 undefined name 'JudgePrompt'atserver/services/alignment_service.py:113,216— both look like latentNameErrors on those paths. Also.pre-commit-config.yamlis absent on this branch, so localhooks no-op.
Review follow-up (from /review on this PR)
blunt pre-existing
no/badtokens, free-form prose like "no issues, correct" now classifiesas FAIL where the old substring matching returned PASS. Low-likelihood (only when a judge emits
prose rather than Pass/Fail/0/1) and simple-mode only, but flagging so it isn't a surprise later.
evaluations, an all-failed simple-mode runnow falls through to the existing
if not evaluations: set_status("failed")guard and failsloudly, so simple-mode gets zero-results protection for free (no new guard needed).