v1.10 audit: release hygiene + judge-scoring fixes by jay-wong-db · Pull Request #170 · databricks-solutions/vibescaler

jay-wong-db · 2026-06-13T12:38:21Z

Summary

Targeted fixes surfaced by a review of the v1.10 release candidate, ahead of the public release.
Two buckets: release hygiene (what ships to customers) and judge-scoring correctness (the
agreement / Cohen's κ numbers the workshop produces). Each fix is a separate, individually
revertable commit (git revert <sha>).

Stacked on rc/v1.10.0 so it flows into the release via the existing RC → main PR. Does not touch
the bug-bash fixes already in the RC. Merged up to date with rc/v1.10.0 (no conflicts).

Key changes (one commit each)

Commit	Type	What
`a30fb25`	hygiene	Regenerate `requirements.txt` with `--no-dev` — stop shipping pytest / debugpy / docker / testcontainers / black / mypy to customer deployments. Dev workflow unaffected (devs install via `uv pip install -e ".[dev]"` from `pyproject`).
`af7d83b`	hygiene	Untrack dev/test artifacts that were shipping in the repo: `mlflow.db.bak` (contains local employee paths), `e2e-test-output.txt`, the vite build-timestamp file, `test-results/.last-run.json`; fix the `.gitignore` rules.
`74d87e2`	hygiene	Move internal Databricks package-proxy hostnames out of `justfile` into `DB_PYPI_INDEX` / `DB_NPM_REGISTRY` env vars with public-registry defaults, so no internal hostnames ship. Internal dev flow still works when the vars are set.
`819ddcd`	correctness	Simple-mode (non-MLflow) eval: (1) stop recording failed LLM calls as perfect agreement (`predicted == human`), which inflated κ; they're now excluded and surfaced as a failure count. (2) Match binary verdicts by whole word, not substring — `"incorrect"` / `"unacceptable"` were scoring as PASS (they contain `"correct"` / `"acceptable"`). (3) Stop defaulting unparseable output to PASS.
`612bee1`	correctness	Alignment eval: fail loudly when zero results are produced (judge value column missing / all rows error) instead of reporting "success" with κ=0, which was indistinguishable from a real zero-agreement result.

Reviewer focus

819ddcd and 612bee1 are the behavior changes worth the closest look — they're on the
judge-scoring path (the workshop's core output). Both only affect the simple-mode (non-MLflow)
eval path and the zero-results edge; the MLflow eval path was already correct.
The three chore/ commits are mechanical (generated file / untracking / config) — quick skim.
Heads-up: 819ddcd lives in code the proposed "require MLflow / drop simple-mode eval"
simplification would remove. If you take that direction, this fix is simply mooted — revert it.

Validation

Backend: 913 passed locally (incl. a new test for the zero-results guard). The only 3
failures are pre-existing and environment-only (google-genai not installed in the test venv),
unrelated to this PR.
Frontend: 363 passed (untouched by these commits).
ruff: no new errors introduced.
No CI ran (GitHub Actions appear dormant on this repo) — all validation is local.

Risks / notes

No dedicated unit tests for 819ddcd (the simple-mode fixes): that logic is an inline closure
in a large endpoint (not unit-testable without refactoring) and is a deletion candidate under
the simplification above, so it was verified via full-suite no-regression rather than new tests.
612bee1 has a dedicated test.
Pre-existing lint issues spotted while here (NOT introduced by this PR, flagging for
awareness): F821 undefined name 'host' at server/routers/workshops.py:3577 and F821 undefined name 'JudgePrompt' at server/services/alignment_service.py:113,216 — both look like latent
NameErrors on those paths. Also .pre-commit-config.yaml is absent on this branch, so local
hooks no-op.
Opened as draft for review; happy to iterate on scope, granularity, or commit messages.

Review follow-up (from /review on this PR)

Binary parser now checks FAIL keywords before PASS ("negatives win"). Combined with the
blunt pre-existing no / bad tokens, free-form prose like "no issues, correct" now classifies
as FAIL where the old substring matching returned PASS. Low-likelihood (only when a judge emits
prose rather than Pass/Fail/0/1) and simple-mode only, but flagging so it isn't a surprise later.
Emergent win: because failed calls no longer pad evaluations, an all-failed simple-mode run
now falls through to the existing if not evaluations: set_status("failed") guard and fails
loudly, so simple-mode gets zero-results protection for free (no new guard needed).

requirements.txt is the deploy/app-install artifact (justfile app-install recipe); developers install dev deps via `uv pip install -e ".[dev]"` from pyproject, so this does not affect the dev workflow. Regenerated with `uv export --frozen --no-dev`. Removes pytest, debugpy, black, mypy, ruff, testcontainers, etc. from what the deployed Databricks App installs (smaller install, smaller attack surface; debugpy is a remote debugger). Runtime-critical packages that also live in the [dev] group (alembic, gunicorn, uvicorn) survive as mlflow transitive deps; docker remains as an mlflow[databricks,genai] transitive dep (not the dev SDK). Audit finding C1. Revertable in isolation.

Untrack mlflow.db.bak, e2e-test-output.txt, test-results/.last-run.json, and the vite build-timestamp file so they don't ship in the public release. Files stay on disk; only removed from version control. Fix the .gitignore rule that should have matched test-results/, and add rules for *.bak and the vite timestamp artifact.

The justfile hardcoded internal Databricks proxy hostnames. Read them from DB_PYPI_INDEX / DB_NPM_REGISTRY with public-registry defaults so no internal hostnames ship in the release; the internal dev flow still works when the vars are set.

Two simple-mode (non-MLflow) judge-scoring bugs that silently skewed Cohen's kappa: - Failed LLM calls were recorded as predicted == human (fake perfect agreement). They are now excluded from the agreement set and surfaced as a failure count in metrics and logs. - Binary verdicts were matched by substring, so "incorrect"/"unacceptable" scored as PASS (they contain "correct"/"acceptable"), and unparseable output defaulted to PASS/3. Now matched by whole word with FAIL checked first plus common negated-pass phrases; unparseable output is recorded as a failed evaluation instead of a fabricated rating. The MLflow eval path already handled these correctly; this only touches the simple-mode fallback. Combined because the parser fix reuses the failed-evaluation tracking added for the first fix.

run_evaluation_with_answer_sheet built a success result even when no evaluations were extracted (judge value column missing, result DataFrame None, or every row failed to parse). _calculate_eval_metrics([]) returns kappa=0, so the job reported "completed" with kappa=0 — indistinguishable from a real zero-agreement result. It now yields an explicit error result and stops before computing metrics. Adds a JUDGE_EVALUATION_SPEC test asserting the zero-results guard runs before metrics and yields success=False.

jay-wong-db added 6 commits June 13, 2026 12:02

Merge remote-tracking branch 'origin/rc/v1.10.0' into fix/v1.10-audit

3d5c0c4

jay-wong-db marked this pull request as ready for review June 13, 2026 13:04

jay-wong-db requested review from forrestmurray-db and vivian-xie-db as code owners June 13, 2026 13:04

forrestmurray-db approved these changes Jun 16, 2026

View reviewed changes

forrestmurray-db merged commit c1fc41d into rc/v1.10.0 Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.10 audit: release hygiene + judge-scoring fixes#170

v1.10 audit: release hygiene + judge-scoring fixes#170
forrestmurray-db merged 6 commits into
rc/v1.10.0from
fix/v1.10-audit

jay-wong-db commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jay-wong-db commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes (one commit each)

Reviewer focus

Validation

Risks / notes

Review follow-up (from /review on this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jay-wong-db commented Jun 13, 2026 •

edited

Loading