Problem
--threshold compares the mean score against the threshold, but RESULT: uses per-test pass/fail (score >= 0.8). This produces contradictory output and exit codes:
RESULT: FAIL (28/31 passed, mean score: 0.927)
Suite score: 0.93 (threshold: 0.80) — PASS ← exit code 0
The output says FAIL but the exit code is 0. Users expect --threshold 0.8 to mean "each test must score >= 0.8" — matching the per-test requirement.
Root cause
formatEvaluationSummary() — per-test pass/fail (score >= hardcoded 0.8)
formatThresholdSummary() — mean score comparison
- Exit code follows threshold (mean-based), not RESULT (per-test)
Fix
PR #885 — --threshold now overrides the per-test score requirement:
calculateEvaluationSummary() recomputes passed/failed using the threshold
- RESULT line shows the threshold:
28/31 scored >= 0.8
- Exit code matches RESULT verdict
- Removed separate
formatThresholdSummary() — one unified output line
Problem
--thresholdcompares the mean score against the threshold, butRESULT:uses per-test pass/fail (score >= 0.8). This produces contradictory output and exit codes:The output says FAIL but the exit code is 0. Users expect
--threshold 0.8to mean "each test must score >= 0.8" — matching the per-test requirement.Root cause
formatEvaluationSummary()— per-test pass/fail (score >= hardcoded 0.8)formatThresholdSummary()— mean score comparisonFix
PR #885 —
--thresholdnow overrides the per-test score requirement:calculateEvaluationSummary()recomputes passed/failed using the threshold28/31 scored >= 0.8formatThresholdSummary()— one unified output line