Skip to content

Night Nurse: Research-to-ship loop: pick 3–5 relevant papers (brand governance, brand voice control, style consistency, evals for creative output) → extract 10 concrete enhancement ideas → implement ONE small, testable improvement (docs or code) in BrandOS.#2

Draft
amadad wants to merge 27 commits intomainfrom
night-nurse/001-research-to-ship-loop-pick-35-relevant-papers-bran

Conversation

@amadad
Copy link
Owner

@amadad amadad commented Feb 10, 2026

Summary

Automated implementation by Night Nurse.

Source: backlog
Iterations: 19
Tasks completed: 18
Tasks remaining: 0

Spec

Proposal: Research-Grounded Eval Improvement for BrandOS

Why

BrandOS has a working eval system (rubric → LLM-as-judge → heal loop) but it was built from intuition, not research. The grader prompt is a generic "you are an expert content evaluator" instruction. The rubric dimensions (clarity, engagement, brand_voice, accuracy) are reasonable defaults but lack specificity that would make evaluations reliable and actionable.

Academic research on LLM-as-judge systems, brand voice consistency, and creative output evaluation has matured significantly. There are concrete, proven techniques we're not using — techniques that would make our evaluations more reliable with minimal code change.

Research Survey (5 Papers)

1. Zheng et al. (2023) — "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"

Key finding: Single-point scoring is the least reliable LLM-as-judge mode. Reference-guided grading (providing examples of good/bad output alongside the rubric) significantly improves agreement with human evaluators. Position bias exists: judges favor the first content shown.

Relevance to BrandOS: Our grade_content() uses single-point scoring with no reference examples. Adding reference-guided grading using the brand's examples field would improve reliability.

2. Kim et al. (2024) — "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other LMs"

Key finding: Rubrics with anchor descriptions (what a score of 1, 3, 5 looks like concretely) produce much more calibrated evaluations than rubrics with only dimension descriptions. A rubric that says "brand_voice: Does it match the brand voice?" is far less reliable than one with explicit score anchors.

Relevance to BrandOS: Our RubricDimension has description and criteria but no score anchors. Adding score_anchors: dict[int, str] (e.g., {1: "Contradicts brand tone", 3: "Generally on-brand but inconsistent", 5: "Indistinguishable from brand exemplars"}) is a small model change with large impact.

3. Li et al. (2024) — "Style Over Substance: Evaluation Biases for Large Language Models"

Key finding: LLM judges systematically favor longer, more verbose outputs and outputs with more formatting (headers, bullet points). This means a heal loop that iterates based on LLM-judge feedback will tend to make content longer and more formatted with each iteration — drifting away from platform-native style (280-char tweets become essays).

Relevance to BrandOS: Our heal loop (heal.py) has no length/format guardrails. Content can inflate over iterations. Adding a platform-aware length check between heal iterations would prevent this.

4. Wang et al. (2024) — "CriticBench: Benchmarking LLMs for Critique-Correct Reasoning"

Key finding: The quality of critique (identifying what's wrong) and correction (fixing it) are separate capabilities. Critique accuracy is much higher than correction quality. Systems that separate the critique step from the correction step outperform combined "evaluate and fix" approaches.

Relevance to BrandOS: Our heal loop already separates grade (critique) from improve (correct), which aligns with this research. But the improve prompt doesn't receive the full rubric context — only failed dimensions and suggestions. Passing the rubric anchors into the improve step would help.

Log

Iteration 12 - 04:19:24

Task: 5.1 Create tests/eval/test_rubric.py: test parse_rubric() with and without score_anchors
Result: ✓ Complete

Iteration 13 - 04:20:23

Task: 5.2 Create tests/eval/test_grader.py: test that grade_content() prompt includes anchors when present and omits them when absent (mock LLM)
Result: ✓ Complete

Iteration 14 - 04:21:09

Task: 5.3 Create tests/eval/test_heal.py: test that _improve_content() includes target anchors for failed dimensions (mock LLM)
Result: ✓ Complete

Iteration 15 - 04:24:13

Task: 6.1 Run uv run ruff check src/ and fix any lint issues
Result: ✓ Complete

Iteration 16 - 04:24:20

Task: 6.2 Run uv run ruff format src/
Result: ✓ Complete

Iteration 17 - 04:24:30

Task: 6.3 Run uv run pytest tests/eval/ and confirm all tests pass
Result: ✓ Complete

Iteration 18 - 04:26:32

Task: 6.4 Manual smoke test: uv run brandos eval grade with default rubric, confirm anchors appear in output
Result: ✓ Complete

Result: SUCCESS

Verification: SKIPPED (no diff or spec)

amadad added 27 commits January 28, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant