Night Nurse: Research-to-ship loop: pick 3–5 relevant papers (brand governance, brand voice control, style consistency, evals for creative output) → extract 10 concrete enhancement ideas → implement ONE small, testable improvement (docs or code) in BrandOS. by amadad · Pull Request #2 · amadad/brandOS

amadad · 2026-02-10T04:26:36Z

Summary

Automated implementation by Night Nurse.

Source: backlog
Iterations: 19
Tasks completed: 18
Tasks remaining: 0

Spec

Proposal: Research-Grounded Eval Improvement for BrandOS

Why

BrandOS has a working eval system (rubric → LLM-as-judge → heal loop) but it was built from intuition, not research. The grader prompt is a generic "you are an expert content evaluator" instruction. The rubric dimensions (clarity, engagement, brand_voice, accuracy) are reasonable defaults but lack specificity that would make evaluations reliable and actionable.

Academic research on LLM-as-judge systems, brand voice consistency, and creative output evaluation has matured significantly. There are concrete, proven techniques we're not using — techniques that would make our evaluations more reliable with minimal code change.

Research Survey (5 Papers)

1. Zheng et al. (2023) — "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"

Key finding: Single-point scoring is the least reliable LLM-as-judge mode. Reference-guided grading (providing examples of good/bad output alongside the rubric) significantly improves agreement with human evaluators. Position bias exists: judges favor the first content shown.

Relevance to BrandOS: Our grade_content() uses single-point scoring with no reference examples. Adding reference-guided grading using the brand's examples field would improve reliability.

2. Kim et al. (2024) — "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other LMs"

Key finding: Rubrics with anchor descriptions (what a score of 1, 3, 5 looks like concretely) produce much more calibrated evaluations than rubrics with only dimension descriptions. A rubric that says "brand_voice: Does it match the brand voice?" is far less reliable than one with explicit score anchors.

Relevance to BrandOS: Our RubricDimension has description and criteria but no score anchors. Adding score_anchors: dict[int, str] (e.g., {1: "Contradicts brand tone", 3: "Generally on-brand but inconsistent", 5: "Indistinguishable from brand exemplars"}) is a small model change with large impact.

3. Li et al. (2024) — "Style Over Substance: Evaluation Biases for Large Language Models"

Key finding: LLM judges systematically favor longer, more verbose outputs and outputs with more formatting (headers, bullet points). This means a heal loop that iterates based on LLM-judge feedback will tend to make content longer and more formatted with each iteration — drifting away from platform-native style (280-char tweets become essays).

Relevance to BrandOS: Our heal loop (heal.py) has no length/format guardrails. Content can inflate over iterations. Adding a platform-aware length check between heal iterations would prevent this.

4. Wang et al. (2024) — "CriticBench: Benchmarking LLMs for Critique-Correct Reasoning"

Key finding: The quality of critique (identifying what's wrong) and correction (fixing it) are separate capabilities. Critique accuracy is much higher than correction quality. Systems that separate the critique step from the correction step outperform combined "evaluate and fix" approaches.

Relevance to BrandOS: Our heal loop already separates grade (critique) from improve (correct), which aligns with this research. But the improve prompt doesn't receive the full rubric context — only failed dimensions and suggestions. Passing the rubric anchors into the improve step would help.

Log

Iteration 12 - 04:19:24

Task: 5.1 Create tests/eval/test_rubric.py: test parse_rubric() with and without score_anchors
Result: ✓ Complete

Iteration 13 - 04:20:23

Task: 5.2 Create tests/eval/test_grader.py: test that grade_content() prompt includes anchors when present and omits them when absent (mock LLM)
Result: ✓ Complete

Iteration 14 - 04:21:09

Task: 5.3 Create tests/eval/test_heal.py: test that _improve_content() includes target anchors for failed dimensions (mock LLM)
Result: ✓ Complete

Iteration 15 - 04:24:13

Task: 6.1 Run uv run ruff check src/ and fix any lint issues
Result: ✓ Complete

Iteration 16 - 04:24:20

Task: 6.2 Run uv run ruff format src/
Result: ✓ Complete

Iteration 17 - 04:24:30

Task: 6.3 Run uv run pytest tests/eval/ and confirm all tests pass
Result: ✓ Complete

Iteration 18 - 04:26:32

Task: 6.4 Manual smoke test: uv run brandos eval grade with default rubric, confirm anchors appear in output
Result: ✓ Complete

Result: SUCCESS

Verification: SKIPPED (no diff or spec)

…via Ralph loop

…/grader.py`

…Exemplars | None` with markdown parsing

…kquotes, malformed markdown

…rs each

…VoiceExemplars` parameter

… Off-Brand Examples` sections

…hars good + 500 chars bad = 1500 total cap

…ines and examples

…rs()` when `brand` is provided

…nt and absent

…lars when `brand` is provided

… / "Voice Anti-Pattern" in healing prompt

… with machine-parseable structure

…by the grader

…voice_exemplars()`

…les from well-formed voice-guide

…ing/empty file

…es + 300 char limit

…en provided

…tal budget

…en brand has voice-guide

…nd has no voice-guide

…xemplars

…rs-bran

amadad added 27 commits January 28, 2026 15:05

feat(001-implement-notification-via-slack-sdk-or-email): implemented …

d4fc943

…via Ralph loop

night-nurse: 1.1 Add VoiceExemplars dataclass to `src/brand_os/eval…

0ef96e5

…/grader.py`

night-nurse: 1.2 Implement `load_voice_exemplars(brand: str) -> Voice…

10eddf5

…Exemplars | None` with markdown parsing

night-nurse: 1.3 Handle edge cases: missing file, empty file, no bloc…

41bd6b1

…kquotes, malformed markdown

night-nurse: 1.4 Enforce limits: max 3 good + 3 bad examples, 300 cha…

a4334d0

…rs each

night-nurse: 2.1 Extend _build_voice_context() to accept optional `…

c9be1ed

…VoiceExemplars` parameter

night-nurse: 2.2 Format exemplars as ### On-Brand Examples and `###…

40176c4

… Off-Brand Examples` sections

night-nurse: 2.3 Implement token budget: 500 chars definition + 500 c…

679de73

…hars good + 500 chars bad = 1500 total cap

night-nurse: 2.4 Update the instruction line to reference both guidel…

a5deca7

…ines and examples

night-nurse: 3.1 Update grade_content() to call `load_voice_exempla…

83acd72

…rs()` when `brand` is provided

night-nurse: 3.2 Pass exemplars to _build_voice_context()

1e7a544

night-nurse: 3.3 Verify prompt renders correctly with exemplars prese…

e536d3f

…nt and absent

night-nurse: 4.1 Update _improve_content() to load and inject exemp…

9648142

…lars when `brand` is provided

night-nurse: 4.2 Format exemplars as "Voice Reference (target style)"…

80d9075

… / "Voice Anti-Pattern" in healing prompt

night-nurse: 5.1 Rewrite brands/_template/references/voice-guide.md…

c03fc56

… with machine-parseable structure

night-nurse: 5.2 Add HTML comments explaining how exemplars are used …

def0d6d

…by the grader

night-nurse: 5.3 Verify the empty template returns None from `load_…

17d03cb

…voice_exemplars()`

night-nurse: 6.1 Test load_voice_exemplars(): parses good/bad examp…

e9bc513

…les from well-formed voice-guide

night-nurse: 6.2 Test load_voice_exemplars(): returns None for miss…

945290c

…ing/empty file

night-nurse: 6.3 Test load_voice_exemplars(): respects max 3 exampl…

72ac5fe

…es + 300 char limit

night-nurse: 6.4 Test _build_voice_context(): includes exemplars wh…

ea5eb4e

…en provided

night-nurse: 6.5 Test _build_voice_context(): respects 1500 char to…

94fd137

…tal budget

night-nurse: 6.6 Test grade_content(): prompt includes exemplars wh…

645d8b2

…en brand has voice-guide

night-nurse: 6.7 Test grade_content(): backward compatible when bra…

505911c

…nd has no voice-guide

night-nurse: 6.8 Test _improve_content(): healing prompt includes e…

b4a69c0

…xemplars

night-nurse: 7.1 uv run ruff check src/ — no lint errors

091436e

night-nurse: finalize 001-research-to-ship-loop-pick-35-relevant-pape…

4d04373

…rs-bran

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Night Nurse: Research-to-ship loop: pick 3–5 relevant papers (brand governance, brand voice control, style consistency, evals for creative output) → extract 10 concrete enhancement ideas → implement ONE small, testable improvement (docs or code) in BrandOS.#2

amadad commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amadad commented Feb 10, 2026

Summary

Spec

Proposal: Research-Grounded Eval Improvement for BrandOS

Why

Research Survey (5 Papers)

1. Zheng et al. (2023) — "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"

2. Kim et al. (2024) — "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other LMs"

3. Li et al. (2024) — "Style Over Substance: Evaluation Biases for Large Language Models"

4. Wang et al. (2024) — "CriticBench: Benchmarking LLMs for Critique-Correct Reasoning"

Log

Iteration 12 - 04:19:24

Iteration 13 - 04:20:23

Iteration 14 - 04:21:09

Iteration 15 - 04:24:13

Iteration 16 - 04:24:20

Iteration 17 - 04:24:30

Iteration 18 - 04:26:32

Result: SUCCESS

Verification: SKIPPED (no diff or spec)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant