feat(darwin-skill): heterogeneous validation + tie-break strategy + stopping principles by definewbie · Pull Request #2 · alchaincyf/darwin-skill

definewbie · 2026-04-15T08:35:00Z

Background

This PR introduces the first heterogeneous validation of Darwin Skill across three distinct skill types:

Low-score minimal skill (obsidian)
Mid-score resource-heavy skill (popular-web-designs)
High-score mature skill (systematic-debugging, control group)

Goal: validate whether Darwin can:

Repair low-quality skills (structure building)
Improve mid-quality skills (non-invasive enhancement)
Avoid over-optimizing high-quality skills

Experiment Setup

Rubric: full 8-dimension scoring (structure + effectiveness)
Evaluation: dry_run (due to sub-agent constraints)
Constraints:
- Single-dimension per round
- File size ≤ 150% (low-score) / ≤110% (mid-score)
- Same prompts / same rubric (closed-loop consistency)

Results

1. obsidian (31.7 → 51.2, +61.5%)

Type: command pile → executable workflow

Key improvements:

Added checkpoints (R1, R2)
Introduced minimal workflow (R3)

Outcome:

Structure formed (workflow + safety)
Stopped due to round limit + size ceiling (149%)

2. popular-web-designs (71.5 → 76.55, +5.05%)

Type: high-quality templates, weak execution guidance

Key improvements:

Added 3 inline checkpoints (non-invasive)
No template or structure changes

Outcome:

Checkpoints: 4 → 7.5
Zero negative dimension impact
+255B only (102.7%)

3. systematic-debugging (~88)

Control group, no optimization applied.

Key Findings

Darwin handles two distinct failure modes:
- "No structure" → structural reconstruction
- "Good content, weak guardrails" → minimal enhancement
Tie-break strategy is critical:
- When dimensions tie, prioritize structural leverage over raw score
Diminishing returns detected:
- After 3 rounds or near size ceiling, further optimization is inefficient
Size constraint acts as effective regularization:
- Prevents over-engineering and verbosity

New Mechanisms Added

1. Tie-Break Strategy

Resolve equal-score dimensions based on:

structural leverage
previous round delta
core bottleneck identification

2. Stopping Principles

Stop optimization when:

round limit reached
file size ≥ threshold
remaining weak dimensions are low-leverage
marginal gain < expected complexity cost

Artifacts

results.tsv: experiment records
workspace/EXPERIMENT-SUMMARY.md: full experiment details

Conclusion

Darwin is now validated for:

low-score skill reconstruction
mid-score skill enhancement (safe mode)

Next step: validate on real sub-agent execution (full_test mode)

This reverts commit 9adea0b.

This reverts commit c2dee77.

- Extract Rubric to references/rubric.md (lazy-load, save ~430 tokens) - Replace Rubric section with quick-ref table + load instruction - Fix screenshot.mjs hardcoded playwright path + execSync injection - Adapt paths from .claude/skills to ~/.hermes/skills - Add scripts/lint-skill.sh static checker - Clean workspace stale experiment data

…ding timing, compressed phases, result-card extracted - Move constraints before Phase 0 (was after Phase 3) - Add preflight check table with degradation strategy - Add 评分闭环一致性 rule (constraint alchaincyf#7) - Clarify rubric loading: must reload before every evaluation round - Move strategy library next to Phase 2 - Compress Phase descriptions: remove ASCII art templates, keep checklist - Extract result-card guide to references/result-card-guide.md - Compress usage modes to dispatch table - Compress results.tsv format to field list - SKILL.md: 12,547 chars → 8,086 chars (-35.6%, 4,461 chars saved)

1. Preflight: unify semantics — remove mixed shell/API/abstract commands, use pure check descriptions 2. Rubric fallback: explicit fallback path in Phase 1/2, add rubric_mode column to results.tsv 3. Phase 2: replace pseudocode with numbered execution protocol (10-step checklist) 4. Phase 0.5: add 3 hard rules for test prompt design (min2/max3, must cover core, locked prompts) 5. results.tsv: add round + rubric_mode columns 6. Phase 3: add decision-layer conclusion table (4 categories: continue/local-opt/rewrite/unreliable)

1. Phase 2: round must be read from results.tsv, not inferred 2. Phase 1: test-prompts.json locked after baseline, no regeneration 3. Phase 2 step 7: score drift must be annotated in note field 4. Preflight: single skill failure → skip (not block), add status=skip 5. Phase 2.5: exploration rewrite only when eval_mode=full_test SKILL.md: 9,447 → 10,315 chars (+9.2%)

4-step decision framework for tied dimensions: 1. Exclude low-value dims (resources, frontmatter in minimal skills) 2. Structural leverage priority: workflow > checkpoint > boundary > specificity 3. Recent delta: Δ=0 → downgrade, Δ>0 → can continue 4. Core contradiction: identify skill's main problem Also: Phase 2 opening line now references Tie-Break on ties.

4 conditions for stopping and escalating to manual review: - 3 rounds reached - file size >= 145% of original - remaining weak dims are low-leverage - further optimization likely adds complexity not structure Also: stopped skill marked manual_review_needed with KEEP-<skill>.md record.

definewbie · 2026-04-15T13:59:26Z

在hermes agent上做了一些优化，希望跟大佬学习。

definewbie added 10 commits April 15, 2026 08:46

baseline: before philosophy trim experiment

0aaac70

trim: remove 设计哲学 and 设计灵感 sections, replace with 2-line 核心原则

9adea0b

Revert "trim: remove 设计哲学 and 设计灵感 sections, replace with 2-line 核心原则"

c2dee77

This reverts commit 9adea0b.

Reapply "trim: remove 设计哲学 and 设计灵感 sections, replace with 2-line 核心原则"

40b0fff

This reverts commit c2dee77.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(darwin-skill): heterogeneous validation + tie-break strategy + stopping principles#2

feat(darwin-skill): heterogeneous validation + tie-break strategy + stopping principles#2
definewbie wants to merge 10 commits into
alchaincyf:masterfrom
definewbie:feat/heterogeneous-validation

definewbie commented Apr 15, 2026

Uh oh!

definewbie commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

definewbie commented Apr 15, 2026

Background

Experiment Setup

Results

1. obsidian (31.7 → 51.2, +61.5%)

2. popular-web-designs (71.5 → 76.55, +5.05%)

3. systematic-debugging (~88)

Key Findings

New Mechanisms Added

1. Tie-Break Strategy

2. Stopping Principles

Artifacts

Conclusion

Uh oh!

definewbie commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant