Skip to content

feat(darwin-skill): heterogeneous validation + tie-break strategy + stopping principles#2

Open
definewbie wants to merge 10 commits into
alchaincyf:masterfrom
definewbie:feat/heterogeneous-validation
Open

feat(darwin-skill): heterogeneous validation + tie-break strategy + stopping principles#2
definewbie wants to merge 10 commits into
alchaincyf:masterfrom
definewbie:feat/heterogeneous-validation

Conversation

@definewbie
Copy link
Copy Markdown

Background

This PR introduces the first heterogeneous validation of Darwin Skill across three distinct skill types:

  • Low-score minimal skill (obsidian)
  • Mid-score resource-heavy skill (popular-web-designs)
  • High-score mature skill (systematic-debugging, control group)

Goal: validate whether Darwin can:

  1. Repair low-quality skills (structure building)
  2. Improve mid-quality skills (non-invasive enhancement)
  3. Avoid over-optimizing high-quality skills

Experiment Setup

  • Rubric: full 8-dimension scoring (structure + effectiveness)
  • Evaluation: dry_run (due to sub-agent constraints)
  • Constraints:
    • Single-dimension per round
    • File size ≤ 150% (low-score) / ≤110% (mid-score)
    • Same prompts / same rubric (closed-loop consistency)

Results

1. obsidian (31.7 → 51.2, +61.5%)

Type: command pile → executable workflow

Key improvements:

  • Added checkpoints (R1, R2)
  • Introduced minimal workflow (R3)

Outcome:

  • Structure formed (workflow + safety)
  • Stopped due to round limit + size ceiling (149%)

2. popular-web-designs (71.5 → 76.55, +5.05%)

Type: high-quality templates, weak execution guidance

Key improvements:

  • Added 3 inline checkpoints (non-invasive)
  • No template or structure changes

Outcome:

  • Checkpoints: 4 → 7.5
  • Zero negative dimension impact
  • +255B only (102.7%)

3. systematic-debugging (~88)

Control group, no optimization applied.


Key Findings

  1. Darwin handles two distinct failure modes:

    • "No structure" → structural reconstruction
    • "Good content, weak guardrails" → minimal enhancement
  2. Tie-break strategy is critical:

    • When dimensions tie, prioritize structural leverage over raw score
  3. Diminishing returns detected:

    • After 3 rounds or near size ceiling, further optimization is inefficient
  4. Size constraint acts as effective regularization:

    • Prevents over-engineering and verbosity

New Mechanisms Added

1. Tie-Break Strategy

Resolve equal-score dimensions based on:

  • structural leverage
  • previous round delta
  • core bottleneck identification

2. Stopping Principles

Stop optimization when:

  • round limit reached
  • file size ≥ threshold
  • remaining weak dimensions are low-leverage
  • marginal gain < expected complexity cost

Artifacts

  • results.tsv: experiment records
  • workspace/EXPERIMENT-SUMMARY.md: full experiment details

Conclusion

Darwin is now validated for:

  • low-score skill reconstruction
  • mid-score skill enhancement (safe mode)

Next step: validate on real sub-agent execution (full_test mode)

- Extract Rubric to references/rubric.md (lazy-load, save ~430 tokens)
- Replace Rubric section with quick-ref table + load instruction
- Fix screenshot.mjs hardcoded playwright path + execSync injection
- Adapt paths from .claude/skills to ~/.hermes/skills
- Add scripts/lint-skill.sh static checker
- Clean workspace stale experiment data
…ding timing, compressed phases, result-card extracted

- Move constraints before Phase 0 (was after Phase 3)
- Add preflight check table with degradation strategy
- Add 评分闭环一致性 rule (constraint alchaincyf#7)
- Clarify rubric loading: must reload before every evaluation round
- Move strategy library next to Phase 2
- Compress Phase descriptions: remove ASCII art templates, keep checklist
- Extract result-card guide to references/result-card-guide.md
- Compress usage modes to dispatch table
- Compress results.tsv format to field list
- SKILL.md: 12,547 chars → 8,086 chars (-35.6%, 4,461 chars saved)
1. Preflight: unify semantics — remove mixed shell/API/abstract commands, use pure check descriptions
2. Rubric fallback: explicit fallback path in Phase 1/2, add rubric_mode column to results.tsv
3. Phase 2: replace pseudocode with numbered execution protocol (10-step checklist)
4. Phase 0.5: add 3 hard rules for test prompt design (min2/max3, must cover core, locked prompts)
5. results.tsv: add round + rubric_mode columns
6. Phase 3: add decision-layer conclusion table (4 categories: continue/local-opt/rewrite/unreliable)
1. Phase 2: round must be read from results.tsv, not inferred
2. Phase 1: test-prompts.json locked after baseline, no regeneration
3. Phase 2 step 7: score drift must be annotated in note field
4. Preflight: single skill failure → skip (not block), add status=skip
5. Phase 2.5: exploration rewrite only when eval_mode=full_test

SKILL.md: 9,447 → 10,315 chars (+9.2%)
4-step decision framework for tied dimensions:
1. Exclude low-value dims (resources, frontmatter in minimal skills)
2. Structural leverage priority: workflow > checkpoint > boundary > specificity
3. Recent delta: Δ=0 → downgrade, Δ>0 → can continue
4. Core contradiction: identify skill's main problem

Also: Phase 2 opening line now references Tie-Break on ties.
4 conditions for stopping and escalating to manual review:
- 3 rounds reached
- file size >= 145% of original
- remaining weak dims are low-leverage
- further optimization likely adds complexity not structure

Also: stopped skill marked manual_review_needed with KEEP-<skill>.md record.
@definewbie
Copy link
Copy Markdown
Author

在hermes agent上做了一些优化,希望跟大佬学习。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant