feat(darwin-skill): heterogeneous validation + tie-break strategy + stopping principles#2
Open
definewbie wants to merge 10 commits into
Open
Conversation
This reverts commit 9adea0b.
This reverts commit c2dee77.
- Extract Rubric to references/rubric.md (lazy-load, save ~430 tokens) - Replace Rubric section with quick-ref table + load instruction - Fix screenshot.mjs hardcoded playwright path + execSync injection - Adapt paths from .claude/skills to ~/.hermes/skills - Add scripts/lint-skill.sh static checker - Clean workspace stale experiment data
…ding timing, compressed phases, result-card extracted - Move constraints before Phase 0 (was after Phase 3) - Add preflight check table with degradation strategy - Add 评分闭环一致性 rule (constraint alchaincyf#7) - Clarify rubric loading: must reload before every evaluation round - Move strategy library next to Phase 2 - Compress Phase descriptions: remove ASCII art templates, keep checklist - Extract result-card guide to references/result-card-guide.md - Compress usage modes to dispatch table - Compress results.tsv format to field list - SKILL.md: 12,547 chars → 8,086 chars (-35.6%, 4,461 chars saved)
1. Preflight: unify semantics — remove mixed shell/API/abstract commands, use pure check descriptions 2. Rubric fallback: explicit fallback path in Phase 1/2, add rubric_mode column to results.tsv 3. Phase 2: replace pseudocode with numbered execution protocol (10-step checklist) 4. Phase 0.5: add 3 hard rules for test prompt design (min2/max3, must cover core, locked prompts) 5. results.tsv: add round + rubric_mode columns 6. Phase 3: add decision-layer conclusion table (4 categories: continue/local-opt/rewrite/unreliable)
1. Phase 2: round must be read from results.tsv, not inferred 2. Phase 1: test-prompts.json locked after baseline, no regeneration 3. Phase 2 step 7: score drift must be annotated in note field 4. Preflight: single skill failure → skip (not block), add status=skip 5. Phase 2.5: exploration rewrite only when eval_mode=full_test SKILL.md: 9,447 → 10,315 chars (+9.2%)
4-step decision framework for tied dimensions: 1. Exclude low-value dims (resources, frontmatter in minimal skills) 2. Structural leverage priority: workflow > checkpoint > boundary > specificity 3. Recent delta: Δ=0 → downgrade, Δ>0 → can continue 4. Core contradiction: identify skill's main problem Also: Phase 2 opening line now references Tie-Break on ties.
4 conditions for stopping and escalating to manual review: - 3 rounds reached - file size >= 145% of original - remaining weak dims are low-leverage - further optimization likely adds complexity not structure Also: stopped skill marked manual_review_needed with KEEP-<skill>.md record.
Author
|
在hermes agent上做了一些优化,希望跟大佬学习。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
This PR introduces the first heterogeneous validation of Darwin Skill across three distinct skill types:
Goal: validate whether Darwin can:
Experiment Setup
Results
1. obsidian (31.7 → 51.2, +61.5%)
Type: command pile → executable workflow
Key improvements:
Outcome:
2. popular-web-designs (71.5 → 76.55, +5.05%)
Type: high-quality templates, weak execution guidance
Key improvements:
Outcome:
3. systematic-debugging (~88)
Control group, no optimization applied.
Key Findings
Darwin handles two distinct failure modes:
Tie-break strategy is critical:
Diminishing returns detected:
Size constraint acts as effective regularization:
New Mechanisms Added
1. Tie-Break Strategy
Resolve equal-score dimensions based on:
2. Stopping Principles
Stop optimization when:
Artifacts
results.tsv: experiment recordsworkspace/EXPERIMENT-SUMMARY.md: full experiment detailsConclusion
Darwin is now validated for:
Next step: validate on real sub-agent execution (full_test mode)