feat: variant task-scoping + tactical-ddd skill benchmark example by szjanikowski · Pull Request #54 · NoesisVision/nasde-toolkit

szjanikowski · 2026-05-22T10:02:33Z

Two logical changes, squashed into two commits.

1. `feat`: variant task-scoping (+ two small fixes) — toolkit code

A variant.toml may now declare tasks = [...] to restrict a variant to specific
tasks. Use it for a repo-specific variant (e.g. a skill whose examples reference one
repo's conventions) so it never runs against the wrong codebase. With
--all-variants, a scoped variant runs only against its declared tasks; the
scope wins even over an explicit --tasks filter. Absent/empty → unscoped (default).

Bundled with two small fixes that belong with the toolkit:

evaluator default max_turns 30 → 60 (avoids error_max_turns on DDD-heavy workspaces),
starlette 1.0.0 → 1.1.0 (PYSEC-2026-161, transitive via harbor/fastapi/mcp).

Covered by tests (test_runner.py, test_config.py); scaffold template documents the new field.

2. `example`: tactical-ddd skill benchmark — `examples/` only

Reworks the ddd-architectural-challenges example into a worked study of a public
DDD skill (Nick Tune's tactical-ddd) and a repo-tuned version, measured across four
variants (vanilla / guided / public-skill / repo-tuned) on two tasks — a clean-feature
task (weather discount) and a legacy anemic→rich refactor (movie-rental), both .NET 8.
New README section with per-dimension radars, an increment-over-vanilla chart, and
token/time charts. No toolkit code touched by this part.

Report/analysis tooling used to produce these results (ad-hoc, not yet integrated)
is parked on poc/benchmark-report-tooling for a future nasde report feature —
deliberately kept out of this PR.

Quality gates green (ruff/format/mypy/pytest ×4 platforms, CVE audit, benchmark configs).

- Variant task-scoping: a variant.toml may declare `tasks = [...]` to restrict that variant to specific tasks (e.g. a repo-specific skill that should never run against the wrong codebase). With --all-variants, a scoped variant runs only against its declared tasks; the scope wins even over an explicit --tasks filter. (runner: load_variant_task_scope / scope_tasks_for_variant; cli wiring; scaffold template gains a commented example; tests in test_runner.py.) - Raise default evaluator max_turns 30 → 60 (error_max_turns on DDD-heavy workspaces). - Bump starlette 1.0.0 → 1.1.0 (PYSEC-2026-161, transitive via harbor/fastapi/mcp). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reworks the ddd-architectural-challenges example into a worked study of a public DDD skill (Nick Tune's tactical-ddd) and a repo-tuned version, measured with NASDE across four variants (vanilla / guided / public-skill / repo-tuned) on two tasks: a clean-feature task (weather discount) and a legacy anemic→rich refactor (movie-rental), both on .NET 8. - Tasks: add csharp-movie-rental-anemic, modernize csharp-anemic-to-rich-domain to .NET 8, align rubrics + verifiers. - Variants: public skill (repo-leaks removed, idiomatic C#), three repo-tuned skills; explicit skill invocation in instructions so activation is deterministic. - README: new Claude/skill deep-dive section — per-dimension radars, increment-over- vanilla line (absolute cross-task scores aren't comparable; increments are), and average token/time charts. Existing cross-agent table kept. - Lessons surfaced: judge per-dimension not one aggregate; a mounted skill isn't a used skill (verify activation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

szjanikowski force-pushed the experiment/tactical-ddd-focus branch 2 times, most recently from da8ec5b to c5c0961 Compare May 26, 2026 11:23

szjanikowski changed the title ~~tactical-ddd skill eval: variant task-scoping, skill-activation fix, focus benchmark~~ feat: variant task-scoping + tactical-ddd skill benchmark example May 26, 2026

szjanikowski force-pushed the experiment/tactical-ddd-focus branch 7 times, most recently from 1e8c524 to 56cf9b8 Compare May 26, 2026 12:52

Szymon Janikowski and others added 2 commits May 26, 2026 14:55

szjanikowski force-pushed the experiment/tactical-ddd-focus branch from 56cf9b8 to 5a7076b Compare May 26, 2026 12:55

szjanikowski merged commit de1ea5d into main May 26, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: variant task-scoping + tactical-ddd skill benchmark example#54

feat: variant task-scoping + tactical-ddd skill benchmark example#54
szjanikowski merged 2 commits into
mainfrom
experiment/tactical-ddd-focus

szjanikowski commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

szjanikowski commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. feat: variant task-scoping (+ two small fixes) — toolkit code

2. example: tactical-ddd skill benchmark — examples/ only

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

szjanikowski commented May 22, 2026 •

edited

Loading

1. `feat`: variant task-scoping (+ two small fixes) — toolkit code

2. `example`: tactical-ddd skill benchmark — `examples/` only