docs: update tactical-ddd example results to final n=5 by szjanikowski · Pull Request #55 · NoesisVision/nasde-toolkit

szjanikowski · 2026-05-27T21:14:16Z

Updates the ddd-architectural-challenges deep-dive (README + docs/benchmark-results.md) and its charts to the final n=5 attempts/variant numbers, and adds full per-dimension significance analysis. Replaces the earlier n=3 framing.

What the data supports (each configuration vs the bare model)

step	Weather (greenfield)	Movie (legacy)
guided (hand-written hints)	+0.01 noise	+0.00 noise
public skill	+0.07 real	−0.02 noise
repo-tuned	+0.12 real	+0.05 real

The repo-tuned skill significantly beats the bare model on both tasks, and beats hand-written hints.
Off-the-shelf public skill helps only on the clean feature; on the legacy refactor it doesn't beat vanilla.
Hand-written hints ≈ vanilla (no measurable lift — a placebo).

The earlier draft headlined "tuning to the repo adds +0.13/+0.06"; at n=5 the public→tuned gap is within noise, so the honest claim is skill (especially tuned) vs the bare model, not tuned vs public.

Per-dimension: the aggregate hides the Movie story

On Movie the aggregate barely moves (+0.05), but per dimension the repo-tuned skill significantly lifts 4 of 5 — architecture +0.12, encapsulation +0.08, domain +0.04, extensibility +0.05 — while test quality drops (−0.035) and flattens the average (the skill teaches modeling, not testing). Full per-dimension tables with forest charts are included.

Method (significance)

A gap is called real only when its 95% CI (percentile bootstrap on per-attempt means) excludes zero. (Not a t-test — the linked Wolfe article uses resampling, not the t-distribution.)
Also computed a Bayesian bootstrap (Rubin) side by side: aggregate verdicts agree 100%, per-dimension agree almost everywhere (borderline-on-zero cases flagged). Honest small-n caveat stated: n=5 bootstrap is orientation, not a certificate.

Changed

README.md — deep-dive paragraph rewritten to the supported claims
docs/benchmark-results.md — table (averages, not medians), per-dimension Movie paragraph, increment/radar/ops charts, significance method + links
examples/ddd-architectural-challenges/SIGNIFICANCE-per-dimension-bootstrap.md — per-dimension tables + 12 forest charts (Weather/Movie × Overall + 5 dimensions)
examples/ddd-architectural-challenges/SIGNIFICANCE-per-dimension-bayes-vs-bootstrap.md — Bayesian-vs-frequentist comparison
regenerated chart PNGs under examples/ddd-architectural-challenges/assets/

Chart-generation script and the experiment journal live on poc/benchmark-report-tooling (kept out of the release per the earlier split).

🤖 Generated with Claude Code

Raises both tasks to n=5 attempts/variant and reframes the deep-dive on what the data actually supports (Welch t-test, 95% CI), replacing the earlier n=3 framing: - A repo-tuned skill significantly beats the bare model on BOTH tasks (+0.12 weather, +0.05 movie) and beats hand-written hints. - The off-the-shelf public skill helps only on the clean feature (+0.07); on the legacy refactor it does not beat vanilla. - Hand-written 'guided' hints ≈ vanilla (no measurable lift). - Updated table (means, not medians), increment chart, per-dimension radars, and token/time charts to the n=5 numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es) and link from results Adds the full per-dimension significance tables to the DDD example and links them from benchmark-results.md: - SIGNIFICANCE-per-dimension-bootstrap.md — percentile bootstrap (95% CI), overall + per dimension, both tasks. Key finding: on Movie the repo-tuned skill is significant on 4/5 dimensions (architecture +0.12, encapsulation +0.08, ...) while test_quality drops (−0.035), which is what flattens the aggregate. - SIGNIFICANCE-per-dimension-bayes-vs-bootstrap.md — Bayesian bootstrap (Rubin) side by side with the frequentist bootstrap; aggregate verdicts agree 100%, with 4 borderline per-dimension disagreements honestly flagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Welch t-test The deep-dive prose still said gaps are called real 'when it clears a Welch t-test at 95%' — a leftover from the earlier t-Welch framing, inconsistent with the bootstrap tables this doc now links and with the post's method. Changed to percentile bootstrap 95% CI. Verified all in-text verdicts (W vanilla→tuned +0.12 sig, W vanilla→public +0.07 sig, M vanilla→tuned +0.05 sig, M vanilla→public doesn't beat) match the bootstrap table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…per-dim tension) benchmark-results had per-dimension only as radars + a generic 'aggregate can hide a win' line — but never stated the concrete, strongest finding: on Movie the aggregate says +0.05 (barely moved), yet per dimension the repo-tuned skill significantly lifts 4 of 5 (architecture +0.12, encapsulation +0.08, domain +0.04, extensibility +0.05), while test_quality drops (−0.035) and flattens the average. Added one paragraph after the radars; numbers verified against the linked bootstrap table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…charts Replaces the long flat significance tables with a cleaner layout: top sections Weather and Movie, each split into Overall + one sub-section per dimension. Comparisons are now each configuration vs the bare model (vanilla → guided / public / repo-tuned). - Bootstrap file: every sub-section has a forest chart (12 total) + a small table. Charts color real gaps (CI off zero) by task, grey out noise. - Bayes-vs-bootstrap file: same per-dimension layout, tables only (bootstrap CI beside Bayesian-bootstrap CI, Agree? column); 2 borderline-on-zero disagreements flagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Paths were ../assets/ (resolving to examples/assets/ — nonexistent) instead of assets/. The MD lives in examples/ddd-architectural-challenges/ and the charts in its assets/ subdir, so the relative path is assets/, no ../. Fixes broken image icons on GitHub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Szymon Janikowski and others added 6 commits May 27, 2026 23:13

szjanikowski merged commit e6e79cb into main May 28, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: update tactical-ddd example results to final n=5#55

docs: update tactical-ddd example results to final n=5#55
szjanikowski merged 6 commits into
mainfrom
docs/tactical-ddd-n5-update

szjanikowski commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

szjanikowski commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What the data supports (each configuration vs the bare model)

Per-dimension: the aggregate hides the Movie story

Method (significance)

Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

szjanikowski commented May 27, 2026 •

edited

Loading