Skip to content

docs: update tactical-ddd example results to final n=5#55

Merged
szjanikowski merged 6 commits into
mainfrom
docs/tactical-ddd-n5-update
May 28, 2026
Merged

docs: update tactical-ddd example results to final n=5#55
szjanikowski merged 6 commits into
mainfrom
docs/tactical-ddd-n5-update

Conversation

@szjanikowski
Copy link
Copy Markdown
Contributor

@szjanikowski szjanikowski commented May 27, 2026

Updates the ddd-architectural-challenges deep-dive (README + docs/benchmark-results.md) and its charts to the final n=5 attempts/variant numbers, and adds full per-dimension significance analysis. Replaces the earlier n=3 framing.

What the data supports (each configuration vs the bare model)

step Weather (greenfield) Movie (legacy)
guided (hand-written hints) +0.01 noise +0.00 noise
public skill +0.07 real −0.02 noise
repo-tuned +0.12 real +0.05 real
  • The repo-tuned skill significantly beats the bare model on both tasks, and beats hand-written hints.
  • Off-the-shelf public skill helps only on the clean feature; on the legacy refactor it doesn't beat vanilla.
  • Hand-written hints ≈ vanilla (no measurable lift — a placebo).

The earlier draft headlined "tuning to the repo adds +0.13/+0.06"; at n=5 the public→tuned gap is within noise, so the honest claim is skill (especially tuned) vs the bare model, not tuned vs public.

Per-dimension: the aggregate hides the Movie story

On Movie the aggregate barely moves (+0.05), but per dimension the repo-tuned skill significantly lifts 4 of 5 — architecture +0.12, encapsulation +0.08, domain +0.04, extensibility +0.05 — while test quality drops (−0.035) and flattens the average (the skill teaches modeling, not testing). Full per-dimension tables with forest charts are included.

Method (significance)

  • A gap is called real only when its 95% CI (percentile bootstrap on per-attempt means) excludes zero. (Not a t-test — the linked Wolfe article uses resampling, not the t-distribution.)
  • Also computed a Bayesian bootstrap (Rubin) side by side: aggregate verdicts agree 100%, per-dimension agree almost everywhere (borderline-on-zero cases flagged). Honest small-n caveat stated: n=5 bootstrap is orientation, not a certificate.

Changed

  • README.md — deep-dive paragraph rewritten to the supported claims
  • docs/benchmark-results.md — table (averages, not medians), per-dimension Movie paragraph, increment/radar/ops charts, significance method + links
  • examples/ddd-architectural-challenges/SIGNIFICANCE-per-dimension-bootstrap.md — per-dimension tables + 12 forest charts (Weather/Movie × Overall + 5 dimensions)
  • examples/ddd-architectural-challenges/SIGNIFICANCE-per-dimension-bayes-vs-bootstrap.md — Bayesian-vs-frequentist comparison
  • regenerated chart PNGs under examples/ddd-architectural-challenges/assets/

Chart-generation script and the experiment journal live on poc/benchmark-report-tooling (kept out of the release per the earlier split).

🤖 Generated with Claude Code

Szymon Janikowski and others added 6 commits May 27, 2026 23:13
Raises both tasks to n=5 attempts/variant and reframes the deep-dive on what the
data actually supports (Welch t-test, 95% CI), replacing the earlier n=3 framing:

- A repo-tuned skill significantly beats the bare model on BOTH tasks (+0.12 weather,
  +0.05 movie) and beats hand-written hints.
- The off-the-shelf public skill helps only on the clean feature (+0.07); on the
  legacy refactor it does not beat vanilla.
- Hand-written 'guided' hints ≈ vanilla (no measurable lift).
- Updated table (means, not medians), increment chart, per-dimension radars, and
  token/time charts to the n=5 numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es) and link from results

Adds the full per-dimension significance tables to the DDD example and links them from
benchmark-results.md:
- SIGNIFICANCE-per-dimension-bootstrap.md — percentile bootstrap (95% CI), overall + per
  dimension, both tasks. Key finding: on Movie the repo-tuned skill is significant on 4/5
  dimensions (architecture +0.12, encapsulation +0.08, ...) while test_quality drops
  (−0.035), which is what flattens the aggregate.
- SIGNIFICANCE-per-dimension-bayes-vs-bootstrap.md — Bayesian bootstrap (Rubin) side by
  side with the frequentist bootstrap; aggregate verdicts agree 100%, with 4 borderline
  per-dimension disagreements honestly flagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Welch t-test

The deep-dive prose still said gaps are called real 'when it clears a Welch t-test at
95%' — a leftover from the earlier t-Welch framing, inconsistent with the bootstrap
tables this doc now links and with the post's method. Changed to percentile bootstrap
95% CI. Verified all in-text verdicts (W vanilla→tuned +0.12 sig, W vanilla→public +0.07
sig, M vanilla→tuned +0.05 sig, M vanilla→public doesn't beat) match the bootstrap table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…per-dim tension)

benchmark-results had per-dimension only as radars + a generic 'aggregate can hide a
win' line — but never stated the concrete, strongest finding: on Movie the aggregate
says +0.05 (barely moved), yet per dimension the repo-tuned skill significantly lifts
4 of 5 (architecture +0.12, encapsulation +0.08, domain +0.04, extensibility +0.05),
while test_quality drops (−0.035) and flattens the average. Added one paragraph after
the radars; numbers verified against the linked bootstrap table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…charts

Replaces the long flat significance tables with a cleaner layout: top sections Weather
and Movie, each split into Overall + one sub-section per dimension. Comparisons are now
each configuration vs the bare model (vanilla → guided / public / repo-tuned).
- Bootstrap file: every sub-section has a forest chart (12 total) + a small table.
  Charts color real gaps (CI off zero) by task, grey out noise.
- Bayes-vs-bootstrap file: same per-dimension layout, tables only (bootstrap CI beside
  Bayesian-bootstrap CI, Agree? column); 2 borderline-on-zero disagreements flagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paths were ../assets/ (resolving to examples/assets/ — nonexistent) instead of assets/.
The MD lives in examples/ddd-architectural-challenges/ and the charts in its assets/
subdir, so the relative path is assets/, no ../. Fixes broken image icons on GitHub.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@szjanikowski szjanikowski merged commit e6e79cb into main May 28, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant