docs: update tactical-ddd example results to final n=5#55
Merged
Conversation
Raises both tasks to n=5 attempts/variant and reframes the deep-dive on what the data actually supports (Welch t-test, 95% CI), replacing the earlier n=3 framing: - A repo-tuned skill significantly beats the bare model on BOTH tasks (+0.12 weather, +0.05 movie) and beats hand-written hints. - The off-the-shelf public skill helps only on the clean feature (+0.07); on the legacy refactor it does not beat vanilla. - Hand-written 'guided' hints ≈ vanilla (no measurable lift). - Updated table (means, not medians), increment chart, per-dimension radars, and token/time charts to the n=5 numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es) and link from results Adds the full per-dimension significance tables to the DDD example and links them from benchmark-results.md: - SIGNIFICANCE-per-dimension-bootstrap.md — percentile bootstrap (95% CI), overall + per dimension, both tasks. Key finding: on Movie the repo-tuned skill is significant on 4/5 dimensions (architecture +0.12, encapsulation +0.08, ...) while test_quality drops (−0.035), which is what flattens the aggregate. - SIGNIFICANCE-per-dimension-bayes-vs-bootstrap.md — Bayesian bootstrap (Rubin) side by side with the frequentist bootstrap; aggregate verdicts agree 100%, with 4 borderline per-dimension disagreements honestly flagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Welch t-test The deep-dive prose still said gaps are called real 'when it clears a Welch t-test at 95%' — a leftover from the earlier t-Welch framing, inconsistent with the bootstrap tables this doc now links and with the post's method. Changed to percentile bootstrap 95% CI. Verified all in-text verdicts (W vanilla→tuned +0.12 sig, W vanilla→public +0.07 sig, M vanilla→tuned +0.05 sig, M vanilla→public doesn't beat) match the bootstrap table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…per-dim tension) benchmark-results had per-dimension only as radars + a generic 'aggregate can hide a win' line — but never stated the concrete, strongest finding: on Movie the aggregate says +0.05 (barely moved), yet per dimension the repo-tuned skill significantly lifts 4 of 5 (architecture +0.12, encapsulation +0.08, domain +0.04, extensibility +0.05), while test_quality drops (−0.035) and flattens the average. Added one paragraph after the radars; numbers verified against the linked bootstrap table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…charts Replaces the long flat significance tables with a cleaner layout: top sections Weather and Movie, each split into Overall + one sub-section per dimension. Comparisons are now each configuration vs the bare model (vanilla → guided / public / repo-tuned). - Bootstrap file: every sub-section has a forest chart (12 total) + a small table. Charts color real gaps (CI off zero) by task, grey out noise. - Bayes-vs-bootstrap file: same per-dimension layout, tables only (bootstrap CI beside Bayesian-bootstrap CI, Agree? column); 2 borderline-on-zero disagreements flagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paths were ../assets/ (resolving to examples/assets/ — nonexistent) instead of assets/. The MD lives in examples/ddd-architectural-challenges/ and the charts in its assets/ subdir, so the relative path is assets/, no ../. Fixes broken image icons on GitHub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Updates the
ddd-architectural-challengesdeep-dive (README +docs/benchmark-results.md) and its charts to the final n=5 attempts/variant numbers, and adds full per-dimension significance analysis. Replaces the earlier n=3 framing.What the data supports (each configuration vs the bare model)
The earlier draft headlined "tuning to the repo adds +0.13/+0.06"; at n=5 the public→tuned gap is within noise, so the honest claim is skill (especially tuned) vs the bare model, not tuned vs public.
Per-dimension: the aggregate hides the Movie story
On Movie the aggregate barely moves (+0.05), but per dimension the repo-tuned skill significantly lifts 4 of 5 — architecture +0.12, encapsulation +0.08, domain +0.04, extensibility +0.05 — while test quality drops (−0.035) and flattens the average (the skill teaches modeling, not testing). Full per-dimension tables with forest charts are included.
Method (significance)
Changed
README.md— deep-dive paragraph rewritten to the supported claimsdocs/benchmark-results.md— table (averages, not medians), per-dimension Movie paragraph, increment/radar/ops charts, significance method + linksexamples/ddd-architectural-challenges/SIGNIFICANCE-per-dimension-bootstrap.md— per-dimension tables + 12 forest charts (Weather/Movie × Overall + 5 dimensions)examples/ddd-architectural-challenges/SIGNIFICANCE-per-dimension-bayes-vs-bootstrap.md— Bayesian-vs-frequentist comparisonexamples/ddd-architectural-challenges/assets/Chart-generation script and the experiment journal live on
poc/benchmark-report-tooling(kept out of the release per the earlier split).🤖 Generated with Claude Code