A few generation-pipeline ideas that aim to improve build speed and quality without reducing the randomness/unpredictability that makes the targets contamination-resistant. PR #5 already does the first one; the rest are written up here for discussion rather than as code.
Guiding idea: separate the entropy budget from the LLM budget
The unpredictability that makes a cell hard to memorise can come from cheap, seeded, logged random draws; the LLM is best reserved for the parts that need natural-language realism (theme copy, scenario prose). Keeping those two budgets separate makes generation faster and cheaper, lets a specific cell be reproduced for debugging, and even lets you increase entropy without spending tokens. Most of the ideas below are facets of this.
1. Parallelise the independent asset steps — done in #5
Chrome, decoys, and 404 depend only on (theme, scenario) and not on each other, so they can generate concurrently. See PR #5.
2. Route by step criticality, not a blanket quality flag
The scenario step is the only schema-strict, retry-heavy one; everything else (theme, chrome, 404) is cosmetic. Routing scenario to the strong model and the cosmetic steps to the fast model gets the quality where it matters without paying for it everywhere. (This would also be a natural place to retire the hardcoded QUALITY_MODEL = 'claude-opus-4-7'.)
3. On a schema-bound violation, reprompt — don't clamp
When a generated scenario draw violates the schema's array bounds, the safest-looking fix is to clamp the arrays to fit. But clamping biases the output toward the clamp values and quietly shrinks the distribution — i.e. it spends exactly the unpredictability you want to keep. Rejection-sampling (reprompt and redraw) preserves the distribution. A clamp/repair pass is fine as a last-resort fallback after N reprompts, but reprompt should be the first move.
4. Partial regeneration on validation failure
When solvability / discovery / negative-control fails post-deploy, regenerating the whole scenario throws away the parts that were fine. Feeding the specific failure back and regenerating only the broken element recovers faster with no quality cost.
5. Seeded-but-logged RNG
Recording the per-deploy random seed (in the local, non-baked manifest) keeps every deploy fully random while letting a researcher reproduce a specific cell exactly for debugging. Zero entropy cost.
Happy to turn any of 2–5 into PRs if the direction is welcome.
A few generation-pipeline ideas that aim to improve build speed and quality without reducing the randomness/unpredictability that makes the targets contamination-resistant. PR #5 already does the first one; the rest are written up here for discussion rather than as code.
Guiding idea: separate the entropy budget from the LLM budget
The unpredictability that makes a cell hard to memorise can come from cheap, seeded, logged random draws; the LLM is best reserved for the parts that need natural-language realism (theme copy, scenario prose). Keeping those two budgets separate makes generation faster and cheaper, lets a specific cell be reproduced for debugging, and even lets you increase entropy without spending tokens. Most of the ideas below are facets of this.
1. Parallelise the independent asset steps — done in #5
Chrome, decoys, and 404 depend only on (theme, scenario) and not on each other, so they can generate concurrently. See PR #5.
2. Route by step criticality, not a blanket quality flag
The scenario step is the only schema-strict, retry-heavy one; everything else (theme, chrome, 404) is cosmetic. Routing scenario to the strong model and the cosmetic steps to the fast model gets the quality where it matters without paying for it everywhere. (This would also be a natural place to retire the hardcoded
QUALITY_MODEL = 'claude-opus-4-7'.)3. On a schema-bound violation, reprompt — don't clamp
When a generated scenario draw violates the schema's array bounds, the safest-looking fix is to clamp the arrays to fit. But clamping biases the output toward the clamp values and quietly shrinks the distribution — i.e. it spends exactly the unpredictability you want to keep. Rejection-sampling (reprompt and redraw) preserves the distribution. A clamp/repair pass is fine as a last-resort fallback after N reprompts, but reprompt should be the first move.
4. Partial regeneration on validation failure
When solvability / discovery / negative-control fails post-deploy, regenerating the whole scenario throws away the parts that were fine. Feeding the specific failure back and regenerating only the broken element recovers faster with no quality cost.
5. Seeded-but-logged RNG
Recording the per-deploy random seed (in the local, non-baked manifest) keeps every deploy fully random while letting a researcher reproduce a specific cell exactly for debugging. Zero entropy cost.
Happy to turn any of 2–5 into PRs if the direction is welcome.