Skip to content

Upd0611#55

Merged
grauwolf32 merged 40 commits into
mainfrom
upd0611
Jun 10, 2026
Merged

Upd0611#55
grauwolf32 merged 40 commits into
mainfrom
upd0611

Conversation

@grauwolf32

Copy link
Copy Markdown
Owner

No description provided.

grauwolf32 and others added 30 commits June 7, 2026 07:53
…ssing from ground truth

- dvblab-023: PyYAML 5.3.1 vulnerable dependency (CVE-2020-14343), reachable via
  yaml.load(Loader=yaml.Loader) at auth_routes.py:157.
- dvpwa-023: stored XSS via unescaped {{ student.name }} in evaluate.jinja2:25
  under autoescape=False (excluded course.name — Undefined on the Course tuple).
Both were reported by the scanner but scored as FP under the incomplete GT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-pay)

The fixture's project root is tests/playground/typescript/vault-pay (package.json
+ src/); the doubled path never resolved. Makes the 2 vaultpay trace cases runnable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each run writes a single-fixture envelope + per-case metrics/artifacts to
eval_runs/<RUN_STAMP>/<scenario>-<unit>-eval-<fixture>/ (RUN_STAMP = mmdd-HHMMSS
per process, overridable via CONTRACTOR_EVAL_RUN_STAMP). The flat
eval_runs/<unit>/ path is kept as a 'latest' pointer for analytics-ui back-compat.
case_artifact_dir gains a scenario tag (default agent; task callers updated).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… cap

Lets an experiment loosen/disable count-based heavy-result elision (keep_last_n)
without code changes; >0 overrides the caller's elide_keep_last_n (default 15),
0 keeps historical behaviour.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…S_AT)

Run each trace case N times; passes if any attempt passes (pass@N semantics,
consistent with the detection eval). Default 1 = unchanged. Per-attempt runs[]
recorded when N>1.
…ore thrash

Records showed failing crapi-workshop trace cases churning annotate->restore->
re-annotate (restore 2.4-5.8x the annotate count) because annotate_trace refuses
duplicates, so revising forces a full-path restore. converge.md = v7 + HARD RULE 9
(commit discipline: annotate once, cap restores, stop on no-progress) + matching
anti-pattern; general, not benchmark-specific. Task prompt also fixed: it told the
agent to use insert_line (contradicting HARD RULE 3) — now says annotate_trace,
annotate-once. active stays v7 pending eval (test via CONTRACTOR_EVAL_TRACE_PROMPT_VERSION=converge).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Eval confirms: crapi-workshop 7/15 -> 13/15 (+6) with restores 110 -> 20 (-82%)
and lowest token cost; fastapi/spring/vulnyapi unchanged at 11/12 (zero regression
— clean cases never thrashed so commit-discipline is a no-op there). The workshop
recall collapse was an annotate->restore convergence thrash; commit-discipline fixes
it generally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review found the active trace task (v1) misaligned with the planner/agent/skill
surface split: it leaked worker tool/edit mechanics into the planner surface
(planner is tool-agnostic), duplicated the trace skill's sink catalogue + finding
shapes (already drifted — stale sink list, missing Shape D), and its output_format
diverged from the agent §OUTPUT. v3 keeps only planner-appropriate content (goals,
state, scope, planning rules, stop conditions), delegates all domain knowledge to
the injected trace skill, drops tool/mechanic detail, and aligns output to the
agent §OUTPUT. Added CONTRACTOR_TASK_VERSION_<NAME> override to TaskTemplate.load so
v3 can be A/B eval-gated without flipping active (stays v1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pipeline eval (planner-driven trace-graph + trace-graph-pathpar on vulnyapi,
converge agent held constant) confirms v3 improves annotation F1 on BOTH variants:
trace-graph 0.706->0.776 (recall 0.57->0.91), pathpar 0.636->0.783 (recall
0.67->0.86), zero regression. The lean, drift-free, tool-agnostic task lets the
planner/worker cover the path better while delegating knowledge to the trace skill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hmark dirs

The exploit wrapper's final envelope write was truncated (top-level 0/0) despite
all 15 per-benchmark dirs persisting. scripts/rebuild_xbow_envelope.py re-aggregates
them into a capture envelope: pass = a real uuid-shaped flag{...} present in the
dir (the XBOW flag-capture metric, broader than the exploitability verdict). Now
shows 14/15 (XBEN-010 = the timeout, only dir with no flag/metrics).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-flask-app)

The 10th realvuln repo, never used to tune codereview v3 → a genuine held-out
split for the overfit test. Result: held-out recall (best 0.19) sits within the
seen range (0.17-0.40), so overfit is NOT confirmed — v3's low recall is a uniform
single-pass weakness, not memorization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Feeds the fixture's OpenAPI spec to the trace agent as an attack-surface map for
the X1 A/B (spec-first vs code-direct). Default off = unchanged. Tests the founding
formats-as-substrate premise: does the OAS map lift endpoint/annotation recall.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… brace pitfall)

ADK inject_session_state reads {id} as a session-state lookup -> KeyError('id'),
aborting the likec4 workflow. The example OAS path now uses the non-identifier
{note-id} form (renders literally, ADK-safe), per the project's brace convention.
Surfaced while generating a LikeC4 model for the X2 (LikeC4->STRIDE) test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eat-with-likec4)

Generated by the likec4 workflow (after the brace fix). Used for the X2 A/B and
kept as a precompute so the threat eval can run with-LikeC4. X2 CONFIRMED: feeding
it lifts STRIDE coverage 3→5 categories, reports 8→27, endpoint recall 0.67→1.00.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
InvalidToolCallGuardrailCallback returned the response unconditionally
whenever parts were non-empty, so CallbackChain stopped at it and any
callback appended after the worker chain was dead code — notably
MandatoryToolCallback in both exploitability agents, whose verdict
enforcement never fired. Track whether a part was actually rewritten and
return None otherwise (state is still saved).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The overlay glob relied on PurePosixPath.match, which on Python 3.12 is
not recursive for ** and matches right-anchored: /**/*.py missed
root-level files, /src/**/*.py returned nothing, and multi-level
non-** patterns failed. Move the correct path-aware matcher from
cli/fs.py into contractor/tools/fs/globmatch.py (cli imports it from
there, preserving the import direction rule) and rewrite the overlay
glob as walk + regex, mirroring RootedLocalFileSystem semantics
(tombstones excluded, overlay-added files included, files only).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… per-invocation artifact keys

Three task-runner fixes from the project review:

- Exceptions inside _run_single_iteration (transient LiteLLM/network
  errors, typo'd skills) now consume an attempt and retry per the
  documented max_attempts invariant instead of aborting the whole
  workflow; ITERATION_RESULT carries error_type/error_message and
  TaskNotCompletedError chains the last exception. CancelledError
  still propagates.
- Declared input artifacts that were never published warned about
  instead of silently rendering as empty strings; artifact bytes
  decode with errors="replace" + warning instead of errors="ignore".
- add_task(artifacts=[]) no longer resurrects template defaults
  (mirrors the skills is-not-None handling).
- add_task accepts an optional artifact_key so fan-out tasks from the
  same template publish non-colliding artifacts; checkpoint restore
  validates against the invocation's own expected artifact names, not
  whatever a sibling task published.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…que fan-out artifact keys

trace-verify only read findings under trace-annotation:* — running it
after the production-default trace-graph (trace-graph:*) or pathpar
(trace-graph-pathpar:*) silently verified nothing. Extract the three
producer prefixes into shared constants (contractor/workflows/
namespaces.py; pathpar's PATH_NAMESPACE_PREFIX aliases the shared one),
probe all of them per path, and WARN naming the probed prefixes when
zero findings exist anywhere instead of skipping at DEBUG.

Also wire the new per-invocation artifact_key into the fan-out
workflows so per-finding tasks stop overwriting each other's published
artifacts: trace_verify keys by source namespace + finding,
exploitability and vuln_scan_trace by finding name.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…kpoint validation

- TaskRunner/AgentRunner._emit no longer let an exception in the event
  handler (metrics sink disk error, UI render failure) abort the run;
  CancelledError still propagates.
- metrics plugin: _CallTracker.resolve prefers the oldest unfinished
  non-errored call and register() finishes stale errored same-fingerprint
  calls, so a retried identical tool call pairs with its own after_tool
  instead of the stale errored one (wrong success flag, wrong timing,
  leaked pending call).
- checkpoint restore is skipped with a warning when the entry's
  template_key/template_version no longer match the invocation, instead
  of silently feeding stale artifacts downstream.
- Checkpoint.load treats structurally malformed entries as corrupt
  (warning + ignore) instead of raising KeyError; _load_checkpoint moved
  inside run()'s try so cleanup runs if it fails.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…k loss, settings routing

- vuln_scan_fast._dedup no longer crashes on trailing "CWE-" or
  null details/place (str coercion + regex CWE extraction); same
  coercion in the trace-confirm stage.
- stable explicit ref= on conditionally-added tasks in oas_building,
  likec4_building, vuln_assess, vuln_scan_fast so --resume matches
  checkpoint entries regardless of which tasks were skipped.
- oas_building/oas_enrichment/likec4_building TaskRunners now use
  ctx.app_name instead of ad-hoc names, aligning publish/load with the
  skip-checks and CLI export under app-partitioned artifact services.
- trace_graph_pathpar merges and saves completed overlay forks in a
  try/finally around the TaskGroup, so one failed path no longer
  discards every other path's annotations (inline rather than _cleanup
  because vuln_assess drives _run_impl directly).
- CONTRACTOR_TARGET_URL/CONTRACTOR_PROXY routed through Settings
  (aliased fields, same env vars) instead of ad-hoc os.environ reads;
  cli/.env anchored via Path(__file__) so non-CLI entrypoints load the
  same env file.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…radictions, context-callback fixes

- finish-time summarizer runs with tools=[] instead of inheriting the
  worker's whole toolset; its records payload is capped to max_records
  and each record field truncated (20k + marker), so a long run can't
  blow the context window inside finish; malformed-worker raw output is
  stored truncated instead of verbatim.
- TASK_LIMIT_REACHED_MSG no longer tells the planner to "finish
  immediately" (finish(done) refuses while subtasks are new); v5
  bootstrap step 2 no longer instructs a zero-subtask finish(done) the
  tool is guaranteed to refuse; decompose_subtask reports remaining
  capacity instead of claiming the limit is reached for other failures.
- SubtaskDecomposition enforces the documented 1-3 children
  (max_length=3).
- SummarizationLimitCallback latches per invocation instead of
  appending the summarize message to every subsequent request.
- FunctionResultsRemovalCallback gives unmatched function responses a
  per-index sentinel signature so they never dedup against each other.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ns, fixture pinning

- eval auto-skip gate bypassed only when the -m markexpr actually
  selects eval (word match), not for any marker expression;
  CONTRACTOR_RUN_EVAL parsed as a boolean so =0 disables evals.
- rebuild_eval_envelope.py discovers both the legacy flat layout and
  the dated archive layout (eval_runs/<stamp>/<scenario>-<unit>-eval-
  <fixture>/cases), deduped newest-wins; --all scans both.
- EvalSink default run_name includes scenario (and non-generic
  metric_kind) so buckets sharing a unit stop overwriting each other's
  latest envelope; warns when records in a bucket disagree on
  model/prompt_version.
- prepare_vuln_benchmarks.py treats checkout-pin failures as fatal and
  removes the partial clone instead of printing "OK (pinned)" while
  building the fixture from HEAD.
- dump_langfuse_trace.py resolves cli/.env from the repo root (the old
  contractor/cli/.env candidate never existed).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… close http clients, additive prompt sections

- the live UI now stops only on workflow_finished (emitted exactly once
  in Workflow.run's finally), so per-finding task_failed events in
  vuln_scan_trace/exploitability no longer blank the rest of the run;
  workflow construction errors surface as a clean click.UsageError.
- HTTPClient opens a per-request httpx.AsyncClient (async with) instead
  of an instance-lifetime client nothing ever closed; session cookies
  persist via a shared jar merged back after each response; public tool
  signatures unchanged.
- oas_analyzer TaskDescription.format composes sections additively, so
  the general sub-agents get their objective back and idor/ssrf keep
  their examples (the old chained conditional dropped them).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e, descriptive errors

- add_task validates skill names eagerly (ValueError listing available
  skills) instead of surfacing a FileNotFoundError hours into a run;
  skill/artifact memory injection hoisted out of the retry loop to once
  per task (inputs never change across attempts).
- carry state strips the task's stale invocation-scoped planner keys
  between attempts, so retries stop deep-copying and re-snapshotting
  every previous attempt's subtask state into metrics.jsonl.
- AgentRunner threads the event handler through run() instead of
  stashing it on the instance — concurrent run() calls no longer
  clobber each other's handler.
- RenderedTask raises on artifact refs that normalize to the same
  template variable instead of silently dropping one; TaskTemplate.load
  reports missing required fields as descriptive ValueErrors.
- EventType/AgioEventType mirroring pinned by a test so new event types
  can't silently vanish from metrics.jsonl (TASK_SKIPPED kept — it IS
  emitted via Workflow.emit_task_skipped, contrary to the review note).
- resolved the after_run_callback comment contradiction empirically: it
  fires when the run completes, not on mid-stream errors/cancellation —
  SandboxCleanupPlugin documented as primary teardown, the run() sweep
  as backstop.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… resolution

- a rejected skip (current subtask done/decomposed/skipped) now returns
  an {"error": ...} naming the actual cause instead of the same
  "no active subtasks" result as a successful skip — the planner can
  finally tell the two apart.
- task_tools raises a descriptive ValueError at build time when
  worker_instrumentation=False, use_input_schema=True and the worker
  has no input_schema, instead of an obscure ADK KeyError at runtime.
- Likec4Linter resolves its binary lazily on first use, so building the
  tool list never raises and a missing binary surfaces as the
  documented error tool-result instead of crashing workflow assembly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ine, walk ceiling

- walk() filters symlinked-outside files from its yielded names the way
  ls/glob already do (content was never readable; the names leaked).
- _strip_protocol returns the realpath it just validated instead of the
  unresolved candidate, closing the check-then-use window; in-sandbox
  symlinks keep resolving (regression-tested).
- overlay forks record their pre-fork tombstone set and
  merge_overlay_forks subtracts that baseline — pre-fork deletes no
  longer re-propagate on merge (the old filter subtracted tombstones
  from file-content keys, a no-op).
- fs glob/grep walks are bounded by a new fs_max_files_per_walk setting
  (default 100k, env FS_MAX_FILES_PER_WALK); hitting the ceiling is
  reported in the tool output instead of silently scanning forever.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…, no-raise observability

- RpmRatelimitCallback rolls its window forward when stale, mirroring
  the TPM implementation, instead of accumulating counts across stale
  windows; both classes now carry a prominent warning that their sync
  time.sleep blocks the event loop (the CallbackChain is strictly sync,
  so async conversion is not an option today).
- TokenUsageCallback mirrors the in-progress invocation into history on
  every call (same-key overwrite, no double-count), so the final
  invocation is no longer missing; Russian comments translated; unused
  TokenUsageCallbackException deleted.
- BaseCallback.validate() actually validates now — parameter name/kind
  comparison against the expected ADK callback signature (return
  annotations excluded, which is why the old check was disabled); all
  production callbacks verified to pass.
- observability.run_context degrades to a no-op span with a warning
  when Langfuse is enabled but broken, honoring the module's
  never-raises contract; init()'s no-retry-after-failure is documented.
- module-level logger.setLevel(DEBUG) removed from ratelimits and
  guardrails (library code must not override app log policy).

ThinkingBudgetGuardrailCallback and ToolMaxCallsGuardrailCallback are
deliberately kept unwired for future use.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…, per-path artifact keys

- oas_builder/oas_linter get accurate AgentTool descriptions instead of
  all three workers presenting as "software engineering agent" to the
  router; http_agent's context-limit instruction no longer references a
  nonexistent report tool.
- planner v5 Rule 6 reworded format-neutrally (it asserted an XML
  record shape while the default planner format is json); orphaned
  FINISH_MAX_CALLS_RVALUE deleted; leftover code_graph_agent dir
  removed.
- oas_analyzer: severity sorting uses an explicit rank map (medium no
  longer sorts below low; unknown severities pinned last), sub-agent
  order is deterministic (tuple, not set), output_schema annotation
  fixed to type[BaseModel].
- trace_annotation passes a per-path artifact_key so fanned-out path
  tasks stop overwriting each other's published artifacts (the other
  three trace workflows drive AgentRunner directly and publish none —
  no change needed); module-level logger.setLevel(DEBUG) removed from
  all four trace workflows.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- eval harness run_agent raises AgentRunTimeout carrying the partial
  AgentRun (tool calls, best-effort session state and artifacts,
  timed_out=True) instead of discarding everything on timeout; eval
  consumers still see the timeout as a failure, but err.partial is now
  inspectable for debugging.
- analyze_metrics.py computes cost only for models present in the
  pricing table and says which models were skipped, instead of pricing
  local lm-studio models with a Gemini fallback; all-unknown runs show
  "n/a" rather than a fake $0.
- select_fixture annotation corrected (it never returns None);
  MetricsSink's envelope-wins key shadowing documented as intentional.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
README: injection happens once per task (not per attempt), checkpoint +
best-effort event delivery, invocation-scoped subtask state, per-invocation
artifact keys, summarizer is tool-less + record-capped, finish refusal rules,
full 14-workflow registry. TUNABLE_PARAMS: FS_MAX_FILES_PER_WALK, task-version
env override, trace_annotation v3, real harness timeouts, new eval knobs.
tuning: Settings-routed sampling + tool caps, observations block, corrected
per-workflow budget table, current prompt actives. eval-tuning: Tier 1-2
marked landed (Settings env vars), envelope/rebuild guidance, eval gate.
insights-parallel: merge-in-finally semantics, per-call on_event, config-routed
max_concurrency. arch.likec4: Pipeline->Workflow names, full registry,
oas_validate tasks, --workflow CLI forms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
grauwolf32 and others added 10 commits June 11, 2026 00:15
…relocate shannon design doc

- trace_annotation_direct: with_graph_tools back to false — it had drifted
  to true (carried over in the 127ed1f relocation), making trace-direct
  functionally identical to trace-graph and invalidating both docstrings'
  A/B framing. trace-direct is the prompt-only baseline again.
- trace_annotation_direct + trace_graph config.yaml: drop output_format —
  both workflows pass _format=template.format, the YAML knob was never read.
- contractor/workflows/shannon/ held only DESIGN.md (no workflow.py /
  config.yaml / __init__.py, unregistered) — moved to
  docs/shannon-workflow-design.md so the workflows package stays
  one-folder-per-workflow.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ed vuln tool-name constants

The ~35-line after_model chain merge was duplicated verbatim in
exploitability_agent and web_exploitability_agent; it now lives as
chain_after_model_callback() in callbacks/adapter.py (raises on a
non-after_model callback instead of silently dropping it).

READ_ONLY_VULN_TOOL_NAMES / VERDICT_TOOL_NAMES move next to the tools
they describe in contractor/tools/vuln.py, replacing two private copies
of each plus a third read-only set in trace_verifier_agent.

No behavior change; full unit suite green except two pre-existing
trace-postdiff namespace failures from in-flight work.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The load-artifact -> guard -> yaml.safe_load -> mapping-guard ->
name-backfill reshape was copy-pasted in five workflows (exploitability,
trace_verify, vuln_scan_fast, vuln_scan_trace, vuln_assess). It now
lives in contractor/workflows/findings.py as load_findings_artifact()
(name->fields mapping to list of finding dicts) and
load_yaml_dict_artifact() (the dict-merging variant vuln_assess needs),
so the read side of the findings contract can't drift between workflows.

Call-site extras (severity sort, file+CWE dedup) stay where they were.
Net -83 lines; behavior unchanged except unparseable-YAML warnings are
now uniformly logged (vuln_scan_fast and vuln_assess previously
swallowed them silently).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tics_agent

New trace-postdiff workflow: stage A runs the trace_agent loop with
vulnerability reporting disabled (navigation only); stage B runs the
new vuln_analytics_agent once per path over the @trace-marked diff of
that path's annotations (judgement only), reporting findings under the
trace-postdiff: namespace prefix so trace-verify/vuln-assess pick them
up. Targets the small-model navigate-vs-judge split; A/B against the
single-stage trace-graph.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
group_paths_by_prefix(depth) makes the route group — not the single
path — the unit of memory namespace, skill injection, fork/concurrency
(pathpar), and the analytics stage (postdiff), so sibling handlers
share discovered context instead of re-navigating N times. Defaults:
trace-postdiff groups at depth 1; pathpar keeps depth 0 (per-path,
byte-identical namespaces). Consumers (trace-verify, vuln_assess)
probe depth-1/2 group keys alongside path keys, with cross-path dedup
so a group's findings are verified once.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pass 1 fans one nomination sweep out per vulnerability class in
parallel (sink_nomination template), each sweeping the whole project
for its single class at low confidence — narrow attention beats one
do-everything scan on large codebases. Includes an explicit
absence-of-control class so missing-auth/ownership, which a taint
scan can't surface, gets nominated per handler. Pass 2 merges, dedups,
caps, and deep-traces survivors by reusing VulnScanTraceWorkflow's
trace phase (refactored to instance-level self.CFG + namespace-derived
refs so the subclass reuses it verbatim).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New docs/planner.md documents the planner/task_runner spine with nine
Mermaid diagrams: component layering, the TaskRunner per-task retry
lifecycle, the one-iteration sequence (planner↔manager↔worker↔summarizer),
the prompt-v5 action-picker loop, the subtask state machine, the
execute_current_subtask worker-retry/parse flow, the flat
decompose-and-insert layout, finish + summarizer, the two-tier
session-state keyspace, and the artifact hand-off. Cross-linked from
README §4.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ent implementation'

§8 lays out the planner/task_runner design space as A/B-able experiments
(V0–V7) grouped by axis — control-flow/plan-shape, verification,
context-passing — each with what-changes / hypothesis / metric.

§9 frames the next decisions: 9.1 names the design decisions the baseline
already commits to (and which variant revisits each), 9.2 recommends the
first batch (V0→V1→V3, V2), 9.3 specs the enabling planner_builder seam +
CONTRACTOR_PLANNER_STRATEGY env routing (symmetric to the existing
worker_builder / task-version pattern) so variants run through the same
eval/v1 pass@N harness. 'Where to look next' renumbered to §10.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…odology

Incorporates a synthesized technical review of §8–§9:

- §8.2 reframed 'Verification' → 'Trust the worker's verdict': V3/V4/V8
  are three challengers to ONE committed decision (judge / ask-N /
  re-ask-once), tested as a single axis. V3 split into three scopings
  (acceptance-line / finish-gate / per-subtask), cheapest first.
- New variants: V8 (re-execute-once on incomplete — relaxes the §4
  invariant for one retry), V9 (worker-proposed decomposition — the
  info-flow fix V7 approximates blindly), V10 (plan carry-forward across
  attempts — new §8.4 retry/resume axis).
- §8.1: fixed V1 ('plan-once' is tool removal AND a paired prompt, not one
  disabled tool); added V2 attribution caveat.
- §9.1 decision table updated (trust row = V3·V4·V8; empty-plan-per-attempt
  row = V10; decompose row += V9).
- §9.2 resequenced: seam → baseline telemetry → V0/V1 2x2 → V8 →
  V3-cheapest-scope → V9.
- §9.3: made the strategy contract explicit (write the fixed §6 keys + set
  end_invocation — the whole planner_builder interface); registry entries
  are (builder, prompt_version, toolset) bundles since prompts travel with
  strategies.
- New §10 methodology: instrument the baseline first (telemetry pre-checks
  hypotheses for free) + budget for variance (pre-register N and a
  threshold beating baseline seed spread). 'Where to look next' → §11.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Extends the strategy seam beyond planner_builder (§9.3) to the three
injection points the in-loop variants need, each defaulting to today's
behaviour so streamline stays byte-identical:

- execution_policy (task_tools → execute_current_subtask): the worker-call
  core behind a protocol — unlocks V8 (re-execute-once), V4 (best-of-N),
  V3 per-subtask critic. Notes that V3's cheaper finish-gate scope hooks
  the finish closure instead.
- records_policy (task_tools → get_records / save_record): the history
  view behind a protocol — unlocks V6 rolling summary.
- scheduler (StreamlineManager.get_current_subtask + advance, plus runner
  parallelism): subtask selection + DAG concurrency — unlocks V5; flagged
  as the deep, second-wave seam (spans manager + runner, needs depends_on
  + pathpar fork/merge).

Adds a bundle-composition diagram and notes V10 is a runner-level flag
(_build_task_initial_state), not one of these policies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@grauwolf32 grauwolf32 merged commit e3d5632 into main Jun 10, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant