Upd0611 by grauwolf32 · Pull Request #55 · grauwolf32/contractor

grauwolf32 · 2026-06-10T22:14:10Z

No description provided.

…ssing from ground truth - dvblab-023: PyYAML 5.3.1 vulnerable dependency (CVE-2020-14343), reachable via yaml.load(Loader=yaml.Loader) at auth_routes.py:157. - dvpwa-023: stored XSS via unescaped {{ student.name }} in evaluate.jinja2:25 under autoescape=False (excluded course.name — Undefined on the Course tuple). Both were reported by the scanner but scored as FP under the incomplete GT. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-pay) The fixture's project root is tests/playground/typescript/vault-pay (package.json + src/); the doubled path never resolved. Makes the 2 vaultpay trace cases runnable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Each run writes a single-fixture envelope + per-case metrics/artifacts to eval_runs/<RUN_STAMP>/<scenario>-<unit>-eval-<fixture>/ (RUN_STAMP = mmdd-HHMMSS per process, overridable via CONTRACTOR_EVAL_RUN_STAMP). The flat eval_runs/<unit>/ path is kept as a 'latest' pointer for analytics-ui back-compat. case_artifact_dir gains a scenario tag (default agent; task callers updated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… cap Lets an experiment loosen/disable count-based heavy-result elision (keep_last_n) without code changes; >0 overrides the caller's elide_keep_last_n (default 15), 0 keeps historical behaviour. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…S_AT) Run each trace case N times; passes if any attempt passes (pass@N semantics, consistent with the detection eval). Default 1 = unchanged. Per-attempt runs[] recorded when N>1.

…ore thrash Records showed failing crapi-workshop trace cases churning annotate->restore-> re-annotate (restore 2.4-5.8x the annotate count) because annotate_trace refuses duplicates, so revising forces a full-path restore. converge.md = v7 + HARD RULE 9 (commit discipline: annotate once, cap restores, stop on no-progress) + matching anti-pattern; general, not benchmark-specific. Task prompt also fixed: it told the agent to use insert_line (contradicting HARD RULE 3) — now says annotate_trace, annotate-once. active stays v7 pending eval (test via CONTRACTOR_EVAL_TRACE_PROMPT_VERSION=converge). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Eval confirms: crapi-workshop 7/15 -> 13/15 (+6) with restores 110 -> 20 (-82%) and lowest token cost; fastapi/spring/vulnyapi unchanged at 11/12 (zero regression — clean cases never thrashed so commit-discipline is a no-op there). The workshop recall collapse was an annotate->restore convergence thrash; commit-discipline fixes it generally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Review found the active trace task (v1) misaligned with the planner/agent/skill surface split: it leaked worker tool/edit mechanics into the planner surface (planner is tool-agnostic), duplicated the trace skill's sink catalogue + finding shapes (already drifted — stale sink list, missing Shape D), and its output_format diverged from the agent §OUTPUT. v3 keeps only planner-appropriate content (goals, state, scope, planning rules, stop conditions), delegates all domain knowledge to the injected trace skill, drops tool/mechanic detail, and aligns output to the agent §OUTPUT. Added CONTRACTOR_TASK_VERSION_<NAME> override to TaskTemplate.load so v3 can be A/B eval-gated without flipping active (stays v1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pipeline eval (planner-driven trace-graph + trace-graph-pathpar on vulnyapi, converge agent held constant) confirms v3 improves annotation F1 on BOTH variants: trace-graph 0.706->0.776 (recall 0.57->0.91), pathpar 0.636->0.783 (recall 0.67->0.86), zero regression. The lean, drift-free, tool-agnostic task lets the planner/worker cover the path better while delegating knowledge to the trace skill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…hmark dirs The exploit wrapper's final envelope write was truncated (top-level 0/0) despite all 15 per-benchmark dirs persisting. scripts/rebuild_xbow_envelope.py re-aggregates them into a capture envelope: pass = a real uuid-shaped flag{...} present in the dir (the XBOW flag-capture metric, broader than the exploitability verdict). Now shows 14/15 (XBEN-010 = the timeout, only dir with no flag/metrics). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-flask-app) The 10th realvuln repo, never used to tune codereview v3 → a genuine held-out split for the overfit test. Result: held-out recall (best 0.19) sits within the seen range (0.17-0.40), so overfit is NOT confirmed — v3's low recall is a uniform single-pass weakness, not memorization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Feeds the fixture's OpenAPI spec to the trace agent as an attack-surface map for the X1 A/B (spec-first vs code-direct). Default off = unchanged. Tests the founding formats-as-substrate premise: does the OAS map lift endpoint/annotation recall. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… brace pitfall) ADK inject_session_state reads {id} as a session-state lookup -> KeyError('id'), aborting the likec4 workflow. The example OAS path now uses the non-identifier {note-id} form (renders literally, ADK-safe), per the project's brace convention. Surfaced while generating a LikeC4 model for the X2 (LikeC4->STRIDE) test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eat-with-likec4) Generated by the likec4 workflow (after the brace fix). Used for the X2 A/B and kept as a precompute so the threat eval can run with-LikeC4. X2 CONFIRMED: feeding it lifts STRIDE coverage 3→5 categories, reports 8→27, endpoint recall 0.67→1.00. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

InvalidToolCallGuardrailCallback returned the response unconditionally whenever parts were non-empty, so CallbackChain stopped at it and any callback appended after the worker chain was dead code — notably MandatoryToolCallback in both exploitability agents, whose verdict enforcement never fired. Track whether a part was actually rewritten and return None otherwise (state is still saved). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The overlay glob relied on PurePosixPath.match, which on Python 3.12 is not recursive for ** and matches right-anchored: /**/*.py missed root-level files, /src/**/*.py returned nothing, and multi-level non-** patterns failed. Move the correct path-aware matcher from cli/fs.py into contractor/tools/fs/globmatch.py (cli imports it from there, preserving the import direction rule) and rewrite the overlay glob as walk + regex, mirroring RootedLocalFileSystem semantics (tombstones excluded, overlay-added files included, files only). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… per-invocation artifact keys Three task-runner fixes from the project review: - Exceptions inside _run_single_iteration (transient LiteLLM/network errors, typo'd skills) now consume an attempt and retry per the documented max_attempts invariant instead of aborting the whole workflow; ITERATION_RESULT carries error_type/error_message and TaskNotCompletedError chains the last exception. CancelledError still propagates. - Declared input artifacts that were never published warned about instead of silently rendering as empty strings; artifact bytes decode with errors="replace" + warning instead of errors="ignore". - add_task(artifacts=[]) no longer resurrects template defaults (mirrors the skills is-not-None handling). - add_task accepts an optional artifact_key so fan-out tasks from the same template publish non-colliding artifacts; checkpoint restore validates against the invocation's own expected artifact names, not whatever a sibling task published. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…que fan-out artifact keys trace-verify only read findings under trace-annotation:* — running it after the production-default trace-graph (trace-graph:*) or pathpar (trace-graph-pathpar:*) silently verified nothing. Extract the three producer prefixes into shared constants (contractor/workflows/ namespaces.py; pathpar's PATH_NAMESPACE_PREFIX aliases the shared one), probe all of them per path, and WARN naming the probed prefixes when zero findings exist anywhere instead of skipping at DEBUG. Also wire the new per-invocation artifact_key into the fan-out workflows so per-finding tasks stop overwriting each other's published artifacts: trace_verify keys by source namespace + finding, exploitability and vuln_scan_trace by finding name. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…kpoint validation - TaskRunner/AgentRunner._emit no longer let an exception in the event handler (metrics sink disk error, UI render failure) abort the run; CancelledError still propagates. - metrics plugin: _CallTracker.resolve prefers the oldest unfinished non-errored call and register() finishes stale errored same-fingerprint calls, so a retried identical tool call pairs with its own after_tool instead of the stale errored one (wrong success flag, wrong timing, leaked pending call). - checkpoint restore is skipped with a warning when the entry's template_key/template_version no longer match the invocation, instead of silently feeding stale artifacts downstream. - Checkpoint.load treats structurally malformed entries as corrupt (warning + ignore) instead of raising KeyError; _load_checkpoint moved inside run()'s try so cleanup runs if it fails. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…k loss, settings routing - vuln_scan_fast._dedup no longer crashes on trailing "CWE-" or null details/place (str coercion + regex CWE extraction); same coercion in the trace-confirm stage. - stable explicit ref= on conditionally-added tasks in oas_building, likec4_building, vuln_assess, vuln_scan_fast so --resume matches checkpoint entries regardless of which tasks were skipped. - oas_building/oas_enrichment/likec4_building TaskRunners now use ctx.app_name instead of ad-hoc names, aligning publish/load with the skip-checks and CLI export under app-partitioned artifact services. - trace_graph_pathpar merges and saves completed overlay forks in a try/finally around the TaskGroup, so one failed path no longer discards every other path's annotations (inline rather than _cleanup because vuln_assess drives _run_impl directly). - CONTRACTOR_TARGET_URL/CONTRACTOR_PROXY routed through Settings (aliased fields, same env vars) instead of ad-hoc os.environ reads; cli/.env anchored via Path(__file__) so non-CLI entrypoints load the same env file. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…radictions, context-callback fixes - finish-time summarizer runs with tools=[] instead of inheriting the worker's whole toolset; its records payload is capped to max_records and each record field truncated (20k + marker), so a long run can't blow the context window inside finish; malformed-worker raw output is stored truncated instead of verbatim. - TASK_LIMIT_REACHED_MSG no longer tells the planner to "finish immediately" (finish(done) refuses while subtasks are new); v5 bootstrap step 2 no longer instructs a zero-subtask finish(done) the tool is guaranteed to refuse; decompose_subtask reports remaining capacity instead of claiming the limit is reached for other failures. - SubtaskDecomposition enforces the documented 1-3 children (max_length=3). - SummarizationLimitCallback latches per invocation instead of appending the summarize message to every subsequent request. - FunctionResultsRemovalCallback gives unmatched function responses a per-index sentinel signature so they never dedup against each other. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ns, fixture pinning - eval auto-skip gate bypassed only when the -m markexpr actually selects eval (word match), not for any marker expression; CONTRACTOR_RUN_EVAL parsed as a boolean so =0 disables evals. - rebuild_eval_envelope.py discovers both the legacy flat layout and the dated archive layout (eval_runs/<stamp>/<scenario>-<unit>-eval- <fixture>/cases), deduped newest-wins; --all scans both. - EvalSink default run_name includes scenario (and non-generic metric_kind) so buckets sharing a unit stop overwriting each other's latest envelope; warns when records in a bucket disagree on model/prompt_version. - prepare_vuln_benchmarks.py treats checkout-pin failures as fatal and removes the partial clone instead of printing "OK (pinned)" while building the fixture from HEAD. - dump_langfuse_trace.py resolves cli/.env from the repo root (the old contractor/cli/.env candidate never existed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… close http clients, additive prompt sections - the live UI now stops only on workflow_finished (emitted exactly once in Workflow.run's finally), so per-finding task_failed events in vuln_scan_trace/exploitability no longer blank the rest of the run; workflow construction errors surface as a clean click.UsageError. - HTTPClient opens a per-request httpx.AsyncClient (async with) instead of an instance-lifetime client nothing ever closed; session cookies persist via a shared jar merged back after each response; public tool signatures unchanged. - oas_analyzer TaskDescription.format composes sections additively, so the general sub-agents get their objective back and idor/ssrf keep their examples (the old chained conditional dropped them). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…e, descriptive errors - add_task validates skill names eagerly (ValueError listing available skills) instead of surfacing a FileNotFoundError hours into a run; skill/artifact memory injection hoisted out of the retry loop to once per task (inputs never change across attempts). - carry state strips the task's stale invocation-scoped planner keys between attempts, so retries stop deep-copying and re-snapshotting every previous attempt's subtask state into metrics.jsonl. - AgentRunner threads the event handler through run() instead of stashing it on the instance — concurrent run() calls no longer clobber each other's handler. - RenderedTask raises on artifact refs that normalize to the same template variable instead of silently dropping one; TaskTemplate.load reports missing required fields as descriptive ValueErrors. - EventType/AgioEventType mirroring pinned by a test so new event types can't silently vanish from metrics.jsonl (TASK_SKIPPED kept — it IS emitted via Workflow.emit_task_skipped, contrary to the review note). - resolved the after_run_callback comment contradiction empirically: it fires when the run completes, not on mid-stream errors/cancellation — SandboxCleanupPlugin documented as primary teardown, the run() sweep as backstop. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… resolution - a rejected skip (current subtask done/decomposed/skipped) now returns an {"error": ...} naming the actual cause instead of the same "no active subtasks" result as a successful skip — the planner can finally tell the two apart. - task_tools raises a descriptive ValueError at build time when worker_instrumentation=False, use_input_schema=True and the worker has no input_schema, instead of an obscure ADK KeyError at runtime. - Likec4Linter resolves its binary lazily on first use, so building the tool list never raises and a missing binary surfaces as the documented error tool-result instead of crashing workflow assembly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ine, walk ceiling - walk() filters symlinked-outside files from its yielded names the way ls/glob already do (content was never readable; the names leaked). - _strip_protocol returns the realpath it just validated instead of the unresolved candidate, closing the check-then-use window; in-sandbox symlinks keep resolving (regression-tested). - overlay forks record their pre-fork tombstone set and merge_overlay_forks subtracts that baseline — pre-fork deletes no longer re-propagate on merge (the old filter subtracted tombstones from file-content keys, a no-op). - fs glob/grep walks are bounded by a new fs_max_files_per_walk setting (default 100k, env FS_MAX_FILES_PER_WALK); hitting the ceiling is reported in the tool output instead of silently scanning forever. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…, no-raise observability - RpmRatelimitCallback rolls its window forward when stale, mirroring the TPM implementation, instead of accumulating counts across stale windows; both classes now carry a prominent warning that their sync time.sleep blocks the event loop (the CallbackChain is strictly sync, so async conversion is not an option today). - TokenUsageCallback mirrors the in-progress invocation into history on every call (same-key overwrite, no double-count), so the final invocation is no longer missing; Russian comments translated; unused TokenUsageCallbackException deleted. - BaseCallback.validate() actually validates now — parameter name/kind comparison against the expected ADK callback signature (return annotations excluded, which is why the old check was disabled); all production callbacks verified to pass. - observability.run_context degrades to a no-op span with a warning when Langfuse is enabled but broken, honoring the module's never-raises contract; init()'s no-retry-after-failure is documented. - module-level logger.setLevel(DEBUG) removed from ratelimits and guardrails (library code must not override app log policy). ThinkingBudgetGuardrailCallback and ToolMaxCallsGuardrailCallback are deliberately kept unwired for future use. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…, per-path artifact keys - oas_builder/oas_linter get accurate AgentTool descriptions instead of all three workers presenting as "software engineering agent" to the router; http_agent's context-limit instruction no longer references a nonexistent report tool. - planner v5 Rule 6 reworded format-neutrally (it asserted an XML record shape while the default planner format is json); orphaned FINISH_MAX_CALLS_RVALUE deleted; leftover code_graph_agent dir removed. - oas_analyzer: severity sorting uses an explicit rank map (medium no longer sorts below low; unknown severities pinned last), sub-agent order is deterministic (tuple, not set), output_schema annotation fixed to type[BaseModel]. - trace_annotation passes a per-path artifact_key so fanned-out path tasks stop overwriting each other's published artifacts (the other three trace workflows drive AgentRunner directly and publish none — no change needed); module-level logger.setLevel(DEBUG) removed from all four trace workflows. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- eval harness run_agent raises AgentRunTimeout carrying the partial AgentRun (tool calls, best-effort session state and artifacts, timed_out=True) instead of discarding everything on timeout; eval consumers still see the timeout as a failure, but err.partial is now inspectable for debugging. - analyze_metrics.py computes cost only for models present in the pricing table and says which models were skipped, instead of pricing local lm-studio models with a Gemini fallback; all-unknown runs show "n/a" rather than a fake $0. - select_fixture annotation corrected (it never returns None); MetricsSink's envelope-wins key shadowing documented as intentional. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

README: injection happens once per task (not per attempt), checkpoint + best-effort event delivery, invocation-scoped subtask state, per-invocation artifact keys, summarizer is tool-less + record-capped, finish refusal rules, full 14-workflow registry. TUNABLE_PARAMS: FS_MAX_FILES_PER_WALK, task-version env override, trace_annotation v3, real harness timeouts, new eval knobs. tuning: Settings-routed sampling + tool caps, observations block, corrected per-workflow budget table, current prompt actives. eval-tuning: Tier 1-2 marked landed (Settings env vars), envelope/rebuild guidance, eval gate. insights-parallel: merge-in-finally semantics, per-call on_event, config-routed max_concurrency. arch.likec4: Pipeline->Workflow names, full registry, oas_validate tasks, --workflow CLI forms. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…relocate shannon design doc - trace_annotation_direct: with_graph_tools back to false — it had drifted to true (carried over in the 127ed1f relocation), making trace-direct functionally identical to trace-graph and invalidating both docstrings' A/B framing. trace-direct is the prompt-only baseline again. - trace_annotation_direct + trace_graph config.yaml: drop output_format — both workflows pass _format=template.format, the YAML knob was never read. - contractor/workflows/shannon/ held only DESIGN.md (no workflow.py / config.yaml / __init__.py, unregistered) — moved to docs/shannon-workflow-design.md so the workflows package stays one-folder-per-workflow. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ed vuln tool-name constants The ~35-line after_model chain merge was duplicated verbatim in exploitability_agent and web_exploitability_agent; it now lives as chain_after_model_callback() in callbacks/adapter.py (raises on a non-after_model callback instead of silently dropping it). READ_ONLY_VULN_TOOL_NAMES / VERDICT_TOOL_NAMES move next to the tools they describe in contractor/tools/vuln.py, replacing two private copies of each plus a third read-only set in trace_verifier_agent. No behavior change; full unit suite green except two pre-existing trace-postdiff namespace failures from in-flight work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The load-artifact -> guard -> yaml.safe_load -> mapping-guard -> name-backfill reshape was copy-pasted in five workflows (exploitability, trace_verify, vuln_scan_fast, vuln_scan_trace, vuln_assess). It now lives in contractor/workflows/findings.py as load_findings_artifact() (name->fields mapping to list of finding dicts) and load_yaml_dict_artifact() (the dict-merging variant vuln_assess needs), so the read side of the findings contract can't drift between workflows. Call-site extras (severity sort, file+CWE dedup) stay where they were. Net -83 lines; behavior unchanged except unparseable-YAML warnings are now uniformly logged (vuln_scan_fast and vuln_assess previously swallowed them silently). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…tics_agent New trace-postdiff workflow: stage A runs the trace_agent loop with vulnerability reporting disabled (navigation only); stage B runs the new vuln_analytics_agent once per path over the @trace-marked diff of that path's annotations (judgement only), reporting findings under the trace-postdiff: namespace prefix so trace-verify/vuln-assess pick them up. Targets the small-model navigate-vs-judge split; A/B against the single-stage trace-graph. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

group_paths_by_prefix(depth) makes the route group — not the single path — the unit of memory namespace, skill injection, fork/concurrency (pathpar), and the analytics stage (postdiff), so sibling handlers share discovered context instead of re-navigating N times. Defaults: trace-postdiff groups at depth 1; pathpar keeps depth 0 (per-path, byte-identical namespaces). Consumers (trace-verify, vuln_assess) probe depth-1/2 group keys alongside path keys, with cross-path dedup so a group's findings are verified once. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Pass 1 fans one nomination sweep out per vulnerability class in parallel (sink_nomination template), each sweeping the whole project for its single class at low confidence — narrow attention beats one do-everything scan on large codebases. Includes an explicit absence-of-control class so missing-auth/ownership, which a taint scan can't surface, gets nominated per handler. Pass 2 merges, dedups, caps, and deep-traces survivors by reusing VulnScanTraceWorkflow's trace phase (refactored to instance-level self.CFG + namespace-derived refs so the subclass reuses it verbatim). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

New docs/planner.md documents the planner/task_runner spine with nine Mermaid diagrams: component layering, the TaskRunner per-task retry lifecycle, the one-iteration sequence (planner↔manager↔worker↔summarizer), the prompt-v5 action-picker loop, the subtask state machine, the execute_current_subtask worker-retry/parse flow, the flat decompose-and-insert layout, finish + summarizer, the two-tier session-state keyspace, and the artifact hand-off. Cross-linked from README §4. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ent implementation' §8 lays out the planner/task_runner design space as A/B-able experiments (V0–V7) grouped by axis — control-flow/plan-shape, verification, context-passing — each with what-changes / hypothesis / metric. §9 frames the next decisions: 9.1 names the design decisions the baseline already commits to (and which variant revisits each), 9.2 recommends the first batch (V0→V1→V3, V2), 9.3 specs the enabling planner_builder seam + CONTRACTOR_PLANNER_STRATEGY env routing (symmetric to the existing worker_builder / task-version pattern) so variants run through the same eval/v1 pass@N harness. 'Where to look next' renumbered to §10. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…odology Incorporates a synthesized technical review of §8–§9: - §8.2 reframed 'Verification' → 'Trust the worker's verdict': V3/V4/V8 are three challengers to ONE committed decision (judge / ask-N / re-ask-once), tested as a single axis. V3 split into three scopings (acceptance-line / finish-gate / per-subtask), cheapest first. - New variants: V8 (re-execute-once on incomplete — relaxes the §4 invariant for one retry), V9 (worker-proposed decomposition — the info-flow fix V7 approximates blindly), V10 (plan carry-forward across attempts — new §8.4 retry/resume axis). - §8.1: fixed V1 ('plan-once' is tool removal AND a paired prompt, not one disabled tool); added V2 attribution caveat. - §9.1 decision table updated (trust row = V3·V4·V8; empty-plan-per-attempt row = V10; decompose row += V9). - §9.2 resequenced: seam → baseline telemetry → V0/V1 2x2 → V8 → V3-cheapest-scope → V9. - §9.3: made the strategy contract explicit (write the fixed §6 keys + set end_invocation — the whole planner_builder interface); registry entries are (builder, prompt_version, toolset) bundles since prompts travel with strategies. - New §10 methodology: instrument the baseline first (telemetry pre-checks hypotheses for free) + budget for variance (pre-register N and a threshold beating baseline seed spread). 'Where to look next' → §11. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Extends the strategy seam beyond planner_builder (§9.3) to the three injection points the in-loop variants need, each defaulting to today's behaviour so streamline stays byte-identical: - execution_policy (task_tools → execute_current_subtask): the worker-call core behind a protocol — unlocks V8 (re-execute-once), V4 (best-of-N), V3 per-subtask critic. Notes that V3's cheaper finish-gate scope hooks the finish closure instead. - records_policy (task_tools → get_records / save_record): the history view behind a protocol — unlocks V6 rolling summary. - scheduler (StreamlineManager.get_current_subtask + advance, plus runner parallelism): subtask selection + DAG concurrency — unlocks V5; flagged as the deep, second-wave seam (spans manager + runner, needs depends_on + pathpar fork/merge). Adds a bundle-composition diagram and notes V10 is a runner-level flag (_build_task_initial_state), not one of these policies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

grauwolf32 and others added 30 commits June 7, 2026 07:53

feat(eval): pass@N loop for the trace eval (CONTRACTOR_EVAL_TRACE_PAS…

3ede881

…S_AT) Run each trace case N times; passes if any attempt passes (pass@N semantics, consistent with the detection eval). Default 1 = unchanged. Per-attempt runs[] recorded when N>1.

grauwolf32 and others added 10 commits June 11, 2026 00:15

grauwolf32 merged commit e3d5632 into main Jun 10, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upd0611#55

Upd0611#55
grauwolf32 merged 40 commits into
mainfrom
upd0611

grauwolf32 commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grauwolf32 commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant