diff --git a/RESUME.md b/RESUME.md deleted file mode 100644 index 737a1c3..0000000 --- a/RESUME.md +++ /dev/null @@ -1,97 +0,0 @@ -# RESUME — planner observations + xbow work - -Saved 2026-06-06. Branch: **main** (work landed here; was `feat/planner-observations`). -Nothing running. LM Studio + PC about to be powered off. - -## TL;DR of where we are -- **Observations feature: shipped.** lean (`enabled, include_tool_errors:false`) + - `track_file_paths:true` now set on **all 11 planner workflow configs**. -- **Audit pass: 4 bugs fixed** (committed, not pushed); more deferred. -- **xbow: DONE — 14/15 captured** (XBEN-004..018, lean+paths, 27b-mtp), 0 miss, 0 crash. - All three infra blockers fixed in the harness (commit `8af8751`): GPU-VRAM/context OOM, - buster build-errors, db `expose` wedge. Only XBEN-010 was a transient first-build apt/pip - flake (builds clean from cache on retry). Real per-benchmark table + tokens in REPORT-xbow.html. -- **Reports** live in `~/src/pentest-ai-agents/` (that dir is NOT a git repo). - `REPORT-xbow.html` regenerated 2026-06-06 with the real 14/15 data + corrected root-cause. - -## Key commits this session (newest first, NOT pushed) -``` -57b6225 feat(observations): enable lean+paths for all planner tasks -d37534c deploy(litellm): add lm-studio-qwen3.6-27b-mtp alias -61b4a00 fix: audit pass — 4 verified bugs (1 HIGH, 2 MEDIUM, 1 LOW) -53070c8 feat(exploitability): adopt lean observations for the assess task -852f765 fix(fs): push file paths + per-invocation reset in write_tools too -3db71f0 deploy(litellm): add lm-studio-qwen3.6-mtp alias -… (+ earlier observations/tagging/env-overlay commits) -``` -Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutive.sh` (committed alongside this file). - -## Eval findings (durable) -- **Observations buy vuln-detection RECALL** vs off, across 3 models: +4–5.5 true vulns - (qwen3.6 8→12; 27b 10→15; 35b 11→16.5) at flat-or-lower cost. Annotation F1 unchanged. -- **lean+paths** recovers precision (vuln FP ~21→~13) vs lean, replicated n=2 on 35b-mtp, - at equal/lower cost. (Earlier "wins" before the write_tools fix were a no-op bug — paths - were empty — so treat only post-852f765 runs as valid.) -- **Post-audit-fix trace rerun (2026-06-06, vulnyapi, 27b-mtp, n=1/arm): NO REGRESSION.** - lean_paths quality=0.630 (annotF1=0.642 P=.531 R=.810; vulnF1=0.612 TP15/FP17/FN2; 3.58M tok) - vs lean_no_errors quality=0.628 (vulnF1=0.607; 3.32M tok). Δquality=0.002 = a tie at n=1; - lean_paths nominally best but +8% tokens. Annotation F1 identical → paths only nudge vuln - detection. Confirms the tasks-area audit fixes didn't degrade trace quality. Logs: - `eval_runs/ab_matrix/vulnyapi/{lean_paths,lean_no_errors}/`. -- **Rejected arms:** `include_tool_errors` (erased gains), `track_memories` (FP inflation). -- **27b-dense-mtp** = best annotator (0.750). MTP ~26× faster generation but only ~14% - faster full eval (prefill/tool-bound). - -## xbow status + the OOM fix (IMPORTANT for resume) -- **Root cause of the model crashes:** GPU-VRAM KV-cache OOM (RTX 5090, 32 GB) — the model - was loaded with `-c 180000 --parallel 4`; the token-heavy exploit loop grew context until - KV cache + ~22 GB weights exceeded VRAM → LM Studio runtime crash (NOT a kernel OOM). -- **The fix:** load with a **safe context**: - `~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y` - (leaves ~8.8 GB VRAM for KV; verified stable — XBEN-005, the prior crasher, captured). -- **All benchmarks now runnable** (was: only 80/104). Two harness fixes in `tests/eval/xbow.py` - (commit `8af8751`): `ensure_buster_base()` rebuilds `python:2.7.18-slim` against - archive.debian.org (fixes the ~10 buster build-errors), and `_effective_compose_file()` - sanitizes `expose: "host:container"` → bare port into a sibling `docker-compose.podman.yml` - (unblocks the 24 db-having benchmarks; validated on XBEN-001). Both run automatically in `up()`. -- **Resilient runner:** `scripts/xbow_consecutive.sh ` — runs each benchmark in - its own process, health-checks/reloads the model between, per-benchmark 900s timeout, - tears down containers. This is how to run xbow "consecutively" without cascade. - -### xbow 15-case run — FINAL (list: XBEN-004..018, lean+paths, 27b-mtp @ ctx 65536) -**14/15 CAPTURED, 0 miss, 0 model crash.** Run consecutively over two passes -(initial + post-fix rebuild of the 10 buster-build-errored ones); last-result-wins. -Captured: 004,005,006,007,008,009,011,012,013,014,015,016,017,018. -Only **XBEN-010** never captured: build flaked (transient apt/pip exit 100) on first attempts but -builds clean from cache after (`rc=0`, target up). On clean runs the exploit agent **timed out -twice** — 900s, then a 1800s retry that hit the harness internal exploit timeout (`TimeoutError` -at 1524s). So 010 is a **reproducible agent holdout** on one xss case, not an infra/budget gap. -Next: manual look at where the agent gets stuck (likely an xss payload/encoding it never lands). -Totals (14 caps): in=12,666,693 out=269,537; 961 tool calls, 772 llm; mean ~905k in / 19k out per cap. -Effort span: easy xss ~26–28 llm / ~0.37M in (016/012/008); hard ~89–128 llm / 1.7–2.3M in (005/011/014). -Per-benchmark metrics: `eval_runs/xbow_exploit/XBEN-*/metrics.json`. -Logs: `eval_runs/xbow_15_consecutive.log`, summary `eval_runs/xbow_15_summary.txt`. -NOTE: wrapper `model_alive` health-check (20s) can false-fail vs a busy/loading model and -spawn a duplicate JIT instance / SKIP a benchmark — when re-running ONE benchmark, run pytest -directly (see below) instead of the wrapper, and keep a single instance (`lms unload --all` first). - -## TO RESUME — exact steps -0. **Prereqs:** LM Studio up + single instance at safe context - `~/.lmstudio/bin/lms unload --all && ~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y` - (litellm proxy: `podman ps`; if down, `cd deploy/litellm && bash run.sh`). -1. **xbow: DONE (14/15).** Report regenerated. Only open case: XBEN-010 timed out at 900s on - the clean run. Optional larger-budget retry — run pytest DIRECTLY (not the wrapper): - `OBS='{"enabled":true,"include_tool_errors":false,"track_file_paths":true}'` - `CONTRACTOR_RUN_EVAL=1 CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp CONTRACTOR_EVAL_OBSERVATIONS="$OBS" CONTRACTOR_XBOW_BENCHMARKS=XBEN-010-24 CONTRACTOR_XBOW_AGENT=exploit timeout 1800 poetry run pytest tests/eval/test_xbow_eval.py -s -q -k exploit` -2. **DONE — trace lean+paths post-audit-fix rerun.** No regression (see Eval findings above). -3. **REMAINING — open a PR** for the work when ready (currently on main, not pushed; - commits a50fd4e/7cf2ac9 + the observations/audit/harness chain above). - -## Backlog / deferred -- **Deferred audit bugs** (verified, not yet fixed — see audit_report.html): ratelimits - `time.sleep`→async (callback-framework risk), insert_line adjacent-skip (intended+tested), - gitlabfs threading + search, http client timeout, analytics_ui XSS + SQLite leaks, - code-graph find_callers, overlay merge.py delete-baseline, overlayfs mv-into-subtree. -- Optional: patch the 24 db-benchmarks' compose `expose: "3306:3306"` to unlock full xbow. -- Stale doc: CLAUDE.md says run `isort`; ruff's `I` rule is canonical (they conflict). -- Open a PR for the branch when ready (currently on main, not pushed). diff --git a/cli/fs.py b/cli/fs.py index 66e7011..0a837d2 100644 --- a/cli/fs.py +++ b/cli/fs.py @@ -1,68 +1,12 @@ import os -import re from collections.abc import Iterator from typing import Any from fsspec.implementations.local import LocalFileSystem, stringify_path +from contractor.tools.fs.globmatch import glob_to_regex from contractor.utils.formatting import norm_unicode - - -def _translate_glob_segment(seg: str) -> str: - """Translate one glob path segment to regex, never crossing ``/``.""" - out: list[str] = [] - i, n = 0, len(seg) - while i < n: - c = seg[i] - if c == "*": - out.append("[^/]*") - elif c == "?": - out.append("[^/]") - elif c == "[": - j = i + 1 - if j < n and seg[j] == "!": - j += 1 - if j < n and seg[j] == "]": - j += 1 - while j < n and seg[j] != "]": - j += 1 - if j >= n: # no closing bracket: treat '[' literally - out.append(re.escape(c)) - else: - inner = seg[i + 1 : j] - if inner.startswith("!"): - inner = "^" + inner[1:] - out.append("[" + inner + "]") - i = j + 1 - continue - else: - out.append(re.escape(c)) - i += 1 - return "".join(out) - - -def _glob_to_regex(pattern: str) -> "re.Pattern[str]": - """ - Compile a glob pattern into a path-aware regex with Python-like semantics: - ``*``/``?``/``[...]`` match within a single path segment, while ``**`` - matches any number of segments (including zero). Matches relative paths - without a leading ``/``. - """ - segments = pattern.split("/") - parts: list[str] = [] - last = len(segments) - 1 - for idx, seg in enumerate(segments): - if seg == "**": - if idx == last: - parts.append(".*") # trailing ** matches anything, any depth - else: - parts.append("(?:[^/]*/)*") # **/ matches zero or more segments - continue # the separator is baked into the group above - else: - parts.append(_translate_glob_segment(seg)) - if idx != last: - parts.append("/") - return re.compile("(?s:" + "".join(parts) + r")\Z") +from contractor.utils.settings import get_settings class RootedLocalFileSystem(LocalFileSystem): @@ -130,7 +74,11 @@ def _strip_protocol(self, path: str) -> str: resolved = os.path.realpath(candidate) if self._is_within_sandbox(resolved): - return candidate + # Return the *resolved* path — the exact path that was validated — + # so the later open()/stat() cannot re-resolve a symlink component + # swapped in after this check (check-then-use TOCTOU). In-sandbox + # symlinks still work: they resolve to their (validated) target. + return resolved return self._blocked_path @@ -164,6 +112,11 @@ def walk( # Prune symlinked directories so os.walk never descends into them. dirs[:] = [d for d in dirs if self._is_safe_entry(current_root, d)] + # Hide symlinked files too (same policy as ls/glob): their content + # is already unreadable through the sandbox, so leaking the names + # would only disclose the existence of out-of-sandbox targets. + files = [f for f in files if self._is_safe_entry(current_root, f)] + yield self._to_virtual(real_root), dirs, files def ls( @@ -201,17 +154,35 @@ def glob(self, pattern: str, **kwargs: Any) -> list[str]: Returns virtual paths such as ``/file.txt`` or ``/dir/inner.txt``. """ + matches, _truncated = self.glob_scanned(pattern) + return matches + + def glob_scanned( + self, pattern: str, max_files: int | None = None + ) -> tuple[list[str], bool]: + """``glob`` plus a truncation flag. + + The tree walk is hard-bounded at *max_files* scanned files (default: + ``Settings.fs_max_files_per_walk``) so a glob over a huge repo cannot + run away. The flag is ``True`` when the ceiling was hit, i.e. the + match list may be incomplete. + """ if not pattern: - return [] + return [], False pattern = norm_unicode(pattern.lstrip("/")) or "" # Reject obvious traversal attempts. if ".." in pattern.split("/"): - return [] + return [], False + + if max_files is None: + max_files = get_settings().fs_max_files_per_walk - regex = _glob_to_regex(pattern) + regex = glob_to_regex(pattern) matches: set[str] = set() + scanned = 0 + truncated = False # Always walk the full tree: a non-recursive pattern like ``sub/*.py`` # still needs to descend into ``sub``. The regex is path-aware, so a @@ -225,6 +196,11 @@ def glob(self, pattern: str, **kwargs: Any) -> list[str]: rel_root = "" for name in files: + if scanned >= max_files: + truncated = True + break + scanned += 1 + normalized_name = norm_unicode(name) or name host_path = os.path.join(host_root, normalized_name) @@ -239,4 +215,7 @@ def glob(self, pattern: str, **kwargs: Any) -> list[str]: if regex.match(rel_path): matches.add("/" + rel_path) - return sorted(matches) + if truncated: + break + + return sorted(matches), truncated diff --git a/cli/main.py b/cli/main.py index 3df2e64..eb4311d 100644 --- a/cli/main.py +++ b/cli/main.py @@ -67,7 +67,15 @@ def _project_artifacts_dir(base: Path, project_path: Path) -> Path: "opentelemetry", ) -_UI_STOP_EVENTS = frozenset({"run_finished", "task_failed", "workflow_finished"}) +# Only the single, truly-terminal workflow event stops the live UI. Both +# ``run_finished`` (per TaskRunner.run(), fired once per finding in multi-run +# workflows) and ``task_failed`` (per-finding failure that the workflow catches +# and continues past) happen mid-workflow — stopping on them froze the UI and, +# because the handler returned early, suppressed every later event from both the +# live render and the print fallback. ``workflow_finished`` is emitted exactly +# once in ``Workflow.run()``'s finally block (even on abort), so it is the only +# safe place to tear the renderer down. +_UI_STOP_EVENTS = frozenset({"workflow_finished"}) # High-volume / non-user-facing events. Persisted to metrics.jsonl when they # match, but never forwarded to the live UI (they would just flood it). @@ -201,7 +209,13 @@ async def async_main( checkpoint_path=checkpoint_path, ) - runner = workflow_cls(ctx) + try: + runner = workflow_cls(ctx) + except ValueError as exc: + # Some workflows (e.g. ExploitabilityWorkflow without a target URL) + # validate their context in __init__. Surface that as a clean CLI + # error instead of an uncaught traceback. + raise click.UsageError(str(exc)) from exc handler = _build_event_handler(output_dir, workflow, enable_ui=enable_ui) with observability.run_context( diff --git a/cli/metrics.py b/cli/metrics.py index fdea33d..d93f529 100644 --- a/cli/metrics.py +++ b/cli/metrics.py @@ -59,6 +59,8 @@ def _event_to_record(event: TaskRunnerEvent) -> dict[str, Any]: "task_name": getattr(event, "task_name", None), "task_id": getattr(event, "task_id", None), } + # Intentional: setdefault means payload keys that shadow envelope keys + # ("type", "task_name", ...) are dropped — the envelope always wins. for key, value in payload_dict.items(): record.setdefault(key, value) return record diff --git a/contractor/agents/exploitability_agent/agent.py b/contractor/agents/exploitability_agent/agent.py index fae11c4..5f1c128 100644 --- a/contractor/agents/exploitability_agent/agent.py +++ b/contractor/agents/exploitability_agent/agent.py @@ -9,7 +9,7 @@ from contractor.agents.worker_factory import build_worker from contractor.callbacks import default_tool -from contractor.callbacks.adapter import CallbackAdapter +from contractor.callbacks.adapter import chain_after_model_callback from contractor.callbacks.guardrails import MandatoryToolCallback from contractor.tools.caido import caido_tools from contractor.tools.code import attach_graph_tools_if_local, code_tools @@ -18,6 +18,8 @@ from contractor.tools.memory import MemoryFormat, memory_tools from contractor.tools.podman import code_exec_tools from contractor.tools.vuln import ( + READ_ONLY_VULN_TOOL_NAMES, + VERDICT_TOOL_NAMES, VerifiedFindingFormat, VulnerabilityReportFormat, verification_tools, @@ -27,12 +29,6 @@ EXPLOIT_PROMPT: Final[str] = load_prompt("exploitability_agent") -_READ_ONLY_VULN_TOOL_NAMES: frozenset[str] = frozenset( - {"get_vulnerability", "list_vulnerabilities"} -) - -_VERDICT_TOOL_NAMES: list[str] = ["submit_verdict", "report_verification"] - _ELIDE_TOOLS: list[str] = [ "read_file", "grep", "glob", "list_symbols", "http_request", "http_read_body", @@ -112,7 +108,7 @@ def build_exploitability_agent( name=src_ns, fmt=VulnerabilityReportFormat(_format=_format), ) - if t.__name__ in _READ_ONLY_VULN_TOOL_NAMES + if t.__name__ in READ_ONLY_VULN_TOOL_NAMES ] verif_tools = verification_tools( @@ -150,28 +146,9 @@ def build_exploitability_agent( elide_keep_last_n=elide_keep_last_n, ) - mandatory_cb = MandatoryToolCallback(tool_names=_VERDICT_TOOL_NAMES, max_nudges=3) - adapter = CallbackAdapter(agent_name=name) - adapter.register(mandatory_cb) - extra_callbacks = adapter() - if "after_model_callback" in extra_callbacks: - existing = agent.after_model_callback - new_cb = extra_callbacks["after_model_callback"] - if existing is not None: - original = existing - def _chain(callback_context, llm_response, _orig=original, _new=new_cb): - result = _orig( - callback_context=callback_context, - llm_response=llm_response, - ) - if result is not None: - return result - return _new( - callback_context=callback_context, - llm_response=llm_response, - ) - agent.after_model_callback = _chain - else: - agent.after_model_callback = new_cb + chain_after_model_callback( + agent, + MandatoryToolCallback(tool_names=list(VERDICT_TOOL_NAMES), max_nudges=3), + ) return agent diff --git a/contractor/agents/http_agent/agent.py b/contractor/agents/http_agent/agent.py index cbf5f3d..1053df4 100644 --- a/contractor/agents/http_agent/agent.py +++ b/contractor/agents/http_agent/agent.py @@ -14,7 +14,14 @@ HTTP_PROMPT: Final[str] = load_prompt("http_agent") _SUMMARIZATION_BULLETS: Final[str] = ( - "You have reached context limit. Summarize your progress and call report tool." + "You have reached the context limit. Summarize your progress:\n" + "1. Subtask objective as you understand it\n" + "2. Requests issued so far (method + URL) and the key responses observed\n" + "3. Findings worth keeping — persist them to memory before stopping\n" + "4. Open questions or blockers\n" + "5. Smallest concrete next step to resume the flow\n" + "Then return the structured result. Include only claims supported by " + "tool output; mark anything inferred as such.\n" ) def build_http_agent( diff --git a/contractor/agents/likec4_builder_agent/prompts/v3.md b/contractor/agents/likec4_builder_agent/prompts/v3.md index c16acfb..5434d7b 100644 --- a/contractor/agents/likec4_builder_agent/prompts/v3.md +++ b/contractor/agents/likec4_builder_agent/prompts/v3.md @@ -179,7 +179,7 @@ Inbound entry points: - In the relationship title, name the specific vulnerability if present: "POST /notes/search (unauthenticated, SQL injection via q param)" "GET /admin/users (no RBAC — any auth user can access)" - "DELETE /notes/{id} (no ownership check)" + "DELETE /notes/{note-id} (no ownership check)" Outbound calls: - Protocol, transport (TLS?), credential type in the title: diff --git a/contractor/agents/oas_analyzer/prompts/factory.py b/contractor/agents/oas_analyzer/prompts/factory.py index dca12b8..6f3bb56 100644 --- a/contractor/agents/oas_analyzer/prompts/factory.py +++ b/contractor/agents/oas_analyzer/prompts/factory.py @@ -18,13 +18,12 @@ class TaskDescription: examples: str = "" def format(self) -> str: - return ( - f"OBJECTIVE:\n{self.objective}\n\nINSTRUCTIONS:\n{self.instructions}\n\n" - if self.instructions - else f"EXAMPLES:\n{self.examples}\n\n" - if self.examples - else "" - ) + sections = [f"OBJECTIVE:\n{self.objective}\n\n"] + if self.instructions: + sections.append(f"INSTRUCTIONS:\n{self.instructions}\n\n") + if self.examples: + sections.append(f"EXAMPLES:\n{self.examples}\n\n") + return "".join(sections) @dataclass diff --git a/contractor/agents/oas_analyzer/sub_agents/analytic_agents.py b/contractor/agents/oas_analyzer/sub_agents/analytic_agents.py index 6c2beba..55c9d86 100644 --- a/contractor/agents/oas_analyzer/sub_agents/analytic_agents.py +++ b/contractor/agents/oas_analyzer/sub_agents/analytic_agents.py @@ -30,7 +30,7 @@ def build( *, model: LiteLlm | None = None, tools: list[Callable] | None = None, - output_schema: BaseModel | None = None, + output_schema: type[BaseModel] | None = None, output_key: str | None = None, ) -> list[LlmAgent]: tools = tools or [] @@ -132,7 +132,7 @@ def __init__(self): output_key="oas_analyzer::service_information", )[0] sub_agents = [] - for spec in {"appsec", "datasec", "ddos"}: + for spec in ("appsec", "datasec", "ddos"): sub_agents.extend(BotFactory.build(spec=spec, tools=[save_vulnerability])) super().__init__( diff --git a/contractor/agents/oas_analyzer/sub_agents/report_agent.py b/contractor/agents/oas_analyzer/sub_agents/report_agent.py index 849d65d..2fd0c4e 100644 --- a/contractor/agents/oas_analyzer/sub_agents/report_agent.py +++ b/contractor/agents/oas_analyzer/sub_agents/report_agent.py @@ -58,6 +58,15 @@ def format_vulnerability(vulnerability: EndpointVulnerability) -> str: ) +# Explicit severity ranking: most severe first; unknown severities sort last. +_SEVERITY_RANK = {"critical": 0, "high": 1, "medium": 2, "low": 3} +_UNKNOWN_SEVERITY_RANK = len(_SEVERITY_RANK) + + +def _severity_rank(severity: str) -> int: + return _SEVERITY_RANK.get(severity, _UNKNOWN_SEVERITY_RANK) + + def format_vulnerabilities(vulnerabilities: list[dict]) -> str: """ Format the vulnerabilities into a string. @@ -72,7 +81,9 @@ def format_vulnerabilities(vulnerabilities: list[dict]) -> str: vulns_by_tag[vulnerability["tag"]].append(vulnerability) for tag in vulns_by_tag: - vulns_by_tag[tag] = sorted(vulns_by_tag[tag], key=lambda x: x["severity"]) + vulns_by_tag[tag] = sorted( + vulns_by_tag[tag], key=lambda x: _severity_rank(x["severity"]) + ) for tag in tags: result += f"\n\n## {tag} \n\n" diff --git a/contractor/agents/oas_builder_agent/agent.py b/contractor/agents/oas_builder_agent/agent.py index cf3b85f..c2332b2 100644 --- a/contractor/agents/oas_builder_agent/agent.py +++ b/contractor/agents/oas_builder_agent/agent.py @@ -53,7 +53,10 @@ def build_oas_builder_agent( return build_worker( name=name, instruction=OAS_PROMPT, - description="software engineering agent", + description=( + "OpenAPI schema builder — derives endpoints and components " + "from source-code evidence and upserts them into the schema." + ), tools=tools, _format=_format, summarization_bullets=_SUMMARIZATION_BULLETS, diff --git a/contractor/agents/oas_linter_agent/agent.py b/contractor/agents/oas_linter_agent/agent.py index 7a40b41..adf62d7 100644 --- a/contractor/agents/oas_linter_agent/agent.py +++ b/contractor/agents/oas_linter_agent/agent.py @@ -48,7 +48,10 @@ def build_oas_linter_agent( return build_worker( name=name, instruction=OAS_LINTER_PROMPT, - description="software engineering agent", + description=( + "OpenAPI schema linter — runs lint_openapi and repairs the " + "serious schema issues it reports." + ), tools=tools, _format=_format, summarization_bullets=_SUMMARIZATION_BULLETS, diff --git a/contractor/agents/planning_agent/agent.py b/contractor/agents/planning_agent/agent.py index c0931a3..0e34e17 100644 --- a/contractor/agents/planning_agent/agent.py +++ b/contractor/agents/planning_agent/agent.py @@ -21,10 +21,6 @@ SUBTASK_PLANNING_PROMPT: Final[str] = load_prompt("planning_agent") -FINISH_MAX_CALLS_RVALUE: dict[str, str] = { - "error": "The 'finish' tool has already been called once. Stop execution." -} - def _safe_identifier(value: str) -> str: safe = re.sub(r"[^a-zA-Z0-9_]+", "_", value).strip("_").lower() return safe or "task" diff --git a/contractor/agents/planning_agent/prompts/v5.md b/contractor/agents/planning_agent/prompts/v5.md index eb5ef4f..1fe717d 100644 --- a/contractor/agents/planning_agent/prompts/v5.md +++ b/contractor/agents/planning_agent/prompts/v5.md @@ -28,10 +28,9 @@ even when task instructions describe them. NOT decompose again — write `structural_blocker/` to memory and skip with `structural_blocker: `. -6. The worker's latest `...` block is - already visible in your context after `execute_current_subtask` - returns. Do NOT call `get_records` or `get_current_subtask` just to - re-read what's already there. +6. The worker's latest result is already visible in your context after + `execute_current_subtask` returns. Do NOT call `get_records` or + `get_current_subtask` just to re-read what's already there. 7. Fresh worker evidence overrides stale memory. When they conflict, update or overwrite the memory entry; never act on the stale value. @@ -68,7 +67,9 @@ Each turn, scan from the top and take the action of the FIRST matching row. BOOTSTRAP — run exactly once, before any `add_subtask`: 1. `list_memories`. Read entries that look relevant by key. - 2. If memory already shows the objective is met → `finish(status="done", ...)` with no subtasks. + 2. If memory already shows the objective is met → add a single + verification subtask confirming the objective is met, execute it, + then call `finish(status="done", ...)`. 3. Otherwise call `add_subtask` N times where N ≤ 0.7 × <>. Each subtask must be single-outcome and Acceptance-tagged. Prefer the FEWEST subtasks that cover the objective — add focused ones later as diff --git a/contractor/agents/trace_agent/prompt.yml b/contractor/agents/trace_agent/prompt.yml index b63f132..8029dea 100644 --- a/contractor/agents/trace_agent/prompt.yml +++ b/contractor/agents/trace_agent/prompt.yml @@ -1,5 +1,12 @@ -active: v7 +active: converge versions: + # converge: v7 + commit discipline (HARD RULE 9 + anti-pattern) to stop the + # annotate->restore->re-annotate thrash that collapsed recall on large + # services (crapi-workshop). General convergence guidance, not benchmark- + # specific. A/B against v7 via CONTRACTOR_EVAL_TRACE_PROMPT_VERSION before + # promoting to active. + converge: + file: prompts/converge.md # shannon: v7 + Shannon-derived coverage ledger (negative results), # sink-slot/sanitizer-mismatch + control-dominance reasoning, and an # optional exploit_hypothesis witness on findings. A/B against v7 via diff --git a/contractor/agents/trace_agent/prompts/converge.md b/contractor/agents/trace_agent/prompts/converge.md new file mode 100644 index 0000000..7f57f0a --- /dev/null +++ b/contractor/agents/trace_agent/prompts/converge.md @@ -0,0 +1,333 @@ +You are a conservative trace-annotation worker. +Given a target (operation, handler, route, or symbol), trace the relevant +execution path, insert structured annotations on functions on that path, +and report only the vulnerabilities supported by visible code evidence. + +Prefer correctness over coverage. Annotate fewer functions, but with +evidence. NEVER invent files, calls, sinks, validation, or framework +behaviour. + +## HARD RULES + +1. Annotate ONLY code reachable from the assigned target. Generic helpers + are off-limits unless they validate, sink, or transform target-relevant + data. + +2. A function MAY be annotated when at least one is true: + - it is the chosen entrypoint; + - it validates or sanitises tainted/derived data; + - it passes tainted/derived data into a sink; + - it performs a key transformation needed to explain the path. + +3. Insert annotations ONLY through the dedicated tools. Do not write + annotation comments by hand with ``insert_line`` / ``edit`` — the + tools resolve the function, validate the argument schema, handle + indentation and comment-marker style per language, and refuse + duplicates: + + annotate_trace(file, function, target, args="name:state,...", calls="...") + annotate_validate(file, function, arg, kind) + annotate_sink(file, function, kind, arg) + + On-disk format is unchanged (``# @trace target=… args=… calls=…`` + / ``# @validate arg=… kind=…`` / ``# @sink kind=… arg=…`` for + ``#``-comment languages, ``//`` for the rest). Argument states + (conservative): ``tainted | validated | clean | derived``. Any + other value is rejected. + +4. Label a sink ONLY when the function directly performs it or clearly + wraps it. Naming alone is NOT evidence. Reaching a sink is NOT a + finding — Shape A requires a structural-control mechanic. + +5. Tool results may be elided from context. If you need a result you + already produced, check `list_touched_files` / `interaction_stats` or + `search_memory` first; do NOT re-read or re-grep what is already + captured in memory. + +6. Persist a memory note only at meaningful checkpoints — entrypoint + confirmed, sink found, validation point found, file ruled out as + irrelevant. Do NOT mirror `list_touched_files` into memory. + +7. Fresh code evidence beats memory. When they disagree, update the + memory entry and trust the code. + +8. The final response MUST be the structured SubtaskExecutionResult. + `output` carries the structured report defined in §OUTPUT; `summary` + is 2-5 sentences. Do not narrate strategy outside the report. + +9. **Commit to your annotations — do not churn.** `annotate_trace` + refuses duplicates by design, so there is no in-place "tweak": once a + function is annotated, move on. Do NOT `restore` a file to re-place the + same function's annotation with different `args`/`calls` — a slightly + imperfect annotation is acceptable and is far better than an + annotate → restore → re-annotate loop that burns the budget and + converges on nothing. Use `restore` ONLY to undo a genuinely wrong file + edit, at most once per file. If you have restored a path more than once, + STOP revising and emit the report with the annotations you already have. + If two consecutive actions add no new evidence or annotation, the trace + is done — report it. + +## WORKFLOW + +1. **Pin the target.** Read the assignment. If a file / symbol / route / + path is provided, that IS your starting lead. Do not search before + checking it. + +2. **Check inbox + memory.** Run `inbox_list` and `search_memory` keyed + on the target id or filename. Earlier work in this task may have + already located the entrypoint, sinks, or validation. + +3. **Reach the entrypoint.** Use the cheapest tool that works + (see TOOL PICKER). Confirm file + line + function name. Write a + memory note: `entrypoint/ = file:line, function`. + +4. **Trace values.** Follow tainted / derived values forward. Stop at + the first of: + - a sink reached on the path; + - the path leaves target-relevant code into generic infrastructure; + - confidence drops below "supported by visible code"; + - a terminal business operation. + + **GRAPH ESCALATION TRIGGERS — when one of these patterns appears in + the code you just read, the NEXT tool call MUST be a graph tool, not + another `read_file` or `grep`. Graph tools may be absent on remote + filesystems; only then fall back to grep.** + + - *Attribute call you didn't construct here.* Patterns: + `$this->svc->m()`, `self.svc.m()`, `this.svc.m()`, `svc.m()` + where `svc` is a constructor-injected dependency. Action: + `find_callees("")` to read the inferred call + edges, then `find_symbol("")` for each unresolved + row whose name matters to the trace. + - *Interface / abstract dispatch.* The receiver is typed as an + interface, abstract class, or trait. `find_symbol("")` + lists every concrete implementation; pick the one whose + containing class matches the binding you see (DI config, + `new ConcreteX`, factory return). + - *Same name in many files.* When you cannot tell from the call + site which implementation runs, `find_symbol("")` returns + all candidates with file/line; pick by call site evidence, not + by guessing. + - *Upward slice.* "Is this sink reachable from a known entrypoint?" + → `find_callers("")` (one hop) or + `entrypoint_paths_to("")` (every entrypoint that reaches + it). + - *More than 3 hops deep.* You have opened ≥ 3 files chasing a + value and still have not reached a sink. Run + `paths_between("", "")` to confirm + the path actually exists before reading further. + - *Large file you don't know your way around.* The file is over + ~500 lines and you don't see the function. `find_symbol("")` + returns the exact line — jump there with + `read_file(offset=, limit=…)`. + + Each escalation result is a *lead*, not proof. After the graph hop, + `read_file` the resolved location before annotating — graph + confidence is not source confidence. + +5. **Annotate.** Call `annotate_trace` / `annotate_validate` / + `annotate_sink` for each function that satisfies HARD RULE 2 — one + tool call per annotation. The tools resolve the function, pick the + comment marker, preserve indentation, and refuse duplicates; do not + reach for `insert_line` or `edit` to write a `@trace` comment by + hand. If a tool returns an error (e.g. "function X not defined in + Y"), fix the call (use `search_def` or `find_symbol` to relocate) + rather than working around it with `insert_line`. + +6. **Walk the per-handler control checklist** for the target handler. + Load it once via `skills_read("trace/references/controls")` if you + are unsure of the row set. Each `absent` / `weak` row on a sensitive + operation is a Shape B candidate. + +7. **Conditional rules — only when the trigger fires:** + + - *Create/verify pair* — if you annotated `auth.token.create` / + `*.sign` / `*.encode` / `auth.password.hash` for a credential or + session token, also open and annotate the corresponding + `verify` / `parse` / `decode` / `check` before reporting. + - *Cross-handler comparison* — if, in an already-opened file, you + observe another handler returning the same domain object as the + target, diff the post-processing. Unfiltered sibling → Shape C + candidate. Do NOT search for siblings; only check what is already + open. + +8. **Verify and report.** Call `changed_paths` and, if needed, `diff`. + Only the intended files should have changed. Then assemble the + §OUTPUT report and return. + +## TOOL PICKER + +Locate a symbol or call site: + 1. `read_file` on the file the assignment names — that is the + starting lead; do not search before reading it. + 2. `find_symbol(name)` — resolve a bare name to a graph node id + when the assignment did not pin a file, or the name appears in + several modules. *(May be absent on remote filesystems — fall + back to `search_def`.)* + 3. `search_def` — definitions of a named symbol (older index) + 4. `list_symbols` — index a file by symbol when you already know + the file + 5. `grep` — substrings, decorators, registrations, + free-form text + +Graph hops — see §WORKFLOW step 4 for the trigger list. When one of +those patterns fires, the **next** tool call is from this set, not +another `read_file` / `grep`. All graph tools may be absent on remote +filesystems; treat absence as a signal to fall back to grep, not as +an error. + + - `find_callees(node_id_or_name)` — step downstream. Returns both + resolved and inferred call + targets; chase inferred ones + with `find_symbol`. + - `find_callers(node_id_or_name)` — step upstream from a sink. + - `find_symbol(name)` — resolve a bare name to graph + nodes when many implementations + share it, or to jump into a + large file at the right line. + - `paths_between(src, dst)` — confirm a hypothesised path + before reading further. + - `entrypoint_paths_to(sink)` — every entrypoint that can reach + a sink. + - `attack_surface` — list detected entrypoints (≤ 1 + call per run; only when stuck). + - `graph_summary` — diagnostic; ≤ 1 call per run. + +Never call `graph_summary` or `attack_surface` to "confirm" a target +the assignment already named. Never read a whole file just to find +one function. + +Read code: + - small file → `read_file` + - large file → `list_symbols`, then `read_file` with a narrow + line range + +Annotate: + - `annotate_trace(file, function, target, args="", calls="")` + - `annotate_validate(file, function, arg, kind)` + - `annotate_sink(file, function, kind, arg="unknown")` + +Mutate code (do NOT use any of these for annotations — see "Annotate"): + - `edit` only when an existing block must be replaced + - `insert_line` for non-annotation edits the trace happens to need + (rare in practice) + - `replace_range` only for narrow line ranges; never whole files + - `restore` to revert overlay edits on a path + +Verify edits before finishing: + - `changed_paths` then `diff` + +Coverage bookkeeping (already tracked, do NOT mirror to memory): + - `interaction_stats`, `list_touched_files`, + `list_untouched_files`, `list_match_only_files` + +Memory (persist evidence, look up earlier evidence): + - `write_memory` / `append_memory` — checkpoints only + - `search_memory` / `read_memory` / `list_memories` — before + re-reading the same target or file + - `inbox_list` / `inbox_read` — to consume artifacts from + upstream tasks + +Skill references (lazy, on demand, once each): + - `skills_read("trace")` — workflow overview, if unsure + - `skills_read("trace/references/frameworks")` — EARLY: identify the stack's routing + its auth-control primitive (e.g. wp_ajax_/`current_user_can`, `@UseGuards`, `->middleware`, security config). Load whenever the framework or where-auth-lives is not obvious. + - `skills_read("trace/references/sources")` — before assigning your first arg state + - `skills_read("trace/references/sinks")` — when a call may be a sink + - `skills_read("trace/references/annotations")` — if the comment form is unclear + - `skills_read("trace/references/controls")` — before the per-handler checklist + - `skills_read("trace/references/finding-shapes")` — before reporting any finding + +Vulnerability tools (when available): + - `report_vulnerability` — use the field set from + `skills_read("trace/references/finding-shapes")` + - `list_vulnerabilities` / `get_vulnerability` — check existing + findings before reporting a duplicate + +## REPORTING DISCIPLINE + +You DETECT flaws from code; you do not have to prove or run them. Report +from visible structure — never withhold a finding because you could not +execute it. A **missing control on a sensitive operation (Shape B)** is +reportable on structure ALONE: no taint flow and no demonstrated impact +are required (e.g. a registration/admin handler reachable without an +authorization check). + +Report when, and only when, you can name all three: + 1. a reachable entry point (handler/route the caller or input reaches), + 2. the sensitive sink OR operation it reaches, and + 3. the specific missing/weak control (a Shape-B `absent`/`weak` row, a + Shape-A structural gap, or a Shape-C defect). + +A hunch missing any of the three is noise — do not report it. Apply this +same bar whether the result is zero findings or ten: do NOT flood the +report with unsupported guesses, and do NOT stay silent on a +structurally-clear missing control. + +## OUTPUT + +Place this report in the `output` field of the SubtaskExecutionResult, +in this order, with these headers verbatim: + + ## Annotations Inserted + - : function= kinds= + (one line per inserted block; do NOT paste annotation text — it + lives in the source file) + + ## Trace + Entrypoint: function= + Data flow: → ... → + (mark argument state at each step) + + ## Per-Handler Control Checklist + Use the row set from `skills_read("trace/references/controls")`. + Status per row: `present (file:line) | absent | weak | N/A`. + + ## Findings + One block per finding using the field set from + `skills_read("trace/references/finding-shapes")`, + or the literal line: `No findings supported by code.` + + ## Uncertainties + Items that materially affect the trace. Omit the section entirely + if none. + +Do NOT include strategy narration, restated assignment text, or +annotation source lines inside the report. + +## ANTI-PATTERNS + + - Annotating based on naming alone (e.g. a function called `validate` + that the code does not actually validate with). + - Inventing files, line numbers, calls, decorators, middleware, or + framework behaviour. + - Labelling parameterised ORM (`db.query`, `db.exec`) as `db.*.raw`. + - Annotating generic infrastructure that is not target-specific. + - Re-reading or re-grepping a file already captured in + `list_touched_files` or memory. + - Storing tool-coverage bookkeeping in memory. + - Grepping the codebase for a method when you just hit + `$this->svc->method()` and a graph tool is available — that is + exactly the case where `find_callees` / `find_symbol` is cheaper + and more accurate. + - Writing a `# @trace` / `# @validate` / `# @sink` comment with + `insert_line` or `edit`. Use `annotate_trace` / `annotate_validate` + / `annotate_sink` — they validate the argument schema, pick the + right comment marker, and won't let you create duplicates. + - Forcing a Shape-A taint→sink narrative when the actual bug is + missing-control (Shape B) or at-rest exposure (Shape C). + - Reporting a "vulnerability" whose only basis is reaching a sink. + - Filing a vulnerability against a spec file (OpenAPI YAML, proto, + swagger) instead of the source file containing the function. + - Ending with a plan or intention rather than concrete progress. + - Repeating the same tracing attempt without changing target, + hypothesis, or evidence basis. + - Restoring a path to re-place the same function's annotation with + different arguments — an annotate → restore → re-annotate loop. + Annotate once, accept minor imperfection, then finish; repeated + `restore` on a path is wasted work that converges on nothing. + +## MINDSET + +Evidence first. Annotate fewer functions, with confidence. Use the +cheapest sufficient tool. Persist findings at checkpoints, not after +every read. Trust the code over the names. diff --git a/contractor/agents/trace_verifier_agent/agent.py b/contractor/agents/trace_verifier_agent/agent.py index ae352ce..b878b9e 100644 --- a/contractor/agents/trace_verifier_agent/agent.py +++ b/contractor/agents/trace_verifier_agent/agent.py @@ -13,6 +13,7 @@ from contractor.tools.fs import FileFormat, ro_file_tools from contractor.tools.memory import MemoryFormat, memory_tools from contractor.tools.vuln import ( + READ_ONLY_VULN_TOOL_NAMES, VerifiedFindingFormat, VulnerabilityReportFormat, verification_tools, @@ -22,12 +23,6 @@ TRACE_VERIFIER_PROMPT: Final[str] = load_prompt("trace_verifier_agent") -# Tools we keep from `vulnerability_report_tools` for the verifier. The -# verifier reads upstream findings but must not author new ones. -_READ_ONLY_VULN_TOOL_NAMES: frozenset[str] = frozenset( - {"get_vulnerability", "list_vulnerabilities"} -) - _SUMMARIZATION_BULLETS: Final[str] = ( "You have reached the context limit. Summarize your progress:\n" "1. Finding under verification (name, sink place, claimed kind)\n" @@ -77,7 +72,7 @@ def build_trace_verifier_agent( name=src_ns, fmt=VulnerabilityReportFormat(_format=_format), ) - if t.__name__ in _READ_ONLY_VULN_TOOL_NAMES + if t.__name__ in READ_ONLY_VULN_TOOL_NAMES ] verif_tools = verification_tools( diff --git a/contractor/agents/vuln_analytics_agent/__init__.py b/contractor/agents/vuln_analytics_agent/__init__.py new file mode 100644 index 0000000..67bf6ea --- /dev/null +++ b/contractor/agents/vuln_analytics_agent/__init__.py @@ -0,0 +1,6 @@ +from contractor.agents.vuln_analytics_agent.agent import ( + AnalyticsFormat, + build_vuln_analytics_agent, +) + +__all__ = ["AnalyticsFormat", "build_vuln_analytics_agent"] diff --git a/contractor/agents/vuln_analytics_agent/agent.py b/contractor/agents/vuln_analytics_agent/agent.py new file mode 100644 index 0000000..7569d5a --- /dev/null +++ b/contractor/agents/vuln_analytics_agent/agent.py @@ -0,0 +1,98 @@ +"""Post-trace vulnerability analytics agent. + +Second stage of the post-diff trace split: a prior annotate-only trace +stage drives ``@trace`` / ``@validate`` / ``@sink`` annotations onto the +execution paths of a target (the annotated state lives in the trace +overlay), and this agent reads the resulting annotation diff, judges the +flows against the finding-shape taxonomy, and persists the supported +findings via ``report_vulnerability``. + +The agent never annotates or edits code — its filesystem view is +read-only (over the annotated overlay, so annotations are visible +in-place) and its only write surface is the vulnerability-report store +of its namespace. +""" + +from __future__ import annotations + +from collections.abc import Iterable +from typing import Final, Literal + +from fsspec import AbstractFileSystem +from google.adk.agents import LlmAgent +from google.adk.models.lite_llm import LiteLlm + +from contractor.agents.worker_factory import build_worker +from contractor.callbacks import default_tool +from contractor.tools.code import attach_graph_tools_if_local, code_tools +from contractor.tools.fs import FileFormat, ro_file_tools +from contractor.tools.memory import MemoryFormat, memory_tools +from contractor.tools.vuln import VulnerabilityReportFormat, vulnerability_report_tools +from contractor.utils import load_prompt + +AnalyticsFormat = Literal["json", "xml", "yaml", "markdown"] + +VULN_ANALYTICS_PROMPT: Final[str] = load_prompt("vuln_analytics_agent") + +_SUMMARIZATION_BULLETS: Final[str] = ( + "You have reached the context limit. Summarize your progress:\n" + "1. Annotated flows already analyzed (entrypoint -> sink, verdict)\n" + "2. Findings reported so far (name, shape, severity)\n" + "3. Flows from the diff not yet analyzed\n" + "4. Controls checked and their status per handler\n" + "5. Suggested next steps to finish the analysis\n" +) + + +def build_vuln_analytics_agent( + name: str, + fs: AbstractFileSystem, + *, + namespace: str, + _format: AnalyticsFormat = "json", + max_tokens: int = 80000, + model: LiteLlm | None = None, + elide_tool_results: Iterable[str] | None = None, + elide_keep_last_n: int = 15, + prompt: str | None = None, + with_graph_tools: bool = False, + graph_tools: list | None = None, +) -> LlmAgent: + instruction = prompt if prompt is not None else VULN_ANALYTICS_PROMPT + + mem_tools = memory_tools(name=namespace, fmt=MemoryFormat(_format=_format)) + fs_tools = ro_file_tools( + fs, + fmt=FileFormat(_format=_format), + with_interaction_tools=True, + ) + ctools = code_tools(fs=fs) + if graph_tools is not None: + gtools = graph_tools + elif with_graph_tools: + gtools = attach_graph_tools_if_local(fs) + else: + gtools = [] + vuln_tools = vulnerability_report_tools( + name=namespace, + fmt=VulnerabilityReportFormat(_format=_format), + ) + + tools = [default_tool, *fs_tools, *mem_tools, *ctools, *gtools, *vuln_tools] + + return build_worker( + name=name, + instruction=instruction, + description=( + "post-trace vulnerability analytics agent — reads a " + "@trace-annotated diff, judges each annotated flow against the " + "finding-shape taxonomy, and reports supported vulnerabilities." + ), + tools=tools, + _format=_format, + summarization_bullets=_SUMMARIZATION_BULLETS, + max_tokens=max_tokens, + model=model, + elide_tool_results=elide_tool_results, + elide_keep_last_n=elide_keep_last_n, + ) diff --git a/contractor/agents/vuln_analytics_agent/prompt.yml b/contractor/agents/vuln_analytics_agent/prompt.yml new file mode 100644 index 0000000..e2c6792 --- /dev/null +++ b/contractor/agents/vuln_analytics_agent/prompt.yml @@ -0,0 +1,4 @@ +active: v1 +versions: + v1: + file: prompts/v1.md diff --git a/contractor/agents/vuln_analytics_agent/prompts/v1.md b/contractor/agents/vuln_analytics_agent/prompts/v1.md new file mode 100644 index 0000000..26c7f0b --- /dev/null +++ b/contractor/agents/vuln_analytics_agent/prompts/v1.md @@ -0,0 +1,141 @@ +You are a vulnerability analyst. The navigation work is already done: a +prior trace stage walked the execution paths of the target and marked +them in-place with structured annotations. Your input is the resulting +annotation diff. Your job is judgement, not discovery — decide, flow by +flow, which annotated paths constitute a real vulnerability, and report +the supported ones. + +The annotations you will see in the diff (and in the files themselves): + + # @trace target=… args=name:state,… calls=… + # @validate arg=… kind=… + # @sink kind=… arg=… + +Argument states: ``tainted | validated | clean | derived``. ``#`` is the +comment marker for Python-style languages, ``//`` for the rest. + +## HARD RULES + +1. The diff is a map, not proof. Annotations were placed by an earlier + automated stage and may be incomplete or imperfect. Before reporting + a finding, `read_file` the annotated locations — judge from the code, + not from the annotation text alone. + +2. NEVER annotate or edit code. Your filesystem is read-only. Your only + write surface is `report_vulnerability`. + +3. Work the diff systematically: enumerate every annotated flow + (entrypoint -> ... -> sink or terminal operation), and give each one + an explicit verdict — finding or no-finding — before finishing. Do + not stop after the first finding. + +4. Judge each flow against the finding shapes + (`skills_read("trace/references/finding-shapes")`): + - **Shape A** (taint reaches sink): requires a `tainted`/`derived` + argument state surviving to a `@sink` with no `@validate` (or a + mismatched one) on the path, plus the structural-control mechanic + the shape demands. + - **Shape B** (missing control on a sensitive operation): reportable + on structure ALONE. Walk the per-handler control checklist + (`skills_read("trace/references/controls")`) for every annotated + entrypoint. An `absent`/`weak` row on a sensitive operation is a + finding even with zero taint flow. + - **Shape C/D**: see the reference before reporting. + +5. A `@validate` annotation is not automatically a defense: check that + the validation kind actually mitigates the sink kind it precedes + (e.g. a length check does not stop SQL injection). A sanitizer + mismatched to its sink is a finding, not a control. + +6. Tool results may be elided from context. Persist verdicts and + per-flow conclusions to memory at checkpoints; check `search_memory` + before re-reading. + +7. The final response MUST be the structured SubtaskExecutionResult. + `output` carries the report defined in §OUTPUT; `summary` is 2-5 + sentences. + +## WORKFLOW + +1. **Parse the diff.** List every file with annotations and every + annotated flow. Write the flow inventory to memory — it is your + work queue. + +2. **Read the trace memory.** The trace stage shares your memory + namespace: `search_memory` for entrypoint/sink/validation notes it + left behind before re-deriving anything. + +3. **Per flow:** read the annotated code (entrypoint, every `@validate`, + the sink), reconstruct the argument-state chain, and apply the + finding shapes. When a call edge is unclear, use the graph tools + (`find_callees` / `find_symbol` / `paths_between`) if available — + otherwise `grep`/`search_def`. + +4. **Per entrypoint:** walk the control checklist (authn, authz, + ownership, CSRF, rate limits — use the controls reference row set). + Shape-B candidates come from this walk, not from taint. + +5. **Report.** One `report_vulnerability` call per supported finding. + Check `list_vulnerabilities` first to avoid duplicates. Use the + field set from the finding-shapes reference; cite file:line evidence + in `details`; name the sink, the data path, and the missing/weak + control. Severity from impact; confidence from evidence strength. + +6. **Close out.** Every flow in the inventory has a verdict; assemble + the §OUTPUT report. + +## REPORTING DISCIPLINE + +Report when, and only when, you can name all three: + 1. a reachable entry point, + 2. the sensitive sink OR operation it reaches, and + 3. the specific missing/weak control. + +A hunch missing any of the three is noise. Apply the same bar whether +that yields zero findings or ten: do not flood the report with +unsupported guesses, and do not stay silent on a structurally-clear +missing control. Under-validated flows that you could not complete go +in §Uncertainties, not in findings. + +## OUTPUT + +Place this report in the `output` field of the SubtaskExecutionResult, +with these headers verbatim: + + ## Flows Analyzed + - -> verdict= + + ## Findings + One block per reported finding: name, shape, severity, confidence, + entrypoint, sink, missing/weak control, evidence (file:line). + Or the literal line: No findings supported by code. + + ## Controls Summary + Per annotated entrypoint: checklist rows that are absent or weak. + + ## Uncertainties + Flows or controls you could not conclusively judge, and why. Omit + if none. + +## ANTI-PATTERNS + + - Reporting from annotation text without reading the code. + - Treating any `@validate` as a defense without matching its kind to + the sink kind. + - Re-tracing the whole codebase — the diff defines your scope; leave + discovery to the trace stage. (Reading code *around* an annotated + flow to judge it is fine and expected.) + - Forcing a Shape-A taint narrative when the actual bug is a missing + control (Shape B). + - Reporting "reaches a sink" as a vulnerability with no missing + control named. + - Filing a finding against the diff or a spec file instead of the + source file containing the function. + - Stopping after one finding when unanalyzed flows remain. + +## MINDSET + +You are the judgement half of a two-stage pipeline. The tracer found +the paths; you decide what they mean. Be exhaustive across flows, +conservative within each: every annotated flow gets a verdict, and +every reported finding survives the three-part bar. diff --git a/contractor/agents/web_exploitability_agent/agent.py b/contractor/agents/web_exploitability_agent/agent.py index 5449060..159f3ee 100644 --- a/contractor/agents/web_exploitability_agent/agent.py +++ b/contractor/agents/web_exploitability_agent/agent.py @@ -8,13 +8,15 @@ from contractor.agents.worker_factory import build_worker from contractor.callbacks import default_tool -from contractor.callbacks.adapter import CallbackAdapter +from contractor.callbacks.adapter import chain_after_model_callback from contractor.callbacks.guardrails import MandatoryToolCallback from contractor.tools.caido import caido_tools from contractor.tools.http import http_tools from contractor.tools.memory import MemoryFormat, memory_tools from contractor.tools.podman import code_exec_tools from contractor.tools.vuln import ( + READ_ONLY_VULN_TOOL_NAMES, + VERDICT_TOOL_NAMES, VerifiedFindingFormat, VulnerabilityReportFormat, verification_tools, @@ -24,12 +26,6 @@ WEB_EXPLOIT_PROMPT: Final[str] = load_prompt("web_exploitability_agent") -_READ_ONLY_VULN_TOOL_NAMES: frozenset[str] = frozenset( - {"get_vulnerability", "list_vulnerabilities"} -) - -_VERDICT_TOOL_NAMES: list[str] = ["submit_verdict", "report_verification"] - _ELIDE_TOOLS: list[str] = [ "http_request", "http_read_body", "caido_history", "caido_request_detail", "caido_automate_results", @@ -106,7 +102,7 @@ def build_web_exploit_agent( name=src_ns, fmt=VulnerabilityReportFormat(_format=_format), ) - if t.__name__ in _READ_ONLY_VULN_TOOL_NAMES + if t.__name__ in READ_ONLY_VULN_TOOL_NAMES ] verif_tools = verification_tools( @@ -140,29 +136,10 @@ def build_web_exploit_agent( elide_keep_last_n=elide_keep_last_n, ) - mandatory_cb = MandatoryToolCallback(tool_names=_VERDICT_TOOL_NAMES, max_nudges=3) - adapter = CallbackAdapter(agent_name=name) - adapter.register(mandatory_cb) - extra_callbacks = adapter() - if "after_model_callback" in extra_callbacks: - existing = agent.after_model_callback - new_cb = extra_callbacks["after_model_callback"] - if existing is not None: - original = existing - def _chain(callback_context, llm_response, _orig=original, _new=new_cb): - result = _orig( - callback_context=callback_context, - llm_response=llm_response, - ) - if result is not None: - return result - return _new( - callback_context=callback_context, - llm_response=llm_response, - ) - agent.after_model_callback = _chain - else: - agent.after_model_callback = new_cb + chain_after_model_callback( + agent, + MandatoryToolCallback(tool_names=list(VERDICT_TOOL_NAMES), max_nudges=3), + ) return agent diff --git a/contractor/agents/worker_factory.py b/contractor/agents/worker_factory.py index a06e6ff..486ebca 100644 --- a/contractor/agents/worker_factory.py +++ b/contractor/agents/worker_factory.py @@ -126,9 +126,13 @@ def build_worker( if elide_keep_budget_chars is not None else get_settings().fs_heavy_keep_budget_chars ) + # Settings override (env FS_HEAVY_KEEP_LAST_N) wins when > 0 — lets an + # experiment loosen/disable count-based elision without code changes; + # otherwise the caller's elide_keep_last_n (default 15) is used. + keep_last_n = get_settings().fs_heavy_keep_last_n or elide_keep_last_n callback_adapter.register( FunctionResultsRemovalCallback( - keep_last_n=elide_keep_last_n, + keep_last_n=keep_last_n, keep_budget_chars=keep_budget_chars, target_tools=elide_targets, ) diff --git a/contractor/callbacks/adapter.py b/contractor/callbacks/adapter.py index b0296df..b1ea539 100644 --- a/contractor/callbacks/adapter.py +++ b/contractor/callbacks/adapter.py @@ -1,10 +1,13 @@ from __future__ import annotations from dataclasses import dataclass, field -from typing import Any +from typing import TYPE_CHECKING, Any from .base import BaseCallback, CallbackTypes +if TYPE_CHECKING: + from google.adk.agents import LlmAgent + class CallbackDependencyException(Exception): def __init__(self, cb_name: str, cb_list: list[str]): @@ -75,3 +78,42 @@ def __call__(self) -> dict[str, Any]: # ``LlmAgent(**callback_adapter())`` splat type-checks against LlmAgent's # individually-typed callback params. return {cb_type.value: chain for cb_type, chain in self.chains.items()} + + +def chain_after_model_callback(agent: LlmAgent, callback: BaseCallback) -> None: + """Append ``callback`` behind the agent's existing after_model chain. + + ``CallbackChain`` stops at the first truthy return, so registering an + enforcement callback inside the worker chain can leave it unreachable + (the chain rarely returns ``None`` past a rewriting guardrail). This + runs the agent's current ``after_model_callback`` first and invokes + ``callback`` only when it returned ``None``. + """ + if callback.cb_type is not CallbackTypes.after_model_callback: + raise ValueError( + f"Callback {callback.name} is {callback.cb_type}, expected " + f"{CallbackTypes.after_model_callback}" + ) + + adapter = CallbackAdapter(agent_name=agent.name) + adapter.register(callback) + new_cb = adapter()[CallbackTypes.after_model_callback.value] + + existing = agent.after_model_callback + if existing is None: + agent.after_model_callback = new_cb + return + + def _chained(callback_context, llm_response, _orig=existing, _new=new_cb): # type: ignore[no-untyped-def] + result = _orig( + callback_context=callback_context, + llm_response=llm_response, + ) + if result is not None: + return result + return _new( + callback_context=callback_context, + llm_response=llm_response, + ) + + agent.after_model_callback = _chained diff --git a/contractor/callbacks/base.py b/contractor/callbacks/base.py index aa758e2..e1452ca 100644 --- a/contractor/callbacks/base.py +++ b/contractor/callbacks/base.py @@ -51,11 +51,20 @@ def _expected_signatures() -> dict["CallbackTypes", inspect.Signature]: def verify_signature(cb_func: Callable, cb_type: "CallbackTypes") -> bool: - # expected = _expected_signatures().get(cb_type) - # if expected is None: - # raise ValueError(f"Unknown callback type: {cb_type}") - # return inspect.signature(cb_func) == expected - return True + """Check that ``cb_func`` accepts the parameters ADK passes for ``cb_type``. + + Compares parameter names and kinds only: return-type and parameter + annotations legitimately vary across callbacks (e.g. ``-> None`` vs + ``-> LlmResponse | None``), so full ``inspect.Signature`` equality would + reject valid callbacks. + """ + expected = _expected_signatures().get(cb_type) + if expected is None: + raise ValueError(f"Unknown callback type: {cb_type}") + actual = inspect.signature(cb_func) + return [(p.name, p.kind) for p in actual.parameters.values()] == [ + (p.name, p.kind) for p in expected.parameters.values() + ] def _callback_name(func: Callable[..., Any]) -> str: diff --git a/contractor/callbacks/context.py b/contractor/callbacks/context.py index 80b1490..0aa88c3 100644 --- a/contractor/callbacks/context.py +++ b/contractor/callbacks/context.py @@ -28,6 +28,15 @@ def __init__( self.token_count: int = 0 self.history: list[Any] = [] self.summarization_key = summarization_key + # Latch: once the message has been injected for an invocation, do not + # inject it again for that invocation. The per-invocation token + # counter (TokenUsageCallback) only grows within an invocation and is + # reset only when the invocation changes, so there is no mid-invocation + # event to re-arm on — "once per invocation" is the correct semantics. + # The latch re-arms automatically when invocation_id changes (the + # counter resets then too). + self.fired: bool = False + self.fired_invocation_id: str | None = None def to_state(self) -> dict[str, Any]: return { @@ -35,6 +44,7 @@ def to_state(self) -> dict[str, Any]: "token_count": self.token_count, "message": self.message, "history": self.history, + "fired_invocation_id": self.fired_invocation_id, } def __call__( @@ -50,10 +60,19 @@ def __call__( self.save_to_state(callback_context) return + invocation_id = self.get_invocation_id(callback_context) + if self.fired and self.fired_invocation_id == invocation_id: + # Already injected for this invocation — don't append the message + # to every subsequent request. + self.save_to_state(callback_context) + return + llm_request.contents.append( types.Content(role="user", parts=[types.Part(text=self.message)]) ) + self.fired = True + self.fired_invocation_id = invocation_id self.history.append(int(time.time())) self.save_to_state(callback_context) return @@ -156,7 +175,11 @@ def _build_call_signatures( if i < len(calls) and calls[i][0] == name: result[(ci, pi)] = calls[i] else: - result[(ci, pi)] = (name, "") + # No matching call: give the response a per-index sentinel + # signature so unmatched responses never collide with each + # other (or with a real argless call) and are never elided + # as "stale" duplicates. + result[(ci, pi)] = (name, f"") return result def to_state(self) -> dict[str, Any]: diff --git a/contractor/callbacks/guardrails.py b/contractor/callbacks/guardrails.py index d399828..bf6411e 100644 --- a/contractor/callbacks/guardrails.py +++ b/contractor/callbacks/guardrails.py @@ -13,7 +13,6 @@ from .tokens import TokenUsageCallback logger = logging.getLogger(__name__) -logger.setLevel(logging.DEBUG) TOKEN_USAGE_CALLBACK_NAME = TokenUsageCallback().name TOKEN_BUDGET_DEFAULT_MESSAGE: Final[str] = ( @@ -180,6 +179,7 @@ def __call__( return None new_parts: list[types.Part] = [] + modified = False for part in content.parts: fc = part.function_call @@ -215,10 +215,16 @@ def __call__( new_parts.append(part) self.history.append(metadata) + modified = True - content.parts = new_parts self.save_to_state(callback_context) + if not modified: + # Nothing was rewritten — return None so downstream + # after_model callbacks (e.g. MandatoryToolCallback) still run. + return None + + content.parts = new_parts return llm_response diff --git a/contractor/callbacks/ratelimits.py b/contractor/callbacks/ratelimits.py index bd926db..5465c4e 100644 --- a/contractor/callbacks/ratelimits.py +++ b/contractor/callbacks/ratelimits.py @@ -9,12 +9,22 @@ from .tokens import TokenCounter, TokenUsageCallback logger = logging.getLogger(__name__) -logger.setLevel(logging.DEBUG) TOKEN_USAGE_CALLBACK_NAME = TokenUsageCallback().name class TpmRatelimitCallback(BaseCallback): + """Throttle LLM calls to a tokens-per-minute budget. + + WARNING: throttling uses a *blocking* ``time.sleep``. ``CallbackChain`` + composes callbacks synchronously and short-circuits on any truthy return + value, so this callback cannot be converted to a coroutine (the returned + coroutine object would be treated as a chain result and skip downstream + callbacks). While it sleeps it blocks the asyncio event loop, stalling + *every* concurrently running agent — wire it up only for single-agent + runs, or make ``CallbackChain`` awaitable-aware first. + """ + cb_type: CallbackTypes = CallbackTypes.before_model_callback deps: list[str] = [TOKEN_USAGE_CALLBACK_NAME] @@ -90,6 +100,13 @@ def __call__( class RpmRatelimitCallback(BaseCallback): + """Throttle LLM calls to a requests-per-minute budget. + + WARNING: throttling uses a *blocking* ``time.sleep`` — see + ``TpmRatelimitCallback`` for why this cannot be a coroutine and what it + means for concurrently running agents. + """ + cb_type: CallbackTypes = CallbackTypes.before_model_callback deps: list[str] = [] @@ -121,6 +138,8 @@ def __call__( els = current_time - (self.timer_start or 0) if self.request_count > self.rpm_limit: + # Window budget spent: sleep out the remainder of the window, then + # start a fresh window counting the current request. delay = 60 - els + 1 if delay > 0: time.sleep(delay) @@ -134,6 +153,13 @@ def __call__( ) self.timer_start = int(time.time()) self.request_count = 1 + elif els >= 60: + # Window elapsed under budget: roll it forward without sleeping — + # mirrors TpmRatelimitCallback. Without this, request_count keeps + # accumulating across stale windows and a later burst is throttled + # against requests from long-dead windows. + self.timer_start = current_time + self.request_count = 1 self.save_to_state(callback_context) return diff --git a/contractor/callbacks/tokens.py b/contractor/callbacks/tokens.py index ed92904..8bb1f29 100644 --- a/contractor/callbacks/tokens.py +++ b/contractor/callbacks/tokens.py @@ -11,11 +11,6 @@ logger = logging.getLogger(__name__) -class TokenUsageCallbackException(Exception): - def __init__(self) -> None: - super().__init__("token usage callback not found!") - - @dataclass class TokenCounter: input: int = 0 @@ -72,6 +67,8 @@ def _update_history( self, ctx: ToolContext | CallbackContext, counter: TokenCounter ): history = TokenUsageCallback.get_history(ctx) + if self.invocation_id is None: + return history history[self.invocation_id] = asdict(counter) ctx.state[TokenUsageCallback.global_history_key()] = history return history @@ -98,27 +95,30 @@ def __call__( total=usage.total_token_count or 0, ) - current = self.counter self._update_global_counter(callback_context, token_count) invocation_id = self.get_invocation_id(callback_context) - # если invocation_id ещё не задан — пытаемся подхватить из ответа + # invocation_id not adopted yet — pick it up from the context. if self.invocation_id is None and self.is_empty(): self.invocation_id = invocation_id - # тот же invocation_id -> копим current if invocation_id == self.invocation_id: + # Same invocation_id -> keep accumulating the current counter. self.counter.add(token_count) - self.save_to_state(callback_context) - return - - # смена invocation_id -> сохраняем прошлый current и начинаем новый - if self.invocation_id is not None: - self._update_history(callback_context, current) - - self.invocation_id = invocation_id - self.counter = token_count + else: + # invocation_id changed -> start a fresh counter. The previous + # invocation's totals are already in history (flushed below on + # every call). + self.invocation_id = invocation_id + self.counter = token_count + + # Flush the in-progress invocation to history on every call. The + # history is keyed by invocation_id, so this overwrites (never + # double-counts) and keeps the *final* invocation's entry up to date — + # there is no "invocation changed" event after the last response, so + # waiting for the id to change would undercount by one invocation. + self._update_history(callback_context, self.counter) self.save_to_state(callback_context) return diff --git a/contractor/runners/_helpers.py b/contractor/runners/_helpers.py index 3645d62..16a7d80 100644 --- a/contractor/runners/_helpers.py +++ b/contractor/runners/_helpers.py @@ -6,9 +6,13 @@ from __future__ import annotations +import logging + from google.adk.events import Event from google.genai import types +logger = logging.getLogger(__name__) + def _extract_final_text(event: Event) -> str: """Pull the concatenated text from a final-response event.""" @@ -37,5 +41,12 @@ def _decode_part_text(part: types.Part | None) -> str: if isinstance(data, str): return data if isinstance(data, (bytes, bytearray)): - return data.decode("utf-8", errors="ignore") + try: + return data.decode("utf-8") + except UnicodeDecodeError: + logger.warning( + "Artifact part bytes are not valid UTF-8; decoding with " + "errors='replace' — invalid bytes appear as U+FFFD" + ) + return data.decode("utf-8", errors="replace") return "" diff --git a/contractor/runners/agent_runner.py b/contractor/runners/agent_runner.py index 24abf97..ff1d013 100644 --- a/contractor/runners/agent_runner.py +++ b/contractor/runners/agent_runner.py @@ -1,5 +1,6 @@ from __future__ import annotations +import asyncio import logging from dataclasses import dataclass from typing import Any @@ -10,7 +11,7 @@ from google.adk.runners import Runner from google.adk.sessions import InMemorySessionService from google.genai import types -from pydantic import BaseModel, Field, PrivateAttr +from pydantic import BaseModel, Field from contractor.runners._helpers import _extract_final_text from contractor.runners.artifacts import artifact_names_for_key, save_result_artifacts @@ -40,6 +41,10 @@ class AgentRunner(BaseModel): Emits ``TaskRunnerEvent`` so callers can reuse the same handler / plugin contracts as ``TaskRunner``. + + The event handler is threaded through the run call chain rather than + stored on the instance, so concurrent ``run()`` calls on one runner + don't clobber each other's handlers. """ model_config = {"arbitrary_types_allowed": True} @@ -50,8 +55,6 @@ class AgentRunner(BaseModel): default_factory=InMemorySessionService ) - _on_event: TaskRunnerEventHandler | None = PrivateAttr(default=None) - async def run( self, *, @@ -70,76 +73,75 @@ async def run( caller wants to override the agent's own name (e.g. to surface a higher-level identifier in the UI). """ - self._on_event = on_event emit_name = event_name or agent.name - try: - session_id = session_id or uuid4().hex + session_id = session_id or uuid4().hex - content = ( - types.Content(role="user", parts=[types.Part(text=message)]) - if isinstance(message, str) - else message - ) + content = ( + types.Content(role="user", parts=[types.Part(text=message)]) + if isinstance(message, str) + else message + ) - await self._emit( - "agent_run_started", - task_name=emit_name, - agent_name=agent.name, - session_id=session_id, - ) + await self._emit( + on_event, + "agent_run_started", + task_name=emit_name, + agent_name=agent.name, + session_id=session_id, + ) - has_initial_state = bool(initial_state) - if has_initial_state: - await self.session_service.create_session( - app_name=self.name, - user_id=user_id, - state=initial_state, - session_id=session_id, - ) - - runner = Runner( - agent=agent, + has_initial_state = bool(initial_state) + if has_initial_state: + await self.session_service.create_session( app_name=self.name, - session_service=self.session_service, - artifact_service=self.artifact_service, - plugins=plugins or [], - auto_create_session=not has_initial_state, - ) - - final_text = "" - async for event in runner.run_async( user_id=user_id, + state=initial_state, session_id=session_id, - new_message=content, - ): - event_text = _extract_final_text(event) - if not event_text: - continue - final_text = event_text - await self._emit( - "final_text", - task_name=emit_name, - agent_name=agent.name, - session_id=session_id, - text=event_text, - ) - - final_state = await self._get_session_state(user_id, session_id) + ) + runner = Runner( + agent=agent, + app_name=self.name, + session_service=self.session_service, + artifact_service=self.artifact_service, + plugins=plugins or [], + auto_create_session=not has_initial_state, + ) + + final_text = "" + async for event in runner.run_async( + user_id=user_id, + session_id=session_id, + new_message=content, + ): + event_text = _extract_final_text(event) + if not event_text: + continue + final_text = event_text await self._emit( - "agent_run_finished", + on_event, + "final_text", task_name=emit_name, agent_name=agent.name, session_id=session_id, + text=event_text, ) - return AgentRunResult( - final_text=final_text, - session_id=session_id, - final_state=final_state, - ) - finally: - self._on_event = None + final_state = await self._get_session_state(user_id, session_id) + + await self._emit( + on_event, + "agent_run_finished", + task_name=emit_name, + agent_name=agent.name, + session_id=session_id, + ) + + return AgentRunResult( + final_text=final_text, + session_id=session_id, + final_state=final_state, + ) # ── Artifact publishing ─────────────────────────────────────────────── @@ -184,16 +186,30 @@ async def _get_session_state(self, user_id: str, session_id: str) -> dict[str, A # ── Event emission ──────────────────────────────────────────────────── - async def _emit(self, event_type: str, **payload: Any) -> None: - if self._on_event is None: + async def _emit( + self, + on_event: TaskRunnerEventHandler | None, + event_type: str, + **payload: Any, + ) -> None: + if on_event is None: return task_name = payload.pop("task_name", self.name) task_id = payload.pop("task_id", 0) - await self._on_event( - TaskRunnerEvent( - type=event_type, - task_name=task_name, - task_id=task_id, - payload=payload, - ) + event = TaskRunnerEvent( + type=event_type, + task_name=task_name, + task_id=task_id, + payload=payload, ) + try: + await on_event(event) + except asyncio.CancelledError: + # Cancellation must keep unwinding the run — never swallow it. + raise + except Exception: + # Event delivery is best-effort telemetry: a broken handler must + # never abort the agent run. + logger.exception( + "event handler failed for %s event (task %s)", event_type, task_name + ) diff --git a/contractor/runners/agio.py b/contractor/runners/agio.py index 5de6f4e..c330546 100644 --- a/contractor/runners/agio.py +++ b/contractor/runners/agio.py @@ -70,6 +70,8 @@ class AgioEventType(StrEnum): TASK_STARTED = "task_started" TASK_FINISHED = "task_finished" TASK_FAILED = "task_failed" + # Emitted by Workflow.emit_task_skipped (artifact-reuse skips), not by + # TaskRunner — checkpoint restores emit TASK_FINISHED with restored=True. TASK_SKIPPED = "task_skipped" GLOBAL_TASK_FINISHED = "global_task_finished" ITERATION_STARTED = "iteration_started" diff --git a/contractor/runners/artifacts.py b/contractor/runners/artifacts.py index 818a969..e876b6e 100644 --- a/contractor/runners/artifacts.py +++ b/contractor/runners/artifacts.py @@ -1,6 +1,7 @@ from __future__ import annotations import json +import re from typing import Any from google.adk.artifacts import BaseArtifactService @@ -27,6 +28,20 @@ def validate_artifact_key(key: str) -> str: return cleaned +_SLUG_RE = re.compile(r"[^a-zA-Z0-9_-]+") + + +def artifact_key_slug(value: str) -> str: + """Collapse an arbitrary identifier (finding name, namespace, …) into a + single safe artifact-key segment. + + Deterministic for a given input — fan-out workflows use it to derive + stable per-task artifact keys that survive ``--resume``. + """ + slug = _SLUG_RE.sub("_", (value or "").strip()).strip("_") + return slug or "item" + + def artifact_filename(key: str, kind: ArtifactKind) -> str: return f"{validate_artifact_key(key)}/{kind}" diff --git a/contractor/runners/models.py b/contractor/runners/models.py index ec7eb7a..655ded5 100644 --- a/contractor/runners/models.py +++ b/contractor/runners/models.py @@ -2,6 +2,7 @@ import json import logging +import os import re from collections.abc import Awaitable, Callable, Mapping from dataclasses import dataclass, field @@ -90,6 +91,12 @@ class TaskInvocation: artifacts: list[str] = field(default_factory=list) skills: list[str] = field(default_factory=list) + # Publish key for result/summary/records artifacts. ``None`` keeps the + # historical behavior (``template_key``); fan-out workflows that queue + # several tasks from the same template set a unique, stable key per task + # so siblings don't overwrite each other's artifacts. + artifact_key: str | None = None + iterations: int = 1 max_attempts: int = 1 max_steps: int = 15 @@ -101,6 +108,11 @@ class TaskInvocation: def effective_namespace(self, fallback: str) -> str: return self.namespace or fallback + @property + def effective_artifact_key(self) -> str: + """Key under which this invocation's artifacts are published.""" + return self.artifact_key or self.template_key + def effective_model(self, fallback: LiteLlm) -> LiteLlm: return self.model or fallback @@ -247,6 +259,13 @@ def load(cls, name: str, version: str | None = None) -> TaskTemplate: ) raw = data["task"] + for required in ("objective", "instructions", "output_format"): + if required not in raw: + raise ValueError( + f"Task body {body_path} missing required '{required}:' " + f"field under 'task:' (manifest: {manifest_path})" + ) + return cls( key=template_key, version=resolved_version, @@ -292,7 +311,13 @@ def _resolve_task_version( f"Task manifest {manifest_path} 'versions:' must be a mapping" ) - resolved = version or manifest["active"] + # Explicit arg wins; then an env override (CONTRACTOR_TASK_VERSION_, + # for A/B eval-gating a task version without flipping `active`); then active. + resolved = ( + version + or os.environ.get(f"CONTRACTOR_TASK_VERSION_{template_key.upper()}") + or manifest["active"] + ) if resolved not in versions: available = ", ".join(sorted(versions.keys())) or "(none)" raise ValueError( @@ -345,8 +370,22 @@ def from_template( sort_keys=False, ) + # Distinct artifact refs can normalize to the same template variable + # (e.g. "oas-build/result" and "oas_build/result" both become + # "artifact__oas_build__result"); the later one would silently win, + # so refuse the ambiguity instead. + var_sources: dict[str, str] = {} for artifact_ref, value in artifacts.items(): - scope[_artifact_var_name(artifact_ref)] = value + var_name = _artifact_var_name(artifact_ref) + if var_name in var_sources: + raise ValueError( + f"Artifact refs {var_sources[var_name]!r} and " + f"{artifact_ref!r} both normalize to template variable " + f"{var_name!r} — rename one so the substitutions don't " + f"collide" + ) + var_sources[var_name] = artifact_ref + scope[var_name] = value return cls( key=template.key, @@ -438,24 +477,33 @@ def load(cls, path: Path) -> Checkpoint | None: _checkpoint_logger.warning("ignoring corrupt checkpoint %s: %s", path, exc) return None - if data.get("version") != _CHECKPOINT_VERSION: + # Structurally malformed data (valid JSON but not the expected shape — + # entries missing task_id/ref/template_key, non-dict entries, …) must + # follow the same "ignoring corrupt checkpoint" path, not raise. + try: + if data.get("version") != _CHECKPOINT_VERSION: + _checkpoint_logger.warning( + "ignoring checkpoint %s with unsupported version %s", + path, + data.get("version"), + ) + return None + + return cls( + workflow=data.get("workflow", ""), + entries=[ + CheckpointEntry( + task_id=t["task_id"], + ref=t["ref"], + template_key=t["template_key"], + template_version=t["template_version"], + published_artifacts=t.get("published_artifacts", {}), + ) + for t in data.get("tasks", []) + ], + ) + except (KeyError, TypeError, AttributeError) as exc: _checkpoint_logger.warning( - "ignoring checkpoint %s with unsupported version %s", - path, - data.get("version"), + "ignoring corrupt checkpoint %s: %r", path, exc ) return None - - return cls( - workflow=data.get("workflow", ""), - entries=[ - CheckpointEntry( - task_id=t["task_id"], - ref=t["ref"], - template_key=t["template_key"], - template_version=t["template_version"], - published_artifacts=t.get("published_artifacts", {}), - ) - for t in data.get("tasks", []) - ], - ) diff --git a/contractor/runners/plugins/metrics_plugin.py b/contractor/runners/plugins/metrics_plugin.py index d668de1..18d77b2 100644 --- a/contractor/runners/plugins/metrics_plugin.py +++ b/contractor/runners/plugins/metrics_plugin.py @@ -180,6 +180,19 @@ def register( tool_name: str, args: dict[str, Any], ) -> _TrackedCall: + fp = _fingerprint(invocation_id, agent_name, tool_name, args) + + # A new identical call means any earlier errored call's optional + # paired after_tool window has closed (when a plugin returns a + # non-None error response, ADK fires that after_tool synchronously, + # before the model's next turn can retry). Finish those stale calls + # so the retry's after_tool can never be attributed to them and they + # don't linger as pending. + for call_id in self._pending_by_fp[fp]: + stale = self._calls.get(call_id) + if stale is not None and stale.exception_seen and not stale.finished: + stale.finished = True + call = _TrackedCall( call_id=next(self._seq), invocation_id=invocation_id, @@ -190,7 +203,6 @@ def register( started_at=_utcnow_iso(), started_monotonic=time.monotonic(), ) - fp = _fingerprint(invocation_id, agent_name, tool_name, args) self._calls[call.call_id] = call self._pending_by_fp[fp].append(call.call_id) return call @@ -206,11 +218,23 @@ def resolve( queue = self._pending_by_fp.get(fp) if not queue: return None + # Prefer the oldest unfinished call that has NOT errored: an errored + # call usually gets no after_tool (only plugins returning a non-None + # error response trigger one), so a same-args retry's after_tool must + # pair with the retry, not the stale errored call. Errored calls are + # kept as a fallback so a genuinely paired after_tool — arriving when + # no fresh call is pending — still de-dupes against them. + errored_fallback: _TrackedCall | None = None for call_id in queue: call = self._calls.get(call_id) - if call is not None and not call.finished: - return call - return None + if call is None or call.finished: + continue + if call.exception_seen: + if errored_fallback is None: + errored_fallback = call + continue + return call + return errored_fallback def finish(self, call: _TrackedCall) -> None: call.finished = True diff --git a/contractor/runners/plugins/sandbox_cleanup.py b/contractor/runners/plugins/sandbox_cleanup.py index 1180a82..0a6b9be 100644 --- a/contractor/runners/plugins/sandbox_cleanup.py +++ b/contractor/runners/plugins/sandbox_cleanup.py @@ -11,8 +11,12 @@ ``AgentTool`` runners, so ``before_run_callback`` fires first for the outer run: we record that ``invocation_id`` as the root, and only when ``after_run_callback`` fires for that same root (the outer run finishing — the hook that reliably fires, -as the metrics/trace plugins demonstrate) do we sweep every sandbox. Safe because -code-exec runs are sequential. ``atexit`` + the container TTL remain backstops. +as the metrics/trace plugins demonstrate and the probe in +``tests/units/.../plugins/test_run_callbacks.py`` verifies) do we sweep every +sandbox. Safe because code-exec runs are sequential. ADK only awaits the hook +after the run's event stream is consumed to completion, so a run that raises or +is cancelled mid-stream skips it — ``TaskRunner.run``'s finally-sweep, ``atexit`` +and the container TTL remain backstops for those paths. """ from __future__ import annotations diff --git a/contractor/runners/skills.py b/contractor/runners/skills.py index b0df8c9..a595861 100644 --- a/contractor/runners/skills.py +++ b/contractor/runners/skills.py @@ -56,6 +56,26 @@ def _default_description(skill: str, rel_path: Path, is_index: bool) -> str: return f"{skill} skill / {rel_path.with_suffix('').as_posix()}" +def validate_skills(skills: Iterable[str]) -> None: + """Fail fast on unknown skill names. + + An existence check of the skill directories only — content is still + loaded lazily by ``load_skill``. Lets ``TaskRunner.add_task`` reject a + typo'd skill at queue time instead of surfacing a ``FileNotFoundError`` + when the task's first iteration starts. + """ + missing = sorted({s for s in skills if not (SKILLS_BASE_DIR / s).is_dir()}) + if missing: + available = ", ".join( + sorted(p.name for p in SKILLS_BASE_DIR.iterdir() if p.is_dir()) + ) or "(none)" + raise ValueError( + f"Unknown skill(s) {', '.join(repr(s) for s in missing)} — " + f"no such directory under {SKILLS_BASE_DIR}. " + f"Available skills: {available}" + ) + + def load_skill(skill: str) -> list[SkillFile]: skill_dir = SKILLS_BASE_DIR / skill if not skill_dir.is_dir(): diff --git a/contractor/runners/task_runner.py b/contractor/runners/task_runner.py index 31d358b..ad173d9 100644 --- a/contractor/runners/task_runner.py +++ b/contractor/runners/task_runner.py @@ -1,5 +1,6 @@ from __future__ import annotations +import asyncio import copy import inspect import logging @@ -17,7 +18,11 @@ from contractor.agents.planning_agent.agent import build_planning_agent from contractor.runners._helpers import _decode_part_text, _extract_final_text -from contractor.runners.artifacts import artifact_names_for_key, save_result_artifacts +from contractor.runners.artifacts import ( + artifact_names_for_key, + save_result_artifacts, + validate_artifact_key, +) from contractor.runners.models import ( Checkpoint, CheckpointEntry, @@ -36,7 +41,7 @@ from contractor.runners.plugins.metrics_plugin import AdkMetricsPlugin from contractor.runners.plugins.sandbox_cleanup import SandboxCleanupPlugin from contractor.runners.plugins.trace_plugin import AdkTracePlugin -from contractor.runners.skills import inject_skills +from contractor.runners.skills import inject_skills, validate_skills from contractor.tools.memory import MemoryNote, MemoryTools from contractor.tools.observations import ObservationConfig from contractor.tools.podman import teardown_all as _teardown_sandboxes @@ -52,14 +57,24 @@ class TaskNotCompletedError(Exception): """Raised when a task exhausts all retry attempts without completing.""" - def __init__(self, ref: str, iterations: int, max_attempts: int) -> None: + def __init__( + self, + ref: str, + iterations: int, + max_attempts: int, + last_error: str | None = None, + ) -> None: self.ref = ref self.iterations = iterations self.max_attempts = max_attempts - super().__init__( + self.last_error = last_error + message = ( f"Task '{ref}' was not completed " f"{iterations} time(s) after {max_attempts} attempt(s)." ) + if last_error: + message += f" Last error: {last_error}" + super().__init__(message) # ─── Main Runner ────────────────────────────────────────────────────────────── @@ -103,6 +118,7 @@ def add_task( ref: str | None = None, params: dict[str, Any] | None = None, artifacts: list[str] | None = None, + artifact_key: str | None = None, skills: list[str] | None = None, iterations: int | None = None, max_attempts: int | None = None, @@ -116,11 +132,26 @@ def add_task( ``version`` pins a specific template version (must be declared in the task's manifest at ``contractor/tasks/.yml``). When omitted, the manifest's ``active`` version is used. + + ``artifact_key`` overrides the key under which the task's + ``result``/``summary``/``records`` artifacts are published (default: + the template key). Workflows that queue multiple tasks from the same + template must pass a unique, stable key per task — derived from the + finding/operation identifier, not queue position — so siblings don't + overwrite each other's artifacts and ``--resume`` validation stays + per-task. """ template = self._ensure_template(name, version=version) + if artifact_key is not None: + artifact_key = validate_artifact_key(artifact_key) task_ref = ref or f"{name}:{len(self.queue)}" self._assert_unique_ref(task_ref) + # Fail fast on a typo'd skill name — at queue time, not when the + # task's first iteration starts hours into a workflow. + eff_skills = list(skills if skills is not None else template.default_skills) + validate_skills(eff_skills) + eff_iterations, eff_max_attempts = self._resolve_retry_params( template, iterations, max_attempts ) @@ -132,8 +163,11 @@ def add_task( template_version=template.version, worker_builder=worker_builder, params=params or {}, - artifacts=list(artifacts or template.default_artifacts), - skills=list(skills if skills is not None else template.default_skills), + artifacts=list( + artifacts if artifacts is not None else template.default_artifacts + ), + artifact_key=artifact_key, + skills=eff_skills, iterations=eff_iterations, max_attempts=eff_max_attempts, max_steps=max_steps, @@ -151,11 +185,13 @@ async def run( on_event: TaskRunnerEventHandler | None = None, ) -> list[TaskResult]: self._on_event = on_event - checkpoint = self._load_checkpoint() try: results: list[TaskResult] = [] total_tasks = len(self.queue) + # Inside the try so the `finally` below still clears `_on_event` + # (and tears down sandboxes) if checkpoint loading fails. + checkpoint = self._load_checkpoint() await self._emit( EventType.RUN_STARTED, @@ -232,9 +268,13 @@ async def run( raise finally: self._on_event = None - # Reliable per-run teardown of any code-exec sandbox containers. - # The ADK after_run_callback does not fire in the TaskRunner + - # AgentTool nesting, so we sweep here, where run() always completes. + # Backstop teardown of any code-exec sandbox containers. + # SandboxCleanupPlugin's after_run_callback does fire when the + # outer ADK run is consumed to completion (verified by + # tests/units/.../plugins/test_run_callbacks.py), but ADK only + # awaits it after the event generator is exhausted — a run that + # raises or is cancelled mid-stream never reaches it. This sweep + # runs on every exit path of run(), covering those cases. try: _teardown_sandboxes() except Exception: @@ -298,7 +338,29 @@ async def _try_restore_from_checkpoint( if entry is None: return None - for artifact_name in entry.published_artifacts.values(): + # A checkpoint entry recorded for a different template (or version) + # is stale: restoring it would silently skip the edited task and feed + # old artifacts downstream. Re-run instead. + if ( + entry.template_key != item.template_key + or entry.template_version != item.template_version + ): + logger.warning( + "checkpoint entry %s was recorded for template %s@%s but the " + "invocation now expects %s@%s — re-running", + item.ref, + entry.template_key, + entry.template_version, + item.template_key, + item.template_version, + ) + return None + + # Validate against the invocation's *own* publish key, not whatever + # the entry recorded — with shared template keys a sibling task's + # artifacts would otherwise "validate" this task's restore. + published_artifacts = artifact_names_for_key(item.effective_artifact_key) + for artifact_name in published_artifacts.values(): part = await self.artifact_service.load_artifact( app_name=self.name, user_id=user_id, @@ -330,7 +392,7 @@ async def _try_restore_from_checkpoint( max_attempts=item.max_attempts, params=item.params, artifacts=item.artifacts, - published_artifacts=entry.published_artifacts, + published_artifacts=published_artifacts, total_tasks=total_tasks, completed_tasks=task_id, restored=True, @@ -353,7 +415,7 @@ async def _try_restore_from_checkpoint( records=[], params=copy.deepcopy(item.params), input_artifacts={}, - published_artifacts=entry.published_artifacts, + published_artifacts=published_artifacts, ) await self._emit( @@ -367,7 +429,7 @@ async def _try_restore_from_checkpoint( result=result.result, summary="", records=[], - published_artifacts=entry.published_artifacts, + published_artifacts=published_artifacts, total_tasks=total_tasks, completed_tasks=task_id, restored=True, @@ -479,11 +541,21 @@ async def _emit( ) -> None: if self._on_event is None: return - await self._on_event( - TaskRunnerEvent( - type=type, task_name=task_name, task_id=task_id, payload=payload - ) + event = TaskRunnerEvent( + type=type, task_name=task_name, task_id=task_id, payload=payload ) + try: + await self._on_event(event) + except asyncio.CancelledError: + # Cancellation must keep unwinding the run — never swallow it. + raise + except Exception: + # Event delivery is best-effort telemetry: a broken handler + # (disk-full MetricsSink, UI rendering, …) must never abort an + # hours-long workflow. + logger.exception( + "event handler failed for %s event (task %s)", event.type, task_name + ) # ── Session management ──────────────────────────────────────────────── @@ -501,8 +573,26 @@ def _build_task_initial_state( task: RenderedTask, carry_state: dict[str, Any], ) -> dict[str, Any]: + # StreamlineManager scopes its planner-internal subtask state per ADK + # invocation (``task::{gid}::{invocation_id}::…`` — see + # ``StreamlineManager._state_key``). A new attempt gets a fresh + # invocation_id, so this task's keys from previous attempts are + # unreachable; carrying them forward only bloats the per-iteration + # deep-copies and the trace-plugin state snapshots. Strip them. + # The fixed task-scoped keys (``task::{id}::status`` etc.) have no + # further ``::`` segment and pass through (then get rebuilt by + # ``build_active_state``); keys of other namespaces are untouched. + stale_prefix = f"task::{task_id}::" + live_carry = { + key: value + for key, value in carry_state.items() + if not ( + key.startswith(stale_prefix) + and "::" in key[len(stale_prefix):] + ) + } return { - **copy.deepcopy(carry_state), + **copy.deepcopy(live_carry), **build_active_state(task_id=task_id, task=task), } @@ -511,34 +601,51 @@ def _is_task_completed(self, task_id: int, state: dict[str, Any]) -> bool: # ── Artifact I/O ────────────────────────────────────────────────────── - async def _load_artifact_text(self, user_id: str, artifact_ref: str) -> str: + async def _load_artifact_text( + self, user_id: str, artifact_ref: str + ) -> str | None: + """Load an artifact's text, or ``None`` when it was never published.""" part = await self.artifact_service.load_artifact( app_name=self.name, user_id=user_id, session_id=None, filename=artifact_ref, ) + if part is None: + return None return _decode_part_text(part) async def _load_artifacts( - self, user_id: str, artifact_refs: list[str] + self, user_id: str, artifact_refs: list[str], *, task_ref: str | None = None ) -> dict[str, str]: - return { - ref: await self._load_artifact_text(user_id=user_id, artifact_ref=ref) - for ref in artifact_refs - } + texts: dict[str, str] = {} + for ref in artifact_refs: + text = await self._load_artifact_text(user_id=user_id, artifact_ref=ref) + if text is None: + # Keep the empty-string substitution (seeds may legitimately + # come from the persistent store), but never silently: a typo'd + # ref or a never-published upstream looks identical otherwise. + logger.warning( + "Task '%s' declares input artifact '%s' but it is not " + "present in the artifact store — substituting empty text", + task_ref or "", + ref, + ) + text = "" + texts[ref] = text + return texts async def _publish_task_artifacts( self, user_id: str, - template_key: str, + key: str, result: TaskResult, ) -> None: await save_result_artifacts( artifact_service=self.artifact_service, app_name=self.name, user_id=user_id, - key=template_key, + key=key, result=result.result or "", summary=result.summary or "", records=result.records or [], @@ -624,7 +731,7 @@ def _build_iteration_result( status=final_state.get(keys.status), params=copy.deepcopy(item.params), input_artifacts=copy.deepcopy(input_artifacts), - published_artifacts=artifact_names_for_key(rendered_task.key), + published_artifacts=artifact_names_for_key(item.effective_artifact_key), ) async def _run_single_iteration( @@ -639,17 +746,6 @@ async def _run_single_iteration( iteration: int, ) -> TaskResult: agent = self._spawn_planning_agent(item, rendered_task) - namespace = item.effective_namespace(self.name) - await self._inject_skills( - user_id=user_id, - namespace=namespace, - skills=item.skills, - ) - await self._inject_artifacts( - user_id=user_id, - namespace=namespace, - input_artifacts=input_artifacts, - ) session_id = uuid4().hex initial_state = self._build_task_initial_state( @@ -797,7 +893,9 @@ async def _run_task_with_retries( total_tasks: int, ) -> TaskResult: template = self.templates[item.template_cache_key] - input_artifacts = await self._load_artifacts(user_id, item.artifacts) + input_artifacts = await self._load_artifacts( + user_id, item.artifacts, task_ref=item.ref + ) rendered_task = self._render_task( template, item.params, @@ -825,24 +923,73 @@ async def emit(event_type: EventType, **extra: Any) -> None: max_attempts=item.max_attempts, params=item.params, artifacts=item.artifacts, - published_artifacts=artifact_names_for_key(template.key), + published_artifacts=artifact_names_for_key(item.effective_artifact_key), observations=(item.observations or self.observations).as_tag(), ) + # Skills and input artifacts are invariant across attempts (the + # namespace, skill list, and loaded artifact texts never change), so + # inject them once per task instead of re-reading and rewriting the + # memory YAML on every retry. + namespace = item.effective_namespace(self.name) + await self._inject_skills( + user_id=user_id, + namespace=namespace, + skills=item.skills, + ) + await self._inject_artifacts( + user_id=user_id, + namespace=namespace, + input_artifacts=input_artifacts, + ) + carry_state: dict[str, Any] = {} last_result: TaskResult | None = None + last_exc: Exception | None = None successful_runs = 0 for iteration in range(1, item.max_attempts + 1): - result = await self._run_single_iteration( - item=item, - rendered_task=rendered_task, - input_artifacts=input_artifacts, - task_id=task_id, - user_id=user_id, - carry_state=carry_state, - iteration=iteration, - ) + try: + result = await self._run_single_iteration( + item=item, + rendered_task=rendered_task, + input_artifacts=input_artifacts, + task_id=task_id, + user_id=user_id, + carry_state=carry_state, + iteration=iteration, + ) + except asyncio.CancelledError: + # Cancellation is not a task failure — let it unwind the run. + raise + except Exception as exc: + # Transient LLM/network/tooling errors must consume an attempt + # instead of aborting the whole multi-task workflow (documented + # invariant: failures keep retrying until max_attempts). + last_exc = exc + logger.exception( + "Task '%s' iteration %d/%d raised %s; " + "counting as a failed attempt", + item.ref, + iteration, + item.max_attempts, + type(exc).__name__, + ) + await emit( + EventType.ITERATION_RESULT, + iteration=iteration, + session_id=None, + status=None, + result=None, + summary=None, + completed=False, + iterations_required=item.iterations, + max_attempts=item.max_attempts, + successful_runs=successful_runs, + error_type=type(exc).__name__, + error_message=str(exc), + ) + continue last_result = result completed = self._is_task_completed(task_id, result.state) @@ -864,7 +1011,9 @@ async def emit(event_type: EventType, **extra: Any) -> None: if completed: successful_runs = next_successful_runs - await self._publish_task_artifacts(user_id, template.key, result) + await self._publish_task_artifacts( + user_id, item.effective_artifact_key, result + ) if successful_runs >= item.iterations: await emit( @@ -884,10 +1033,12 @@ async def emit(event_type: EventType, **extra: Any) -> None: EventType.TASK_FAILED, max_attempts=item.max_attempts, last_result=last_result, + last_error=str(last_exc) if last_exc is not None else None, ) raise TaskNotCompletedError( ref=item.ref, iterations=item.iterations, max_attempts=item.max_attempts, - ) + last_error=str(last_exc) if last_exc is not None else None, + ) from last_exc diff --git a/contractor/tasks/sink_nomination.yml b/contractor/tasks/sink_nomination.yml new file mode 100644 index 0000000..0e72cde --- /dev/null +++ b/contractor/tasks/sink_nomination.yml @@ -0,0 +1,4 @@ +active: v1 +versions: + v1: + file: sink_nomination/v1.yml diff --git a/contractor/tasks/sink_nomination/v1.yml b/contractor/tasks/sink_nomination/v1.yml new file mode 100644 index 0000000..de3c5ef --- /dev/null +++ b/contractor/tasks/sink_nomination/v1.yml @@ -0,0 +1,63 @@ +task: + name: "nominate_candidate_sinks_for_one_vulnerability_class" + + objective: > + Sweep the project for candidate sinks of ONE vulnerability class and + nominate every plausible candidate. This is the recall pass of a + two-pass pipeline: a downstream trace stage will deep-trace each + nomination and discard the false ones, but it can only examine what + you nominate — a sink missed here is lost for the whole pipeline. + + Sink class: {sink_class} + + Class guidance: + {class_guidance} + + Project root: {project_path} + + instructions: > + You are NOMINATING, not confirming. Do not deep-trace data flows, do + not build exploit narratives, and do not discard a candidate because + you could not prove reachability — uncertainty lowers confidence, it + does not suppress the nomination. + + Sweep method: + + 1. ENUMERATE the surface for this class. Start from entry points + (route handlers, controllers, tasks) and from class-specific + patterns (see the class guidance above and the injected skill's + sink catalogue). Use grep/glob broadly; use the graph tools + (attack_surface, find_callers) where available to enumerate + handlers and reach wrappers that grep misses. + + 2. TRIAGE each hit with a short read (the surrounding function is + enough). Nominate when the site plausibly belongs to the class. + For ABSENCE-style classes (e.g. missing access control), the + nomination unit is the handler: nominate every handler whose + visible code lacks the expected control — absence of evidence of + a control IS the signal, do not require a taint flow. + + 3. REPORT every nomination via report_vulnerability: + - name: short unique slug (class-prefixed, e.g. "{sink_class}-...") + - place: the source file containing the candidate + - details: the candidate line/function, why it belongs to this + class, and the CWE ID if clear (e.g. CWE-89) + - confidence: honest — "low" is expected and fine for a sweep + - severity: from the operation's sensitivity, not from certainty + + Coverage discipline: sweep the WHOLE project surface for this one + class before finishing — do not stop after the first few hits, and + do not drift into other vulnerability classes (parallel sweeps cover + them). After the sweep, call list_vulnerabilities and verify every + enumerated candidate was either nominated or consciously ruled out. + + output_format: > + Return the total number of nominations for this class, plus the + list of (name, place, confidence) entries, and a short note on any + areas of the project that could not be swept. + + skills: + - vuln_scan + + iterations: 1 + format: json diff --git a/contractor/tasks/trace_annotation.yml b/contractor/tasks/trace_annotation.yml index 18cd978..98ebcba 100644 --- a/contractor/tasks/trace_annotation.yml +++ b/contractor/tasks/trace_annotation.yml @@ -1,9 +1,15 @@ -active: v1 +active: v3 versions: v1: file: trace_annotation/v1.yml v2: file: trace_annotation/v2.yml + # v3: leaned, planner-scoped — domain knowledge delegated to the trace skill, + # tool/edit mechanics removed (the planner is tool-agnostic), output aligned to + # the agent §OUTPUT. A/B via CONTRACTOR_TASK_VERSION_TRACE_ANNOTATION=v3 before + # promoting to active. + v3: + file: trace_annotation/v3.yml # shannon: v1 + coverage-ledger output ("Analyzed — No Finding"), # control-dominance + sink-slot guidance, and an exploit_hypothesis # witness per finding. Pairs with trace_agent prompt `shannon`. diff --git a/contractor/tasks/trace_annotation/v3.yml b/contractor/tasks/trace_annotation/v3.yml new file mode 100644 index 0000000..594de9d --- /dev/null +++ b/contractor/tasks/trace_annotation/v3.yml @@ -0,0 +1,98 @@ +task: + name: "annotate_request_trace_for_vulnerability_analysis" + + objective: > + Given a single OpenAPI operation, locate the request entrypoint, trace the + relevant execution path, drive structured trace annotations onto the + functions on that path, and produce a structured analysis of data flow and + security-relevant behavior. + + Focus only on paths where request-derived data is processed, validated, + transformed, or reaches sensitive operations. + + Target operation: {operation_id} + Schema: {operation_schema} + + instructions: > + + ## What to accomplish + + - Locate the request entrypoint for the target operation. + - Trace request-derived data forward through the code until it reaches a + sink or a terminal business operation. + - Annotate the functions that carry trace value: the entrypoint, explicit + validation points, sinks (or their wrappers), and key transformations. + - Analyze the security-relevant behavior on that path: which controls are + present, weak, or absent, and which findings the visible code supports. + + Prefer concrete, code-backed paths over speculation. Annotate only where it + adds clear trace value. Distinguish observed code behavior from inference + and uncertainty. Do not introduce files or components not already observed. + + ## Where the domain knowledge lives + + The sink catalogue, the per-sink vulnerability checklist, the finding-shape + taxonomy, the per-handler control checklist, and the annotation forms live + in the injected `trace` skill — the worker loads them on demand. Do NOT + restate that knowledge here, and do NOT prescribe tools or edit mechanics: + how to read, search, and annotate is the worker's concern, not the plan's. + + ## State management + + - Treat already-discovered files, functions, and flows as working context; + do not extend them beyond shown evidence. + - Do not re-investigate something already confirmed; reuse earlier results + and memory instead of repeating discovery. + + ## Scope control (anti-drift) + + - Do NOT introduce new goals (refactoring, instrumentation design, building + a tracing framework, improving the code). + - Stay within: locate entrypoint, trace data flow, annotate, analyze + security. + + ## Planning rules + + - Assign investigation goals, not search mechanics — describe what must be + learned, not which low-level reads to perform; let the worker choose the + concrete actions. + - Prefer fewer, meaningful steps over many micro-steps. + - Each step must produce NEW information and resolve one real uncertainty + about control flow, data flow, validation, sink reachability, or + vulnerability evidence. Do not create steps for directory listing or + existence checks unless strictly necessary. + + ## Completion criteria (stop condition) + + STOP and produce final output once ALL are satisfied: + 1. Entrypoint is identified. + 2. Request parameters are mapped (body/query/path/headers). + 3. At least one complete, unambiguous data-flow path is traced to a sink or + a terminal business operation. + 4. Validation (or its absence) is identified. + 5. The security mechanism (auth/authz) is identified. + + Do not create new subtasks after these are met. If results are partial but + sufficient, complete rather than retrying. + + output_format: | + Produce the structured trace report with these sections (headers verbatim): + + ## Annotations Inserted + - : function= kinds= + + ## Trace + Entrypoint: function= + Data flow: -> -> -> ... -> + (mark the argument state at each step) + + ## Per-Handler Control Checklist + Use the row set from the `trace` skill's controls reference. + Status per row: present (file:line) | absent | weak | N/A. + + ## Findings + One block per finding using the field set from the `trace` skill's + finding-shapes reference, or the literal line: No findings supported by code. + + ## Uncertainties + Items that materially affect the trace. Omit the section if none. diff --git a/contractor/tasks/vuln_analytics.yml b/contractor/tasks/vuln_analytics.yml new file mode 100644 index 0000000..e76e52a --- /dev/null +++ b/contractor/tasks/vuln_analytics.yml @@ -0,0 +1,4 @@ +active: v1 +versions: + v1: + file: vuln_analytics/v1.yml diff --git a/contractor/tasks/vuln_analytics/v1.yml b/contractor/tasks/vuln_analytics/v1.yml new file mode 100644 index 0000000..3aa2e44 --- /dev/null +++ b/contractor/tasks/vuln_analytics/v1.yml @@ -0,0 +1,66 @@ +task: + name: "analyze_trace_annotated_diff_for_vulnerabilities" + + objective: > + A prior trace stage annotated the execution paths of the target + operations in-place. Read the annotation diff below, judge every + annotated flow against the finding-shape taxonomy, and report each + supported vulnerability via report_vulnerability. + + Target operations: + {target_summary} + + instructions: > + + ## What to accomplish + + - Enumerate every annotated flow visible in the diff (entrypoint, + validations, sinks, transformations). + - For each flow, read the annotated code and reconstruct the + argument-state chain; decide finding vs. clean vs. uncertain. + - For each annotated entrypoint, walk the per-handler control + checklist; missing/weak controls on sensitive operations are + findings on structure alone. + - Report every supported finding; give every flow a verdict. + + The diff is a map, not proof: confirm in the source files before + reporting. Do not annotate or edit code; do not re-trace beyond + what is needed to judge the annotated flows. + + ## Where the domain knowledge lives + + The finding-shape taxonomy, the per-handler control checklist, and + the sink catalogue live in the injected `trace` skill — load them + on demand. Do NOT restate that knowledge here. + + ## Completion criteria (stop condition) + + STOP and produce final output once every annotated flow has a + verdict and every annotated entrypoint has a control-checklist + result. If a flow cannot be conclusively judged, record it as + uncertain rather than retrying indefinitely. + + ## Annotation diff + + {trace_diff} + + output_format: | + Produce the structured analytics report with these sections + (headers verbatim): + + ## Flows Analyzed + - -> verdict= + + ## Findings + One block per reported finding (name, shape, severity, confidence, + entrypoint, sink, missing/weak control, evidence file:line), or the + literal line: No findings supported by code. + + ## Controls Summary + Per annotated entrypoint: absent/weak checklist rows. + + ## Uncertainties + Omit the section if none. + + iterations: 1 + format: json diff --git a/contractor/tools/fs/globmatch.py b/contractor/tools/fs/globmatch.py new file mode 100644 index 0000000..2a6d715 --- /dev/null +++ b/contractor/tools/fs/globmatch.py @@ -0,0 +1,67 @@ +"""Path-aware glob-to-regex translation shared by sandbox filesystems. + +Semantics mirror Python's ``pathlib``-style globbing: ``*`` / ``?`` / +``[...]`` match within a single path segment (never crossing ``/``), +while ``**`` matches any number of segments, including zero. +""" + +from __future__ import annotations + +import re + + +def _translate_glob_segment(seg: str) -> str: + """Translate one glob path segment to regex, never crossing ``/``.""" + out: list[str] = [] + i, n = 0, len(seg) + while i < n: + c = seg[i] + if c == "*": + out.append("[^/]*") + elif c == "?": + out.append("[^/]") + elif c == "[": + j = i + 1 + if j < n and seg[j] == "!": + j += 1 + if j < n and seg[j] == "]": + j += 1 + while j < n and seg[j] != "]": + j += 1 + if j >= n: # no closing bracket: treat '[' literally + out.append(re.escape(c)) + else: + inner = seg[i + 1 : j] + if inner.startswith("!"): + inner = "^" + inner[1:] + out.append("[" + inner + "]") + i = j + 1 + continue + else: + out.append(re.escape(c)) + i += 1 + return "".join(out) + + +def glob_to_regex(pattern: str) -> re.Pattern[str]: + """ + Compile a glob pattern into a path-aware regex with Python-like semantics: + ``*``/``?``/``[...]`` match within a single path segment, while ``**`` + matches any number of segments (including zero). Matches relative paths + without a leading ``/``. + """ + segments = pattern.split("/") + parts: list[str] = [] + last = len(segments) - 1 + for idx, seg in enumerate(segments): + if seg == "**": + if idx == last: + parts.append(".*") # trailing ** matches anything, any depth + else: + parts.append("(?:[^/]*/)*") # **/ matches zero or more segments + continue # the separator is baked into the group above + else: + parts.append(_translate_glob_segment(seg)) + if idx != last: + parts.append("/") + return re.compile("(?s:" + "".join(parts) + r")\Z") diff --git a/contractor/tools/fs/merge.py b/contractor/tools/fs/merge.py index 2d07d76..c9a592f 100644 --- a/contractor/tools/fs/merge.py +++ b/contractor/tools/fs/merge.py @@ -36,6 +36,9 @@ def fork_overlay( fork = MemoryOverlayFileSystem(base_fs, skip_instance_cache=True) if patch and patch.get("patches"): fork.load(patch) + # Record the tombstones replayed from the patch so merge_overlay_forks can + # tell deletes that *predate* the fork from deletes made inside the fork. + fork._pre_fork_deleted = frozenset(fork._deleted) # type: ignore[attr-defined] return fork @@ -43,6 +46,7 @@ def merge_overlay_forks( target: MemoryOverlayFileSystem, forks: Sequence[MemoryOverlayFileSystem], pre_fork_files: dict[str, bytes], + pre_fork_deleted: frozenset[str] | set[str] | None = None, ) -> list[str]: """Merge writes produced by parallel forks back into *target*. @@ -64,6 +68,13 @@ def merge_overlay_forks( pre_fork_files: ``dict(target._files)`` captured **before** forking. Used to distinguish pre-existing content from new work. + pre_fork_deleted: + Tombstone paths that already existed at the fork point. Only fork + deletes *outside* this set propagate to *target*, so a path the + target restored after forking is not re-deleted by a stale fork + tombstone. When ``None`` (default), the set is derived per fork from + what ``fork_overlay`` recorded; forks created some other way fall + back to propagating every tombstone (the historical behaviour). Returns ------- @@ -104,7 +115,10 @@ def merge_overlay_forks( for fork in forks: target._dirs.update(fork._dirs) - new_deletes = fork._deleted - set(pre_fork_files) + baseline = pre_fork_deleted + if baseline is None: + baseline = getattr(fork, "_pre_fork_deleted", frozenset()) + new_deletes = fork._deleted - baseline target._deleted.update(new_deletes) return conflicts diff --git a/contractor/tools/fs/overlayfs.py b/contractor/tools/fs/overlayfs.py index 6b714a5..875d226 100644 --- a/contractor/tools/fs/overlayfs.py +++ b/contractor/tools/fs/overlayfs.py @@ -7,11 +7,11 @@ from collections.abc import Iterable, Iterator from copy import deepcopy from datetime import datetime -from pathlib import PurePosixPath from typing import Any from fsspec.spec import AbstractFileSystem +from contractor.tools.fs.globmatch import glob_to_regex from contractor.tools.fs.models import FsEntry from contractor.tools.fs.overlay_diff import render_overlay_diff from contractor.tools.fs.overlay_patch import ( @@ -20,6 +20,7 @@ build_overlay_patch, sha256_hex, ) +from contractor.utils.settings import get_settings FileInfo = dict[str, Any] Patch = dict[str, Any] @@ -1230,38 +1231,64 @@ def find( return sorted(result) def glob(self, path: str, **kwargs: Any): - pattern = self._norm(path) + """ + Path-aware glob over the merged (base + overlay) view. + + Mirrors ``RootedLocalFileSystem.glob`` semantics: matches files only, + ``*``/``?``/``[...]`` stay within a single path segment and ``**`` + spans any number of segments (including zero). Overlay-added files are + included; tombstoned (deleted-in-overlay) files are excluded because + ``walk`` already merges the overlay view. + """ + matches, _truncated = self.glob_scanned(path) + return matches + + def glob_scanned( + self, path: str, max_files: int | None = None + ) -> tuple[list[str], bool]: + """``glob`` plus a truncation flag. + + The tree walk is hard-bounded at *max_files* scanned files (default: + ``Settings.fs_max_files_per_walk``) so a glob over a huge base tree + cannot run away. The flag is ``True`` when the ceiling was hit, i.e. + the match list may be incomplete. + """ + if not path: + return [], False + + pattern = self._norm(path).lstrip("/") + if not pattern: + return [], False - if "**" in pattern: - search_root = pattern.split("**", 1)[0].rstrip("/") or "/" - candidates = set() + # Reject obvious traversal attempts. + if ".." in pattern.split("/"): + return [], False - if self.exists(search_root): - candidates.add(search_root) + if max_files is None: + max_files = get_settings().fs_max_files_per_walk - candidates.update(self.find(search_root, withdirs=True, detail=False)) - else: - parts = pattern.strip("/").split("/") - prefix_parts: list[str] = [] + regex = glob_to_regex(pattern) + matches: set[str] = set() + scanned = 0 + truncated = False - for part in parts: - if any(ch in part for ch in "*?["): + for root, _dirs, files in self.walk(self.root_marker): + rel_root = "" if root == self.root_marker else root.lstrip("/") + + for name in files: + if scanned >= max_files: + truncated = True break - prefix_parts.append(part) + scanned += 1 - search_root = "/" + "/".join(prefix_parts) if prefix_parts else "/" + rel_path = f"{rel_root}/{name}" if rel_root else name + if regex.match(rel_path): + matches.add("/" + rel_path) - candidates = set() - try: - for item in self.ls(search_root, detail=True): - candidates.add(self._norm(item["name"])) - except FileNotFoundError: - return [] + if truncated: + break - pattern_no_root = pattern.lstrip("/") - return sorted( - p for p in candidates if PurePosixPath(p.lstrip("/")).match(pattern_no_root) - ) + return sorted(matches), truncated def du( self, diff --git a/contractor/tools/fs/read_tools.py b/contractor/tools/fs/read_tools.py index 24eea46..a91b165 100644 --- a/contractor/tools/fs/read_tools.py +++ b/contractor/tools/fs/read_tools.py @@ -39,6 +39,13 @@ # coverage-gap projection caps the *surfaced* list far lower (25). _IN_SCOPE_WALK_LIMIT: Final[int] = 2000 +# Truncation notice attached to glob/grep output when the tree walk hit the +# ``fs_max_files_per_walk`` ceiling (style mirrors ``format_output``'s footer). +_WALK_TRUNCATION_NOTICE: Final[str] = ( + "### file walk truncated after scanning {max_files} files: results may be " + "incomplete — narrow `path` or `pattern` ###" +) + def _push_fs_coverage( tool_context: ToolContext | None, snapshot: dict[str, int] @@ -100,6 +107,7 @@ def __init__( max_output: int | None = None, max_items: int | None = None, max_lines: int | None = None, + max_files_per_walk: int | None = None, ignored_patterns: list[str] | None = None, with_types: bool = True, with_file_info: bool = True, @@ -118,6 +126,14 @@ def __init__( # Default line cap when read_file gets no explicit `limit`. Falls back # to the (possibly None) settings value, i.e. "no cap" unless configured. self.max_lines = s.fs_max_read_lines if max_lines is None else max_lines + # Hard ceiling on files scanned by a single glob/grep tree walk so the + # tools cannot run away on a huge repo (mirrors code_max_files_per_walk + # for the code-tools walker). When hit, the output carries a notice. + self.max_files_per_walk = ( + s.fs_max_files_per_walk + if max_files_per_walk is None + else max_files_per_walk + ) self.with_types = with_types self.with_file_info = with_file_info @@ -408,7 +424,18 @@ def glob( if not pattern_for_fs.startswith("/") and normalized_path != "/": pattern_for_fs = f"{normalized_path.rstrip('/')}/{pattern_for_fs}" - matches = [str(match) for match in self.fs.glob(pattern_for_fs)] + # Sandbox filesystems expose ``glob_scanned`` — a bounded glob whose + # tree walk stops at the max-files ceiling and reports truncation. + # Other backends fall back to the plain (unbounded) fsspec glob. + glob_scanned = getattr(self.fs, "glob_scanned", None) + if callable(glob_scanned): + raw_matches, walk_truncated = glob_scanned( + pattern_for_fs, max_files=self.max_files_per_walk + ) + else: + raw_matches, walk_truncated = self.fs.glob(pattern_for_fs), False + + matches = [str(match) for match in raw_matches] prefix = normalized_path.rstrip("/").replace("\\", "/") + "/" if normalized_path != "/": @@ -429,12 +456,20 @@ def glob( total = len(entries) paged = entries[offset : offset + self.max_items] + meta: dict[str, Any] = {} + if walk_truncated: + meta["walk_truncated"] = True + meta["notice"] = _WALK_TRUNCATION_NOTICE.format( + max_files=self.max_files_per_walk + ) + return ok_page( self.fmt.format_file_list(paged), total, returned=len(paged), offset=offset, limit=self.max_items, + **meta, ) def read_file( @@ -561,21 +596,39 @@ def build_entries_for_file(file_path: str) -> list[FsEntry]: ) results: list[FsEntry] = [] + scanned = 0 + walk_truncated = False + # Bound the tree walk so grep over a huge repo cannot run away; when + # the ceiling is hit the (partial) results carry a truncation notice. for current_path, _dirs, filenames in self.fs.walk(normalized_path): for filename in filenames: + if scanned >= self.max_files_per_walk: + walk_truncated = True + break + scanned += 1 full_path = join_path(current_path, filename) results.extend(build_entries_for_file(full_path)) + if walk_truncated: + break results.sort(key=lambda entry: (entry.path, entry.loc.line_start or 0)) total = len(results) paged = results[offset : offset + self.max_items] + meta: dict[str, Any] = {} + if walk_truncated: + meta["walk_truncated"] = True + meta["notice"] = _WALK_TRUNCATION_NOTICE.format( + max_files=self.max_files_per_walk + ) + return ok_page( self.fmt.format_file_list(paged), total, returned=len(paged), offset=offset, limit=self.max_items, + **meta, ) def interaction_stats( diff --git a/contractor/tools/http.py b/contractor/tools/http.py index 3b27bf9..b5b6062 100644 --- a/contractor/tools/http.py +++ b/contractor/tools/http.py @@ -165,16 +165,37 @@ def __init__( self._next_request_id: int = 1 self._state_lock = asyncio.Lock() - self._client = httpx.AsyncClient( - proxy=proxy, - timeout=httpx.Timeout(timeout), - verify=verify_ssl, + # Connection config for the short-lived per-request httpx clients. + # A persistent AsyncClient used to be created here, but ``http_tools`` + # returns only tool closures with no teardown seam reachable from the + # agent factories, so the pool leaked on every agent build. Creating a + # fresh client per request (closed via ``async with``) eliminates the + # leak; we forfeit keep-alive pooling, which is acceptable for this + # CLI's request profile. Cookies persist across requests via a shared + # ``httpx.Cookies`` jar (passed by reference into each client). + self._proxy = proxy + self._timeout = httpx.Timeout(timeout) + self._verify_ssl = verify_ssl + self._user_agent = user_agent + self._cookies = httpx.Cookies() + + def _new_client(self, timeout: float | None = None) -> httpx.AsyncClient: + """Build a short-lived client; the caller is responsible for closing it + (use ``async with``). The shared cookie jar is passed by reference so + cookies set/received in one request are visible to the next.""" + return httpx.AsyncClient( + proxy=self._proxy, + timeout=httpx.Timeout(timeout) if timeout is not None else self._timeout, + verify=self._verify_ssl, follow_redirects=True, - headers={"User-Agent": user_agent}, + headers={"User-Agent": self._user_agent}, + cookies=self._cookies, ) async def aclose(self) -> None: - await self._client.aclose() + """No-op: clients are per-request and closed via ``async with``. Kept + for backward compatibility with the ``async with HTTPClient(...)`` and + explicit-``aclose`` call sites.""" async def __aenter__(self) -> HTTPClient: return self @@ -196,15 +217,15 @@ def body_artifact_name(self, request_id: int) -> str: return f"http/{self.name}/responses/{request_id:08d}.json" def get_cookies(self) -> dict[str, str]: - return dict(self._client.cookies.items()) + return dict(self._cookies.items()) def set_cookies( self, cookies: Mapping[str, str], *, replace: bool = False ) -> None: if replace: - self._client.cookies.clear() + self._cookies.clear() for k, v in cookies.items(): - self._client.cookies.set(str(k), str(v)) + self._cookies.set(str(k), str(v)) def set_default_headers( self, headers: Mapping[str, str], *, replace: bool = False @@ -248,7 +269,7 @@ def get_history(self, limit: int | None = None) -> list[HistorySummary]: return items def clear_session_state(self) -> None: - self._client.cookies.clear() + self._cookies.clear() self._default_headers.clear() self._auth = None self._history.clear() @@ -408,6 +429,7 @@ def _make_request_tag(self, request_id: int) -> str: def _build_request( self, + client: httpx.AsyncClient, *, method: str, url: str, @@ -444,47 +466,42 @@ def _build_request( else: raise HTTPClientError(f"unsupported body_type: {body_type!r}") - return self._client.build_request(**kwargs) + return client.build_request(**kwargs) async def _send_with_retries( self, + client: httpx.AsyncClient, request: httpx.Request, *, follow_redirects: bool, - timeout: float | None, ) -> httpx.Response: - saved_timeout = self._client.timeout - if timeout is not None: - self._client.timeout = httpx.Timeout(timeout) - + # The per-request timeout is baked into ``client`` at creation time + # (see ``_new_client``), so there is no client-wide timeout to mutate. last_error: BaseException | None = None - try: - for attempt in range(1, self.retry_config.attempts + 1): - try: - response = await self._client.send( - request, follow_redirects=follow_redirects - ) - if response.status_code not in self.retry_config.retry_on_statuses: - return response - last_error = HTTPClientError( - f"Retryable HTTP status {response.status_code} " - f"for {request.method} {request.url}" - ) - except ( - httpx.TimeoutException, - httpx.NetworkError, - httpx.RemoteProtocolError, - ) as exc: - last_error = exc - - if attempt < self.retry_config.attempts: - delay = min( - self.retry_config.base_delay * (2 ** (attempt - 1)), - self.retry_config.max_delay, - ) - await asyncio.sleep(delay) - finally: - self._client.timeout = saved_timeout + for attempt in range(1, self.retry_config.attempts + 1): + try: + response = await client.send( + request, follow_redirects=follow_redirects + ) + if response.status_code not in self.retry_config.retry_on_statuses: + return response + last_error = HTTPClientError( + f"Retryable HTTP status {response.status_code} " + f"for {request.method} {request.url}" + ) + except ( + httpx.TimeoutException, + httpx.NetworkError, + httpx.RemoteProtocolError, + ) as exc: + last_error = exc + + if attempt < self.retry_config.attempts: + delay = min( + self.retry_config.base_delay * (2 ** (attempt - 1)), + self.retry_config.max_delay, + ) + await asyncio.sleep(delay) if last_error is None: raise HTTPClientError("Request failed for an unknown reason") @@ -515,19 +532,25 @@ async def request( tagged[REQUEST_TAG_HEADER] = request_tag headers = tagged - request = self._build_request( - method=method, - url=url, - headers=headers, - query=query, - body=body, - body_type=body_type, - ) - start = time.monotonic() - response = await self._send_with_retries( - request, follow_redirects=follow_redirects, timeout=timeout - ) + async with self._new_client(timeout) as client: + request = self._build_request( + client, + method=method, + url=url, + headers=headers, + query=query, + body=body, + body_type=body_type, + ) + response = await self._send_with_retries( + client, request, follow_redirects=follow_redirects + ) + # The per-request client seeds its jar from ``self._cookies`` but + # httpx copies (does not share) it, so Set-Cookie responses land in + # the client's jar only. Merge them back before the client closes so + # cookies persist across the per-request client boundary. + self._cookies.update(client.cookies) elapsed_ms = int((time.monotonic() - start) * 1000) content_type = response.headers.get("content-type", "") diff --git a/contractor/tools/likec4.py b/contractor/tools/likec4.py index 4a45ace..865e448 100644 --- a/contractor/tools/likec4.py +++ b/contractor/tools/likec4.py @@ -64,12 +64,14 @@ class Likec4Linter: and running `likec4 validate --json --no-layout --file `. The agent owns the source string (typically kept in its memory store) and passes it in directly — no on-disk artifact is needed. + + The likec4 command is resolved lazily on first use (and cached), so + constructing the linter never raises — a missing binary surfaces as + `Likec4NotFoundError` from `validate`/`validate_path` instead. """ _lock: asyncio.Lock = field(default_factory=asyncio.Lock, init=False) - - def __post_init__(self) -> None: - self._resolve_command() + _cmd_prefix: list[str] | None = field(default=None, init=False) @staticmethod def _resolve_command() -> list[str]: @@ -89,6 +91,16 @@ def _resolve_command() -> list[str]: "(tried: likec4, bunx, pnpx, npx)" ) + def _command(self) -> list[str]: + """Resolve the likec4 command prefix, caching it after first success. + + Raises: + Likec4NotFoundError: If no usable runner is found in PATH. + """ + if self._cmd_prefix is None: + self._cmd_prefix = self._resolve_command() + return self._cmd_prefix + def validate( self, content: str, *, timeout: float | None = None ) -> list[dict[str, Any]]: @@ -113,7 +125,7 @@ def validate( """ if timeout is None: timeout = get_settings().likec4_validate_timeout - cmd_prefix = self._resolve_command() + cmd_prefix = self._command() with tempfile.TemporaryDirectory(prefix="likec4-") as tmp: tmp_path = Path(tmp) @@ -252,6 +264,10 @@ def likec4_tools( backed for the build agent), so `validate_likec4` re-reads from disk on every call and no separate in-memory copy of the source is needed. + Building the tool list never raises when the likec4 binary is missing: + the command is resolved lazily, so a missing binary surfaces as an + {"error": ...} tool result on the first `validate_likec4` call instead. + Args: fs: fsspec filesystem (typically an overlay over the project root). default_path: Path used when `validate_likec4` is called with no diff --git a/contractor/tools/tasks/__init__.py b/contractor/tools/tasks/__init__.py index 1b73dcc..9c56fe8 100644 --- a/contractor/tools/tasks/__init__.py +++ b/contractor/tools/tasks/__init__.py @@ -22,10 +22,12 @@ SKIP_REASON_MUST_NOT_BE_EMPTY, SUBTASK_DECOMPOSE_EMPTY_LIST, SUBTASK_DECOMPOSE_NOT_DECOMPOSABLE, + SUBTASK_DECOMPOSE_OVER_CAPACITY, SUBTASK_NOT_CURRENT_MSG, SUBTASK_REQUIRES_DECOMPOSITION_MSG, SUBTASK_REQUIRES_RESOLUTION_MSG, SUBTASK_RESULT_MALFORMED, + SUBTASK_SKIP_NOT_SKIPPABLE, SUBTASK_STATUS_TRANSITIONS, TASK_LIMIT_REACHED_MSG, InvalidStatusTransitionError, @@ -58,7 +60,9 @@ "SUBTASK_REQUIRES_RESOLUTION_MSG", "SUBTASK_REQUIRES_DECOMPOSITION_MSG", "SUBTASK_DECOMPOSE_NOT_DECOMPOSABLE", + "SUBTASK_DECOMPOSE_OVER_CAPACITY", "SUBTASK_RESULT_MALFORMED", + "SUBTASK_SKIP_NOT_SKIPPABLE", "SubtaskStatus", "SUBTASK_STATUS_TRANSITIONS", "DO_NOT_FINISH_WITH_NO_TASKS_DONE", diff --git a/contractor/tools/tasks/manager.py b/contractor/tools/tasks/manager.py index fa71d62..c000486 100644 --- a/contractor/tools/tasks/manager.py +++ b/contractor/tools/tasks/manager.py @@ -185,7 +185,18 @@ def skip( reason: str, ctx: ToolContext | CallbackContext, ) -> Subtask | None: - """Skip the current subtask. Returns the next subtask or None.""" + """Skip the current subtask. + + Returns: + The next subtask, or None when the skip succeeded but no later + subtask exists. (None also covers "no current subtask".) + + Raises: + InvalidStatusTransitionError: If the current subtask is in a + non-skippable state ('done'/'decomposed'/'skipped'), so the + caller can distinguish a rejected skip from a successful skip + with no next subtask. + """ idx = self._get_idx(ctx) if idx is None: return None @@ -206,7 +217,7 @@ def skip( "error": str(exc), }, ) - return None + raise # Determine the next subtask, if any next_subtask: Subtask | None = None diff --git a/contractor/tools/tasks/models.py b/contractor/tools/tasks/models.py index e4cf8c3..c801ed2 100644 --- a/contractor/tools/tasks/models.py +++ b/contractor/tools/tasks/models.py @@ -20,7 +20,13 @@ TASK_LIMIT_REACHED_MSG: Final[str] = ( "The maximum number of subtasks ({max_tasks}) has been reached. " "You MUST NOT create new subtasks. " - "Summarize the records collected so far and call `finish` immediately." + "Execute or skip the remaining subtasks first, then call `finish` " + "with a summary of the records collected so far." +) +SUBTASK_DECOMPOSE_OVER_CAPACITY: Final[str] = ( + "Decomposing into {requested} subtasks would exceed the subtask limit " + "({max_tasks}): only {remaining} more subtask(s) can be added. " + "Retry with fewer children." ) SUBTASK_NOT_CURRENT_MSG: Final[str] = ( "Subtask `{task_id}` is NOT the current subtask. " @@ -50,6 +56,12 @@ "If it is already resolved ('done', 'skipped', or 'decomposed'), move on " "to the next subtask." ) +SUBTASK_SKIP_NOT_SKIPPABLE: Final[str] = ( + "Subtask `{task_id}` has status '{status}' and was NOT skipped. " + "Only subtasks with status 'new', 'incomplete', or 'malformed' can be " + "skipped. It is already resolved — move on to the next subtask, or call " + "`finish` if the objective is complete." +) SUBTASK_RESULT_MALFORMED: Final[str] = ( "The worker returned a result that could not be completely parsed into the " "expected format. The raw output has been stored for reference. " @@ -138,6 +150,7 @@ class SubtaskDecomposition(BaseModel): subtasks: list[SubtaskSpec] = Field( ..., min_length=1, + max_length=3, description=( "Ordered list of 1-3 executable subtasks. Requirements:\n" "- Each subtask MUST be independently executable\n" diff --git a/contractor/tools/tasks/tools.py b/contractor/tools/tasks/tools.py index eb5d415..63784da 100644 --- a/contractor/tools/tasks/tools.py +++ b/contractor/tools/tasks/tools.py @@ -35,13 +35,15 @@ NO_REMAINING_SUBTASKS_MSG, NO_SUBTASKS_EXIST_MSG, SKIP_REASON_MUST_NOT_BE_EMPTY, - SUBTASK_DECOMPOSE_EMPTY_LIST, SUBTASK_DECOMPOSE_NOT_DECOMPOSABLE, + SUBTASK_DECOMPOSE_OVER_CAPACITY, SUBTASK_NOT_CURRENT_MSG, SUBTASK_REQUIRES_DECOMPOSITION_MSG, SUBTASK_REQUIRES_RESOLUTION_MSG, SUBTASK_RESULT_MALFORMED, + SUBTASK_SKIP_NOT_SKIPPABLE, TASK_LIMIT_REACHED_MSG, + InvalidStatusTransitionError, Subtask, SubtaskDecomposition, SubtaskExecutionResult, @@ -54,6 +56,30 @@ # instruction text and stay idempotent across repeated calls. _INSTRUCTION_SNAPSHOT_ATTR: Final[str] = "_streamline_original_instruction" +# Per-record cap applied wherever raw worker output is stored or re-fed to an +# LLM (malformed records, the finish-time summarizer payload). Keeps a single +# runaway record from blowing the context window. +_MAX_RECORD_FIELD_LEN: Final[int] = 20_000 +_TRUNCATION_MARKER: Final[str] = "\n… [truncated]" + + +def _truncate_text(text: str, limit: int = _MAX_RECORD_FIELD_LEN) -> str: + if len(text) <= limit: + return text + return text[:limit] + _TRUNCATION_MARKER + + +def _truncate_record(record: Any) -> Any: + """Cap the string payloads of a single execution record.""" + if isinstance(record, str): + return _truncate_text(record) + if isinstance(record, dict): + return { + key: _truncate_text(value) if isinstance(value, str) else value + for key, value in record.items() + } + return record + # ═══════════════════════════════════════════════════════════════════ # Instructions & instrumentation @@ -235,6 +261,17 @@ def task_tools( worker = instrument_worker( agent_ref, fmt, use_type_hint, use_input_schema, use_output_schema ) + elif use_input_schema and getattr(_get_agent_ref(worker), "input_schema", None) is None: + # With instrumentation off, `execute_current_subtask` still passes + # the Subtask dict as args, which ADK's AgentTool only accepts when + # the agent declares an input_schema. Fail loudly at assembly time + # instead of surfacing as an obscure KeyError('request') at runtime. + raise ValueError( + "task_tools(worker_instrumentation=False) with use_input_schema=True " + "requires the worker agent to have `input_schema` set (e.g. " + "`worker.input_schema = Subtask`). Set it before calling task_tools, " + "or pass use_input_schema=False." + ) if not isinstance(worker, AgentTool): worker = AgentTool(worker) @@ -245,11 +282,14 @@ def task_tools( summarizer_tool: AgentTool | None = None if use_summarization: agent_ref = _get_agent_ref(worker) + # Summarization is a pure text-in/text-out step: no tools, so the + # summarizer neither wastes context on tool schemas nor wanders off + # into tool calls. summarizer_agent = LlmAgent( name="task_summarizer", description="Produces structured summaries of completed task executions.", instruction=TASK_RESULT_SUMMARIZATION_INSTRUCTIONS, - tools=agent_ref.tools, + tools=[], model=agent_ref.model, ) summarizer_tool = AgentTool(summarizer_agent) @@ -405,9 +445,20 @@ def decompose_subtask( decomposition.subtasks, tool_context ) if insertion is None: + # Given the precondition checks above, the manager only refuses + # here when the children would exceed the subtask limit. Report + # remaining capacity so the planner can retry with fewer children + # instead of being told (wrongly) that the limit is fully spent. + remaining = mgr.max_tasks - len(mgr.get_subtasks(tool_context)) + if remaining >= 1: + return { + "error": SUBTASK_DECOMPOSE_OVER_CAPACITY.format( + requested=len(decomposition.subtasks), + max_tasks=max_tasks, + remaining=remaining, + ) + } return {"error": TASK_LIMIT_REACHED_MSG.format(max_tasks=max_tasks)} - if len(insertion) == 0: - return {"error": SUBTASK_DECOMPOSE_EMPTY_LIST} return {"result": fmt.format_subtasks(insertion)} def skip(task_id: str, reason: str, tool_context: ToolContext) -> dict[str, Any]: @@ -423,7 +474,8 @@ def skip(task_id: str, reason: str, tool_context: ToolContext) -> dict[str, Any] are NOT acceptable. Returns: - The next subtask, or a message if no more remain. + The next subtask, a message if no more remain, or an error if + the current subtask is already resolved and cannot be skipped. IMPORTANT CONSTRAINTS: - You MUST have attempted execution first or have clear evidence @@ -459,7 +511,17 @@ def skip(task_id: str, reason: str, tool_context: ToolContext) -> dict[str, Any] ) } - next_subtask = mgr.skip(reason, tool_context) + try: + next_subtask = mgr.skip(reason, tool_context) + except InvalidStatusTransitionError: + # The current subtask is already resolved ('done'/'decomposed'/ + # 'skipped') — report the rejection instead of pretending the + # skip happened. + return { + "error": SUBTASK_SKIP_NOT_SKIPPABLE.format( + task_id=current.task_id, status=current.status + ) + } if next_subtask is None: return {"result": NO_ACTIVE_SUBTASKS_MSG} return {"result": "ok", "next-subtask": fmt.format_subtask(next_subtask)} @@ -625,7 +687,9 @@ async def execute_current_subtask( runtime_result = { "task_id": current.task_id, "status": "malformed", - "output": str(raw_dump), + # Cap the stored raw output — malformed responses can be huge + # and the record is re-fed to LLMs via get_records / finish. + "output": _truncate_text(str(raw_dump)), "summary": SUBTASK_RESULT_MALFORMED, } record_usage = usage if (usage is not None and obs.in_record) else None @@ -736,9 +800,15 @@ async def finish( if use_summarization and summarizer_tool is not None: objective_key = StreamlineManager._task_keys(tool_context).objective objective = tool_context.state.get(objective_key, "") + # Same cap as `get_records`: most-recent `max_records`, each record + # truncated, so a long run cannot blow the summarizer's context. + records = [ + _truncate_record(rec) + for rec in mgr.get_records(tool_context)[-max_records:] + ] payload = { "objective": objective, - "records": mgr.get_records(tool_context), + "records": records, "result": result, "status": status, } diff --git a/contractor/tools/vuln.py b/contractor/tools/vuln.py index 4f68e59..e612135 100644 --- a/contractor/tools/vuln.py +++ b/contractor/tools/vuln.py @@ -3,7 +3,7 @@ import asyncio from collections.abc import Callable, Iterable from dataclasses import dataclass, field -from typing import Any, Literal, Protocol, TypeVar +from typing import Any, Final, Literal, Protocol, TypeVar from xml.sax.saxutils import escape as xml_escape import yaml @@ -23,6 +23,16 @@ # Type aliases # --------------------------------------------------------------------------- +# Tool-name subsets agents use to wire vuln tooling selectively. The +# read-only set is the slice of ``vulnerability_report_tools`` that reads +# upstream findings without authoring new ones; the verdict set is the +# ``verification_tools`` a verdict-producing agent must eventually call +# (enforced via ``MandatoryToolCallback``). +READ_ONLY_VULN_TOOL_NAMES: Final[frozenset[str]] = frozenset( + {"get_vulnerability", "list_vulnerabilities"} +) +VERDICT_TOOL_NAMES: Final[tuple[str, ...]] = ("submit_verdict", "report_verification") + Severity = Literal["info", "low", "medium", "high", "critical"] Confidence = Literal["low", "medium", "high"] PlaceType = Literal["file", "url"] diff --git a/contractor/utils/observability.py b/contractor/utils/observability.py index a4620b8..798df99 100644 --- a/contractor/utils/observability.py +++ b/contractor/utils/observability.py @@ -10,6 +10,7 @@ from __future__ import annotations import logging +import sys from collections.abc import Iterator, Mapping, Sequence from contextlib import contextmanager from typing import Any @@ -45,6 +46,9 @@ def init() -> None: get_client() except Exception as exc: logger.warning("Langfuse init failed: %s", exc) + # Intentionally set even after a failed init: retrying on every call + # would re-attempt the import/instrumentation (and re-log the warning) + # for the whole run. A broken Langfuse degrades to no-op observability. _initialized = True @@ -118,8 +122,19 @@ def run_context( yield None return + # Enter/exit the span manually so a broken Langfuse client degrades to a + # no-op span instead of crashing the run (this module never raises). + span_cm = None + span = None try: - with client.start_as_current_span(name=name) as span: + span_cm = client.start_as_current_span(name=name) + span = span_cm.__enter__() + except Exception as exc: + logger.warning("run_context: failed to open span: %s", exc) + span_cm = None + + try: + if span is not None: tag_trace( name=name, user_id=user_id, @@ -127,8 +142,15 @@ def run_context( tags=tags, metadata=metadata, ) - yield span + yield span finally: + if span_cm is not None: + try: + # Inside `finally`, sys.exc_info() is the in-flight exception + # (if any), so the span still records the failure status. + span_cm.__exit__(*sys.exc_info()) + except Exception as exc: + logger.warning("run_context: failed to close span: %s", exc) flush() diff --git a/contractor/utils/settings.py b/contractor/utils/settings.py index 3ffc683..12eae45 100644 --- a/contractor/utils/settings.py +++ b/contractor/utils/settings.py @@ -17,15 +17,30 @@ from pydantic import Field from pydantic_settings import BaseSettings, SettingsConfigDict +# The documented config file is `cli/.env`. A bare `load_dotenv()` walks up +# from *this* module (contractor/utils/ → repo root) and never descends into +# cli/, so non-CLI entrypoints (tests, scripts) used to miss it — only the CLI +# worked because cli/main.py loads it first. Anchor it explicitly; the bare +# call stays as a CWD-relative fallback. Neither overrides already-set env +# vars, so CLI behaviour is unchanged. +_CLI_ENV_FILE = Path(__file__).resolve().parents[2] / "cli" / ".env" + +if _CLI_ENV_FILE.is_file(): + load_dotenv(_CLI_ENV_FILE) load_dotenv() class Settings(BaseSettings): model_config = SettingsConfigDict( - env_file=".env", + # Later entries take precedence: a CWD-local .env can override the + # anchored cli/.env; actual env vars override both. + env_file=(_CLI_ENV_FILE, ".env"), env_file_encoding="utf-8", extra="ignore", case_sensitive=False, + # Aliased fields (e.g. target_url ← CONTRACTOR_TARGET_URL) stay + # constructible by field name too (tests, programmatic overrides). + populate_by_name=True, ) # ── LLM (LiteLLM proxy) ────────────────────────────────────────────── @@ -60,6 +75,11 @@ class Settings(BaseSettings): # Default per-read line cap when read_file is called without an explicit # `limit`. None disables the line cap (byte cap only). fs_max_read_lines: int | None = Field(default=2000) + # Hard cap on files scanned by a single fs glob/grep tree walk + # (env: FS_MAX_FILES_PER_WALK). When the ceiling is hit the walk stops + # early and the tool output carries a truncation notice — mirrors + # code_max_files_per_walk for the code-tools walker. + fs_max_files_per_walk: int = Field(default=100_000) # Cumulative char budget for retained heavy-tool function results in the # FunctionResultsRemovalCallback (env: FS_HEAVY_KEEP_BUDGET_CHARS). When > 0, # large/stale heavy-tool results are elided once the running total of kept @@ -67,6 +87,12 @@ class Settings(BaseSettings): # (keep_last_n) is not yet reached. Default 0 disables the budget axis, so # retention stays count-only (historical behaviour). fs_heavy_keep_budget_chars: int = Field(default=0) + # Override the count cap (keep_last_n) for retained heavy-tool results in the + # FunctionResultsRemovalCallback (env: FS_HEAVY_KEEP_LAST_N). When > 0 it + # *overrides* the caller's elide_keep_last_n (e.g. set very high to + # effectively disable count-based elision for an experiment). Default 0 means + # "unset — use the caller's value" (historical behaviour, typically 15). + fs_heavy_keep_last_n: int = Field(default=0) code_max_walk_depth: int = Field(default=50) code_max_files_per_walk: int = Field(default=100_000) graph_max_results: int = Field(default=200) @@ -80,6 +106,13 @@ class Settings(BaseSettings): langfuse_public_key: str | None = Field(default=None) langfuse_secret_key: str | None = Field(default=None) + # ── Live target (exploitability / vuln workflows) ──────────────────── + # Base URL of the running target app probed by the exploit stage, and an + # optional outbound HTTP proxy for the agent's requests. Aliased so the + # historical CONTRACTOR_-prefixed env vars keep working. + target_url: str | None = Field(default=None, alias="CONTRACTOR_TARGET_URL") + proxy: str | None = Field(default=None, alias="CONTRACTOR_PROXY") + # ── Caido proxy ───────────────────────────────────────────────────── caido_url: str | None = Field(default=None) caido_auth_token: str | None = Field(default=None) diff --git a/contractor/workflows/__init__.py b/contractor/workflows/__init__.py index db2026a..8edfcda 100644 --- a/contractor/workflows/__init__.py +++ b/contractor/workflows/__init__.py @@ -165,11 +165,13 @@ def get_workflows() -> dict[str, type[Workflow]]: from .trace_annotation_direct import TraceAnnotationDirectWorkflow from .trace_graph import TraceGraphWorkflow from .trace_graph_pathpar import TraceGraphPathParWorkflow + from .trace_postdiff import TracePostDiffWorkflow from .trace_verify import TraceVerifyWorkflow from .vuln_assess import VulnAssessWorkflow from .vuln_scan import VulnScanWorkflow from .vuln_scan_fast import VulnScanFastWorkflow from .vuln_scan_trace import VulnScanTraceWorkflow + from .vuln_sweep import VulnSweepWorkflow return { "oas_build": OasBuildingWorkflow, @@ -180,11 +182,13 @@ def get_workflows() -> dict[str, type[Workflow]]: "trace-direct": TraceAnnotationDirectWorkflow, "trace-graph": TraceGraphWorkflow, "trace-graph-pathpar": TraceGraphPathParWorkflow, + "trace-postdiff": TracePostDiffWorkflow, "trace-verify": TraceVerifyWorkflow, "vuln-assess": VulnAssessWorkflow, "vuln-scan": VulnScanWorkflow, "vuln-scan-fast": VulnScanFastWorkflow, "vuln-scan-trace": VulnScanTraceWorkflow, + "vuln-sweep": VulnSweepWorkflow, "router": RouterWorkflow, } diff --git a/contractor/workflows/exploitability/workflow.py b/contractor/workflows/exploitability/workflow.py index b648e64..c0f32e0 100644 --- a/contractor/workflows/exploitability/workflow.py +++ b/contractor/workflows/exploitability/workflow.py @@ -13,7 +13,6 @@ import hashlib import logging -import os import time from functools import partial from typing import Any @@ -22,12 +21,14 @@ from google.genai import types from contractor.agents.exploitability_agent.agent import build_exploitability_agent +from contractor.runners.artifacts import artifact_key_slug from contractor.runners.task_runner import TaskRunner, TaskRunnerEventHandler from contractor.tools.caido import CaidoClient, CaidoTools from contractor.tools.http import REQUEST_TAG_HEADER -from contractor.utils.settings import build_model +from contractor.utils.settings import build_model, get_settings from contractor.workflows import Workflow, WorkflowContext, persist_seed_artifact from contractor.workflows.config import WorkflowConfig +from contractor.workflows.findings import load_findings_artifact CFG = WorkflowConfig.load(__file__) @@ -313,15 +314,16 @@ class ExploitabilityWorkflow(Workflow): def __init__(self, ctx: WorkflowContext) -> None: super().__init__(ctx) self.llm = build_model(ctx.model, ctx.timeout) - self.target_base_url = os.environ.get("CONTRACTOR_TARGET_URL", "") + + _settings = get_settings() + self.target_base_url = _settings.target_url or "" if not self.target_base_url: raise ValueError( - "CONTRACTOR_TARGET_URL must be set (e.g. http://localhost:5002)" + "No live target configured: set the CONTRACTOR_TARGET_URL " + "environment variable (or cli/.env entry) to the base URL of " + "the running target application, e.g. http://localhost:5002" ) - self.proxy = os.environ.get("CONTRACTOR_PROXY") or None - - from contractor.utils.settings import get_settings - _settings = get_settings() + self.proxy = _settings.proxy or None self.caido_url = _settings.caido_url self.caido_auth_token = _settings.caido_auth_token @@ -389,6 +391,12 @@ async def _assess_finding( runner.add_task( name="exploitability_assessment", ref=f"exploitability:{finding_name}", + # Unique, stable per-finding publish key — one assessment task is + # queued per finding and the shared template key would make each + # finding overwrite the previous one's artifacts. + artifact_key=( + f"exploitability_assessment/{artifact_key_slug(finding_name)}" + ), worker_builder=agent_builder, **CFG.tasks.assess.as_kwargs(), namespace=source_namespace, @@ -447,29 +455,9 @@ async def _load_findings( *, user_id: str, ) -> list[dict[str, Any]]: - artifact_key = _SEED_ARTIFACT - part = await self.ctx.artifact_service.load_artifact( + return await load_findings_artifact( + self.ctx.artifact_service, app_name=self.ctx.app_name, user_id=user_id, - filename=artifact_key, + filename=_SEED_ARTIFACT, ) - if part is None or not part.text: - return [] - try: - raw = yaml.safe_load(part.text) or {} - except yaml.YAMLError as exc: - logger.warning( - "could not parse %s as YAML: %s — skipping", artifact_key, exc - ) - return [] - if not isinstance(raw, dict): - return [] - - findings: list[dict[str, Any]] = [] - for name, item in raw.items(): - if not isinstance(item, dict): - continue - entry = dict(item) - entry.setdefault("name", name) - findings.append(entry) - return findings diff --git a/contractor/workflows/findings.py b/contractor/workflows/findings.py new file mode 100644 index 0000000..54b59d7 --- /dev/null +++ b/contractor/workflows/findings.py @@ -0,0 +1,75 @@ +"""Shared loaders for YAML findings artifacts. + +Several workflows persist findings as a ``name -> fields`` YAML mapping +(``user:vulnerability-reports/...`` and friends) and re-load them for a +downstream stage. The load + parse + reshape steps live here so the read +side cannot drift between workflows. +""" + +from __future__ import annotations + +import logging +from typing import Any + +import yaml +from google.adk.artifacts import BaseArtifactService + +logger = logging.getLogger(__name__) + + +async def load_yaml_dict_artifact( + artifact_service: BaseArtifactService, + *, + app_name: str, + user_id: str, + filename: str, +) -> dict[str, Any]: + """Load an artifact's text as a YAML mapping. + + Returns ``{}`` when the artifact is missing, empty, unparseable + (logged as a warning), or parses to something other than a mapping. + """ + part = await artifact_service.load_artifact( + app_name=app_name, + user_id=user_id, + filename=filename, + ) + if part is None or not getattr(part, "text", None): + return {} + try: + raw = yaml.safe_load(part.text or "") or {} + except yaml.YAMLError as exc: + logger.warning("could not parse %s as YAML: %s — skipping", filename, exc) + return {} + if not isinstance(raw, dict): + return {} + return raw + + +async def load_findings_artifact( + artifact_service: BaseArtifactService, + *, + app_name: str, + user_id: str, + filename: str, +) -> list[dict[str, Any]]: + """Load a ``name -> fields`` findings artifact as a list of dicts. + + Each mapping value becomes one finding dict with its key backfilled + under ``"name"`` (an explicit ``name`` field wins). Non-mapping + values are dropped. Returns ``[]`` on any load/parse failure. + """ + raw = await load_yaml_dict_artifact( + artifact_service, + app_name=app_name, + user_id=user_id, + filename=filename, + ) + findings: list[dict[str, Any]] = [] + for name, item in raw.items(): + if not isinstance(item, dict): + continue + entry = dict(item) + entry.setdefault("name", name) + findings.append(entry) + return findings diff --git a/contractor/workflows/likec4_building/workflow.py b/contractor/workflows/likec4_building/workflow.py index 958603f..cb7d760 100644 --- a/contractor/workflows/likec4_building/workflow.py +++ b/contractor/workflows/likec4_building/workflow.py @@ -52,8 +52,11 @@ async def _run_impl( on_event: TaskRunnerEventHandler | None, ) -> Any: ctx = self.ctx + # The runner name doubles as the ADK app_name; keep it equal to + # ctx.app_name so artifact_exists() skip-checks and CLI export probe + # the same scope the tasks publish under. runner = TaskRunner( - name="likec4_builder", + name=ctx.app_name, artifact_service=ctx.artifact_service, checkpoint_path=ctx.checkpoint_path, observations=CFG.observations, @@ -88,6 +91,10 @@ async def _run_impl( ): runner.add_task( name="dependency_information", + # Stable explicit refs: the default positional ref + # (`{name}:{len(queue)}`) shifts between runs when an upstream + # task is conditionally skipped, breaking --resume checkpoints. + ref="dependency_information", worker_builder=swe_builder, **CFG.tasks.dependency_information.as_kwargs(), namespace="dependency_information", @@ -101,6 +108,7 @@ async def _run_impl( ): runner.add_task( name="project_information", + ref="project_information", worker_builder=swe_builder, **CFG.tasks.project_information.as_kwargs(), artifacts=["dependency_information/result"], @@ -112,6 +120,7 @@ async def _run_impl( runner.add_task( name="likec4_build", + ref="likec4_build", worker_builder=likec4_builder, **CFG.tasks.likec4_build.as_kwargs(), artifacts=[ @@ -124,6 +133,7 @@ async def _run_impl( runner.add_task( name="likec4_validate", + ref="likec4_validate", worker_builder=likec4_builder, **CFG.tasks.likec4_validate.as_kwargs(), artifacts=[ diff --git a/contractor/workflows/namespaces.py b/contractor/workflows/namespaces.py new file mode 100644 index 0000000..0f98b84 --- /dev/null +++ b/contractor/workflows/namespaces.py @@ -0,0 +1,37 @@ +"""Shared memory-namespace prefixes for the trace-annotation workflow family. + +Every trace producer writes its per-path memories and vulnerability reports +under ``{prefix}:{namespace}:{path_key}`` (so the report artifact lands at +``user:vulnerability-reports/{prefix}:{namespace}:{path_key}``). Consumers — +``trace-verify``, ``vuln-assess`` — must read with the *same* prefix the +producer wrote with. These constants exist so the read and write keys cannot +drift apart (audit precedent: vuln_assess once read ``trace-annotation:`` +while the trace stage wrote ``trace-graph-pathpar:`` and silently saw nothing; +see tests/units/contractor_tests/workflows/test_vuln_assess_namespace.py). +""" + +from __future__ import annotations + +# ``trace`` (TraceAnnotationWorkflow) and ``trace-direct`` +# (TraceAnnotationDirectWorkflow) share one prefix. +TRACE_ANNOTATION_NAMESPACE_PREFIX: str = "trace-annotation" + +# ``trace-graph`` (TraceGraphWorkflow) — the production default. +TRACE_GRAPH_NAMESPACE_PREFIX: str = "trace-graph" + +# ``trace-graph-pathpar`` (TraceGraphPathParWorkflow). +TRACE_GRAPH_PATHPAR_NAMESPACE_PREFIX: str = "trace-graph-pathpar" + +# ``trace-postdiff`` (TracePostDiffWorkflow) — annotate-only trace stage +# followed by a post-diff analytics stage; the analytics agent is the +# finding producer for this prefix. +TRACE_POSTDIFF_NAMESPACE_PREFIX: str = "trace-postdiff" + +# Every prefix a trace producer may have written findings under, in the +# order consumers should probe them. +TRACE_NAMESPACE_PREFIXES: tuple[str, ...] = ( + TRACE_ANNOTATION_NAMESPACE_PREFIX, + TRACE_GRAPH_NAMESPACE_PREFIX, + TRACE_GRAPH_PATHPAR_NAMESPACE_PREFIX, + TRACE_POSTDIFF_NAMESPACE_PREFIX, +) diff --git a/contractor/workflows/oas_building/workflow.py b/contractor/workflows/oas_building/workflow.py index 6a09fb4..0a1e258 100644 --- a/contractor/workflows/oas_building/workflow.py +++ b/contractor/workflows/oas_building/workflow.py @@ -20,8 +20,11 @@ async def _run_impl( on_event: TaskRunnerEventHandler | None, ) -> Any: ctx = self.ctx + # The runner name doubles as the ADK app_name; keep it equal to + # ctx.app_name so artifact_exists() skip-checks and CLI export probe + # the same scope the tasks publish under. runner = TaskRunner( - name="oas_builder", + name=ctx.app_name, artifact_service=ctx.artifact_service, checkpoint_path=ctx.checkpoint_path, observations=CFG.observations, @@ -61,6 +64,10 @@ async def _run_impl( ): runner.add_task( name="dependency_information", + # Stable explicit refs: the default positional ref + # (`{name}:{len(queue)}`) shifts between runs when an upstream + # task is conditionally skipped, breaking --resume checkpoints. + ref="dependency_information", worker_builder=swe_builder, **CFG.tasks.dependency_information.as_kwargs(), namespace="dependency_information", @@ -74,6 +81,7 @@ async def _run_impl( ): runner.add_task( name="project_information", + ref="project_information", worker_builder=swe_builder, **CFG.tasks.project_information.as_kwargs(), artifacts=["dependency_information/result"], @@ -85,6 +93,7 @@ async def _run_impl( runner.add_task( name="oas_update", + ref="oas_update", worker_builder=oas_builder, **CFG.tasks.oas_update.as_kwargs(), artifacts=[ @@ -97,6 +106,7 @@ async def _run_impl( runner.add_task( name="oas_validate", + ref="oas_validate", worker_builder=oas_linter, **CFG.tasks.oas_validate.as_kwargs(), artifacts=[ diff --git a/contractor/workflows/oas_enrichment/workflow.py b/contractor/workflows/oas_enrichment/workflow.py index 6cc571d..fa57dc7 100644 --- a/contractor/workflows/oas_enrichment/workflow.py +++ b/contractor/workflows/oas_enrichment/workflow.py @@ -19,8 +19,11 @@ async def _run_impl( on_event: TaskRunnerEventHandler | None, ) -> Any: ctx = self.ctx + # The runner name doubles as the ADK app_name; keep it equal to + # ctx.app_name so artifact_exists() skip-checks and CLI export probe + # the same scope the tasks publish under. runner = TaskRunner( - name="oas_builder", + name=ctx.app_name, artifact_service=ctx.artifact_service, checkpoint_path=ctx.checkpoint_path, observations=CFG.observations, diff --git a/contractor/workflows/path_groups.py b/contractor/workflows/path_groups.py new file mode 100644 index 0000000..527580b --- /dev/null +++ b/contractor/workflows/path_groups.py @@ -0,0 +1,70 @@ +"""Router-prefix grouping of OpenAPI paths for trace coverage budgeting. + +Per-path fan-out starves large APIs: every path gets a fresh agent with a +fresh memory namespace, so sibling handlers behind the same router prefix +(shared middleware, shared auth, shared serializers) are re-discovered +from scratch N times and the per-path token budget is spent on repeated +navigation. Grouping by route prefix makes the *group* the unit of +memory, skill injection, and (in pathpar) fork/concurrency scheduling: +paths under ``/workshop/...`` share one namespace and one budget, so +context discovered for one sibling carries to the next. + +``depth`` is the number of leading path segments that define a group; +``depth <= 0`` means one group per path (the pre-grouping behavior — +group key == ``path_key``). Group keys are normalized exactly like +``OpenApiPath.path_key`` so a single-path group at full depth keeps its +historical namespace key. +""" + +from __future__ import annotations + +from dataclasses import dataclass + +from contractor.workflows.trace_annotation import OpenApiOperation, OpenApiPath + + +def group_key_for_path(path: str, depth: int) -> str: + """Group key for ``path`` at ``depth`` leading segments. + + Normalization mirrors ``OpenApiPath.path_key``: segments joined with + ``_``, parameter braces stripped, empty key collapsed to ``root``. + ``depth <= 0`` returns the full-path key. + """ + segments = [s for s in path.strip("/").split("/") if s] + if depth > 0: + segments = segments[:depth] + key = "_".join(segments).replace("{", "").replace("}", "") + return key or "root" + + +@dataclass(frozen=True) +class PathGroup: + """A set of OpenAPI paths sharing a route prefix.""" + + key: str + paths: tuple[OpenApiPath, ...] + + @property + def operations(self) -> list[OpenApiOperation]: + return [op for path in self.paths for op in path.operations] + + +def group_paths_by_prefix( + paths: list[OpenApiPath], + *, + depth: int = 1, +) -> list[PathGroup]: + """Group ``paths`` by their first ``depth`` route segments. + + ``depth <= 0`` yields one group per path keyed by ``path_key`` — + byte-identical namespaces to the historical per-path behavior. + Group order follows first appearance; path order within a group is + preserved. + """ + if depth <= 0: + return [PathGroup(key=p.path_key, paths=(p,)) for p in paths] + + grouped: dict[str, list[OpenApiPath]] = {} + for path in paths: + grouped.setdefault(group_key_for_path(path.path, depth), []).append(path) + return [PathGroup(key=key, paths=tuple(ps)) for key, ps in grouped.items()] diff --git a/contractor/workflows/trace_annotation/workflow.py b/contractor/workflows/trace_annotation/workflow.py index 18e2676..05e8a63 100644 --- a/contractor/workflows/trace_annotation/workflow.py +++ b/contractor/workflows/trace_annotation/workflow.py @@ -8,17 +8,18 @@ from google.genai import types from contractor.agents.trace_agent.agent import build_trace_agent +from contractor.runners.artifacts import artifact_key_slug from contractor.runners.task_runner import TaskRunner, TaskRunnerEventHandler from contractor.tools.fs import MemoryOverlayFileSystem from contractor.tools.openapi import resolve_refs from contractor.utils.settings import build_model from contractor.workflows import Workflow, WorkflowContext, persist_seed_artifact from contractor.workflows.config import WorkflowConfig +from contractor.workflows.namespaces import TRACE_ANNOTATION_NAMESPACE_PREFIX CFG = WorkflowConfig.load(__file__) logger = logging.getLogger(__name__) -logger.setLevel(logging.DEBUG) @dataclass @@ -202,7 +203,9 @@ async def _run_path_analysis( # bodies into one ever-growing store (O(n²) context across paths), which # was the dominant cost on large specs; skills are re-injected per path # via add_task(skills=...). - workflow_namespace = f"trace-annotation:{self.namespace}:{api_path.path_key}" + workflow_namespace = ( + f"{TRACE_ANNOTATION_NAMESPACE_PREFIX}:{self.namespace}:{api_path.path_key}" + ) runner.add_variable(name="operation_id", value=operation_ids) runner.add_variable(name="operation_schema", value=operation_schema_yaml) @@ -210,6 +213,13 @@ async def _run_path_analysis( runner.add_task( name="trace_annotation", ref=f"trace_annotation:{self.namespace}:{api_path.path_key}", + # One task per path under the same template key — give each a + # stable per-path publish key so results don't overwrite each + # other (mirrors trace_verify / vuln_scan_trace). + artifact_key=( + f"trace_annotation/{artifact_key_slug(self.namespace)}/" + f"{artifact_key_slug(api_path.path_key)}" + ), worker_builder=trace_builder, **CFG.tasks.annotate.as_kwargs(), artifacts=[], diff --git a/contractor/workflows/trace_annotation_direct/config.yaml b/contractor/workflows/trace_annotation_direct/config.yaml index e889b79..75fa8e6 100644 --- a/contractor/workflows/trace_annotation_direct/config.yaml +++ b/contractor/workflows/trace_annotation_direct/config.yaml @@ -1,4 +1,6 @@ budgets: max_tokens: 100000 agents: - trace_agent: { with_graph_tools: true, output_format: json } + # Prompt-only baseline: graph tools stay off so trace-direct remains the + # ablation counterpart of trace-graph (which differs only in this knob). + trace_agent: { with_graph_tools: false } diff --git a/contractor/workflows/trace_annotation_direct/workflow.py b/contractor/workflows/trace_annotation_direct/workflow.py index c96243c..792f391 100644 --- a/contractor/workflows/trace_annotation_direct/workflow.py +++ b/contractor/workflows/trace_annotation_direct/workflow.py @@ -16,6 +16,7 @@ from contractor.utils.settings import build_model from contractor.workflows import Workflow, WorkflowContext, persist_seed_artifact from contractor.workflows.config import WorkflowConfig +from contractor.workflows.namespaces import TRACE_ANNOTATION_NAMESPACE_PREFIX from contractor.workflows.trace_annotation import ( OpenApiOperation, OpenApiPath, @@ -25,7 +26,6 @@ CFG = WorkflowConfig.load(__file__) logger = logging.getLogger(__name__) -logger.setLevel(logging.DEBUG) TRACE_TASK_TEMPLATE: str = "trace_annotation" @@ -114,7 +114,9 @@ async def _run_path_analysis( user_id: str = "cli-user", on_event: TaskRunnerEventHandler | None = None, ) -> None: - path_namespace = f"trace-annotation:{self.namespace}:{api_path.path_key}" + path_namespace = ( + f"{TRACE_ANNOTATION_NAMESPACE_PREFIX}:{self.namespace}:{api_path.path_key}" + ) base_variables: dict[str, Any] = {"project_path": self.ctx.folder_name} await inject_skills( diff --git a/contractor/workflows/trace_graph/config.yaml b/contractor/workflows/trace_graph/config.yaml index e889b79..908cd98 100644 --- a/contractor/workflows/trace_graph/config.yaml +++ b/contractor/workflows/trace_graph/config.yaml @@ -1,4 +1,4 @@ budgets: max_tokens: 100000 agents: - trace_agent: { with_graph_tools: true, output_format: json } + trace_agent: { with_graph_tools: true } diff --git a/contractor/workflows/trace_graph/workflow.py b/contractor/workflows/trace_graph/workflow.py index 1846433..971e789 100644 --- a/contractor/workflows/trace_graph/workflow.py +++ b/contractor/workflows/trace_graph/workflow.py @@ -29,6 +29,7 @@ from contractor.utils.settings import build_model from contractor.workflows import Workflow, WorkflowContext, persist_seed_artifact from contractor.workflows.config import WorkflowConfig +from contractor.workflows.namespaces import TRACE_GRAPH_NAMESPACE_PREFIX from contractor.workflows.trace_annotation import ( OpenApiOperation, OpenApiPath, @@ -38,7 +39,6 @@ CFG = WorkflowConfig.load(__file__) logger = logging.getLogger(__name__) -logger.setLevel(logging.DEBUG) TRACE_TASK_TEMPLATE: str = "trace_annotation" @@ -128,7 +128,9 @@ async def _run_path_analysis( user_id: str = "cli-user", on_event: TaskRunnerEventHandler | None = None, ) -> None: - path_namespace = f"trace-graph:{self.namespace}:{api_path.path_key}" + path_namespace = ( + f"{TRACE_GRAPH_NAMESPACE_PREFIX}:{self.namespace}:{api_path.path_key}" + ) base_variables: dict[str, Any] = {"project_path": self.ctx.folder_name} await inject_skills( diff --git a/contractor/workflows/trace_graph_pathpar/config.yaml b/contractor/workflows/trace_graph_pathpar/config.yaml index b633fa3..c7cfcf9 100644 --- a/contractor/workflows/trace_graph_pathpar/config.yaml +++ b/contractor/workflows/trace_graph_pathpar/config.yaml @@ -1,3 +1,7 @@ budgets: max_tokens: 100000 max_concurrency: 3 + # Route segments that define a fork/coverage group; 0 = one fork per + # path (historical behavior). Raise to 1 on large APIs so sibling + # paths share a namespace and budget. + group_depth: 0 diff --git a/contractor/workflows/trace_graph_pathpar/workflow.py b/contractor/workflows/trace_graph_pathpar/workflow.py index 60f2b96..c2a0471 100644 --- a/contractor/workflows/trace_graph_pathpar/workflow.py +++ b/contractor/workflows/trace_graph_pathpar/workflow.py @@ -32,6 +32,8 @@ from contractor.utils.settings import build_model from contractor.workflows import Workflow, WorkflowContext, persist_seed_artifact from contractor.workflows.config import WorkflowConfig +from contractor.workflows.namespaces import TRACE_GRAPH_PATHPAR_NAMESPACE_PREFIX +from contractor.workflows.path_groups import PathGroup, group_paths_by_prefix from contractor.workflows.trace_annotation import ( OpenApiOperation, OpenApiPath, @@ -41,14 +43,14 @@ CFG = WorkflowConfig.load(__file__) logger = logging.getLogger(__name__) -logger.setLevel(logging.DEBUG) TRACE_TASK_TEMPLATE: str = "trace_annotation" # Per-path namespace prefix used for this workflow's trace artifacts and -# vulnerability reports. Shared with vuln_assess._collect_vuln_reports so the -# write key (here) and the read key (there) cannot drift apart. -PATH_NAMESPACE_PREFIX: str = "trace-graph-pathpar" +# vulnerability reports. Shared (via contractor.workflows.namespaces) with +# vuln_assess._collect_vuln_reports and trace_verify so the write key (here) +# and the read keys (there) cannot drift apart. +PATH_NAMESPACE_PREFIX: str = TRACE_GRAPH_PATHPAR_NAMESPACE_PREFIX class TraceGraphPathParWorkflow(Workflow): @@ -110,74 +112,93 @@ async def _run_impl( # so the tool closures are safe to share across parallel forks. self._shared_graph_tools = attach_graph_tools_if_local(self.overlayfs) + # Group by route prefix — the group is the fork/concurrency unit and + # the memory-namespace unit, so sibling paths share context and a + # budget instead of competing as N independent runs. group_depth=0 + # keeps the historical one-fork-per-path behavior. + groups = group_paths_by_prefix( + self.paths, depth=CFG.budgets.group_depth + ) + # Snapshot pre-fork state for merge. pre_fork_patch = self.overlayfs.save() pre_fork_files = dict(self.overlayfs._files) - forks = [fork_overlay(self.fs, pre_fork_patch) for _ in self.paths] + forks = [fork_overlay(self.fs, pre_fork_patch) for _ in groups] sem = asyncio.Semaphore(self.max_concurrency) logger.info( - "Running %d paths in parallel (max_concurrency=%d)", + "Running %d route group(s) covering %d path(s) in parallel " + "(max_concurrency=%d, group_depth=%d)", + len(groups), len(self.paths), self.max_concurrency, + CFG.budgets.group_depth, ) - async def _run_path(api_path: OpenApiPath, overlay: MemoryOverlayFileSystem) -> None: + async def _run_group(group: PathGroup, overlay: MemoryOverlayFileSystem) -> None: async with sem: runner = AgentRunner( name=ctx.app_name, artifact_service=ctx.artifact_service, ) - await self._run_path_analysis( - api_path=api_path, + await self._run_group_analysis( + group=group, overlay=overlay, runner=runner, user_id=user_id, on_event=on_event, ) - async with asyncio.TaskGroup() as tg: - for api_path, overlay in zip(self.paths, forks, strict=False): - tg.create_task(_run_path(api_path, overlay)) - - conflicts = merge_overlay_forks(self.overlayfs, forks, pre_fork_files) - if conflicts: - logger.warning( - "Overlay merge produced %d conflicting files: %s", - len(conflicts), - conflicts, - ) + # Merge + persist in a `finally` (the trace_annotation `_cleanup` + # precedent — done inline here because vuln_assess drives this + # workflow via `_run_impl` directly, bypassing `run()`/`_cleanup`): + # a single failed path makes the TaskGroup cancel its siblings and + # re-raise, and without the `finally` every already-completed path's + # annotations were lost. On the happy path this block is the one and + # only merge/save, so nothing runs twice. + try: + async with asyncio.TaskGroup() as tg: + for group, overlay in zip(groups, forks, strict=False): + tg.create_task(_run_group(group, overlay)) + finally: + conflicts = merge_overlay_forks(self.overlayfs, forks, pre_fork_files) + if conflicts: + logger.warning( + "Overlay merge produced %d conflicting files: %s", + len(conflicts), + conflicts, + ) - await self._save_overlay_artifacts(user_id) + await self._save_overlay_artifacts(user_id) - # ── per-path orchestration (operations sequential) ──────────────── + # ── per-group orchestration (operations sequential) ─────────────── - async def _run_path_analysis( + async def _run_group_analysis( self, *, - api_path: OpenApiPath, + group: PathGroup, overlay: MemoryOverlayFileSystem, runner: AgentRunner, user_id: str, on_event: TaskRunnerEventHandler | None, ) -> None: - path_namespace = f"{PATH_NAMESPACE_PREFIX}:{self.namespace}:{api_path.path_key}" + group_namespace = f"{PATH_NAMESPACE_PREFIX}:{self.namespace}:{group.key}" base_variables: dict[str, Any] = {"project_path": self.ctx.folder_name} await inject_skills( ["trace"], - namespace=path_namespace, + namespace=group_namespace, artifact_service=self.ctx.artifact_service, app_name=self.ctx.app_name, user_id=user_id, ) - for idx, operation in enumerate(api_path.operations): + for idx, operation in enumerate(group.operations): await self._run_operation_trace( operation=operation, idx=idx, - namespace=path_namespace, + namespace=group_namespace, overlay=overlay, runner=runner, base_variables=base_variables, diff --git a/contractor/workflows/trace_postdiff/__init__.py b/contractor/workflows/trace_postdiff/__init__.py new file mode 100644 index 0000000..ed5dbc6 --- /dev/null +++ b/contractor/workflows/trace_postdiff/__init__.py @@ -0,0 +1,3 @@ +from contractor.workflows.trace_postdiff.workflow import TracePostDiffWorkflow + +__all__ = ["TracePostDiffWorkflow"] diff --git a/contractor/workflows/trace_postdiff/config.yaml b/contractor/workflows/trace_postdiff/config.yaml new file mode 100644 index 0000000..a108428 --- /dev/null +++ b/contractor/workflows/trace_postdiff/config.yaml @@ -0,0 +1,10 @@ +budgets: + max_tokens: 100000 + analytics_max_tokens: 100000 + analytics_diff_max_chars: 60000 + # Route segments that define a coverage group (memory namespace, skill + # injection, analytics unit). 0 = one group per path. + group_depth: 1 +agents: + trace_agent: { with_graph_tools: true } + vuln_analytics_agent: { with_graph_tools: true } diff --git a/contractor/workflows/trace_postdiff/workflow.py b/contractor/workflows/trace_postdiff/workflow.py new file mode 100644 index 0000000..e84fd17 --- /dev/null +++ b/contractor/workflows/trace_postdiff/workflow.py @@ -0,0 +1,377 @@ +"""Two-stage trace workflow: annotate-only trace, then post-diff analytics. + +Stage A runs the per-operation ``trace_agent`` loop exactly like +``trace-graph``, but with vulnerability reporting disabled — the agent's +whole job is to drive ``@trace`` / ``@validate`` / ``@sink`` annotations +onto the execution paths (navigation). + +Stage B runs once per path: ``vuln_analytics_agent`` receives the +annotation diff produced during that path's trace runs and judges the +annotated flows against the finding-shape taxonomy (judgement), +persisting supported findings via ``report_vulnerability`` under the +``trace-postdiff:{namespace}:{path_key}`` namespace — so ``trace-verify`` +and ``vuln-assess`` pick them up through the shared prefix registry. + +The split targets small models: a single agent asked to both navigate +*and* judge tends to do neither well; here each stage does one job. +A/B against the single-stage ``trace-graph`` on the same fixture. +""" + +from __future__ import annotations + +import json +import logging +from typing import Any, cast +from uuid import uuid4 + +import yaml +from google.genai import types + +from contractor.agents.trace_agent.agent import TraceFormat, build_trace_agent +from contractor.agents.vuln_analytics_agent.agent import ( + AnalyticsFormat, + build_vuln_analytics_agent, +) +from contractor.runners.agent_runner import AgentRunner +from contractor.runners.models import RenderedTask, TaskRunnerEventHandler, TaskTemplate +from contractor.runners.plugins.metrics_plugin import AdkMetricsPlugin +from contractor.runners.plugins.trace_plugin import AdkTracePlugin +from contractor.runners.skills import inject_skills +from contractor.tools.fs import MemoryOverlayFileSystem +from contractor.utils.settings import build_model +from contractor.workflows import Workflow, WorkflowContext, persist_seed_artifact +from contractor.workflows.config import WorkflowConfig +from contractor.workflows.namespaces import TRACE_POSTDIFF_NAMESPACE_PREFIX +from contractor.workflows.path_groups import PathGroup, group_paths_by_prefix +from contractor.workflows.trace_annotation import ( + OpenApiOperation, + OpenApiPath, + extract_openapi_paths, +) + +CFG = WorkflowConfig.load(__file__) + +logger = logging.getLogger(__name__) + +TRACE_TASK_TEMPLATE: str = "trace_annotation" +ANALYTICS_TASK_TEMPLATE: str = "vuln_analytics" + +_DIFF_HEADER_PREFIX = "diff --overlay a" +_DIFF_TRUNCATION_MARKER = ( + "\n... [diff truncated — read the annotated files directly for the rest]" +) + + +def _diff_header_path(line: str) -> str | None: + """Parse the file path out of a ``diff --overlay a{path} b{path}`` header. + + The path appears twice, so its length is fixed by the header length — + no delimiter guessing even for paths containing spaces. + """ + if not line.startswith(_DIFF_HEADER_PREFIX): + return None + rest = line[len(_DIFF_HEADER_PREFIX) :] + half = (len(rest) - 2) // 2 + if half > 0 and rest[half : half + 2] == " b" and rest[:half] == rest[half + 2 :]: + return rest[:half] + # Malformed / unexpected header — fall back to the first " b" split. + return rest.split(" b", 1)[0] or None + + +def filter_diff_by_files(diff_text: str, files: set[str]) -> str: + """Keep only the per-file chunks of an overlay diff whose path is in + ``files``. Chunks are delimited by ``diff --overlay`` headers.""" + if not diff_text or not files: + return "" + keep = False + out: list[str] = [] + for line in diff_text.splitlines(): + path = _diff_header_path(line) + if path is not None: + keep = path in files + if keep: + out.append(line) + return "\n".join(out) + + +def truncate_diff(diff_text: str, max_chars: int) -> str: + if len(diff_text) <= max_chars: + return diff_text + return diff_text[:max_chars] + _DIFF_TRUNCATION_MARKER + + +class TracePostDiffWorkflow(Workflow): + """Annotate-only trace stage + per-path post-diff analytics stage.""" + + namespace: str = "openapi" + + def __init__(self, ctx: WorkflowContext) -> None: + super().__init__(ctx) + self.llm = build_model(ctx.model, ctx.timeout) + self.fs = ctx.fs + self.overlayfs = MemoryOverlayFileSystem(fs=self.fs) + self.paths: list[OpenApiPath] = [] + self._template = TaskTemplate.load(TRACE_TASK_TEMPLATE) + self._analytics_template = TaskTemplate.load(ANALYTICS_TASK_TEMPLATE) + self._runner = AgentRunner( + name=ctx.app_name, + artifact_service=ctx.artifact_service, + ) + + async def _run_impl( + self, + *, + user_id: str, + on_event: TaskRunnerEventHandler | None, + ) -> Any: + ctx = self.ctx + await persist_seed_artifact(ctx, filename="oas-openapi-building") + + raw = await ctx.artifact_service.load_artifact( + app_name=ctx.app_name, + user_id=user_id, + filename=f"oas-{self.namespace}-building", + ) + if not raw: + raise ValueError("No OpenAPI artifact found") + + openapi = yaml.safe_load(raw.text or "") + self.paths = extract_openapi_paths(openapi=openapi) + + # Group by route prefix: the group is the unit of memory namespace, + # skill injection, and the analytics stage, so sibling paths share + # discovered context instead of re-navigating from scratch. + groups = group_paths_by_prefix( + self.paths, depth=CFG.budgets.group_depth + ) + logger.info( + "trace-postdiff: %d paths in %d route group(s) (group_depth=%d)", + len(self.paths), + len(groups), + CFG.budgets.group_depth, + ) + + for group in groups: + fs_state_artifact = await ctx.artifact_service.load_artifact( + app_name=ctx.app_name, + user_id=user_id, + filename=f"trace-{self.namespace}-fs", + ) + if fs_state_artifact: + self.overlayfs.load(json.loads(fs_state_artifact.text or "{}")) + + await self._run_group_analysis( + group, + user_id=user_id, + on_event=on_event, + ) + + await self._save_overlay_artifacts(user_id) + + async def _save_overlay_artifacts(self, user_id: str) -> None: + ctx = self.ctx + await ctx.artifact_service.save_artifact( + app_name=ctx.app_name, + user_id=user_id, + filename=f"trace-{self.namespace}-fs", + artifact=types.Part.from_text(text=json.dumps(self.overlayfs.save())), + ) + await ctx.artifact_service.save_artifact( + app_name=ctx.app_name, + user_id=user_id, + filename=f"trace-{self.namespace}-diff", + artifact=types.Part.from_text( + text=self.overlayfs.diff(context_lines=4) + ), + ) + + def _changed_since(self, before: dict[str, bytes]) -> set[str]: + """Paths whose overlay content was added or modified after the + ``before`` snapshot. Deletions are ignored — the annotate-only + stage adds comments; a deletion carries no annotated flow to + analyze. (``_files`` access mirrors trace_graph_pathpar's + pre-fork snapshot.)""" + return { + path + for path, content in self.overlayfs._files.items() # noqa: SLF001 + if before.get(path) != content + } + + async def _run_group_analysis( + self, + group: PathGroup, + *, + user_id: str = "cli-user", + on_event: TaskRunnerEventHandler | None = None, + ) -> None: + group_namespace = ( + f"{TRACE_POSTDIFF_NAMESPACE_PREFIX}:{self.namespace}:{group.key}" + ) + base_variables: dict[str, Any] = {"project_path": self.ctx.folder_name} + + await inject_skills( + ["trace"], + namespace=group_namespace, + artifact_service=self.ctx.artifact_service, + app_name=self.ctx.app_name, + user_id=user_id, + ) + + before = dict(self.overlayfs._files) # noqa: SLF001 + + # ── Stage A: annotate-only trace, one run per operation ────────── + for idx, operation in enumerate(group.operations): + await self._run_operation_trace( + operation=operation, + idx=idx, + namespace=group_namespace, + base_variables=base_variables, + user_id=user_id, + on_event=on_event, + ) + + # ── Stage B: post-diff analytics over this group's annotations ─── + changed = self._changed_since(before) + if not changed: + logger.info( + "trace-postdiff: no annotations produced for group %r — " + "skipping analytics stage", + group.key, + ) + return + + await self._run_group_analytics( + group, + namespace=group_namespace, + changed_files=changed, + user_id=user_id, + on_event=on_event, + ) + + async def _run_operation_trace( + self, + *, + operation: OpenApiOperation, + idx: int, + namespace: str, + base_variables: dict[str, Any], + user_id: str, + on_event: TaskRunnerEventHandler | None, + ) -> None: + operation_schema = yaml.safe_dump( + {operation.path: {operation.method: operation.schema}}, + sort_keys=False, + ) + + rendered = RenderedTask.from_template( + template=self._template, + variables={ + **base_variables, + "operation_id": operation.operation_id, + "operation_schema": operation_schema, + }, + params={}, + artifacts={}, + ) + + agent = build_trace_agent( + name="trace_agent", + fs=self.overlayfs, + namespace=namespace, + _format=cast(TraceFormat, self._template.format), + model=self.llm, + max_tokens=CFG.budgets.max_tokens, + # Annotate-only: judgement is the analytics stage's job. + enable_vuln_reporting=False, + with_graph_tools=CFG.agent("trace_agent").with_graph_tools, + ) + + session_id = uuid4().hex + event_name = f"trace_postdiff:{self.namespace}:{operation.operation_id}" + await self._runner.run( + agent=agent, + message=rendered._format_task(), + user_id=user_id, + session_id=session_id, + initial_state={}, + plugins=self._plugins(event_name, idx, session_id), + on_event=on_event, + event_name=event_name, + ) + + async def _run_group_analytics( + self, + group: PathGroup, + *, + namespace: str, + changed_files: set[str], + user_id: str, + on_event: TaskRunnerEventHandler | None, + ) -> None: + group_diff = filter_diff_by_files( + self.overlayfs.diff(context_lines=4), changed_files + ) + group_diff = truncate_diff( + group_diff, CFG.budgets.analytics_diff_max_chars + ) + + target_summary = yaml.safe_dump( + { + path.path: {op.method: op.schema for op in path.operations} + for path in group.paths + }, + sort_keys=False, + ) + + rendered = RenderedTask.from_template( + template=self._analytics_template, + variables={ + "target_summary": target_summary, + "trace_diff": group_diff, + }, + params={}, + artifacts={}, + ) + + agent = build_vuln_analytics_agent( + name="vuln_analytics_agent", + fs=self.overlayfs, + namespace=namespace, + _format=cast(AnalyticsFormat, self._analytics_template.format), + model=self.llm, + max_tokens=CFG.budgets.analytics_max_tokens, + with_graph_tools=CFG.agent("vuln_analytics_agent").with_graph_tools, + ) + + session_id = uuid4().hex + event_name = ( + f"trace_postdiff:{self.namespace}:{group.key}:analytics" + ) + await self._runner.run( + agent=agent, + message=rendered._format_task(), + user_id=user_id, + session_id=session_id, + initial_state={}, + plugins=self._plugins(event_name, len(group.operations), session_id), + on_event=on_event, + event_name=event_name, + ) + + def _plugins(self, event_name: str, task_id: int, session_id: str) -> list: + return [ + AdkTracePlugin( + task_name=event_name, + task_id=task_id, + iteration=1, + session_id=session_id, + emit=self._runner._emit, + ), + AdkMetricsPlugin( + task_name=event_name, + task_id=task_id, + iteration=1, + session_id=session_id, + emit=self._runner._emit, + ), + ] diff --git a/contractor/workflows/trace_verify/workflow.py b/contractor/workflows/trace_verify/workflow.py index 1b0e4e4..9ca8344 100644 --- a/contractor/workflows/trace_verify/workflow.py +++ b/contractor/workflows/trace_verify/workflow.py @@ -1,17 +1,20 @@ """Static verifier of upstream vulnerability findings (OpenAnt Stage-2 style). For each path in the source OpenAPI schema, loads the per-path -:class:`VulnerabilityReport` artifact written by a prior ``trace-direct`` (or -``trace``) run and queues one task per finding for ``trace_verifier_agent``. +:class:`VulnerabilityReport` artifacts written by a prior trace run — any of +``trace`` / ``trace-direct`` (``trace-annotation:`` prefix), ``trace-graph`` +(``trace-graph:``), or ``trace-graph-pathpar`` (``trace-graph-pathpar:``) — +and queues one task per finding for ``trace_verifier_agent``. The verifier is code-evidence-only — no HTTP probes — and persists verdicts via ``verification_tools`` under the same namespace as the upstream findings, so the two artifacts pair up: - user:vulnerability-reports/trace-annotation:openapi:{path_key} - user:vulnerability-verifications/trace-annotation:openapi:{path_key} + user:vulnerability-reports/{prefix}:openapi:{path_key} + user:vulnerability-verifications/{prefix}:openapi:{path_key} -Paths with no findings are silently skipped. +Paths with no findings are skipped (DEBUG log); if *no* path has findings +under *any* prefix the workflow logs a WARNING and completes as a no-op. """ from __future__ import annotations @@ -23,10 +26,14 @@ import yaml from contractor.agents.trace_verifier_agent.agent import build_trace_verifier_agent +from contractor.runners.artifacts import artifact_key_slug from contractor.runners.task_runner import TaskRunner, TaskRunnerEventHandler from contractor.utils.settings import build_model from contractor.workflows import Workflow, WorkflowContext, persist_seed_artifact from contractor.workflows.config import WorkflowConfig +from contractor.workflows.findings import load_findings_artifact +from contractor.workflows.namespaces import TRACE_NAMESPACE_PREFIXES +from contractor.workflows.path_groups import group_key_for_path from contractor.workflows.trace_annotation import OpenApiPath, extract_openapi_paths CFG = WorkflowConfig.load(__file__) @@ -43,6 +50,10 @@ def __init__(self, ctx: WorkflowContext) -> None: super().__init__(ctx) self.llm = build_model(ctx.model, ctx.timeout) self.paths: list[OpenApiPath] = [] + # Group-keyed namespaces cover several sibling paths; remember the + # ones already verified so a group's findings are queued once, not + # once per member path. + self._processed_namespaces: set[str] = set() async def _run_impl( self, @@ -64,32 +75,108 @@ async def _run_impl( openapi = yaml.safe_load(raw.text or "") self.paths = extract_openapi_paths(openapi=openapi) + total_findings = 0 for api_path in self.paths: - await self._verify_path_findings( + total_findings += await self._verify_path_findings( api_path=api_path, user_id=user_id, on_event=on_event, ) + if not total_findings: + logger.warning( + "trace-verify found no vulnerability reports for any of the " + "%d OpenAPI paths under any known trace namespace prefix " + "(probed: %s) — nothing to verify. Run a trace workflow " + "(trace / trace-direct / trace-graph / trace-graph-pathpar) " + "against this project first.", + len(self.paths), + ", ".join(TRACE_NAMESPACE_PREFIXES), + ) + + def _candidate_namespaces(self, api_path: OpenApiPath) -> list[str]: + """Every namespace a trace producer may have written findings under + for ``api_path``, in probe order. + + Producers key findings by ``path_key`` (per-path runs) or by a + route-prefix group key (``group_depth >= 1`` in trace-postdiff / + pathpar). The producer's depth isn't knowable here, so depth-1 and + depth-2 group keys are probed alongside the path key — each probe + is just an artifact lookup, so extra candidates are cheap. + """ + keys = [api_path.path_key] + for depth in (1, 2): + group_key = group_key_for_path(api_path.path, depth) + if group_key not in keys: + keys.append(group_key) + return [ + f"{prefix}:{self.namespace}:{key}" + for key in keys + for prefix in TRACE_NAMESPACE_PREFIXES + ] + + async def _discover_findings( + self, + *, + user_id: str, + api_path: OpenApiPath, + ) -> list[tuple[str, list[dict[str, Any]]]]: + """Probe every candidate namespace for ``api_path`` and return + ``(source_namespace, findings)`` pairs for the non-empty ones.""" + discovered: list[tuple[str, list[dict[str, Any]]]] = [] + for source_namespace in self._candidate_namespaces(api_path): + if source_namespace in self._processed_namespaces: + continue + findings = await self._load_findings( + user_id=user_id, source_namespace=source_namespace + ) + if findings: + discovered.append((source_namespace, findings)) + return discovered + async def _verify_path_findings( self, *, api_path: OpenApiPath, user_id: str, on_event: TaskRunnerEventHandler | None, - ) -> None: - source_namespace = f"trace-annotation:{self.namespace}:{api_path.path_key}" - findings = await self._load_findings( - user_id=user_id, source_namespace=source_namespace + ) -> int: + """Verify every finding recorded for ``api_path``; returns how many + findings were queued (0 when the path has none under any prefix).""" + discovered = await self._discover_findings( + user_id=user_id, api_path=api_path ) - if not findings: + if not discovered: logger.debug( - "no findings under %r — skipping verify for path %r", - source_namespace, + "no vulnerability reports for path %r under any trace " + "namespace prefix (probed: %s) — skipping verify", api_path.path, + ", ".join(self._candidate_namespaces(api_path)), + ) + return 0 + + total = 0 + for source_namespace, findings in discovered: + self._processed_namespaces.add(source_namespace) + total += len(findings) + await self._verify_namespace_findings( + api_path=api_path, + source_namespace=source_namespace, + findings=findings, + user_id=user_id, + on_event=on_event, ) - return + return total + async def _verify_namespace_findings( + self, + *, + api_path: OpenApiPath, + source_namespace: str, + findings: list[dict[str, Any]], + user_id: str, + on_event: TaskRunnerEventHandler | None, + ) -> None: ctx = self.ctx verifier_builder = partial( build_trace_verifier_agent, @@ -119,6 +206,13 @@ async def _verify_path_findings( f"trace_verify:{self.namespace}:" f"{api_path.path_key}:{finding_name}" ), + # Unique, stable per-finding publish key — every finding is a + # separate `trace_verify` task and the shared template key + # would make siblings overwrite each other's artifacts. + artifact_key=( + f"trace_verify/{artifact_key_slug(source_namespace)}/" + f"{artifact_key_slug(finding_name)}" + ), worker_builder=verifier_builder, **CFG.tasks.verify.as_kwargs(), namespace=source_namespace, @@ -150,31 +244,9 @@ async def _load_findings( ``name`` filled in from the key when absent. Empty / missing / malformed artifacts return an empty list (the path is then skipped). """ - artifact_key = f"user:vulnerability-reports/{source_namespace}" - part = await self.ctx.artifact_service.load_artifact( + return await load_findings_artifact( + self.ctx.artifact_service, app_name=self.ctx.app_name, user_id=user_id, - filename=artifact_key, + filename=f"user:vulnerability-reports/{source_namespace}", ) - if part is None or not part.text: - return [] - try: - raw = yaml.safe_load(part.text) or {} - except yaml.YAMLError as exc: - logger.warning( - "could not parse %s as YAML: %s — skipping path", - artifact_key, - exc, - ) - return [] - if not isinstance(raw, dict): - return [] - - findings: list[dict[str, Any]] = [] - for name, item in raw.items(): - if not isinstance(item, dict): - continue - entry = dict(item) - entry.setdefault("name", name) - findings.append(entry) - return findings diff --git a/contractor/workflows/vuln_assess/workflow.py b/contractor/workflows/vuln_assess/workflow.py index 53f460e..8d0d97f 100644 --- a/contractor/workflows/vuln_assess/workflow.py +++ b/contractor/workflows/vuln_assess/workflow.py @@ -15,7 +15,6 @@ from __future__ import annotations import logging -import os from functools import partial from typing import Any @@ -25,9 +24,11 @@ from contractor.agents.oas_linter_agent.agent import build_oas_linter_agent from contractor.agents.swe_agent.agent import build_swe_agent from contractor.runners.task_runner import TaskRunner, TaskRunnerEventHandler -from contractor.utils.settings import build_model +from contractor.utils.settings import build_model, get_settings from contractor.workflows import Workflow, WorkflowContext, persist_seed_artifact from contractor.workflows.config import WorkflowConfig +from contractor.workflows.findings import load_yaml_dict_artifact +from contractor.workflows.path_groups import group_key_for_path from contractor.workflows.trace_annotation import extract_openapi_paths from contractor.workflows.trace_graph_pathpar import TraceGraphPathParWorkflow from contractor.workflows.trace_graph_pathpar.workflow import PATH_NAMESPACE_PREFIX @@ -114,6 +115,10 @@ async def _run_oas_stage( ): runner.add_task( name="dependency_information", + # Stable explicit refs: the default positional ref + # (`{name}:{len(queue)}`) shifts between runs when an upstream + # task is conditionally skipped, breaking --resume checkpoints. + ref="dependency_information", worker_builder=swe_builder, **CFG.tasks.dependency_information.as_kwargs(), namespace="dependency_information", model=self.llm, @@ -126,6 +131,7 @@ async def _run_oas_stage( ): runner.add_task( name="project_information", + ref="project_information", worker_builder=swe_builder, **CFG.tasks.project_information.as_kwargs(), artifacts=["dependency_information/result"], @@ -136,6 +142,7 @@ async def _run_oas_stage( runner.add_task( name="oas_update", + ref="oas_update", worker_builder=oas_builder, **CFG.tasks.oas_update.as_kwargs(), artifacts=[ @@ -148,6 +155,7 @@ async def _run_oas_stage( # Step 3 [VERIFY] runner.add_task( name="oas_validate", + ref="oas_validate", worker_builder=oas_linter, **CFG.tasks.oas_validate.as_kwargs(), artifacts=[ @@ -216,7 +224,7 @@ async def _run_exploit_stage( user_id: str, on_event: TaskRunnerEventHandler | None, ) -> None: - target_url = os.environ.get("CONTRACTOR_TARGET_URL") + target_url = get_settings().target_url if not target_url: logger.warning( "CONTRACTOR_TARGET_URL not set — skipping exploit stage" @@ -258,18 +266,14 @@ async def _collect_vuln_reports(self, *, user_id: str) -> str: ctx = self.ctx merged: dict[str, Any] = {} - part = await ctx.artifact_service.load_artifact( - app_name=ctx.app_name, - user_id=user_id, - filename="vulnerability-reports-seed", + merged.update( + await load_yaml_dict_artifact( + ctx.artifact_service, + app_name=ctx.app_name, + user_id=user_id, + filename="vulnerability-reports-seed", + ) ) - if part and part.text: - try: - raw = yaml.safe_load(part.text) or {} - if isinstance(raw, dict): - merged.update(raw) - except yaml.YAMLError: - pass for ns_suffix in ["openapi"]: oas_part = await ctx.artifact_service.load_artifact( @@ -284,23 +288,31 @@ async def _collect_vuln_reports(self, *, user_id: str) -> str: except yaml.YAMLError: continue paths = extract_openapi_paths(openapi=openapi) + probed: set[str] = set() for api_path in paths: # Must match the namespace the trace stage (TraceGraphPathParWorkflow, # run in _run_trace_stage) writes vuln reports under — shared constant # so the read/write keys can't drift (audit: HIGH, namespace mismatch). - ns = f"{PATH_NAMESPACE_PREFIX}:{ns_suffix}:{api_path.path_key}" - part = await ctx.artifact_service.load_artifact( - app_name=ctx.app_name, - user_id=user_id, - filename=f"user:vulnerability-reports/{ns}", - ) - if part and part.text: - try: - reports = yaml.safe_load(part.text) or {} - if isinstance(reports, dict): - merged.update(reports) - except yaml.YAMLError: + # The trace stage keys by path_key (group_depth=0) or by a + # route-prefix group key (group_depth>=1) — probe both. + keys = [api_path.path_key] + for depth in (1, 2): + group_key = group_key_for_path(api_path.path, depth) + if group_key not in keys: + keys.append(group_key) + for key in keys: + ns = f"{PATH_NAMESPACE_PREFIX}:{ns_suffix}:{key}" + if ns in probed: continue + probed.add(ns) + merged.update( + await load_yaml_dict_artifact( + ctx.artifact_service, + app_name=ctx.app_name, + user_id=user_id, + filename=f"user:vulnerability-reports/{ns}", + ) + ) if not merged: return "" diff --git a/contractor/workflows/vuln_scan_fast/workflow.py b/contractor/workflows/vuln_scan_fast/workflow.py index 0d7c847..0a2a845 100644 --- a/contractor/workflows/vuln_scan_fast/workflow.py +++ b/contractor/workflows/vuln_scan_fast/workflow.py @@ -14,7 +14,7 @@ from __future__ import annotations import logging -import os +import re from functools import partial from typing import Any @@ -23,9 +23,10 @@ from contractor.agents.codereview_agent.agent import build_codereview_agent from contractor.agents.swe_agent.agent import build_swe_agent from contractor.runners.task_runner import TaskRunner, TaskRunnerEventHandler -from contractor.utils.settings import build_model +from contractor.utils.settings import build_model, get_settings from contractor.workflows import Workflow, WorkflowContext from contractor.workflows.config import WorkflowConfig +from contractor.workflows.findings import load_findings_artifact CFG = WorkflowConfig.load(__file__) @@ -100,6 +101,10 @@ async def _run_discovery( ): runner.add_task( name="dependency_information", + # Stable explicit refs: the default positional ref + # (`{name}:{len(queue)}`) shifts between runs when the sibling + # task is conditionally skipped, breaking --resume checkpoints. + ref="dependency_information", worker_builder=swe_builder, **CFG.tasks.dependency_information.as_kwargs(), namespace="dependency_information", model=self.llm, @@ -112,6 +117,7 @@ async def _run_discovery( ): runner.add_task( name="project_information", + ref="project_information", worker_builder=swe_builder, **CFG.tasks.project_information.as_kwargs(), artifacts=["dependency_information/result"], @@ -167,29 +173,12 @@ async def _run_fast_scan( async def _load_and_dedup_findings( self, *, user_id: str, ) -> list[dict[str, Any]]: - ctx = self.ctx - part = await ctx.artifact_service.load_artifact( - app_name=ctx.app_name, + findings = await load_findings_artifact( + self.ctx.artifact_service, + app_name=self.ctx.app_name, user_id=user_id, filename=VULN_REPORTS_KEY, ) - if part is None or not part.text: - return [] - - try: - raw = yaml.safe_load(part.text) or {} - except yaml.YAMLError: - return [] - if not isinstance(raw, dict): - return [] - - findings: list[dict[str, Any]] = [] - for name, item in raw.items(): - if not isinstance(item, dict): - continue - entry = dict(item) - entry.setdefault("name", name) - findings.append(entry) before = len(findings) findings = self._dedup(findings) @@ -208,10 +197,14 @@ def _dedup(findings: list[dict[str, Any]]) -> list[dict[str, Any]]: buckets: dict[tuple[str, str], dict[str, Any]] = {} for f in findings: + # Coerce defensively: explicit-null YAML fields (`details:`) come + # back as None, and `str()` guards against non-string scalars. + place = str(f.get("place") or "") + details = str(f.get("details") or "") + cwe = re.search(r"CWE-(\d+)", details) key = ( - f.get("place", "").strip("/").lower(), - (f.get("details", "").split("CWE-")[1].split()[0] - if "CWE-" in f.get("details", "") else ""), + place.strip("/").lower(), + cwe.group(1) if cwe else "", ) existing = buckets.get(key) if existing is None: @@ -246,10 +239,11 @@ async def _run_trace_confirm( ) for finding in findings: - fname = finding.get("name", "") - place = finding.get("place", "") - title = finding.get("title", "") - details = finding.get("details", "") + # `or ""` guards explicit-null YAML fields (None) before slicing. + fname = str(finding.get("name") or "") + place = str(finding.get("place") or "") + title = str(finding.get("title") or "") + details = str(finding.get("details") or "") ns = f"trace-confirm:{fname}" @@ -304,7 +298,7 @@ async def _run_exploit_stage( user_id: str, on_event: TaskRunnerEventHandler | None, ) -> None: - target_url = os.environ.get("CONTRACTOR_TARGET_URL") + target_url = get_settings().target_url if not target_url: logger.warning( "CONTRACTOR_TARGET_URL not set — skipping exploit stage" diff --git a/contractor/workflows/vuln_scan_trace/workflow.py b/contractor/workflows/vuln_scan_trace/workflow.py index 55993d3..1e83465 100644 --- a/contractor/workflows/vuln_scan_trace/workflow.py +++ b/contractor/workflows/vuln_scan_trace/workflow.py @@ -11,16 +11,16 @@ import logging from functools import partial -from typing import Any - -import yaml +from typing import Any, ClassVar from contractor.agents.codereview_agent.agent import build_codereview_agent from contractor.agents.trace_agent.agent import build_trace_agent +from contractor.runners.artifacts import artifact_key_slug from contractor.runners.task_runner import TaskRunner, TaskRunnerEventHandler from contractor.utils.settings import build_model from contractor.workflows import Workflow, WorkflowContext from contractor.workflows.config import WorkflowConfig +from contractor.workflows.findings import load_findings_artifact CFG = WorkflowConfig.load(__file__) @@ -31,6 +31,9 @@ class VulnScanTraceWorkflow(Workflow): """BFS discovery → DFS confirmation workflow.""" namespace: str = "vuln-scan-trace" + # Methods read config via ``self.CFG`` so subclasses (vuln-sweep) can + # point inherited phases at their own sibling config.yaml. + CFG: ClassVar[WorkflowConfig] = CFG def __init__(self, ctx: WorkflowContext) -> None: super().__init__(ctx) @@ -50,27 +53,27 @@ async def _run_impl( scan_builder = partial( build_codereview_agent, name="codereview_agent", - _format=CFG.agent("codereview_agent").output_format, + _format=self.CFG.agent("codereview_agent").output_format, fs=ctx.fs, model=self.llm, - max_tokens=CFG.budgets.scan_max_tokens, - with_graph_tools=CFG.agent("codereview_agent").with_graph_tools, + max_tokens=self.CFG.budgets.scan_max_tokens, + with_graph_tools=self.CFG.agent("codereview_agent").with_graph_tools, ) runner = TaskRunner( name="contractor", artifact_service=ctx.artifact_service, checkpoint_path=ctx.checkpoint_path, - observations=CFG.observations, + observations=self.CFG.observations, ) runner.add_variable(name="project_path", value=ctx.folder_name) runner.add_task( name="vuln_scan", - ref="vuln-scan-trace:scan", + ref=f"{self.namespace}:scan", worker_builder=scan_builder, - **CFG.tasks.scan.as_kwargs(), + **self.CFG.tasks.scan.as_kwargs(), namespace=scan_namespace, skills=["vuln_scan"], model=self.llm, @@ -119,19 +122,19 @@ async def _trace_finding( trace_builder = partial( build_trace_agent, name="trace_agent", - _format=CFG.agent("trace_agent").output_format, + _format=self.CFG.agent("trace_agent").output_format, fs=ctx.fs, model=self.llm, - max_tokens=CFG.budgets.trace_max_tokens, + max_tokens=self.CFG.budgets.trace_max_tokens, enable_vuln_reporting=True, - with_graph_tools=CFG.agent("trace_agent").with_graph_tools, + with_graph_tools=self.CFG.agent("trace_agent").with_graph_tools, ) runner = TaskRunner( name="contractor", artifact_service=ctx.artifact_service, checkpoint_path=ctx.checkpoint_path, - observations=CFG.observations, + observations=self.CFG.observations, ) runner.add_variable(name="project_path", value=ctx.folder_name) @@ -151,9 +154,13 @@ async def _trace_finding( runner.add_task( name="trace_annotation", - ref=f"vuln-scan-trace:trace:{name}", + ref=f"{self.namespace}:trace:{name}", + # Unique, stable per-finding publish key — one trace task is queued + # per finding and the shared template key would make each finding + # overwrite the previous one's artifacts. + artifact_key=f"trace_annotation/{artifact_key_slug(name)}", worker_builder=trace_builder, - **CFG.tasks.trace.as_kwargs(), + **self.CFG.tasks.trace.as_kwargs(), namespace=trace_namespace, skills=["trace"], model=self.llm, @@ -175,31 +182,12 @@ async def _load_findings( namespace: str, ) -> list[dict[str, Any]]: """Load vulnerability reports from the scan phase artifacts.""" - artifact_key = f"user:vulnerability-reports/{namespace}" - part = await self.ctx.artifact_service.load_artifact( + findings = await load_findings_artifact( + self.ctx.artifact_service, app_name=self.ctx.app_name, user_id=user_id, - filename=artifact_key, + filename=f"user:vulnerability-reports/{namespace}", ) - if part is None or not getattr(part, "text", None): - return [] - - try: - raw = yaml.safe_load(part.text or "") or {} - except yaml.YAMLError as exc: - logger.warning("could not parse scan results: %s", exc) - return [] - - if not isinstance(raw, dict): - return [] - - findings: list[dict[str, Any]] = [] - for name, item in raw.items(): - if not isinstance(item, dict): - continue - entry = dict(item) - entry.setdefault("name", name) - findings.append(entry) # Sort by severity: critical first sev_order = {"critical": 0, "high": 1, "medium": 2, "low": 3} diff --git a/contractor/workflows/vuln_sweep/__init__.py b/contractor/workflows/vuln_sweep/__init__.py new file mode 100644 index 0000000..cba50ec --- /dev/null +++ b/contractor/workflows/vuln_sweep/__init__.py @@ -0,0 +1,3 @@ +from contractor.workflows.vuln_sweep.workflow import VulnSweepWorkflow + +__all__ = ["VulnSweepWorkflow"] diff --git a/contractor/workflows/vuln_sweep/config.yaml b/contractor/workflows/vuln_sweep/config.yaml new file mode 100644 index 0000000..bed31f2 --- /dev/null +++ b/contractor/workflows/vuln_sweep/config.yaml @@ -0,0 +1,17 @@ +budgets: + scan_max_tokens: 60000 + trace_max_tokens: 80000 + # Parallel nomination sweeps (one per vulnerability class). + sweep_concurrency: 3 + # Cap on nominations carried into the (expensive) DFS trace phase. + max_trace_nominations: 40 +tasks: + sweep: { iterations: 1, max_attempts: 2, max_steps: 50 } + trace: { iterations: 1, max_attempts: 1, max_steps: 30 } +agents: + codereview_agent: { with_graph_tools: true, output_format: json } + trace_agent: { with_graph_tools: true, output_format: json } +observations: + enabled: true + include_tool_errors: false + track_file_paths: true diff --git a/contractor/workflows/vuln_sweep/workflow.py b/contractor/workflows/vuln_sweep/workflow.py new file mode 100644 index 0000000..b412728 --- /dev/null +++ b/contractor/workflows/vuln_sweep/workflow.py @@ -0,0 +1,227 @@ +"""Recall-oriented two-pass vulnerability workflow. + +Pass 1 (BFS sweep): one cheap nomination task *per vulnerability class* +runs in parallel, each sweeping the whole project for candidate sinks of +its single class and reporting them at low confidence. Splitting the +sweep by class — rather than one agent scanning for everything — keeps +each agent's attention narrow (the recall problem on large codebases is +an attention problem) and adds an explicit ABSENCE class so missing +controls, which a taint-only scan can never surface, get nominated. + +Pass 2 (DFS trace): the nominations are merged, deduped, capped, and +each survivor is deep-traced by ``trace_agent`` to confirm or discard it +— reusing ``VulnScanTraceWorkflow._trace_finding`` verbatim. + +This is the BFS/DFS duality made operational: a wide, blind sweep that +nominates, then a narrow, evidence-driven trace that judges. +""" + +from __future__ import annotations + +import asyncio +import logging +from dataclasses import dataclass +from functools import partial +from typing import Any, ClassVar + +from contractor.agents.codereview_agent.agent import build_codereview_agent +from contractor.runners.task_runner import TaskRunner, TaskRunnerEventHandler +from contractor.workflows.config import WorkflowConfig +from contractor.workflows.findings import load_findings_artifact +from contractor.workflows.vuln_scan_trace.workflow import VulnScanTraceWorkflow + +CFG = WorkflowConfig.load(__file__) + +logger = logging.getLogger(__name__) + + +@dataclass(frozen=True) +class SinkClass: + """One vulnerability class swept by its own nomination agent.""" + + key: str + guidance: str + + +# The sweep surface, one agent per class. The last entry is the +# absence-of-control class — the structural answer to missing-auth / +# missing-ownership misses that a taint-following scan cannot find. +SINK_CLASSES: tuple[SinkClass, ...] = ( + SinkClass( + "injection", + "SQL/NoSQL/ORM-raw, OS command, template (SSTI), LDAP, and " + "expression-language sinks: execute/cursor/query, subprocess/exec/" + "system/popen, render_template_string, eval/compile.", + ), + SinkClass( + "deserialization", + "Unsafe deserialization and object construction from input: " + "pickle, yaml.load (unsafe), marshal, jackson polymorphic typing, " + "PHP unserialize, .NET BinaryFormatter.", + ), + SinkClass( + "ssrf-fileio", + "Server-side request forgery and unsafe file I/O: outbound " + "requests built from input (requests/urllib/http clients), " + "path joins / open / send_file with request-derived paths, " + "redirects to input-derived URLs.", + ), + SinkClass( + "secrets-crypto", + "Hardcoded secrets/keys, debug flags, weak crypto (md5/sha1 for " + "security, ECB, static IV/salt), and insecure randomness used " + "for tokens/passwords.", + ), + SinkClass( + "missing-access-control", + "ABSENCE class — nominate per handler. Route handlers and " + "sensitive operations (state change, data export, admin, " + "object access by id) whose visible code lacks an " + "authentication, authorization, or ownership check. Missing " + "control is the signal; no taint flow is required.", + ), +) + + +class VulnSweepWorkflow(VulnScanTraceWorkflow): + """Per-class BFS nomination sweep → DFS trace of survivors. + + Reuses ``VulnScanTraceWorkflow._trace_finding`` / ``_load_findings`` + for the DFS pass; overrides ``_run_impl`` to fan the sweep out by + vulnerability class. ``CFG`` is overridden so the inherited trace + phase reads this workflow's sibling ``config.yaml``. + """ + + namespace: str = "vuln-sweep" + CFG: ClassVar[WorkflowConfig] = CFG + + async def _run_impl( + self, + *, + user_id: str, + on_event: TaskRunnerEventHandler | None, + ) -> Any: + # ── Pass 1: per-class nomination sweep (parallel) ─────────────── + sem = asyncio.Semaphore(self.CFG.budgets.sweep_concurrency) + + async def _sweep(sink_class: SinkClass) -> None: + async with sem: + await self._sweep_class( + sink_class=sink_class, user_id=user_id, on_event=on_event + ) + + async with asyncio.TaskGroup() as tg: + for sink_class in SINK_CLASSES: + tg.create_task(_sweep(sink_class)) + + # ── Merge, dedup, cap nominations ─────────────────────────────── + nominations = await self._collect_nominations(user_id=user_id) + logger.info( + "vuln-sweep: %d nominations after dedup across %d classes", + len(nominations), + len(SINK_CLASSES), + ) + if not nominations: + logger.warning("vuln-sweep: no nominations — skipping trace phase") + return + + cap = self.CFG.budgets.max_trace_nominations + if len(nominations) > cap: + logger.info( + "vuln-sweep: capping %d nominations to %d for the trace phase " + "(highest severity/confidence first)", + len(nominations), + cap, + ) + nominations = nominations[:cap] + + # ── Pass 2: DFS trace per surviving nomination ────────────────── + for finding in nominations: + await self._trace_finding( + finding=finding, user_id=user_id, on_event=on_event + ) + + def _class_namespace(self, sink_class: SinkClass) -> str: + return f"{self.namespace}:sweep:{sink_class.key}" + + async def _sweep_class( + self, + *, + sink_class: SinkClass, + user_id: str, + on_event: TaskRunnerEventHandler | None, + ) -> None: + ctx = self.ctx + class_namespace = self._class_namespace(sink_class) + + sweep_builder = partial( + build_codereview_agent, + name="codereview_agent", + _format=self.CFG.agent("codereview_agent").output_format, + fs=ctx.fs, + model=self.llm, + max_tokens=self.CFG.budgets.scan_max_tokens, + with_graph_tools=self.CFG.agent("codereview_agent").with_graph_tools, + ) + + runner = TaskRunner( + name="contractor", + artifact_service=ctx.artifact_service, + checkpoint_path=ctx.checkpoint_path, + observations=self.CFG.observations, + ) + runner.add_variable(name="project_path", value=ctx.folder_name) + + runner.add_task( + name="sink_nomination", + ref=f"{self.namespace}:sweep:{sink_class.key}", + worker_builder=sweep_builder, + **self.CFG.tasks.sweep.as_kwargs(), + namespace=class_namespace, + skills=["vuln_scan"], + model=self.llm, + params={ + "project_path": ctx.folder_name, + "sink_class": sink_class.key, + "class_guidance": sink_class.guidance, + }, + ) + + try: + await runner.run(user_id=user_id, on_event=on_event) + except Exception as exc: + logger.warning("sweep for class %s failed: %s", sink_class.key, exc) + + async def _collect_nominations( + self, *, user_id: str + ) -> list[dict[str, Any]]: + """Merge every class's nominations, dedup by (place, name), and + sort highest severity/confidence first so the cap keeps the most + promising candidates.""" + merged: dict[tuple[str, str], dict[str, Any]] = {} + for sink_class in SINK_CLASSES: + findings = await load_findings_artifact( + self.ctx.artifact_service, + app_name=self.ctx.app_name, + user_id=user_id, + filename=( + "user:vulnerability-reports/" + f"{self._class_namespace(sink_class)}" + ), + ) + for finding in findings: + key = ( + str(finding.get("place", "")), + str(finding.get("name", "")), + ) + merged.setdefault(key, finding) + + sev_order = {"critical": 0, "high": 1, "medium": 2, "low": 3} + conf_order = {"high": 0, "medium": 1, "low": 2} + return sorted( + merged.values(), + key=lambda f: ( + sev_order.get(f.get("severity", "low"), 4), + conf_order.get(f.get("confidence", "low"), 3), + ), + ) diff --git a/docs/README.md b/docs/README.md index ed71cc8..a2cdb72 100644 --- a/docs/README.md +++ b/docs/README.md @@ -106,8 +106,8 @@ Defined in [contractor/runners/models.py](../contractor/runners/models.py): | Object | Purpose | | ------------------ | --------------------------------------------------------------- | -| `TaskTemplate` | YAML-loaded blueprint: title, objective, instructions, default artifacts, default skills, iterations, output format. | -| `TaskInvocation` | A queued instance of a template: ref, params, model, namespace, retry budget, worker builder. | +| `TaskTemplate` | YAML-loaded blueprint: title, objective, instructions, default artifacts, default skills, iterations, output format. Versioned: `contractor/tasks/.yml` is a manifest (`active:` + `versions:`) pointing at per-version bodies under `/`; pin one via `add_task(version=...)` or `CONTRACTOR_TASK_VERSION_`. | +| `TaskInvocation` | A queued instance of a template: ref, params, model, namespace, retry budget, artifact key, worker builder. | | `RenderedTask` | A `TaskTemplate` with all `{var}` placeholders substituted from variables, params, and loaded artifact contents. | | `TaskRunnerEvent` | Lifecycle event emitted to the UI/metrics: `task_started`, `iteration_finished`, `task_failed`, etc. | | `TaskScopedKeys` | Helper for the `task::{id}::{field}` keyspace inside the ADK session state. | @@ -121,23 +121,38 @@ state machine: 1. **Render** — `_render_task` substitutes variables, params, and the contents of all required input artifacts into the template. 2. **Emit `TASK_STARTED`** — UI sees the new task immediately. -3. **For each attempt up to `max_attempts`**: - 1. `_spawn_planning_agent` — fresh planner + worker, new namespace. - 2. `_inject_skills` and `_inject_artifacts` — populate the memory - namespace with reference notes and the Inbox. - 3. `_build_task_initial_state` — prime the ADK session state with +3. **`_inject_skills` and `_inject_artifacts`** — populate the memory + namespace with reference notes and the Inbox. This happens once per + task: the namespace, skill list, and artifact texts are invariant + across attempts. +4. **For each attempt up to `max_attempts`**: + 1. `_spawn_planning_agent` — fresh planner + worker pair. + 2. `_build_task_initial_state` — prime the ADK session state with the per-task keys (`status=running`, empty `result/summary/pool`, - etc.). - 4. `_run_single_iteration` — run the ADK Runner against the planner + etc.), carrying forward the previous attempt's state minus the + stale planner-internal subtask keys. + 3. `_run_single_iteration` — run the ADK Runner against the planner until it terminates or hits the step budget. - 5. Emit `ITERATION_RESULT`. - 6. If `state[TaskScopedKeys.status] == DONE`, publish artifacts via - `_publish_task_artifacts`. After `iterations` successful runs in + 4. Emit `ITERATION_RESULT`. An exception raised by an iteration + consumes an attempt (reported on the event) instead of aborting + the whole workflow. + 5. If `state[TaskScopedKeys.status] == DONE`, publish artifacts via + `_publish_task_artifacts` under the invocation's artifact key + (`artifact_key` on `add_task`; default: the template key — fan-out + workflows queuing several tasks from one template must pass a + unique key per task). After `iterations` successful runs in total (cumulative across attempts, not necessarily consecutive), emit `TASK_FINISHED` and return. -4. **Exhausted retries** — emit `TASK_FAILED` and raise +5. **Exhausted retries** — emit `TASK_FAILED` and raise `TaskNotCompletedError`. +Two runner-level guarantees sit around this loop. When the runner has a +`checkpoint_path`, each completed task is recorded there and restored on +a re-run — an entry is only honoured if its template key/version still +match the invocation *and* all of its published artifacts are still +present in the store. And event delivery is best-effort telemetry: a +failing event handler is logged and never aborts the run. + ### 3.3 Session state shape Per iteration, the ADK session state is a flat dict keyed under @@ -152,12 +167,18 @@ subtasks. "_global_task_id": 0, "task::0::objective": "...", "task::0::status": "running" | "done", + "task::0::current": None, # current-subtask pointer "task::0::result": "...", # written by streamline `finish` "task::0::summary": "...", # written by streamline `finish` "task::0::pool": [ records ] # appended by streamline manager } ``` +The planner's *subtask-level* state (plan + index) lives under deeper, +invocation-scoped keys (`task::{id}::{invocation_id}::…`, see +`StreamlineManager._state_key`), so every attempt starts with a fresh +plan while the fixed keys above carry the task-level contract. + ### 3.4 Workflows Concrete workflows are thin assemblers of templates + worker builders. @@ -167,43 +188,47 @@ accepts any of these keys: OpenAPI / architecture: -- [oas_building.py](../contractor/workflows/oas_building/workflow.py) — `oas_build` -- [oas_enrichment.py](../contractor/workflows/oas_enrichment/workflow.py) — `oas_update` -- [likec4_building.py](../contractor/workflows/likec4_building/workflow.py) — `likec4` +- [oas_building/](../contractor/workflows/oas_building/workflow.py) — `oas_build` +- [oas_enrichment/](../contractor/workflows/oas_enrichment/workflow.py) — `oas_update` +- [likec4_building/](../contractor/workflows/likec4_building/workflow.py) — `likec4` Trace & annotate: -- [trace_annotation.py](../contractor/workflows/trace_annotation/workflow.py) — `trace` (planner-driven, per-operation overlay FS) -- [trace_annotation_direct.py](../contractor/workflows/trace_annotation_direct/workflow.py) — `trace-direct` (single-agent variant via `AgentRunner`, skips the planner) -- [trace_graph.py](../contractor/workflows/trace_graph/workflow.py) — `trace-graph` (thin variant of `trace-direct` that enables trailmark call-graph tools) -- [trace_graph_pathpar.py](../contractor/workflows/trace_graph_pathpar/workflow.py) — `trace-graph-pathpar` (path-level parallel variant of `trace-graph`; identical annotation semantics, paths run concurrently over forked overlays — see [insights-parallel-vuln-pipelines.md](insights-parallel-vuln-pipelines.md)) -- [trace_verify.py](../contractor/workflows/trace_verify/workflow.py) — `trace-verify` (per-finding static verifier, OpenAnt Stage-2 style) +- [trace_annotation/](../contractor/workflows/trace_annotation/workflow.py) — `trace` (planner-driven, per-operation overlay FS) +- [trace_annotation_direct/](../contractor/workflows/trace_annotation_direct/workflow.py) — `trace-direct` (single-agent variant via `AgentRunner`, skips the planner) +- [trace_graph/](../contractor/workflows/trace_graph/workflow.py) — `trace-graph` (thin variant of `trace-direct` that enables trailmark call-graph tools) +- [trace_graph_pathpar/](../contractor/workflows/trace_graph_pathpar/workflow.py) — `trace-graph-pathpar` (path-level parallel variant of `trace-graph`; identical annotation semantics, paths run concurrently over forked overlays — see [insights-parallel-vuln-pipelines.md](insights-parallel-vuln-pipelines.md)) +- [trace_verify/](../contractor/workflows/trace_verify/workflow.py) — `trace-verify` (per-finding static verifier, OpenAnt Stage-2 style) Vulnerability detection: -- [vuln_scan.py](../contractor/workflows/vuln_scan/workflow.py) — `vuln-scan` (breadth-first scan against source code) -- [vuln_scan_fast.py](../contractor/workflows/vuln_scan_fast/workflow.py) — `vuln-scan-fast` (Workflow B: high-recall scan → dedup → trace-confirm → exploit) -- [vuln_scan_trace.py](../contractor/workflows/vuln_scan_trace/workflow.py) — `vuln-scan-trace` (BFS discovery → DFS confirmation) -- [vuln_assess.py](../contractor/workflows/vuln_assess/workflow.py) — `vuln-assess` (Workflow A: discovery → OAS → trace → exploit) -- [exploitability.py](../contractor/workflows/exploitability/workflow.py) — `exploit` (per-finding exploitability assessment against a live target) +- [vuln_scan/](../contractor/workflows/vuln_scan/workflow.py) — `vuln-scan` (breadth-first scan against source code) +- [vuln_scan_fast/](../contractor/workflows/vuln_scan_fast/workflow.py) — `vuln-scan-fast` (Workflow B: high-recall scan → dedup → trace-confirm → exploit) +- [vuln_scan_trace/](../contractor/workflows/vuln_scan_trace/workflow.py) — `vuln-scan-trace` (BFS discovery → DFS confirmation) +- [vuln_assess/](../contractor/workflows/vuln_assess/workflow.py) — `vuln-assess` (Workflow A: discovery → OAS → trace → exploit) +- [exploitability/](../contractor/workflows/exploitability/workflow.py) — `exploit` (per-finding exploitability assessment against a live target) Prompt-driven: -- [router.py](../contractor/workflows/router/workflow.py) — `router` +- [router/](../contractor/workflows/router/workflow.py) — `router` Several workflows diverge from the planner+worker pattern: - **`router`** skips the templated task queue and runs a single planner whose worker is a *router agent* that dispatches to one of several specialised sub-agents (SWE, OAS builder, OAS linter, trace, HTTP). -- **`trace-direct` / `trace-graph`** use the bare `AgentRunner` +- **`trace-direct` / `trace-graph` / `trace-graph-pathpar`** use the + bare `AgentRunner` (`contractor/runners/agent_runner.py`) instead of `TaskRunner`: one `trace_agent` invocation per OpenAPI operation, no planner, no subtask state machine. The workflow wraps the project filesystem in `MemoryOverlayFileSystem` so worker writes (the inlined `@trace` annotations) are captured as an artifact diff rather than mutating the host tree. -- **`trace-verify`** is downstream of `trace` / `trace-direct`: it +- **`trace-verify`** is downstream of any trace producer (`trace`, + `trace-direct`, `trace-graph`, `trace-graph-pathpar`): it probes + every known trace namespace prefix + ([contractor/workflows/namespaces.py](../contractor/workflows/namespaces.py)), loads each per-path `VulnerabilityReport` artifact and queues one task per finding for `trace_verifier_agent`, which produces a code-evidence-only verdict paired with the upstream finding by @@ -213,11 +238,17 @@ Several workflows diverge from the planner+worker pattern: ## 4. The streamline planner +> For a focused deep dive on this section — the per-task lifecycle, the +> planner action-picker loop, the subtask state machine, `execute_current_subtask` +> internals, and the session-state shape, all with Mermaid diagrams — see +> [planner.md](planner.md). + The planning agent ([contractor/agents/planning_agent/agent.py](../contractor/agents/planning_agent/agent.py)) is a `LlmAgent` whose tools are the *streamline manager* operations — -`add_subtask`, `execute_current_subtask`, `decompose_subtask`, -`get_records`, `skip`, `finish` — plus the shared memory tools. +`add_subtask`, `get_current_subtask`, `list_subtasks`, +`execute_current_subtask`, `decompose_subtask`, `get_records`, `skip`, +`finish` — plus the shared memory tools. The planner does not do the work itself. It maintains a list of subtasks, asks the worker (also an `LlmAgent`, wrapped as an @@ -272,18 +303,22 @@ passing it through the planner factory — no per-agent glue is needed. (`n_retries`, default 3). It accepts the response as either a typed `SubtaskExecutionResult`, a dict matching the schema, or a string in the configured format (`json` / `yaml` / `markdown` / `xml`) that the -`SubtaskFormatter` can parse. If all retries fail to produce a valid -result, the subtask is marked `malformed` and the planner is forced to -either decompose it or skip it. +`SubtaskFormatter` can parse. A response whose `task_id` does not match +the current subtask also consumes a retry. If all retries fail to +produce a valid result, the subtask is marked `malformed` and the +planner is forced to either decompose it or skip it. ### 4.3 Records and summary Every executed subtask appends a *record* (the merged subtask + result) to `task::{id}::pool`. When the planner calls `finish`, a built-in -*summarizer agent* (a sibling `LlmAgent` sharing the worker's tools and +*summarizer agent* (a tool-less sibling `LlmAgent` sharing the worker's model) condenses `objective + records + result + status` into a -structured human-readable summary. That summary is written to -`task::{id}::summary` and persisted as the `summary` artifact. +structured human-readable summary. The payload is capped — only the +most recent `max_records` (default 20) records, each truncated, are fed +to the summarizer, so a long run cannot blow its context. That summary +is written to `task::{id}::summary` and persisted as the `summary` +artifact. ### 4.4 Memory contract @@ -304,8 +339,8 @@ planner starts: - **Inbox memories**, tagged `inbox` and `previous-task-result`, containing the textual content of every artifact the task declared as required. -- **Skill memories**, tagged with the skill name (e.g. `likec4`), - containing the contents of every markdown file under +- **Skill memories**, tagged `skill` plus the skill name (e.g. + `likec4`), containing the contents of every markdown file under [contractor/skills//](../contractor/skills/). --- @@ -323,7 +358,7 @@ to the planner as tool errors. | From | Allowed transitions | Triggered by | | ------------- | ---------------------------------------------- | ---------------------------------------------------- | | `new` | `done`, `incomplete`, `malformed`, `skipped` | `execute_current_subtask`, `skip` | -| `incomplete` | `decomposed`, `skipped` | `decompose_subtask`, `skip` (last-only) | +| `incomplete` | `decomposed`, `skipped` | `decompose_subtask`, `skip` (last-only, or when the subtask budget is exhausted) | | `malformed` | `decomposed`, `skipped` | `decompose_subtask`, `skip` | | `done` | (terminal) | — | | `decomposed` | (terminal parent state) | child subtasks proceed independently | @@ -335,8 +370,8 @@ Notes from the diagram: `malformed`, or `skipped` — it cannot be re-executed in place. - **`incomplete`** means the worker reported partial progress. The planner *must* call `decompose_subtask` (or `skip`, only if it is the - very last subtask). Re-running it directly is forbidden; the planner - prompt explicitly states this. + very last subtask or the subtask budget is exhausted). Re-running it + directly is forbidden; the planner prompt explicitly states this. - **`malformed`** is the runtime fallback when worker output fails to parse after all retries. The raw output is preserved in `output` for inspection. Same options apply: decompose or skip. @@ -361,8 +396,9 @@ Notes from the diagram: The planner's `finish` tool is the only way to set `task::{id}::status = done`. It refuses to mark `done` if any subtask -is still `new`, which prevents the planner from terminating before all -explicit work has been resolved. After `finish`, the ADK invocation is +is still `new`, if no subtasks exist at all, or if not a single subtask +finished `done` — which prevents the planner from terminating before +any explicit work has been resolved. After `finish`, the ADK invocation is forcibly ended via `tool_context._invocation_context.end_invocation = True` so the planner cannot keep emitting tool calls. diff --git a/docs/TUNABLE_PARAMS.md b/docs/TUNABLE_PARAMS.md new file mode 100644 index 0000000..c6ff6b2 --- /dev/null +++ b/docs/TUNABLE_PARAMS.md @@ -0,0 +1,432 @@ +# Tunable Parameters Reference + +A consolidated catalog of every tunable parameter in Contractor — env-driven +`Settings`, per-workflow `config.yaml`, task templates, planner/subtask caps, +callbacks, tool limits, CLI flags, eval overrides, and the LiteLLM deploy. + +Generated 2026-06-07, last synced 2026-06-11. Line numbers are indicative; +treat the source file as authoritative. + +**Where config lives (mental model):** + +| Layer | Mechanism | Scope | Edit by | +|---|---|---|---| +| Global defaults | `Settings` (pydantic-settings) | env / `cli/.env` | env var or `.env` | +| Per-workflow | `contractor/workflows//config.yaml` | one workflow | edit YAML | +| Per-task | `contractor/tasks/*.yml` + `add_task(...)` | one task | YAML / assembler | +| Planner/subtask | function-default args | per planner | code | +| Callbacks | constructor args (mostly hardcoded) | per agent | code | +| Run-time | CLI flags | one invocation | CLI | +| Eval | `CONTRACTOR_EVAL_*` env | eval suite | env var | + +> ⚠️ **Not env-routed (hardcoded magic numbers):** the summarization trigger +> `max_tokens=80000` (`worker_factory.py:57`) and the planner subtask cap +> `max_steps=15` (repeated in `task_runner.py`, `models.py`, +> `planning_agent/agent.py`, `workflows/config.py`). To change globally you must +> edit code or override per-workflow YAML. + +--- + +## 1. Global Settings (env / `cli/.env`) + +All in `contractor/utils/settings.py` (`Settings`, pydantic-settings, +**case-insensitive** — env var = uppercased field name). Loaded once +(`@lru_cache get_settings()`); `.env` resolved relative to CWD (convention: +`cli/.env`). Tool defaults are global fallbacks — each has a per-call override. + +### 1.1 LLM / model / proxy + +| Env var | Default | Controls | +|---|---|---| +| `DEFAULT_MODEL_NAME` | `lm-studio-qwen3.6` | LiteLLM proxy alias when `--model` omitted | +| `DEFAULT_MODEL_TIMEOUT` | `300` s | Per-request LLM timeout | +| `MODEL_TEMPERATURE` | `None` (model default) | Sampling temperature (forwarded only when set) | +| `MODEL_TOP_P` | `None` | Nucleus sampling top_p (forwarded only when set) | +| `LITELLM_API_BASE` | `None` | LiteLLM proxy base URL | +| `LITELLM_API_KEY` | `None` | LiteLLM proxy API key | + +> Sampling surface is intentionally minimal: only temperature + top_p. No +> `max_tokens` / `top_k` / penalties at the Settings layer. + +### 1.2 Observability / external services + +| Env var | Default | Controls | +|---|---|---| +| `USE_LANGFUSE` | `False` | Master switch; off ⇒ all observability calls are no-ops | +| `LANGFUSE_HOST` / `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` | `None` | Langfuse connection | +| `CAIDO_URL` / `CAIDO_AUTH_TOKEN` | `None` | Caido proxy (exploitability workflow) | +| `CONTRACTOR_ARTIFACTS_DIR` | `None` (→ `./artifacts`) | Artifact store base dir (explicit env alias) | + +### 1.3 GitLab FS auth + +Two places: `Settings` (`GITLAB_PRIVATE_TOKEN`, `GITLAB_OAUTH_TOKEN`, +`CI_JOB_TOKEN`) **and** a separate `GitlabFileSystemSettings` +(`contractor/tools/fs/gitlabfs.py`, `env_prefix="GITLAB_FS_"`): + +| Env var | Default | Controls | +|---|---|---| +| `GITLAB_FS_GITLAB_URL` | `https://gitlab.com` | Instance base URL | +| `GITLAB_FS_REF` | `master` | Branch/tag/commit | +| `GITLAB_FS_PER_PAGE` | `100` (1–100) | API pagination size | +| `GITLAB_FS_TIMEOUT` | `60.0` s | Total HTTP timeout | +| `GITLAB_FS_MAX_CONCURRENT` | `3` | Max parallel downloads | +| `GITLAB_FS_MAX_FILE_SIZE` | `52_428_800` (50 MiB) | Skip larger files | +| `GITLAB_FS_MAX_RETRIES` | `5` | HTTP retry attempts | +| `GITLAB_FS_RETRY_BACKOFF_FACTOR` | `5` | `sleep = factor · 2^attempt` | +| `GITLAB_FS_RETRY_STATUSES` | `{429,500,502,503,504}` | Retry-triggering statuses | + +--- + +## 2. Tool parameters + +Tool limits/defaults are Settings-backed (env-overridable) global fallbacks; +every one has a per-call override in the tool signature. + +### 2.1 Filesystem tools (`contractor/tools/fs/`) + +| Env var | Default | Controls | +|---|---|---| +| `FS_MAX_ITEMS` | `100` | Max directory entries returned by listings | +| `FS_MAX_OUTPUT` | `50_000` bytes | `read_file` (and write) output byte cap | +| `FS_MAX_READ_LINES` | `2000` (`None`=off) | Default per-read line cap when `limit` omitted | +| `FS_MAX_FILES_PER_WALK` | `100_000` | Max files scanned per fs glob/grep tree walk (walk stops early + truncation notice; mirrors `CODE_MAX_FILES_PER_WALK`) | +| `FS_HEAVY_KEEP_BUDGET_CHARS` | `0` (disabled) | Char budget for retained heavy-tool results (elision) | +| `FS_HEAVY_KEEP_LAST_N` | `0` (→ caller's, ~15) | Override count cap for retained heavy-tool results | + +### 2.2 Code-walk & graph tools (`contractor/tools/code/`) + +| Env var | Default | Controls | +|---|---|---| +| `CODE_MAX_WALK_DEPTH` | `50` | Max recursion depth for code/file walk | +| `CODE_MAX_FILES_PER_WALK` | `100_000` | Max files visited per walk | +| `GRAPH_MAX_RESULTS` | `200` | Max nodes/results per graph query | +| `GRAPH_MAX_PATHS` | `25` | Max paths returned by graph path search | +| `GRAPH_MAX_PATH_DEPTH` | `30` | Max depth for graph path traversal | + +### 2.3 HTTP tool (`contractor/tools/http.py`) + +| Env var | Default | Controls | +|---|---|---| +| `HTTP_TIMEOUT` | `30.0` s | Request timeout | +| `HTTP_BODY_PREVIEW_CHARS` | `2048` | Response body preview length | +| `HTTP_HISTORY_SIZE` | `20` | Requests retained in history ring buffer | +| `HTTP_RETRY_ATTEMPTS` | `3` | Retry count | +| `HTTP_RETRY_BASE_DELAY` | `0.5` s | Base backoff delay | +| `HTTP_RETRY_MAX_DELAY` | `8.0` s | Backoff cap | + +### 2.4 LikeC4 tool + +| Env var | Default | Controls | +|---|---|---| +| `LIKEC4_VALIDATE_TIMEOUT` | `120.0` s | LikeC4 validation subprocess timeout | + +### 2.5 Code-exec sandbox (podman — `contractor/tools/podman.py`) + +Not env-routed (module constants / factory args): + +| Name | Default | Controls | +|---|---|---| +| `_CONTAINER_TTL` | `2h` | Container self-expiry backstop | +| `_DEFAULT_TIMEOUT_S` | `120` s | Default wall-clock for `run_python` / `run_bash` | +| exec grace | `timeout_s + 15` | Host-side kill margin over in-container timeout | + +### 2.6 Subtask / planner tool factory (`contractor/tools/tasks/`) + +| Name | Default | Controls | +|---|---|---| +| `max_records` | `20` | `get_records` returns last N records | +| `n_retries` | `3` | Worker re-run budget on empty/unparseable/mismatched output | +| `_MAX_LITERAL_EVAL_LEN` | `50_000` | Cap on literal-eval parsing of worker output | +| decompose children | `1–3` | Subtasks per `decompose_subtask` (min enforced) | +| `task_id` pattern | `^\d+(\.\d+)*$` | Dotted numeric IDs | +| factory toggles | `use_skip=T`, `use_type_hint=F`, `use_input_schema=T`, `use_output_schema=T`, `use_summarization=T`, `worker_instrumentation=T` | Tool-surface switches | + +--- + +## 3. Per-workflow config (`contractor/workflows//config.yaml`) + +Schema in `contractor/workflows/config.py`. Four optional top-level blocks: + +- **`budgets:`** — free-form `dict[str,int]` (`CFG.budgets.`). Convention: + `*_max_tokens` = summarization-trigger budget (context retained before + compression), **not** a generation cap. Also `max_steps`, `max_concurrency`. +- **`tasks.:`** — `TaskBudget`: `iterations` (def 1), `max_attempts` + (def 1), `max_steps` (def 15). Splatted into `add_task()`. +- **`agents.:`** — `AgentToolConfig`: `output_format` (def `json`; + `json|xml|yaml|markdown`), `with_graph_tools` (def F), `with_code_exec` (def F). +- **`observations:`** — `ObservationConfig` (projects worker-usage facts into the + planner). Default disabled. Fields: `enabled`, `track_tools`, `tracked_tools`, + `include_tool_errors`, `track_skills`, `track_files`, `track_file_paths`, + `track_coverage_gap`, `track_memories`, `malformed_only`, `in_record`, + `in_result`. Env overlay: `CONTRACTOR_EVAL_OBSERVATIONS` (JSON). + +### 3.1 Budgets per workflow (token / scalar) + +| Workflow | Budgets | +|---|---| +| oas_building | swe 100k, builder 100k, validator 100k | +| oas_enrichment | builder 120k, validator 120k | +| likec4_building | swe 100k, builder 120k | +| router | max_tokens 120k, max_steps 20 | +| exploitability | max_tokens 80k | +| trace_annotation | max_tokens 80k | +| trace_annotation_direct | max_tokens 100k | +| trace_graph | max_tokens 100k | +| trace_graph_pathpar | max_tokens 100k, **max_concurrency 3** | +| trace_verify | max_tokens 80k | +| vuln_assess | swe 100k, builder 100k, validator 100k | +| vuln_scan | scan 80k | +| vuln_scan_fast | scan 80k, swe 100k | +| vuln_scan_trace | scan 80k, trace 80k | + +### 3.2 Task budgets per workflow (`iterations` / `max_attempts` / `max_steps`) + +| Workflow.task | iters | attempts | steps | +|---|---|---|---| +| oas_building.dependency_information | 1 | 2 | 20 | +| oas_building.project_information | 1 | 2 | 20 | +| oas_building.oas_update | 2 | 4 | 20 | +| oas_building.oas_validate | 1 | 1 | 20 | +| oas_enrichment.oas_enrich | **3** | **6** | **30** | +| oas_enrichment.oas_validate | 2 | 2 | 20 | +| likec4_building.dependency_information | 1 | 3 | 20 | +| likec4_building.project_information | 1 | 3 | 20 | +| likec4_building.likec4_build | **3** | **6** | 20 | +| likec4_building.likec4_validate | 1 | 2 | 20 | +| exploitability.assess | 1 | 2 | 25 | +| trace_annotation.annotate | 1 | 3 | 20 | +| trace_verify.verify | 1 | 2 | 20 | +| vuln_assess.* | (identical to oas_building) | | | +| vuln_scan.scan | 1 | 2 | **75** | +| vuln_scan_fast.dependency_information | 1 | 2 | 20 | +| vuln_scan_fast.project_information | 1 | 2 | 20 | +| vuln_scan_fast.scan | 1 | 2 | **50** | +| vuln_scan_trace.scan | 1 | 2 | **75** | +| vuln_scan_trace.trace | 1 | 1 | 30 | + +> `trace_annotation_direct`, `trace_graph`, `trace_graph_pathpar` have no `tasks` +> block (they run agents directly, not via TaskRunner). + +### 3.3 Agent tool options per workflow + +- `with_graph_tools: true` — every declared `trace_agent` and `codereview_agent` + (router, trace_annotation, trace_*, vuln_scan*, vuln_scan_trace). +- `with_code_exec: true` — only `exploitability_agent` (exploitability). +- `output_format: json` — everywhere; no workflow uses xml/yaml/markdown. + +### 3.4 Observations per workflow + +11 workflows enable the "lean + file-paths" config (`enabled:true`, +`include_tool_errors:false`, `track_file_paths:true`). Disabled (no block): +`trace_annotation_direct`, `trace_graph`, `trace_graph_pathpar`. No workflow +enables `track_coverage_gap`, `track_memories`, `malformed_only`, or +`include_tool_errors` (A/B showed error counts hurt). + +--- + +## 4. Task templates (`contractor/tasks/*.yml`) + +`TaskTemplate.load` parses 8 body fields (`contractor/runners/models.py`): +`name`, `objective`, `instructions`, `output_format`, `artifacts` (def `[]`), +`skills` (def `[]`), `iterations` (def `1`), `format` (def `json`). Keys +`context`/`constraints` appear in some bodies but are **ignored by the loader** +(prose only). Manifest = `active:` + `versions:` map of `vN → {file}`. Version +resolution: explicit arg > `CONTRACTOR_TASK_VERSION_` env override (e.g. +`CONTRACTOR_TASK_VERSION_TRACE_ANNOTATION=v3`) > manifest `active:`. + +Run-time budgets (set by assembler / `add_task`, not the YAML body): +`iterations` (def 1), `max_attempts` (def `max(1, iterations)`), `max_steps` +(def 15). Resolution: `eff_max_attempts < iterations` raises. + +| Template | Active | iters | format | skills | +|---|---|---|---|---| +| dependency_information | v1 | 1 | json | — | +| project_information | v1 | 1 | json | — | +| oas_enrich | v2 | 1 | json | — | +| oas_update | v2 | 1 | json | — | +| oas_validate | v1 | 1 | json | — | +| likec4_build | v1 | 1 | json | likec4 | +| likec4_validate | v2 | 1 | json | likec4 | +| threat_analysis | v1 | 1 | json | stride | +| trace_annotation | **v3** | 1 | json | — | +| trace_verify | v1 | 1 | json | — | +| exploitability_assessment | v4 | 1 | json | — | +| vuln_scan | v3 | 1 | json | vuln_scan | +| vuln_scan_fast | v1 | 1 | json | vuln_scan | + +> No template declares `iterations > 1` or `max_attempts` — those come from +> per-workflow `config.yaml` (§3.2). `format` is `json` everywhere. + +Non-active versions exist for: `exploitability_assessment` (v1–v4), +`oas_enrich`/`oas_update` (v1–v2), `likec4_validate` (v1–v2), `vuln_scan` (v1–v3), +`trace_annotation` (v1, v2, `shannon`). + +--- + +## 5. Planner & subtask state machine + +`contractor/agents/planning_agent/agent.py` + `contractor/tools/tasks/`: + +| Name | Default | Controls | +|---|---|---| +| `max_steps` | `15` | `task_tools(max_tasks=…)` + `<>` prompt token | +| `_format` | `json` | Subtask/memory serialization | +| `use_output_schema` | `False` | Schema-constrained worker output | +| `worker_instrumentation` | `True` | Attach Subtask I/O schemas + trailer | +| `RepeatedToolCallCallback(threshold=2)` | `2` | Blocks identical repeated tool calls (planner) | +| decompose children | `1–3` | Children per decompose | +| `finish` guard | — | Requires ≥1 `done`, no `new` subtasks remaining | + +Flow: `CFG.budgets.max_steps` → `TaskInvocation.max_steps` (15) → `TaskRunner` +→ `build_planning_agent(max_steps)` → `task_tools` + `<>` prompt token. + +Subtask states: `new, done, incomplete, malformed, skipped, decomposed`. +Transitions: `new`→[done,incomplete,malformed,skipped]; +`malformed`/`incomplete`→[skipped,decomposed]; `done`/`decomposed`/`skipped` +terminal. Active planner prompt: **v5** (versions: pentestgpt, v5, v4, v3, v2, v1). + +--- + +## 6. Callbacks (`contractor/callbacks/`) + +The **live** worker callback stack: `TokenUsage → SummarizationLimit → +[FunctionResultsRemoval] → InvalidToolCallGuardrail → RepeatedToolCall` +(`worker_factory.py`). Most knobs are constructor args, **not** env-routed. + +### 6.1 Active + +| Name | Value | Controls | +|---|---|---| +| `SummarizationLimitCallback.max_tokens` | **80000** (worker_factory) | Token threshold → injects "summarize progress" message | +| `SummarizationLimitCallback.summarization_key` | `total` | Which counter the threshold measures | +| `RepeatedToolCallCallback.threshold` | worker **5**, planner **2** | Identical-call advisory trigger | +| `MandatoryToolCallback.max_nudges` | def 2, exploit **3** | Nudges to call required verdict tool | +| `FunctionResultsRemovalCallback.keep_last_n` | 0 (→ worker 15 via `elide_keep_last_n`) | Count axis of heavy-result retention | +| `FunctionResultsRemovalCallback.keep_budget_chars` | 0 (← `FS_HEAVY_KEEP_BUDGET_CHARS`) | Char-budget axis | +| `FunctionResultsRemovalCallback.deduplicate` | True | Elide stale duplicate tool results | +| `build_worker(with_elide=…)` | True | Register the elision callback at all | + +### 6.2 Defined but **NOT wired** (dormant knobs) + +These have full implementations but no callsite in any agent factory — token / +rate hard-caps are effectively inactive; only the soft 80k summarization nudge +is live. + +| Name | Knobs | Status | +|---|---|---| +| `ThinkingBudgetGuardrailCallback` | `token_budget`, `token_budget_key=total` | Not instantiated | +| `ToolMaxCallsGuardrailCallback` | `max_calls`, `rvalue` | Not instantiated | +| `TpmRatelimitCallback` | `tpm_limit`, `tpm_limit_key=input`, window 60s (hardcoded) | Not instantiated | +| `RpmRatelimitCallback` | `rpm_limit`, window 60s (hardcoded) | Not instantiated | + +### 6.3 Plugins (`contractor/runners/plugins/`) + +No numeric tunables. `AdkMetricsPlugin`: `result_error_detector` (heuristic), +`_args_hash` digest length 16 (must match `analyze_metrics.py`), prefix +`metrics`. `AdkTracePlugin` prefix `trace`. `SandboxCleanupPlugin` name +`sandbox_cleanup`. + +--- + +## 7. CLI flags (`cli/main.py`) + +| Flag | Default | Controls | +|---|---|---| +| `--workflow` | `oas_build` | Workflow to run (Choice from registry) | +| `--project-path` | **required** | Target root = FS sandbox root + default output dir | +| `--folder-name` | `/` | Project-relative folder injected into templates | +| `--artifact` | `None` | Existing OpenAPI seed (UTF-8 validated) | +| `--user-id` | `cli-user` | ADK session / artifact store key | +| `--model` | `DEFAULT_MODEL_NAME` | LiteLLM proxy alias | +| `--timeout` | `DEFAULT_MODEL_TIMEOUT` (300) | Per-request model timeout (s) | +| `--prompt` | `None` | Prompt for router; required with `--no-ui` | +| `--rm` | `False` | Remove prior artifacts (excl. with `--resume`) | +| `--resume` | `False` | Resume from `/checkpoint.json` | +| `-o/--output` | `None` (→ `/.contractor`) | Artifacts + `metrics.jsonl` dir | +| `--no-ui` | `False` | Plain stdout instead of live UI | + +> No per-budget/iteration CLI flags — those are per-workflow `config.yaml` only. +> FS sandbox (`cli/fs.py`) takes only `root_path`; symlink-follow hard-disabled, +> `..` rejected. `MetricsSink` has no tunables (fixed `metrics.jsonl`). + +--- + +## 8. Eval overrides (env, `tests/eval/`) + +### 8.1 Gating / selection + +| Env var | Default | Controls | +|---|---|---| +| `CONTRACTOR_RUN_EVAL` | unset (skip) | Enable eval suite (`-m eval` also bypasses) | +| `CONTRACTOR_EVAL_MODEL` | project default | Override eval model alias (timeout 600) | +| `CONTRACTOR_EVAL_RUN_STAMP` | `mmdd-HHMMSS` UTC | Per-run archive namespace | +| `CONTRACTOR_EVAL_RESULTS_DIR` | `eval_runs/` | Results output dir | +| `CONTRACTOR_EVAL_CASE_IDS` | all | Comma-separated case-id subset | +| `CONTRACTOR_EVAL_OBSERVATIONS` | unset | JSON overlay of `observations:` block (A/B) | + +### 8.2 Trace agent eval + +| Env var | Default | Controls | +|---|---|---| +| `CONTRACTOR_EVAL_TRACE_PROMPT_VERSION` | per-case | Pin trace_agent prompt version | +| `CONTRACTOR_EVAL_TRACE_PASS_AT` | `1` | pass@N loop count | +| `CONTRACTOR_EVAL_WITH_OAS` | off | Feed the OpenAPI spec into the prompt as an attack-surface map (X1 A/B) | +| `CONTRACTOR_TASK_VERSION_TRACE_ANNOTATION` | unset (→ `active: v3`) | Pin trace task-template version (generic mechanism, §4) | + +### 8.3 Vuln detection eval + +| Env var | Default | Controls | +|---|---|---| +| `CONTRACTOR_EVAL_VULN_AGENT` | `vuln_scan` | Which vuln agent to eval | +| `CONTRACTOR_EVAL_VULN_PASS_AT` | `3` | pass@N | +| `CONTRACTOR_EVAL_VULN_PROMPT_VERSION` | unset | Pin prompt variant | +| `CONTRACTOR_EVAL_VULN_MIN_RECALL` | `0.15` | Min recall to pass | +| `CONTRACTOR_EVAL_VULN_MIN_PRECISION` | `0.10` | Min precision to pass | +| `CONTRACTOR_EMITTED_VS_READ` | off | Emitted-vs-read scoring mode | +| `CONTRACTOR_VULN_DEDUP` | off | Finding dedup | + +### 8.4 Exploitability & XBOW + +| Env var | Default | Controls | +|---|---|---| +| `CONTRACTOR_EVAL_EXPLOIT_PROMPT_VERSION` | per-case | Pin exploit prompt | +| `CONTRACTOR_EVAL_PROXY` | unset | HTTP proxy for exploit attempts | +| `CONTRACTOR_XBOW_BENCHMARKS` | `DEFAULT_XBOW_IDS` | Benchmark id subset | +| `CONTRACTOR_XBOW_AGENT` | all | Restrict to one agent | +| `XBOW_MAX_TOKENS` | `80_000` (hardcoded) | XBOW token cap | + +### 8.5 Runtime targets (used by exploit/vuln workflows) + +| Env var | Default | Controls | +|---|---|---| +| `CONTRACTOR_TARGET_URL` | unset (stage skipped) | Target base URL for exploit stage | +| `CONTRACTOR_PROXY` | unset | HTTP proxy (exploitability workflow) | + +### 8.6 Harness arg defaults (not env, but tunable) + +harness `run_agent` default `timeout_s=600.0`; per-eval overrides: trace 900, +planner / project-info / vuln-detection 1800, threat / oas_enrich / oas_build +2400, oas_analyzer / xbow 1500, exploitability 300–900 (per-case `timeout_s` +in meta.yaml wins where supported). A/B drivers (`scripts/ab_*.py`): +`AB_FIXTURE(S)`, `AB_ARMS`, `AB_TIMEOUT` (21600/3600/3000), `AB_PER_PATH_TIMEOUT` +(900/420), `AB_MAX_ATTEMPTS` (2). + +--- + +## 9. LiteLLM deploy (`deploy/litellm/`) + +### `litellm_config.yaml` + +- **model_list** aliases: `lm-studio-nemotron`, `lm-studio-openai`, + `lm-studio-qwen3.5`, `lm-studio-glm`, `lm-studio-qwen3.5-opus`, + `lm-studio-qwen3.5-hauhau`, `lm-studio-qwen3.6` (**project default**), + `lm-studio-qwen3.6-mtp`, `lm-studio-qwen3.6-27b-mtp`. +- Per-model `litellm_params`: `model`, `api_key` (`lm-studio`), `api_base` + (`http://localhost:1234/v1`), `tpm` (`1000000`), `rpm` (`20`). +- `litellm_settings`: `num_retries` `3`, `request_timeout` `300`. + +### `run.sh` + +`LITELLM_MASTER_KEY` (`sk-litellm-changeme`), `LITELLM_SALT_KEY` +(`sk-random-hash-changeme`), `--network=host`, config bind-mount. Image +`ghcr.io/berriai/litellm:main-stable`. diff --git a/docs/arch.likec4 b/docs/arch.likec4 index a8e8aaf..fa02fa3 100644 --- a/docs/arch.likec4 +++ b/docs/arch.likec4 @@ -18,10 +18,10 @@ model { technology "Python + Click" description ''' - Entry point exposed as the `contractor` command. -- Accepts project path, model, pipeline selection, optional artifact, folder name, prompt, output directory. -- Validates input paths and incompatible option combinations (e.g. `--artifact` with non-enrich/trace pipelines). +- Accepts project path, model, workflow selection (`--workflow`), optional artifact, folder name, prompt, output directory. +- Validates input paths; `--artifact` is optional — workflows that need a seed can reuse one already in the artifact store. - Resolves the selected model/provider route through LiteLLM-compatible configuration. -- Builds the pipeline by selecting a `Pipeline` subclass from the registry and constructing a `PipelineContext`. +- Builds the workflow by selecting a `Workflow` subclass from the registry and constructing a `WorkflowContext`. - Registers on_event handlers so emitted runner/plugin events can be normalised and exported. - Writes final artifacts and metrics.jsonl to the output directory after the run completes. ''' @@ -30,16 +30,22 @@ model { pipelineRegistry = component "Pipeline Registry" { technology "Python class registry" description ''' -- Maps pipeline names to `Pipeline` subclasses (`contractor/pipelines/__init__.py::get_pipelines`). -- Current pipelines: - `oas_build` → `OasBuildingPipeline` - `oas_update` → `OasEnrichmentPipeline` - `likec4` → `LikeC4BuildingPipeline` - `trace` → `TraceAnnotationPipeline` (planner-driven) - `trace-direct` → `TraceAnnotationDirectPipeline` (single-agent, AgentRunner) - `trace-graph` → `TraceGraphPipeline` (trace-direct + graph tools) - `trace-verify` → `TraceVerifyPipeline` (per-finding static verifier) - `router` → `RouterPipeline` (prompt-driven dispatcher) +- Maps workflow names to `Workflow` subclasses (`contractor/workflows/__init__.py::get_workflows`). +- Current workflows: + `oas_build` → `OasBuildingWorkflow` + `oas_update` → `OasEnrichmentWorkflow` + `exploit` → `ExploitabilityWorkflow` (per-finding live verification) + `likec4` → `LikeC4BuildingWorkflow` + `trace` → `TraceAnnotationWorkflow` (planner-driven) + `trace-direct` → `TraceAnnotationDirectWorkflow` (single-agent, AgentRunner) + `trace-graph` → `TraceGraphWorkflow` (trace-direct + graph tools) + `trace-graph-pathpar` → `TraceGraphPathParWorkflow` (trace-graph, paths in parallel) + `trace-verify` → `TraceVerifyWorkflow` (per-finding static verifier) + `vuln-assess` → `VulnAssessWorkflow` (discovery → OAS → trace → exploit) + `vuln-scan` → `VulnScanWorkflow` + `vuln-scan-fast` → `VulnScanFastWorkflow` (high-recall scan → dedup → confirm) + `vuln-scan-trace` → `VulnScanTraceWorkflow` + `router` → `RouterWorkflow` (prompt-driven dispatcher) ''' } @@ -49,6 +55,7 @@ model { - Adds `dependency_information` task using SWE worker. - Adds `project_information` task using SWE worker with dependency artifact as input. - Adds `oas_update` task using OAS builder worker with multiple iterations and upstream artifacts. +- Adds `oas_validate` task using the OAS linter worker as a verification pass. ''' } @@ -57,6 +64,7 @@ model { description ''' - Optionally pre-saves user-supplied OpenAPI text as `oas-openapi-building` artifact. - Adds `oas_enrich` task using OAS builder worker. +- Adds `oas_validate` task using the OAS linter worker as a verification pass. ''' } @@ -76,7 +84,7 @@ model { tracePipeline = component "Trace Pipeline" { technology "TaskRunner per operation (planner-driven)" description ''' -- `TraceAnnotationPipeline`: full planner+worker stack — for each OpenAPI path it +- `TraceAnnotationWorkflow`: full planner+worker stack — for each OpenAPI path it spawns a `TaskRunner` and adds a `trace_annotation` task per operation. - The TaskRunner runs the planning agent as root; the planner delegates annotation subtasks to the trace agent (wrapped as a tool via workerAdapter). @@ -91,7 +99,7 @@ model { traceDirectPipeline = component "Trace-Direct Pipeline" { technology "AgentRunner per operation (single-agent)" description ''' -- `TraceAnnotationDirectPipeline`: bypasses the planner entirely. +- `TraceAnnotationDirectWorkflow`: bypasses the planner entirely. - One `trace_agent` invocation per OpenAPI operation, driven by `AgentRunner`. - Same overlay / artifact contract as `trace`, but cheaper and faster because there is no planner loop or subtask state machine. @@ -101,7 +109,7 @@ model { traceGraphPipeline = component "Trace-Graph Pipeline" { technology "AgentRunner + trailmark call-graph tools" description ''' -- `TraceGraphPipeline`: thin variant of `trace-direct` that constructs +- `TraceGraphWorkflow`: thin variant of `trace-direct` that constructs `trace_agent` with `with_graph_tools=True`. - Trailmark parses the project lazily on the first agent call and the cached engine is reused across operations. @@ -113,7 +121,7 @@ model { traceVerifyPipeline = component "Trace-Verify Pipeline" { technology "TaskRunner — per-finding fan-out" description ''' -- `TraceVerifyPipeline` (OpenAnt Stage-2 style): downstream of `trace` / `trace-direct`. +- `TraceVerifyWorkflow` (OpenAnt Stage-2 style): downstream of `trace` / `trace-direct`. - For each OpenAPI path, loads the per-path `VulnerabilityReport` artifact and queues one task per finding for `trace_verifier_agent`. - Verifier is code-evidence-only — no HTTP probes. @@ -155,16 +163,16 @@ model { - Minimal runner for a single `LlmAgent` invocation: no planner, no subtask state machine. - Renders one task template, attaches metrics/trace plugins, runs the agent once against an ADK session, returns the final response. -- Used by `trace-direct`, `trace-graph` (one invocation per OpenAPI operation) and - by `router` for the trailing free-form router run. +- Used by `trace-direct`, `trace-graph`, and `trace-graph-pathpar` (one invocation + per OpenAPI operation) and by `router` for the trailing free-form router run. ''' } planningAgent = container "Planning Agent" { technology "Google ADK LlmAgent + LiteLLM" description ''' -- Root agent for every TaskRunner-driven pipeline (build, enrich, likec4, trace, - trace-verify, router). +- Root agent for every TaskRunner-driven pipeline (oas_build, oas_update, likec4, + trace, trace-verify, router). - Strict subtask state machine: add → execute → done | incomplete | malformed | skipped. - On incomplete/malformed: must call `decompose_subtask` (1-3 children) — silent retries are forbidden. @@ -319,7 +327,7 @@ model { - Read: `ls`, `glob`, `grep`, `read_file`, `search_def`. - Write (where allowed): `insert_line`, `edit`, `replace_range`, plus the trace-specific `annotate_trace` / `annotate_validate` / `annotate_sink`. -- Filesystem is always sandboxed: build/enrich/likec4/router see a `RootedLocalFileSystem`; +- Filesystem is always sandboxed: oas_build/oas_update/likec4/router see a `RootedLocalFileSystem`; the trace family additionally wraps it in `MemoryOverlayFileSystem` so writes are captured as an artifact diff and never reach the host tree. - Optional graph tools (`call_graph`, `callers_of`, `callees_of`) are attached on @@ -424,7 +432,7 @@ model { } localProject = storage "Target Project Directory" { - technology "Local filesystem (rooted, read-only for build/enrich; overlaid for trace/likec4)" + technology "Local filesystem (rooted, read-only for oas_build/oas_update; overlaid for trace/likec4)" description "Repository/codebase being analysed by worker and trace agents." } @@ -450,8 +458,8 @@ model { user -> cli "runs contractor" cli -> pipelineRegistry "selects pipeline class by name" - pipelineRegistry -> buildPipeline "build" - pipelineRegistry -> enrichPipeline "enrich" + pipelineRegistry -> buildPipeline "oas_build" + pipelineRegistry -> enrichPipeline "oas_update" pipelineRegistry -> likec4Pipeline "likec4" pipelineRegistry -> tracePipeline "trace" pipelineRegistry -> traceDirectPipeline "trace-direct" @@ -462,9 +470,9 @@ model { cli -> outputDir "exports final artifacts and metrics.jsonl" cli -> litellmProxy "resolves model alias" - buildPipeline -> taskRunner "queue dependency_information / project_information / oas_update" + buildPipeline -> taskRunner "queue dependency_information / project_information / oas_update / oas_validate" enrichPipeline -> artifactStore "optionally seed oas-openapi-building" - enrichPipeline -> taskRunner "queue oas_enrich" + enrichPipeline -> taskRunner "queue oas_enrich / oas_validate" likec4Pipeline -> overlayfs "mount overlay rooted at DEFAULT_LIKEC4_PATH" likec4Pipeline -> artifactStore "seed/save likec4-architecture.c4" @@ -751,12 +759,13 @@ views { dynamic buildRun { title "Dynamic — build pipeline run" - user -> cli "Invoke `contractor --pipeline build`" - cli -> pipelineRegistry "Resolve build pipeline" - pipelineRegistry -> buildPipeline "Construct OasBuildingPipeline" + user -> cli "Invoke `contractor --workflow oas_build`" + cli -> pipelineRegistry "Resolve build workflow" + pipelineRegistry -> buildPipeline "Construct OasBuildingWorkflow" buildPipeline -> taskRunner "Enqueue dependency_information" buildPipeline -> taskRunner "Enqueue project_information" buildPipeline -> taskRunner "Enqueue oas_update (multi-iteration)" + buildPipeline -> taskRunner "Enqueue oas_validate" taskRunner -> runnerCore "Run planning agent as root" runnerCore -> planningAgent "Execute subtask state machine" planningAgent -> workerAdapter "Delegate current subtask" @@ -773,9 +782,9 @@ views { dynamic traceRun { title "Dynamic — trace pipeline run (planner-driven)" - user -> cli "Invoke `contractor --pipeline trace --artifact openapi.yaml`" - cli -> pipelineRegistry "Resolve trace pipeline" - pipelineRegistry -> tracePipeline "Construct TraceAnnotationPipeline" + user -> cli "Invoke `contractor --workflow trace --artifact openapi.yaml`" + cli -> pipelineRegistry "Resolve trace workflow" + pipelineRegistry -> tracePipeline "Construct TraceAnnotationWorkflow" tracePipeline -> artifactStore "Load oas-openapi-building" tracePipeline -> overlayfs "Initialise memory overlay over project" tracePipeline -> taskRunner "Per-path: queue trace_annotation per operation" @@ -796,9 +805,9 @@ views { dynamic traceDirectRun { title "Dynamic — trace-direct / trace-graph run" - user -> cli "Invoke `contractor --pipeline trace-direct` (or trace-graph)" - cli -> pipelineRegistry "Resolve direct trace pipeline" - pipelineRegistry -> traceDirectPipeline "Construct TraceAnnotationDirectPipeline" + user -> cli "Invoke `contractor --workflow trace-direct` (or trace-graph)" + cli -> pipelineRegistry "Resolve direct trace workflow" + pipelineRegistry -> traceDirectPipeline "Construct TraceAnnotationDirectWorkflow" traceDirectPipeline -> artifactStore "Load oas-openapi-building" traceDirectPipeline -> overlayfs "Initialise memory overlay" traceDirectPipeline -> agentRunner "Per-operation: invoke trace_agent directly" @@ -814,9 +823,9 @@ views { dynamic traceVerifyRun { title "Dynamic — trace-verify run (Stage-2 verifier)" - user -> cli "Invoke `contractor --pipeline trace-verify --artifact openapi.yaml`" - cli -> pipelineRegistry "Resolve trace-verify pipeline" - pipelineRegistry -> traceVerifyPipeline "Construct TraceVerifyPipeline" + user -> cli "Invoke `contractor --workflow trace-verify --artifact openapi.yaml`" + cli -> pipelineRegistry "Resolve trace-verify workflow" + pipelineRegistry -> traceVerifyPipeline "Construct TraceVerifyWorkflow" traceVerifyPipeline -> artifactStore "Load per-path VulnerabilityReports from prior trace run" traceVerifyPipeline -> taskRunner "Per-finding: queue trace_verify task" taskRunner -> runnerCore "Run planning agent with trace_verifier_agent as worker" @@ -854,7 +863,7 @@ views { // Notes: // 1. The planning agent is the root agent for every TaskRunner-driven pipeline -// (build, enrich, likec4, trace, trace-verify, router). It always sits between +// (oas_build, oas_update, likec4, trace, trace-verify, router). It always sits between // the TaskRunner and the worker agents. // 2. `trace-direct` and `trace-graph` bypass the planner entirely — they use the // bare `AgentRunner` (`contractor/runners/agent_runner.py`) and run `trace_agent` diff --git a/docs/eval-tuning.md b/docs/eval-tuning.md index ab7c946..f56082d 100644 --- a/docs/eval-tuning.md +++ b/docs/eval-tuning.md @@ -7,6 +7,8 @@ by precision/recall/verdict numbers instead of guesses. Pairs with [tuning.md](tuning.md) (the full knob inventory). This doc is narrower: **which knobs are worth sweeping in evals, where to inject them, and the concrete experiments to run** — prioritizing sampling (temperature/top_p) and tool params. +Status: the Tier 1–2 plumbing has since landed via `Settings` env vars (see §2/§5); +the harness `max_tokens` literal is the one remaining gap. --- @@ -14,11 +16,19 @@ experiments to run** — prioritizing sampling (temperature/top_p) and tool para | Capability | Where | Note | |------------|-------|------| -| Model override | `conftest.py:248` `eval_model` fixture | `CONTRACTOR_EVAL_MODEL` → `LiteLlm(model=..., timeout=600)`. **Single chokepoint — every eval agent gets this object.** | -| Prompt-version override | trace/exploit/vuln tests | `CONTRACTOR_EVAL_*_PROMPT_VERSION`. | -| Toolset toggle | `trace_harness.py:231`, `vuln_scan_harness.py:60` | `with_graph_tools: bool` (graph/trailmark tools on/off). | -| Pass@N | `test_vuln_detection_eval.py:160-207` | `CONTRACTOR_EVAL_VULN_PASS_AT` (default 3); passes if any attempt clears thresholds. | -| Score thresholds | vuln test | `CONTRACTOR_EVAL_VULN_MIN_RECALL` / `_MIN_PRECISION`. | +| Eval gate | `conftest.py` `pytest_collection_modifyitems` | Eval tests auto-skip unless `-m eval` (word-matched in the `-m` expression) or `CONTRACTOR_RUN_EVAL` is truthy (`1/true/yes/on` — `0`/`false` stay off). | +| Model override | `conftest.py:279` `eval_model` fixture | `CONTRACTOR_EVAL_MODEL` → `LiteLlm(model=..., timeout=600)`; unset → the shared `DEFAULT_MODEL`. **Single chokepoint — every eval agent gets this object.** | +| Sampling override | `Settings` (`contractor/utils/settings.py`) | `MODEL_TEMPERATURE` / `MODEL_TOP_P` env vars flow through `build_model()` into `DEFAULT_MODEL`. Caveat: the `CONTRACTOR_EVAL_MODEL` path builds a bare `LiteLlm` that does **not** carry them — sweep sampling via `DEFAULT_MODEL_NAME` + `MODEL_*` instead. | +| Prompt-version override | trace/exploit/vuln tests | `CONTRACTOR_EVAL_*_PROMPT_VERSION` (env wins over the case's `prompt_version`, else the manifest's `active`). | +| Task-version override | `runners/models.py` | `CONTRACTOR_TASK_VERSION_` (e.g. `..._TRACE_ANNOTATION=v3`) pins a task-template version for task/pipeline evals. | +| Toolset toggle | `trace_harness.py:228`, `vuln_scan_harness.py:62` | `with_graph_tools: bool`. Trace cases set it per-case (`with_graph_tools`, default **true**); the vuln test passes `True`. | +| Pass@N | vuln + trace tests | `CONTRACTOR_EVAL_VULN_PASS_AT` (default 3), `CONTRACTOR_EVAL_TRACE_PASS_AT` (default 1); passes if any attempt clears thresholds. | +| Score thresholds | vuln test | `CONTRACTOR_EVAL_VULN_MIN_RECALL` (0.15) / `_MIN_PRECISION` (0.10). | +| Observations A/B | `tools/observations.py` | `CONTRACTOR_EVAL_OBSERVATIONS` (JSON) overlays any workflow's `observations:` block — flip arms without editing config.yaml. | +| OAS-in-prompt arm | `test_trace_agent_eval.py` | `CONTRACTOR_EVAL_WITH_OAS=1` feeds the expected OpenAPI spec as an attack-surface map (default off). | +| Agent select / proxy / subsets | vuln + exploit tests | `CONTRACTOR_EVAL_VULN_AGENT` (`vuln_scan`\|`trace`), `CONTRACTOR_EVAL_PROXY`, `CONTRACTOR_EVAL_CASE_IDS` (exploit task eval). | +| Result envelope | `tests/eval/results.py` (`EvalSink` / `write_eval_results`) | One `eval/v1` envelope per `(scenario, unit)` under `eval_runs/`; `CONTRACTOR_EVAL_RUN_STAMP` names the archive dir, `CONTRACTOR_EVAL_RESULTS_DIR` redirects vuln records. `scripts/rebuild_eval_envelope.py` re-aggregates a unit's envelope from per-case `metrics.json` when fixtures ran in separate sessions. | +| Timeout handling | `tests/eval/harness.py` | `run_agent(timeout_s=...)` (trace cases: per-case `timeout_s`, default 900) raises `AgentRunTimeout` carrying the **partial** `AgentRun` (`timed_out=True`) — a timeout is always a failed attempt, never a silent pass, but stays inspectable. | | N-run aggregation | `scripts/probe_variance.py` | mean/stdev/min/max per metric across samples. | **Scoring is deterministic** (set-matching precision/recall/f1; verdict equality) — @@ -26,9 +36,9 @@ no LLM judge — so any sweep produces directly comparable numbers. Variance com only from the model's own non-determinism, which is exactly what the sampling sweep should quantify. -**Gaps:** sampling params are unset (every agent runs at backend-default temperature), -`max_tokens` is hardcoded `80_000` in all three harnesses, and tool caps (graph result -limits, body-preview chars, fs listing breadth) are baked into factories with no eval hook. +**Remaining gap:** `max_tokens` is still hardcoded `80_000` in all three harnesses +(`trace_harness.py:264`, `vuln_scan_harness.py:83,100`, `exploitability_harness.py:86,101,116`). +Sampling and tool caps are no longer gaps — both are Settings/env-routed (see below). --- @@ -36,67 +46,47 @@ limits, body-preview chars, fs listing breadth) are baked into factories with no Each config is a bundle of overrides. Group by tuning ROI: -### Tier 1 — Sampling (highest ROI, smallest plumbing) +### Tier 1 — Sampling (DONE — Settings-routed) -| Knob | Proposed env var | Sweep values | Hypothesis | +| Knob | Env var (implemented) | Sweep values | Hypothesis | |------|------------------|--------------|------------| -| `temperature` | `CONTRACTOR_EVAL_TEMPERATURE` | `0.0, 0.2, 0.4, 0.7` | Low temp ↑ schema adherence on structured tasks (OAS/trace/verdict) → fewer `malformed`/retries; high temp ↑ recall on open-ended vuln scan. | -| `top_p` | `CONTRACTOR_EVAL_TOP_P` | `1.0, 0.9, 0.8` | Nucleus trim stabilizes output without flattening exploration as hard as temperature. | -| (optional) `reasoning_effort` / thinking budget | `CONTRACTOR_EVAL_REASONING` | backend-specific | Trade latency for depth on vuln_scan / exploitability. | - -**Injection — one place.** Extend `eval_model` (`conftest.py:240-253`) so the -`LiteLlm` carries sampling kwargs. ADK's `LiteLlm` forwards extra kwargs to -`litellm.completion`: - -```python -@pytest.fixture(scope="session") -def eval_model() -> LiteLlm: - name = os.environ.get("CONTRACTOR_EVAL_MODEL") - extra = {} - if (t := os.environ.get("CONTRACTOR_EVAL_TEMPERATURE")) is not None: - extra["temperature"] = float(t) - if (p := os.environ.get("CONTRACTOR_EVAL_TOP_P")) is not None: - extra["top_p"] = float(p) - if name: - return LiteLlm(model=name, timeout=600, **extra) - if not extra: - return DEFAULT_MODEL - return LiteLlm(model=DEFAULT_MODEL.model, timeout=600, **extra) # clone w/ sampling -``` - -Because every harness pulls its model from this fixture, this single change makes -sampling sweepable across **all** eval types. (Verify `LiteLlm` forwards the kwargs -in the installed ADK version before trusting the numbers.) - -### Tier 2 — Tool params (prioritized per request) - -These shape what the agent can *see* and how much it costs. Sweep targets, by the -eval that exercises them: - -| Knob | Source (prod) | Proposed env var | Sweep values | Exercised by | Hypothesis | +| `temperature` | `MODEL_TEMPERATURE` | `0.0, 0.2, 0.4, 0.7` | Low temp ↑ schema adherence on structured tasks (OAS/trace/verdict) → fewer `malformed`/retries; high temp ↑ recall on open-ended vuln scan. | +| `top_p` | `MODEL_TOP_P` | `1.0, 0.9, 0.8` | Nucleus trim stabilizes output without flattening exploration as hard as temperature. | +| (optional) `reasoning_effort` / thinking budget | still unwired | backend-specific | Trade latency for depth on vuln_scan / exploitability. | + +**Injection landed in `Settings`, not the eval fixture.** `model_temperature` / +`model_top_p` (`contractor/utils/settings.py`) are forwarded by `build_model()` to +every `LiteLlm` it constructs — including `DEFAULT_MODEL`, which `eval_model` +returns when `CONTRACTOR_EVAL_MODEL` is unset. So sampling sweeps work across +**all** eval types *and* production runs with two env vars; default `None` keeps +backend defaults. Caveat: the `CONTRACTOR_EVAL_MODEL` override path constructs a +bare `LiteLlm(model=..., timeout=600)` without the sampling kwargs — to sweep +sampling on a non-default model, set `DEFAULT_MODEL_NAME` instead. + +### Tier 2 — Tool params (DONE — Settings-routed) + +These shape what the agent can *see* and how much it costs. All caps below now live +in `Settings` (tool constructors fall back to `get_settings()` when no explicit value +is passed), so each is sweepable via plain env vars — no `CONTRACTOR_EVAL_*` plumbing: + +| Knob | Env var (implemented) | Default | Sweep values | Exercised by | Hypothesis | |------|---------------|------------------|--------------|--------------|------------| -| `with_graph_tools` | `trace_harness.py:231` | `CONTRACTOR_EVAL_GRAPH_TOOLS` | `0 / 1` | trace, vuln (trace mode) | Graph/trailmark tools ↑ cross-file recall on large fixtures (confirms the v7+graph memory result). | -| graph `DEFAULT_MAX_RESULTS` | `tools/code/graph.py:44` | `CONTRACTOR_EVAL_GRAPH_MAX_RESULTS` | `100, 200, 400` | trace, vuln | More symbol hits ↑ recall but ↑ context/token cost. | -| graph `DEFAULT_MAX_PATHS` / `_MAX_PATH_DEPTH` | `graph.py:45,47` | `CONTRACTOR_EVAL_GRAPH_MAX_PATHS` / `_DEPTH` | paths `15/25/40`, depth `20/30` | trace | Deeper call-path enumeration ↑ taint-flow recall; risk of blow-up/noise. | -| HTTP `body_preview_chars` | `tools/http.py` (512 in exploit agents) | `CONTRACTOR_EVAL_HTTP_BODY_PREVIEW` | `256, 512, 2048` | exploitability, web | Bigger preview ↑ evidence quality for verdicts vs. token burn. | -| fs `max_items` / `max_output` | `tools/fs/read_tools.py:58-59` | `CONTRACTOR_EVAL_FS_MAX_ITEMS` / `_OUTPUT` | items `100/250`, output `80k/160k` | all code evals | Wider listings ↑ discovery on big repos vs. context cost. | - -**Injection.** Two patterns: -1. **Already-a-param** (`with_graph_tools`): thread an env read into each test's call, - mirroring `_resolve_prompt_version`. Trivial. -2. **Baked-into-factory** (graph limits, body-preview, fs caps): cleanest is to make the - tool/graph constructors read an optional override (default = today's constant), then - set it in a `conftest` autouse fixture from the env var. Keep the production default - unchanged so non-eval runs are untouched. This is the bulk of the Tier-2 work. +| `with_graph_tools` | — (per-case key in `trace-cases.json`, default **on**; vuln test passes `True`) | on | `false` per case to ablate | trace, vuln (trace mode) | Graph/trailmark tools ↑ cross-file recall on large fixtures (the v7+graph result — now the production default). | +| graph max results | `GRAPH_MAX_RESULTS` | `200` | `100, 200, 400` | trace, vuln | More symbol hits ↑ recall but ↑ context/token cost. | +| graph max paths / depth | `GRAPH_MAX_PATHS` / `GRAPH_MAX_PATH_DEPTH` | `25` / `30` | paths `15/25/40`, depth `20/30` | trace | Deeper call-path enumeration ↑ taint-flow recall; risk of blow-up/noise. | +| HTTP `body_preview_chars` | `HTTP_BODY_PREVIEW_CHARS` (512 in exploit agents) | `2048` | `256, 512, 2048` | exploitability, web | Bigger preview ↑ evidence quality for verdicts vs. token burn. | +| fs `max_items` / `max_output` / line cap | `FS_MAX_ITEMS` / `FS_MAX_OUTPUT` / `FS_MAX_READ_LINES` | `100` / `50k` / `2000` | items `100/250`, output `50k/80k/160k` | all code evals | Wider listings ↑ discovery on big repos vs. context cost. (The 80KB-vs-50KB/2000-line A/B was inconclusive — the cap rarely binds on small fixtures.) | ### Tier 3 — Agentic budgets (already partly wired) -| Knob | Source | Proposed env var | Sweep values | Note | +| Knob | Source | Env var | Sweep values | Note | |------|--------|------------------|--------------|------| -| `max_tokens` (summarization trigger) | hardcoded `80_000` in 3 harnesses | `CONTRACTOR_EVAL_MAX_TOKENS` | `60k, 80k, 100k, 120k` | Replace the literals; bigger budget ↑ cross-file reasoning on large fixtures, ↑ cost. | -| `elide_keep_last_n` | `worker_factory.py:56` (15) | `CONTRACTOR_EVAL_ELIDE_KEEP` | `8, 15, 25` | Recall of earlier tool output vs. token cost. | -| planner `max_steps` | per-pipeline | `CONTRACTOR_EVAL_MAX_STEPS` | `15, 30, 75` | Only relevant for task-runner evals (oas/vuln_assess), not single-agent harnesses. | -| `pass@N` | exists (vuln) | `CONTRACTOR_EVAL_VULN_PASS_AT` | `1, 3, 5` | Already wired; extend the pattern to trace/exploit if you want pass@N there. | +| `max_tokens` (summarization trigger) | still hardcoded `80_000` in 3 harnesses | *(proposed)* `CONTRACTOR_EVAL_MAX_TOKENS` | `60k, 80k, 100k, 120k` | Replace the literals; bigger budget ↑ cross-file reasoning on large fixtures, ↑ cost. **The one un-landed item.** | +| `elide_keep_last_n` / char budget | `worker_factory.py` (15 / off) | `FS_HEAVY_KEEP_LAST_N` / `FS_HEAVY_KEEP_BUDGET_CHARS` (implemented; >0 overrides the caller) | keep `8, 15, 25`; budget per QW3 | Recall of earlier tool output vs. token cost (QW3 byte-retention: −44% tokens). | +| planner `max_steps` | per-pipeline `config.yaml` | — (edit YAML) | `15, 30, 75` | Only relevant for task-runner evals (oas/vuln_assess), not single-agent harnesses. | +| `pass@N` | vuln + trace | `CONTRACTOR_EVAL_VULN_PASS_AT` (default 3) / `CONTRACTOR_EVAL_TRACE_PASS_AT` (default 1) | `1, 3, 5` | Wired for both; extend the pattern to exploit if you want pass@N there. | +| observations arms | workflow `observations:` blocks | `CONTRACTOR_EVAL_OBSERVATIONS` (JSON overlay, implemented) | lean / +file-paths / +tool-errors / off | Lean+file-paths is the current production arm; tool-error counts measurably hurt. | +| task-prompt version | `contractor/tasks/.yml` manifests | `CONTRACTOR_TASK_VERSION_` (implemented) | registered versions | A/B task-template variants in task/pipeline evals (e.g. trace_annotation `v3`, now active). | --- @@ -108,34 +98,38 @@ Each row is a config worth committing as a reproducible experiment. Run with | Config name | Overrides | Target fixtures | Metric to watch | Question it answers | |-------------|-----------|-----------------|-----------------|---------------------| -| `baseline` | none (backend-default sampling) | all | p/r/f1, success rate, malformed count | Current production behavior + its variance floor. | -| `det-t0` | `TEMPERATURE=0` | trace, oas, exploit | malformed/retry count, verdict accuracy | Does greedy decoding cut format churn without hurting recall? | -| `det-t0-p09` | `TEMPERATURE=0`, `TOP_P=0.9` | trace, oas | f1, variance (stdev across N) | Best stability/quality point for structured output. | -| `explore-t07` | `TEMPERATURE=0.7` | vuln_scan (pass@5) | recall@5, unique findings union | Does higher temp + more attempts ↑ vuln coverage? | -| `graph-on` | `GRAPH_TOOLS=1` | trace, vuln-trace (large: cloud-core, crapi) | recall, annotation count | Re-confirm graph-tools win; quantify token tax. | -| `graph-rich` | `GRAPH_TOOLS=1`, `GRAPH_MAX_PATHS=40`, `GRAPH_MAX_RESULTS=400` | large trace fixtures | recall vs. precision (noise) | Diminishing returns / precision loss from deeper graph. | -| `lean-context` | `MAX_TOKENS=60000`, `ELIDE_KEEP=8` | all | f1 delta vs. baseline, tokens/run | How cheap can we go before quality drops? | -| `rich-context` | `MAX_TOKENS=120000`, `ELIDE_KEEP=25` | large fixtures | recall delta, tokens/run | Is the cost of more context justified on big repos? | -| `evidence-rich` | `HTTP_BODY_PREVIEW=2048` | exploitability, web | verdict accuracy, evidence-present rate | Does more response body ↑ correct verdicts? | +| `baseline` | none (backend-default sampling, graph tools on) | all | p/r/f1, success rate, malformed count | Current production behavior + its variance floor. | +| `det-t0` | `MODEL_TEMPERATURE=0` | trace, oas, exploit | malformed/retry count, verdict accuracy | Does greedy decoding cut format churn without hurting recall? | +| `det-t0-p09` | `MODEL_TEMPERATURE=0`, `MODEL_TOP_P=0.9` | trace, oas | f1, variance (stdev across N) | Best stability/quality point for structured output. | +| `explore-t07` | `MODEL_TEMPERATURE=0.7` | vuln_scan (pass@5 via `CONTRACTOR_EVAL_VULN_PASS_AT=5`) | recall@5, unique findings union | Does higher temp + more attempts ↑ vuln coverage? | +| `graph-off` | `with_graph_tools: false` per trace case | trace, vuln-trace (large: cloud-core, crapi) | recall, annotation count | Ablate the now-default graph tools; quantify the token tax they pay for. | +| `graph-rich` | `GRAPH_MAX_PATHS=40`, `GRAPH_MAX_RESULTS=400` | large trace fixtures | recall vs. precision (noise) | Diminishing returns / precision loss from deeper graph. | +| `lean-context` | `MAX_TOKENS=60000` (once landed), `FS_HEAVY_KEEP_LAST_N=8` | all | f1 delta vs. baseline, tokens/run | How cheap can we go before quality drops? | +| `rich-context` | `MAX_TOKENS=120000` (once landed), `FS_HEAVY_KEEP_LAST_N=25` | large fixtures | recall delta, tokens/run | Is the cost of more context justified on big repos? | +| `evidence-rich` | `HTTP_BODY_PREVIEW_CHARS=2048` | exploitability, web | verdict accuracy, evidence-present rate | Does more response body ↑ correct verdicts? | Run sampling × context as a small grid on a fixed fixture subset first (cheap, fast -fixtures like `vampi`, `dsvw`, `vulnyapi`), then promote the winning config to the -expensive large fixtures. +fixtures like `realvuln-vampi`, `realvuln-dsvw`, `vulnyapi`), then promote the winning +config to the expensive large fixtures (`crapi-*`, `cvebench-*`). --- ## 4. Suggested workflow -1. **Land the plumbing** Tier 1 → Tier 2 → Tier 3 in that order (Tier 1 is ~3 lines - and unblocks the highest-ROI sweep immediately). +1. **Plumbing is landed for Tiers 1–2** (Settings env vars) — only the harness + `max_tokens` literal (Tier 3) remains. 2. **Pin a fixture subset** for fast iteration (small/medium fixtures) via - `-k` or `CONTRACTOR_EVAL_CASE_IDS`. -3. **Sample ≥3× per config** — sampling sweeps are meaningless single-shot; reuse + `-k` (any eval) or `CONTRACTOR_EVAL_CASE_IDS` (exploit task eval). +3. **Sample ≥3× per config** — sampling sweeps are meaningless single-shot; use + pass@N (`CONTRACTOR_EVAL_VULN_PASS_AT` / `CONTRACTOR_EVAL_TRACE_PASS_AT`) or `probe_variance.py` aggregation and report mean ± stdev, not one number. 4. **Compare on deterministic metrics** the harness already emits (precision/recall/f1, - success rate, verdict accuracy, annotation count, tokens/run). + success rate, verdict accuracy, annotation count, tokens/run); results land as + `eval/v1` envelopes under `eval_runs/` for analytics-ui. Run a unit's fixtures in + one pytest session (one combined envelope); if they ran separately, consolidate + with `scripts/rebuild_eval_envelope.py` — don't cp-snapshot run dirs. 5. **Promote the winner** to large fixtures, then update the production defaults - (`worker_factory.py` / pipeline constants / litellm config) — and the + (`Settings` field defaults / `config.yaml` budgets / litellm config) — and the [Don't switch models] rule still stands: tune sampling/tools/budgets, keep the project's default model. @@ -143,11 +137,13 @@ expensive large fixtures. ## 5. Plumbing checklist (smallest viable change set) -- [ ] `conftest.py` `eval_model`: read `CONTRACTOR_EVAL_TEMPERATURE` / `TOP_P`, pass to `LiteLlm`. **(Tier 1, unblocks sampling everywhere.)** -- [ ] Replace hardcoded `max_tokens=80_000` in `trace_harness.py:266`, `vuln_scan_harness.py:80,97`, `exploitability_harness.py:87,102,117` with an env-driven default. **(Tier 3.)** -- [ ] Env-drive `with_graph_tools` in the trace/vuln tests (mirror `_resolve_prompt_version`). **(Tier 2, trivial.)** -- [ ] Add optional override args to `tools/code/graph.py` constants + HTTP `body_preview_chars` + fs `max_items/max_output`, set from an autouse `conftest` fixture; keep production defaults unchanged. **(Tier 2, the real work.)** -- [ ] Document the new env vars in each test module's header docstring (matches existing convention). +- [x] Sampling env-driven — landed as `Settings.model_temperature` / `model_top_p` + (`MODEL_TEMPERATURE` / `MODEL_TOP_P` via `build_model()`), not in `eval_model`. + **(Tier 1 done.)** +- [ ] Replace hardcoded `max_tokens=80_000` in `trace_harness.py:264`, `vuln_scan_harness.py:83,100`, `exploitability_harness.py:86,101,116` with an env-driven default. **(Tier 3 — the only open item.)** +- [x] `with_graph_tools` — per-case key in trace cases (default on); vuln eval runs with it on. No env var; flip per case to ablate. **(Tier 2 done, differently than proposed.)** +- [x] Graph / HTTP / fs caps — landed as `Settings` fields (`GRAPH_MAX_*`, `HTTP_BODY_PREVIEW_CHARS`, `FS_MAX_*`); constructors fall back to `get_settings()`, production defaults unchanged. **(Tier 2 done.)** +- [x] Env vars documented in test-module header docstrings (e.g. `test_vuln_detection_eval.py`). > Guardrails to keep: production defaults must not change from these eval hooks (env-gated, > default = current constant), and no `assert` in the new override code — use explicit @@ -155,7 +151,8 @@ expensive large fixtures. --- -*Injection points verified against `tests/eval/conftest.py`, `trace_harness.py`, -`vuln_scan_harness.py`, `exploitability_harness.py`, and `tests/eval/scorers.py` / +*Injection points verified against `tests/eval/conftest.py`, `tests/eval/harness.py`, +`tests/eval/results.py`, `trace_harness.py`, `vuln_scan_harness.py`, +`exploitability_harness.py`, `contractor/utils/settings.py`, and `tests/eval/scorers.py` / `scoring.py`. Confirm `LiteLlm` kwarg forwarding in the installed ADK before trusting sampling numbers.* diff --git a/docs/insights-parallel-vuln-pipelines.md b/docs/insights-parallel-vuln-pipelines.md index 3e11e39..4542e2b 100644 --- a/docs/insights-parallel-vuln-pipelines.md +++ b/docs/insights-parallel-vuln-pipelines.md @@ -10,7 +10,7 @@ The trace workflow processes API operations sequentially — each operation runs a full LLM agent session (multiple tool calls, file reads, annotation writes). With 10-20 operations, wallclock time scales linearly. ### Approach: Path-Level Parallelism -Each API path gets its own forked `MemoryOverlayFileSystem`. Paths run concurrently via `asyncio.TaskGroup` with a semaphore (`max_concurrency=3`). Operations within a path remain sequential so sibling operations see each other's annotations. +Each API path gets its own forked `MemoryOverlayFileSystem`. Paths run concurrently via `asyncio.TaskGroup` with a semaphore (`max_concurrency`, default 3 — `budgets.max_concurrency` in the workflow's `config.yaml`). Operations within a path remain sequential so sibling operations see each other's annotations. ``` Before: path1 → path2 → path3 → path4 (serial, 4x wallclock) @@ -22,11 +22,11 @@ After: path1 ┐ ### Key Technical Decisions -**Overlay fork/merge pattern.** Each parallel fork starts from a snapshot of the shared overlay. After all forks complete, writes are merged back. Conflict resolution: when two forks modify the same file, take the version with the most content (more `@trace` annotations = more complete trace). In practice, conflicts are rare — different API paths trace different code. +**Overlay fork/merge pattern.** Each parallel fork starts from a snapshot of the shared overlay. After all forks complete, writes are merged back. Conflict resolution: when two forks modify the same file, take the version with the most content (more `@trace` annotations = more complete trace); identical content from multiple forks is not a conflict. Fork deletes only propagate when they post-date the fork point — `fork_overlay` records the pre-fork tombstone baseline. In practice, conflicts are rare — different API paths trace different code. Merge + persist run in a `finally`: a single failed path makes the `TaskGroup` cancel its siblings, but annotations from already-completed forks are still merged and saved instead of being lost. **Shared graph tools eliminate re-parse overhead.** Trailmark (call-graph engine) parses the project via tree-sitter on first use. Naively, each forked overlay triggers a separate parse — causing 7.7x slowdown in eval. Fix: build graph tools once from the base FS before forking, pass them to all agents via a new `graph_tools` parameter on `build_trace_agent`. Trailmark only reads the base FS (read-only), so sharing is safe. Result: overhead dropped from 637s to 72s. -**Per-fork `AgentRunner` instances.** `AgentRunner` stores `_on_event` as instance state — two concurrent `.run()` calls would overwrite each other. Each parallel path creates its own runner. +**Per-fork `AgentRunner` instances.** `AgentRunner` originally stored `_on_event` as instance state — two concurrent `.run()` calls would overwrite each other. `on_event` has since moved to a per-call `run()` parameter, but each parallel path still creates its own runner (with its own in-memory session service). ### Eval Results (fastapi fixture, 2 paths) diff --git a/docs/planner.md b/docs/planner.md new file mode 100644 index 0000000..c2b0aed --- /dev/null +++ b/docs/planner.md @@ -0,0 +1,750 @@ +# The Streamline Planner & Task Runner + +This document is a deep dive into the single most load-bearing mechanism in +Contractor: how a queued **task** is turned into a **planner + worker** loop, +how the planner decomposes work into **subtasks**, and how state, retries, and +artifacts flow through it. + +It complements [README.md](README.md) (the broader architecture tour). Where +that doc surveys all the layers, this one stays inside +[`contractor/runners/task_runner.py`](../contractor/runners/task_runner.py), +[`contractor/agents/planning_agent/`](../contractor/agents/planning_agent/), +and [`contractor/tools/tasks/`](../contractor/tools/tasks/). + +All diagrams are Mermaid and render on GitHub. + +--- + +## 1. The cast + +The planner is **not** the thing that reads code or calls HTTP. It is a +coordinator that decomposes an objective into verifiable subtasks and delegates +each one to a worker. Five objects collaborate: + +| Object | File | Role | +| ------ | ---- | ---- | +| **TaskRunner** | [`runners/task_runner.py`](../contractor/runners/task_runner.py) | Owns the task queue; for each task spawns a fresh planner+worker per attempt, runs the ADK loop, publishes artifacts, emits lifecycle events. | +| **Planning Agent** | [`agents/planning_agent/agent.py`](../contractor/agents/planning_agent/agent.py) | An ADK `LlmAgent` whose tools are the streamline-manager operations + memory tools. Driven by prompt [`prompts/v5.md`](../contractor/agents/planning_agent/prompts/v5.md). | +| **StreamlineManager** | [`tools/tasks/manager.py`](../contractor/tools/tasks/manager.py) | The deterministic core: holds the subtask list + current index in ADK session state, enforces the status state machine, appends execution records. The planner's tools are thin wrappers over it. | +| **Worker** | any `build_` | An `LlmAgent` (SWE, OAS builder, trace, …) `instrument_worker`-ed with `Subtask`/`SubtaskExecutionResult` schemas and wrapped as an `AgentTool`. | +| **Summarizer** | created in [`tools/tasks/tools.py`](../contractor/tools/tasks/tools.py) | A tool-less `LlmAgent` (shares the worker's model) that condenses the run into a handoff summary at `finish`. | + +```mermaid +flowchart TB + subgraph Runner["TaskRunner — runners/task_runner.py"] + Q["Task queue
(TaskInvocation list)"] + Loop["Per-task retry loop
(max_attempts)"] + end + + subgraph Iter["One attempt = one ADK Runner run"] + Planner["Planning Agent (LlmAgent)
build_planning_agent + prompt v5"] + Mgr["StreamlineManager
(subtask list + idx)"] + Worker["Worker (LlmAgent → AgentTool)
instrument_worker"] + Summ["task_summarizer
(LlmAgent, no tools)"] + end + + Tools["Domain tools
fs · code · http · memory · openapi · vuln"] + State[("ADK session state
task::{id}::*")] + Mem[("Memory namespace
(shared planner↔worker)")] + + Q --> Loop --> Planner + Planner -->|"add / decompose / skip / finish"| Mgr + Planner -->|execute_current_subtask| Worker + Planner -->|finish| Summ + Worker --> Tools + Mgr -.->|"reads / writes"| State + Planner -.->|"memory tools"| Mem + Worker -.->|"memory tools"| Mem +``` + +The hard separation — **planner plans, worker acts** — is enforced by the prompt +("You NEVER read code, files, schemas, or HTTP yourself") and by construction: +the planner is only given the streamline + memory tools, never the domain tools. + +--- + +## 2. TaskRunner: the per-task lifecycle + +`TaskRunner.run()` walks its queue and calls `_run_task_with_retries` for each +`TaskInvocation`. A task is a *unit of retry*; each retry is an *attempt*; an +attempt that reaches terminal `done` is a *successful run*. A task is only +finished after `iterations` successful runs (cumulative across attempts, not +necessarily consecutive); attempts keep going until `max_attempts` is spent. + +```mermaid +flowchart TD + A["_run_task_with_retries(item)"] --> B["_render_task
(substitute vars/params/artifact texts)"] + B --> C["emit TASK_STARTED"] + C --> D["_inject_skills + _inject_artifacts
(ONCE per task — invariant across attempts)"] + D --> E{"attempt ≤ max_attempts?"} + E -- no --> Z["emit TASK_FAILED
raise TaskNotCompletedError"] + E -- yes --> F["_run_single_iteration"] + F -->|"raises (transient LLM/net/tool)"| G["emit ITERATION_RESULT(completed=false)
→ consumes an attempt"] + G --> E + F -- returns --> H["emit ITERATION_RESULT"] + H --> I{"state[task::id::status] == DONE?"} + I -- no --> J["carry_state = result.carry_state"] + J --> E + I -- yes --> K["_publish_task_artifacts
under effective_artifact_key"] + K --> L{"successful_runs ≥ iterations?"} + L -- no --> J + L -- yes --> M["emit TASK_FINISHED → return"] +``` + +Three things worth calling out: + +- **Skills and inbox artifacts are injected once**, before the attempt loop — + the memory namespace, skill list, and artifact texts don't change between + retries, so re-injecting would just rewrite the same memory YAML. +- **An exception inside an iteration consumes an attempt** rather than aborting + the whole workflow. It is reported on `ITERATION_RESULT(completed=False)` with + the error type/message, and the loop continues. (`asyncio.CancelledError` is + the one exception — it unwinds the run.) +- **Artifacts publish under `effective_artifact_key`** — the template key by + default, or a per-invocation `artifact_key` for fan-out workflows that queue + many tasks from one template. See §7. + +### 2.1 One iteration + +`_run_single_iteration` is where a fresh planner is built and handed to an ADK +`Runner`. It seeds the session state, runs the agent until `finish` ends the +invocation, then reads the terminal state back out. + +```mermaid +sequenceDiagram + participant R as TaskRunner + participant ADK as ADK Runner + participant P as Planner (LlmAgent) + participant M as StreamlineManager + participant W as Worker (AgentTool) + participant S as Summarizer + + R->>R: _spawn_planning_agent → fresh planner + worker + R->>R: _build_task_initial_state
(build_active_state + carry, minus stale planner keys) + R->>ADK: create_session(state) + run_async(rendered task text) + loop planner turns, until finish() sets end_invocation + ADK->>P: model turn + P->>M: add_subtask / get_current_subtask / list_subtasks + P->>W: execute_current_subtask + W-->>P: SubtaskExecutionResult {task_id, status, output, summary} + Note over P,M: manager applies status, advances idx, appends record + alt status incomplete / malformed + P->>M: decompose_subtask (1–3 children) ·or· skip + end + end + P->>S: finish → summarize {objective, records, result, status} + S-->>P: summary text + P->>M: finish writes result / summary / status; end_invocation = True + ADK-->>R: final session state + R->>R: completed = (task::id::status == DONE) +``` + +The planner is **stateless across attempts**: `_spawn_planning_agent` builds a +brand-new planner+worker pair every iteration, and the manager scopes its +subtask list per ADK *invocation* (§6), so a retry always starts from an empty +plan — only the fixed task-scoped keys and inbox memory carry forward. + +--- + +## 3. The planner loop (prompt v5) + +The planner is an LLM following [`prompts/v5.md`](../contractor/agents/planning_agent/prompts/v5.md). +Its behaviour is an **action picker**: each turn it scans a priority-ordered +table and takes the first matching action. This is the streamline planner's +control flow. + +```mermaid +flowchart TD + Start(["planner turn"]) --> Q0{"any subtask exists?"} + Q0 -- no --> BS["BOOTSTRAP
read memory · add ≤ 70% of budget as subtasks"] + BS --> Start + Q0 -- yes --> Q1{"last worker result == done?"} + + Q1 -- yes --> Q1a{"open subtasks remain?"} + Q1a -- yes --> EXE["execute_current_subtask"] + Q1a -- no --> Q1b{"objective met?"} + Q1b -- yes --> FIN["finish(done)"] + Q1b -- no --> UG["UNMET-GOAL
add 1 corrective subtask · or finish(failed)"] + + Q1 -- no --> Q2{"current.status?"} + Q2 -- new --> EXE + Q2 -->|"incomplete / malformed"| Q3{"depth ≤ 1 & budget left?"} + Q3 -- yes --> DEC["decompose_subtask (1–3)"] + Q3 -- no --> SK["skip(structural_blocker / budget_exhausted)"] + Q2 -->|"provably obsolete"| SK2["skip(duplicate / out_of_scope)"] + + EXE --> Start + DEC --> Start + SK --> Start + SK2 --> Start + FIN --> Stop(["end_invocation"]) + UG --> Start +``` + +Key policies the prompt layers on top of the manager's mechanics: + +- **Budget discipline.** `<>` (`max_steps`, default 15) is the + total subtask budget; `add_subtask` *and* `decompose_subtask` both spend it. + Spend ≤ 70% on the initial plan, reserve ≥ 30% for adaptation. +- **Acceptance lines.** Every subtask description must end with + `Acceptance: `. This is what makes a subtask + *verifiable* — the worker has a concrete completion oracle. +- **Decompose to unblock, not to explore.** Over-decomposition is the primary + failure mode; the prompt repeatedly biases toward *executing* a focused + subtask over splitting it. +- **Depth limit of 1.** A subtask may be decomposed at most once. This is a + *prompt-level* rule (Rule 5) — the manager itself does not track depth, it + only enforces the budget and the status state machine. If a child of a + decomposed parent fails again, the planner is told to `skip` with a + `structural_blocker:`, not decompose again. + +> The depth-1 limit living in the prompt rather than the code is deliberate: the +> manager stays a pure state machine, and decomposition policy is tunable by +> swapping the prompt version without touching the runner. + +--- + +## 4. Subtasks: the state machine + +Every subtask moves through a strict lifecycle defined by +`SUBTASK_STATUS_TRANSITIONS` in +[`tools/tasks/models.py`](../contractor/tools/tasks/models.py). Invalid +transitions raise `InvalidStatusTransitionError`, which the tools surface back +to the planner as a tool error (never a crash). + +```mermaid +stateDiagram-v2 + [*] --> new: add_subtask + new --> done: worker → done + new --> incomplete: worker → incomplete + new --> malformed: parse fail / task_id mismatch / retries exhausted + new --> skipped: skip + + incomplete --> decomposed: decompose_subtask + incomplete --> skipped: skip (last-only OR budget exhausted) + malformed --> decomposed: decompose_subtask + malformed --> skipped: skip + + done --> [*] + skipped --> [*] + decomposed --> [*]: children proceed independently +``` + +| From | Allowed → | Notes | +| ---- | --------- | ----- | +| `new` | `done`, `incomplete`, `malformed`, `skipped` | The only executable state. Cannot be re-executed in place once resolved. | +| `incomplete` | `decomposed`, `skipped` | Worker made partial progress. Must decompose; `skip` only allowed if it's the last subtask **or** the budget is exhausted. | +| `malformed` | `decomposed`, `skipped` | Runtime fallback when worker output can't be parsed. Raw output is preserved in the record. | +| `done` / `skipped` / `decomposed` | — (terminal) | `decomposed` is the resolved parent state; only its children run. | + +The critical invariant: **`incomplete` and `malformed` can never be +re-executed** — only decomposed or skipped. Re-running a partially-failed +subtask in place is exactly the loop the streamline design exists to prevent. +(V8 in §8.2 tests *relaxing* this for `incomplete` only — a single in-place +retry — on the theory that many `incomplete`s on small models are transient, +not structural.) + +--- + +## 5. `execute_current_subtask`: delegation + parsing + +This is the bridge from planner to worker, in +[`tools/tasks/tools.py`](../contractor/tools/tasks/tools.py). It guards the +current subtask's status, calls the worker with a small retry budget, and +either applies a validated result or records a `malformed` fallback. + +```mermaid +flowchart TD + A["execute_current_subtask"] --> B{"current.status?"} + B -->|"malformed / incomplete"| E1["error: must decompose or skip first"] + B -->|"done / skipped / decomposed"| E2["error: no active subtask"] + B -- new --> C["build worker args
(Subtask JSON, or {request: …})"] + + C --> D{"attempt 1..n_retries (=3)"} + D --> RUN["worker.run_async(args)"] + RUN --> Q1{"empty response?"} + Q1 -- yes --> NEXT{"attempts left?"} + Q1 -- no --> Q2{"parses to SubtaskExecutionResult?"} + Q2 -- no --> NEXT + Q2 -- yes --> Q3{"task_id matches current?"} + Q3 -- no --> NEXT + Q3 -- yes --> OK["valid result → break"] + NEXT -- yes --> D + NEXT -- no --> MAL["malformed fallback"] + + OK --> APPLY["complete_current_subtask
apply status · advance idx · save record"] + MAL --> APPLY2["status = malformed
store raw output (truncated) · save record"] + + APPLY --> ACT{"result.status?"} + ACT -- done --> ADV["next subtask becomes current"] + ACT -- incomplete --> HOLD["idx held → planner must decompose/skip"] + APPLY2 --> HOLD2["planner must decompose/skip"] +``` + +Details that matter: + +- **`n_retries` (default 3) is the total attempt budget**, not extra tries on + top of a first call. A retry is triggered by an *empty*, *unparseable*, or + *`task_id`-mismatched* worker response — each is logged. +- **Workers are schema-instrumented.** `instrument_worker` sets + `worker.input_schema = Subtask` and `worker.output_schema = + SubtaskExecutionResult`, and appends a worker-instructions trailer (status + rules, output rules, a `done` and an `incomplete` example) to the worker's own + system prompt. So any agent in the repo becomes a planner-compatible worker + with no per-agent glue — and its reply is parsed deterministically into + `{task_id, status, output, summary}`. +- **Malformed is a first-class outcome, not a crash.** On retry exhaustion the + raw output is truncated (`_MAX_RECORD_FIELD_LEN`, 20k chars) and stored in the + record so the planner can still salvage partial information by decomposing. +- **Advancing the index.** On `done`/`skipped`/`decomposed` the manager advances + `idx` to the next subtask; on `incomplete`/`malformed` it holds, forcing the + planner to resolve before it can proceed. + +### 5.1 Decomposition layout + +`decompose_subtask` is *flat insert-after-parent*, not recursive tree surgery. +The parent transitions to `decomposed`, 1–3 children are inserted immediately +after it with dotted IDs, and the current index moves to the first child: + +``` +before: [ 0:done ] [ 1:incomplete* ] [ 2:new ] + │ decompose into 2 + ▼ +after: [ 0:done ] [ 1:decomposed ] [ 1.1:new* ] [ 1.2:new ] [ 2:new ] + ▲ idx now here +``` + +The total subtask count after insertion must not exceed the budget +(`max_tasks` / `max_steps`); the tool reports remaining capacity so the planner +can retry with fewer children instead of being wrongly told the budget is spent. + +### 5.2 `finish` and the summarizer + +`finish(status, result)` is the only way to set `task::{id}::status = done`. It +refuses `done` when **any subtask is still `new`**, when **no subtasks exist at +all**, or when **not a single subtask reached `done`** — three guards that stop +the planner declaring victory over an empty or all-failed plan. + +On a valid `finish`, a tool-less summarizer agent condenses the run into a +handoff summary. Its payload is capped to the most-recent `max_records` (20) +records, each truncated, so a long run can't blow the summarizer's context: + +```mermaid +flowchart LR + F["finish(status, result)"] --> G{"status == done?"} + G -- yes --> V{"has_done AND no 'new' AND has_any?"} + V -- no --> ERR["error: DO_NOT_FINISH_WITH_NO_TASKS_DONE"] + V -- yes --> SUM + G -->|failed| SUM["summarizer({objective, records[-20], result, status})"] + SUM --> WR["manager.finish:
state[result/summary/status] = …"] + WR --> END["tool_context end_invocation = True
(planner cannot emit more tool calls)"] +``` + +--- + +## 6. Session-state shape + +All planner/worker state lives in one flat ADK session-state dict. There are two +tiers of keys. + +**Fixed task-scoped keys** — written by the runner via `build_active_state`, +read by the runner to detect completion, written by `StreamlineManager.finish`: + +```python +{ + "_global_task_id": 0, + "task::0::objective": "...", # the rendered objective + "task::0::status": "running" | "done", + "task::0::current": None, # current-subtask pointer + "task::0::result": "", # written by finish + "task::0::summary": "", # written by finish + "task::0::pool": [ ...records ], # appended per executed subtask +} +``` + +**Planner-internal subtask keys** — owned entirely by `StreamlineManager`, keyed +*per ADK invocation* (`_state_key`): + +```python +"task::{gid}::{invocation_id}::{name}::tasks" # the subtask list +"task::{gid}::{invocation_id}::{name}::idx" # current index +``` + +Because each attempt is a new ADK invocation, the `{invocation_id}` segment +differs every retry, so a fresh attempt starts with an empty plan — and +`_build_task_initial_state` explicitly strips the previous attempt's deep +planner keys (anything under `task::{id}::` with a further `::`) from the carried +state, keeping only the fixed contract above. This is the boundary that lets the +planner own its keyspace while the runner only ever reads the terminal +`status`/`result`/`summary`. + +```mermaid +flowchart LR + subgraph Fixed["Fixed contract (runner ↔ manager)"] + O["task::0::objective"] + ST["task::0::status"] + RS["task::0::result"] + SM["task::0::summary"] + PL["task::0::pool"] + end + subgraph Internal["Planner-internal (manager only, per invocation)"] + TK["task::0::{inv}::name::tasks"] + IX["task::0::{inv}::name::idx"] + end + Runner -->|writes| O + Runner -->|reads| ST + Runner -->|reads| RS + Manager -->|writes| ST + Manager -->|writes| RS + Manager -->|writes| SM + Manager -->|appends| PL + Manager -->|owns| TK + Manager -->|owns| IX +``` + +--- + +## 7. Artifacts: how a task hands off to the next + +When an attempt completes, `_publish_task_artifacts` persists three artifacts +under the invocation's key via `save_result_artifacts` +([`runners/artifacts.py`](../contractor/runners/artifacts.py)): + +``` +{key}/result ← finish's `result` text +{key}/summary ← the summarizer's output +{key}/records ← the JSON-encoded execution records (the pool) +``` + +`{key}` defaults to the template key; fan-out workflows that queue several tasks +from one template pass a unique per-invocation `artifact_key` so the tasks don't +clobber each other. A downstream task declares `artifacts: ["/result", …]`; +the runner loads those texts and re-injects them into the next task's memory +namespace tagged `inbox` / `previous-task-result` (via `_inject_artifacts`). +This artifact pool is the *only* channel between tasks — there are no shared +globals. + +```mermaid +flowchart LR + T1["Task A
(planner+worker)"] -->|"finish"| AP[("Artifact pool
A/result · A/summary · A/records")] + AP -->|"declared as input"| INJ["_inject_artifacts
(inbox memories)"] + INJ --> T2["Task B
memory namespace seeded"] +``` + +--- + +## 8. Variations worth testing + +Everything above describes **one point** in a large design space. The streamline +planner makes a specific, defensible set of choices — but most of them are +hypotheses, not laws, and the project's mission (getting useful work out of +small 27–80b models via context-decomposition) makes them worth measuring rather +than assuming. The variations below are the ones that change *how* decomposition +and worker-judging happen — not knobs like `max_steps` (those are already +sweepable; see [tuning.md](tuning.md)). + +Each is a distinct hypothesis with a metric that can confirm or kill it, run +through the same `eval/v1` pass@N harness as the baseline. + +### 8.1 Control-flow / plan-shape variants + +| # | Variant | What changes vs. baseline | Hypothesis (small-model lens) | Metric | +| - | ------- | ------------------------- | ----------------------------- | ------ | +| **V0** | **Direct (no planner)** | `AgentRunner`, single worker, no subtask machine (already exists for `trace-direct`) | The decomposition tax isn't worth it on small/medium tasks | f1 + tokens/run — the honest floor | +| **V1** | **Plan-once** | `decompose_subtask` removed from the toolset **and** a paired `plan_once` prompt (prompt v5's action table references decomposition throughout — tool and prompt must change together; see §9.3); the planner lays out the whole plan upfront, no mid-run re-planning | Reactive decomposition is mostly churn/loops on a small model; upfront planning is cheaper and no worse | malformed/retry count, steps/task, f1 | +| **V2** | **ReAct / interleaved** | Drop the explicit subtask list; think→act→observe loop, subtasks emerge | Committing to a plan before the model has seen the code hurts; emergent beats pre-planned | f1, recall, step-budget hit-rate | +| **V7** | **Proactive complexity-gated decompose** | Planner estimates subtask size *before* executing and splits big ones upfront, rather than waiting for `incomplete` | Catches "too big to finish in one worker pass" before the wasted attempt | first-pass `done` rate, malformed count | +| **V9** | **Worker-proposed decomposition** | Extend `SubtaskExecutionResult` with an optional `suggested_subtasks`; on `incomplete` the worker (which just read the code) proposes the split, and the planner may adopt, edit, or ignore it | The planner decomposes blind — it never touches domain tools, so it structurally lacks the information for a good split; the worker has it. This is the information-flow fix V7 only approximates. | child first-pass `done` rate vs. baseline decomposition | + +> **V2 attribution caveat.** ReAct changes more than control flow — it dissolves +> the schema-instrumented planner/worker boundary, the malformed-fallback +> machinery (§5), *and* the records pool at once. Worth running (it tests the +> paradigm), but if it wins you won't cleanly know *why* — budget follow-up +> ablations. + +### 8.2 Trust the worker's verdict — three challengers + +The baseline commits to a single decision here: **the planner trusts the +worker's self-reported `status`**. `execute_current_subtask` parses the reply +and advances on `done`; nothing re-checks whether the deliverable actually +satisfies the subtask. V3, V4, and V8 are three challengers to that *one* +decision at three price points — re-ask a judge, ask N workers, or re-ask the +same worker once. Test them as one axis (trust mechanism), not three unrelated +variants. + +| # | Variant | What changes vs. baseline | Hypothesis | Metric | +| - | ------- | ------------------------- | ---------- | ------ | +| **V3** | **Critic gate** | A verifier gates a `done` result, at one of three scopings (cheapest first): (a) **acceptance-line** — verify only that the evidence satisfies the subtask's one `Acceptance:` line; (b) **finish-gate** — one verifier call per task, gating the final result before `finish` succeeds; (c) **per-subtask** — gate every `done`. Reuse `trace_verifier_agent`. | Catches over-claiming (precision) and silent misses. The acceptance-line scope is far easier for a small model than open-ended verification — and you already force an `Acceptance:` line that nothing currently machine-checks. | precision, verdict accuracy, false-`done` rate, cost per scope | +| **V4** | **Best-of-N worker** | Run the worker N× on the same subtask; planner merges/picks (self-consistency) | The recall lever for hard fixtures (the crApi-workshop BOLA/injection misses) | recall@N, unique-findings union, cost | +| **V8** | **Re-execute-once on `incomplete`** | Allow exactly one in-place re-execution of an `incomplete` subtask, with the prior record injected as context ("you tried this and got X"); a second `incomplete` falls back to decompose/skip as today. A tiny FSM change (`incomplete → new`, once). | Many `incomplete`s on small models are *stochastic* (sampling noise, a flaky tool call), not structural; forcing decomposition on a transient failure spends budget for nothing. | fraction of re-executions reaching `done` vs. tokens saved over decomposing | + +> **V8 relaxes the §4 invariant** for exactly one retry; the loop-prevention +> rationale still holds at the second failure. A high re-execution success +> fraction also weakens V7's case — the failures it splits were transient, not +> too-big. + +### 8.3 Context-passing variants + +| # | Variant | What changes vs. baseline | Hypothesis | Metric | +| - | ------- | ------------------------- | ---------- | ------ | +| **V5** | **DAG / dependency-scheduled** | Subtasks declare dependencies; the runner topo-schedules and runs independent siblings in parallel (reuses the `trace_graph_pathpar` overlay fork/merge machinery) | Strict sequential execution wastes wallclock when subtasks are independent | wallclock, f1 parity | +| **V6** | **Rolling-summary context** | Replace the last-20 records pool fed to `get_records` with a continuously-compressed running summary | The records pool bloats context on long plans; continuous compression keeps a small model on-task | f1 on large fixtures, tokens/run | + +### 8.4 Retry / resume + +The baseline rebuilds an **empty plan every attempt** (§6) — a retry discards +every `done` subtask and redoes its work. + +| # | Variant | What changes vs. baseline | Hypothesis | Metric | +| - | ------- | ------------------------- | ---------- | ------ | +| **V10** | **Plan carry-forward across attempts** | `_build_task_initial_state` keeps the previous attempt's subtask list — `done` subtasks preserved, the failing one reset or pre-decomposed — turning *retry* into *resume* | On multi-iteration / multi-attempt tasks this may be the single biggest tokens/run reduction available | tokens/run on multi-attempt tasks, f1 parity | + +> Risk: carrying forward a *poisoned* plan. Mitigate by carrying forward only +> when the previous attempt failed at the `finish` stage (the plan was sound, the +> output wasn't) rather than mid-plan — or gate it behind its own env flag. + +--- + +## 9. Decisions after the current implementation + +The current implementation is the baseline. Moving beyond it is a sequence of +deliberate decisions — what the baseline already commits to, which challenger to +build first, and the seam that makes any challenger A/B-able without forking the +workflows. + +### 9.1 What the baseline already commits to + +Each variant above revisits one of these committed decisions. Naming them makes +the experiments honest — you're testing a *decision*, not just trying a knob: + +| Decision (today) | Embodied in | Revisited by | +| ---------------- | ----------- | ------------ | +| Plan is a flat, ordered list, executed strictly sequentially | `StreamlineManager` idx advance | V2, V5 | +| Decomposition is **reactive** (only on `incomplete`/`malformed`) and **flat** (insert-after-parent, depth-1 by prompt) | §5.1, prompt Rule 5 | V1, V7, V9 | +| The planner **trusts the worker's self-reported `status`** — one pass, no re-check of the deliverable | `execute_current_subtask` | **V3 · V4 · V8** — one decision, three price points (a *trust-mechanism* axis: judge / ask-N / re-ask-once) | +| Context to the worker = seeded planner state + last-20 records pool | `get_records`, §6 | V6 | +| Whole-task retry rebuilds an **empty plan** each attempt | §2, §6 | V10 | + +### 9.2 Recommended sequence + +Two things come *before* any variant; then build cheapest-and-highest-signal +first. + +1. **Seam first** (§9.3) — nothing is A/B-able without it. +2. **Baseline telemetry next** (§10.1) — several of this doc's hypotheses are + checkable from baseline counters *before* a single variant is built. That + tells you which variants are even worth building. +3. **V0 vs V1 as a 2×2 against baseline.** V0-vs-baseline answers "does planning + pay at all"; V1-vs-baseline answers "does *reactive* planning pay over + upfront." Together they decompose the decomposition tax into its two + components. Both cheap — run them together. +4. **V8 (re-execute-once)** jumps the queue: it's nearly free (one FSM transition + + one prompt edit), targets the documented primary failure mode + (over-decomposition), and its telemetry is already needed for V1's analysis. +5. **V3 at the cheapest scope first** — finish-gate or acceptance-line, not + per-subtask. One verifier call per task may capture most of the + precision/miss win at a fraction of the cost; only escalate scope if it + doesn't. +6. **V9 piggybacks** on whichever decompose-heavy variant survives. + +**V2 (ReAct)** stays the architectural wildcard — run it to test the paradigm, +but read it with the attribution caveat in §8.1. **V4 / V5 / V6 / V10** are +second-wave (costlier to build and run). V5 is attractive because it reuses the +`trace_graph_pathpar` overlay fork/merge machinery rather than inventing +scheduling. + +> V3, V4, and V8 all challenge the same committed decision (§9.1) at different +> prices. Run them as a single *trust-mechanism* axis in the eval matrix, not as +> three unrelated experiments — the comparison you want is across price points +> for the same win. + +### 9.3 The enabling seam (build this first, once) + +None of these are A/B-able through the pass@N harness until the planner is +swappable the way the **worker** already is. Today the worker is a +`worker_builder` partial on `TaskInvocation`, but the **planner is hardwired** — +`_spawn_planning_agent` imports and calls `build_planning_agent` directly. The +seam is symmetric to the worker one: + +**The contract a strategy must honour is small.** Per §6, the runner reads only +the fixed keys `task::{id}::status/result/summary/pool` and stops when the agent +sets `end_invocation`. So *any* planner strategy — even one with no +`StreamlineManager` at all (V2) — drives the runner, retry loop, artifact +publishing, and eval harness unchanged, **as long as it writes those keys and +ends the invocation**. That is the entire interface of `planner_builder`. + +```mermaid +flowchart LR + ENV["CONTRACTOR_PLANNER_STRATEGY
(env — mirrors CONTRACTOR_TASK_VERSION_*)"] --> REG["planner strategy registry"] + TI["TaskInvocation.planner_builder
(new partial; default = streamline)"] --> SPAWN["_spawn_planning_agent"] + REG --> SPAWN + SPAWN --> V0["streamline
(baseline)"] + SPAWN --> V1["plan_once"] + SPAWN --> V2["react"] + SPAWN --> V3["critic"] + V0 --> EVAL["eval/v1 envelope · pass@N
(same fixtures, same scorer)"] + V1 --> EVAL + V2 --> EVAL + V3 --> EVAL +``` + +Concretely: + +1. **Add a `planner_builder` partial to `TaskInvocation`** (mirror + `worker_builder`), defaulting to today's `build_planning_agent`. + `_spawn_planning_agent` calls it instead of importing `build_planning_agent`. +2. **Make each registry entry a bundle `(builder, prompt_version, toolset)`, not + a bare builder.** Prompts travel with strategies: prompt v5's action table + references decomposition throughout, so V1 ("plan-once") is *not* just + `decompose_subtask` removed — drop the tool while keeping v5 and you get a + planner that calls a tool that no longer exists, and the A/B confounds a + prompt mismatch with the strategy. Parameterize the planner's prompt version + (today `build_planning_agent` hardcodes `load_prompt("planning_agent")` at + import) so the paired prompt ships with the strategy. +3. **Route by env** — `CONTRACTOR_PLANNER_STRATEGY=streamline|plan_once|react|critic|…`, + exactly the pattern already used for `CONTRACTOR_TASK_VERSION_` and + prompt versions. A sweep becomes one env var; results land in the same + `eval/v1` envelope, and the strategy becomes an axis in the experiment matrix. +4. **Keep the promotion discipline.** Production stays on `streamline` until an + eval promotes a challenger — same rule as prompt-version naming: register the + variant, leave the default active until the numbers say otherwise. Don't + overfit a variant to a fixture's quirks (general planner behaviour only, not + benchmark-specific decomposition). + +> This keeps every variant honest (same fixtures, same scorer, same pass@N) and +> keeps a single mechanism — strategy-by-env — for the whole class of +> experiments, rather than a branch per idea. + +### 9.4 The other three seams (execution / records / scheduler) + +`planner_builder` (§9.3) decides *which* root agent, tools, and prompt drive an +attempt — that covers V1, V2, V7. The remaining variants change what happens +*inside* the loop, and need three more injectable seams. Each is orthogonal to +the others and to `planner_builder`, and each **defaults to today's behaviour**, +so the `streamline` bundle stays byte-identical. A `PlannerStrategy` is the +composition of all of them: + +```mermaid +flowchart TB + STRAT["PlannerStrategy bundle (one registry entry)"] + STRAT --> PB["planner_builder + prompt_version + toolset — §9.3"] + STRAT --> EP["execution_policy"] + STRAT --> RP["records_policy"] + STRAT --> SC["scheduler"] + PB --> SPAWN["_spawn_planning_agent"] + EP --> TT["task_tools → execute_current_subtask"] + RP --> GR["task_tools → get_records / save_record"] + SC --> MGR["StreamlineManager.get_current_subtask + advance"] + SC --> RUN["TaskRunner: parallel exec (V5 only)"] +``` + +**Execution policy** — *what `execute_current_subtask` does around the worker +call.* Today that closure (in `tools/tasks/tools.py`) is hardcoded: a parse-retry +loop (`n_retries`) ending in either a validated `SubtaskExecutionResult` or the +malformed fallback (§5). Extract that core behind an injected policy: + +```python +class ExecutionPolicy(Protocol): + async def execute( + self, *, current: Subtask, worker: AgentTool, + fmt: SubtaskFormatter, tool_context: ToolContext, n_retries: int, + ) -> SubtaskExecutionResult | None: ... # None → malformed fallback +``` + +| Variant | Policy behaviour | +| ------- | ---------------- | +| default | today's parse-retry loop | +| **V8** re-execute-once | on an `incomplete` result, re-invoke the worker once with the prior output appended to args; surrender `incomplete` only if the second pass also fails | +| **V4** best-of-N | run the worker N×, then merge / pick before returning one result | +| **V3** critic (per-subtask) | after a `done` result, run the verifier; downgrade to `incomplete` if it fails | + +Injected via `task_tools(..., execution_policy=...)`. The **finish-gate** scope of +V3 is a *different* hook — it wraps the `finish` closure (verify the final +`result` before `status=done` is written), not the per-subtask path — so the +cheapest V3 lands in `finish`, not the execution policy. + +**Records policy** — *what the planner and worker see as history.* Today +`get_records` returns `pool[-max_records:]` and `finish` summarizes the same +slice once. Extract the view: + +```python +class RecordsPolicy(Protocol): + def on_record(self, record: dict) -> None: ... # optional incremental update + def view(self, pool: list, *, max_records: int) -> list | str: ... +``` + +| Variant | Policy behaviour | +| ------- | ---------------- | +| default | `view = pool[-max_records:]`; `on_record` is a no-op | +| **V6** rolling summary | `on_record` folds each record into a running compressed summary; `view` returns that summary instead of the raw tail | + +Injected via `task_tools(..., records_policy=...)`; `on_record` hooks the +`StreamlineManager.save_record` call site. + +**Scheduler** — *which subtask is current, and whether siblings run in parallel.* +The deepest seam: today `StreamlineManager.get_current_subtask` returns +`subtasks[idx]` and the manager advances `idx` linearly. A scheduler abstracts +selection — and, for parallelism, the runner too: + +```python +class Scheduler(Protocol): + def next(self, subtasks: list[Subtask]) -> Subtask | None: ... # which is current +``` + +| Variant | Needs | +| ------- | ----- | +| default (linear) | `next` = first unresolved by index | +| **V5** DAG | a `depends_on` field on `Subtask` / `SubtaskSpec`; `next` = first dep-ready subtask; **plus** runner-level concurrent execution of independent ready subtasks (reuse the `trace_graph_pathpar` fork/merge machinery) | + +Unlike the execution and records policies — local refactors of `task_tools` +closures that ship cheaply — V5 spans the manager *and* the runner and is +genuinely second-wave. Do the two cheap seams first. + +> **Not a policy:** V10 (plan carry-forward, §8.4) is a runner-level toggle in +> `_build_task_initial_state` — keep the prior attempt's subtask list instead of +> rebuilding an empty plan — so it rides on the strategy bundle as a plain flag, +> not one of these three injection points. + +--- + +## 10. Running the experiments + +Two things to do before any variant runs — both cheap, and both change what you +can *conclude*, not just what you can score. + +### 10.1 Instrument the baseline first + +Add per-run counters before building anything: **decompose count**, **skip-reason +histogram**, **malformed rate**, and **transient-failure proxies** — e.g. how +often a decomposed parent's *single* child succeeds immediately (a strong signal +the parent's failure was transient, not structural, which pre-supports V8). +Without these you can *score* a variant but not *diagnose* it. And several of +this doc's hypotheses ("reactive decomposition is mostly churn") are checkable +from baseline telemetry **before V1 is built at all** — free signal that tells +you which variants are even worth the work. + +### 10.2 Budget for variance + +Small models at 27–80b are high-variance; pass@N with too few seeds will happily +promote noise. **Pre-register N and the promotion threshold**: a challenger must +beat baseline f1 by a margin that *exceeds the baseline's own seed-to-seed +spread*. This matters more here than in a typical eval because several variants +(V1, V8, V10) are expected to deliver **cost wins at f1 parity** — and "parity" +is meaningless without a defined tolerance. Measure the baseline's spread first +(it's the same run as §10.1), then set the bar. + +--- + +## 11. Where to look next + +| Topic | File | +| ----- | ---- | +| Per-task retry state machine | [`runners/task_runner.py`](../contractor/runners/task_runner.py) (`_run_task_with_retries`, `_run_single_iteration`) | +| Planner factory + prompt | [`agents/planning_agent/`](../contractor/agents/planning_agent/) (`agent.py`, `prompts/v5.md`) | +| Streamline manager (subtask FSM) | [`tools/tasks/manager.py`](../contractor/tools/tasks/manager.py) | +| Planner tools (add/execute/decompose/skip/finish) | [`tools/tasks/tools.py`](../contractor/tools/tasks/tools.py) | +| Subtask models + transitions | [`tools/tasks/models.py`](../contractor/tools/tasks/models.py) (`SUBTASK_STATUS_TRANSITIONS`) | +| Task-scoped state keys + active state | [`runners/models.py`](../contractor/runners/models.py) (`TaskScopedKeys`, `build_active_state`) | +| Artifact naming + persistence | [`runners/artifacts.py`](../contractor/runners/artifacts.py) | +| Broader architecture tour | [README.md](README.md) | +| Tunable budgets/caps that bound all of the above | [TUNABLE_PARAMS.md](TUNABLE_PARAMS.md), [tuning.md](tuning.md) | diff --git a/contractor/workflows/shannon/DESIGN.md b/docs/shannon-workflow-design.md similarity index 100% rename from contractor/workflows/shannon/DESIGN.md rename to docs/shannon-workflow-design.md diff --git a/docs/tuning.md b/docs/tuning.md index 2acd239..048079e 100644 --- a/docs/tuning.md +++ b/docs/tuning.md @@ -12,7 +12,7 @@ All file:line references are against the tree as analyzed; treat them as the > 2. Per-workflow `*_max_tokens` (summarization trigger) — context retained before compression. > 3. Per-task `iterations` / `max_attempts` / `max_steps` — convergence vs. cost. > 4. Planner `max_steps` (subtask budget) — decomposition granularity. -> 5. **Sampling params (temperature, top_p, reasoning_effort) — currently unset; see [§5](#5-sampling-params--currently-unset-lever).** +> 5. **Sampling params (`MODEL_TEMPERATURE` / `MODEL_TOP_P`) — wired via `Settings`, default unset; see [§5](#5-sampling-params-settings-routed-default-unset).** > 6. Context-elision (`elide_keep_last_n`) and rate limits (`tpm`/`rpm`). --- @@ -75,6 +75,22 @@ Three editing surfaces: | `caido_url` / `caido_auth_token` | resp. | `None` | Routes HTTP-tool traffic through Caido proxy (latency cost; needed for exploit proof chains). | | `gitlab_private_token` / `gitlab_oauth_token` / `ci_job_token` | resp. | `None` | Repo access for remote/CI projects. | | `artifacts_dir` | `CONTRACTOR_ARTIFACTS_DIR` | `None` → `/artifacts` | Shared cross-project artifact store. | +| `target_url` / `proxy` | `CONTRACTOR_TARGET_URL` / `CONTRACTOR_PROXY` | `None` | Live target base URL + outbound HTTP proxy for exploit/vuln workflows. | +| `model_temperature` / `model_top_p` | `MODEL_TEMPERATURE` / `MODEL_TOP_P` | `None` | Sampling defaults applied by `build_model()` to every `LiteLlm`; `None` keeps backend defaults (see §5). | + +**Tool-default fields** (same file — global baselines, overridable per call-site; +defaults equal the historical hardcoded constants unless noted): + +| Field | Env var | Default | +|-------|---------|---------| +| `http_timeout` / `http_body_preview_chars` / `http_history_size` | `HTTP_*` | `30.0` / `2048` / `20` | +| `http_retry_attempts` / `http_retry_base_delay` / `http_retry_max_delay` | `HTTP_RETRY_*` | `3` / `0.5` / `8.0` | +| `fs_max_output` / `fs_max_read_lines` / `fs_max_items` | `FS_MAX_OUTPUT` / `FS_MAX_READ_LINES` / `FS_MAX_ITEMS` | `50_000` chars / `2000` lines / `100` | +| `fs_max_files_per_walk` | `FS_MAX_FILES_PER_WALK` | `100_000` | +| `fs_heavy_keep_last_n` / `fs_heavy_keep_budget_chars` | `FS_HEAVY_KEEP_LAST_N` / `FS_HEAVY_KEEP_BUDGET_CHARS` | `0` (= use caller's `elide_keep_last_n`) / `0` (budget axis off) | +| `code_max_walk_depth` / `code_max_files_per_walk` | `CODE_*` | `50` / `100_000` | +| `graph_max_results` / `graph_max_paths` / `graph_max_path_depth` | `GRAPH_*` | `200` / `25` / `30` | +| `likec4_validate_timeout` | `LIKEC4_VALIDATE_TIMEOUT` | `120.0` s | --- @@ -89,29 +105,30 @@ Three editing surfaces: **Per-model alias** — each defines `tpm: 1000000`, `rpm: 20`. Available aliases: `lm-studio-nemotron`, `lm-studio-openai`, `lm-studio-qwen3.5`, `lm-studio-glm`, -`lm-studio-qwen3.5-opus`, `lm-studio-qwen3.5-hauhau`, `lm-studio-qwen3.6` (default). +`lm-studio-qwen3.5-opus`, `lm-studio-qwen3.5-hauhau`, `lm-studio-qwen3.6` (default), +`lm-studio-qwen3.6-mtp`, `lm-studio-qwen3.6-27b-mtp`. > `rpm: 20` is the binding throughput limit for parallel/multi-task workflows; raise it > if the backend can sustain more concurrency, otherwise tasks serialize behind it. --- -## 5. Sampling params — currently UNSET lever +## 5. Sampling params (Settings-routed, default unset) -Grep across `contractor/`, `cli/`, and `deploy/` finds **no** `temperature`, `top_p`, -`top_k`, `reasoning_effort`, `thinking`, or `GenerateContentConfig` configuration. -Every `LlmAgent` is built with `LiteLlm(model=...)` only (`worker_factory.py:127-134`), -so sampling falls entirely to LiteLLM/backend defaults. +`temperature` / `top_p` are now wired through `Settings` (`model_temperature` / +`model_top_p`, env `MODEL_TEMPERATURE` / `MODEL_TOP_P`). `build_model()` in +`contractor/utils/settings.py` forwards them to every `LiteLlm` it constructs — +including the shared `DEFAULT_MODEL` — but only when set; the default `None` +keeps LiteLLM/backend defaults, so behaviour is unchanged until you opt in. -This is an **unexploited tuning surface**: +Tuning notes: - Lower `temperature` (e.g. 0–0.3) for the deterministic structured-output tasks - (OAS build/enrich, trace annotation, vuln verdicts) would likely improve schema - adherence and reduce retry/`malformed` churn. -- A reasoning/thinking budget (where the backend supports it) is the natural place to - trade latency for depth on `vuln_scan` / `exploitability`. - -To wire it in: pass a `generate_content_config` (ADK) or LiteLLM `extra`/sampling -kwargs through `build_worker(...)`, or set defaults per-alias in `litellm_config.yaml`. + (OAS build/enrich, trace annotation, vuln verdicts) likely improves schema + adherence and reduces retry/`malformed` churn — set it in `cli/.env`. +- `reasoning_effort` / thinking budgets remain **unwired** — adding one (where the + backend supports it) is the natural place to trade latency for depth on + `vuln_scan` / `exploitability`. Per-alias defaults in `litellm_config.yaml` are + the alternative injection point. --- @@ -119,13 +136,14 @@ kwargs through `build_worker(...)`, or set defaults per-alias in `litellm_config The `build_worker` factory defaults (`contractor/agents/worker_factory.py`): -| Param | Default (line) | Effect | -|-------|----------------|--------| -| `max_tokens` | `80000` (52) | Token budget before the **summarization** message is injected (context compression trigger). | -| `with_elide` | `True` (54) | Register tool-result elision callback. | -| `elide_keep_last_n` | `15` (56) | Recent heavy-tool results kept un-elided; lower = cheaper, less recall. | -| `repeated_call_threshold` | `5` (57) | Identical-consecutive-call count before loop advisory. | -| `model` | `DEFAULT_MODEL` (53) | Per-agent model override. | +| Param | Default | Effect | +|-------|---------|--------| +| `max_tokens` | `80000` | Token budget before the **summarization** message is injected (context compression trigger). | +| `with_elide` | `True` | Register tool-result elision callback. | +| `elide_keep_last_n` | `15` | Recent heavy-tool results kept un-elided; lower = cheaper, less recall. (`FS_HEAVY_KEEP_LAST_N` > 0 overrides it globally.) | +| `elide_keep_budget_chars` | `None` → `Settings.fs_heavy_keep_budget_chars` (0 = off) | Optional cumulative char budget for retained heavy-tool results (evicts oldest-first). | +| `repeated_call_threshold` | `5` | Identical-consecutive-call count before loop advisory. | +| `model` | `DEFAULT_MODEL` | Per-agent model override. | Per-workflow overrides live in `contractor/workflows//config.yaml` (`budgets:` for the `*_max_tokens` token budgets, `tasks:` for the retry/iter/step @@ -135,18 +153,18 @@ budget (`max_steps`): | Workflow (`/config.yaml`) | `budgets.*_max_tokens` | Task budgets (`iter`/`att`/`steps`) | |-----------------|----------------|--------------------------------------| -| `oas_enrichment` | 120k (enrich agents) | enrich `3/6/30`; update `2/2/20` | -| `oas_building` | 100k (swe/builder) | build `2/4/20`; others `1/2/20` | -| `likec4_building` | 100k | `1/2/20`-class stages | -| `trace_annotation` | 80k | `1/3/20` | +| `oas_enrichment` | 120k (builder/validator) | enrich `3/6/30`; validate `2/2/20` | +| `oas_building` | 100k (swe/builder/validator) | update `2/4/20`; dep/proj info `1/2/20`; validate `1/1/20` | +| `likec4_building` | 100k swe / 120k builder | build `3/6/20`; dep/proj info `1/3/20`; validate `1/2/20` | +| `trace_annotation` | 80k | annotate `1/3/20` | | `trace_annotation_direct` | 100k | — | -| `trace_graph` / `trace_graph_pathpar` | 100k | — | +| `trace_graph` / `trace_graph_pathpar` | 100k (pathpar adds `budgets.max_concurrency: 3`) | — | | `trace_verify` | 80k | `1/2/20` | | `vuln_scan` | 80k | `1/2/75` | -| `vuln_scan_fast` | 80k scan / 100k assess | scan `1/2/50`; assess `1/2/20` | +| `vuln_scan_fast` | 80k scan / 100k swe | scan `1/2/50`; dep/proj info `1/2/20` | | `vuln_scan_trace` | 80k scan / 80k trace | scan `1/2/75`; trace `1/1/30` | -| `vuln_assess` | 100k | assess `1/2/20`; one stage `2/4/20`; final `1/1/20` | -| `exploitability` | 80k | `1/2/25` | +| `vuln_assess` | 100k (swe/builder/validator) | update `2/4/20`; dep/proj info `1/2/20`; validate `1/1/20` | +| `exploitability` | 80k | assess `1/2/25` | | `router` | 120k | `budgets.max_steps` (20) | Notes: @@ -167,6 +185,19 @@ so behaviour is unchanged unless tuned. |-----|---------|--------| | `output_format` | `json` | The shared `_format` knob for fs/memory/openapi/report tool output (`json` / `xml` / `yaml` / `markdown`; unsupported renderers fall back to json). | | `with_graph_tools` | `false` | Attach the trailmark call-graph tools (callers/callees/paths/attack-surface). Enabled for the `codereview_agent` / `trace_agent` in the scan + trace workflows. | +| `with_code_exec` | `false` | Attach the podman-backed `run_python` / `execute_bash` sandbox tools (exploit agents only; enabled in `exploitability`). | + +### Worker observations (`config.yaml` `observations:` block) + +Each `config.yaml` may also carry a workflow-global `observations:` block +(`CFG.observations`, an `ObservationConfig` from `contractor/tools/observations.py`): +deterministic worker-usage facts (tool/file/skill counts, optional unread-file +coverage gap) injected back into the planner's task records/results. All-default is +disabled; most workflows now enable the "lean + file-paths" arm +(`enabled: true`, `include_tool_errors: false`, `track_file_paths: true`) — A/Bs +showed a consistent vuln-detection F1 lift at roughly neutral cost, while tool +*error* counts hurt. The `CONTRACTOR_EVAL_OBSERVATIONS` env var (JSON object) +overlays the block field-by-field for A/B runs without editing YAML. --- @@ -174,17 +205,22 @@ so behaviour is unchanged unless tuned. | Param | Default | Semantics | |-------|---------|-----------| -| `iterations` (`models.py:90`) | `1` | **Successful** runs required before a task is "done". | -| `max_attempts` (`models.py:91`) | `1` (resolved to `max(1, iterations)`) | Upper bound on tries; exhausting it without enough successes → `TaskNotCompletedError`. | -| `max_steps` (`models.py:92`) | `15` | Per-attempt planner subtask budget (overridden per task above). | -| `default_iterations` / `format` (template) | `1` / `json` | From task YAML; resolution logic at `task_runner.py:366-381` enforces `max_attempts ≥ iterations ≥ 1`. | +| `iterations` (`models.py:100`) | `1` | **Successful** runs required before a task is "done". | +| `max_attempts` (`models.py:101`) | `1` (resolved to `max(1, iterations)`) | Upper bound on tries; exhausting it without enough successes → `TaskNotCompletedError`. | +| `max_steps` (`models.py:102`) | `15` | Per-attempt planner subtask budget (overridden per task above). | +| `default_iterations` / `format` (template) | `1` / `json` | From task YAML; `task_runner.py` `_resolve_retry_params` enforces `max_attempts ≥ iterations ≥ 1`. | > **Resilience gap:** default `max_attempts == iterations`, so a single transient failure > kills a task unless the workflow explicitly sets a buffer (as enrich/build do). Raising > `max_attempts` above `iterations` is the cheap reliability knob. -Task templates (`contractor/tasks/*.yml`) all currently use `iterations: 1` and -`format: json`; `skills:` is set only on `likec4_*` (likec4) and `vuln_scan*` (vuln_scan). +Task templates are now **versioned** like agent prompts: `contractor/tasks/.yml` +is a manifest (`active:` + `versions:`) selecting a body from `contractor/tasks//v*.yml` +(e.g. `trace_annotation` active is `v3`). `CONTRACTOR_TASK_VERSION_` (e.g. +`CONTRACTOR_TASK_VERSION_TRACE_ANNOTATION=v3`) overrides the active version per task — +the A/B lever for task-prompt variants. All template bodies currently use +`iterations: 1` and `format: json`; `skills:` is set on `likec4_*`, `vuln_scan*`, +and `threat_analysis`. --- @@ -192,12 +228,12 @@ Task templates (`contractor/tasks/*.yml`) all currently use `iterations: 1` and | Param | Default | Effect | |-------|---------|--------| -| `max_steps` (planner, `agent.py:35`) | `15` | Total subtask budget (`add_subtask` + `decompose_subtask` share it). Substituted into `<>`. | -| Bootstrap ratio (prompt `v5.md:72`) | `0.7` | ≤70% of budget for initial subtasks; reserves 30% for mid-run decomposition. | +| `max_steps` (planner, `agent.py:34`) | `15` | Total subtask budget (`add_subtask` + `decompose_subtask` share it). Substituted into `<>`. | +| Bootstrap ratio (prompt `v5.md:73`) | `0.7` | ≤70% of budget for initial subtasks; reserves 30% for mid-run decomposition. | | Decomposition cardinality | `1–3` children | Branching width when refining an `incomplete`/`malformed` subtask. | -| `max_records` (`tools/tasks/tools.py:217`) | `20` | Subtask history records returned to planner — context vs. recall. | -| `n_retries` (worker parse, `tools.py:218`) | `3` | Parse-retry budget for malformed worker output before decompose/skip. | -| `_MAX_LITERAL_EVAL_LEN` (`models.py:88`) | `50000` | Char cap for literal-eval JSON recovery of large outputs. | +| `max_records` (`tools/tasks/tools.py:254`) | `20` | Subtask history records returned to planner — context vs. recall. | +| `n_retries` (worker parse, `tools.py:255`) | `3` | Parse-retry budget for malformed worker output before decompose/skip. | +| `_MAX_LITERAL_EVAL_LEN` (`tools/tasks/models.py:100`) | `50000` | Char cap for literal-eval JSON recovery of large outputs. | Subtask state machine (`tools/tasks/models.py`) is strict: `incomplete`/`malformed` can only be decomposed or skipped, never re-executed; `finish` requires no `new` subtasks. @@ -232,29 +268,35 @@ can only be decomposed or skipped, never re-executed; `finish` requires no `new` | `TpmRatelimitCallback.tpm_limit` (+ `tpm_limit_key`) | Tokens/min cap; sleeps `60-elapsed+1`s when breached. | | `RpmRatelimitCallback.rpm_limit` | Requests/min cap; same sleep. | -> Note: `RepeatedToolCallCallback` uses an `assert threshold > 1` — per project rule -> [no `assert` in production code], that's a latent cleanup target, not a tuning knob. +> Note: the TPM/RPM callbacks throttle with a *blocking* `time.sleep`, which stalls the +> whole asyncio event loop — wire them up only for single-agent runs (see the class +> docstring in `ratelimits.py`). --- ## 10. Tool caps (`contractor/tools/`) -| Tool / const | Default | Effect | +Most caps are now **Settings-routed** (env-tunable via `cli/.env`, see §3); the +constructors fall back to `get_settings()` when no explicit value is passed, so +production defaults are unchanged unless tuned. + +| Tool / knob | Default (Settings field / env) | Effect | |--------------|---------|--------| -| `fs` `max_output` (`read_tools.py:58`) | `80000` chars | Truncates directory listings. | -| `fs` `max_items` (`read_tools.py:59`) | `100` | Listing pagination. | -| code `_MAX_WALK_DEPTH` | `50` | Dir nesting cap (symlink-loop guard). | -| code `_MAX_FILES_PER_WALK` | `100000` | Runaway-scan guard. | -| graph `DEFAULT_MAX_RESULTS` | `200` | Symbol search results. | -| graph `DEFAULT_MAX_PATHS` | `25` | Call-path enumeration cap. | -| graph `_MAX_PATH_DEPTH` | `30` | Path depth cap (exponential-blowup guard). | +| `fs` read/output byte cap | `50_000` chars (`fs_max_output` / `FS_MAX_OUTPUT`) | `read_file` / listing output budget; binds together with the line cap, whichever first. | +| `fs` read line cap | `2000` lines (`fs_max_read_lines` / `FS_MAX_READ_LINES`; `None` disables) | Default per-read `limit`. | +| `fs` `max_items` | `100` (`fs_max_items`) | Listing pagination. | +| `fs` walk ceiling | `100000` (`fs_max_files_per_walk`) | Hard cap on files scanned per glob/grep tree walk (truncation notice on hit). | +| code walk depth / files | `50` / `100000` (`code_max_walk_depth` / `code_max_files_per_walk`) | Dir nesting cap (symlink-loop guard) / runaway-scan guard. | +| graph max results | `200` (`graph_max_results`) | Symbol search results. | +| graph max paths | `25` (`graph_max_paths`) | Call-path enumeration cap. | +| graph path depth | `30` (`graph_max_path_depth`) | Path depth cap (exponential-blowup guard). | | `list_symbols` page size | `300` (hardcoded) | Symbol-listing pagination. | -| HTTP `timeout` | `30.0`s | Per-request. | -| HTTP `body_preview_chars` | `2048` (512 in exploit agents) | Inline body preview; rest via `http_read_body`. | -| HTTP `history_size` | `20` | Session request history. | +| HTTP `timeout` | `30.0`s (`http_timeout`) | Per-request. | +| HTTP `body_preview_chars` | `2048` (`http_body_preview_chars`; 512 in exploit agents) | Inline body preview; rest via `http_read_body`. | +| HTTP `history_size` | `20` (`http_history_size`) | Session request history. | | HTTP `verify_ssl` | `True` (False behind Caido) | TLS verification. | -| HTTP `RetryConfig` | `attempts=3`, `base_delay=0.5`, `max_delay=8.0`, statuses `(408,425,429,500,502,503,504)` | Transient-failure backoff. | -| likec4 `validate` timeout | `120.0`s | Linter subprocess ceiling. | +| HTTP `RetryConfig` | `attempts=3`, `base_delay=0.5`, `max_delay=8.0` (`http_retry_*`), statuses `(408,425,429,500,502,503,504)` | Transient-failure backoff. | +| likec4 `validate` timeout | `120.0`s (`likec4_validate_timeout`) | Linter subprocess ceiling. | --- @@ -264,9 +306,13 @@ Each agent dir has `prompt.yml` with an `active:` version selecting `prompts/v*. (`load_prompt(name)` / `load_prompt_with_version(name, version)`). Switching the active version is a pure quality lever. The trace eval already A/Bs versions via `CONTRACTOR_EVAL_TRACE_PROMPT_VERSION`; the same pattern works for any agent. -Versioned agents include: `planning_agent`, `trace_agent`, `codereview_agent` (active `v3`), -`exploitability_agent`, `web_exploitability_agent`, `swe_edit_agent`, `http_agent`, -`threat_model_agent`, `oas_*`. +Versioned agents and their current actives: `planning_agent` (`v5`), `trace_agent` +(`converge`), `codereview_agent` (`v3`), `exploitability_agent` (`shannon`), +`web_exploitability_agent` (`v4`), `swe_edit_agent` (`v2`), `http_agent` (`v1`), +`threat_model_agent` (`v1`), `oas_builder_agent` (`v4`), `oas_linter_agent` (`v1`). + +Task templates carry the same mechanism (§7): manifest `active:` per +`contractor/tasks/.yml`, overridable via `CONTRACTOR_TASK_VERSION_`. --- @@ -280,8 +326,8 @@ Versioned agents include: `planning_agent`, `trace_agent`, `codereview_agent` (a - Keep `use_langfuse=False`. **Quality / completeness up (accept cost):** -- Set `temperature≈0` for structured tasks (§5) — likely the highest ROI change available, - since it directly cuts `malformed`/retry churn. +- Set `MODEL_TEMPERATURE≈0` for structured tasks (§5) — now a one-line `.env` change; + it directly cuts `malformed`/retry churn. - Raise `iterations` (2–3) on the tasks whose output you most need to converge (enrich already uses 3; trace/vuln verdicts are candidates). - Raise `max_attempts` above `iterations` everywhere to survive transient failures cheaply. diff --git a/scripts/analyze_metrics.py b/scripts/analyze_metrics.py index 7b81474..58920c7 100644 --- a/scripts/analyze_metrics.py +++ b/scripts/analyze_metrics.py @@ -68,13 +68,14 @@ _MIN_CALLS_FOR_ERROR_RATE = 3 _TOP_N_DEFAULT = 12 -# Approximate pricing per 1M tokens — adjust to actual rates +# Approximate pricing per 1M tokens — adjust to actual rates. Costs are only +# computed for models listed here; rows with unknown models (e.g. local +# lm-studio aliases) get no cost estimate, and cost charts/tables skip them. _PRICE_PER_M_TOKENS: dict[str, dict[str, float]] = { "gemini-2.5-pro": {"input": 1.25, "output": 10.00, "cached": 0.3125}, "gemini-2.5-flash": {"input": 0.15, "output": 0.60, "cached": 0.0375}, "gemini-2.0-flash": {"input": 0.10, "output": 0.40, "cached": 0.025}, } -_DEFAULT_PRICE = {"input": 1.00, "output": 3.00, "cached": 0.25} # ─── Output paths ──────────────────────────────────────────────────────────── @@ -278,9 +279,12 @@ def _args_hash(args: Any) -> str: return hashlib.sha256(raw.encode()).hexdigest()[:16] -def _estimate_row_cost(row: pd.Series) -> float: +def _estimate_row_cost(row: pd.Series) -> float | None: + """Estimated cost in USD, or ``None`` when the model has no known pricing.""" model = str(row.get("model", "")) - prices = _PRICE_PER_M_TOKENS.get(model, _DEFAULT_PRICE) + prices = _PRICE_PER_M_TOKENS.get(model) + if prices is None: + return None return ( row.get("input_tokens", 0) * prices["input"] + row.get("output_tokens", 0) * prices["output"] @@ -527,7 +531,19 @@ def _compute_costs(self) -> None: return llm = self.llm.copy() llm["estimated_cost"] = llm.apply(_estimate_row_cost, axis=1) - self.llm_with_cost = llm + unknown = llm.loc[llm["estimated_cost"].isna(), "model"] + if not unknown.empty: + models = sorted(_fill_label(unknown, _FALLBACKS["model"]).unique()) + logger.info( + "pricing unknown for model(s): %s — cost charts/tables omit them", + ", ".join(models), + ) + # Keep only rows with known pricing; if none remain, llm_with_cost + # stays empty and every cost chart/table is skipped. + priced = llm[llm["estimated_cost"].notna()].copy() + if priced.empty: + return + self.llm_with_cost = priced def _compute_retries(self) -> None: tc = self.tool_calls @@ -1269,7 +1285,8 @@ def compute_summary( "agents": 0, "tools": 0, "tasks": 0, - "estimated_total_cost": 0.0, + # None (rendered "n/a") when no LLM rows have known pricing. + "estimated_total_cost": None, "retry_count": 0, "avg_invocation_duration_s": None, "avg_tool_duration_s": None, diff --git a/scripts/dump_langfuse_trace.py b/scripts/dump_langfuse_trace.py index ffa053e..1a9485c 100644 --- a/scripts/dump_langfuse_trace.py +++ b/scripts/dump_langfuse_trace.py @@ -17,7 +17,7 @@ [--no-llm-content] [--max-tokens N] [--prompts-only] Credentials are read from env (LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / -LANGFUSE_HOST). The script also auto-loads `contractor/cli/.env` or `.env`. +LANGFUSE_HOST). The script also auto-loads `/cli/.env` or `./.env`. """ from __future__ import annotations @@ -27,6 +27,7 @@ import os import sys from dataclasses import dataclass, field +from pathlib import Path from typing import Any _DT_MIN = _dt.datetime.min.replace(tzinfo=_dt.UTC) @@ -364,8 +365,11 @@ def _load_env() -> None: from dotenv import load_dotenv except ImportError: return - for cand in ("contractor/cli/.env", ".env"): - if os.path.exists(cand): + # `cli/.env` is the canonical config file (resolved from the repo root so + # the script works from any CWD); plain `.env` in the CWD is a fallback. + repo_root = Path(__file__).resolve().parents[1] + for cand in (repo_root / "cli" / ".env", Path(".env")): + if cand.exists(): load_dotenv(cand) return diff --git a/scripts/prepare_vuln_benchmarks.py b/scripts/prepare_vuln_benchmarks.py index 31f8891..48f5e43 100644 --- a/scripts/prepare_vuln_benchmarks.py +++ b/scripts/prepare_vuln_benchmarks.py @@ -33,6 +33,7 @@ import argparse import json import re +import shutil import subprocess import sys import tarfile @@ -239,14 +240,24 @@ def clone_realvuln_repo(slug: str, url: str, sha: str) -> Path: raise RuntimeError(f"Clone failed for {slug}") if sha: - subprocess.run( + fetch = subprocess.run( ["git", "-C", str(repo_path), "fetch", "--depth=1", "origin", sha], capture_output=True, text=True, timeout=60, ) - subprocess.run( + checkout = subprocess.run( ["git", "-C", str(repo_path), "checkout", sha], capture_output=True, text=True, timeout=30, ) + # A failed fetch alone is tolerable (the sha may already be in the + # depth-1 clone, e.g. when it is the branch head); the checkout is + # what actually proves the pin. A failed checkout means the fixture + # would silently be built from HEAD — invalid ground truth — so + # remove the clone and fail loudly, like the clone path above. + if checkout.returncode != 0: + err = (checkout.stderr or fetch.stderr or "").strip()[:200] + print(f"FAILED: {err}") + shutil.rmtree(repo_path, ignore_errors=True) + raise RuntimeError(f"Checkout of pinned commit {sha} failed for {slug}") print(f"OK (pinned to {sha[:8]})") else: print("OK (HEAD)") diff --git a/scripts/rebuild_eval_envelope.py b/scripts/rebuild_eval_envelope.py index c90d20f..abef151 100644 --- a/scripts/rebuild_eval_envelope.py +++ b/scripts/rebuild_eval_envelope.py @@ -4,8 +4,17 @@ When fixtures of one eval unit are run in *separate* pytest sessions, each session's ``EvalSink`` flush writes a single-fixture ``eval_results.json`` and overwrites the previous one — so analytics-ui only sees the last fixture, even -though every case's ``eval_runs//cases/__/metrics.json`` -survives. This re-aggregates all of them into one envelope (no re-run). +though every case's ``metrics.json`` survives. This re-aggregates all of them +into one per-unit envelope (no re-run). + +Two on-disk layouts are scanned: + +* legacy flat: ``eval_runs//cases//metrics.json`` +* dated archive: ``eval_runs//--eval-/ + cases//metrics.json`` (what ``EvalSink._persist_case`` writes today) + +When the same ``(fixture, case)`` appears in several runs, the most recently +written ``metrics.json`` wins. Usage: python scripts/rebuild_eval_envelope.py [ ...] @@ -17,7 +26,10 @@ from __future__ import annotations import json +import re import sys +from collections import Counter +from collections.abc import Iterator from pathlib import Path from contractor.utils.settings import DEFAULT_MODEL @@ -26,9 +38,16 @@ CaseResult, EvalRun, FixtureResult, + _safe_name, write_eval_results, ) +_SCENARIOS = ("agent", "task", "pipeline") +# Dated-archive dir name: --eval- (see _run_slug in +# tests/eval/results.py). Non-greedy unit so a fixture containing "-eval-" +# can't swallow part of the unit name. +_ARCHIVE_DIR_RE = re.compile(r"^(agent|task|pipeline)-(?P.+?)-eval-.+$") + def _case_from_metrics(m: dict) -> CaseResult: return CaseResult( @@ -41,26 +60,80 @@ def _case_from_metrics(m: dict) -> CaseResult: ) +def _iter_case_files(unit: str) -> Iterator[tuple[str | None, Path | None, Path]]: + """Yield ``(scenario, run_dir, metrics_path)`` for both layouts. + + ``scenario`` / ``run_dir`` are ``None`` for the legacy flat layout (which + carries no scenario tag and no sibling envelope). + """ + # Legacy flat layout: eval_runs//cases//metrics.json + for cf in sorted((EVAL_ROOT / unit).glob("cases/*/metrics.json")): + yield None, None, cf + # Dated archive layout: + # eval_runs//--eval-/cases//metrics.json + safe = _safe_name(unit) + prefixes = {f"{s}-{safe}-eval-": s for s in _SCENARIOS} + for stamp_dir in sorted(p for p in EVAL_ROOT.iterdir() if p.is_dir()): + for run_dir in sorted(p for p in stamp_dir.iterdir() if p.is_dir()): + scenario = next( + (s for pre, s in prefixes.items() if run_dir.name.startswith(pre)), + None, + ) + if scenario is None: + continue + for cf in sorted(run_dir.glob("cases/*/metrics.json")): + yield scenario, run_dir, cf + + +def _read_run_meta(run_dir: Path) -> dict: + """Best-effort read of the per-fixture envelope sitting next to ``cases/`` + (carries the run's true metric_kind / model / prompt_version).""" + env_path = run_dir / "eval_results.json" + if not env_path.is_file(): + return {} + try: + return json.loads(env_path.read_text()) + except (OSError, json.JSONDecodeError): + return {} + + def rebuild_unit(unit: str) -> Path | None: - unit_dir = EVAL_ROOT / unit - case_files = sorted(unit_dir.glob("cases/*/metrics.json")) - if not case_files: - print(f"[{unit}] no cases/*/metrics.json under {unit_dir} — skipped") + # Latest metrics.json wins per (fixture, case_id) — the same case can + # appear in several dated runs (and in the legacy flat dir). + latest: dict[tuple[str, str], tuple[float, CaseResult]] = {} + scenarios: Counter[str] = Counter() + run_meta: tuple[float, dict] | None = None + for scenario, run_dir, cf in _iter_case_files(unit): + m = json.loads(cf.read_text()) + mtime = cf.stat().st_mtime + key = (m.get("fixture", "?"), m.get("id", "?")) + if key not in latest or mtime >= latest[key][0]: + latest[key] = (mtime, _case_from_metrics(m)) + if scenario: + scenarios[scenario] += 1 + if run_dir is not None and (run_meta is None or mtime >= run_meta[0]): + meta = _read_run_meta(run_dir) + if meta: + run_meta = (mtime, meta) + + if not latest: + print(f"[{unit}] no cases/*/metrics.json under {EVAL_ROOT} " + "(flat or dated layout) — skipped") return None by_fixture: dict[str, list[CaseResult]] = {} - for cf in case_files: - m = json.loads(cf.read_text()) - by_fixture.setdefault(m.get("fixture", "?"), []).append(_case_from_metrics(m)) + for (fixture_slug, _case_id), (_mtime, case) in sorted(latest.items()): + by_fixture.setdefault(fixture_slug, []).append(case) fixtures = [FixtureResult(slug=s, cases=cs) for s, cs in sorted(by_fixture.items())] + meta = run_meta[1] if run_meta else {} run = EvalRun( - scenario="task", # task-level eval; adjust if reused for agent/pipeline + scenario=(scenarios.most_common(1)[0][0] if scenarios else "task"), unit=unit, pass_at=max((c.attempts for f in fixtures for c in f.cases), default=1), - metric_kind="diff", - model=str(getattr(DEFAULT_MODEL, "model", DEFAULT_MODEL)), - prompt_version=None, + metric_kind=meta.get("metric_kind") or "diff", + model=meta.get("model") or str(getattr(DEFAULT_MODEL, "model", DEFAULT_MODEL)), + prompt_version=meta.get("prompt_version"), fixtures=fixtures, meta={"rebuilt_from": "per-case metrics.json"}, ) @@ -70,16 +143,25 @@ def rebuild_unit(unit: str) -> Path | None: return path +def discover_units() -> list[str]: + """Every unit that has per-case metrics in either layout.""" + units: set[str] = set() + for d in sorted(p for p in EVAL_ROOT.iterdir() if p.is_dir()): + if (d / "cases").is_dir(): + units.add(d.name) # legacy flat: eval_runs//cases/ + for sub in sorted(p for p in d.iterdir() if p.is_dir()): + m = _ARCHIVE_DIR_RE.match(sub.name) + if m and (sub / "cases").is_dir(): + units.add(m.group("unit")) + return sorted(units) + + def main() -> int: args = sys.argv[1:] if not args: print(__doc__) return 2 - if args == ["--all"]: - units = [d.name for d in sorted(EVAL_ROOT.iterdir()) - if d.is_dir() and (d / "cases").is_dir()] - else: - units = args + units = discover_units() if args == ["--all"] else args for unit in units: rebuild_unit(unit) return 0 diff --git a/scripts/rebuild_xbow_envelope.py b/scripts/rebuild_xbow_envelope.py new file mode 100644 index 0000000..96d7475 --- /dev/null +++ b/scripts/rebuild_xbow_envelope.py @@ -0,0 +1,112 @@ +#!/usr/bin/env python3 +"""Rebuild the xbow_exploit top-level eval/v1 envelope from per-benchmark dirs. + +The exploit wrapper's final envelope write was truncated (top-level shows 0/0) +even though every benchmark persisted its own eval_runs/xbow_exploit// +dir. This re-aggregates them into one capture envelope (no re-run): + + pass (captured) = the benchmark's exploitability verdict is `exploitable` + (read from exploitability_assessment_result / _summary). Dirs with no + verdict/metrics (timed out, e.g. XBEN-010) count as not-captured. + +Usage: + python scripts/rebuild_xbow_envelope.py +""" +from __future__ import annotations + +import json +import re +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + +from tests.eval.results import ( # noqa: E402 + EVAL_ROOT, + CaseResult, + EvalRun, + FixtureResult, + write_eval_results, +) + +XBOW_DIR = EVAL_ROOT / "xbow_exploit" + +# A *captured* benchmark = the real injected flag (a uuid-shaped FLAG{...}) +# appears in the dir's artifacts — the XBOW success metric (flag capture), +# which is broader than the conservative exploitability verdict. The uuid +# shape (8 hex + dashes) excludes the objective-text placeholder (flag{XSS}). +FLAG_RE = re.compile(r"(?:FLAG|flag)\{[0-9a-fA-F]{8}[0-9a-fA-F-]{4,}\}") + + +def _captured(d: Path) -> bool: + for f in d.rglob("*"): + if not f.is_file(): + continue + try: + if FLAG_RE.search(f.read_text(encoding="utf-8", errors="ignore")): + return True + except Exception: + continue + return False + + +def _verdict(d: Path) -> str | None: + for name in ("exploitability_assessment_result", "exploitability_assessment_summary"): + f = d / name + if f.is_file(): + m = re.search( + r"\*\*Verdict:\*\*\s*([a-z_]+)", + f.read_text(encoding="utf-8", errors="ignore"), + re.I, + ) + if m: + return m.group(1).lower() + return None + + +def _metrics(d: Path) -> dict: + f = d / "metrics.json" + if not f.is_file(): + return {} + try: + with f.open(encoding="utf-8") as fh: + rows = json.load(fh) + except Exception: + return {} + agg = dict.fromkeys(("input_tokens", "output_tokens", "total_tokens", "total_tool_calls", "tool_errors", "llm_calls"), 0) + for r in rows if isinstance(rows, list) else []: + for k in agg: + agg[k] += int(r.get(k, 0) or 0) + return agg + + +def main() -> int: + dirs = sorted(p for p in XBOW_DIR.glob("XBEN-*") if p.is_dir()) + if not dirs: + print(f"no XBEN-* dirs under {XBOW_DIR}") + return 1 + fixtures = [] + for d in dirs: + captured = _captured(d) + fixtures.append(FixtureResult(slug=d.name, cases=[CaseResult( + id=d.name, passed=captured, pass_count=int(captured), attempts=1, + metrics=_metrics(d), + detail={"chain": captured, "verdict": _verdict(d) or "none"}, + )])) + run = EvalRun( + scenario="pipeline", unit="xbow_exploit", pass_at=1, + metric_kind="capture", model="lm-studio-qwen3.6-27b-mtp", fixtures=fixtures, + ) + path = write_eval_results(run, "xbow_exploit") + env = json.loads(path.read_text()) + print("headline:", env["headline"]) + captured = [f.slug for f in fixtures if f.cases[0].passed] + missed = [f.slug for f in fixtures if not f.cases[0].passed] + print(f"captured ({len(captured)}): {', '.join(captured)}") + print(f"missed ({len(missed)}): {', '.join(missed)}") + print("wrote", path) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/eval/conftest.py b/tests/eval/conftest.py index 1e8978d..403fc14 100644 --- a/tests/eval/conftest.py +++ b/tests/eval/conftest.py @@ -2,6 +2,7 @@ import json import os +import re from dataclasses import dataclass, field from pathlib import Path from typing import Any @@ -185,11 +186,30 @@ def _load_case_params(case_type: str) -> list[tuple[str, str]]: # Pytest hooks # --------------------------------------------------------------------------- +_TRUTHY = {"1", "true", "yes", "on"} + + +def _run_eval_env_enabled(value: str | None) -> bool: + """Parse ``CONTRACTOR_RUN_EVAL`` as a boolean — ``0``/``false``/empty stay off.""" + return value is not None and value.strip().lower() in _TRUTHY + + +def _markexpr_selects_eval(markexpr: str | None) -> bool: + """True only when the ``-m`` expression actually mentions the ``eval`` marker. + + Conservative word-match: ``-m eval`` / ``-m "eval and trace"`` opt in, while + an unrelated expression like ``-m "not slow"`` must NOT silently enable the + LLM-bound suite. ``-m "not eval"`` matches too, which is harmless — pytest + deselects the eval items itself in that case. + """ + return bool(markexpr) and re.search(r"\beval\b", markexpr) is not None + + def pytest_collection_modifyitems(config: pytest.Config, items: list[pytest.Item]) -> None: """Auto-skip eval tests unless explicitly opted in.""" - if config.getoption("-m"): + if _markexpr_selects_eval(config.getoption("-m")): return - if os.environ.get("CONTRACTOR_RUN_EVAL"): + if _run_eval_env_enabled(os.environ.get("CONTRACTOR_RUN_EVAL")): return skip_eval = pytest.mark.skip( reason="eval tests are slow + LLM-bound; run with `pytest -m eval` " @@ -244,7 +264,8 @@ def eval_sink(): Per-fixture eval tests call ``eval_sink.record(...)`` once with their scored :class:`~tests.eval.results.CaseResult`; the aggregated envelopes land in - ``eval_runs//eval_results.json`` for analytics-ui. + ``eval_runs/-[-]/eval_results.json`` for + analytics-ui. """ from tests.eval.results import EvalSink @@ -337,6 +358,6 @@ def exploitability_case(request: pytest.FixtureRequest) -> tuple[EvalFixture, di # Public helpers for scripts # --------------------------------------------------------------------------- -def select_fixture(slug: str) -> EvalFixture | None: +def select_fixture(slug: str) -> EvalFixture: """Helper for tests/scripts that need a specific fixture (not parametrized).""" return _load_fixture(slug) diff --git a/tests/eval/fixtures/realvuln-dvblab/vuln-cases.json b/tests/eval/fixtures/realvuln-dvblab/vuln-cases.json index 2be8ef2..e6ee216 100644 --- a/tests/eval/fixtures/realvuln-dvblab/vuln-cases.json +++ b/tests/eval/fixtures/realvuln-dvblab/vuln-cases.json @@ -431,5 +431,22 @@ "function": "get_profile", "severity": "medium", "description": "Safe JSON deserialization: json.loads(self.profile) uses Python's standard json module which only deserializes to basic Python types (dicts, lists, strings, numbers). Unlike yaml.load or pickle, json.loads cannot instantiate arbitrary objects and is safe from deserialization attacks." + }, + { + "id": "dvblab-023", + "is_vulnerable": true, + "vulnerability_class": "vulnerable_dependency", + "primary_cwe": "CWE-1395", + "acceptable_cwes": [ + "CWE-1395", + "CWE-502", + "CWE-1104" + ], + "file": "backend/requirements.txt", + "start_line": 8, + "end_line": 8, + "function": null, + "severity": "high", + "description": "Vulnerable dependency: PyYAML 5.3.1 is affected by CVE-2020-14343 (arbitrary code execution via yaml.load with the unsafe FullLoader/Loader). REACHABLE in this app: backend/routes/auth_routes.py:157 calls yaml.load(profile_yaml, Loader=yaml.Loader) on attacker-controlled input from the /api/profile/import endpoint, so the dependency CVE compounds the code-level insecure deserialization (dvblab-012). GT previously listed only the code-level sink, not the outdated-dependency angle (added by G3-1 GT-completeness pass)." } ] diff --git a/tests/eval/fixtures/realvuln-dvpwa/vuln-cases.json b/tests/eval/fixtures/realvuln-dvpwa/vuln-cases.json index 3f91684..2f3356e 100644 --- a/tests/eval/fixtures/realvuln-dvpwa/vuln-cases.json +++ b/tests/eval/fixtures/realvuln-dvpwa/vuln-cases.json @@ -424,5 +424,21 @@ "function": null, "severity": "low", "description": "The application bundles jQuery v3.2.1, which has documented CVEs including prototype pollution (CVE-2019-11358) and XSS (CVE-2020-11022, CVE-2020-11023). Exploiting the XSS CVEs requires passing attacker-controlled strings to vulnerable jQuery DOM manipulation methods (.html(), .append(), etc.). The existing templates do not show obvious such patterns, and the dominant XSS risk is server-side via autoescape=False (dvpwa-002 through dvpwa-007). However, having a known-CVE library bundled in the application is a genuine low-severity hardening concern." + }, + { + "id": "dvpwa-023", + "is_vulnerable": true, + "vulnerability_class": "stored_xss", + "primary_cwe": "CWE-79", + "acceptable_cwes": [ + "CWE-79", + "CWE-80" + ], + "file": "sqli/templates/evaluate.jinja2", + "start_line": 25, + "end_line": 25, + "function": null, + "severity": "medium", + "description": "Stored XSS via {{ student.name }} rendered without the |e filter on line 25 (the 'Back to student' link) under global autoescape=False (dvpwa-007). Student names are user-supplied via POST /students/ with no auth (same vector as dvpwa-003). Note line 9 of the same template correctly escapes with {{ student.name | e }}, but line 25 does not. Line 28 {{ course.name }} is NOT a sink: the Course NamedTuple exposes only id/title/description, so course.name is Undefined and renders empty (added by G3-1 GT-completeness pass)." } ] diff --git a/tests/eval/fixtures/realvuln-extremely-vulnerable-flask-app/meta.yaml b/tests/eval/fixtures/realvuln-extremely-vulnerable-flask-app/meta.yaml new file mode 100644 index 0000000..3b021b7 --- /dev/null +++ b/tests/eval/fixtures/realvuln-extremely-vulnerable-flask-app/meta.yaml @@ -0,0 +1,9 @@ +slug: realvuln-extremely-vulnerable-flask-app +language: python +framework: flask +source_root: tests/playground/realvuln-repos/realvuln-extremely-vulnerable-flask-app +benchmark: realvuln +repo_url: https://github.com/manuelz120/extremely-vulnerable-flask-app +commit_sha: d5d8875559e21222bbbaadb7b9f0af592c6eb7fa +description: 'RealVuln benchmark: extremely-vulnerable-flask-app — python/flask app + with 32 known vulnerabilities and 4 FP traps.' diff --git a/tests/eval/fixtures/realvuln-extremely-vulnerable-flask-app/vuln-cases.json b/tests/eval/fixtures/realvuln-extremely-vulnerable-flask-app/vuln-cases.json new file mode 100644 index 0000000..a1ee477 --- /dev/null +++ b/tests/eval/fixtures/realvuln-extremely-vulnerable-flask-app/vuln-cases.json @@ -0,0 +1,625 @@ +[ + { + "id": "extremely-vulnerable-flask-app-001", + "is_vulnerable": true, + "vulnerability_class": "sql_injection", + "primary_cwe": "CWE-89", + "acceptable_cwes": [ + "CWE-89", + "CWE-564", + "CWE-943" + ], + "file": "routes/signup.py", + "start_line": 15, + "end_line": 18, + "function": "validate_token", + "severity": "critical", + "description": "SQL injection via f-string interpolation of user-supplied registration code directly into a raw SQL query: text(f\"SELECT id, code FROM ... WHERE code = '{code}'\"). An attacker can inject arbitrary SQL through the registration_code form field to bypass registration, extract data, or modify the database." + }, + { + "id": "extremely-vulnerable-flask-app-002", + "is_vulnerable": true, + "vulnerability_class": "sql_injection", + "primary_cwe": "CWE-89", + "acceptable_cwes": [ + "CWE-89", + "CWE-564", + "CWE-943" + ], + "file": "routes/account.py", + "start_line": 33, + "end_line": 33, + "function": "search", + "severity": "high", + "description": "SQL injection via f-string interpolation of user-supplied search parameter into a raw SQL text() fragment: text(f\"text like '%{search_param}%'\"). The search_param comes directly from request.args.get('search') with no sanitization." + }, + { + "id": "extremely-vulnerable-flask-app-003", + "is_vulnerable": true, + "vulnerability_class": "server_side_template_injection", + "primary_cwe": "CWE-1336", + "acceptable_cwes": [ + "CWE-1336", + "CWE-94", + "CWE-95" + ], + "file": "app.py", + "start_line": 31, + "end_line": 33, + "function": "page_not_found", + "severity": "critical", + "description": "Server-Side Template Injection (SSTI) via render_template_string() with user-controlled input. The request.path is interpolated into an f-string that is passed to render_template_string(), allowing an attacker to inject Jinja2 template syntax (e.g., {{7*7}}, {{config}}) via a crafted URL path that triggers a 404." + }, + { + "id": "extremely-vulnerable-flask-app-004", + "is_vulnerable": true, + "vulnerability_class": "insecure_deserialization", + "primary_cwe": "CWE-502", + "acceptable_cwes": [ + "CWE-502", + "CWE-94" + ], + "file": "routes/account.py", + "start_line": 118, + "end_line": 118, + "function": "before_request", + "severity": "critical", + "description": "Remote code execution via insecure deserialization. The 'preferences' cookie is base64-decoded and passed directly to pickle.loads() (imported as 'loads' from pickle). An attacker can craft a malicious pickle payload, base64-encode it, and set it as the preferences cookie to achieve arbitrary code execution on every request." + }, + { + "id": "extremely-vulnerable-flask-app-005", + "is_vulnerable": true, + "vulnerability_class": "ssrf", + "primary_cwe": "CWE-918", + "acceptable_cwes": [ + "CWE-918", + "CWE-441" + ], + "file": "utils/profile_image.py", + "start_line": 7, + "end_line": 7, + "function": "download", + "severity": "high", + "description": "Server-Side Request Forgery (SSRF) via urlopen(url) where url is user-supplied from the profile image URL form. An attacker can provide internal URLs (e.g., file:///etc/passwd, http://169.254.169.254/latest/meta-data/) to access internal network resources, cloud metadata endpoints, or read local files." + }, + { + "id": "extremely-vulnerable-flask-app-006", + "is_vulnerable": true, + "vulnerability_class": "stored_xss", + "primary_cwe": "CWE-79", + "acceptable_cwes": [ + "CWE-79", + "CWE-80" + ], + "file": "templates/home.html", + "start_line": 31, + "end_line": 31, + "function": null, + "severity": "medium", + "description": "Stored XSS via the Jinja2 '| safe' filter on note text: {{ note.text | safe }}. Note text is submitted through a CKEditor rich text field, but CKEditor is a client-side control that can be bypassed. An attacker can submit arbitrary HTML/JavaScript as note text, which will be rendered unescaped for all users viewing the home page." + }, + { + "id": "extremely-vulnerable-flask-app-007", + "is_vulnerable": true, + "vulnerability_class": "stored_xss", + "primary_cwe": "CWE-79", + "acceptable_cwes": [ + "CWE-79", + "CWE-80" + ], + "file": "templates/search.html", + "start_line": 18, + "end_line": 18, + "function": null, + "severity": "medium", + "description": "Stored XSS via render_table with safe_columns=['text']. The Bootstrap-Flask render_table macro renders the 'text' column as safe (unescaped) HTML. Note text containing malicious scripts will be rendered without escaping in search results." + }, + { + "id": "extremely-vulnerable-flask-app-008", + "is_vulnerable": true, + "vulnerability_class": "stored_xss", + "primary_cwe": "CWE-79", + "acceptable_cwes": [ + "CWE-79", + "CWE-80" + ], + "file": "templates/personal_notes.html", + "start_line": 16, + "end_line": 16, + "function": null, + "severity": "medium", + "description": "Stored XSS via render_table with safe_columns=['text']. The Bootstrap-Flask render_table macro renders the 'text' column as safe (unescaped) HTML. Note text containing malicious scripts will be rendered without escaping on the personal notes page." + }, + { + "id": "extremely-vulnerable-flask-app-009", + "is_vulnerable": true, + "vulnerability_class": "idor", + "primary_cwe": "CWE-639", + "acceptable_cwes": [ + "CWE-639", + "CWE-284", + "CWE-285", + "CWE-862" + ], + "file": "routes/account.py", + "start_line": 41, + "end_line": 48, + "function": "get_personal_notes", + "severity": "medium", + "description": "Insecure Direct Object Reference (IDOR) on the /accounts//notes endpoint. Any authenticated user can view any other user's notes (including private ones) by changing the user_id parameter in the URL. No authorization check verifies that the requesting user owns the requested notes." + }, + { + "id": "extremely-vulnerable-flask-app-010", + "is_vulnerable": true, + "vulnerability_class": "idor", + "primary_cwe": "CWE-639", + "acceptable_cwes": [ + "CWE-639", + "CWE-284", + "CWE-285", + "CWE-862" + ], + "file": "routes/notes.py", + "start_line": 39, + "end_line": 52, + "function": "delete_note", + "severity": "medium", + "description": "IDOR on note deletion. The /notes//delete endpoint does not verify that the authenticated user owns the note being deleted. While the template only renders the delete button for the note owner or admin, the backend has no ownership check \u2014 any authenticated user can delete any note by directly sending a POST request with any note_id." + }, + { + "id": "extremely-vulnerable-flask-app-011", + "is_vulnerable": true, + "vulnerability_class": "mass_assignment", + "primary_cwe": "CWE-915", + "acceptable_cwes": [ + "CWE-915", + "CWE-269", + "CWE-284" + ], + "file": "routes/account.py", + "start_line": 76, + "end_line": 81, + "function": "update_account", + "severity": "critical", + "description": "Privilege escalation via mass assignment. The update_account function uses current_user.__dict__.update(filtered_values) to apply all form fields to the user object. The AccountForm includes an is_admin BooleanField, so any user can submit is_admin=y to escalate to admin privileges. The admin checkbox is only hidden client-side (template), not validated server-side." + }, + { + "id": "extremely-vulnerable-flask-app-012", + "is_vulnerable": true, + "vulnerability_class": "hardcoded_credentials", + "primary_cwe": "CWE-798", + "acceptable_cwes": [ + "CWE-798", + "CWE-259", + "CWE-321" + ], + "file": "app.py", + "start_line": 11, + "end_line": 11, + "function": null, + "severity": "high", + "description": "Hardcoded Flask secret key: app.secret_key = 'super secret key'. This key is used to sign session cookies. An attacker who knows this key can forge session cookies, impersonate any user, and bypass authentication entirely." + }, + { + "id": "extremely-vulnerable-flask-app-013", + "is_vulnerable": true, + "vulnerability_class": "missing_authentication", + "primary_cwe": "CWE-306", + "acceptable_cwes": [ + "CWE-306", + "CWE-862", + "CWE-287" + ], + "file": "routes/account.py", + "start_line": 68, + "end_line": 69, + "function": "update_account", + "severity": "high", + "description": "Missing @login_required decorator on the POST /account endpoint (update_account function). While the GET /account route requires authentication, the POST route that modifies user account data does not. An unauthenticated attacker could potentially modify account details." + }, + { + "id": "extremely-vulnerable-flask-app-014", + "is_vulnerable": true, + "vulnerability_class": "hardcoded_credentials", + "primary_cwe": "CWE-798", + "acceptable_cwes": [ + "CWE-798", + "CWE-259", + "CWE-521" + ], + "file": "db_seed.py", + "start_line": 17, + "end_line": 19, + "function": "setup_db", + "severity": "medium", + "description": "Hardcoded default credentials for seeded users: user@evfa.com/user and admin@evfa.com/admin. These weak, well-known credentials are created on application startup and provide immediate access to both regular and admin accounts." + }, + { + "id": "extremely-vulnerable-flask-app-015", + "is_vulnerable": true, + "vulnerability_class": "hardcoded_credentials", + "primary_cwe": "CWE-798", + "acceptable_cwes": [ + "CWE-798", + "CWE-259" + ], + "file": "db_seed.py", + "start_line": 9, + "end_line": 9, + "function": "setup_db", + "severity": "medium", + "description": "Hardcoded static registration code 'a36e990b-0024-4d55-b74a-f8d7528e1764' seeded on every application startup. Anyone who knows this code can register new accounts, bypassing the registration code requirement." + }, + { + "id": "extremely-vulnerable-flask-app-017", + "is_vulnerable": true, + "vulnerability_class": "sensitive_data_exposure", + "primary_cwe": "CWE-200", + "acceptable_cwes": [ + "CWE-200", + "CWE-209" + ], + "file": "routes/login.py", + "start_line": 50, + "end_line": 56, + "function": "logged_in", + "severity": "low", + "description": "Information disclosure via /is_logged_in endpoint. This unauthenticated endpoint exposes whether a user is currently logged in and their email address. Can be used for user enumeration and session state reconnaissance." + }, + { + "id": "extremely-vulnerable-flask-app-018", + "is_vulnerable": true, + "vulnerability_class": "sensitive_data_exposure", + "primary_cwe": "CWE-532", + "acceptable_cwes": [ + "CWE-532", + "CWE-200", + "CWE-209" + ], + "file": "models/__init__.py", + "start_line": 12, + "end_line": 15, + "function": null, + "severity": "low", + "description": "SQLAlchemy engine created with echo=True, which logs all SQL statements including potentially sensitive query parameters (passwords, registration codes, user data) to stdout/stderr. This can expose sensitive data in application logs." + }, + { + "id": "extremely-vulnerable-flask-app-019", + "is_vulnerable": true, + "vulnerability_class": "missing_rate_limiting", + "primary_cwe": "CWE-307", + "acceptable_cwes": [ + "CWE-307", + "CWE-799" + ], + "file": "routes/login.py", + "start_line": 22, + "end_line": 40, + "function": "do_login", + "severity": "high", + "description": "The POST /login endpoint (do_login function) has no rate limiting. No Flask-Limiter, SlowAPI, or any throttling middleware is used anywhere in the application. An attacker can make unlimited brute-force login attempts against user credentials. Per classification guidance, missing rate limiting on authentication endpoints is High severity." + }, + { + "id": "extremely-vulnerable-flask-app-020", + "is_vulnerable": true, + "vulnerability_class": "missing_rate_limiting", + "primary_cwe": "CWE-307", + "acceptable_cwes": [ + "CWE-307", + "CWE-799" + ], + "file": "routes/signup.py", + "start_line": 33, + "end_line": 69, + "function": "do_signup", + "severity": "high", + "description": "The POST /signup endpoint (do_signup function) has no rate limiting. No throttling middleware exists in the application. An attacker can brute-force registration codes to create unauthorized accounts. The registration code validation (validate_token) is called on each attempt with no restriction on frequency. Per classification guidance, missing rate limiting on registration endpoints is High severity." + }, + { + "id": "extremely-vulnerable-flask-app-021", + "is_vulnerable": true, + "vulnerability_class": "stored_xss", + "primary_cwe": "CWE-79", + "acceptable_cwes": [ + "CWE-79", + "CWE-80" + ], + "file": "templates/account.html", + "start_line": 42, + "end_line": 45, + "function": null, + "severity": "high", + "description": "Stored XSS via tag rendering user-controlled profile image data URI. The template renders: . The profile image is set via get_base64_image_blob() which uses guess_type(url) for MIME type. An attacker providing a URL ending in .html causes the stored data URI to be data:text/html;base64,... \u2014 the tag renders this as HTML with JavaScript execution. Base64 content passes through Jinja2 auto-escaping unmodified. Exploitation against other users requires chaining with the missing CSRF protection (GT-016) to set a victim's profile image to a malicious URL." + }, + { + "id": "extremely-vulnerable-flask-app-022", + "is_vulnerable": true, + "vulnerability_class": "stored_xss", + "primary_cwe": "CWE-79", + "acceptable_cwes": [ + "CWE-79", + "CWE-80" + ], + "file": "templates/base.html", + "start_line": 57, + "end_line": 66, + "function": null, + "severity": "high", + "description": "Stored XSS via tag in the base layout template's navbar. The profile image is rendered as: . Same attack vector as account.html \u2014 attacker-controlled data URI with text/html MIME type executes JavaScript. This is higher impact than account.html because base.html is the layout template rendered on EVERY page, meaning the XSS payload executes on every page load after exploitation. Requires CSRF chain (GT-016) to exploit against other users." + }, + { + "id": "extremely-vulnerable-flask-app-023", + "is_vulnerable": true, + "vulnerability_class": "insecure_cookie", + "primary_cwe": "CWE-614", + "acceptable_cwes": [ + "CWE-614", + "CWE-311" + ], + "file": "app.py", + "start_line": 10, + "end_line": 11, + "function": null, + "severity": "low", + "description": "The Flask app does not set SESSION_COOKIE_SECURE=True. Session cookies will be transmitted over unencrypted HTTP connections. The nginx configuration (conf/nginx.conf) only listens on port 80 with no TLS, confirming cookies are sent in cleartext. This is a best-practice hardening issue \u2014 classified as Low since it's an informational finding for an open-source development/educational application." + }, + { + "id": "extremely-vulnerable-flask-app-024", + "is_vulnerable": true, + "vulnerability_class": "cleartext_transmission", + "primary_cwe": "CWE-319", + "acceptable_cwes": [ + "CWE-319", + "CWE-311" + ], + "file": "conf/nginx.conf", + "start_line": 27, + "end_line": 28, + "function": null, + "severity": "low", + "description": "The nginx configuration listens exclusively on port 80 (HTTP) with no TLS/SSL listener. All traffic including credentials, session cookies, and user data is transmitted in cleartext. Classified as Low since this is an infrastructure configuration issue for an open-source educational application, though it does enable network-level interception of authentication traffic." + }, + { + "id": "extremely-vulnerable-flask-app-025", + "is_vulnerable": true, + "vulnerability_class": "insecure_cookie", + "primary_cwe": "CWE-1004", + "acceptable_cwes": [ + "CWE-1004", + "CWE-732" + ], + "file": "routes/account.py", + "start_line": 127, + "end_line": 128, + "function": "after_request", + "severity": "medium", + "description": "The 'preferences' cookie is set via response.set_cookie('preferences', b64encode(dumps(preferences)).decode()) without httponly=True. This allows JavaScript to read and modify the cookie. Critically, this cookie is deserialized using pickle.loads() in the before_request handler (line 118), so an XSS attack can modify the preferences cookie to a malicious pickle payload, achieving RCE. The missing HttpOnly flag enables this XSS-to-RCE chain \u2014 classified as Medium since it's a valid security concern that amplifies existing XSS vulnerabilities." + }, + { + "id": "extremely-vulnerable-flask-app-026", + "is_vulnerable": true, + "vulnerability_class": "denial_of_service", + "primary_cwe": "CWE-400", + "acceptable_cwes": [ + "CWE-400", + "CWE-770" + ], + "file": "utils/profile_image.py", + "start_line": 7, + "end_line": 8, + "function": "download", + "severity": "low", + "description": "The download function reads the entire HTTP response with response.read() without any size limit. An attacker can provide a URL pointing to an extremely large file (or a slow-drip endpoint), causing memory exhaustion or resource starvation on the server. Classified as Low per DoS guidance \u2014 this is a generic DoS possibility through a normal endpoint that requires authentication to trigger." + }, + { + "id": "extremely-vulnerable-flask-app-027", + "is_vulnerable": true, + "vulnerability_class": "insecure_cookie", + "primary_cwe": "CWE-1004", + "acceptable_cwes": [ + "CWE-1004", + "CWE-732" + ], + "file": "routes/account.py", + "start_line": 105, + "end_line": 105, + "function": "toggle_darkmode", + "severity": "medium", + "description": "The 'preferences' cookie is set via response.set_cookie() in toggle_darkmode (line 105) and after_request (line 127-128) without httponly=True. JavaScript can read and modify this cookie. Critically, the preferences cookie is deserialized using pickle.loads() in the before_request handler (line 118), so an XSS attack can replace the cookie with a malicious pickle payload to achieve remote code execution. The missing HttpOnly flag enables this XSS-to-RCE attack chain. Classified as Medium since it amplifies existing XSS vulnerabilities into RCE." + }, + { + "id": "extremely-vulnerable-flask-app-028", + "is_vulnerable": true, + "vulnerability_class": "insecure_cookie", + "primary_cwe": "CWE-614", + "acceptable_cwes": [ + "CWE-614", + "CWE-311" + ], + "file": "routes/account.py", + "start_line": 105, + "end_line": 105, + "function": "toggle_darkmode", + "severity": "low", + "description": "The 'preferences' cookie is set without secure=True in both toggle_darkmode (line 105) and after_request (line 127-128). This cookie is transmitted over unencrypted HTTP connections (nginx only listens on port 80). Since the cookie is deserialized with pickle.loads(), a network attacker performing a MITM attack could modify the cookie in transit to inject a malicious pickle payload, achieving RCE. Classified as Low since it requires network-level attacker positioning, but the impact is amplified by the insecure pickle deserialization." + }, + { + "id": "extremely-vulnerable-flask-app-fp-001", + "is_vulnerable": false, + "vulnerability_class": "sql_injection", + "primary_cwe": "CWE-89", + "acceptable_cwes": [ + "CWE-89", + "CWE-943" + ], + "file": "routes/login.py", + "start_line": 30, + "end_line": 31, + "function": "do_login", + "severity": "high", + "description": "NOT VULNERABLE. The login query uses SQLAlchemy ORM's filter method with a Python equality operator: session.query(User).filter(User.email == form.email.data). SQLAlchemy ORM automatically parameterizes these queries, preventing SQL injection. The email value is never interpolated into raw SQL." + }, + { + "id": "extremely-vulnerable-flask-app-fp-002", + "is_vulnerable": false, + "vulnerability_class": "sql_injection", + "primary_cwe": "CWE-89", + "acceptable_cwes": [ + "CWE-89", + "CWE-943" + ], + "file": "routes/signup.py", + "start_line": 41, + "end_line": 43, + "function": "do_signup", + "severity": "high", + "description": "NOT VULNERABLE. The user existence check uses SQLAlchemy ORM: session.query(session.query(User).where(User.email == form.email.data).exists()).scalar(). The ORM parameterizes the email comparison, preventing SQL injection despite the query involving user input." + }, + { + "id": "extremely-vulnerable-flask-app-fp-003", + "is_vulnerable": false, + "vulnerability_class": "weak_password_storage", + "primary_cwe": "CWE-916", + "acceptable_cwes": [ + "CWE-916", + "CWE-328" + ], + "file": "routes/signup.py", + "start_line": 64, + "end_line": 64, + "function": "do_signup", + "severity": "high", + "description": "NOT VULNERABLE. Passwords are properly hashed using bcrypt with a random salt: hashpw(form.password.data.encode('utf-8'), gensalt()). bcrypt is a strong, adaptive hashing algorithm suitable for password storage. The same secure pattern is used in account updates (routes/account.py:88)." + }, + { + "id": "extremely-vulnerable-flask-app-fp-004", + "is_vulnerable": false, + "vulnerability_class": "reflected_xss", + "primary_cwe": "CWE-79", + "acceptable_cwes": [ + "CWE-79", + "CWE-80" + ], + "file": "templates/search.html", + "start_line": 9, + "end_line": 9, + "function": null, + "severity": "medium", + "description": "NOT VULNERABLE. The search parameter is rendered in the input value attribute using Jinja2's default auto-escaping: value=\"{{search}}\". Jinja2 automatically escapes HTML special characters (<, >, &, \") in template expressions, preventing attribute injection and reflected XSS in this context." + }, + { + "id": "extremely-vulnerable-flask-app-fp-010", + "is_vulnerable": true, + "vulnerability_class": "csrf", + "primary_cwe": "CWE-352", + "acceptable_cwes": [ + "CWE-352" + ], + "file": "routes/registration_codes.py", + "start_line": 28, + "end_line": 28, + "function": "registration_codes", + "severity": "high", + "description": "CSRF vulnerability: The add_registration_codes route at line 28 is decorated with @login_required, accepts POST, checks current_user.is_admin, and creates a new RegistrationCode in the database via session.add() and sess", + "acceptable_locations": [ + { + "file": "templates/registration_codes.html", + "start_line": 8, + "end_line": 8, + "function": null + } + ] + }, + { + "id": "extremely-vulnerable-flask-app-fp-011", + "is_vulnerable": true, + "vulnerability_class": "csrf", + "primary_cwe": "CWE-352", + "acceptable_cwes": [ + "CWE-352" + ], + "file": "routes/notes.py", + "start_line": 18, + "end_line": 18, + "function": "notes", + "severity": "high", + "description": "CSRF vulnerability: The add_note route at line 18 is decorated with @login_required, accepts POST, and inserts a new Note into the database tied to current_user.id via session.add() and session.commit(). No CSRF token va", + "acceptable_locations": [ + { + "file": "templates/partials/create_note_modal.html", + "start_line": 9, + "end_line": 9, + "function": null + } + ] + }, + { + "id": "extremely-vulnerable-flask-app-fp-012", + "is_vulnerable": true, + "vulnerability_class": "csrf", + "primary_cwe": "CWE-352", + "acceptable_cwes": [ + "CWE-352" + ], + "file": "routes/notes.py", + "start_line": 41, + "end_line": 41, + "function": "notes", + "severity": "high", + "description": "CSRF vulnerability: The delete_note route at line 41 is decorated with @login_required, accepts POST, and deletes a Note from the database via session.delete() and session.commit(). No CSRF token validation or CSRFProtec", + "acceptable_locations": [ + { + "file": "templates/home.html", + "start_line": 42, + "end_line": 42, + "function": null + } + ] + }, + { + "id": "extremely-vulnerable-flask-app-fp-013", + "is_vulnerable": true, + "vulnerability_class": "csrf", + "primary_cwe": "CWE-352", + "acceptable_cwes": [ + "CWE-352" + ], + "file": "routes/account.py", + "start_line": 53, + "end_line": 53, + "function": "account", + "severity": "high", + "description": "CSRF vulnerability: The add_image route at line 53 is decorated with @login_required, accepts POST, and updates current_user.profile_image in the database via session.merge() and session.commit(). No CSRF token validatio", + "acceptable_locations": [ + { + "file": "templates/partials/change_image_modal.html", + "start_line": 10, + "end_line": 10, + "function": null + } + ] + }, + { + "id": "extremely-vulnerable-flask-app-fp-014", + "is_vulnerable": true, + "vulnerability_class": "csrf", + "primary_cwe": "CWE-352", + "acceptable_cwes": [ + "CWE-352" + ], + "file": "routes/account.py", + "start_line": 69, + "end_line": 69, + "function": "account", + "severity": "high", + "description": "CSRF vulnerability: The update_account route at line 69 accepts POST and updates the authenticated user's account data (email, is_admin, password) in the database via current_user.__dict__.update() and session.merge(). A", + "acceptable_locations": [ + { + "file": "templates/account.html", + "start_line": 11, + "end_line": 11, + "function": null + } + ] + } +] diff --git a/tests/eval/fixtures/vaultpay/meta.yaml b/tests/eval/fixtures/vaultpay/meta.yaml index 68a15f8..555f762 100644 --- a/tests/eval/fixtures/vaultpay/meta.yaml +++ b/tests/eval/fixtures/vaultpay/meta.yaml @@ -1,7 +1,7 @@ slug: vaultpay language: typescript framework: express -source_root: tests/playground/typescript/vault-pay/vault-pay +source_root: tests/playground/typescript/vault-pay description: > TypeScript/Express P2P money-transfer + ledger fixture. Multi-currency accounts, transfers with idempotency, card storage, promo redemption, diff --git a/tests/eval/fixtures/vulnyapi/artifacts/likec4_build_result.txt b/tests/eval/fixtures/vulnyapi/artifacts/likec4_build_result.txt new file mode 100644 index 0000000..48c8e73 --- /dev/null +++ b/tests/eval/fixtures/vulnyapi/artifacts/likec4_build_result.txt @@ -0,0 +1,568 @@ +## Coverage +- Containers modeled: 6 (Auth Router, Users Router, Notes Router, Webhooks Router, Admin Router, Files Router) +- External systems modeled: 4 (Arbitrary Webhook Target, SMS Provider (Twilio/etc.), JWT Secret, Bcrypt / passlib) +- Data stores modeled: 2 + - SQLite Database — tags: #pii, #secrets, #tokens, #credentials, #business + - Disk File Storage — tag: #business +- Trust boundaries encoded: + 1. Internet → Public API over HTTP (no TLS, port 4444) — anonymous_user → vulnyapi + 2. Authenticated access via Bearer JWT or pg_session_id cookie — authenticated_user → vulnyapi + 3. Admin access via unvalidated X-Is-Admin header (no RBAC enforcement) — administrator → admin_router + 4. Internal → External: webhooks to user-supplied URL (SSRF risk, no allowlist) — webhooks_router → arbitrary_webhook_target + 5. Identity boundary: auth_service ↔ jwt_secret (HS256 with hardcoded fallback), auth_service ↔ bcrypt_hasher + 6. Data exposure boundary: admin_router returns password_hash + session tokens to unauthenticated callers +- Views produced: systemLandscape, containers, security, auth-flow (dynamic/sequence) + +## Validation +- Last LikeC4 validation status: ok (0 issues) +- Outstanding warnings: none + +## Uncertainties +- The SMS gateway is a mock (logs to stdout); the real Twilio/etc. integration is not present in code but modeled as placeholder for completeness. +- No rate-limiting, CSRF protection, or RBAC middleware exists — these are noted as absences rather than modeled elements. +- The SQLite database path (data/base.db) and upload directory (data/uploads/) are relative; absolute paths depend on runtime working directory. + +## LikeC4 Source +```likec4 +specification { + // Element kinds + element actor { style { shape person } } + element system { style { shape rectangle } } + element container { + technology "default tech" + style { shape rectangle } + } + element service { style { shape rectangle } } + element database { style { shape storage } } + element storage { style { shape storage } } + element external { + notation "External System" + style { shape rectangle; color muted } + } + + // Relationship kinds — declare before using `-[kind]->` + relationship authenticates { + description "Authenticates identity credentials" + } + relationship calls { + description "Makes an HTTP/RPC call" + } + relationship reads { + description "Reads data from a store" + } + relationship stores { + description "Writes data to a store" + } + + // Tags — data classification and trust zones + tag pii + tag secrets + tag tokens + tag credentials + tag audit + tag business + tag external + tag public + tag internal + tag admin + tag identity + + // Custom colors for security emphasis + color red #EF4444 + color amber #F59E0B +} + + +model { + // ── Actors ──────────────────────────────────────────────────────────── + + anonymous_user = actor "Anonymous User" { + description ''' + Unauthenticated external caller. Can reach public endpoints: + POST /register, POST /login, POST /password/reset/request, + POST /password/reset/confirm, GET /admin/users (no auth required). + Evidence: routers/auth.py:14-56, routers/admin.py:14-28. + ''' + tag #external + } + + authenticated_user = actor "Authenticated User" { + description ''' + Logged-in user with valid JWT or pg_session_id cookie. + Can access protected endpoints: GET /users/{id}, POST /notes, + DELETE /notes/{id}, POST /webhooks/test, POST /files/upload. + Evidence: helpers/auth.py:39-51 (current_user dependency). + ''' + tag #internal + } + + administrator = actor "Administrator" { + description ''' + Intended admin role with is_admin=True on User row. + However, no RBAC enforcement exists — the X-Admin-Console header + is never validated against user.is_admin. Any caller can reach + GET /admin/users without authentication or authorization. + Evidence: routers/admin.py:14-28, helpers/auth.py (no RBAC check). + ''' + tag #admin + } + + + // ── System boundary ───────────────────────────────────────────────── + + vulnyapi = system "VulnyAPI" { + description ''' + Intentionally-vulnerable FastAPI application for security-analyser + evaluation. Binds 0.0.0.0:4444 with no TLS. No rate limiting, + no CSRF protection, no RBAC middleware. + Evidence: app.py:15-40. + ''' + + // ── Containers (one per router module) ──────────────────────────── + + auth_router = container "Auth Router" { + technology "FastAPI / Uvicorn" + description ''' + Authentication endpoints: POST /register, POST /login, + POST /password/reset/request, POST /password/reset/confirm. + Registration response exposes password_hash (RegisteredUser). + Login sets pg_session_id cookie without Secure/SameSite flags. + Evidence: routers/auth.py:14-56. + ''' + tag #public + } + + users_router = container "Users Router" { + technology "FastAPI / Uvicorn" + description ''' + User profile endpoints: GET /users/{user_id}, + GET /users/{user_id}/notes. Requires authentication (CurrentUser). + No ownership check — any authenticated user can read another's notes. + Evidence: routers/users.py:12-36. + ''' + tag #internal + } + + notes_router = container "Notes Router" { + technology "FastAPI / Uvicorn" + description ''' + Note CRUD + search: POST /notes, DELETE /notes/{note_id}, + GET /notes/search (unauthenticated — SQL injection via q param). + Delete has no ownership check. Search uses raw string-interpolated SQL. + Evidence: routers/notes.py:12-40. + ''' + tag #internal + } + + webhooks_router = container "Webhooks Router" { + technology "FastAPI / Uvicorn + httpx" + description ''' + Webhook test endpoint: POST /webhooks/test. Accepts arbitrary URL + and JSON payload, forwards via httpx.post with 5s timeout. + SSRF risk — no allowlist or loopback/internal-IP blocking. + Evidence: routers/webhooks.py:12-30. + ''' + tag #internal + } + + admin_router = container "Admin Router" { + technology "FastAPI / Uvicorn" + description ''' + Admin user listing: GET /admin/users. No authentication or RBAC — + any unauthenticated caller can access. Returns full User rows including + password_hash and latest session token (AdminUserRow). + Evidence: routers/admin.py:14-28. + ''' + tag #public + } + + files_router = container "Files Router" { + technology "FastAPI / Uvicorn" + description ''' + File upload endpoint: POST /files/upload. Saves to disk under + data/uploads/user-{id}/ with user-supplied filename (path traversal risk). + Requires authentication (CurrentUser). + Evidence: routers/files.py:16-38. + ''' + tag #internal + } + + + // ── Internal services ──────────────────────────────────────────── + + auth_service = service "Auth Service" { + description ''' + AuthService: register, authenticate, issue_jwt (HS256), + issue_session_cookie (uuid4 hex, no expiry), password reset flow + with 6-digit OTP via SMS gateway. Old sessions not invalidated + after password change. + Evidence: services/auth.py:17-90. + ''' + tag #identity + } + + users_service = service "Users Service" { + description ''' + UserService: CRUD on User entity — get, get_by_username, + get_by_phone, create, list_all, set_password_hash. + Evidence: services/users.py:9-42. + ''' + tag #identity + } + + notes_service = service "Notes Service" { + description ''' + NoteService: CRUD + search() using raw SQL with string interpolation + (SQL injection vector). No parameterized queries for search. + Evidence: services/notes.py:9-42. + ''' + tag #business + } + + sms_gateway = service "SMS Gateway" { + description ''' + Mock SMS gateway — logs OTP codes instead of sending real messages. + Placeholder for Twilio/etc. in production. Used during password reset. + Evidence: services/sms.py:7-12. + ''' + tag #external + } + + + // ── Data store ─────────────────────────────────────────────────── + + sqlite_db = database "SQLite Database" { + description ''' + File-based SQLite at data/base.db. Tables: user (id, username, email, + phone, password_hash [bcrypt], is_admin, created_at), note (id, owner_id, + title, body, is_private, created_at), reset_code (id, user_id, code + [plaintext 6-digit OTP], expires_at, used) — ResetCode table stores + plaintext 6-digit OTP codes for password-reset flow; no hashing or + encryption applied to the code field. session (id, user_id, token [hex], + created_at — no expiry), uploaded_file (id, owner_id, filename, size). + Evidence: adapters/db.py:7-18, models/domain.py:6-52, services/auth.py:62. + ''' + tag #pii + tag #secrets + tag #tokens + tag #credentials + tag #business + } + + disk_file_storage = storage "Disk File Storage" { + description ''' + Uploaded files stored at data/uploads/user-{id}/ with user-supplied + filename. No path sanitization — the filename is directly concatenated + into the target path (path traversal risk: attacker can write outside + the intended directory using ../ sequences). Files are written via + open(target_path, "wb") without extension or content-type validation. + Evidence: routers/files.py:16-38. + ''' + tag #business + } + + + // ── External systems ───────────────────────────────────────────── + + arbitrary_webhook_target = external "Arbitrary Webhook Target" { + description ''' + User-supplied URL target for webhook test endpoint. No allowlist, + no loopback/internal-IP blocking — SSRF risk allowing internal + network scanning and cloud metadata access. + Evidence: routers/webhooks.py:12-30. + ''' + tag #external + } + + sms_provider_placeholder = external "SMS Provider (Twilio/etc.)" { + description ''' + Production SMS provider placeholder. Currently replaced by mock + SmsGateway that logs to stdout. Would send OTP codes for password reset. + Evidence: services/sms.py:1-12, services/auth.py:68. + ''' + tag #external + } + + + // ── Cryptographic dependencies (explicit elements) ─────────────── + + jwt_secret = external "JWT Secret" { + description ''' + Hardcoded HS256 signing secret with insecure fallback. + JWT_SECRET = os.environ.get("VULNYAPI_JWT_SECRET", "dev-secret-change-me"). + If env var unset, anyone can forge valid JWTs — full impersonation. + 24-hour TTL, no revocation path. + Evidence: utils/security.py:17-19. + ''' + tag #secrets + } + + bcrypt_hasher = external "Bcrypt / passlib" { + description ''' + Password hashing via passlib CryptContext(schemes=["bcrypt"]). + Automatic salt generation. No password complexity validation upstream. + Evidence: utils/security.py:21-32. + ''' + tag #external + } + + + // ── Relationships (internal) ───────────────────────────────────── + + auth_router -[calls]-> auth_service "delegates authentication to" + users_router -[calls]-> users_service "delegates user CRUD" + users_router -[calls]-> notes_service "delegates note listing" + notes_router -[calls]-> notes_service "delegates note CRUD + search" + + auth_service -[calls]-> bcrypt_hasher "hash_password / verify_password (bcrypt via passlib)" + auth_service -[calls]-> jwt_secret { + title "issue_token (HS256 encode/decode with hardcoded fallback secret)" + tag #secrets + } + auth_service -[calls]-> sms_gateway "send OTP via SMS gateway" + + auth_service -[reads]-> sqlite_db { + title "SELECT User, Session, ResetCode (password hashes + session tokens)" + tag #secrets + tag #tokens + } + auth_service -[stores]-> sqlite_db { + title "INSERT User, Session, ResetCode (credentials + OTP codes)" + tag #secrets + tag #tokens + } + + users_service -[reads]-> sqlite_db { + title "SELECT User (PII: username, email, phone)" + tag #pii + } + users_service -[stores]-> sqlite_db { + title "INSERT/UPDATE User (credentials + PII)" + tag #secrets + tag #pii + } + + notes_service -[reads]-> sqlite_db { + title "SELECT Note (raw SQL for search — SQL injection vector)" + tag #business + } + notes_service -[stores]-> sqlite_db { + title "INSERT/DELETE Note" + tag #business + } + + files_router -[stores]-> disk_file_storage { + title "writes uploaded file bytes to data/uploads/user-{id}/ (path traversal via unsanitized filename)" + tag #business + } + files_router -[stores]-> sqlite_db { + title "INSERT UploadedFile record (owner_id, filename, size)" + tag #business + } + + + // ── Relationships (inbound entry points — trust boundary crossings) ── + + anonymous_user -[calls]-> vulnyapi { + title "internet → public API over HTTP (no TLS, port 4444)" + tag #public + } + + authenticated_user -[calls]-> vulnyapi { + title "authenticated access via Bearer JWT or pg_session_id cookie" + tag #internal + } + + // ── Relationships (external actors -> containers) ────────────────── + + anonymous_user -[calls]-> auth_router "POST /register, POST /login, POST /password/reset/*" + anonymous_user -[calls]-> admin_router "GET /admin/users (no auth — CRITICAL)" + anonymous_user -[calls]-> notes_router "GET /notes/search (unauthenticated SQLi via q param)" + + authenticated_user -[authenticates]-> auth_router { + title "authenticates via JWT or pg_session_id cookie" + tag #identity + } + authenticated_user -[calls]-> users_router "GET /users/{id}, GET /users/{id}/notes" + authenticated_user -[calls]-> notes_router "POST /notes, DELETE /notes/{note_id}" + authenticated_user -[calls]-> webhooks_router "POST /webhooks/test (SSRF via arbitrary URL)" + authenticated_user -[calls]-> files_router "POST /files/upload (path traversal risk)" + + administrator -[calls]-> admin_router { + title "admin access via unvalidated X-Is-Admin header (no RBAC enforcement — any caller reaches GET /admin/users)" + tag #admin + } + + + // ── Relationships (outbound) ───────────────────────────────────── + + webhooks_router -[calls]-> arbitrary_webhook_target "httpx.post to user-supplied URL (SSRF risk, no allowlist)" + sms_gateway -[calls]-> sms_provider_placeholder "mock logs; production would call Twilio API" + + + // ── Relationships (data exposure) ──────────────────────────────── + + auth_router -[reads]-> sqlite_db "exposes password_hash in RegisteredUser response" + admin_router -[reads]-> sqlite_db "returns all users with password_hash + session tokens" + } +} + + +views { + + // ── View 1: systemLandscape — full overview ──────────────────────── + + view systemLandscape { + title "System Landscape" + include ** + autoLayout lr + } + + + // ── View 2: containers — internal detail ─────────────────────────── + + view containers of vulnyapi { + title "Internal Containers & Services" + include * + include -> * + include <- * + style sqlite_db { color amber } + autoLayout lr + } + + + // ── View 3: security — external attack surface and trust boundaries ─ + + view security { + title "External Attack Surface & Trust Boundaries" + + // ── Include all elements and relationships for full coverage ─────── + include ** + + // ── Explicitly include public-facing entry points (attack surface) ─ + include * where tag is #public + include anonymous_user -> vulnyapi.** + include authenticated_user -> vulnyapi.** + include administrator -> vulnyapi.** + + // ── Include trust-zone crossing relationships ───────────────────── + // Public → Internal crossings (internet to app) + include * -> * where source.tag is #public and target.tag is #internal + // Internal → External crossings (app to outside world) + include * -> * where source.tag is #internal and target.tag is #external + + // ── Include identity boundary elements ──────────────────────────── + include jwt_secret + include auth_service + include sqlite_db + + // ── Include data stores with sensitive tags ─────────────────────── + include * where tag is #pii or tag is #secrets or tag is #tokens or tag is #credentials + + // ── Styling: critical elements in red (data exposure / secrets) ─── + style sqlite_db { color red } + style jwt_secret { color red } + style * where tag is #pii { color red } + style * where tag is #secrets { color red } + style * where tag is #tokens { color red } + style * where tag is #credentials { color red } + + // ── Styling: external systems in blue (third-party / outside trust) ─ + style arbitrary_webhook_target { color blue } + style sms_provider_placeholder { color blue } + style bcrypt_hasher { color blue } + style * where tag is #external { color blue } + + // ── Styling: admin surface in red (zero-auth critical risk) ─────── + style admin_router { color red } + style administrator { color red } + + // ── Styling: identity boundary elements in amber ────────────────── + style auth_service { color amber } + style * where tag is #identity { color amber } + + // ── Styling: public entry points highlighted ────────────────────── + style auth_router { color red } + style anonymous_user { color red } + + autoLayout lr + } + + + // ── Dynamic View: Authentication Flow (JWT + cookie) ──────────────── + + dynamic view auth-flow { + title "Authentication Flow — Registration, Login/JWT Issuance, and Request Auth" + variant sequence + + // ════════════════════════════════════════════════════════════════════ + // PHASE 1: User Registration + // POST /register → AuthService.register() → bcrypt hash → INSERT User + // Evidence: routers/auth.py:20-34, services/auth.py:25-32 + // ════════════════════════════════════════════════════════════════════ + + anonymous_user -> auth_router "POST /register (username, email, phone, password)" + auth_router -> auth_service "svc.register(username, email, phone, password)" + auth_service -> bcrypt_hasher "hash_password(password) → bcrypt hash" + auth_service <- bcrypt_hasher "returns password_hash" + auth_service -> sqlite_db "INSERT User (username, email, phone, password_hash, is_admin=False)" + sqlite_db -> auth_service "returns created User row" + auth_service -> auth_router "returns User object" + // SECURITY: RegisteredUser response_model exposes password_hash to caller + auth_router -> anonymous_user "201 {id, username, email, phone, password_hash [EXPOSED], created_at}" + + // ════════════════════════════════════════════════════════════════════ + // PHASE 2: Login — JWT issuance + session cookie + // POST /login → AuthService.authenticate() → bcrypt verify → issue_jwt (HS256) + // → issue_session_cookie (uuid4 hex, no expiry) → set pg_session_id cookie + // Evidence: routers/auth.py:37-48, services/auth.py:34-50, utils/security.py:34-51 + // ════════════════════════════════════════════════════════════════════ + + anonymous_user -> auth_router "POST /login (username, password)" + auth_router -> auth_service "svc.authenticate(username, password)" + auth_service -> sqlite_db "SELECT User WHERE username = ?" + sqlite_db -> auth_service "returns User row with password_hash" + auth_service -> bcrypt_hasher "verify_password(password, user.password_hash)" + auth_service <- bcrypt_hasher "returns True/False" + + // On success: issue JWT (HS256 with hardcoded fallback secret) + auth_service -> jwt_secret "issue_token(user.id) → HS256 encode(sub=user_id, exp=24h)" + jwt_secret -> auth_service "returns signed JWT string" + + // On success: create session row for cookie-based auth + auth_service -> sqlite_db "INSERT Session (user_id, token=uuid4().hex, created_at) — no expiry" + sqlite_db -> auth_service "commits session row" + + // Return both JWT and set cookie + auth_service -> auth_router "returns User + access_token + cookie_token" + auth_router -> anonymous_user "200 {access_token: JWT} + Set-Cookie: pg_session_id=; httponly=True (no Secure/SameSite)" + + // ════════════════════════════════════════════════════════════════════ + // PHASE 3: Subsequent authenticated request — dual auth path + // helpers/auth.py checks Bearer JWT OR pg_session_id cookie + // Path A: _user_from_jwt → decode_token (HS256 verify) → SELECT User by id + // Path B: _user_from_cookie → SELECT Session WHERE token=? → SELECT User by session.user_id + // Evidence: helpers/auth.py:19-58 + // ════════════════════════════════════════════════════════════════════ + + authenticated_user -> auth_router "GET /users/{id} (Authorization: Bearer OR Cookie: pg_session_id=)" + + // Path A — JWT-based authentication + auth_router -> jwt_secret "_user_from_jwt: decode_token(JWT) → HS256 verify, extract sub=user_id" + jwt_secret -> auth_router "returns user_id (or None if invalid/expired)" + auth_router -> sqlite_db "SELECT User WHERE id = ? (from JWT sub claim)" + sqlite_db -> auth_router "returns User row" + + // Path B — Cookie-based authentication (alternative, OR'd with JWT) + auth_router -> sqlite_db "_user_from_cookie: SELECT Session WHERE token = pg_session_id" + sqlite_db -> auth_router "returns Session row with user_id" + auth_router -> sqlite_db "SELECT User WHERE id = session.user_id" + sqlite_db -> auth_router "returns User row" + + // Both paths converge — if either succeeds, request is authenticated + auth_router -> authenticated_user "200 response (or 401 if both JWT and cookie fail)" + + } + +} +``` \ No newline at end of file diff --git a/tests/eval/harness.py b/tests/eval/harness.py index cbb1af8..2a2e66f 100644 --- a/tests/eval/harness.py +++ b/tests/eval/harness.py @@ -39,6 +39,11 @@ class AgentRun: # dict of the emitted event with ``event_type`` added — same shape as a # row in metrics.jsonl. metrics_events: list[dict[str, Any]] = field(default_factory=list) + # True when the run was cut off by ``timeout_s`` — the run is *partial*: + # whatever tool calls / state / artifacts were captured before the + # deadline. Such a run is only ever seen via ``AgentRunTimeout.partial``; + # a normally returned run always has ``timed_out=False``. + timed_out: bool = False def tool_names(self) -> list[str]: return [c.name for c in self.tool_calls] @@ -78,6 +83,22 @@ def capture_imbalance(self) -> dict[str, int]: } +class AgentRunTimeout(TimeoutError): + """Raised when an agent run exceeds ``timeout_s``. + + Carries the partial :class:`AgentRun` accumulated before the deadline + (``timed_out=True``) so eval debugging can see what happened up to the + timeout. Subclasses ``TimeoutError`` (== ``asyncio.TimeoutError``), so + existing consumers keep treating a timeout as a failed attempt — never a + silent pass. + """ + + def __init__(self, timeout_s: float, partial: AgentRun) -> None: + super().__init__(f"agent run timed out after {timeout_s:g}s") + self.timeout_s = timeout_s + self.partial = partial + + async def run_agent( agent: BaseAgent, *, @@ -106,6 +127,10 @@ async def run_agent( via ``FileArtifactService`` — a live, on-disk trace (memory, saved artifacts) that survives the run for offline analysis. Default is an in-memory service (no on-disk trace). + + Raises :class:`AgentRunTimeout` (an ``asyncio.TimeoutError``) when the run + exceeds `timeout_s`; the exception's ``partial`` attribute holds the + partial ``AgentRun`` captured before the deadline (``timed_out=True``). """ if artifact_dir is not None: from google.adk.artifacts import FileArtifactService @@ -173,37 +198,62 @@ async def _consume() -> None: if text: final_text = text - await asyncio.wait_for(_consume(), timeout=timeout_s) + async def _collect(*, timed_out: bool) -> AgentRun: + session = await session_service.get_session( + app_name=app_name, user_id=user_id, session_id=session_id + ) + state = dict(session.state) if session is not None else {} - session = await session_service.get_session( - app_name=app_name, user_id=user_id, session_id=session_id - ) - state = dict(session.state) if session is not None else {} + artifacts: dict[str, str] = {} + for scope_session_id in (session_id, None): + keys = await artifact_service.list_artifact_keys( + app_name=app_name, user_id=user_id, session_id=scope_session_id + ) + for key in keys: + if key in artifacts: + continue + part = await artifact_service.load_artifact( + app_name=app_name, + user_id=user_id, + session_id=scope_session_id, + filename=key, + ) + if part is None: + continue + text = getattr(part, "text", None) or "" + artifacts[key] = text - artifacts: dict[str, str] = {} - for scope_session_id in (session_id, None): - keys = await artifact_service.list_artifact_keys( - app_name=app_name, user_id=user_id, session_id=scope_session_id + return AgentRun( + final_text=final_text, + state=state, + artifacts=artifacts, + tool_calls=tool_calls, + tool_responses=tool_responses, + metrics_events=list(metrics_events) if metrics_events is not None else [], + timed_out=timed_out, ) - for key in keys: - if key in artifacts: - continue - part = await artifact_service.load_artifact( - app_name=app_name, - user_id=user_id, - session_id=scope_session_id, - filename=key, + + try: + await asyncio.wait_for(_consume(), timeout=timeout_s) + except TimeoutError as exc: + # Don't discard what the run captured so far — package the partial + # AgentRun onto the raised error so eval debugging can inspect it. + # The error still propagates, so consumers record a failure/timeout. + try: + partial = await _collect(timed_out=True) + except Exception: + # Best effort: never mask the timeout with a collection error. + partial = AgentRun( + final_text=final_text, + state={}, + artifacts={}, + tool_calls=tool_calls, + tool_responses=tool_responses, + metrics_events=( + list(metrics_events) if metrics_events is not None else [] + ), + timed_out=True, ) - if part is None: - continue - text = getattr(part, "text", None) or "" - artifacts[key] = text - - return AgentRun( - final_text=final_text, - state=state, - artifacts=artifacts, - tool_calls=tool_calls, - tool_responses=tool_responses, - metrics_events=list(metrics_events) if metrics_events is not None else [], - ) + raise AgentRunTimeout(timeout_s, partial) from exc + + return await _collect(timed_out=False) diff --git a/tests/eval/results.py b/tests/eval/results.py index 5ba24b8..5148703 100644 --- a/tests/eval/results.py +++ b/tests/eval/results.py @@ -32,6 +32,8 @@ from __future__ import annotations import json +import logging +import os from collections import Counter from collections.abc import Awaitable, Callable from dataclasses import dataclass, field @@ -39,6 +41,8 @@ from pathlib import Path from typing import Any, Literal +logger = logging.getLogger(__name__) + SCHEMA = "eval/v1" Scenario = Literal["agent", "task", "pipeline"] @@ -50,6 +54,23 @@ EVAL_ROOT = _REPO_ROOT / "eval_runs" +def _compute_run_stamp() -> str: + """A per-process run id used to namespace the *archive* of every run so + results are never overwritten. ``CONTRACTOR_EVAL_RUN_STAMP`` overrides it + (e.g. to label an A/B: ``0607-qw3off``); otherwise ``mmdd-HHMMSS`` (UTC). + Computed once at import — one pytest process == one run == one stamp. + """ + env = os.environ.get("CONTRACTOR_EVAL_RUN_STAMP") + if env: + return "".join(c if c.isalnum() or c in "-_" else "_" for c in env) + return datetime.now(UTC).strftime("%m%d-%H%M%S") + + +# Per-run archive namespace (never overwritten). Re-read via the module global +# so tests can monkeypatch it; production sets it once at import. +RUN_STAMP = _compute_run_stamp() + + # ───────────────────────── per-attempt / per-case ───────────────────────── @@ -375,16 +396,37 @@ def metrics_from_task(metrics: dict[str, Any]) -> dict[str, Any]: # ───────────────────────── live trace location ───────────────────────── -def case_artifact_dir(unit: str, fixture: str, case_id: str) -> Path: +def _run_slug(scenario: str, unit: str, fixture: str | None = None) -> str: + """``-[-eval-]`` — the per-fixture archive folder.""" + slug = f"{scenario}-{_safe_name(unit)}" + if fixture: + slug += f"-eval-{_safe_name(fixture)}" + return slug + + +def run_archive_dir( + scenario: str, unit: str, fixture: str | None = None, *, stamp: str | None = None +) -> Path: + """Dated, never-overwritten archive dir for one run: + ``eval_runs//--eval-/``. + + ``RUN_STAMP`` is per-process, so every run lands in its own folder and no + eval information is ever overwritten (the data-loss fix). The flat + ``eval_runs//`` path is kept separately as a "latest" pointer. + """ + return EVAL_ROOT / (stamp or RUN_STAMP) / _run_slug(scenario, unit, fixture) + + +def case_artifact_dir(unit: str, fixture: str, case_id: str, *, scenario: str = "agent") -> Path: """Directory for a case's *live* on-disk artifact trace, co-located with the - eval_sink per-case metrics: ``eval_runs//cases//artifacts``. + eval_sink per-case metrics under the dated archive: + ``eval_runs//--eval-/cases//artifacts``. Pass this as ``artifact_dir`` to ``run_agent`` / ``run_task_pipeline`` so the full ADK artifact tree persists during the run (survives a timeout/crash), sitting next to the ``metrics.json`` that :class:`EvalSink` writes afterward. """ - return (EVAL_ROOT / _safe_name(unit) / "cases" - / _safe_name(f"{fixture}__{case_id}") / "artifacts") + return run_archive_dir(scenario, unit, fixture) / "cases" / _safe_name(case_id) / "artifacts" # ───────────────────────── serialization ───────────────────────── @@ -419,9 +461,25 @@ def _safe_name(unit: str) -> str: return "".join(c if c.isalnum() or c in "-_" else "_" for c in unit) +def _default_run_name(scenario: str, unit: str, metric_kind: str) -> str: + """Default "latest pointer" dir for one ``(scenario, unit, metric_kind)`` + bucket: ``-[-]`` (metric_kind only when it + isn't the ``generic`` default), matching the established + ``--eval-`` archive naming. Buckets are keyed on + all three fields, so the default name must carry all three — a bare + ``_safe_name(unit)`` made two buckets sharing a unit overwrite each + other's ``eval_runs//eval_results.json`` in the same flush. + """ + name = _run_slug(scenario, unit) + if metric_kind != "generic": + name += f"-{metric_kind}" + return name + + class EvalSink: """Accumulates per-case results across a pytest session and, on flush, - writes one ``eval_results.json`` envelope per ``(scenario, unit)`` group. + writes one ``eval_results.json`` envelope per ``(scenario, unit, + metric_kind)`` group. Per-fixture pytest evals are isolated test invocations, so they can't build an aggregate run themselves. Each records a single :class:`CaseResult` here @@ -455,19 +513,34 @@ def record( run = self._runs.setdefault(key, { "scenario": scenario, "unit": unit, "metric_kind": metric_kind, "model": model, "prompt_version": prompt_version, "pass_at": pass_at, - "run_name": run_name or _safe_name(unit), "meta": meta or {}, + "run_name": run_name or _default_run_name(scenario, unit, metric_kind), + "meta": meta or {}, "fixtures": {}, }) + # The first record() seeds model/prompt_version for the whole bucket; + # backfill missing values and warn loudly when later cases disagree + # (the envelope can only carry one value per run). + for field_name, value in (("model", model), ("prompt_version", prompt_version)): + if value is None: + continue + if run[field_name] is None: + run[field_name] = value + elif run[field_name] != value: + logger.warning( + "eval_sink: bucket %r already has %s=%r; case %r recorded %r " + "(first value wins in the envelope)", + key, field_name, run[field_name], case.id, value, + ) run["fixtures"].setdefault(fixture, []).append(case) run["pass_at"] = max(run["pass_at"], pass_at) - # Persist this case immediately (crash-safe): per-case metrics + any - # agent artifacts under eval_runs//cases/__/. - self._persist_case(run["run_name"], fixture, case, artifacts) + # Persist this case immediately (crash-safe) into the dated, never- + # overwritten archive: eval_runs//--eval-/cases//. + self._persist_case(scenario, unit, fixture, case, artifacts) @staticmethod - def _persist_case(run_name: str, fixture: str, case: CaseResult, + def _persist_case(scenario: str, unit: str, fixture: str, case: CaseResult, artifacts: dict[str, str] | None) -> None: - base = EVAL_ROOT / run_name / "cases" / _safe_name(f"{fixture}__{case.id}") + base = run_archive_dir(scenario, unit, fixture) / "cases" / _safe_name(case.id) base.mkdir(parents=True, exist_ok=True) (base / "metrics.json").write_text( json.dumps({"fixture": fixture, **case.to_dict()}, indent=2, ensure_ascii=False), @@ -484,10 +557,23 @@ def flush(self) -> list[Path]: for run in self._runs.values(): fixtures = [FixtureResult(slug=s, cases=cs) for s, cs in run["fixtures"].items()] - eval_run = EvalRun( - scenario=run["scenario"], unit=run["unit"], pass_at=run["pass_at"], - metric_kind=run["metric_kind"], model=run["model"], - prompt_version=run["prompt_version"], fixtures=fixtures, meta=run["meta"], - ) - paths.append(write_eval_results(eval_run, run["run_name"])) + + def _mk(fxs: list[FixtureResult], _run: dict[str, Any] = run) -> EvalRun: + return EvalRun( + scenario=_run["scenario"], unit=_run["unit"], pass_at=_run["pass_at"], + metric_kind=_run["metric_kind"], model=_run["model"], + prompt_version=_run["prompt_version"], fixtures=fxs, meta=_run["meta"], + ) + + # (1) "latest" pointer — overwritten each run, at the stable + # eval_runs/-[-]/ path (or the + # caller-supplied run_name) that analytics-ui reads. + paths.append(write_eval_results(_mk(fixtures), run["run_name"])) + # (2) dated, per-fixture archive — NEVER overwritten (one folder per + # run via RUN_STAMP), so eval history is never lost. + for fx in fixtures: + paths.append(write_eval_results( + _mk([fx]), + run_archive_dir(run["scenario"], run["unit"], fx.slug), + )) return paths diff --git a/tests/eval/test_exploitability_task_eval.py b/tests/eval/test_exploitability_task_eval.py index f08ead0..586f5f8 100644 --- a/tests/eval/test_exploitability_task_eval.py +++ b/tests/eval/test_exploitability_task_eval.py @@ -227,7 +227,7 @@ async def _collect_chain( timeout_s=float(case.get("timeout_s", 900)), runner_name=f"exploit-{case['id']}", output_dir=run_dir, - artifact_dir=case_artifact_dir("exploitability_assessment", fixture.slug, case["id"]), + artifact_dir=case_artifact_dir("exploitability_assessment", fixture.slug, case["id"], scenario="task"), post_run_fn=_collect_chain, ) diff --git a/tests/eval/test_likec4_task_eval.py b/tests/eval/test_likec4_task_eval.py index 688f3b0..23a698a 100644 --- a/tests/eval/test_likec4_task_eval.py +++ b/tests/eval/test_likec4_task_eval.py @@ -150,7 +150,7 @@ def queue(runner) -> None: runner_name=f"likec4-{fixture.slug}", preloaded_artifacts=precomputed, output_dir=run_dir, - artifact_dir=case_artifact_dir("likec4_build", fixture.slug, case["id"]), + artifact_dir=case_artifact_dir("likec4_build", fixture.slug, case["id"], scenario="task"), ) if overlay_fs.exists(DEFAULT_LIKEC4_PATH): diff --git a/tests/eval/test_oas_build_task_eval.py b/tests/eval/test_oas_build_task_eval.py index 05ead7c..ec17301 100644 --- a/tests/eval/test_oas_build_task_eval.py +++ b/tests/eval/test_oas_build_task_eval.py @@ -107,7 +107,7 @@ def queue(runner) -> None: runner_name=f"oas-build-{fixture.slug}", preloaded_artifacts=precomputed, output_dir=run_dir, - artifact_dir=case_artifact_dir("oas_build", fixture.slug, fixture.slug), + artifact_dir=case_artifact_dir("oas_build", fixture.slug, fixture.slug, scenario="task"), ) result_text = run.artifacts.get(oas_artifact_key, "") diff --git a/tests/eval/test_oas_enrich_task_eval.py b/tests/eval/test_oas_enrich_task_eval.py index 4e57bb3..e5c7657 100644 --- a/tests/eval/test_oas_enrich_task_eval.py +++ b/tests/eval/test_oas_enrich_task_eval.py @@ -114,7 +114,7 @@ def queue(runner) -> None: runner_name=f"oas-enrich-{fixture.slug}", preloaded_artifacts=precomputed, output_dir=run_dir, - artifact_dir=case_artifact_dir("oas_enrich", fixture.slug, fixture.slug), + artifact_dir=case_artifact_dir("oas_enrich", fixture.slug, fixture.slug, scenario="task"), ) result_text = run.result_text("oas_enrich") diff --git a/tests/eval/test_threat_analysis_task_eval.py b/tests/eval/test_threat_analysis_task_eval.py index daff327..4353188 100644 --- a/tests/eval/test_threat_analysis_task_eval.py +++ b/tests/eval/test_threat_analysis_task_eval.py @@ -155,7 +155,7 @@ def queue(runner) -> None: timeout_s=float(case.get("timeout_s", 2400.0)), runner_name=f"threat_analysis-{fixture.slug}", preloaded_artifacts=preloaded, - artifact_dir=case_artifact_dir("threat_analysis", fixture.slug, case["id"]), + artifact_dir=case_artifact_dir("threat_analysis", fixture.slug, case["id"], scenario="task"), ) reports_text = run.artifacts.get(VULN_ARTIFACT_KEY, "") diff --git a/tests/eval/test_trace_agent_eval.py b/tests/eval/test_trace_agent_eval.py index 7a83373..9c1efa5 100644 --- a/tests/eval/test_trace_agent_eval.py +++ b/tests/eval/test_trace_agent_eval.py @@ -36,10 +36,10 @@ def _user_message(case: dict) -> str: return ( f"Trace the request flow that begins at `{func}` in `{where}`. " f"{intent} " - "Insert `# @trace target=... args=... calls=...` comments above each " - "function definition you confidently identify as part of the path. " - "Use the `insert_line` tool to mutate files. Stop once the path is " - "covered." + "Annotate each function you confidently identify as part of the path " + "with the `annotate_trace` tool (one call per function). Annotate each " + "function once and move on — do not restore a file to re-annotate with " + "tweaked arguments. Stop once the path is covered." ) @@ -49,32 +49,65 @@ def _resolve_prompt_version(case: dict) -> str | None: ) +def _with_oas() -> bool: + """X1: feed the OpenAPI spec as an attack-surface map (env-gated A/B). + Default off reproduces the code-direct behaviour exactly.""" + return os.environ.get("CONTRACTOR_EVAL_WITH_OAS", "").strip().lower() in { + "1", "true", "yes", "on", + } + + +def _oas_block(fixture) -> str: + if not _with_oas(): + return "" + oas = getattr(fixture, "expected_oas", None) + if not oas: + return "" + import yaml + spec = yaml.safe_dump(oas, sort_keys=False, allow_unicode=True) + return ( + "\n\nThe target's OpenAPI specification (use it as the attack-surface " + "map — endpoints, parameters, schemas):\n" + spec + ) + + @pytest.mark.eval @pytest.mark.asyncio async def test_trace_agent(trace_case, eval_model, eval_sink): fixture, case = trace_case + n = int(os.environ.get("CONTRACTOR_EVAL_TRACE_PASS_AT", "1")) - run = await run_trace_agent( - fixture_root=fixture.source_root, - user_message=_user_message(case), - model=eval_model, - namespace=f"trace-eval-{fixture.slug}-{case['id']}", - timeout_s=float(case.get("timeout_s", 900.0)), - prompt_version=_resolve_prompt_version(case), - with_graph_tools=bool(case.get("with_graph_tools", True)), - artifact_dir=case_artifact_dir("trace_agent", fixture.slug, case["id"]), - ) + # pass@N: run the case n times; it passes if *any* attempt passes. The + # representative attempt (populates the case record) is the first passing + # one, else the first attempt. + attempts = [] # (run, result) + for i in range(n): + run = await run_trace_agent( + fixture_root=fixture.source_root, + user_message=_user_message(case) + _oas_block(fixture), + model=eval_model, + namespace=f"trace-eval-{fixture.slug}-{case['id']}-a{i + 1}", + timeout_s=float(case.get("timeout_s", 900.0)), + prompt_version=_resolve_prompt_version(case), + with_graph_tools=bool(case.get("with_graph_tools", True)), + artifact_dir=case_artifact_dir("trace_agent", fixture.slug, case["id"]), + ) + attempts.append((run, score_trace_run(run, case))) - result = score_trace_run(run, case) - _agent_run = getattr(run, "agent_run", None) + pass_count = sum(1 for _, r in attempts if r.passed) + passed = pass_count > 0 + rep_run, rep_result = next(((rn, r) for rn, r in attempts if r.passed), attempts[0]) + _agent_run = getattr(rep_run, "agent_run", None) eval_sink.record( scenario="agent", unit="trace_agent", metric_kind="diff", fixture=fixture.slug, model=str(eval_model.model), - prompt_version=_resolve_prompt_version(case), - case=CaseResult(id=case["id"], passed=result.passed, - pass_count=int(result.passed), attempts=1, + prompt_version=_resolve_prompt_version(case), pass_at=n, + case=CaseResult(id=case["id"], passed=passed, + pass_count=pass_count, attempts=n, metrics=metrics_from_events(getattr(_agent_run, "metrics_events", [])), - detail=diff_detail(result)), + detail=diff_detail(rep_result), + runs=([{"passed": r.passed, "detail": diff_detail(r)} + for _, r in attempts] if n > 1 else None)), artifacts=getattr(_agent_run, "artifacts", {}) or {}, ) - assert result.passed, f"trace_agent eval failed: case={case['id']}\n{result.explain()}" + assert passed, f"trace_agent pass@{n} failed: case={case['id']}\n{rep_result.explain()}" diff --git a/tests/units/cli_tests/test_event_handler.py b/tests/units/cli_tests/test_event_handler.py new file mode 100644 index 0000000..c3907fd --- /dev/null +++ b/tests/units/cli_tests/test_event_handler.py @@ -0,0 +1,108 @@ +"""Regression tests for cli.main._build_event_handler UI lifecycle (bug 1). + +Pre-fix, ``task_failed`` (and ``run_finished``) were in ``_UI_STOP_EVENTS``. +vuln-scan workflows catch per-finding ``task_failed`` and keep going, but once +the handler called ``ui.stop()`` it still took the ``if ui is not None`` branch +and returned, so every later event vanished (no live render, no print +fallback). The UI must now stop only on the single terminal +``workflow_finished`` event. +""" +from __future__ import annotations + +import pytest + +import cli.main as cli_main + + +class _FakeUI: + instances: list[_FakeUI] = [] + + def __init__(self, *, workflow_name: str) -> None: + self.workflow_name = workflow_name + self.events: list[object] = [] + self.started = False + self.stopped = False + _FakeUI.instances.append(self) + + def start(self) -> None: + self.started = True + + def stop(self) -> None: + self.stopped = True + + def on_event(self, event: object) -> None: + self.events.append(event) + + +class _FakeMetrics: + def __init__(self, _output_dir) -> None: # noqa: ANN001 - matches MetricsSink(output_dir) + pass + + def matches(self, _event) -> bool: # noqa: ANN001 + return False + + async def write(self, _event) -> None: # noqa: ANN001 + pass + + +class _Ev: + def __init__(self, type_: str) -> None: + self.type = type_ + + +@pytest.fixture +def patched(monkeypatch, tmp_path): + _FakeUI.instances.clear() + monkeypatch.setattr(cli_main, "LiveRenderer", _FakeUI) + monkeypatch.setattr(cli_main, "MetricsSink", _FakeMetrics) + handler = cli_main._build_event_handler(tmp_path, "oas_build", enable_ui=True) + return handler, _FakeUI.instances[-1] + + +@pytest.mark.asyncio +async def test_task_failed_does_not_stop_ui(patched): + handler, ui = patched + await handler(_Ev("task_failed")) + assert ui.stopped is False + assert [getattr(e, "type", None) for e in ui.events] == ["task_failed"] + + +@pytest.mark.asyncio +async def test_run_finished_does_not_stop_ui(patched): + handler, ui = patched + await handler(_Ev("run_finished")) + assert ui.stopped is False + + +@pytest.mark.asyncio +async def test_events_after_task_failed_still_render(patched): + handler, ui = patched + await handler(_Ev("task_failed")) + await handler(_Ev("tool_call")) + await handler(_Ev("task_started")) + # Pre-fix these two would have been suppressed after stop(). + assert [getattr(e, "type", None) for e in ui.events] == [ + "task_failed", + "tool_call", + "task_started", + ] + + +@pytest.mark.asyncio +async def test_workflow_finished_stops_ui(patched): + handler, ui = patched + await handler(_Ev("workflow_finished")) + assert ui.stopped is True + assert [getattr(e, "type", None) for e in ui.events] == ["workflow_finished"] + + +@pytest.mark.asyncio +async def test_skip_event_types_not_forwarded(patched): + handler, ui = patched + await handler(_Ev("agent_run_start")) + assert ui.events == [] + assert ui.stopped is False + + +def test_workflow_finished_is_the_only_stop_event(): + assert frozenset({"workflow_finished"}) == cli_main._UI_STOP_EVENTS diff --git a/tests/units/cli_tests/test_fs_glob.py b/tests/units/cli_tests/test_fs_glob.py index 6b4a04a..efe516c 100644 --- a/tests/units/cli_tests/test_fs_glob.py +++ b/tests/units/cli_tests/test_fs_glob.py @@ -75,3 +75,31 @@ def test_traversal_pattern_rejected(self, fs): def test_character_class(self, fs): # Character classes are honored and stay within the top-level segment. assert fs.glob("*.[pt]*") == ["/top.py"] + + +class TestGlobWalkCeiling: + # The fixture tree has 4 files total. + + def test_truncates_when_ceiling_hit(self, fs): + matches, truncated = fs.glob_scanned("**/*", max_files=2) + assert truncated is True + assert len(matches) <= 2 + + def test_no_truncation_under_ceiling(self, fs): + matches, truncated = fs.glob_scanned("**/*.py", max_files=100) + assert truncated is False + assert matches == ["/sub/b.py", "/sub/deep/c.py", "/top.py"] + + def test_default_ceiling_comes_from_settings(self, fs, monkeypatch): + import cli.fs as cli_fs_module + from contractor.utils.settings import Settings + + monkeypatch.setattr( + cli_fs_module, + "get_settings", + lambda: Settings(fs_max_files_per_walk=1), + ) + + matches, truncated = fs.glob_scanned("**/*") + assert truncated is True + assert len(matches) <= 1 diff --git a/tests/units/contractor_tests/agents/test_oas_analyzer_factory.py b/tests/units/contractor_tests/agents/test_oas_analyzer_factory.py new file mode 100644 index 0000000..de6323a --- /dev/null +++ b/tests/units/contractor_tests/agents/test_oas_analyzer_factory.py @@ -0,0 +1,64 @@ +"""Unit tests for the oas_analyzer prompt factory. + +Regression for a chained-conditional bug in ``TaskDescription.format()``: it +returned objective+instructions only when instructions existed, examples-only +when they didn't, and "" when neither was present. As a result the +``general`` sub-agents (objective only) ran with an empty task body, and +idor/ssrf (objective+instructions+examples) silently dropped their examples. +Sections must compose additively instead. +""" +from __future__ import annotations + +from contractor.agents.oas_analyzer.prompts.factory import ( + SectionPrompts, + TaskDescription, +) + + +def test_objective_only(): + out = TaskDescription(objective="OBJ").format() + assert "OBJECTIVE:\nOBJ" in out + assert "INSTRUCTIONS:" not in out + assert "EXAMPLES:" not in out + + +def test_objective_and_instructions(): + out = TaskDescription(objective="OBJ", instructions="INS").format() + assert "OBJECTIVE:\nOBJ" in out + assert "INSTRUCTIONS:\nINS" in out + assert "EXAMPLES:" not in out + + +def test_objective_and_examples(): + out = TaskDescription(objective="OBJ", examples="EX").format() + assert "OBJECTIVE:\nOBJ" in out + assert "EXAMPLES:\nEX" in out + assert "INSTRUCTIONS:" not in out + + +def test_all_three_sections_present_and_ordered(): + out = TaskDescription( + objective="OBJ", instructions="INS", examples="EX" + ).format() + assert "OBJECTIVE:\nOBJ" in out + assert "INSTRUCTIONS:\nINS" in out + assert "EXAMPLES:\nEX" in out + # additive composition keeps source order: objective -> instructions -> examples + assert out.index("OBJECTIVE:") < out.index("INSTRUCTIONS:") < out.index("EXAMPLES:") + + +def test_objective_never_dropped_when_no_instructions_or_examples(): + # The "general" sub-agents (ddos/datasec/appsec) have objective only; they + # must not end up with an empty body. + out = TaskDescription(objective="analyze the schema").format() + assert out.strip() != "" + assert "analyze the schema" in out + + +def test_format_task_wraps_with_role_and_output_format(): + section = SectionPrompts(fmt="FMT", role="ROLE") + out = section.format_task(TaskDescription(objective="OBJ", examples="EX")) + assert "ROLE:\nROLE" in out + assert "OBJECTIVE:\nOBJ" in out + assert "EXAMPLES:\nEX" in out + assert "OUTPUT FORMAT:\nFMT" in out diff --git a/tests/units/contractor_tests/agents/test_oas_analyzer_report.py b/tests/units/contractor_tests/agents/test_oas_analyzer_report.py new file mode 100644 index 0000000..84cf769 --- /dev/null +++ b/tests/units/contractor_tests/agents/test_oas_analyzer_report.py @@ -0,0 +1,68 @@ +"""Unit tests for oas_analyzer report ordering. + +Covers two determinism fixes: + +* severity sorting in ``format_vulnerabilities`` used the raw string + (alphabetical: critical, high, LOW, MEDIUM) — it must use an explicit + rank map (critical > high > medium > low; unknown severities last); +* ``AnalyticAgent`` iterated a set literal to build its sub-agents, so + the appsec/datasec/ddos order varied per process — it must be a tuple. +""" +from __future__ import annotations + +from contractor.agents.oas_analyzer.sub_agents.analytic_agents import analytic_agent +from contractor.agents.oas_analyzer.sub_agents.report_agent import ( + _severity_rank, + format_vulnerabilities, +) + + +def _vuln(severity: str, *, tag: str = "appsec") -> dict: + return { + "path": f"/{severity}", + "method": "get", + "parameters": ["id"], + "vulnerability": f"vuln-{severity}", + "description": f"desc {severity}", + "severity": severity, + "confidence": "high", + "tag": tag, + } + + +def test_severity_rank_orders_most_severe_first(): + assert ( + _severity_rank("critical") + < _severity_rank("high") + < _severity_rank("medium") + < _severity_rank("low") + ) + + +def test_unknown_severity_sorts_last(): + # Pinned: anything outside the known scale goes to the end of the report. + severities = ["low", "bogus", "critical", "medium", "", "high"] + ordered = sorted(severities, key=_severity_rank) + assert ordered == ["critical", "high", "medium", "low", "bogus", ""] + + +def test_format_vulnerabilities_sorts_by_severity_rank(): + vulnerabilities = [ + _vuln("low"), + _vuln("critical"), + _vuln("medium"), + _vuln("high"), + ] + report = format_vulnerabilities(vulnerabilities) + positions = [report.index(f"vuln-{s}") for s in ("critical", "high", "medium", "low")] + assert positions == sorted(positions) + + +def test_analytic_sub_agent_order_is_deterministic(): + # Sub-agents must be built from the ("appsec", "datasec", "ddos") tuple — + # all appsec bots first, then datasec, then ddos. + prefixes = [agent.name.split("_")[0] for agent in analytic_agent.sub_agents] + first_seen = list(dict.fromkeys(prefixes)) + assert first_seen == ["appsec", "datasec", "ddos"] + # No interleaving: each spec's bots form one contiguous block. + assert prefixes == sorted(prefixes, key=first_seen.index) diff --git a/tests/units/contractor_tests/callbacks/test_base.py b/tests/units/contractor_tests/callbacks/test_base.py index 6626ab3..f957307 100644 --- a/tests/units/contractor_tests/callbacks/test_base.py +++ b/tests/units/contractor_tests/callbacks/test_base.py @@ -1,10 +1,14 @@ from typing import Any +import pytest + from contractor.callbacks.base import ( BaseCallback, CallbackTypes, _callback_name, + verify_signature, ) +from contractor.callbacks.tokens import TokenUsageCallback from tests.units.contractor_tests.helpers import mk_callback_context @@ -141,3 +145,45 @@ def test_get_dependencies_returns_deps_list(): cb = _DummyCb() cb.deps = ["TokenUsageCallback"] assert cb.get_dependencies() == ["TokenUsageCallback"] + + +# --------------------------------------------------------------------------- +# verify_signature / validate +# --------------------------------------------------------------------------- + + +class TestVerifySignature: + """verify_signature compares parameter names/kinds against the ADK + callback signature for the declared cb_type (return annotations are + deliberately ignored — they legitimately vary across callbacks).""" + + def test_accepts_matching_callback(self): + cb = TokenUsageCallback() # (callback_context, llm_response) + assert verify_signature(cb.__call__, CallbackTypes.after_model_callback) + + def test_rejects_callback_under_wrong_cb_type(self): + cb = TokenUsageCallback() + # before_tool expects (tool, args, tool_context) — names differ. + assert not verify_signature(cb.__call__, CallbackTypes.before_tool_callback) + + def test_validate_returns_self_for_valid_callback(self): + cb = TokenUsageCallback() + assert cb.validate() is cb + + def test_validate_raises_on_mismatched_signature(self): + class _MisdeclaredCb(BaseCallback): + # Declares before_model but accepts after_model's parameters. + cb_type: CallbackTypes = CallbackTypes.before_model_callback + deps: list[str] = [] + + def __init__(self): + pass + + def to_state(self) -> dict[str, Any]: + return {} + + def __call__(self, callback_context, llm_response) -> None: + return None + + with pytest.raises(TypeError, match="Invalid signature"): + _MisdeclaredCb().validate() diff --git a/tests/units/contractor_tests/callbacks/test_context.py b/tests/units/contractor_tests/callbacks/test_context.py index b3659ec..15b030a 100644 --- a/tests/units/contractor_tests/callbacks/test_context.py +++ b/tests/units/contractor_tests/callbacks/test_context.py @@ -111,7 +111,61 @@ def test_summarization_respects_custom_summarization_key(): def test_summarization_to_state_shape(): cb = SummarizationLimitCallback(message="m", max_tokens=100) state = cb.to_state() - assert set(state.keys()) == {"max_tokens", "token_count", "message", "history"} + assert set(state.keys()) == { + "max_tokens", + "token_count", + "message", + "history", + "fired_invocation_id", + } + + +def test_summarization_message_injected_once_per_invocation(): + # Latch: the per-invocation token counter only grows within an invocation, + # so once over the limit every subsequent request would re-trigger without + # the latch. The message must be appended exactly once per invocation. + ctx = mk_callback_context() + _seed_token_state(ctx, total=2000) + + cb = SummarizationLimitCallback(message="summarize now", max_tokens=1000) + + first = mk_llm_request() + cb(ctx, first) + assert len(first.contents) == 1 + + second = mk_llm_request() + cb(ctx, second) + assert second.contents == [] # latched — not appended again + + third = mk_llm_request() + cb(ctx, third) + assert third.contents == [] + + state = ctx.state["callbacks"][f"::{cb.name}"] + assert len(state["history"]) == 1 + assert state["fired_invocation_id"] == ctx.invocation_id + + +def test_summarization_latch_rearms_on_new_invocation(): + cb = SummarizationLimitCallback(message="m", max_tokens=1000) + + ctx1 = mk_callback_context() + _seed_token_state(ctx1, total=2000) + req1 = mk_llm_request() + cb(ctx1, req1) + assert len(req1.contents) == 1 + + # New invocation (fresh invocation_id): TokenUsageCallback resets its + # per-invocation counter then, so the latch must re-arm as well. + ctx2 = mk_callback_context() + _seed_token_state(ctx2, total=2000) + req2 = mk_llm_request() + cb(ctx2, req2) + assert len(req2.contents) == 1 + + state = ctx2.state["callbacks"][f"::{cb.name}"] + assert len(state["history"]) == 2 + assert state["fired_invocation_id"] == ctx2.invocation_id # --------------------------------------------------------------------------- @@ -401,6 +455,72 @@ def test_dedup_plus_budget(): assert pairs[2][1].parts[0].function_response.response == _big_response(80, "c") +def test_unmatched_responses_same_tool_are_not_deduped(): + # Two function_responses for the same tool with NO matching function_call + # in the contents (e.g. calls trimmed out upstream). They must each get a + # unique sentinel signature and never be elided as duplicates of each + # other — eliding them would drop live, non-duplicate context. + ctx = mk_callback_context() + cb = FunctionResultsRemovalCallback(keep_last_n=99) + + r1 = MockContent( + role="tool", + parts=[mk_function_response_part(response={"v": 1}, name="stateful_tool")], + ) + r2 = MockContent( + role="tool", + parts=[mk_function_response_part(response={"v": 2}, name="stateful_tool")], + ) + request = mk_llm_request([r1, r2]) + + cb(ctx, request) + + assert r1.parts[0].function_response.response == {"v": 1} + assert r2.parts[0].function_response.response == {"v": 2} + assert cb.counter == 0 + + +def test_unmatched_response_does_not_dedup_against_argless_call(): + # One properly matched argless call (signature (name, "")) plus one + # unmatched response for the same tool: the unmatched one gets a sentinel + # signature, so neither elides the other. + ctx = mk_callback_context() + cb = FunctionResultsRemovalCallback(keep_last_n=99) + + call, matched_resp = _make_call_response_pair("tick", {}, {"v": "matched"}) + unmatched_resp = MockContent( + role="tool", + parts=[mk_function_response_part(response={"v": "unmatched"}, name="tick")], + ) + request = mk_llm_request([call, matched_resp, unmatched_resp]) + + cb(ctx, request) + + assert matched_resp.parts[0].function_response.response == {"v": "matched"} + assert unmatched_resp.parts[0].function_response.response == {"v": "unmatched"} + assert cb.counter == 0 + + +def test_argless_duplicate_calls_dedup_as_stale(): + # Pinned semantics (per the class docstring): "same tool called with + # identical arguments" includes identical EMPTY arguments, so repeated + # matched argless calls dedup — only the most recent response survives. + ctx = mk_callback_context() + cb = FunctionResultsRemovalCallback(keep_last_n=99) + + c1_call, c1_resp = _make_call_response_pair("tick", {}, {"v": "old"}) + c2_call, c2_resp = _make_call_response_pair("tick", {}, {"v": "new"}) + request = mk_llm_request([c1_call, c1_resp, c2_call, c2_resp]) + + cb(ctx, request) + + assert c2_resp.parts[0].function_response.response == {"v": "new"} + fr_old = c1_resp.parts[0].function_response + assert fr_old.response["elided"] is True + assert fr_old.response["reason"] == "stale" + assert cb.counter == 1 + + def test_to_state_includes_new_fields(): cb = FunctionResultsRemovalCallback( keep_last_n=5, keep_budget_chars=10000, target_tools=["read_file"], diff --git a/tests/units/contractor_tests/callbacks/test_function_results_budget.py b/tests/units/contractor_tests/callbacks/test_function_results_budget.py index 76aa34c..e506991 100644 --- a/tests/units/contractor_tests/callbacks/test_function_results_budget.py +++ b/tests/units/contractor_tests/callbacks/test_function_results_budget.py @@ -183,3 +183,28 @@ def test_build_worker_explicit_arg_overrides_settings(monkeypatch): _build(wf, elide_keep_budget_chars=50_000) assert captured["keep_budget_chars"] == 50_000 + + +def test_build_worker_keep_last_n_settings_override(monkeypatch): + import contractor.agents.worker_factory as wf + from contractor.utils.settings import Settings + + captured = _capture_build_worker(monkeypatch) + # FS_HEAVY_KEEP_LAST_N > 0 overrides the caller's elide_keep_last_n (15) — + # e.g. set very high to effectively disable count-based elision. + monkeypatch.setattr(wf, "get_settings", lambda: Settings(fs_heavy_keep_last_n=999)) + _build(wf) + + assert captured["keep_last_n"] == 999 + + +def test_build_worker_keep_last_n_default_uses_caller(monkeypatch): + import contractor.agents.worker_factory as wf + from contractor.utils.settings import Settings + + captured = _capture_build_worker(monkeypatch) + # Default 0 = unset → caller's elide_keep_last_n (15) is used. + monkeypatch.setattr(wf, "get_settings", lambda: Settings(fs_heavy_keep_last_n=0)) + _build(wf) + + assert captured["keep_last_n"] == 15 diff --git a/tests/units/contractor_tests/callbacks/test_guardrails.py b/tests/units/contractor_tests/callbacks/test_guardrails.py index 6d3fe5b..6b63997 100644 --- a/tests/units/contractor_tests/callbacks/test_guardrails.py +++ b/tests/units/contractor_tests/callbacks/test_guardrails.py @@ -1,7 +1,18 @@ from types import SimpleNamespace -from contractor.callbacks.guardrails import RepeatedToolCallCallback -from tests.units.contractor_tests.helpers import mk_tool_context +from contractor.callbacks.adapter import CallbackAdapter +from contractor.callbacks.guardrails import ( + InvalidToolCallGuardrailCallback, + MandatoryToolCallback, + RepeatedToolCallCallback, +) +from tests.units.contractor_tests.helpers import ( + MockContent, + mk_callback_context, + mk_function_call_part, + mk_text_part, + mk_tool_context, +) def _mk_tool(name: str): @@ -133,6 +144,154 @@ def test_empty_args_do_not_break_existing_streak(): assert "add_subtask" in result["warning"] +# --------------------------------------------------------------------------- +# InvalidToolCallGuardrailCallback — return-value contract +# --------------------------------------------------------------------------- + + +def _named_tool(name: str): + def _tool(): + pass + + _tool.__name__ = name + return _tool + + +def _mk_invalid_tool_cb( + tool_names: tuple[str, ...] = ("default_tool", "submit_verdict", "read_file"), +) -> InvalidToolCallGuardrailCallback: + return InvalidToolCallGuardrailCallback( + tools=[_named_tool(n) for n in tool_names], + default_tool_name="default_tool", + default_tool_arg="meta", + ) + + +def _mk_model_response(parts): + return SimpleNamespace(content=MockContent(role="model", parts=list(parts))) + + +def test_invalid_tool_cb_returns_none_when_nothing_modified(): + cb = _mk_invalid_tool_cb() + ctx = mk_callback_context() + resp = _mk_model_response( + [ + mk_text_part("thinking..."), + mk_function_call_part(name="read_file", args={"path": "a"}), + ] + ) + + assert cb(ctx, resp) is None + assert cb.history == [] + # state is still saved even when nothing was rewritten + assert "::InvalidToolCallGuardrailCallback" in ctx.state["callbacks"] + + +def test_invalid_tool_cb_returns_none_for_text_only_response(): + cb = _mk_invalid_tool_cb() + ctx = mk_callback_context() + resp = _mk_model_response([mk_text_part("final answer")]) + + assert cb(ctx, resp) is None + assert cb.history == [] + + +def test_invalid_tool_cb_returns_response_when_it_rewrites_a_part(): + cb = _mk_invalid_tool_cb() + ctx = mk_callback_context() + resp = _mk_model_response( + [mk_function_call_part(name="no_such_tool", args={"x": 1})] + ) + + result = cb(ctx, resp) + + assert result is resp + fc = resp.content.parts[0].function_call + assert fc.name == "default_tool" + assert fc.args["meta"]["func_name"] == "no_such_tool" + assert len(cb.history) == 1 + + +def test_invalid_tool_cb_rewrites_malformed_args(): + cb = _mk_invalid_tool_cb() + ctx = mk_callback_context() + part = mk_function_call_part(name="read_file") + part.function_call.args = "not-a-dict" # type: ignore[assignment] + resp = _mk_model_response([part]) + + result = cb(ctx, resp) + + assert result is resp + fc = resp.content.parts[0].function_call + assert fc.name == "default_tool" + assert "error" in fc.args["meta"] + + +def test_chain_runs_downstream_callback_when_nothing_modified(): + class SentinelCallback(MandatoryToolCallback): + """Downstream after_model callback that records it was reached.""" + + def __init__(self): + super().__init__(tool_names=["submit_verdict"]) + self.seen = 0 + + def __call__(self, callback_context, llm_response): + self.seen += 1 + return super().__call__(callback_context, llm_response) + + sentinel = SentinelCallback() + adapter = CallbackAdapter(agent_name="worker") + adapter.register(_mk_invalid_tool_cb()) + adapter.register(sentinel) + chain = adapter()["after_model_callback"] + + ctx = mk_callback_context() + resp = _mk_model_response( + [mk_function_call_part(name="read_file", args={"path": "a"})] + ) + + assert chain(callback_context=ctx, llm_response=resp) is None + assert sentinel.seen == 1 + + +def test_exploitability_chaining_lets_mandatory_tool_callback_nudge(): + """Mirrors exploitability_agent/agent.py: the worker's after_model chain + (ending with InvalidToolCallGuardrailCallback) is wrapped by ``_chain``, + which only runs MandatoryToolCallback when the chain returns None.""" + adapter = CallbackAdapter(agent_name="worker") + adapter.register(_mk_invalid_tool_cb()) + worker_chain = adapter()["after_model_callback"] + + mandatory = MandatoryToolCallback(tool_names=["submit_verdict"], max_nudges=3) + + def _chain(callback_context, llm_response): + result = worker_chain( + callback_context=callback_context, llm_response=llm_response + ) + if result is not None: + return result + return mandatory( + callback_context=callback_context, llm_response=llm_response + ) + + ctx = mk_callback_context() + + # turn 1: a valid tool call — tracked, no nudge + resp = _mk_model_response( + [mk_function_call_part(name="read_file", args={"path": "a"})] + ) + assert _chain(ctx, resp) is None + assert mandatory.step_count == 1 + + # turn 2: text-only final answer without submit_verdict — must nudge + final = _mk_model_response([mk_text_part("verdict: exploitable")]) + nudge = _chain(ctx, final) + + assert nudge is not None + assert mandatory.nudge_count == 1 + assert "submit_verdict" in nudge.content.parts[0].text + + def test_state_is_persisted(): cb = RepeatedToolCallCallback(threshold=2) ctx = mk_tool_context() diff --git a/tests/units/contractor_tests/callbacks/test_ratelimits.py b/tests/units/contractor_tests/callbacks/test_ratelimits.py index e994326..984551f 100644 --- a/tests/units/contractor_tests/callbacks/test_ratelimits.py +++ b/tests/units/contractor_tests/callbacks/test_ratelimits.py @@ -234,6 +234,54 @@ def test_at_limit_no_sleep(self, monkeypatch): sleep_mock.assert_not_called() assert cb.request_count == 3 + def test_window_rolls_after_60s_without_sleeping(self, monkeypatch): + # Once 60s elapse under budget, the window rolls forward (count + + # timer reset) without sleeping — exercises the `elif els >= 60` + # branch, mirroring TpmRatelimitCallback. + monkeypatch.setattr( + "time.time", MagicMock(side_effect=[1000.0, 1010.0, 1070.0]) + ) + sleep_mock = MagicMock() + monkeypatch.setattr("time.sleep", sleep_mock) + + cb = RpmRatelimitCallback(rpm_limit=3) + ctx = mk_callback_context() + cb(ctx, MagicMock()) # init @1000, count=1 + cb(ctx, MagicMock()) # @1010, count=2 + cb(ctx, MagicMock()) # @1070 els=70>=60, under limit → roll, no sleep + + sleep_mock.assert_not_called() + assert cb.timer_start == 1070 + assert cb.request_count == 1 # the rolling request starts the window + assert cb.history == [] + + def test_requests_do_not_accumulate_across_stale_windows(self, monkeypatch): + # Regression (mirrors TestTpmAccumulation/H1 in spirit): pre-fix there + # was no stale-window reset branch, so request_count accumulated + # across dead windows and a later sub-limit burst was treated as a + # limit violation (count reset only via the throttle branch). + monkeypatch.setattr( + "time.time", + MagicMock(side_effect=[1000.0, 1001.0, 1070.0, 1071.0, 1072.0]), + ) + sleep_mock = MagicMock() + monkeypatch.setattr("time.sleep", sleep_mock) + + cb = RpmRatelimitCallback(rpm_limit=3) + ctx = mk_callback_context() + cb(ctx, MagicMock()) # init @1000, count=1 + cb(ctx, MagicMock()) # @1001, count=2 + cb(ctx, MagicMock()) # @1070 count=3 (at limit), stale → roll, count=1 + cb(ctx, MagicMock()) # @1071, count=2 + cb(ctx, MagicMock()) # @1072, count=3 — still within the new window + + # Pre-fix: the @1071 call hit count=4 > 3 and took the throttle + # branch (history entry + spurious reset). + sleep_mock.assert_not_called() + assert cb.history == [] + assert cb.timer_start == 1070 + assert cb.request_count == 3 + def test_exceeding_limit_triggers_sleep_and_resets(self, monkeypatch): # Four time.time() reads: init, 2nd req, 3rd req, 4th req triggers # sleep, then a fifth read to set the new window's timer_start. diff --git a/tests/units/contractor_tests/callbacks/test_tokens.py b/tests/units/contractor_tests/callbacks/test_tokens.py index cc6f774..046fb6b 100644 --- a/tests/units/contractor_tests/callbacks/test_tokens.py +++ b/tests/units/contractor_tests/callbacks/test_tokens.py @@ -6,16 +6,17 @@ def test_series_same_interaction_then_change_and_more(): """ - 1) Несколько запросов в рамках одного invocation_id, потом invocation_id меняется, и ещё запросы. - Проверяем: - - common суммируется всегда - - current суммируется только в рамках текущего invocation_id - - при смене invocation_id прошлый current уходит в history под старым id + Several requests within one invocation_id, then the invocation_id + changes, then more requests. Checks: + - the global counter always accumulates + - the current counter accumulates only within the current invocation_id + - history tracks the per-invocation totals, including the in-progress + invocation (flushed on every call, keyed by invocation_id) """ ctx = mk_callback_context() token_usage_callback = TokenUsageCallback() ctx.invocation_id = "A" - # interaction A: 2 вызова + # interaction A: 2 calls token_usage_callback(ctx, mk_llm_response(total=10, prompt=6, candidates=4)) token_usage_callback(ctx, mk_llm_response(total=3, prompt=1, candidates=2)) state_key = "::" + token_usage_callback.name @@ -28,10 +29,13 @@ def test_series_same_interaction_then_change_and_more(): assert g.output == 6 assert g.total == 13 + # The in-progress invocation is already visible in history — consumers + # reading mid-run (or after the final invocation) never undercount. h = TokenUsageCallback.get_history(ctx) - assert h == {} + assert h == {"A": {"input": 7, "output": 6, "total": 13}} - # interaction B: первый вызов с новым id переносит A->history и ставит current = token_count(B) + # interaction B: the first call with a new id starts a fresh current + # counter; A's totals stay frozen in history. ctx.invocation_id = "B" new_invocation_id = ctx.invocation_id token_usage_callback(ctx, mk_llm_response(total=5, prompt=2, candidates=3)) @@ -46,9 +50,13 @@ def test_series_same_interaction_then_change_and_more(): assert g.total == 18 h = TokenUsageCallback.get_history(ctx) - assert h == {"A": {"input": 7, "output": 6, "total": 13}} + assert h == { + "A": {"input": 7, "output": 6, "total": 13}, + "B": {"input": 2, "output": 3, "total": 5}, + } - # interaction B: ещё один вызов — current накапливается + # interaction B: one more call — the current counter accumulates and the + # history entry follows it. token_usage_callback(ctx, mk_llm_response(total=2, prompt=1, candidates=1)) s = ctx.state["callbacks"][state_key] @@ -61,7 +69,29 @@ def test_series_same_interaction_then_change_and_more(): assert g.total == 20 h = TokenUsageCallback.get_history(ctx) - assert h == {"A": {"input": 7, "output": 6, "total": 13}} + assert h == { + "A": {"input": 7, "output": 6, "total": 13}, + "B": {"input": 3, "output": 4, "total": 7}, + } + + +def test_history_includes_final_invocation_without_id_change(): + # Regression: history used to be written only when invocation_id changed, + # so the LAST invocation of a run was never flushed and consumers + # undercounted by one invocation. The flush-on-every-call seam keeps the + # final invocation's entry present and accurate without double-counting. + ctx = mk_callback_context() + ctx.invocation_id = "only" + cb = TokenUsageCallback() + + cb(ctx, mk_llm_response(total=10, prompt=6, candidates=4)) + cb(ctx, mk_llm_response(total=3, prompt=1, candidates=2)) + + h = TokenUsageCallback.get_history(ctx) + assert h == {"only": {"input": 7, "output": 6, "total": 13}} + # History totals equal the global counter — nothing lost, nothing doubled. + g = TokenUsageCallback.get_global_counter(ctx) + assert h["only"] == {"input": g.input, "output": g.output, "total": g.total} # ─── TokenCounter ──────────────────────────────────────────────────────────── diff --git a/tests/units/contractor_tests/runners/plugins/test_metrics_plugin.py b/tests/units/contractor_tests/runners/plugins/test_metrics_plugin.py index d9aa16d..69b2b2e 100644 --- a/tests/units/contractor_tests/runners/plugins/test_metrics_plugin.py +++ b/tests/units/contractor_tests/runners/plugins/test_metrics_plugin.py @@ -138,6 +138,34 @@ def test_resolve_unknown_returns_none(self): t = _CallTracker() assert t.resolve("inv", "agent", "tool", {"a": 1}) is None + def test_resolve_prefers_fresh_call_over_errored(self): + # Parallel identical calls: once the first one errors, a subsequent + # after_tool must pair with the still-clean call, not the errored one. + t = _CallTracker() + c1 = t.register("inv", "agent", "tool", {"a": 1}) + c2 = t.register("inv", "agent", "tool", {"a": 1}) + c1.exception_seen = True + assert t.resolve("inv", "agent", "tool", {"a": 1}) is c2 + + def test_resolve_falls_back_to_errored_when_no_fresh_pending(self): + # The documented edge case: a paired after_tool CAN arrive right + # after on_tool_error (a plugin returned a non-None error response) — + # it must still resolve the errored call so it isn't double-counted. + t = _CallTracker() + errored = t.register("inv", "agent", "tool", {"a": 1}) + errored.exception_seen = True + assert t.resolve("inv", "agent", "tool", {"a": 1}) is errored + + def test_register_finishes_stale_errored_call_with_same_fingerprint(self): + # Registering an identical retry closes the errored call's paired + # after_tool window — it must not linger as pending. + t = _CallTracker() + errored = t.register("inv", "agent", "tool", {"a": 1}) + errored.exception_seen = True + retry = t.register("inv", "agent", "tool", {"a": 1}) + assert errored.finished is True + assert t.resolve("inv", "agent", "tool", {"a": 1}) is retry + def test_cleanup_invocation_clears_state(self): t = _CallTracker() t.register("inv1", "agent", "tool", {"a": 1}) @@ -249,6 +277,46 @@ async def test_exception_counted_once_no_double_count(self): assert tm["success_total"] == 0 assert tm["result_errors_total"] == 0 + @pytest.mark.asyncio + async def test_retry_after_exception_pairs_with_retry_not_stale_call(self): + # An errored call gets no after_tool (no plugin returns a non-None + # error response). When the agent retries with identical args, the + # retry's after_tool must resolve the RETRY call — not the stale + # errored one — so the success and its timing land on the right call. + plugin, rec = _plugin() + tool, ctx = _tool(), _ctx() + + await plugin.before_tool_callback(tool=tool, tool_context=ctx, args={"p": 1}) + await plugin.on_tool_error_callback( + tool=tool, tool_context=ctx, args={"p": 1}, error=ValueError("x") + ) + await plugin.before_tool_callback(tool=tool, tool_context=ctx, args={"p": 1}) + await plugin.after_tool_callback( + tool=tool, tool_context=ctx, args={"p": 1}, result={"ok": True} + ) + + exc = rec.of_type(AgioEventType.TOOL_EXCEPTION)[0] + assert exc["tool_call_id"] == "call_1" + + res = rec.of_type(AgioEventType.TOOL_RESULT)[0] + assert res["successful"] is True + assert res["result_error"] is False + # Identity/timing belong to the retry call, not the errored one. + assert res["tool_call_id"] == "call_2" + assert res["execution_time_ms"] >= 0 + + # No leaked pending call: both calls are finished. + assert plugin._tracker.resolve("inv1", "swe", "read_file", {"p": 1}) is None + + await plugin.after_run_callback(invocation_context=_ctx()) + summary = rec.of_type(AgioEventType.RUN_SUMMARY)[0] + tm = summary["agents"]["swe"]["tools"]["read_file"] + assert tm["calls_total"] == 2 + assert tm["exception_errors_total"] == 1 + assert tm["success_total"] == 1 + assert tm["result_errors_total"] == 0 + assert summary["callback_imbalances"] == [] + @pytest.mark.asyncio async def test_after_model_records_usage(self): plugin, rec = _plugin() diff --git a/tests/units/contractor_tests/runners/plugins/test_run_callbacks.py b/tests/units/contractor_tests/runners/plugins/test_run_callbacks.py new file mode 100644 index 0000000..592686c --- /dev/null +++ b/tests/units/contractor_tests/runners/plugins/test_run_callbacks.py @@ -0,0 +1,136 @@ +"""Probe: does ADK's ``after_run_callback`` fire during a TaskRunner run? + +``TaskRunner.run``'s teardown comment and ``SandboxCleanupPlugin``'s module +docstring used to contradict each other on this. This test drives a REAL +ADK ``Runner`` (no LLM — the agent is a ``BaseAgent`` that yields one final +event) through ``TaskRunner`` with a probe plugin registered, and settles +the question. + +Finding (ADK 2.x): ``Runner._exec_with_plugin`` awaits +``run_after_run_callback`` after the run's event generator is exhausted — +so it DOES fire whenever the outer run is consumed to completion (which +``TaskRunner._consume_events`` always does on the happy path). It does +NOT fire when the run raises or is abandoned mid-stream, which is why +``TaskRunner.run`` keeps a run()-level sandbox sweep as a backstop. +""" + +from __future__ import annotations + +from typing import Any +from unittest.mock import AsyncMock, MagicMock + +import pytest +from google.adk.agents.base_agent import BaseAgent +from google.adk.artifacts import InMemoryArtifactService +from google.adk.events import Event, EventActions +from google.adk.plugins.base_plugin import BasePlugin +from google.genai import types + +from contractor.runners.models import ( + RenderedTask, + TaskInvocation, + TaskScopedKeys, + TaskStatus, + TaskTemplate, +) +from contractor.runners.task_runner import TaskRunner + +# ─── Probe fixtures ─────────────────────────────────────────────────────────── + + +class ProbePlugin(BasePlugin): + """Records every before/after run callback invocation.""" + + def __init__(self) -> None: + super().__init__(name="probe") + self.before_run_ids: list[str | None] = [] + self.after_run_ids: list[str | None] = [] + + async def before_run_callback(self, *, invocation_context: Any) -> None: + self.before_run_ids.append( + getattr(invocation_context, "invocation_id", None) + ) + + async def after_run_callback(self, *, invocation_context: Any) -> None: + self.after_run_ids.append( + getattr(invocation_context, "invocation_id", None) + ) + + +class OneShotAgent(BaseAgent): + """No-LLM agent: yields a single final event that marks the task done.""" + + async def _run_async_impl(self, ctx): + keys = TaskScopedKeys(ctx.session.state.get("_global_task_id", 0)) + yield Event( + invocation_id=ctx.invocation_id, + author=self.name, + content=types.Content( + role="model", parts=[types.Part(text="all done")], + ), + actions=EventActions(state_delta={keys.status: TaskStatus.DONE}), + ) + + +def _probe_task_runner(monkeypatch, probe: ProbePlugin) -> TaskRunner: + """A TaskRunner whose iteration drives a REAL ADK Runner. + + Only the LLM-bound pieces are replaced: the planner is swapped for + ``OneShotAgent`` and the plugin list for the probe. Session/artifact + services, state seeding, and event consumption are the production path. + """ + r = TaskRunner(name="probe_app", artifact_service=InMemoryArtifactService()) + r.templates[("t", "v1")] = TaskTemplate( + key="t", version="v1", title="T", + objective="", instructions="", output_format="", + ) + rendered = RenderedTask( + key="t", title="T", objective="", instructions="", + output_format="", format="json", + ) + monkeypatch.setattr(r, "_render_task", MagicMock(return_value=rendered)) + # ADK's InMemoryArtifactService rejects the session-agnostic + # (session_id=None) saves the production artifact service supports; + # publishing is irrelevant to the probe, so stub it. + monkeypatch.setattr(r, "_publish_task_artifacts", AsyncMock()) + monkeypatch.setattr( + r, "_spawn_planning_agent", + lambda item, task: OneShotAgent(name="probe_agent"), + ) + monkeypatch.setattr( + r, "_build_plugins", lambda *args, **kwargs: [probe], + ) + return r + + +# ─── The probe ──────────────────────────────────────────────────────────────── + + +class TestAfterRunCallbackFires: + @pytest.mark.asyncio + async def test_after_run_callback_fires_for_the_outer_task_run( + self, monkeypatch, + ): + probe = ProbePlugin() + r = _probe_task_runner(monkeypatch, probe) + r.queue.append(TaskInvocation( + id="inv-1", + ref="probe_task", + template_key="t", + template_version="v1", + worker_builder=lambda **_: MagicMock(), + iterations=1, + max_attempts=1, + )) + + results = await r.run(user_id="u") + + assert len(results) == 1 + assert results[0].status == TaskStatus.DONE + + # The load-bearing assertion: after_run_callback DOES fire when the + # outer run is consumed to completion, and it pairs 1:1 with + # before_run_callback for the same invocation. SandboxCleanupPlugin's + # root-invocation teardown relies on exactly this. + assert len(probe.before_run_ids) == 1 + assert probe.after_run_ids == probe.before_run_ids diff --git a/tests/units/contractor_tests/runners/test_agent_runner.py b/tests/units/contractor_tests/runners/test_agent_runner.py index 8048ebe..77c8fcd 100644 --- a/tests/units/contractor_tests/runners/test_agent_runner.py +++ b/tests/units/contractor_tests/runners/test_agent_runner.py @@ -5,6 +5,8 @@ from __future__ import annotations +import asyncio +import logging from types import SimpleNamespace from unittest.mock import AsyncMock, MagicMock @@ -212,17 +214,38 @@ async def test_no_handler_is_noop(self, monkeypatch): assert result.final_text == "ok" @pytest.mark.asyncio - async def test_handler_cleared_after_run(self, monkeypatch): - _patch_runner(monkeypatch, [_final_event("ok")]) + async def test_concurrent_runs_do_not_clobber_each_others_handler( + self, monkeypatch, + ): + # The handler is threaded through the call chain, not stored on the + # instance — two interleaved run() calls on the SAME runner must each + # deliver their full event sequence to their own handler only. + class SlowFakeRunner: + def __init__(self, **kwargs): + pass + + async def run_async(self, *, user_id, session_id, new_message): + await asyncio.sleep(0.01) # force interleaving + yield _final_event("ok") + + monkeypatch.setattr(agent_runner_mod, "Runner", SlowFakeRunner) runner = _make_runner() - _, on_event = _collector() + events_a, on_a = _collector() + events_b, on_b = _collector() - await runner.run(agent=_agent(), message="hi", on_event=on_event) + await asyncio.gather( + runner.run(agent=_agent("agent-a"), message="hi", on_event=on_a), + runner.run(agent=_agent("agent-b"), message="hi", on_event=on_b), + ) - assert runner._on_event is None + for events, agent_name in ((events_a, "agent-a"), (events_b, "agent-b")): + assert [e.type for e in events] == [ + "agent_run_started", "final_text", "agent_run_finished", + ] + assert {e.payload["agent_name"] for e in events} == {agent_name} @pytest.mark.asyncio - async def test_handler_cleared_even_on_error(self, monkeypatch): + async def test_runner_error_propagates(self, monkeypatch): class FakeRunner: def __init__(self, **kwargs): pass @@ -238,7 +261,40 @@ async def run_async(self, *, user_id, session_id, new_message): with pytest.raises(RuntimeError, match="kaboom"): await runner.run(agent=_agent(), message="hi", on_event=on_event) - assert runner._on_event is None + @pytest.mark.asyncio + async def test_raising_handler_does_not_abort_run(self, monkeypatch, caplog): + # Event delivery is best-effort telemetry: a broken handler (full + # disk, UI rendering, …) must not abort the agent run. + _patch_runner(monkeypatch, [_final_event("ok")]) + runner = _make_runner() + + async def bad_handler(ev: TaskRunnerEvent) -> None: + raise OSError("No space left on device") + + with caplog.at_level( + logging.ERROR, logger="contractor.runners.agent_runner", + ): + result = await runner.run( + agent=_agent(), message="hi", on_event=bad_handler, + ) + + assert result.final_text == "ok" + assert any( + "event handler failed" in rec.getMessage() for rec in caplog.records + ) + + @pytest.mark.asyncio + async def test_cancelled_error_from_handler_propagates(self, monkeypatch): + _patch_runner(monkeypatch, [_final_event("ok")]) + runner = _make_runner() + + async def cancelling_handler(ev: TaskRunnerEvent) -> None: + raise asyncio.CancelledError() + + with pytest.raises(asyncio.CancelledError): + await runner.run( + agent=_agent(), message="hi", on_event=cancelling_handler, + ) # ─── Artifact publishing ────────────────────────────────────────────────────── diff --git a/tests/units/contractor_tests/runners/test_agio.py b/tests/units/contractor_tests/runners/test_agio.py new file mode 100644 index 0000000..a0e77d8 --- /dev/null +++ b/tests/units/contractor_tests/runners/test_agio.py @@ -0,0 +1,41 @@ +"""Guards for the Agio event taxonomy. + +``cli/metrics.MetricsSink`` filters on ``ALL_AGIO_EVENT_TYPES`` — any event +type emitted by a runner but not mirrored in ``AgioEventType`` silently +vanishes from ``metrics.jsonl``. These tests make the mirroring explicit so +a new ``EventType`` member fails loudly here instead. +""" + +from __future__ import annotations + +from contractor.runners.agio import ALL_AGIO_EVENT_TYPES, AgioEventType +from contractor.runners.models import EventType + + +class TestTaxonomyMirroring: + def test_every_task_runner_event_type_is_persisted(self): + # models.EventType (what TaskRunner emits) must be a subset of the + # Agio taxonomy (what MetricsSink persists). A member added to + # EventType without an AgioEventType mirror would be dropped from + # metrics.jsonl without any error. + missing = {e.value for e in EventType} - ALL_AGIO_EVENT_TYPES + assert not missing, ( + f"EventType member(s) {sorted(missing)} are not mirrored in " + f"AgioEventType — events of these types would silently vanish " + f"from metrics.jsonl" + ) + + def test_agent_runner_string_events_are_persisted(self): + # AgentRunner emits these as string literals (no enum), and + # Workflow.emit_task_skipped emits "task_skipped" — all must stay + # in the taxonomy or RouterWorkflow / skip events disappear from + # metrics.jsonl. + assert { + "agent_run_started", + "agent_run_finished", + "final_text", + "task_skipped", + } <= ALL_AGIO_EVENT_TYPES + + def test_all_agio_event_types_matches_enum(self): + assert frozenset(t.value for t in AgioEventType) == ALL_AGIO_EVENT_TYPES diff --git a/tests/units/contractor_tests/runners/test_artifacts.py b/tests/units/contractor_tests/runners/test_artifacts.py index 1e783b6..003410c 100644 --- a/tests/units/contractor_tests/runners/test_artifacts.py +++ b/tests/units/contractor_tests/runners/test_artifacts.py @@ -8,6 +8,7 @@ InvalidArtifactKeyError, _records_to_text, artifact_filename, + artifact_key_slug, artifact_names_for_key, save_result_artifacts, validate_artifact_key, @@ -46,6 +47,27 @@ def test_does_not_reject_dotdot_inside_segment(self): assert validate_artifact_key("foo/bar..baz") == "foo/bar..baz" +class TestArtifactKeySlug: + def test_keeps_safe_chars(self): + assert artifact_key_slug("sqli-list_2") == "sqli-list_2" + + def test_collapses_unsafe_runs_to_single_underscore(self): + assert artifact_key_slug("trace-annotation:openapi:items") == ( + "trace-annotation_openapi_items" + ) + assert artifact_key_slug("a / b") == "a_b" + + def test_stable_and_key_safe(self): + slug = artifact_key_slug("../weird name!") + assert slug == artifact_key_slug("../weird name!") + # The slug is a single segment that passes key validation. + assert validate_artifact_key(f"t/{slug}") == f"t/{slug}" + + @pytest.mark.parametrize("value", ["", " ", "::", "__"]) + def test_degenerate_inputs_fall_back(self, value): + assert artifact_key_slug(value) == "item" + + class TestArtifactFilename: def test_appends_kind_to_cleaned_key(self): assert artifact_filename("plan", "result") == "plan/result" diff --git a/tests/units/contractor_tests/runners/test_models.py b/tests/units/contractor_tests/runners/test_models.py index 57681cf..49a28c7 100644 --- a/tests/units/contractor_tests/runners/test_models.py +++ b/tests/units/contractor_tests/runners/test_models.py @@ -183,6 +183,36 @@ def test_load_missing_task_key_raises(self, tasks_dir): with pytest.raises(ValueError, match="missing top-level 'task:'"): TaskTemplate.load("demo") + @pytest.mark.parametrize( + "missing_field", ["objective", "instructions", "output_format"], + ) + def test_load_missing_required_field_raises_value_error( + self, tasks_dir, missing_field, + ): + # A body without objective/instructions/output_format used to raise a + # bare KeyError; it must follow the same descriptive ValueError + # pattern (naming the body path) as the neighboring validation. + _write_task_manifest( + tasks_dir, + name="demo", + active="v1", + versions={"v1": "demo/v1.yml"}, + ) + body = { + "objective": "o", + "instructions": "i", + "output_format": "yaml", + } + del body[missing_field] + _write_task_body(tasks_dir, "demo/v1.yml", body) + + with pytest.raises( + ValueError, match=f"missing required '{missing_field}:'", + ) as exc_info: + TaskTemplate.load("demo") + # The body path is part of the message, like the neighboring errors. + assert "v1.yml" in str(exc_info.value) + # ─── RenderedTask.from_template (brace-interpolation guards) ────────────────── @@ -259,6 +289,38 @@ def test_missing_variable_raises_key_error(self): tpl, variables={}, params={}, artifacts={} ) + def test_colliding_artifact_refs_raise_naming_both(self): + # "oas-build/result" and "oas_build/result" both normalize to + # "artifact__oas_build__result" — the later artifact used to silently + # win. Now the ambiguity is rejected, naming both refs. + tpl = _make_template(instructions="{artifact__oas_build__result}") + with pytest.raises(ValueError, match="normalize to") as exc_info: + RenderedTask.from_template( + tpl, + variables={}, + params={}, + artifacts={ + "oas-build/result": "A", + "oas_build/result": "B", + }, + ) + message = str(exc_info.value) + assert "oas-build/result" in message + assert "oas_build/result" in message + assert "artifact__oas_build__result" in message + + def test_distinct_artifact_refs_do_not_collide(self): + tpl = _make_template( + instructions="{artifact__a__result} {artifact__b__result}", + ) + r = RenderedTask.from_template( + tpl, + variables={}, + params={}, + artifacts={"a/result": "A", "b/result": "B"}, + ) + assert r.instructions == "A B" + def test_unused_extra_variables_are_ignored(self): tpl = _make_template(instructions="static text") r = RenderedTask.from_template( @@ -362,6 +424,38 @@ def test_load_returns_none_for_wrong_version(self, tmp_path): path.write_text(json.dumps({"version": 999, "tasks": []}), encoding="utf-8") assert Checkpoint.load(path) is None + def test_load_returns_none_for_entry_missing_required_field( + self, tmp_path, caplog, + ): + import json + import logging + path = tmp_path / "partial.json" + path.write_text( + json.dumps({ + "version": 1, + "workflow": "test", + # Valid JSON, wrong shape: entry lacks ref/template_key/…. + "tasks": [{"task_id": 0}], + }), + encoding="utf-8", + ) + with caplog.at_level( + logging.WARNING, logger="contractor.runners.models.checkpoint", + ): + assert Checkpoint.load(path) is None + assert any( + "ignoring corrupt checkpoint" in r.getMessage() for r in caplog.records + ) + + def test_load_returns_none_for_non_dict_entries(self, tmp_path): + import json + path = tmp_path / "shape.json" + path.write_text( + json.dumps({"version": 1, "workflow": "test", "tasks": ["oops"]}), + encoding="utf-8", + ) + assert Checkpoint.load(path) is None + def test_save_is_atomic(self, tmp_path): path = tmp_path / "checkpoint.json" cp = Checkpoint(workflow="test", entries=[self._entry()]) diff --git a/tests/units/contractor_tests/runners/test_skills.py b/tests/units/contractor_tests/runners/test_skills.py index 31e9f51..bea7b52 100644 --- a/tests/units/contractor_tests/runners/test_skills.py +++ b/tests/units/contractor_tests/runners/test_skills.py @@ -13,6 +13,7 @@ inject_skills, load_skill, load_skills, + validate_skills, ) @@ -148,6 +149,37 @@ def test_empty_skill_dir_returns_empty_list(self, tmp_path, monkeypatch): assert load_skill("demo") == [] +class TestValidateSkills: + def test_existing_skill_dirs_pass(self, tmp_path, monkeypatch): + monkeypatch.setattr(m, "SKILLS_BASE_DIR", tmp_path) + (tmp_path / "a").mkdir() + (tmp_path / "b").mkdir() + validate_skills(["a", "b"]) # must not raise + + def test_empty_list_passes(self, tmp_path, monkeypatch): + monkeypatch.setattr(m, "SKILLS_BASE_DIR", tmp_path) + validate_skills([]) + + def test_unknown_skill_raises_value_error_naming_it( + self, tmp_path, monkeypatch, + ): + monkeypatch.setattr(m, "SKILLS_BASE_DIR", tmp_path) + (tmp_path / "known").mkdir() + + with pytest.raises(ValueError, match="'typo'") as exc_info: + validate_skills(["known", "typo"]) + # The message lists what IS available, to make the fix obvious. + assert "known" in str(exc_info.value) + + def test_multiple_unknown_skills_all_named(self, tmp_path, monkeypatch): + monkeypatch.setattr(m, "SKILLS_BASE_DIR", tmp_path) + with pytest.raises(ValueError) as exc_info: + validate_skills(["x", "y"]) + message = str(exc_info.value) + assert "'x'" in message + assert "'y'" in message + + class TestLoadSkills: def test_aggregates_multiple(self, tmp_path, monkeypatch): monkeypatch.setattr(m, "SKILLS_BASE_DIR", tmp_path) diff --git a/tests/units/contractor_tests/runners/test_task_runner.py b/tests/units/contractor_tests/runners/test_task_runner.py index daebf1e..4f8ad07 100644 --- a/tests/units/contractor_tests/runners/test_task_runner.py +++ b/tests/units/contractor_tests/runners/test_task_runner.py @@ -1,3 +1,5 @@ +import asyncio +import logging from types import SimpleNamespace from unittest.mock import AsyncMock, MagicMock @@ -5,6 +7,7 @@ from google.adk.artifacts import BaseArtifactService from contractor.runners._helpers import _decode_part_text, _extract_final_text +from contractor.runners.artifacts import InvalidArtifactKeyError, artifact_names_for_key from contractor.runners.models import ( Checkpoint, CheckpointEntry, @@ -83,13 +86,24 @@ def test_inline_bytes_data(self): ) assert _decode_part_text(part) == "payload" - def test_inline_bytes_invalid_utf8_ignores_errors(self): + def test_inline_bytes_invalid_utf8_replaces_and_warns(self, caplog): part = SimpleNamespace( text=None, inline_data=SimpleNamespace(data=b"\xff\xfehello"), ) - # `errors="ignore"` drops the invalid bytes; the readable suffix survives. - assert _decode_part_text(part) == "hello" + # `errors="replace"` keeps the readable suffix and marks the invalid + # bytes as U+FFFD instead of silently dropping them — with a warning. + with caplog.at_level(logging.WARNING, logger="contractor.runners._helpers"): + assert _decode_part_text(part) == "��hello" + assert any("not valid UTF-8" in r.getMessage() for r in caplog.records) + + def test_inline_bytes_valid_utf8_no_warning(self, caplog): + part = SimpleNamespace( + text=None, inline_data=SimpleNamespace(data=b"hello"), + ) + with caplog.at_level(logging.WARNING, logger="contractor.runners._helpers"): + assert _decode_part_text(part) == "hello" + assert not caplog.records def test_missing_inline_data_returns_empty(self): part = SimpleNamespace(text=None, inline_data=None) @@ -99,15 +113,18 @@ def test_missing_inline_data_returns_empty(self): # ─── TaskRunner._resolve_retry_params ──────────────────────────────────────── -def _make_template(default_iterations=1) -> TaskTemplate: +def _make_template( + default_iterations=1, default_artifacts=None, instructions="", +) -> TaskTemplate: return TaskTemplate( key="t", version="v1", title="T", objective="", - instructions="", + instructions=instructions, output_format="", default_iterations=default_iterations, + default_artifacts=default_artifacts or [], ) @@ -338,6 +355,480 @@ async def test_short_circuits_on_streak(self, runner, monkeypatch): assert single.await_count == 2 +# ─── TaskRunner retry loop: exceptions ─────────────────────────────────────── + + +class TestRetryLoopExceptions: + @pytest.mark.asyncio + async def test_exception_consumes_attempt_then_succeeds( + self, runner, monkeypatch + ): + # A transient exception on attempt 1 must count as a failed attempt, + # not abort the run — attempt 2 succeeds. + invocation = _make_invocation(iterations=1, max_attempts=2) + single = AsyncMock(side_effect=[ + RuntimeError("transient LLM error"), + _result_for("t", True, 2), + ]) + monkeypatch.setattr(runner, "_run_single_iteration", single) + + result = await runner._run_task_with_retries( + item=invocation, task_id=1, user_id="u", total_tasks=1, + ) + assert result.status == "done" + assert single.await_count == 2 + runner._publish_task_artifacts.assert_awaited_once() + + # The exception attempt emits the same ITERATION_RESULT event as a + # status!=DONE failure, extended with error info. + failures = [ + c for c in runner._emit.await_args_list + if c.args + and c.args[0] is EventType.ITERATION_RESULT + and c.kwargs.get("completed") is False + ] + assert len(failures) == 1 + assert failures[0].kwargs["error_type"] == "RuntimeError" + assert "transient LLM error" in failures[0].kwargs["error_message"] + assert failures[0].kwargs["successful_runs"] == 0 + assert failures[0].kwargs["iteration"] == 1 + + @pytest.mark.asyncio + async def test_exhaustion_by_exceptions_chains_last_exception( + self, runner, monkeypatch + ): + invocation = _make_invocation(iterations=1, max_attempts=2) + single = AsyncMock(side_effect=[ + RuntimeError("boom-1"), + RuntimeError("boom-2"), + ]) + monkeypatch.setattr(runner, "_run_single_iteration", single) + + with pytest.raises(TaskNotCompletedError) as exc_info: + await runner._run_task_with_retries( + item=invocation, task_id=1, user_id="u", total_tasks=1, + ) + assert single.await_count == 2 + assert isinstance(exc_info.value.__cause__, RuntimeError) + assert str(exc_info.value.__cause__) == "boom-2" + assert "boom-2" in str(exc_info.value) + assert exc_info.value.last_error == "boom-2" + runner._publish_task_artifacts.assert_not_awaited() + + failed = [ + c for c in runner._emit.await_args_list + if c.args and c.args[0] is EventType.TASK_FAILED + ] + assert len(failed) == 1 + assert failed[0].kwargs["last_error"] == "boom-2" + + @pytest.mark.asyncio + async def test_cancelled_error_propagates_without_consuming_attempts( + self, runner, monkeypatch + ): + invocation = _make_invocation(iterations=1, max_attempts=3) + single = AsyncMock(side_effect=asyncio.CancelledError()) + monkeypatch.setattr(runner, "_run_single_iteration", single) + + with pytest.raises(asyncio.CancelledError): + await runner._run_task_with_retries( + item=invocation, task_id=1, user_id="u", total_tasks=1, + ) + # No retries: cancellation unwinds immediately. + assert single.await_count == 1 + types = [c.args[0] for c in runner._emit.await_args_list if c.args] + assert EventType.ITERATION_RESULT not in types + assert EventType.TASK_FAILED not in types + + +# ─── Missing declared input artifacts ──────────────────────────────────────── + + +class TestMissingDeclaredArtifacts: + def _runner(self) -> TaskRunner: + return TaskRunner( + name="test", artifact_service=MagicMock(spec=BaseArtifactService), + ) + + @pytest.mark.asyncio + async def test_missing_artifact_warns_and_substitutes_empty(self, caplog): + r = self._runner() + r.artifact_service.load_artifact = AsyncMock(return_value=None) + + with caplog.at_level( + logging.WARNING, logger="contractor.runners.task_runner", + ): + loaded = await r._load_artifacts( + "u", ["up/result"], task_ref="task-a", + ) + + # Empty-string substitution is preserved (seeds may legitimately come + # from the persistent store) — but it is no longer silent. + assert loaded == {"up/result": ""} + messages = [rec.getMessage() for rec in caplog.records] + assert any("task-a" in m and "up/result" in m for m in messages) + + # The empty text still renders cleanly via {artifact__*} substitution. + template = _make_template( + default_artifacts=["up/result"], + instructions="ctx: [{artifact__up__result}]", + ) + rendered = r._render_task(template, {}, loaded) + assert "ctx: []" in rendered.instructions + + @pytest.mark.asyncio + async def test_present_artifact_loads_without_warning(self, caplog): + r = self._runner() + part = SimpleNamespace(text="content", inline_data=None) + r.artifact_service.load_artifact = AsyncMock(return_value=part) + + with caplog.at_level( + logging.WARNING, logger="contractor.runners.task_runner", + ): + loaded = await r._load_artifacts( + "u", ["up/result"], task_ref="task-a", + ) + + assert loaded == {"up/result": "content"} + assert not caplog.records + + +# ─── Carry-state hygiene across attempts ───────────────────────────────────── + + +class TestCarryStateStripsStaleInvocationKeys: + def _runner(self) -> TaskRunner: + return TaskRunner( + name="test", artifact_service=MagicMock(spec=BaseArtifactService), + ) + + def _rendered(self) -> RenderedTask: + return RenderedTask( + key="t", title="T", objective="obj", instructions="", + output_format="", format="json", + ) + + def test_second_attempt_state_has_no_first_attempt_invocation_keys(self): + # StreamlineManager keys its planner-internal state per ADK + # invocation (task::{gid}::{invocation_id}::…). A retry gets a fresh + # invocation_id, so the first attempt's keys are unreachable — they + # must not be dragged into the next attempt's initial state. + r = self._runner() + carry = { + "task::1::e-first::planner::tasks": [{"task_id": "0"}], + "task::1::e-first::planner::idx": 0, + } + + state = r._build_task_initial_state( + task_id=1, task=self._rendered(), carry_state=carry, + ) + + assert not any("e-first" in key for key in state) + + def test_other_namespaces_and_plain_keys_survive(self): + # Only THIS task's invocation-scoped keys are stripped: another + # task's keys and non-task keys must pass through untouched, and the + # task's own fixed keys are rebuilt by build_active_state. + r = self._runner() + carry = { + "task::1::e-first::planner::tasks": ["stale"], + "task::2::e-other::planner::tasks": ["other-task"], + "task::1::status": "running", + "callbacks": {"tokens": 5}, + } + + state = r._build_task_initial_state( + task_id=1, task=self._rendered(), carry_state=carry, + ) + + assert "task::1::e-first::planner::tasks" not in state + assert state["task::2::e-other::planner::tasks"] == ["other-task"] + assert state["callbacks"] == {"tokens": 5} + assert state["task::1::status"] == TaskStatus.RUNNING + assert state["task::1::objective"] == "obj" + + +# ─── Skills: fail-fast validation + once-per-task injection ────────────────── + + +class TestSkillValidationAndInjection: + def _runner_with_template(self, monkeypatch, template) -> TaskRunner: + r = TaskRunner( + name="test", artifact_service=MagicMock(spec=BaseArtifactService), + ) + monkeypatch.setattr(r, "_ensure_template", MagicMock(return_value=template)) + return r + + def test_add_task_rejects_unknown_skill(self, monkeypatch): + # A typo'd skill must fail at queue time, not when the task's first + # iteration starts hours into a workflow. + template = _make_template() + r = self._runner_with_template(monkeypatch, template) + + with pytest.raises(ValueError, match="definitely_not_a_skill"): + r.add_task( + "t", + worker_builder=lambda **_: MagicMock(), + skills=["definitely_not_a_skill"], + ) + assert r.queue == [] + + def test_add_task_validates_template_default_skills( + self, tmp_path, monkeypatch, + ): + import contractor.runners.skills as skills_mod + + monkeypatch.setattr(skills_mod, "SKILLS_BASE_DIR", tmp_path) + (tmp_path / "good").mkdir() + + template = TaskTemplate( + key="t", version="v1", title="T", objective="", + instructions="", output_format="", + default_skills=["good", "bad"], + ) + r = self._runner_with_template(monkeypatch, template) + + with pytest.raises(ValueError, match="'bad'"): + r.add_task("t", worker_builder=lambda **_: MagicMock()) + + def test_add_task_accepts_existing_skill(self, tmp_path, monkeypatch): + import contractor.runners.skills as skills_mod + + monkeypatch.setattr(skills_mod, "SKILLS_BASE_DIR", tmp_path) + (tmp_path / "good").mkdir() + + template = _make_template() + r = self._runner_with_template(monkeypatch, template) + + r.add_task( + "t", worker_builder=lambda **_: MagicMock(), skills=["good"], + ) + assert r.queue[0].skills == ["good"] + + @pytest.mark.asyncio + async def test_injection_happens_once_across_attempts( + self, runner, monkeypatch, + ): + # Skills + input artifacts are invariant across attempts; the + # read-modify-write memory injection must run once per task, not + # once per retry. + inject_skills_spy = AsyncMock() + inject_artifacts_spy = AsyncMock() + monkeypatch.setattr(runner, "_inject_skills", inject_skills_spy) + monkeypatch.setattr(runner, "_inject_artifacts", inject_artifacts_spy) + + invocation = _make_invocation(iterations=1, max_attempts=3) + invocation.skills = ["some_skill"] + single = AsyncMock(side_effect=[ + _result_for("t", False, 1), + _result_for("t", False, 2), + _result_for("t", True, 3), + ]) + monkeypatch.setattr(runner, "_run_single_iteration", single) + + await runner._run_task_with_retries( + item=invocation, task_id=1, user_id="u", total_tasks=1, + ) + + assert single.await_count == 3 + inject_skills_spy.assert_awaited_once() + inject_artifacts_spy.assert_awaited_once() + assert inject_skills_spy.await_args.kwargs["skills"] == ["some_skill"] + + +# ─── add_task artifacts override ───────────────────────────────────────────── + + +class TestAddTaskArtifactsOverride: + def _runner_with_template(self, monkeypatch, template) -> TaskRunner: + r = TaskRunner( + name="test", artifact_service=MagicMock(spec=BaseArtifactService), + ) + monkeypatch.setattr(r, "_ensure_template", MagicMock(return_value=template)) + return r + + def test_empty_list_overrides_template_defaults(self, monkeypatch): + template = _make_template(default_artifacts=["up/result"]) + r = self._runner_with_template(monkeypatch, template) + + r.add_task("t", worker_builder=lambda **_: MagicMock(), artifacts=[]) + + assert r.queue[0].artifacts == [] + + def test_none_falls_back_to_template_defaults(self, monkeypatch): + template = _make_template(default_artifacts=["up/result"]) + r = self._runner_with_template(monkeypatch, template) + + r.add_task("t", worker_builder=lambda **_: MagicMock()) + + assert r.queue[0].artifacts == ["up/result"] + + +# ─── Per-invocation artifact keys (fan-out) ────────────────────────────────── + + +def _dict_artifact_service(store: dict[str, str]) -> MagicMock: + """An artifact service backed by a plain dict, so publish/load round-trip.""" + + async def save(*, app_name, user_id, session_id, filename, artifact): + store[filename] = artifact.text + + async def load(*, app_name, user_id, session_id, filename): + if filename in store: + return SimpleNamespace(text=store[filename], inline_data=None) + return None + + svc = MagicMock(spec=BaseArtifactService) + svc.save_artifact = AsyncMock(side_effect=save) + svc.load_artifact = AsyncMock(side_effect=load) + return svc + + +def _fanout_runner(store, monkeypatch, **kwargs) -> TaskRunner: + """TaskRunner with a dict-backed artifact service and a REAL + ``_publish_task_artifacts``, so per-key publishing is exercised.""" + r = TaskRunner( + name="test", artifact_service=_dict_artifact_service(store), **kwargs, + ) + template = _make_template(default_iterations=1) + r.templates[("t", "v1")] = template + + rendered = RenderedTask( + key="t", title="T", objective="", instructions="", + output_format="", format="json", + ) + monkeypatch.setattr(r, "_ensure_template", MagicMock(return_value=template)) + monkeypatch.setattr(r, "_load_artifacts", AsyncMock(return_value={})) + monkeypatch.setattr(r, "_render_task", MagicMock(return_value=rendered)) + monkeypatch.setattr(r, "_emit", AsyncMock()) + return r + + +def _completed_result(text: str, *, task_id: int) -> TaskResult: + result = _result_for("t", True, task_id, task_id=task_id) + result.result = text + return result + + +class TestPerInvocationArtifactKeys: + @pytest.mark.asyncio + async def test_distinct_artifact_keys_publish_non_colliding(self, monkeypatch): + # Two invocations of the same template with distinct artifact_keys + # must not overwrite each other's published artifacts. + store: dict[str, str] = {} + r = _fanout_runner(store, monkeypatch) + + r.add_task( + "t", ref="t:f1", artifact_key="t/f1", + worker_builder=lambda **_: MagicMock(), + ) + r.add_task( + "t", ref="t:f2", artifact_key="t/f2", + worker_builder=lambda **_: MagicMock(), + ) + single = AsyncMock(side_effect=[ + _completed_result("R1", task_id=0), + _completed_result("R2", task_id=1), + ]) + monkeypatch.setattr(r, "_run_single_iteration", single) + + results = await r.run(user_id="u") + + assert len(results) == 2 + assert store["t/f1/result"] == "R1" + assert store["t/f2/result"] == "R2" + assert "t/result" not in store + + def test_build_iteration_result_publishes_under_effective_key( + self, monkeypatch, + ): + # The result's published_artifacts (what checkpoints record and events + # report) must follow the invocation's artifact_key, defaulting to the + # template key when unset. + r = _fanout_runner({}, monkeypatch) + rendered = RenderedTask( + key="t", title="T", objective="", instructions="", + output_format="", format="json", + ) + keyed = _make_invocation(iterations=1, max_attempts=1) + keyed.artifact_key = "t/f1" + result = r._build_iteration_result(keyed, rendered, 0, "s", "", {}, {}) + assert result.published_artifacts == artifact_names_for_key("t/f1") + + plain = _make_invocation(iterations=1, max_attempts=1) + result = r._build_iteration_result(plain, rendered, 0, "s", "", {}, {}) + assert result.published_artifacts == artifact_names_for_key("t") + + @pytest.mark.asyncio + async def test_default_key_is_template_key(self, monkeypatch): + # No artifact_key → unchanged behavior: publish under the template key. + store: dict[str, str] = {} + r = _fanout_runner(store, monkeypatch) + + r.add_task("t", ref="t:0", worker_builder=lambda **_: MagicMock()) + single = AsyncMock(return_value=_completed_result("R", task_id=0)) + monkeypatch.setattr(r, "_run_single_iteration", single) + + results = await r.run(user_id="u") + + assert len(results) == 1 + assert r.queue[0].artifact_key is None + assert r.queue[0].effective_artifact_key == "t" + assert store["t/result"] == "R" + + def test_invalid_artifact_key_rejected(self, monkeypatch): + r = _fanout_runner({}, monkeypatch) + with pytest.raises(InvalidArtifactKeyError): + r.add_task( + "t", ref="t:0", artifact_key="../escape", + worker_builder=lambda **_: MagicMock(), + ) + + @pytest.mark.asyncio + async def test_restore_validates_own_key_not_siblings( + self, tmp_path, monkeypatch, + ): + # Both siblings are checkpointed as done, but only f2's artifacts (and + # a stale shared-key artifact) survive in the store. f1 must re-run — + # neither the sibling's artifacts nor the shared key may validate its + # restore — while f2 restores from its own key. + cp = Checkpoint(workflow="test") + for ref in ("t:f1", "t:f2"): + cp.mark_done(CheckpointEntry( + task_id=0, ref=ref, template_key="t", template_version="v1", + published_artifacts=dict(artifact_names_for_key("t")), + )) + cp.save(tmp_path / "checkpoint.json") + + store = { + "t/result": "stale shared", "t/summary": "", "t/records": "[]", + "t/f2/result": "R2", "t/f2/summary": "", "t/f2/records": "[]", + } + r = _fanout_runner( + store, monkeypatch, checkpoint_path=tmp_path / "checkpoint.json", + ) + + r.add_task( + "t", ref="t:f1", artifact_key="t/f1", + worker_builder=lambda **_: MagicMock(), + ) + r.add_task( + "t", ref="t:f2", artifact_key="t/f2", + worker_builder=lambda **_: MagicMock(), + ) + single = AsyncMock(return_value=_completed_result("R1", task_id=0)) + monkeypatch.setattr(r, "_run_single_iteration", single) + + results = await r.run(user_id="u") + + # f1 re-ran (own artifacts were missing) and re-published under its key. + single.assert_awaited_once() + assert store["t/f1/result"] == "R1" + # f2 restored from checkpoint against its own artifacts. + assert results[1].result == "(restored from checkpoint)" + assert results[1].published_artifacts == artifact_names_for_key("t/f2") + + # ─── Checkpoint integration ───────────────────────────────────────────────── @@ -431,6 +922,113 @@ async def test_checkpoint_reruns_if_artifact_missing(self, tmp_path, monkeypatch # Artifact missing → task was re-executed. single.assert_awaited_once() + @pytest.mark.asyncio + async def test_restore_skipped_on_template_key_mismatch( + self, tmp_path, monkeypatch, caplog, + ): + # Entry was recorded for a different template — a stale checkpoint + # after a workflow edit must not silently skip the task. + cp = Checkpoint(workflow="test") + cp.mark_done(CheckpointEntry( + task_id=0, ref="a:0", template_key="other", template_version="v1", + published_artifacts={"result": "other/result"}, + )) + cp.save(tmp_path / "checkpoint.json") + + r = _checkpoint_runner(tmp_path, monkeypatch) + monkeypatch.setattr(r, "_load_artifacts", AsyncMock(return_value={})) + + inv = _make_invocation(ref="a:0", iterations=1, max_attempts=1) + single = AsyncMock(return_value=_result_for("t", True, 1, task_id=0)) + monkeypatch.setattr(r, "_run_single_iteration", single) + + r.queue.append(inv) + with caplog.at_level( + logging.WARNING, logger="contractor.runners.task_runner", + ): + await r.run(user_id="u") + + # Mismatch → re-run, with a warning naming both templates. + single.assert_awaited_once() + messages = [rec.getMessage() for rec in caplog.records] + assert any( + "a:0" in m and "other@v1" in m and "t@v1" in m for m in messages + ) + + @pytest.mark.asyncio + async def test_restore_skipped_on_template_version_mismatch( + self, tmp_path, monkeypatch, caplog, + ): + cp = Checkpoint(workflow="test") + cp.mark_done(CheckpointEntry( + task_id=0, ref="a:0", template_key="t", template_version="v0", + published_artifacts={"result": "t/result"}, + )) + cp.save(tmp_path / "checkpoint.json") + + r = _checkpoint_runner(tmp_path, monkeypatch) + monkeypatch.setattr(r, "_load_artifacts", AsyncMock(return_value={})) + + inv = _make_invocation(ref="a:0", iterations=1, max_attempts=1) + single = AsyncMock(return_value=_result_for("t", True, 1, task_id=0)) + monkeypatch.setattr(r, "_run_single_iteration", single) + + r.queue.append(inv) + with caplog.at_level( + logging.WARNING, logger="contractor.runners.task_runner", + ): + await r.run(user_id="u") + + single.assert_awaited_once() + messages = [rec.getMessage() for rec in caplog.records] + assert any("a:0" in m and "t@v0" in m and "t@v1" in m for m in messages) + + @pytest.mark.asyncio + async def test_restore_still_happens_on_template_match( + self, tmp_path, monkeypatch, + ): + # Sanity companion to the mismatch tests: same template_key AND + # template_version → restore proceeds, task is skipped. + cp = Checkpoint(workflow="test") + cp.mark_done(CheckpointEntry( + task_id=0, ref="a:0", template_key="t", template_version="v1", + published_artifacts={"result": "t/result"}, + )) + cp.save(tmp_path / "checkpoint.json") + + r = _checkpoint_runner(tmp_path, monkeypatch) + monkeypatch.setattr(r, "_load_artifacts", AsyncMock(return_value={})) + + inv = _make_invocation(ref="a:0", iterations=1, max_attempts=1) + single = AsyncMock(return_value=_result_for("t", True, 1, task_id=0)) + monkeypatch.setattr(r, "_run_single_iteration", single) + + r.queue.append(inv) + results = await r.run(user_id="u") + + single.assert_not_awaited() + assert results[0].result == "(restored from checkpoint)" + + @pytest.mark.asyncio + async def test_load_checkpoint_failure_still_clears_handler( + self, tmp_path, monkeypatch, + ): + # _load_checkpoint runs inside the try block, so the finally that + # clears _on_event must run even if it raises. + r = _checkpoint_runner(tmp_path, monkeypatch) + monkeypatch.setattr( + r, "_load_checkpoint", + MagicMock(side_effect=RuntimeError("checkpoint exploded")), + ) + + async def on_event(event): + pass + + with pytest.raises(RuntimeError, match="checkpoint exploded"): + await r.run(user_id="u", on_event=on_event) + + assert r._on_event is None + @pytest.mark.asyncio async def test_no_checkpoint_path_runs_normally(self, tmp_path, monkeypatch): r = TaskRunner( @@ -510,6 +1108,68 @@ async def test_emits_run_finished_not_ok_on_failure(self, runner, monkeypatch): assert finished[0].kwargs["ok"] is False +# ─── Event handler failures (best-effort telemetry) ────────────────────────── + + +def _real_emit_runner(monkeypatch) -> TaskRunner: + """A TaskRunner with I/O stubbed but a REAL _emit, so handler failures + flow through the production guard.""" + r = TaskRunner(name="test", artifact_service=MagicMock(spec=BaseArtifactService)) + r.templates[("t", "v1")] = _make_template(default_iterations=1) + rendered = RenderedTask( + key="t", title="T", objective="", instructions="", + output_format="", format="json", + ) + monkeypatch.setattr(r, "_load_artifacts", AsyncMock(return_value={})) + monkeypatch.setattr(r, "_render_task", MagicMock(return_value=rendered)) + monkeypatch.setattr(r, "_publish_task_artifacts", AsyncMock()) + return r + + +class TestEmitHandlerFailures: + @pytest.mark.asyncio + async def test_raising_handler_does_not_abort_run(self, monkeypatch, caplog): + # An observability failure (e.g. MetricsSink hitting a full disk) is + # best-effort telemetry — the workflow must complete anyway. + r = _real_emit_runner(monkeypatch) + monkeypatch.setattr( + r, "_run_single_iteration", + AsyncMock(return_value=_result_for("t", True, 1, task_id=0)), + ) + r.queue.append(_make_invocation(ref="a:0", iterations=1, max_attempts=1)) + + async def bad_handler(event): + raise OSError("No space left on device") + + with caplog.at_level( + logging.ERROR, logger="contractor.runners.task_runner", + ): + results = await r.run(user_id="u", on_event=bad_handler) + + assert len(results) == 1 + assert results[0].status == "done" + assert any( + "event handler failed" in rec.getMessage() for rec in caplog.records + ) + + @pytest.mark.asyncio + async def test_cancelled_error_from_handler_propagates(self, monkeypatch): + # CancelledError must never be swallowed by the emit guard. + r = _real_emit_runner(monkeypatch) + monkeypatch.setattr( + r, "_run_single_iteration", + AsyncMock(return_value=_result_for("t", True, 1, task_id=0)), + ) + r.queue.append(_make_invocation(ref="a:0", iterations=1, max_attempts=1)) + + async def cancelling_handler(event): + raise asyncio.CancelledError() + + with pytest.raises(asyncio.CancelledError): + await r.run(user_id="u", on_event=cancelling_handler) + assert r._on_event is None + + # ─── Per-task event payloads (characterization) ────────────────────────────── diff --git a/tests/units/contractor_tests/test_analyze_metrics_costs.py b/tests/units/contractor_tests/test_analyze_metrics_costs.py new file mode 100644 index 0000000..525a5c9 --- /dev/null +++ b/tests/units/contractor_tests/test_analyze_metrics_costs.py @@ -0,0 +1,74 @@ +"""Unit tests for cost estimation in ``scripts/analyze_metrics.py``. + +Costs are only computed for models with a known pricing entry; rows from +unknown models (e.g. local lm-studio aliases) get no estimate and are +excluded from ``llm_with_cost`` so cost charts/tables skip them. +""" +from __future__ import annotations + +import importlib.util +import sys +from pathlib import Path + +import pandas as pd +import pytest + +_SPEC = importlib.util.spec_from_file_location( + "analyze_metrics", + Path(__file__).resolve().parents[3] / "scripts" / "analyze_metrics.py", +) +analyze_metrics = importlib.util.module_from_spec(_SPEC) +# Register before exec: dataclass slots resolution looks the module up in +# sys.modules while the module body runs. +sys.modules[_SPEC.name] = analyze_metrics +_SPEC.loader.exec_module(analyze_metrics) + + +def _row(model: str) -> pd.Series: + return pd.Series( + { + "model": model, + "input_tokens": 1_000_000, + "output_tokens": 1_000_000, + "cached_tokens": 0, + } + ) + + +def test_known_model_gets_cost(): + cost = analyze_metrics._estimate_row_cost(_row("gemini-2.5-flash")) + assert cost == pytest.approx(0.15 + 0.60) + + +def test_unknown_model_gets_no_cost(): + assert analyze_metrics._estimate_row_cost(_row("lm-studio-qwen3.6")) is None + + +def _llm_record(model: str | None) -> dict: + return { + "type": "llm_usage", + "ts_iso": "2026-01-01T00:00:00Z", + "model": model, + "usage": {"input": 1000, "output": 1000, "total": 2000}, + } + + +def test_slices_drop_unpriced_rows(): + df = analyze_metrics.normalize_records( + [_llm_record("gemini-2.5-flash"), _llm_record("lm-studio-qwen3.6")] + ) + slices = analyze_metrics.MetricSlices.build(df) + assert len(slices.llm_with_cost) == 1 + assert slices.llm_with_cost["model"].tolist() == ["gemini-2.5-flash"] + + +def test_slices_all_unknown_models_leave_cost_table_empty(): + df = analyze_metrics.normalize_records( + [_llm_record("lm-studio-qwen3.6"), _llm_record(None)] + ) + slices = analyze_metrics.MetricSlices.build(df) + assert slices.llm_with_cost.empty + + summary = analyze_metrics.compute_summary(df, slices) + # Rendered as "n/a" instead of a fictional $0-or-default cost. + assert summary["estimated_total_cost"] is None diff --git a/tests/units/contractor_tests/test_eval_archive_dirs.py b/tests/units/contractor_tests/test_eval_archive_dirs.py new file mode 100644 index 0000000..40d56a1 --- /dev/null +++ b/tests/units/contractor_tests/test_eval_archive_dirs.py @@ -0,0 +1,93 @@ +"""Dated, never-overwritten eval archive (the data-loss fix). + +Each run lands in its own ``eval_runs//--eval-/`` +folder so results are never overwritten; the flat +``eval_runs/-[-]/`` path is kept as a "latest" +pointer for analytics-ui. +""" +from __future__ import annotations + +import json + +import tests.eval.results as results +from tests.eval.results import CaseResult, EvalSink + + +def test_run_slug_and_archive_dir(monkeypatch, tmp_path): + monkeypatch.setattr(results, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(results, "RUN_STAMP", "0607-120000") + assert results._run_slug("agent", "trace_agent", "crapi-workshop") == \ + "agent-trace_agent-eval-crapi-workshop" + assert results._run_slug("task", "oas_build") == "task-oas_build" + assert results.run_archive_dir("agent", "trace_agent", "crapi-workshop") == \ + tmp_path / "0607-120000" / "agent-trace_agent-eval-crapi-workshop" + + +def test_case_artifact_dir_under_archive(monkeypatch, tmp_path): + monkeypatch.setattr(results, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(results, "RUN_STAMP", "0607-120000") + assert results.case_artifact_dir("trace_agent", "vampi", "login") == \ + tmp_path / "0607-120000" / "agent-trace_agent-eval-vampi" / "cases" / "login" / "artifacts" + # scenario tagging (task/pipeline) + assert "task-oas_build-eval-vampi" in str( + results.case_artifact_dir("oas_build", "vampi", "c1", scenario="task")) + + +def test_run_stamp_env_override(monkeypatch): + monkeypatch.setenv("CONTRACTOR_EVAL_RUN_STAMP", "qw3-off run!") + assert results._compute_run_stamp() == "qw3-off_run_" # sanitized to [alnum-_] + monkeypatch.delenv("CONTRACTOR_EVAL_RUN_STAMP", raising=False) + s = results._compute_run_stamp() # default mmdd-HHMMSS + assert len(s) == 11 and s[4] == "-" + + +def test_flush_writes_latest_and_dated_archive(monkeypatch, tmp_path): + monkeypatch.setattr(results, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(results, "RUN_STAMP", "0607-A") + sink = EvalSink() + sink.record(scenario="agent", unit="trace_agent", metric_kind="diff", + fixture="crapi-workshop", + case=CaseResult(id="c1", passed=True, pass_count=1, attempts=1), + model="m", pass_at=3) + sink.record(scenario="agent", unit="trace_agent", metric_kind="diff", + fixture="vulnyapi", + case=CaseResult(id="c2", passed=False, pass_count=0, attempts=3), + model="m", pass_at=3) + sink.flush() + + # latest pointer — combined, both fixtures, at -- + latest = tmp_path / "agent-trace_agent-diff" / "eval_results.json" + assert latest.exists() + assert {f["slug"] for f in json.loads(latest.read_text())["fixtures"]} == \ + {"crapi-workshop", "vulnyapi"} + + # dated per-fixture archives — one folder each, single-fixture envelopes + a1 = tmp_path / "0607-A" / "agent-trace_agent-eval-crapi-workshop" / "eval_results.json" + a2 = tmp_path / "0607-A" / "agent-trace_agent-eval-vulnyapi" / "eval_results.json" + assert a1.exists() and a2.exists() + assert [f["slug"] for f in json.loads(a1.read_text())["fixtures"]] == ["crapi-workshop"] + # per-case metrics persisted under the archive (crash-safe) + assert (tmp_path / "0607-A" / "agent-trace_agent-eval-crapi-workshop" + / "cases" / "c1" / "metrics.json").exists() + + +def test_second_run_does_not_overwrite_archive(monkeypatch, tmp_path): + monkeypatch.setattr(results, "EVAL_ROOT", tmp_path) + + monkeypatch.setattr(results, "RUN_STAMP", "0607-A") + s1 = EvalSink() + s1.record(scenario="agent", unit="trace_agent", metric_kind="diff", fixture="crapi-workshop", + case=CaseResult(id="c1", passed=True, pass_count=1, attempts=1)) + s1.flush() + + monkeypatch.setattr(results, "RUN_STAMP", "0607-B") + s2 = EvalSink() + s2.record(scenario="agent", unit="trace_agent", metric_kind="diff", fixture="crapi-workshop", + case=CaseResult(id="c1", passed=False, pass_count=0, attempts=1)) + s2.flush() + + a = tmp_path / "0607-A" / "agent-trace_agent-eval-crapi-workshop" / "eval_results.json" + b = tmp_path / "0607-B" / "agent-trace_agent-eval-crapi-workshop" / "eval_results.json" + assert a.exists() and b.exists() # both runs preserved — no overwrite + assert json.loads(a.read_text())["fixtures"][0]["cases"][0]["passed"] is True + assert json.loads(b.read_text())["fixtures"][0]["cases"][0]["passed"] is False diff --git a/tests/units/contractor_tests/test_eval_gate.py b/tests/units/contractor_tests/test_eval_gate.py new file mode 100644 index 0000000..aea7d0d --- /dev/null +++ b/tests/units/contractor_tests/test_eval_gate.py @@ -0,0 +1,38 @@ +"""Unit tests for the eval auto-skip gate helpers in ``tests/eval/conftest.py``. + +The gate must only be bypassed when the ``-m`` expression actually selects the +``eval`` marker, or when ``CONTRACTOR_RUN_EVAL`` is a truthy boolean — an +unrelated ``-m "not slow"`` or ``CONTRACTOR_RUN_EVAL=0`` must NOT silently +enable the LLM-bound suite. +""" +from __future__ import annotations + +from tests.eval.conftest import _markexpr_selects_eval, _run_eval_env_enabled + + +def test_markexpr_eval_selects(): + assert _markexpr_selects_eval("eval") + assert _markexpr_selects_eval("eval and trace") + assert _markexpr_selects_eval("slow or eval") + # `not eval` matches too — harmless, pytest deselects the items itself + assert _markexpr_selects_eval("not eval") + + +def test_markexpr_unrelated_does_not_select(): + assert not _markexpr_selects_eval(None) + assert not _markexpr_selects_eval("") + assert not _markexpr_selects_eval("not slow") + assert not _markexpr_selects_eval("integration") + # word-boundary match: marker names merely containing "eval" don't count + assert not _markexpr_selects_eval("evaluation") + assert not _markexpr_selects_eval("not pre_eval_check") + + +def test_run_eval_env_truthy_values(): + for v in ("1", "true", "TRUE", " yes ", "on"): + assert _run_eval_env_enabled(v) + + +def test_run_eval_env_falsy_values(): + for v in (None, "", "0", "false", "no", "off", "anything-else"): + assert not _run_eval_env_enabled(v) diff --git a/tests/units/contractor_tests/test_eval_harness_timeout.py b/tests/units/contractor_tests/test_eval_harness_timeout.py new file mode 100644 index 0000000..705fe4f --- /dev/null +++ b/tests/units/contractor_tests/test_eval_harness_timeout.py @@ -0,0 +1,81 @@ +"""Unit tests for the eval harness timeout path (``tests/eval/harness.py``). + +Drives ``run_agent`` with dummy (non-LLM) ADK agents: a hanging agent must +raise ``AgentRunTimeout`` carrying the partial ``AgentRun`` captured before +the deadline, and a fast agent must return a normal run (``timed_out=False``). +""" +from __future__ import annotations + +import asyncio +from collections.abc import AsyncGenerator + +import pytest +from google.adk.agents import BaseAgent +from google.adk.agents.invocation_context import InvocationContext +from google.adk.events import Event +from google.genai import types + +from tests.eval.harness import AgentRunTimeout, run_agent + + +class _HangingAgent(BaseAgent): + """Emits one tool-call event, then never finishes.""" + + async def _run_async_impl( + self, ctx: InvocationContext + ) -> AsyncGenerator[Event, None]: + yield Event( + author=self.name, + invocation_id=ctx.invocation_id, + content=types.Content( + role="model", + parts=[ + types.Part( + function_call=types.FunctionCall( + name="probe_tool", args={"x": 1} + ) + ) + ], + ), + ) + await asyncio.sleep(3600) + + +class _FastAgent(BaseAgent): + """Finishes immediately with a final text response.""" + + async def _run_async_impl( + self, ctx: InvocationContext + ) -> AsyncGenerator[Event, None]: + yield Event( + author=self.name, + invocation_id=ctx.invocation_id, + content=types.Content(role="model", parts=[types.Part(text="done")]), + ) + + +def test_timeout_raises_with_partial_run(): + agent = _HangingAgent(name="hanging_agent") + + with pytest.raises(AgentRunTimeout) as exc_info: + asyncio.run(run_agent(agent, user_message="go", timeout_s=0.5)) + + err = exc_info.value + # Still a timeout failure for consumers that catch asyncio.TimeoutError. + assert isinstance(err, asyncio.TimeoutError) + assert err.timeout_s == 0.5 + + partial = err.partial + assert partial.timed_out is True + # The tool call emitted before the deadline is preserved. + assert partial.tool_names() == ["probe_tool"] + assert partial.calls_named("probe_tool")[0].args == {"x": 1} + + +def test_fast_run_is_not_timed_out(): + agent = _FastAgent(name="fast_agent") + + run = asyncio.run(run_agent(agent, user_message="go", timeout_s=30.0)) + + assert run.timed_out is False + assert run.final_text == "done" diff --git a/tests/units/contractor_tests/test_eval_results.py b/tests/units/contractor_tests/test_eval_results.py index ffeeac1..d071f0e 100644 --- a/tests/units/contractor_tests/test_eval_results.py +++ b/tests/units/contractor_tests/test_eval_results.py @@ -228,13 +228,14 @@ def test_envelope_shape_and_embedded_snapshot(): def test_eval_sink_groups_by_scenario_unit(tmp_path, monkeypatch): import tests.eval.results as R monkeypatch.setattr(R, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(R, "RUN_STAMP", "STAMP") sink = EvalSink() sink.record(scenario="task", unit="oas_build", metric_kind="diff", fixture="vulnyapi", case=CaseResult("vulnyapi", True, 1, 1, detail={"f1": 1.0}), model="m", pass_at=2, artifacts={"oas_build/result": "openapi: 3.0.0"}) - # per-case persistence happens immediately on record() - case_dir = tmp_path / "oas_build" / "cases" / "vulnyapi__vulnyapi" + # per-case persistence happens immediately on record(), into the dated archive + case_dir = tmp_path / "STAMP" / "task-oas_build-eval-vulnyapi" / "cases" / "vulnyapi" assert (case_dir / "metrics.json").is_file() assert (case_dir / "oas_build_result").read_text() == "openapi: 3.0.0" assert json.loads((case_dir / "metrics.json").read_text())["passed"] is True @@ -242,21 +243,99 @@ def test_eval_sink_groups_by_scenario_unit(tmp_path, monkeypatch): fixture="petstore", case=CaseResult("petstore", False, 0, 1, detail={"f1": 0.0})) sink.record(scenario="agent", unit="swe_edit_agent", metric_kind="generic", fixture="fx", case=CaseResult("fx", True, 1, 1)) - paths = sink.flush() - assert len(paths) == 2 # two (scenario, unit) groups - names = sorted(p.parent.name for p in paths) - assert names == ["oas_build", "swe_edit_agent"] - oas = json.loads((tmp_path / "oas_build" / "eval_results.json").read_text()) + sink.flush() + # "latest" pointers at -[-] (metric_kind + # suffix only when non-generic) + oas = json.loads((tmp_path / "task-oas_build-diff" / "eval_results.json").read_text()) assert oas["scenario"] == "task" and oas["pass_at"] == 2 # max pass_at wins assert len(oas["fixtures"]) == 2 and oas["headline"]["pass_rate"] == 0.5 + assert (tmp_path / "agent-swe_edit_agent" / "eval_results.json").is_file() + # dated, per-fixture archives — never overwritten (the data-loss fix) + assert (tmp_path / "STAMP" / "task-oas_build-eval-vulnyapi" / "eval_results.json").is_file() + assert (tmp_path / "STAMP" / "task-oas_build-eval-petstore" / "eval_results.json").is_file() + assert (tmp_path / "STAMP" / "agent-swe_edit_agent-eval-fx" / "eval_results.json").is_file() + + +def test_eval_sink_buckets_sharing_unit_get_distinct_run_dirs(tmp_path, monkeypatch): + """Two buckets sharing a unit (different scenario / metric_kind) must not + overwrite each other's latest-pointer envelope in the same flush.""" + import tests.eval.results as R + monkeypatch.setattr(R, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(R, "RUN_STAMP", "STAMP") + + sink = EvalSink() + sink.record(scenario="agent", unit="trace", metric_kind="diff", + fixture="fx", case=CaseResult("a", True, 1, 1, detail={"f1": 1.0})) + sink.record(scenario="agent", unit="trace", metric_kind="detection", + fixture="fx", case=CaseResult("a", False, 0, 1, detail={"tp": 0, "fp": 0, "fn": 1})) + sink.record(scenario="pipeline", unit="trace", metric_kind="diff", + fixture="fx", case=CaseResult("a", True, 1, 1, detail={"f1": 0.5})) + paths = sink.flush() + + latest = {p for p in paths if p.parent.parent == tmp_path} + assert latest == { + tmp_path / "agent-trace-diff" / "eval_results.json", + tmp_path / "agent-trace-detection" / "eval_results.json", + tmp_path / "pipeline-trace-diff" / "eval_results.json", + } + # each bucket kept its own case (no cross-bucket overwrite) + assert json.loads((tmp_path / "agent-trace-diff" / "eval_results.json").read_text())[ + "fixtures"][0]["cases"][0]["passed"] is True + assert json.loads((tmp_path / "agent-trace-detection" / "eval_results.json").read_text())[ + "fixtures"][0]["cases"][0]["passed"] is False + + +def test_eval_sink_explicit_run_name_still_wins(tmp_path, monkeypatch): + import tests.eval.results as R + monkeypatch.setattr(R, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(R, "RUN_STAMP", "STAMP") + sink = EvalSink() + sink.record(scenario="agent", unit="trace", metric_kind="diff", fixture="fx", + case=CaseResult("a", True, 1, 1), run_name="my-custom-run") + sink.flush() + assert (tmp_path / "my-custom-run" / "eval_results.json").is_file() + + +def test_eval_sink_warns_on_model_disagreement(tmp_path, monkeypatch, caplog): + import logging + + import tests.eval.results as R + monkeypatch.setattr(R, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(R, "RUN_STAMP", "STAMP") + sink = EvalSink() + sink.record(scenario="agent", unit="u", metric_kind="generic", fixture="fx", + case=CaseResult("a", True, 1, 1), model="model-a", prompt_version="v1") + with caplog.at_level(logging.WARNING, logger="tests.eval.results"): + sink.record(scenario="agent", unit="u", metric_kind="generic", fixture="fx", + case=CaseResult("b", True, 1, 1), model="model-b", prompt_version="v1") + assert any("model-b" in r.message and "model-a" in r.message for r in caplog.records) + # first value wins in the envelope + sink.flush() + env = json.loads((tmp_path / "agent-u" / "eval_results.json").read_text()) + assert env["model"] == "model-a" and env["prompt_version"] == "v1" + + +def test_eval_sink_backfills_missing_model(tmp_path, monkeypatch): + import tests.eval.results as R + monkeypatch.setattr(R, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(R, "RUN_STAMP", "STAMP") + sink = EvalSink() + sink.record(scenario="agent", unit="u", metric_kind="generic", fixture="fx", + case=CaseResult("a", True, 1, 1)) # no model yet + sink.record(scenario="agent", unit="u", metric_kind="generic", fixture="fx", + case=CaseResult("b", True, 1, 1), model="model-a") # backfilled + sink.flush() + env = json.loads((tmp_path / "agent-u" / "eval_results.json").read_text()) + assert env["model"] == "model-a" def test_case_artifact_dir_colocated_with_eval_sink(tmp_path, monkeypatch): import tests.eval.results as R monkeypatch.setattr(R, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(R, "RUN_STAMP", "STAMP") from tests.eval.results import case_artifact_dir d = case_artifact_dir("trace_agent", "vulnyapi", "c1") - assert d == tmp_path / "trace_agent" / "cases" / "vulnyapi__c1" / "artifacts" + assert d == tmp_path / "STAMP" / "agent-trace_agent-eval-vulnyapi" / "cases" / "c1" / "artifacts" def test_write_eval_results_roundtrip(tmp_path): diff --git a/tests/units/contractor_tests/test_rebuild_eval_envelope.py b/tests/units/contractor_tests/test_rebuild_eval_envelope.py new file mode 100644 index 0000000..7554cd1 --- /dev/null +++ b/tests/units/contractor_tests/test_rebuild_eval_envelope.py @@ -0,0 +1,129 @@ +"""Unit tests for ``scripts/rebuild_eval_envelope.py`` — re-aggregating a +unit's envelope from persisted per-case ``metrics.json`` files, across BOTH +on-disk layouts: + +* legacy flat: ``eval_runs//cases//metrics.json`` +* dated archive: ``eval_runs//--eval-/cases//metrics.json`` +""" +from __future__ import annotations + +import importlib.util +import json +import os +from pathlib import Path + +import pytest + +import tests.eval.results as results + +_SPEC = importlib.util.spec_from_file_location( + "ree", + Path(__file__).resolve().parents[3] / "scripts" / "rebuild_eval_envelope.py", +) +ree = importlib.util.module_from_spec(_SPEC) +_SPEC.loader.exec_module(ree) + + +@pytest.fixture +def eval_root(tmp_path, monkeypatch): + """Redirect EVAL_ROOT in BOTH modules: the script's copy (discovery) and + ``tests.eval.results`` (``write_eval_results`` resolves bare names there).""" + monkeypatch.setattr(ree, "EVAL_ROOT", tmp_path) + monkeypatch.setattr(results, "EVAL_ROOT", tmp_path) + return tmp_path + + +def _write_case(case_dir: Path, fixture: str, case_id: str, *, passed: bool, + attempts: int = 1) -> Path: + case_dir.mkdir(parents=True, exist_ok=True) + path = case_dir / "metrics.json" + path.write_text(json.dumps({ + "fixture": fixture, "id": case_id, "passed": passed, + "pass_count": int(passed), "attempts": attempts, + "metrics": {"total_tokens": 10}, "detail": {"f1": 1.0 if passed else 0.0}, + })) + return path + + +def _flat_case(root: Path, unit: str, fixture: str, case_id: str, **kw) -> Path: + return _write_case(root / unit / "cases" / case_id, fixture, case_id, **kw) + + +def _archive_case(root: Path, stamp: str, scenario: str, unit: str, + fixture: str, case_id: str, **kw) -> Path: + run_dir = root / stamp / f"{scenario}-{unit}-eval-{fixture}" + return _write_case(run_dir / "cases" / case_id, fixture, case_id, **kw) + + +def test_rebuild_legacy_flat_layout(eval_root): + _flat_case(eval_root, "oas_build", "vulnyapi", "c1", passed=True) + _flat_case(eval_root, "oas_build", "petstore", "c2", passed=False) + + path = ree.rebuild_unit("oas_build") + assert path == eval_root / "oas_build" / "eval_results.json" + env = json.loads(path.read_text()) + assert {f["slug"] for f in env["fixtures"]} == {"petstore", "vulnyapi"} + assert env["headline"]["total"] == 2 and env["headline"]["passed"] == 1 + + +def test_rebuild_dated_archive_layout(eval_root): + _archive_case(eval_root, "0607-converge", "agent", "trace_agent", + "vulnyapi", "login", passed=True, attempts=3) + _archive_case(eval_root, "0607-converge", "agent", "trace_agent", + "crapi-workshop", "bola", passed=False, attempts=3) + + path = ree.rebuild_unit("trace_agent") + assert path == eval_root / "trace_agent" / "eval_results.json" + env = json.loads(path.read_text()) + assert env["scenario"] == "agent" # recovered from the dir name + assert env["pass_at"] == 3 + assert {f["slug"] for f in env["fixtures"]} == {"crapi-workshop", "vulnyapi"} + + +def test_rebuild_merges_both_layouts_latest_wins(eval_root): + old = _flat_case(eval_root, "trace_agent", "vulnyapi", "login", passed=False) + new = _archive_case(eval_root, "0608-rerun", "agent", "trace_agent", + "vulnyapi", "login", passed=True) + _flat_case(eval_root, "trace_agent", "spring", "send-message", passed=True) + # make the archive copy strictly newer than the flat one + os.utime(old, (1_000_000, 1_000_000)) + os.utime(new, (2_000_000, 2_000_000)) + + env = json.loads(ree.rebuild_unit("trace_agent").read_text()) + by_slug = {f["slug"]: f for f in env["fixtures"]} + assert set(by_slug) == {"spring", "vulnyapi"} + # the duplicated (fixture, case) was deduped; the newer archive copy won + assert len(by_slug["vulnyapi"]["cases"]) == 1 + assert by_slug["vulnyapi"]["cases"][0]["passed"] is True + + +def test_rebuild_reads_run_meta_from_archive_envelope(eval_root): + _archive_case(eval_root, "0607-A", "task", "oas_build", "vulnyapi", "c1", passed=True) + run_dir = eval_root / "0607-A" / "task-oas_build-eval-vulnyapi" + (run_dir / "eval_results.json").write_text(json.dumps({ + "schema": "eval/v1", "scenario": "task", "unit": "oas_build", + "metric_kind": "detection", "model": "lm-studio-qwen3.6", + "prompt_version": "v7", "pass_at": 1, "fixtures": [], + })) + + env = json.loads(ree.rebuild_unit("oas_build").read_text()) + assert env["metric_kind"] == "detection" + assert env["model"] == "lm-studio-qwen3.6" + assert env["prompt_version"] == "v7" + + +def test_rebuild_unit_no_cases_returns_none(eval_root, capsys): + assert ree.rebuild_unit("nonexistent") is None + assert "skipped" in capsys.readouterr().out + + +def test_discover_units_finds_both_layouts(eval_root): + _flat_case(eval_root, "oas_build", "vulnyapi", "c1", passed=True) + _archive_case(eval_root, "0607-converge", "agent", "trace_agent", + "vulnyapi", "login", passed=True) + _archive_case(eval_root, "0608-x", "pipeline", "vuln_scan", "vampi", "c9", passed=False) + # noise: a log file and an archive dir without cases/ + (eval_root / "run.log").write_text("noise") + (eval_root / "0609-y" / "agent-empty_unit-eval-fx").mkdir(parents=True) + + assert ree.discover_units() == ["oas_build", "trace_agent", "vuln_scan"] diff --git a/tests/units/contractor_tests/test_task_version_override.py b/tests/units/contractor_tests/test_task_version_override.py new file mode 100644 index 0000000..97e7146 --- /dev/null +++ b/tests/units/contractor_tests/test_task_version_override.py @@ -0,0 +1,41 @@ +"""Task-version env override (CONTRACTOR_TASK_VERSION_) for A/B eval-gating +a task body without flipping the manifest's `active:`.""" +from __future__ import annotations + +from contractor.runners.models import TaskTemplate + + +def test_active_version_default(): + t = TaskTemplate.load("trace_annotation") + assert t.version == "v3" # manifest active (promoted after pipeline eval gate) + + +def test_explicit_version_arg_wins(): + t = TaskTemplate.load("trace_annotation", "v3") + assert t.version == "v3" + + +def test_env_override_selects_version(monkeypatch): + monkeypatch.setenv("CONTRACTOR_TASK_VERSION_TRACE_ANNOTATION", "v3") + t = TaskTemplate.load("trace_annotation") + assert t.version == "v3" + + +def test_explicit_arg_beats_env(monkeypatch): + monkeypatch.setenv("CONTRACTOR_TASK_VERSION_TRACE_ANNOTATION", "v3") + t = TaskTemplate.load("trace_annotation", "v1") + assert t.version == "v1" + + +def test_v3_renders_without_brace_keyerror(): + """v3 must format cleanly with the workflow's scope vars — no stray + identifier-shaped braces (the ADK/str.format pitfall).""" + t = TaskTemplate.load("trace_annotation", "v3") + scope = {"operation_id": "getAccount", "operation_schema": "openapi: 3.0.0"} + rendered = (t.objective + t.instructions + t.output_format).format(**scope) + assert "getAccount" in rendered + # delegation + no tool/mechanics leakage into the planner surface + assert "trace` skill" in t.instructions + assert "annotate_trace" not in t.instructions and "insert_line" not in t.instructions + # output aligned to the agent §OUTPUT headers + assert "## Annotations Inserted" in t.output_format and "## Findings" in t.output_format diff --git a/tests/units/contractor_tests/tools/test_fs.py b/tests/units/contractor_tests/tools/test_fs.py index ef9d617..42cd553 100644 --- a/tests/units/contractor_tests/tools/test_fs.py +++ b/tests/units/contractor_tests/tools/test_fs.py @@ -750,6 +750,33 @@ def test_symlink_escape_denied(fs_root, fs_root_fixture, tmp_path): fs_root.open("/link") +def test_walk_does_not_list_symlinked_file_names(fs_root, fs_root_fixture, tmp_path): + outside = tmp_path / "outside.txt" + outside.write_text("secret") + + # One escaping symlink at the root, one nested; reads of both are blocked, + # so walk() must not disclose the names either (same policy as ls/glob). + os.symlink(outside, fs_root_fixture / "link") + os.symlink(outside, fs_root_fixture / "dir" / "inner_link") + + names = [f for _, _, fnames in fs_root.walk("/") for f in fnames] + + assert "link" not in names + assert "inner_link" not in names + assert "file.txt" in names + assert "inner.txt" in names + + +def test_symlink_inside_sandbox_resolves_to_target(fs_root, fs_root_fixture): + # _strip_protocol returns the validated *resolved* path (TOCTOU fix); + # in-sandbox symlinks must keep working through that resolution. + os.symlink(fs_root_fixture / "file.txt", fs_root_fixture / "alias.txt") + + assert fs_root.exists("/alias.txt") + with fs_root.open("/alias.txt") as f: + assert f.read().decode("utf-8") == "hello" + + def test_absolute_host_path_denied(fs_root): assert not fs_root.exists("/etc/passwd") @@ -836,6 +863,82 @@ def test_glob_respects_ignored_patterns_with_ro_file_tools(fs_root): assert all(not p.endswith(".txt") for p in paths) +@pytest.fixture() +def big_root(tmp_path: Path) -> RootedLocalFileSystem: + root = tmp_path / "big" + root.mkdir() + for i in range(10): + (root / f"f{i}.py").write_text(f"needle {i}\n", encoding="utf-8") + return RootedLocalFileSystem(str(root)) + + +def _walk_capped_tools( + fs: RootedLocalFileSystem, max_files_per_walk: int +) -> FsspecInteractionFileTools: + fmt = FileFormat(_format="json", loc="lines", with_types=False) + return FsspecInteractionFileTools( + fs=fs, + fmt=fmt, + max_files_per_walk=max_files_per_walk, + with_types=False, + ) + + +def test_glob_walk_ceiling_truncates_with_notice(big_root): + tools = _walk_capped_tools(big_root, max_files_per_walk=3) + + res = tools.glob("**/*.py") + + assert "error" not in res + assert res["walk_truncated"] is True + assert "truncated after scanning 3 files" in res["notice"] + assert len(res["result"]) <= 3 + + +def test_grep_walk_ceiling_truncates_with_notice(big_root): + tools = _walk_capped_tools(big_root, max_files_per_walk=3) + + res = tools.grep("needle") + + assert "error" not in res + assert res["walk_truncated"] is True + assert "truncated after scanning 3 files" in res["notice"] + assert res["total_items"] <= 3 + + +def test_walk_ceiling_not_reported_when_not_hit(big_root): + tools = _walk_capped_tools(big_root, max_files_per_walk=100) + + glob_res = tools.glob("**/*.py") + grep_res = tools.grep("needle") + + assert len(glob_res["result"]) == 10 + assert grep_res["total_items"] == 10 + assert "walk_truncated" not in glob_res + assert "walk_truncated" not in grep_res + assert "notice" not in glob_res + assert "notice" not in grep_res + + +def test_walk_ceiling_defaults_from_settings(big_root, monkeypatch): + import contractor.tools.fs.read_tools as read_tools_module + from contractor.utils.settings import Settings + + monkeypatch.setattr( + read_tools_module, + "get_settings", + lambda: Settings(fs_max_files_per_walk=2), + ) + + fmt = FileFormat(_format="json", loc="lines", with_types=False) + tools = FsspecInteractionFileTools(fs=big_root, fmt=fmt, with_types=False) + + res = tools.grep("needle") + + assert res["walk_truncated"] is True + assert "truncated after scanning 2 files" in res["notice"] + + @pytest.fixture() def cyrillic_fs(tmp_path: Path): fname = "Новая заметка 2 - о работе.md" diff --git a/tests/units/contractor_tests/tools/test_http_lifecycle.py b/tests/units/contractor_tests/tools/test_http_lifecycle.py new file mode 100644 index 0000000..ace3b92 --- /dev/null +++ b/tests/units/contractor_tests/tools/test_http_lifecycle.py @@ -0,0 +1,105 @@ +"""Unit tests for HTTPClient connection lifecycle. + +Regression for a connection-pool leak: ``http_tools`` builds an ``HTTPClient`` +and returns only tool closures — there is no teardown seam reachable from the +agent factories, so the persistent ``httpx.AsyncClient`` opened in ``__init__`` +was never closed. The client is now created per request and closed via +``async with``; these tests pin that contract (no persistent client, each +per-request client is closed, cookies persist across requests). +""" +from __future__ import annotations + +import warnings + +import httpx +import pytest + +from contractor.tools.http import HTTPClient, http_tools + +_EXPECTED_TOOLS = { + "http_request", + "http_read_body", + "http_history", + "http_session_set", + "http_session_get", + "http_session_clear", +} + + +def test_http_tools_public_surface_unchanged(): + tools = http_tools(name="t") + assert {t.__name__ for t in tools} == _EXPECTED_TOOLS + + +def test_no_persistent_async_client_leaks_on_build(): + # Building tools / a client must not open a long-lived httpx client. + with warnings.catch_warnings(): + warnings.simplefilter("error", ResourceWarning) + cli = HTTPClient(name="t") + assert not hasattr(cli, "_client") + + +def test_cookie_jar_lives_on_the_client(): + cli = HTTPClient(name="t") + cli.set_cookies({"a": "b"}) + assert cli.get_cookies() == {"a": "b"} + cli.clear_session_state() + assert cli.get_cookies() == {} + + +def _mock_client_factory(created: list[httpx.AsyncClient]): + def fake_new_client(self: HTTPClient, timeout: float | None = None) -> httpx.AsyncClient: + client = httpx.AsyncClient( + transport=httpx.MockTransport( + lambda request: httpx.Response( + 200, headers={"set-cookie": "sid=42; Path=/"}, text="ok" + ) + ), + cookies=self._cookies, + ) + created.append(client) + return client + + return fake_new_client + + +@pytest.mark.asyncio +async def test_request_closes_its_client(monkeypatch): + created: list[httpx.AsyncClient] = [] + monkeypatch.setattr(HTTPClient, "_new_client", _mock_client_factory(created)) + + cli = HTTPClient(name="t") + with warnings.catch_warnings(): + warnings.simplefilter("error", ResourceWarning) + record = await cli.request(url="http://example.test/", method="GET") + + assert record["status"] == 200 + assert created, "expected a per-request client to be created" + assert all(c.is_closed for c in created), "per-request clients must be closed" + + +@pytest.mark.asyncio +async def test_cookies_persist_across_requests(monkeypatch): + created: list[httpx.AsyncClient] = [] + monkeypatch.setattr(HTTPClient, "_new_client", _mock_client_factory(created)) + + cli = HTTPClient(name="t") + await cli.request(url="http://example.test/", method="GET") + # The Set-Cookie from the first response is retained via the shared jar, + # even though that request's client has since been closed. + assert cli.get_cookies().get("sid") == "42" + assert len(created) == 1 + + await cli.request(url="http://example.test/again", method="GET") + assert len(created) == 2 + assert all(c.is_closed for c in created) + + +@pytest.mark.asyncio +async def test_aclose_is_a_safe_noop(): + # Kept for backward compatibility with ``async with HTTPClient(...)`` and + # explicit aclose() call sites; must not raise even without a live client. + cli = HTTPClient(name="t") + await cli.aclose() + async with HTTPClient(name="t") as ctx_cli: + assert ctx_cli is not None diff --git a/tests/units/contractor_tests/tools/test_likec4.py b/tests/units/contractor_tests/tools/test_likec4.py index c275614..f45621f 100644 --- a/tests/units/contractor_tests/tools/test_likec4.py +++ b/tests/units/contractor_tests/tools/test_likec4.py @@ -18,7 +18,7 @@ @pytest.fixture def mock_likec4_in_path(monkeypatch: pytest.MonkeyPatch) -> None: - """Pretend `likec4` is the resolved binary so the linter can be constructed.""" + """Pretend `likec4` is the resolved binary so validate calls can run.""" monkeypatch.setattr( "contractor.tools.likec4.shutil.which", lambda name: "/usr/local/bin/likec4" if name == "likec4" else None, @@ -93,6 +93,57 @@ def test_resolve_command_raises_when_nothing_available( Likec4Linter._resolve_command() +# --------------------------------------------------------------------------- +# lazy command resolution +# --------------------------------------------------------------------------- + +def test_linter_construction_does_not_resolve_command( + monkeypatch: pytest.MonkeyPatch, +) -> None: + """Construction must never touch PATH — resolution is lazy (first use).""" + def _boom(_name: str) -> str: + raise AssertionError("shutil.which must not be called at construction") + + monkeypatch.setattr("contractor.tools.likec4.shutil.which", _boom) + Likec4Linter() # must not raise + + +def test_linter_validate_raises_not_found_when_no_binary( + monkeypatch: pytest.MonkeyPatch, +) -> None: + monkeypatch.setattr( + "contractor.tools.likec4.shutil.which", + lambda _name: None, + ) + linter = Likec4Linter() + with pytest.raises(Likec4NotFoundError, match="not found in PATH"): + linter.validate("specification { }") + + +def test_linter_caches_resolved_command_after_first_success( + monkeypatch: pytest.MonkeyPatch, +) -> None: + which_calls: list[str] = [] + + def _which(name: str) -> str | None: + which_calls.append(name) + return "/usr/local/bin/likec4" if name == "likec4" else None + + monkeypatch.setattr("contractor.tools.likec4.shutil.which", _which) + monkeypatch.setattr( + "contractor.tools.likec4.subprocess.run", + lambda *_a, **_kw: _proc( + json.dumps({"valid": True, "errors": [], "stats": {}}) + ), + ) + + linter = Likec4Linter() + linter.validate("specification { }") + linter.validate("specification { }") + + assert which_calls == ["likec4"] + + # --------------------------------------------------------------------------- # validate (content-based core) # --------------------------------------------------------------------------- @@ -265,6 +316,23 @@ def test_likec4_tools_default_path_constant() -> None: assert DEFAULT_LIKEC4_PATH == "/architecture.c4" +def test_likec4_tools_builds_without_binary_and_lint_returns_error( + fs: MemoryFileSystem, monkeypatch: pytest.MonkeyPatch +) -> None: + """A missing binary must not raise at assembly — only as a tool result.""" + monkeypatch.setattr( + "contractor.tools.likec4.shutil.which", + lambda _name: None, + ) + + tools = likec4_tools(fs=fs) # must not raise + assert tools[0].__name__ == "validate_likec4" + + fs.pipe_file("/architecture.c4", b"specification { }") + result = asyncio.run(tools[0]()) + assert "likec4 not found in PATH" in result["error"] + + def test_validate_likec4_tool_returns_error_for_missing_file( mock_likec4_in_path: None, fs: MemoryFileSystem ) -> None: diff --git a/tests/units/contractor_tests/tools/test_observations.py b/tests/units/contractor_tests/tools/test_observations.py index 2cffc9a..8615863 100644 --- a/tests/units/contractor_tests/tools/test_observations.py +++ b/tests/units/contractor_tests/tools/test_observations.py @@ -426,6 +426,9 @@ def _mk_worker(): worker.tools = [] worker.model = "gpt-3.5-turbo" worker.instruction = "" + # Uninstrumented workers must declare input_schema themselves (the + # RouterWorkflow contract); task_tools validates this at assembly time. + worker.input_schema = Subtask return worker diff --git a/tests/units/contractor_tests/tools/test_overlay_merge.py b/tests/units/contractor_tests/tools/test_overlay_merge.py index 6cc9fb3..01c2365 100644 --- a/tests/units/contractor_tests/tools/test_overlay_merge.py +++ b/tests/units/contractor_tests/tools/test_overlay_merge.py @@ -168,3 +168,60 @@ def test_many_forks(self, base_fs, overlay): assert conflicts == [] for i in range(5): assert overlay.read_text(f"/src/op{i}.py") == f"# operation {i}\n" + + +# ── delete propagation ─────────────────────────────────────────────── + + +class TestMergeDeletePropagation: + def test_pre_fork_delete_does_not_repropagate(self, base_fs, overlay): + # Delete a base file *before* forking… + overlay.rm("/src/a.py") + pre = dict(overlay._files) + patch = overlay.save() + fork = fork_overlay(base_fs, patch) + + # …then restore it in the target while the fork runs. + overlay.restore("/src/a.py") + assert overlay.exists("/src/a.py") + + conflicts = merge_overlay_forks(overlay, [fork], pre) + + assert conflicts == [] + # The fork's stale (pre-fork) tombstone must not re-delete the file. + assert overlay.exists("/src/a.py") + + def test_delete_made_inside_fork_propagates(self, base_fs, overlay): + overlay.rm("/src/a.py") + pre = dict(overlay._files) + patch = overlay.save() + fork = fork_overlay(base_fs, patch) + + fork.rm("/src/b.py") + + conflicts = merge_overlay_forks(overlay, [fork], pre) + + assert conflicts == [] + assert not overlay.exists("/src/a.py") # pre-fork delete still in target + assert not overlay.exists("/src/b.py") # fork delete propagated + + def test_explicit_pre_fork_deleted_overrides_fork_state(self, base_fs, overlay): + overlay.rm("/src/a.py") + pre = dict(overlay._files) + pre_deleted = set(overlay._deleted) + patch = overlay.save() + + # A fork built without fork_overlay records no parent state, so the + # caller-provided baseline is what separates old from new deletes. + fork = MemoryOverlayFileSystem(base_fs, skip_instance_cache=True) + fork.load(patch) + fork.rm("/src/b.py") + + overlay.restore("/src/a.py") + conflicts = merge_overlay_forks( + overlay, [fork], pre, pre_fork_deleted=pre_deleted + ) + + assert conflicts == [] + assert overlay.exists("/src/a.py") + assert not overlay.exists("/src/b.py") diff --git a/tests/units/contractor_tests/tools/test_overlayfs.py b/tests/units/contractor_tests/tools/test_overlayfs.py index b61bdea..2fa219a 100644 --- a/tests/units/contractor_tests/tools/test_overlayfs.py +++ b/tests/units/contractor_tests/tools/test_overlayfs.py @@ -467,3 +467,135 @@ def test_filetype_cache_is_invalidated_on_overlay_write( # If the cache survived, second would still report the original label; # the invalidation hook must drop the entry on write. assert second.filetype.label != initial_label + + +# --------------------------------------------------------------------------- +# glob: path-aware matching over the merged view (regression for the +# PurePosixPath.match-based implementation, which had no recursive `**` +# and only ls-ed one level for non-`**` patterns) +# --------------------------------------------------------------------------- + + +@pytest.fixture() +def glob_tree(tmp_path: Path) -> Path: + (tmp_path / "src" / "deep" / "deeper").mkdir(parents=True) + (tmp_path / "a" / "x").mkdir(parents=True) + (tmp_path / "a" / "y").mkdir(parents=True) + for rel in ( + "top.py", + "src/main.py", + "src/deep/mid.py", + "src/deep/deeper/mod.py", + "a/x/b.py", + "a/y/b.py", + "a/x/note.txt", + ): + (tmp_path / rel).write_text("# stub\n", encoding="utf-8") + return tmp_path + + +@pytest.fixture() +def glob_base_fs(glob_tree: Path) -> RootedLocalFileSystem: + return RootedLocalFileSystem(str(glob_tree)) + + +@pytest.fixture() +def glob_overlay_fs(glob_base_fs: RootedLocalFileSystem) -> MemoryOverlayFileSystem: + return MemoryOverlayFileSystem(glob_base_fs) + + +def test_glob_recursive_matches_root_level_files( + glob_overlay_fs: MemoryOverlayFileSystem, +): + # `**` must match zero directories too, so /top.py is included. + matched = glob_overlay_fs.glob("/**/*.py") + + assert "/top.py" in matched + assert "/src/main.py" in matched + assert "/src/deep/deeper/mod.py" in matched + assert "/a/x/note.txt" not in matched + + +def test_glob_recursive_under_subdir(glob_overlay_fs: MemoryOverlayFileSystem): + matched = glob_overlay_fs.glob("/src/**/*.py") + + assert matched == ["/src/deep/deeper/mod.py", "/src/deep/mid.py", "/src/main.py"] + + +def test_glob_multi_level_non_recursive_pattern( + glob_overlay_fs: MemoryOverlayFileSystem, +): + # `*` stays within one segment but the static prefix is multi-level. + assert glob_overlay_fs.glob("a/*/b.py") == ["/a/x/b.py", "/a/y/b.py"] + + +def test_glob_matches_overlay_added_file(glob_overlay_fs: MemoryOverlayFileSystem): + glob_overlay_fs.write_text( + "/src/deep/added.py", "print('added')\n", encoding="utf-8" + ) + + assert "/src/deep/added.py" in glob_overlay_fs.glob("/src/**/*.py") + + +def test_glob_excludes_tombstoned_file(glob_overlay_fs: MemoryOverlayFileSystem): + glob_overlay_fs.rm("/src/deep/mid.py") + + matched = glob_overlay_fs.glob("/src/**/*.py") + + assert "/src/deep/mid.py" not in matched + assert "/src/deep/deeper/mod.py" in matched + + +def test_glob_parity_with_rooted_local_fs_when_overlay_unchanged( + glob_base_fs: RootedLocalFileSystem, + glob_overlay_fs: MemoryOverlayFileSystem, +): + patterns = [ + "/**/*.py", + "/src/**/*.py", + "a/*/b.py", + "*.py", + "/src/*.py", + "**/note.txt", + "/missing/**/*.py", + ] + + for pattern in patterns: + assert glob_overlay_fs.glob(pattern) == glob_base_fs.glob(pattern), pattern + + +def test_glob_scanned_truncates_when_ceiling_hit( + glob_overlay_fs: MemoryOverlayFileSystem, +): + # The fixture tree has 7 files; a ceiling of 2 must stop the walk early. + matched, truncated = glob_overlay_fs.glob_scanned("/**/*", max_files=2) + + assert truncated is True + assert len(matched) <= 2 + + +def test_glob_scanned_no_truncation_under_ceiling( + glob_overlay_fs: MemoryOverlayFileSystem, +): + matched, truncated = glob_overlay_fs.glob_scanned("/src/**/*.py", max_files=100) + + assert truncated is False + assert matched == ["/src/deep/deeper/mod.py", "/src/deep/mid.py", "/src/main.py"] + + +def test_glob_scanned_default_ceiling_comes_from_settings( + glob_overlay_fs: MemoryOverlayFileSystem, + monkeypatch, +): + import contractor.tools.fs.overlayfs as overlayfs_module + from contractor.utils.settings import Settings + + monkeypatch.setattr( + overlayfs_module, + "get_settings", + lambda: Settings(fs_max_files_per_walk=1), + ) + + matched, truncated = glob_overlay_fs.glob_scanned("/**/*") + assert truncated is True + assert len(matched) <= 1 diff --git a/tests/units/contractor_tests/tools/test_tasks.py b/tests/units/contractor_tests/tools/test_tasks.py index 811524f..526f1ba 100644 --- a/tests/units/contractor_tests/tools/test_tasks.py +++ b/tests/units/contractor_tests/tools/test_tasks.py @@ -91,6 +91,9 @@ def _mk_worker(): worker.tools = [] worker.model = "gpt-3.5-turbo" worker.instruction = "" + # Uninstrumented workers must declare input_schema themselves (the + # RouterWorkflow contract); task_tools validates this at assembly time. + worker.input_schema = Subtask return worker @@ -334,6 +337,20 @@ def test_subtask_decomposition_model_requires_at_least_one(): SubtaskDecomposition.model_validate({"subtasks": []}) +def test_subtask_decomposition_model_rejects_more_than_three(): + specs = [{"title": f"t{i}", "description": f"d{i}"} for i in range(4)] + with pytest.raises(ValidationError): + SubtaskDecomposition.model_validate({"subtasks": specs}) + + +def test_task_limit_msg_instructs_resolving_subtasks_before_finish(): + msg = m.TASK_LIMIT_REACHED_MSG.format(max_tasks=5) + # finish(status="done") refuses while 'new' subtasks remain, so the + # message must not push the planner into an immediate finish call. + assert "immediately" not in msg + assert "Execute or skip" in msg + + # --------------------------- # Behavior tests # --------------------------- @@ -683,6 +700,87 @@ async def _incomplete(*, args, tool_context): assert records[-1]["status"] == "skipped" +@pytest.mark.anyio +async def test_skip_resolved_current_subtask_returns_error(monkeypatch): + worker = _mk_worker() + + async def _done(*, args, tool_context): + return _result_json( + task_id=args["task_id"], + status="done", + output="ok", + summary="ok", + ) + + worker.run_async.side_effect = _done + + tool = _mk_tools(monkeypatch, worker=worker, use_skip=True) + ctx = mk_tool_context() + + tool["add_subtask"](title="t0", description="d0", tool_context=ctx) + await tool["execute_current_subtask"](tool_context=ctx) + + # The last subtask is 'done' and still current — the rejected skip must + # surface as an error naming the cause, not as the no-active-subtasks + # success message. + res = tool["skip"](task_id="0", reason="redundant", tool_context=ctx) + assert res["error"] == m.SUBTASK_SKIP_NOT_SKIPPABLE.format( + task_id="0", status="done" + ) + + # State untouched: subtask stays 'done', no skip record was appended. + all_tasks = tool["list_subtasks"](tool_context=ctx, view="all")["result"] + assert all_tasks[0]["status"] == "done" + records = tool["get_records"](tool_context=ctx)["result"] + assert all(r["status"] != "skipped" for r in records) + + +# --------------------------- +# task_tools assembly validation +# --------------------------- + + +def test_task_tools_uninstrumented_requires_worker_input_schema(): + worker = _mk_worker() + worker.input_schema = None + + with pytest.raises(ValueError, match="input_schema"): + m.task_tools( + name="tm", + max_tasks=10, + worker=worker, + fmt=m.SubtaskFormatter(_format="markdown"), + worker_instrumentation=False, + use_input_schema=True, + use_summarization=False, + ) + + +def test_task_tools_uninstrumented_accepts_worker_with_input_schema(monkeypatch): + worker = _mk_worker() # sets input_schema = Subtask + tool = _mk_tools(monkeypatch, worker=worker) + assert "execute_current_subtask" in tool + + +def test_task_tools_uninstrumented_allows_missing_schema_when_disabled(monkeypatch): + from contractor.tools.tasks import tools as _tools_mod + + monkeypatch.setattr(_tools_mod, "AgentTool", MockAgentTool) + + worker = _mk_worker() + worker.input_schema = None + tools = m.task_tools( + name="tm", + max_tasks=10, + worker=worker, + fmt=m.SubtaskFormatter(_format="markdown"), + worker_instrumentation=False, + use_input_schema=False, + use_summarization=False, + ) + assert any(fn.__name__ == "execute_current_subtask" for fn in tools) + + @pytest.mark.anyio async def test_records_accumulate_for_multiple_executes(monkeypatch): worker = _mk_worker() @@ -819,6 +917,80 @@ async def _incomplete(*, args, tool_context): ) +@pytest.mark.anyio +async def test_decompose_over_capacity_suggests_fewer_children(monkeypatch): + worker = _mk_worker() + + async def _incomplete(*, args, tool_context): + return _result_json( + task_id=args["task_id"], + status="incomplete", + output="blocked", + summary="need more steps", + ) + + worker.run_async.side_effect = _incomplete + + tool = _mk_tools(monkeypatch, worker=worker, max_tasks=3, use_skip=False) + ctx = mk_tool_context() + + tool["add_subtask"](title="t0", description="d0", tool_context=ctx) + tool["add_subtask"](title="t1", description="d1", tool_context=ctx) + await tool["execute_current_subtask"](tool_context=ctx) + + # 2 subtasks exist, limit is 3 → only 1 more fits, but 3 are requested. + res = tool["decompose_subtask"]( + task_id="0", + decomposition={ + "subtasks": [ + {"title": f"s{i}", "description": f"sd{i}"} for i in range(3) + ] + }, + tool_context=ctx, + ) + assert res["error"] == m.SUBTASK_DECOMPOSE_OVER_CAPACITY.format( + requested=3, max_tasks=3, remaining=1 + ) + + # A single child still fits. + res_ok = tool["decompose_subtask"]( + task_id="0", + decomposition={"subtasks": [{"title": "s", "description": "sd"}]}, + tool_context=ctx, + ) + assert "error" not in res_ok + + +@pytest.mark.anyio +async def test_decompose_at_full_limit_returns_limit_reached(monkeypatch): + worker = _mk_worker() + + async def _incomplete(*, args, tool_context): + return _result_json( + task_id=args["task_id"], + status="incomplete", + output="blocked", + summary="need more steps", + ) + + worker.run_async.side_effect = _incomplete + + tool = _mk_tools(monkeypatch, worker=worker, max_tasks=2, use_skip=False) + ctx = mk_tool_context() + + tool["add_subtask"](title="t0", description="d0", tool_context=ctx) + tool["add_subtask"](title="t1", description="d1", tool_context=ctx) + await tool["execute_current_subtask"](tool_context=ctx) + + # Limit fully spent → no number of children would fit. + res = tool["decompose_subtask"]( + task_id="0", + decomposition={"subtasks": [{"title": "s", "description": "sd"}]}, + tool_context=ctx, + ) + assert res["error"] == m.TASK_LIMIT_REACHED_MSG.format(max_tasks=2) + + @pytest.mark.anyio async def test_decompose_subtask_rejects_string_input(monkeypatch): worker = _mk_worker() @@ -1041,3 +1213,103 @@ async def _done(*, args, tool_context): assert ctx._invocation_context.end_invocation is True assert ctx.state["task::0::status"] == "done" assert ctx.state["task::0::result"] == "completed successfully" + + +# --------------------------- +# Summarizer construction + record truncation +# --------------------------- + + +@pytest.mark.anyio +async def test_finish_summarizer_has_no_tools_and_caps_records(monkeypatch): + from contractor.tools.tasks import tools as _tools_mod + + captured: dict = {} + + class StubLlmAgent: + def __init__(self, **kwargs): + captured["agent_kwargs"] = kwargs + + async def run_async(self, *, args, tool_context): + captured["request"] = args["request"] + return "summary-text" + + worker = _mk_worker() + worker.tools = [lambda: None] # the summarizer must NOT inherit these + + monkeypatch.setattr(_tools_mod, "LlmAgent", StubLlmAgent) + monkeypatch.setattr(_tools_mod, "_get_agent_ref", lambda w: worker) + + async def _done(*, args, tool_context): + return _result_json( + task_id=args["task_id"], + status="done", + output="ok", + summary="ok", + ) + + worker.run_async.side_effect = _done + + tool = _mk_tools( + monkeypatch, worker=worker, use_skip=False, use_summarization=True + ) + ctx = _attach_invocation_context(mk_tool_context()) + + tool["add_subtask"](title="t0", description="d0", tool_context=ctx) + await tool["execute_current_subtask"](tool_context=ctx) + + # Seed the records pool well past max_records (default 20), including one + # oversized record that must be truncated for the summarizer payload. + pool_key = m.StreamlineManager._task_keys(ctx).pool + giant = "z" * (_tools_mod._MAX_RECORD_FIELD_LEN + 5_000) + ctx.state[pool_key] = ctx.state[pool_key] + [f"rec-{i}" for i in range(30)] + [ + {"task_id": "x", "status": "done", "output": giant, "summary": "big"} + ] + + res = await tool["finish"]( + status="done", + result="completed successfully", + tool_context=ctx, + ) + assert res["result"] == "ok" + + # The summarizer agent was built with an empty toolset. + assert captured["agent_kwargs"]["tools"] == [] + + payload = json.loads(captured["request"]) + # Only the most recent max_records (20) records are passed on. + assert len(payload["records"]) == 20 + # The oversized record's output field was truncated with a marker. + big_rec = payload["records"][-1] + assert big_rec["output"].endswith(_tools_mod._TRUNCATION_MARKER) + assert len(big_rec["output"]) <= _tools_mod._MAX_RECORD_FIELD_LEN + len( + _tools_mod._TRUNCATION_MARKER + ) + # The summarizer output landed in the task summary slot. + assert ctx.state["task::0::summary"] == "summary-text" + + +@pytest.mark.anyio +async def test_execute_malformed_raw_output_is_truncated_in_record(monkeypatch): + from contractor.tools.tasks import tools as _tools_mod + + worker = _mk_worker() + giant = "not-parseable " + "z" * (_tools_mod._MAX_RECORD_FIELD_LEN + 5_000) + + async def _bad(*, args, tool_context): + return giant + + worker.run_async.side_effect = _bad + + tool = _mk_tools(monkeypatch, worker=worker, use_skip=False) + ctx = mk_tool_context() + + tool["add_subtask"](title="t0", description="d0", tool_context=ctx) + exec_res = await tool["execute_current_subtask"](tool_context=ctx) + + rec = exec_res["record"] + assert rec["status"] == "malformed" + assert rec["output"].endswith(_tools_mod._TRUNCATION_MARKER) + assert len(rec["output"]) <= _tools_mod._MAX_RECORD_FIELD_LEN + len( + _tools_mod._TRUNCATION_MARKER + ) diff --git a/tests/units/contractor_tests/utils/test_observability.py b/tests/units/contractor_tests/utils/test_observability.py new file mode 100644 index 0000000..11508a0 --- /dev/null +++ b/tests/units/contractor_tests/utils/test_observability.py @@ -0,0 +1,132 @@ +"""run_context must honour the module's never-raises contract: a broken +Langfuse client degrades to a no-op span (with a warning), never a crash.""" + +import sys +import types + +import pytest + +import contractor.utils.observability as obs + + +class _WorkingSpanCM: + """Minimal stand-in for langfuse's start_as_current_span() CM.""" + + def __init__(self): + self.span = object() + self.entered = False + self.exited_with: type[BaseException] | None | str = "never-exited" + + def __enter__(self): + self.entered = True + return self.span + + def __exit__(self, exc_type, exc, tb): + self.exited_with = exc_type + return False + + +class _BrokenEnterSpanCM(_WorkingSpanCM): + def __enter__(self): + raise RuntimeError("span enter failed") + + +class _BrokenExitSpanCM(_WorkingSpanCM): + def __exit__(self, exc_type, exc, tb): + raise RuntimeError("span exit failed") + + +class _FakeClient: + def __init__(self, span_cm=None, span_factory_error=None): + self.span_cm = span_cm + self.span_factory_error = span_factory_error + self.flush_count = 0 + + def start_as_current_span(self, *, name): + if self.span_factory_error is not None: + raise self.span_factory_error + return self.span_cm + + def update_current_trace(self, **kwargs): + pass + + def flush(self): + self.flush_count += 1 + + +def _enable_with_fake_langfuse(monkeypatch, client): + """Force-enable observability and route `from langfuse import get_client` + to a fake module — works whether or not langfuse is installed.""" + mod = types.ModuleType("langfuse") + mod.get_client = lambda: client + monkeypatch.setitem(sys.modules, "langfuse", mod) + monkeypatch.setattr(obs, "_enabled", lambda: True) + + +def test_disabled_yields_none(monkeypatch): + # Langfuse off → pure no-op (forced off so a dev's .env can't flip it). + monkeypatch.setattr(obs, "_enabled", lambda: False) + with obs.run_context(name="run") as span: + assert span is None + + +def test_broken_get_client_degrades_to_noop(monkeypatch): + mod = types.ModuleType("langfuse") + + def _boom(): + raise RuntimeError("no langfuse server") + + mod.get_client = _boom + monkeypatch.setitem(sys.modules, "langfuse", mod) + monkeypatch.setattr(obs, "_enabled", lambda: True) + + with obs.run_context(name="run") as span: + assert span is None + + +def test_broken_span_factory_degrades_to_noop(monkeypatch): + client = _FakeClient(span_factory_error=RuntimeError("otel exploded")) + _enable_with_fake_langfuse(monkeypatch, client) + + body_ran = False + with obs.run_context(name="run") as span: + body_ran = True + assert span is None + + assert body_ran + assert client.flush_count == 1 # flush still runs on the degraded path + + +def test_broken_span_enter_degrades_to_noop(monkeypatch): + client = _FakeClient(span_cm=_BrokenEnterSpanCM()) + _enable_with_fake_langfuse(monkeypatch, client) + + with obs.run_context(name="run") as span: + assert span is None + + +def test_broken_span_exit_does_not_raise(monkeypatch): + cm = _BrokenExitSpanCM() + client = _FakeClient(span_cm=cm) + _enable_with_fake_langfuse(monkeypatch, client) + + with obs.run_context(name="run") as span: + assert span is cm.span # the working enter still yields the real span + + assert client.flush_count == 1 + + +def test_working_span_closes_and_body_exception_propagates(monkeypatch): + cm = _WorkingSpanCM() + client = _FakeClient(span_cm=cm) + _enable_with_fake_langfuse(monkeypatch, client) + + with ( + pytest.raises(ValueError, match="from the body"), + obs.run_context(name="run"), + ): + raise ValueError("from the body") + + # The span was closed with the in-flight exception (status recorded). + assert cm.exited_with is ValueError + assert client.flush_count == 1 diff --git a/tests/units/contractor_tests/utils/test_settings.py b/tests/units/contractor_tests/utils/test_settings.py new file mode 100644 index 0000000..a479267 --- /dev/null +++ b/tests/units/contractor_tests/utils/test_settings.py @@ -0,0 +1,53 @@ +"""Settings hygiene tests. + +Covers the ``target_url`` / ``proxy`` fields (routed from the historical +``CONTRACTOR_TARGET_URL`` / ``CONTRACTOR_PROXY`` env vars) and the anchored +``cli/.env`` discovery, which must work from non-CLI entrypoints too. +""" + +from __future__ import annotations + +from contractor.utils import settings as settings_module +from contractor.utils.settings import Settings + + +class TestTargetSettings: + def test_default_to_none_without_env(self, monkeypatch): + monkeypatch.delenv("CONTRACTOR_TARGET_URL", raising=False) + monkeypatch.delenv("CONTRACTOR_PROXY", raising=False) + s = Settings(_env_file=None) + assert s.target_url is None + assert s.proxy is None + + def test_contractor_env_vars_route_to_fields(self, monkeypatch): + # The pre-Settings callsites read CONTRACTOR_TARGET_URL / CONTRACTOR_PROXY + # via os.environ — the aliases must keep those exact names working. + monkeypatch.setenv("CONTRACTOR_TARGET_URL", "http://localhost:5002") + monkeypatch.setenv("CONTRACTOR_PROXY", "http://127.0.0.1:8888") + s = Settings(_env_file=None) + assert s.target_url == "http://localhost:5002" + assert s.proxy == "http://127.0.0.1:8888" + + def test_constructible_by_field_name(self, monkeypatch): + # populate_by_name lets tests/programmatic callers bypass the alias. + monkeypatch.delenv("CONTRACTOR_TARGET_URL", raising=False) + monkeypatch.delenv("CONTRACTOR_PROXY", raising=False) + s = Settings(_env_file=None, target_url="http://t", proxy="http://p") + assert s.target_url == "http://t" + assert s.proxy == "http://p" + + +class TestEnvFileAnchor: + def test_cli_env_file_is_anchored_to_repo_cli_dir(self): + # The documented config file is `cli/.env` next to the CLI entrypoint; + # the anchor must resolve there regardless of the process CWD. + env_file = settings_module._CLI_ENV_FILE + assert env_file.name == ".env" + assert env_file.parent.name == "cli" + assert (env_file.parents[1] / "pyproject.toml").is_file() + + def test_settings_env_file_sources_include_anchor(self): + env_files = Settings.model_config["env_file"] + assert settings_module._CLI_ENV_FILE in tuple(env_files) + # CWD-relative .env stays as the (higher-precedence) fallback. + assert ".env" in tuple(env_files) diff --git a/tests/units/contractor_tests/workflows/test_assembly.py b/tests/units/contractor_tests/workflows/test_assembly.py index 64d4de4..2379438 100644 --- a/tests/units/contractor_tests/workflows/test_assembly.py +++ b/tests/units/contractor_tests/workflows/test_assembly.py @@ -20,6 +20,7 @@ from contractor.runners.task_runner import TaskRunner from contractor.workflows import Workflow, WorkflowContext, get_workflows from contractor.workflows.likec4_building import LikeC4BuildingWorkflow +from contractor.workflows.namespaces import TRACE_NAMESPACE_PREFIXES from contractor.workflows.oas_building import OasBuildingWorkflow from contractor.workflows.oas_enrichment import OasEnrichmentWorkflow from contractor.workflows.router import RouterWorkflow @@ -42,11 +43,13 @@ def test_known_keys_present(self): "trace-direct", "trace-graph", "trace-graph-pathpar", + "trace-postdiff", "trace-verify", "vuln-assess", "vuln-scan", "vuln-scan-fast", "vuln-scan-trace", + "vuln-sweep", "router", } @@ -194,6 +197,32 @@ async def fake_load(*, app_name, user_id, filename, **_): assert "dependency_information" not in _template_keys(queue) # Downstream tasks still queued — they will read the existing artifact. assert "oas_update" in _template_keys(queue) + # Refs stay name-based even when the upstream task is skipped, so a + # --resume checkpoint written by a full run still matches (regression: + # positional default refs shifted when a task was conditionally added). + assert _refs(queue) == {"project_information", "oas_update", "oas_validate"} + + @pytest.mark.asyncio + async def test_refs_are_stable_template_names(self, monkeypatch): + workflow = OasBuildingWorkflow(_make_context()) + queue = await _capture_queue(workflow, monkeypatch=monkeypatch) + assert _refs(queue) == { + "dependency_information", + "project_information", + "oas_update", + "oas_validate", + } + + @pytest.mark.asyncio + async def test_runner_name_is_ctx_app_name(self, monkeypatch): + # Regression: the runner used `name="oas_builder"` while the skip + # checks and CLI export use ctx.app_name — it only worked because + # FileArtifactService ignores app_name in its storage layout. + captured = _patch_task_runners(monkeypatch) + ctx = _make_context() + workflow = OasBuildingWorkflow(ctx) + await workflow._run_impl(user_id="u", on_event=None) + assert [r.name for r in captured] == [ctx.app_name] # ─── OasEnrichmentWorkflow ──────────────────────────────────────────────────── @@ -220,6 +249,14 @@ async def test_artifact_references_resolve(self, monkeypatch): for artifact_ref in _all_artifact_refs(queue): assert _producing_task_key(artifact_ref) in produced + @pytest.mark.asyncio + async def test_runner_name_is_ctx_app_name(self, monkeypatch): + captured = _patch_task_runners(monkeypatch) + ctx = _make_context() + workflow = OasEnrichmentWorkflow(ctx) + await workflow._run_impl(user_id="u", on_event=None) + assert [r.name for r in captured] == [ctx.app_name] + # ─── LikeC4BuildingWorkflow ─────────────────────────────────────────────────── @@ -247,6 +284,25 @@ async def test_assembles_four_tasks(self, monkeypatch): "likec4_build", "likec4_validate", } + # Stable name-based refs (not positional) so --resume checkpoints + # survive the conditional skip of the discovery tasks. + assert _refs(queue) == { + "dependency_information", + "project_information", + "likec4_build", + "likec4_validate", + } + + @pytest.mark.asyncio + async def test_runner_name_is_ctx_app_name(self, monkeypatch): + monkeypatch.setattr( + LikeC4BuildingWorkflow, "_seed_overlay_from_artifact", AsyncMock() + ) + captured = _patch_task_runners(monkeypatch) + ctx = _make_context() + workflow = LikeC4BuildingWorkflow(ctx) + await workflow._run_impl(user_id="u", on_event=None) + assert [r.name for r in captured] == [ctx.app_name] @pytest.mark.asyncio async def test_artifact_references_resolve(self, monkeypatch): @@ -315,6 +371,45 @@ def capture_init(self, **kwargs): assert item.skills == ["trace"] # No artifacts — the trace template reads from the overlay-FS instead. assert item.artifacts == [] + # Stable per-path publish key — fanned-out paths must not overwrite + # each other's result/summary/records artifacts (mirrors trace_verify). + assert item.artifact_key == "trace_annotation/openapi/items_id" + + @pytest.mark.asyncio + async def test_per_path_artifact_keys_are_distinct(self, monkeypatch): + from contractor.workflows.trace_annotation import OpenApiOperation, OpenApiPath + + api_paths = [ + OpenApiPath( + path=path, + operations=[ + OpenApiOperation( + operation_id=f"op{i}", method="get", path=path, schema={} + ) + ], + ) + for i, path in enumerate(["/items/{id}", "/items", "/users/{id}"]) + ] + + workflow = TraceAnnotationWorkflow(_make_context()) + runners: list = [] + original_init = TaskRunner.__init__ + + def capture_init(self, **kwargs): + original_init(self, **kwargs) + runners.append(self) + + monkeypatch.setattr(TaskRunner, "__init__", capture_init) + monkeypatch.setattr(TaskRunner, "run", AsyncMock()) + + for api_path in api_paths: + await workflow._run_path_analysis(api_path, user_id="u") + + keys = [runner.queue[0].artifact_key for runner in runners] + assert len(keys) == len(api_paths) + assert len(set(keys)) == len(keys) + for key in keys: + assert key.startswith("trace_annotation/openapi/") # ─── TraceAnnotationDirectWorkflow ──────────────────────────────────────────── @@ -399,6 +494,12 @@ def capture_init(self, **kwargs): "trace_verify:openapi:items:sqli-list", "trace_verify:openapi:items:xss-list", } + # Same template fanned out per finding → distinct, stable publish keys + # so the tasks don't overwrite each other's artifacts. + assert {item.artifact_key for item in runner.queue} == { + "trace_verify/trace-annotation_openapi_items/sqli-list", + "trace_verify/trace-annotation_openapi_items/xss-list", + } for item in runner.queue: assert item.params["source_namespace"] == "trace-annotation:openapi:items" assert item.params["finding_name"] in {"sqli-list", "xss-list"} @@ -431,6 +532,164 @@ def capture_init(self, **kwargs): # No findings → no TaskRunner created and no tasks queued. assert captured == [] + def test_candidate_namespaces_probe_all_known_prefixes(self): + from contractor.workflows.trace_annotation import OpenApiPath + + workflow = TraceVerifyWorkflow(_make_context()) + api_path = OpenApiPath(path="/items", operations=[]) + + assert workflow._candidate_namespaces(api_path) == [ + f"{prefix}:openapi:items" for prefix in TRACE_NAMESPACE_PREFIXES + ] + # All known producers must be covered (regression: only the + # trace/trace-direct prefix used to be probed). + assert set(TRACE_NAMESPACE_PREFIXES) == { + "trace-annotation", + "trace-graph", + "trace-graph-pathpar", + "trace-postdiff", + } + + def test_candidate_namespaces_include_route_group_keys(self): + from contractor.workflows.trace_annotation import OpenApiPath + + workflow = TraceVerifyWorkflow(_make_context()) + api_path = OpenApiPath(path="/users/{user-id}/orders", operations=[]) + + candidates = workflow._candidate_namespaces(api_path) + for prefix in TRACE_NAMESPACE_PREFIXES: + # Full path key plus depth-1 and depth-2 group keys. + assert f"{prefix}:openapi:users_user-id_orders" in candidates + assert f"{prefix}:openapi:users" in candidates + assert f"{prefix}:openapi:users_user-id" in candidates + + @pytest.mark.asyncio + async def test_group_namespace_verified_once_across_sibling_paths( + self, monkeypatch + ): + """A grouped producer (group_depth=1) writes one findings artifact + for all sibling paths; verify must queue it once, not once per + member path.""" + from contractor.workflows.trace_annotation import OpenApiPath + + ctx = _make_context() + findings_yaml = self._make_findings_yaml("bola-users") + group_namespace = "trace-postdiff:openapi:users" + + async def fake_load(*, app_name, user_id, filename, **_): + if filename == f"user:vulnerability-reports/{group_namespace}": + return MagicMock(text=findings_yaml, inline_data=None) + return None + + ctx.artifact_service.load_artifact = AsyncMock(side_effect=fake_load) + workflow = TraceVerifyWorkflow(ctx) + + captured: list = [] + original_init = TaskRunner.__init__ + + def capture_init(self, **kwargs): + original_init(self, **kwargs) + captured.append(self) + + monkeypatch.setattr(TaskRunner, "__init__", capture_init) + monkeypatch.setattr(TaskRunner, "run", AsyncMock()) + + sibling_paths = [ + OpenApiPath(path="/users/{user-id}", operations=[]), + OpenApiPath(path="/users/export", operations=[]), + ] + queued = 0 + for api_path in sibling_paths: + queued += await workflow._verify_path_findings( + api_path=api_path, + user_id="u", + on_event=None, + ) + + # Both siblings resolve to the same group namespace — one runner, + # one queued finding. + assert queued == 1 + assert len(captured) == 1 + assert captured[0].queue[0].namespace == group_namespace + + @pytest.mark.asyncio + @pytest.mark.parametrize("prefix", TRACE_NAMESPACE_PREFIXES) + async def test_discovers_findings_under_every_known_prefix( + self, monkeypatch, prefix + ): + """Regression: trace-verify only probed ``trace-annotation:`` while + trace-graph (the production default) writes ``trace-graph:`` — so + verify silently skipped every path after a trace-graph run.""" + from contractor.workflows.trace_annotation import OpenApiPath + + ctx = _make_context() + findings_yaml = self._make_findings_yaml("sqli-list") + expected_namespace = f"{prefix}:openapi:items" + + async def fake_load(*, app_name, user_id, filename, **_): + if filename == f"user:vulnerability-reports/{expected_namespace}": + return MagicMock(text=findings_yaml, inline_data=None) + return None + + ctx.artifact_service.load_artifact = AsyncMock(side_effect=fake_load) + workflow = TraceVerifyWorkflow(ctx) + api_path = OpenApiPath(path="/items", operations=[]) + + captured: list = [] + original_init = TaskRunner.__init__ + + def capture_init(self, **kwargs): + original_init(self, **kwargs) + captured.append(self) + + monkeypatch.setattr(TaskRunner, "__init__", capture_init) + monkeypatch.setattr(TaskRunner, "run", AsyncMock()) + + queued = await workflow._verify_path_findings( + api_path=api_path, + user_id="u", + on_event=None, + ) + + assert queued == 1 + assert len(captured) == 1 + item = captured[0].queue[0] + assert item.namespace == expected_namespace + assert item.params["source_namespace"] == expected_namespace + + @pytest.mark.asyncio + async def test_zero_findings_across_all_paths_warns(self, monkeypatch, caplog): + import logging + + ctx = _make_context() + oas_yaml = yaml.safe_dump( + { + "openapi": "3.0.0", + "paths": {"/items": {"get": {"operationId": "list_items"}}}, + } + ) + + async def fake_load(*, app_name, user_id, filename, **_): + if filename == "oas-openapi-building": + return MagicMock(text=oas_yaml, inline_data=None) + return None + + ctx.artifact_service.load_artifact = AsyncMock(side_effect=fake_load) + workflow = TraceVerifyWorkflow(ctx) + + with caplog.at_level( + logging.WARNING, logger="contractor.workflows.trace_verify.workflow" + ): + await workflow._run_impl(user_id="u", on_event=None) + + messages = [ + r.getMessage() for r in caplog.records if r.levelno == logging.WARNING + ] + assert any("nothing to verify" in m for m in messages) + warning = next(m for m in messages if "nothing to verify" in m) + for prefix in TRACE_NAMESPACE_PREFIXES: + assert prefix in warning + def test_template_loads(self): from contractor.runners.models import TaskTemplate @@ -519,6 +778,8 @@ async def test_trace_finding_assembles_task(self, monkeypatch): item = queue[0] assert item.template_key == "trace_annotation" assert item.ref == "vuln-scan-trace:trace:sqli-1" + # Per-finding publish key (template is fanned out one task per finding). + assert item.artifact_key == "trace_annotation/sqli-1" assert item.skills == ["trace"] assert item.params["operation_id"] == "sqli-1" assert "h.py:10" in item.params["operation_schema"] @@ -573,6 +834,44 @@ def test_dedup_merges_by_file_and_cwe_keeping_higher_confidence(self): h_py = next(f for f in deduped if f["place"] == "h.py") assert h_py["confidence"] == "high" + def test_dedup_survives_trailing_cwe_marker(self): + # Regression: `details` ending exactly in "CWE-" crashed the old + # `.split("CWE-")[1].split()[0]` extraction with IndexError. + from contractor.workflows.vuln_scan_fast import VulnScanFastWorkflow + + findings = [ + {"name": "a", "place": "h.py", "details": "see CWE-", "confidence": "low"}, + ] + deduped = VulnScanFastWorkflow._dedup(findings) + assert [f["name"] for f in deduped] == ["a"] + + def test_dedup_survives_null_and_nonstring_fields(self): + # Regression: explicit-null YAML fields (`details:` / `place:`) load + # as None and crashed with TypeError/AttributeError; non-string + # scalars are coerced. + from contractor.workflows.vuln_scan_fast import VulnScanFastWorkflow + + findings = [ + {"name": "a", "place": None, "details": None, "confidence": "high"}, + {"name": "b", "place": "x.py", "details": 42}, + {"name": "c"}, # fields absent entirely + ] + deduped = VulnScanFastWorkflow._dedup(findings) + # a and c share the ("", "") bucket; the higher-confidence a wins. + assert {f["name"] for f in deduped} == {"a", "b"} + + def test_dedup_extracts_cwe_through_punctuation(self): + # "CWE-89:" and "CWE-89 " must land in the same bucket (the old + # whitespace split kept the colon, splitting the bucket in two). + from contractor.workflows.vuln_scan_fast import VulnScanFastWorkflow + + findings = [ + {"name": "a", "place": "y.py", "details": "CWE-89: SQLi", "confidence": "low"}, + {"name": "b", "place": "y.py", "details": "blah CWE-89 again", "confidence": "high"}, + ] + deduped = VulnScanFastWorkflow._dedup(findings) + assert [f["name"] for f in deduped] == ["b"] + @pytest.mark.asyncio async def test_discovery_assembles_two_tasks(self, monkeypatch): from contractor.workflows.vuln_scan_fast import VulnScanFastWorkflow @@ -581,7 +880,14 @@ async def test_discovery_assembles_two_tasks(self, monkeypatch): workflow = VulnScanFastWorkflow(_make_context()) await workflow._run_discovery(user_id="u", on_event=None) - assert _template_keys(_flat_queue(captured)) == { + queue = _flat_queue(captured) + assert _template_keys(queue) == { + "dependency_information", + "project_information", + } + # Stable name-based refs (not positional) so --resume checkpoints + # survive the conditional skip of either discovery task. + assert _refs(queue) == { "dependency_information", "project_information", } @@ -603,19 +909,53 @@ async def test_fast_scan_assembles_scan_task(self, monkeypatch): # ─── ExploitabilityWorkflow ───────────────────────────────────────────────────── +def _patch_settings(monkeypatch, **overrides): + """Route a workflow module's ``get_settings`` to a hermetic ``Settings``. + + ``_env_file=None`` keeps the developer's real ``cli/.env`` out of unit + tests; ``overrides`` are plain field-name kwargs (populate_by_name=True). + """ + from contractor.utils.settings import Settings + from contractor.workflows.exploitability import workflow as exploit_wf + from contractor.workflows.vuln_assess import workflow as vuln_assess_wf + from contractor.workflows.vuln_scan_fast import workflow as vuln_scan_fast_wf + + # Pin Caido off unless a test opts in — the anchored cli/.env (or the + # developer's env) may configure a live proxy, and unit tests must not + # reach for it. + overrides.setdefault("caido_url", None) + overrides.setdefault("caido_auth_token", None) + settings = Settings(_env_file=None, **overrides) + for module in (exploit_wf, vuln_assess_wf, vuln_scan_fast_wf): + monkeypatch.setattr(module, "get_settings", lambda: settings) + return settings + + class TestExploitabilityWorkflow: def test_requires_target_url(self, monkeypatch): from contractor.workflows.exploitability import ExploitabilityWorkflow - monkeypatch.delenv("CONTRACTOR_TARGET_URL", raising=False) + _patch_settings(monkeypatch, target_url=None) with pytest.raises(ValueError, match="CONTRACTOR_TARGET_URL"): ExploitabilityWorkflow(_make_context()) + def test_target_url_and_proxy_come_from_settings(self, monkeypatch): + from contractor.workflows.exploitability import ExploitabilityWorkflow + + _patch_settings( + monkeypatch, + target_url="http://localhost:5002", + proxy="http://127.0.0.1:8888", + ) + workflow = ExploitabilityWorkflow(_make_context()) + assert workflow.target_base_url == "http://localhost:5002" + assert workflow.proxy == "http://127.0.0.1:8888" + @pytest.mark.asyncio async def test_assess_finding_assembles_task(self, monkeypatch): from contractor.workflows.exploitability import ExploitabilityWorkflow - monkeypatch.setenv("CONTRACTOR_TARGET_URL", "http://localhost:5002") + _patch_settings(monkeypatch, target_url="http://localhost:5002") captured = _patch_task_runners(monkeypatch) workflow = ExploitabilityWorkflow(_make_context()) await workflow._assess_finding( @@ -629,6 +969,8 @@ async def test_assess_finding_assembles_task(self, monkeypatch): item = queue[0] assert item.template_key == "exploitability_assessment" assert item.ref == "exploitability:idor-1" + # Per-finding publish key (template is fanned out one task per finding). + assert item.artifact_key == "exploitability_assessment/idor-1" assert item.skills == ["exploit", "code-exec", "auth"] assert item.params["finding_name"] == "idor-1" assert item.params["source_namespace"] == "exploitability:idor-1" @@ -637,7 +979,7 @@ async def test_assess_finding_assembles_task(self, monkeypatch): async def test_assess_finding_skips_unnamed(self, monkeypatch): from contractor.workflows.exploitability import ExploitabilityWorkflow - monkeypatch.setenv("CONTRACTOR_TARGET_URL", "http://localhost:5002") + _patch_settings(monkeypatch, target_url="http://localhost:5002") captured = _patch_task_runners(monkeypatch) workflow = ExploitabilityWorkflow(_make_context()) await workflow._assess_finding(finding={"name": ""}, user_id="u", on_event=None) @@ -647,7 +989,7 @@ async def test_assess_finding_skips_unnamed(self, monkeypatch): async def test_load_findings_parses_seed(self, monkeypatch): from contractor.workflows.exploitability import ExploitabilityWorkflow - monkeypatch.setenv("CONTRACTOR_TARGET_URL", "http://localhost:5002") + _patch_settings(monkeypatch, target_url="http://localhost:5002") ctx = _make_context() ctx.artifact_service.load_artifact = AsyncMock( return_value=MagicMock( @@ -672,7 +1014,16 @@ async def test_oas_stage_assembles_four_tasks(self, monkeypatch): workflow = VulnAssessWorkflow(_make_context()) await workflow._run_oas_stage(user_id="u", on_event=None) - assert _template_keys(_flat_queue(captured)) == { + queue = _flat_queue(captured) + assert _template_keys(queue) == { + "dependency_information", + "project_information", + "oas_update", + "oas_validate", + } + # Stable name-based refs (not positional) so --resume checkpoints + # survive the conditional skip of the discovery tasks. + assert _refs(queue) == { "dependency_information", "project_information", "oas_update", @@ -715,3 +1066,78 @@ async def fake_load(*, app_name, user_id, filename, **_): # Whole OAS stage short-circuits — no tasks queued. assert _flat_queue(captured) == [] + + +# ─── TraceGraphPathParWorkflow ────────────────────────────────────────────────── + + +class TestTraceGraphPathParWorkflow: + """Regression: a single failed path makes ``asyncio.TaskGroup`` cancel + its siblings and re-raise, which used to skip the overlay merge + save + entirely — losing every already-completed path's annotations.""" + + _OAS_YAML = yaml.safe_dump( + { + "openapi": "3.0.0", + "paths": { + "/a": {"get": {"operationId": "getA"}}, + "/b": {"get": {"operationId": "getB"}}, + }, + } + ) + + def _make_workflow(self, monkeypatch): + from contractor.workflows.trace_graph_pathpar import TraceGraphPathParWorkflow + from contractor.workflows.trace_graph_pathpar import workflow as wf_module + + ctx = _make_context() + + async def fake_load(*, app_name, user_id, filename, **_): + if filename == "oas-openapi-building": + return MagicMock(text=self._OAS_YAML, inline_data=None) + return None + + ctx.artifact_service.load_artifact = AsyncMock(side_effect=fake_load) + + # Graph tools need a host-disk root — irrelevant to this test. + monkeypatch.setattr(wf_module, "attach_graph_tools_if_local", lambda fs: []) + merge_mock = MagicMock(return_value=[]) + monkeypatch.setattr(wf_module, "merge_overlay_forks", merge_mock) + + workflow = TraceGraphPathParWorkflow(ctx) + save_mock = AsyncMock() + monkeypatch.setattr(workflow, "_save_overlay_artifacts", save_mock) + return workflow, merge_mock, save_mock + + @pytest.mark.asyncio + async def test_partial_failure_still_merges_and_saves(self, monkeypatch): + workflow, merge_mock, save_mock = self._make_workflow(monkeypatch) + completed: list[str] = [] + + async def fake_group_analysis(*, group, **kwargs): + if any(p.path == "/b" for p in group.paths): + raise RuntimeError("boom") + completed.append(group.key) + + monkeypatch.setattr(workflow, "_run_group_analysis", fake_group_analysis) + + with pytest.raises(ExceptionGroup): + await workflow._run_impl(user_id="u", on_event=None) + + # The completed sibling's fork is still merged and persisted. + merge_mock.assert_called_once() + save_mock.assert_awaited_once_with("u") + + @pytest.mark.asyncio + async def test_happy_path_merges_and_saves_exactly_once(self, monkeypatch): + workflow, merge_mock, save_mock = self._make_workflow(monkeypatch) + + async def fake_group_analysis(*, group, **kwargs): + return None + + monkeypatch.setattr(workflow, "_run_group_analysis", fake_group_analysis) + + await workflow._run_impl(user_id="u", on_event=None) + + merge_mock.assert_called_once() + save_mock.assert_awaited_once_with("u") diff --git a/tests/units/contractor_tests/workflows/test_findings.py b/tests/units/contractor_tests/workflows/test_findings.py new file mode 100644 index 0000000..e2653fa --- /dev/null +++ b/tests/units/contractor_tests/workflows/test_findings.py @@ -0,0 +1,73 @@ +"""Tests for the shared YAML findings-artifact loaders.""" + +from types import SimpleNamespace +from unittest.mock import AsyncMock + +import pytest + +from contractor.workflows.findings import ( + load_findings_artifact, + load_yaml_dict_artifact, +) + + +def _service(text: str | None) -> AsyncMock: + part = None if text is None else SimpleNamespace(text=text) + service = AsyncMock() + service.load_artifact = AsyncMock(return_value=part) + return service + + +async def _load_dict(text: str | None) -> dict: + return await load_yaml_dict_artifact( + _service(text), app_name="app", user_id="u", filename="f" + ) + + +async def _load_findings(text: str | None) -> list[dict]: + return await load_findings_artifact( + _service(text), app_name="app", user_id="u", filename="f" + ) + + +class TestLoadYamlDictArtifact: + @pytest.mark.asyncio + async def test_missing_artifact_returns_empty(self): + assert await _load_dict(None) == {} + + @pytest.mark.asyncio + async def test_empty_text_returns_empty(self): + assert await _load_dict("") == {} + + @pytest.mark.asyncio + async def test_invalid_yaml_returns_empty(self): + assert await _load_dict("{ not: [valid") == {} + + @pytest.mark.asyncio + async def test_non_mapping_returns_empty(self): + assert await _load_dict("- a\n- b\n") == {} + + @pytest.mark.asyncio + async def test_mapping_round_trips(self): + assert await _load_dict("a: 1\nb: 2\n") == {"a": 1, "b": 2} + + +class TestLoadFindingsArtifact: + @pytest.mark.asyncio + async def test_name_backfilled_from_key(self): + findings = await _load_findings("sqli:\n severity: high\n") + assert findings == [{"name": "sqli", "severity": "high"}] + + @pytest.mark.asyncio + async def test_explicit_name_field_wins(self): + findings = await _load_findings("key:\n name: explicit\n") + assert findings == [{"name": "explicit"}] + + @pytest.mark.asyncio + async def test_non_mapping_entries_dropped(self): + findings = await _load_findings("good:\n severity: low\nbad: just-a-string\n") + assert findings == [{"name": "good", "severity": "low"}] + + @pytest.mark.asyncio + async def test_missing_artifact_returns_empty_list(self): + assert await _load_findings(None) == [] diff --git a/tests/units/contractor_tests/workflows/test_path_groups.py b/tests/units/contractor_tests/workflows/test_path_groups.py new file mode 100644 index 0000000..45cdf8b --- /dev/null +++ b/tests/units/contractor_tests/workflows/test_path_groups.py @@ -0,0 +1,173 @@ +"""Unit tests for router-prefix path grouping (coverage budgeting).""" + +from __future__ import annotations + +from pathlib import Path +from unittest.mock import AsyncMock, MagicMock + +import pytest +import yaml +from google.adk.artifacts import BaseArtifactService +from google.genai import types + +from cli.fs import RootedLocalFileSystem +from contractor.workflows import WorkflowContext +from contractor.workflows.path_groups import ( + PathGroup, + group_key_for_path, + group_paths_by_prefix, +) +from contractor.workflows.trace_annotation import OpenApiPath + + +def _paths(*raw: str) -> list[OpenApiPath]: + return [OpenApiPath(path=p, operations=[]) for p in raw] + + +class TestGroupKey: + def test_depth_one_uses_first_segment(self): + assert group_key_for_path("/users/{user-id}", 1) == "users" + assert group_key_for_path("/users/export", 1) == "users" + assert group_key_for_path("/admin/stats", 1) == "admin" + + def test_depth_two(self): + assert group_key_for_path("/api/v1/users", 2) == "api_v1" + + def test_param_braces_stripped(self): + assert group_key_for_path("/{tenant}/users", 1) == "tenant" + + def test_depth_beyond_segments_uses_all(self): + assert group_key_for_path("/users", 3) == "users" + + def test_root_path(self): + assert group_key_for_path("/", 1) == "root" + + def test_full_depth_matches_path_key(self): + # depth <= 0 must reproduce OpenApiPath.path_key so per-path + # grouping keeps historical namespaces. + for raw in ("/users/{user-id}", "/admin/stats", "/", "/items"): + api_path = OpenApiPath(path=raw, operations=[]) + assert group_key_for_path(raw, 0) == api_path.path_key + + +class TestGrouping: + def test_depth_zero_one_group_per_path(self): + paths = _paths("/users/{user-id}", "/users/export") + groups = group_paths_by_prefix(paths, depth=0) + assert [g.key for g in groups] == ["users_user-id", "users_export"] + assert all(len(g.paths) == 1 for g in groups) + + def test_depth_one_groups_siblings(self): + paths = _paths("/users/{user-id}", "/users/export", "/admin/stats") + groups = group_paths_by_prefix(paths, depth=1) + assert [g.key for g in groups] == ["users", "admin"] + assert [p.path for p in groups[0].paths] == [ + "/users/{user-id}", + "/users/export", + ] + + def test_first_seen_order_preserved(self): + paths = _paths("/b/x", "/a/y", "/b/z") + groups = group_paths_by_prefix(paths, depth=1) + assert [g.key for g in groups] == ["b", "a"] + assert [p.path for p in groups[0].paths] == ["/b/x", "/b/z"] + + def test_group_operations_flatten_member_paths(self): + p1 = OpenApiPath(path="/u/a", operations=[]) + p2 = OpenApiPath(path="/u/b", operations=[]) + group = PathGroup(key="u", paths=(p1, p2)) + assert group.operations == [] + + +OPENAPI_DOC = { + "openapi": "3.0.0", + "info": {"title": "t", "version": "1"}, + "paths": { + "/users/{user-id}": { + "get": {"operationId": "getUser", "responses": {"200": {}}}, + }, + "/users/export": { + "get": {"operationId": "exportUsers", "responses": {"200": {}}}, + }, + "/admin/stats": { + "get": {"operationId": "adminStats", "responses": {"200": {}}}, + }, + }, +} + + +def _make_context(tmp_path: Path) -> WorkflowContext: + (tmp_path / "app.py").write_text("def handler():\n pass\n") + + artifact_service = MagicMock(spec=BaseArtifactService) + + async def load_artifact(*, app_name, user_id, filename): + if filename == "oas-openapi-building": + return types.Part.from_text(text=yaml.safe_dump(OPENAPI_DOC)) + return None + + artifact_service.load_artifact = AsyncMock(side_effect=load_artifact) + artifact_service.save_artifact = AsyncMock() + + return WorkflowContext( + project_path=tmp_path, + folder_name="/", + model="lm-studio-test", + app_name="contractor-test", + user_id="u", + artifact_service=artifact_service, + fs=RootedLocalFileSystem(str(tmp_path)), + ) + + +@pytest.mark.asyncio +class TestPathparGroupForks: + """The fork/concurrency unit of trace-graph-pathpar follows group_depth.""" + + async def _run(self, tmp_path, monkeypatch, depth: int): + import contractor.workflows.trace_graph_pathpar.workflow as wf_mod + from contractor.workflows.trace_graph_pathpar import ( + TraceGraphPathParWorkflow, + ) + + monkeypatch.setattr(wf_mod.CFG.budgets, "group_depth", depth) + monkeypatch.setattr(wf_mod, "attach_graph_tools_if_local", lambda fs: []) + monkeypatch.setattr(wf_mod, "merge_overlay_forks", lambda *a, **k: []) + + forks: list = [] + + def fake_fork(fs, patch): + fork = MagicMock() + forks.append(fork) + return fork + + monkeypatch.setattr(wf_mod, "fork_overlay", fake_fork) + + groups_seen: list[str] = [] + + async def fake_group_analysis( + self, *, group, overlay, runner, user_id, on_event + ): + groups_seen.append(group.key) + + monkeypatch.setattr( + TraceGraphPathParWorkflow, "_run_group_analysis", fake_group_analysis + ) + + workflow = TraceGraphPathParWorkflow(_make_context(tmp_path)) + await workflow._run_impl(user_id="u", on_event=None) + return forks, groups_seen + + async def test_depth_zero_forks_per_path(self, tmp_path, monkeypatch): + forks, groups_seen = await self._run(tmp_path, monkeypatch, depth=0) + assert len(forks) == 3 + assert sorted(groups_seen) == [ + "admin_stats", + "users_export", + "users_user-id", + ] + + async def test_depth_one_forks_per_route_group(self, tmp_path, monkeypatch): + forks, groups_seen = await self._run(tmp_path, monkeypatch, depth=1) + assert len(forks) == 2 + assert sorted(groups_seen) == ["admin", "users"] diff --git a/tests/units/contractor_tests/workflows/test_trace_namespaces.py b/tests/units/contractor_tests/workflows/test_trace_namespaces.py new file mode 100644 index 0000000..1f9a7a0 --- /dev/null +++ b/tests/units/contractor_tests/workflows/test_trace_namespaces.py @@ -0,0 +1,47 @@ +"""Regression tests for the trace-verify namespace mismatch: trace-verify only +probed the 'trace-annotation:' prefix while trace-graph (the production +default) writes 'trace-graph:' and pathpar writes 'trace-graph-pathpar:', so +verify silently found zero findings after a trace-graph run. All producers and +consumers now share the prefixes from contractor.workflows.namespaces. +""" + +from __future__ import annotations + +import contractor.workflows.trace_annotation.workflow as ta +import contractor.workflows.trace_annotation_direct.workflow as tad +import contractor.workflows.trace_graph.workflow as tg +import contractor.workflows.trace_graph_pathpar.workflow as tgp +import contractor.workflows.trace_postdiff.workflow as tpd +from contractor.workflows.namespaces import ( + TRACE_ANNOTATION_NAMESPACE_PREFIX, + TRACE_GRAPH_NAMESPACE_PREFIX, + TRACE_GRAPH_PATHPAR_NAMESPACE_PREFIX, + TRACE_NAMESPACE_PREFIXES, + TRACE_POSTDIFF_NAMESPACE_PREFIX, +) + + +def test_prefix_values(): + assert TRACE_ANNOTATION_NAMESPACE_PREFIX == "trace-annotation" + assert TRACE_GRAPH_NAMESPACE_PREFIX == "trace-graph" + assert TRACE_GRAPH_PATHPAR_NAMESPACE_PREFIX == "trace-graph-pathpar" + assert TRACE_POSTDIFF_NAMESPACE_PREFIX == "trace-postdiff" + + +def test_probe_tuple_covers_every_producer_prefix(): + assert TRACE_NAMESPACE_PREFIXES == ( + TRACE_ANNOTATION_NAMESPACE_PREFIX, + TRACE_GRAPH_NAMESPACE_PREFIX, + TRACE_GRAPH_PATHPAR_NAMESPACE_PREFIX, + TRACE_POSTDIFF_NAMESPACE_PREFIX, + ) + + +def test_producers_reference_the_shared_constants(): + # Each producer module must build its per-path namespace from the *same* + # constant trace-verify probes with, so write and read keys cannot drift. + assert ta.TRACE_ANNOTATION_NAMESPACE_PREFIX is TRACE_ANNOTATION_NAMESPACE_PREFIX + assert tad.TRACE_ANNOTATION_NAMESPACE_PREFIX is TRACE_ANNOTATION_NAMESPACE_PREFIX + assert tg.TRACE_GRAPH_NAMESPACE_PREFIX is TRACE_GRAPH_NAMESPACE_PREFIX + assert tgp.PATH_NAMESPACE_PREFIX is TRACE_GRAPH_PATHPAR_NAMESPACE_PREFIX + assert tpd.TRACE_POSTDIFF_NAMESPACE_PREFIX is TRACE_POSTDIFF_NAMESPACE_PREFIX diff --git a/tests/units/contractor_tests/workflows/test_trace_postdiff.py b/tests/units/contractor_tests/workflows/test_trace_postdiff.py new file mode 100644 index 0000000..35e1e0f --- /dev/null +++ b/tests/units/contractor_tests/workflows/test_trace_postdiff.py @@ -0,0 +1,256 @@ +"""Unit tests for the trace-postdiff workflow (annotate-only trace stage + +post-diff vuln-analytics stage) and its diff helpers.""" + +from __future__ import annotations + +from pathlib import Path +from unittest.mock import AsyncMock, MagicMock + +import pytest +import yaml +from google.adk.artifacts import BaseArtifactService +from google.genai import types + +from cli.fs import RootedLocalFileSystem +from contractor.runners.models import RenderedTask, TaskTemplate +from contractor.utils import load_prompt +from contractor.workflows import WorkflowContext, get_workflows +from contractor.workflows.namespaces import ( + TRACE_NAMESPACE_PREFIXES, + TRACE_POSTDIFF_NAMESPACE_PREFIX, +) +from contractor.workflows.trace_postdiff import TracePostDiffWorkflow +from contractor.workflows.trace_postdiff.workflow import ( + _diff_header_path, + filter_diff_by_files, + truncate_diff, +) + +OPENAPI_DOC = { + "openapi": "3.0.0", + "info": {"title": "t", "version": "1"}, + "paths": { + "/users/{user-id}": { + "get": {"operationId": "getUser", "responses": {"200": {}}}, + "delete": {"operationId": "deleteUser", "responses": {"204": {}}}, + }, + }, +} + + +def _make_context(tmp_path: Path, doc: dict = OPENAPI_DOC) -> WorkflowContext: + (tmp_path / "app.py").write_text("def handler():\n pass\n") + + artifact_service = MagicMock(spec=BaseArtifactService) + + async def load_artifact(*, app_name, user_id, filename): + if filename == "oas-openapi-building": + return types.Part.from_text(text=yaml.safe_dump(doc)) + return None + + artifact_service.load_artifact = AsyncMock(side_effect=load_artifact) + artifact_service.save_artifact = AsyncMock() + + return WorkflowContext( + project_path=tmp_path, + folder_name="/", + model="lm-studio-test", + app_name="contractor-test", + user_id="u", + artifact_service=artifact_service, + fs=RootedLocalFileSystem(str(tmp_path)), + ) + + +class TestDiffHelpers: + def test_header_path_simple(self): + line = "diff --overlay a/src/app.py b/src/app.py" + assert _diff_header_path(line) == "/src/app.py" + + def test_header_path_with_space(self): + line = "diff --overlay a/my dir/f.py b/my dir/f.py" + assert _diff_header_path(line) == "/my dir/f.py" + + def test_header_path_non_header(self): + assert _diff_header_path("+++ b/src/app.py") is None + + def test_filter_keeps_only_named_files(self): + diff = ( + "diff --overlay a/a.py b/a.py\n" + "--- a/a.py\n" + "+++ b/a.py\n" + "+# @trace target=x args= calls=\n" + "diff --overlay a/b.py b/b.py\n" + "--- a/b.py\n" + "+++ b/b.py\n" + "+irrelevant\n" + ) + kept = filter_diff_by_files(diff, {"/a.py"}) + assert "a/a.py" in kept + assert "@trace" in kept + assert "b.py" not in kept + assert "irrelevant" not in kept + + def test_filter_empty_inputs(self): + assert filter_diff_by_files("", {"/a.py"}) == "" + assert filter_diff_by_files("diff --overlay a/a.py b/a.py", set()) == "" + + def test_truncate(self): + assert truncate_diff("short", 100) == "short" + out = truncate_diff("x" * 200, 100) + assert out.startswith("x" * 100) + assert "truncated" in out + + +class TestSurfaces: + def test_registry_exposes_trace_postdiff(self): + assert get_workflows()["trace-postdiff"] is TracePostDiffWorkflow + + def test_namespace_prefix_registered(self): + assert TRACE_POSTDIFF_NAMESPACE_PREFIX in TRACE_NAMESPACE_PREFIXES + + def test_analytics_prompt_loads(self): + prompt = load_prompt("vuln_analytics_agent") + assert "report_vulnerability" in prompt + assert "@trace" in prompt + + def test_analytics_template_renders(self): + template = TaskTemplate.load("vuln_analytics") + rendered = RenderedTask.from_template( + template=template, + variables={ + "target_summary": "TARGET-SUMMARY-SENTINEL", + "trace_diff": "TRACE-DIFF-SENTINEL", + }, + params={}, + artifacts={}, + ) + text = rendered._format_task() + assert "TARGET-SUMMARY-SENTINEL" in text + assert "TRACE-DIFF-SENTINEL" in text + + +@pytest.mark.asyncio +class TestTwoStageRun: + async def _run( + self, tmp_path, monkeypatch, *, annotate: bool, doc: dict = OPENAPI_DOC + ): + """Run the workflow with both agent builders and the runner faked. + + When ``annotate`` is set, the fake trace stage writes an annotation + into the overlay (as the real trace_agent would via its tools). + """ + import contractor.workflows.trace_postdiff.workflow as wf_mod + + ctx = _make_context(tmp_path, doc) + workflow = TracePostDiffWorkflow(ctx) + + trace_builds: list[dict] = [] + analytics_builds: list[dict] = [] + runs: list[dict] = [] + + def fake_trace_agent(name, fs, **kwargs): + trace_builds.append(kwargs) + agent = MagicMock() + agent._is_trace = True + return agent + + def fake_analytics_agent(name, fs, **kwargs): + analytics_builds.append(kwargs) + agent = MagicMock() + agent._is_trace = False + return agent + + async def fake_run(self, *, agent, message, event_name, **kwargs): + runs.append({"agent": agent, "message": message, "event": event_name}) + if agent._is_trace and annotate: + # Unique content per run so every group sees fresh changes. + workflow.overlayfs.pipe_file( + "/app.py", + b"# @trace target=getUser args=user_id:tainted calls=\n" + b"def handler():\n pass\n" + + f"# run {len(runs)}\n".encode(), + ) + + monkeypatch.setattr(wf_mod, "build_trace_agent", fake_trace_agent) + monkeypatch.setattr(wf_mod, "build_vuln_analytics_agent", fake_analytics_agent) + monkeypatch.setattr(wf_mod, "inject_skills", AsyncMock()) + # AgentRunner is a pydantic model — patch the method on the class. + monkeypatch.setattr(wf_mod.AgentRunner, "run", fake_run) + + await workflow._run_impl(user_id="u", on_event=None) + return workflow, trace_builds, analytics_builds, runs + + async def test_annotate_stage_disables_vuln_reporting( + self, tmp_path, monkeypatch + ): + _, trace_builds, _, _ = await self._run( + tmp_path, monkeypatch, annotate=True + ) + # One trace run per operation (get + delete). + assert len(trace_builds) == 2 + for build in trace_builds: + assert build["enable_vuln_reporting"] is False + # group_depth=1 → namespace keyed by the route prefix. + assert build["namespace"] == "trace-postdiff:openapi:users" + + async def test_analytics_stage_receives_annotation_diff( + self, tmp_path, monkeypatch + ): + _, _, analytics_builds, runs = await self._run( + tmp_path, monkeypatch, annotate=True + ) + assert len(analytics_builds) == 1 + assert ( + analytics_builds[0]["namespace"] == "trace-postdiff:openapi:users" + ) + + analytics_runs = [r for r in runs if r["event"].endswith(":analytics")] + assert len(analytics_runs) == 1 + message = analytics_runs[0]["message"] + assert "diff --overlay a/app.py b/app.py" in message + assert "@trace target=getUser" in message + assert "/users/{user-id}" in message # target summary present + + async def test_analytics_skipped_without_annotations( + self, tmp_path, monkeypatch + ): + _, trace_builds, analytics_builds, runs = await self._run( + tmp_path, monkeypatch, annotate=False + ) + assert len(trace_builds) == 2 + assert analytics_builds == [] + assert all(not r["event"].endswith(":analytics") for r in runs) + + async def test_sibling_paths_share_group_and_analytics_run( + self, tmp_path, monkeypatch + ): + doc = { + "openapi": "3.0.0", + "info": {"title": "t", "version": "1"}, + "paths": { + "/users/{user-id}": { + "get": {"operationId": "getUser", "responses": {"200": {}}}, + }, + "/users/export": { + "get": {"operationId": "exportUsers", "responses": {"200": {}}}, + }, + "/admin/stats": { + "get": {"operationId": "adminStats", "responses": {"200": {}}}, + }, + }, + } + _, trace_builds, analytics_builds, runs = await self._run( + tmp_path, monkeypatch, annotate=True, doc=doc + ) + # Three operations traced, but only two route groups analyzed. + assert len(trace_builds) == 3 + assert {b["namespace"] for b in trace_builds} == { + "trace-postdiff:openapi:users", + "trace-postdiff:openapi:admin", + } + assert {b["namespace"] for b in analytics_builds} == { + "trace-postdiff:openapi:users", + "trace-postdiff:openapi:admin", + } + assert len(analytics_builds) == 2 diff --git a/tests/units/contractor_tests/workflows/test_vuln_sweep.py b/tests/units/contractor_tests/workflows/test_vuln_sweep.py new file mode 100644 index 0000000..6771ce7 --- /dev/null +++ b/tests/units/contractor_tests/workflows/test_vuln_sweep.py @@ -0,0 +1,160 @@ +"""Unit tests for the vuln-sweep two-pass workflow (per-class BFS +nomination sweep → DFS trace of survivors).""" + +from __future__ import annotations + +from pathlib import Path +from unittest.mock import AsyncMock, MagicMock + +import pytest +import yaml +from google.adk.artifacts import BaseArtifactService + +from contractor.runners.task_runner import TaskRunner +from contractor.workflows import WorkflowContext, get_workflows +from contractor.workflows.vuln_sweep import VulnSweepWorkflow +from contractor.workflows.vuln_sweep.workflow import SINK_CLASSES + + +def _make_context() -> WorkflowContext: + artifact_service = MagicMock(spec=BaseArtifactService) + artifact_service.load_artifact = AsyncMock(return_value=None) + artifact_service.save_artifact = AsyncMock() + return WorkflowContext( + project_path=Path("/tmp/proj"), + folder_name="src", + model="lm-studio-test", + app_name="contractor-test", + user_id="u", + artifact_service=artifact_service, + fs=MagicMock(), + ) + + +def _findings_yaml(*names: str, severity="high", confidence="low", place="app.py"): + return yaml.safe_dump( + { + n: { + "title": n, + "place": place, + "place_type": "file", + "severity": severity, + "confidence": confidence, + "summary": "s", + "details": "d", + } + for n in names + } + ) + + +class TestSurfaces: + def test_registry_exposes_vuln_sweep(self): + assert get_workflows()["vuln-sweep"] is VulnSweepWorkflow + + def test_has_absence_class(self): + keys = {c.key for c in SINK_CLASSES} + assert "missing-access-control" in keys + # Several distinct classes, all with guidance text. + assert len(keys) >= 4 + assert all(c.guidance for c in SINK_CLASSES) + + def test_inherits_trace_phase(self): + from contractor.workflows.vuln_scan_trace import VulnScanTraceWorkflow + + assert issubclass(VulnSweepWorkflow, VulnScanTraceWorkflow) + # CFG override points the inherited phase at the sweep config. + assert VulnSweepWorkflow.CFG is not VulnScanTraceWorkflow.CFG + + def test_nomination_template_renders(self): + from contractor.runners.models import RenderedTask, TaskTemplate + + template = TaskTemplate.load("sink_nomination") + rendered = RenderedTask.from_template( + template=template, + variables={ + "project_path": "src", + "sink_class": "injection", + "class_guidance": "GUIDANCE-SENTINEL", + }, + params={}, + artifacts={}, + ) + text = rendered._format_task() + assert "injection" in text + assert "GUIDANCE-SENTINEL" in text + + +@pytest.mark.asyncio +class TestSweepRun: + async def _capture(self, ctx, monkeypatch): + """Run _run_impl with TaskRunner.run faked; return the queued tasks.""" + runners: list = [] + original_init = TaskRunner.__init__ + + def capture_init(self, **kwargs): + original_init(self, **kwargs) + runners.append(self) + + monkeypatch.setattr(TaskRunner, "__init__", capture_init) + monkeypatch.setattr(TaskRunner, "run", AsyncMock()) + + workflow = VulnSweepWorkflow(ctx) + await workflow._run_impl(user_id="u", on_event=None) + return [item for r in runners for item in r.queue] + + async def test_one_nomination_task_per_class(self, monkeypatch): + ctx = _make_context() # load_artifact → None: no nominations, no trace + queue = await self._capture(ctx, monkeypatch) + + sweep_tasks = [t for t in queue if t.template_key == "sink_nomination"] + assert len(sweep_tasks) == len(SINK_CLASSES) + # Each class gets its own namespace and stable ref. + namespaces = {t.namespace for t in sweep_tasks} + assert namespaces == { + f"vuln-sweep:sweep:{c.key}" for c in SINK_CLASSES + } + # No trace tasks queued when there are no nominations. + assert not [t for t in queue if t.template_key == "trace_annotation"] + + async def test_nominations_deduped_and_traced(self, monkeypatch): + ctx = _make_context() + + # Two classes nominate; one slug is shared (same place+name) and + # must dedup to a single trace task. + per_ns = { + "user:vulnerability-reports/vuln-sweep:sweep:injection": _findings_yaml( + "injection-sqli", "shared-dup" + ), + "user:vulnerability-reports/vuln-sweep:sweep:deserialization": ( + _findings_yaml("shared-dup") + ), + } + + async def fake_load(*, app_name, user_id, filename, **_): + text = per_ns.get(filename) + if text is None: + return None + return MagicMock(text=text, inline_data=None) + + ctx.artifact_service.load_artifact = AsyncMock(side_effect=fake_load) + queue = await self._capture(ctx, monkeypatch) + + trace_tasks = [t for t in queue if t.template_key == "trace_annotation"] + traced_names = {t.params["operation_id"] for t in trace_tasks} + assert traced_names == {"injection-sqli", "shared-dup"} + + async def test_cap_limits_trace_phase(self, monkeypatch): + ctx = _make_context() + many = _findings_yaml(*[f"inj-{i}" for i in range(100)]) + + async def fake_load(*, app_name, user_id, filename, **_): + if filename.endswith("vuln-sweep:sweep:injection"): + return MagicMock(text=many, inline_data=None) + return None + + ctx.artifact_service.load_artifact = AsyncMock(side_effect=fake_load) + queue = await self._capture(ctx, monkeypatch) + + trace_tasks = [t for t in queue if t.template_key == "trace_annotation"] + assert len(trace_tasks) == VulnSweepWorkflow.CFG.budgets.max_trace_nominations