feat(agent-runtime): OpenAI Codex as the third agent runtime + resolve E2E findings (#1187)#1203
feat(agent-runtime): OpenAI Codex as the third agent runtime + resolve E2E findings (#1187)#1203AndriiPasternak31 wants to merge 8 commits into
Conversation
…findings (#1187) Lands the #1187 "harness == runtime" work: Codex CLI joins Claude Code and Gemini as a pluggable AgentRuntime, plus the two blockers the live E2E surfaced that kept a Codex agent from *acting*. Core runtime (codex_runtime.py, built on the per-runtime primitives — NOT a shared helper, so it never inherits Gemini's blanket kill_cgroup_orphans): - `codex exec --json` JSONL parse (thread.started / turn.completed.usage / item.completed); reasoning_output_tokens ⊂ output_tokens (no double-count); `-o` output file authoritative (read-then-delete in finally); estimated cost. - chat continuity via `codex exec resume <thread_id>`; concurrency-safe `_drain_bounded` orphan cleanup; error→HTTP auth 503 / rate 429 / runtime-unavailable 500-not-503 / pipe-drop 502. - RuntimeCapabilities dataclass + per-runtime capabilities(); get_runtime() validates AGENT_RUNTIME and fails loud on unknown. Codex skips Claude subscription assign (is_claude_runtime); OPENAI_API_KEY read from .env; CODEX_HOME relocated under $TMPDIR (gitignored). Session tab gated off via RUNTIMES_WITHOUT_SESSION_TAB_RESUME; frontend RuntimeBadge/terminal/default model; test-codex template; @openai/codex@0.139.0 pinned. E2E findings resolved: - Resume arg-order: exec-level flags (--json/-C/--sandbox/-o) must precede the `resume` sub-subcommand (codex 0.139.0 rejects them after, exit 2 → every turn-2+ failed). Fixed + arg-order guard test. - F-TOOLS (sandbox): normal mode → `--sandbox danger-full-access`. Codex's own bubblewrap sandbox can't create a user namespace in the hardened agent container (`bwrap: No permissions to create a new namespace`), which blocked every shell tool. Dropping it leaves the Trinity container as the sole boundary — same posture as Claude/Gemini (cap_drop ALL + AppArmor + no-new-privileges); no container hardening is weakened. Read-only keeps `--sandbox read-only`. - F-MCP (runtime-aware prompt): PLATFORM_INSTRUCTIONS documented MCP tools with Claude's `mcp__trinity__<tool>` prefix; Codex auto-discovers tools by bare name and answered "unknown MCP server". get_platform_system_prompt(runtime=…) / compose_system_prompt(runtime=…) strip the prefix + add a Codex orientation note for runtime="codex"; threaded from chat.py + task_execution_service.py via the trinity.agent-runtime label (docker_service.get_agent_runtime, resolved lazily + guarded so a re-import under a stubbed docker_service in unit tests can't break dispatch). Claude/Gemini/unknown unchanged. - Test-isolation: test_validate_runtime.py hardened against the sibling sys.modules Mock leak. Housekeeping: cap bcrypt<5 in tests/requirements-test.txt (bcrypt 5.0.0 breaks passlib 1.7.4 at conftest DB-init in a fresh venv). Docs: architecture.md, feature-flows.md + feature-flows/codex-runtime.md, harness-authoring-guide.md, requirements.md, agent guide. Verification: full tests/unit green (2218 passed) incl. new sandbox-resolution, runtime-aware-prompt, and resume arg-order tests. Live F-TOOLS/F-MCP E2E re-verification pending a fresh OPENAI_API_KEY (the test key was rotated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🟡 Open design decision: read-only enforcement for CodexThis PR ships read-only for Codex as The gap: Claude enforces read-only via a So there's a real question of whether Options for the team:
The PR ships (A) as the safe interim. Requesting a decision before this hardens into the contract — and either way we should add a write-attempt test that asserts a blocked write actually fails under read-only. |
…ccess) normal agents run --sandbox danger-full-access; workspace-write/read-only both invoke bwrap (user namespace), which the hardened Trinity container forbids. The container is already the boundary, same posture as Claude/Gemini. Read-only agents keep --sandbox read-only as interim enforcement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
adds dispatcher routing for AGENT_RUNTIME=codex, empty-input no-op, and TOML scalar/escape primitive coverage (#1187 follow-up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
the backend/frontend serialize to_dict() to gate UI surfaces; pin every flag so the contract can't drift silently (#1187). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
covers docker_service.get_agent_runtime() label read + claude-code fallbacks, and task_execution_service._resolve_agent_runtime()'s guarded import — both best-effort, never-raise, never block dispatch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
adds tests for pricing prefix-match, system-prompt compose, result-file read/unlink, guardrail surfacing, API-key loading precedence, availability probe, JSONL event parser edges, and CODEX_HOME resolution (#1187). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Resolve by running |
PR #1203's two red checks, without weakening anything real: - test_validate_runtime: adopt the _STUBBED_MODULE_NAMES + autouse _restore_sys_modules snapshot/restore pair so the required import-time stub eviction is the sanctioned escape hatch the sys.modules pollution lint recognizes (precedent: test_telegram_webhook_backfill), instead of a bare sys.modules.pop the linter flags. - codex_runtime: add a sink-side _ensure_within() containment guard on the -o result file's open()/unlink(). The filename is already reduced to a safe token (_safe_result_token + fixed -last.txt suffix), so the two CodeQL path alerts are false positives — but anchoring a realpath containment check at the filesystem sink is defense-in-depth and the barrier CodeQL recognizes. _read_and_consume_result_file/_safe_unlink now take a required base arg; both production call sites pass codex_home and a tripped guard logs a warning rather than silently swallowing. - tests: update the 4 path-helper call sites for the new signature and add test_ensure_within_rejects_escape (out-of-base path refused; wrappers no-op without touching the file). The remaining CodeQL "uncontrolled command line" alert (list-form Popen, shell=False, cmd[0]="codex", prompt after a -- separator) is dismissed via the API as a verified false positive — no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vybe
left a comment
There was a problem hiding this comment.
✅ Validated & approved (/validate-pr) — clean, no defects. Recommend a /cso --diff security pass before merge (see below).
P1 infrastructure feature (Closes #1187). 42 files, +4423/-41 — large, but the bulk is the new runtime + 12 test files + thorough docs; scope is coherent.
Validation — all green:
- Base →
dev✓ · 7 conventional commits ✓ · full CI matrix green (build, verify-non-root, schema-parity, prod-image-smoke, pytest ×6 seeds, CodeQL, regression-diff) - Docs exceptional: architecture + requirements + feature-flows index + new
codex-runtime.md+ harness-authoring-guide + TRINITY_COMPATIBLE_AGENT_GUIDE. New capability → requirements + feature-flow both present ✓ - Security scan clean: no real secrets (the
trinity_mcp_SUPERSECRETVALUEhit is a test fixture intest_codex_mcp_config.py; thesk-*strings are test fixtures); MCP token referenced viabearer_token_env_var(never persisted);OPENAI_API_KEYread from.env/process env, not exported into the agent-server process;CODEX_HOMEunder$TMPDIR+ gitignored. - Invariant #5 (agent-server + backend both updated) ✓; #13 (no MCP-server change needed — Codex consumes the existing Trinity MCP via
config.toml) ✓; no Dockerfile COPY gap (agent_server/copied wholesale, socodex_runtime.pyships) ✓; no new backend env vars. codex_runtime.pycorrectly built on per-runtime primitives (never inherits Gemini's blanketkill_cgroup_orphans()); resume arg-order fix + guard test; runtime-aware prompt (stripsmcp__trinity__for Codex); cost/token parse avoids reasoning double-count.
Sandbox posture change — reviewed, sound: normal mode → --sandbox danger-full-access drops Codex's inner bubblewrap (which can't create a userns in the hardened container → blocks every shell tool). Claude/Gemini already run with no inner sandbox, so this is parity, not a regression — the Trinity container (non-root, CAP_DROP, network isolation) remains the boundary. Read-only mode correctly keeps --sandbox read-only via ~/.trinity/read-only-config.json (fail-open consistent with the Claude reference hook).
/cso --diff for a deep security pass focused on the danger-full-access decision, the Codex subprocess env/credential flow, and the credential sanitizer over Codex output — /validate-pr + CodeQL are green, but the full P1 pipeline (/review + /validate-pr + /cso) prescribes it for a sandbox-posture change.
Merge note: this shares task_execution_service.py + docs/memory/architecture.md + docker/base-image/Dockerfile with #1195/#1196 — land it last (it pays the rebase). Base-image rebuild required (@openai/codex@0.139.0).
…e-1187-plan # Conflicts: # docs/memory/feature-flows.md
Summary
Lands the #1187 "harness == runtime" work — OpenAI Codex joins Claude Code and Gemini as a pluggable
AgentRuntimeinside the agent container — and resolves the two blockers the live E2E surfaced that kept a Codex agent from acting.A Trinity harness IS an
AgentRuntime.AGENT_RUNTIME(fromtemplate.yaml runtime:, also atrinity.agent-runtimelabel) selects one;runtime_adapter.get_runtime()validates againstKNOWN_RUNTIMESand raises on an unknown value.What's in it
Core runtime (
codex_runtime.py, built on the per-runtime primitives — not a shared helper, so it never inherits Gemini's blanketkill_cgroup_orphans()):codex exec --jsonJSONL parse (thread.started/turn.completed.usage/item.completed);reasoning_output_tokens ⊂ output_tokens(no double-count);-ooutput file authoritative (read-then-delete infinally); estimated cost (CODEX_PRICING).codex exec resume <thread_id>; concurrency-safe_drain_boundedorphan cleanup; error→HTTP auth 503 / rate 429 / runtime-unavailable 500-not-503 / pipe-drop 502.RuntimeCapabilitiesdataclass + per-runtimecapabilities(); Codex skips Claude-subscription assign (is_claude_runtime);OPENAI_API_KEYread from.env;CODEX_HOMEunder$TMPDIR(gitignored). Session tab gated off viaRUNTIMES_WITHOUT_SESSION_TAB_RESUME; frontendRuntimeBadge/terminal/default-model;test-codextemplate;@openai/codex@0.139.0pinned.E2E findings resolved (the four fixes):
--json/-C/--sandbox/-o) must precede theresumesub-subcommand (codex 0.139.0 rejects them after → exit 2 → every turn-2+ failed). Fixed + arg-order guard test.--sandbox danger-full-access. Codex's own bubblewrap sandbox can't create a user namespace inside the hardened agent container (bwrap: No permissions to create a new namespace), which blocked every shell tool. Dropping it leaves the Trinity container as the sole boundary. Read-only keeps--sandbox read-only.PLATFORM_INSTRUCTIONSdocumented MCP tools with Claude'smcp__trinity__<tool>prefix; Codex auto-discovers tools by bare name and answeredunknown MCP server.get_platform_system_prompt(runtime=…)/compose_system_prompt(runtime=…)strip the prefix + add a Codex orientation note forruntime="codex", threaded fromrouters/chat.py+services/task_execution_service.pyvia thetrinity.agent-runtimelabel (docker_service.get_agent_runtime, resolved lazily + guarded so a re-import under a stubbeddocker_servicein unit tests can't break dispatch). Claude/Gemini/unknown unchanged.test_validate_runtime.pyhardened against the siblingsys.modulesMock leak.Housekeeping: cap
bcrypt<5intests/requirements-test.txt(bcrypt 5.0.0 removed the__about__shim passlib 1.7.4 reads at import → breaks every test at conftest DB-init in a fresh venv).🔒 Security note
Disabling Codex's internal bubblewrap sandbox (
danger-full-access) does not weaken container hardening. The Trinity agent container is already the security boundary —cap_drop: ALL, AppArmor,no-new-privileges, network isolation — and runs with no userns grant and unchanged caps. This is exactly the posture under which Claude Code and Gemini already run (no internal sandbox). The change drops a redundant inner sandbox that simply cannot function inside the hardened container.Test plan
Unit: full
tests/unitgreen — 2218 passed, 24 skipped, 0 fail (incl. new sandbox-resolution, runtime-aware-prompt, and resume arg-order tests). Caught + fixed a real order-dependency I'd briefly introduced (top-leveldocker_serviceimport → lazy+guarded), re-verified under adversarial ordering.Live E2E on a real Codex agent booted from the rebuilt base image (key injected via CRED-002):
HELLO_CODEX_E2E); logsandbox=danger-full-access; tools ran (wasbwrapfail)list_agentsvia Trinity MCP returned the agent (wasunknown MCP server)resume=Truesandbox=read-onlysk-proj×0 in agent logsdanger-full-accessran tools non-interactively (no approval prompt → no--dangerously-bypass-approvals-and-sandboxneeded).verify-local
--skip-agent(the agent side was validated more thoroughly by the live E2E above; the dev stack held the global agent network).Docs
architecture.md,feature-flows.md+feature-flows/codex-runtime.md,harness-authoring-guide.md,requirements.md, agent guide.Implements #1187.
🤖 Generated with Claude Code