feat(agent-runtime): OpenAI Codex as the third agent runtime + resolve E2E findings (#1187) by AndriiPasternak31 · Pull Request #1203 · Abilityai/trinity

AndriiPasternak31 · 2026-06-14T04:51:26Z

Summary

Lands the #1187 "harness == runtime" work — OpenAI Codex joins Claude Code and Gemini as a pluggable AgentRuntime inside the agent container — and resolves the two blockers the live E2E surfaced that kept a Codex agent from acting.

A Trinity harness IS an AgentRuntime. AGENT_RUNTIME (from template.yaml runtime:, also a trinity.agent-runtime label) selects one; runtime_adapter.get_runtime() validates against KNOWN_RUNTIMES and raises on an unknown value.

What's in it

Core runtime (codex_runtime.py, built on the per-runtime primitives — not a shared helper, so it never inherits Gemini's blanket kill_cgroup_orphans()):

codex exec --json JSONL parse (thread.started / turn.completed.usage / item.completed); reasoning_output_tokens ⊂ output_tokens (no double-count); -o output file authoritative (read-then-delete in finally); estimated cost (CODEX_PRICING).
chat continuity via codex exec resume <thread_id>; concurrency-safe _drain_bounded orphan cleanup; error→HTTP auth 503 / rate 429 / runtime-unavailable 500-not-503 / pipe-drop 502.
RuntimeCapabilities dataclass + per-runtime capabilities(); Codex skips Claude-subscription assign (is_claude_runtime); OPENAI_API_KEY read from .env; CODEX_HOME under $TMPDIR (gitignored). Session tab gated off via RUNTIMES_WITHOUT_SESSION_TAB_RESUME; frontend RuntimeBadge/terminal/default-model; test-codex template; @openai/codex@0.139.0 pinned.

E2E findings resolved (the four fixes):

Resume arg-order — exec-level flags (--json/-C/--sandbox/-o) must precede the resume sub-subcommand (codex 0.139.0 rejects them after → exit 2 → every turn-2+ failed). Fixed + arg-order guard test.
F-TOOLS (sandbox) — normal mode → --sandbox danger-full-access. Codex's own bubblewrap sandbox can't create a user namespace inside the hardened agent container (bwrap: No permissions to create a new namespace), which blocked every shell tool. Dropping it leaves the Trinity container as the sole boundary. Read-only keeps --sandbox read-only.
F-MCP (runtime-aware prompt) — PLATFORM_INSTRUCTIONS documented MCP tools with Claude's mcp__trinity__<tool> prefix; Codex auto-discovers tools by bare name and answered unknown MCP server. get_platform_system_prompt(runtime=…) / compose_system_prompt(runtime=…) strip the prefix + add a Codex orientation note for runtime="codex", threaded from routers/chat.py + services/task_execution_service.py via the trinity.agent-runtime label (docker_service.get_agent_runtime, resolved lazily + guarded so a re-import under a stubbed docker_service in unit tests can't break dispatch). Claude/Gemini/unknown unchanged.
Test-isolation — test_validate_runtime.py hardened against the sibling sys.modules Mock leak.

Housekeeping: cap bcrypt<5 in tests/requirements-test.txt (bcrypt 5.0.0 removed the __about__ shim passlib 1.7.4 reads at import → breaks every test at conftest DB-init in a fresh venv).

🔒 Security note

Disabling Codex's internal bubblewrap sandbox (danger-full-access) does not weaken container hardening. The Trinity agent container is already the security boundary — cap_drop: ALL, AppArmor, no-new-privileges, network isolation — and runs with no userns grant and unchanged caps. This is exactly the posture under which Claude Code and Gemini already run (no internal sandbox). The change drops a redundant inner sandbox that simply cannot function inside the hardened container.

Test plan

Unit: full tests/unit green — 2218 passed, 24 skipped, 0 fail (incl. new sandbox-resolution, runtime-aware-prompt, and resume arg-order tests). Caught + fixed a real order-dependency I'd briefly introduced (top-level docker_service import → lazy+guarded), re-verified under adversarial ordering.

Live E2E on a real Codex agent booted from the rebuilt base image (key injected via CRED-002):

Check	Result
F-TOOLS	file created + read back (`HELLO_CODEX_E2E`); log `sandbox=danger-full-access`; tools ran (was `bwrap` fail)
F-MCP	`list_agents` via Trinity MCP returned the agent (was `unknown MCP server`)
turn-1	cost ≈ $0.046, session id
turn-2 resume	recalled prior-turn context with no tools; log `resume=True`
read-only	log `sandbox=read-only`
credential leak	`sk-proj` ×0 in agent logs

danger-full-access ran tools non-interactively (no approval prompt → no --dangerously-bypass-approvals-and-sandbox needed).

verify-local --skip-agent (the agent side was validated more thoroughly by the live E2E above; the dev stack held the global agent network).

Docs

architecture.md, feature-flows.md + feature-flows/codex-runtime.md, harness-authoring-guide.md, requirements.md, agent guide.

Implements #1187.

ℹ️ See the follow-up comment on read-only enforcement for Codex — an open design decision for the team (Codex has no PreToolUse hook).

🤖 Generated with Claude Code

…findings (#1187) Lands the #1187 "harness == runtime" work: Codex CLI joins Claude Code and Gemini as a pluggable AgentRuntime, plus the two blockers the live E2E surfaced that kept a Codex agent from *acting*. Core runtime (codex_runtime.py, built on the per-runtime primitives — NOT a shared helper, so it never inherits Gemini's blanket kill_cgroup_orphans): - `codex exec --json` JSONL parse (thread.started / turn.completed.usage / item.completed); reasoning_output_tokens ⊂ output_tokens (no double-count); `-o` output file authoritative (read-then-delete in finally); estimated cost. - chat continuity via `codex exec resume <thread_id>`; concurrency-safe `_drain_bounded` orphan cleanup; error→HTTP auth 503 / rate 429 / runtime-unavailable 500-not-503 / pipe-drop 502. - RuntimeCapabilities dataclass + per-runtime capabilities(); get_runtime() validates AGENT_RUNTIME and fails loud on unknown. Codex skips Claude subscription assign (is_claude_runtime); OPENAI_API_KEY read from .env; CODEX_HOME relocated under $TMPDIR (gitignored). Session tab gated off via RUNTIMES_WITHOUT_SESSION_TAB_RESUME; frontend RuntimeBadge/terminal/default model; test-codex template; @openai/codex@0.139.0 pinned. E2E findings resolved: - Resume arg-order: exec-level flags (--json/-C/--sandbox/-o) must precede the `resume` sub-subcommand (codex 0.139.0 rejects them after, exit 2 → every turn-2+ failed). Fixed + arg-order guard test. - F-TOOLS (sandbox): normal mode → `--sandbox danger-full-access`. Codex's own bubblewrap sandbox can't create a user namespace in the hardened agent container (`bwrap: No permissions to create a new namespace`), which blocked every shell tool. Dropping it leaves the Trinity container as the sole boundary — same posture as Claude/Gemini (cap_drop ALL + AppArmor + no-new-privileges); no container hardening is weakened. Read-only keeps `--sandbox read-only`. - F-MCP (runtime-aware prompt): PLATFORM_INSTRUCTIONS documented MCP tools with Claude's `mcp__trinity__<tool>` prefix; Codex auto-discovers tools by bare name and answered "unknown MCP server". get_platform_system_prompt(runtime=…) / compose_system_prompt(runtime=…) strip the prefix + add a Codex orientation note for runtime="codex"; threaded from chat.py + task_execution_service.py via the trinity.agent-runtime label (docker_service.get_agent_runtime, resolved lazily + guarded so a re-import under a stubbed docker_service in unit tests can't break dispatch). Claude/Gemini/unknown unchanged. - Test-isolation: test_validate_runtime.py hardened against the sibling sys.modules Mock leak. Housekeeping: cap bcrypt<5 in tests/requirements-test.txt (bcrypt 5.0.0 breaks passlib 1.7.4 at conftest DB-init in a fresh venv). Docs: architecture.md, feature-flows.md + feature-flows/codex-runtime.md, harness-authoring-guide.md, requirements.md, agent guide. Verification: full tests/unit green (2218 passed) incl. new sandbox-resolution, runtime-aware-prompt, and resume arg-order tests. Live F-TOOLS/F-MCP E2E re-verification pending a fresh OPENAI_API_KEY (the test key was rotated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AndriiPasternak31 · 2026-06-14T04:51:34Z

🟡 Open design decision: read-only enforcement for Codex

This PR ships read-only for Codex as --sandbox read-only (verified: a read-only turn logs [Codex] exec sandbox=read-only). I want to flag the underlying tradeoff for a team decision, because Codex's read-only story is weaker than Claude's and I don't want it to slip in unexamined.

The gap: Claude enforces read-only via a PreToolUse hook (read-only-guard.py) that inspects every tool call against blocked/allowed patterns. Codex has no PreToolUse hook, so we can't replicate that per-pattern enforcement. Its only native read-only mechanism is --sandbox read-only — but that sandbox is bubblewrap, the same mechanism we just had to drop for normal mode because it can't create a user namespace inside the hardened container (bwrap: No permissions to create a new namespace).

So there's a real question of whether --sandbox read-only actually enforces anything for Codex inside our container, or whether bwrap fails the same way and the agent silently has no sandbox. (In this PR's E2E the read-only turn was a no-tool prompt, so it proves the flag is wired, not that writes are blocked — that needs a dedicated write-attempt test.)

Options for the team:

(A) fail-closed — keep --sandbox read-only; if bwrap can't init, tools error → no writes possible → agent is effectively chat-only in read-only mode. Safe; this is what the PR ships as the interim.
(B) advisory — danger-full-access + a strong read-only prompt directive. Tools work, but write-protection is soft/unenforced (prompt-only).
(C) unsupported — block/warn read-only for Codex in the MVP and revisit when Codex exposes a tool-gating hook.

The PR ships (A) as the safe interim. Requesting a decision before this hardens into the contract — and either way we should add a write-attempt test that asserts a blocked write actually fails under read-only.

…ccess) normal agents run --sandbox danger-full-access; workspace-write/read-only both invoke bwrap (user namespace), which the hardened Trinity container forbids. The container is already the boundary, same posture as Claude/Gemini. Read-only agents keep --sandbox read-only as interim enforcement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

adds dispatcher routing for AGENT_RUNTIME=codex, empty-input no-op, and TOML scalar/escape primitive coverage (#1187 follow-up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

the backend/frontend serialize to_dict() to gate UI surfaces; pin every flag so the contract can't drift silently (#1187). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

covers docker_service.get_agent_runtime() label read + claude-code fallbacks, and task_execution_service._resolve_agent_runtime()'s guarded import — both best-effort, never-raise, never block dispatch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

adds tests for pricing prefix-match, system-prompt compose, result-file read/unlink, guardrail surfacing, API-key loading precedence, availability probe, JSONL event parser edges, and CODEX_HOME resolution (#1187). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-14T09:50:27Z

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

PR #1203's two red checks, without weakening anything real: - test_validate_runtime: adopt the _STUBBED_MODULE_NAMES + autouse _restore_sys_modules snapshot/restore pair so the required import-time stub eviction is the sanctioned escape hatch the sys.modules pollution lint recognizes (precedent: test_telegram_webhook_backfill), instead of a bare sys.modules.pop the linter flags. - codex_runtime: add a sink-side _ensure_within() containment guard on the -o result file's open()/unlink(). The filename is already reduced to a safe token (_safe_result_token + fixed -last.txt suffix), so the two CodeQL path alerts are false positives — but anchoring a realpath containment check at the filesystem sink is defense-in-depth and the barrier CodeQL recognizes. _read_and_consume_result_file/_safe_unlink now take a required base arg; both production call sites pass codex_home and a tripped guard logs a warning rather than silently swallowing. - tests: update the 4 path-helper call sites for the new signature and add test_ensure_within_rejects_escape (out-of-base path refused; wrappers no-op without touching the file). The remaining CodeQL "uncontrolled command line" alert (list-form Popen, shell=False, cmd[0]="codex", prompt after a -- separator) is dismissed via the API as a verified false positive — no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vybe

✅ Validated & approved (/validate-pr) — clean, no defects. Recommend a /cso --diff security pass before merge (see below).

P1 infrastructure feature (Closes #1187). 42 files, +4423/-41 — large, but the bulk is the new runtime + 12 test files + thorough docs; scope is coherent.

Validation — all green:

Base → dev ✓ · 7 conventional commits ✓ · full CI matrix green (build, verify-non-root, schema-parity, prod-image-smoke, pytest ×6 seeds, CodeQL, regression-diff)
Docs exceptional: architecture + requirements + feature-flows index + new codex-runtime.md + harness-authoring-guide + TRINITY_COMPATIBLE_AGENT_GUIDE. New capability → requirements + feature-flow both present ✓
Security scan clean: no real secrets (the trinity_mcp_SUPERSECRETVALUE hit is a test fixture in test_codex_mcp_config.py; the sk-* strings are test fixtures); MCP token referenced via bearer_token_env_var (never persisted); OPENAI_API_KEY read from .env/process env, not exported into the agent-server process; CODEX_HOME under $TMPDIR + gitignored.
Invariant #5 (agent-server + backend both updated) ✓; #13 (no MCP-server change needed — Codex consumes the existing Trinity MCP via config.toml) ✓; no Dockerfile COPY gap (agent_server/ copied wholesale, so codex_runtime.py ships) ✓; no new backend env vars.
codex_runtime.py correctly built on per-runtime primitives (never inherits Gemini's blanket kill_cgroup_orphans()); resume arg-order fix + guard test; runtime-aware prompt (strips mcp__trinity__ for Codex); cost/token parse avoids reasoning double-count.

Sandbox posture change — reviewed, sound: normal mode → --sandbox danger-full-access drops Codex's inner bubblewrap (which can't create a userns in the hardened container → blocks every shell tool). Claude/Gemini already run with no inner sandbox, so this is parity, not a regression — the Trinity container (non-root, CAP_DROP, network isolation) remains the boundary. Read-only mode correctly keeps --sandbox read-only via ~/.trinity/read-only-config.json (fail-open consistent with the Claude reference hook).

⚠️ Recommended before merge (P1 methodology gate): run /cso --diff for a deep security pass focused on the danger-full-access decision, the Codex subprocess env/credential flow, and the credential sanitizer over Codex output — /validate-pr + CodeQL are green, but the full P1 pipeline (/review + /validate-pr + /cso) prescribes it for a sandbox-posture change.

Merge note: this shares task_execution_service.py + docs/memory/architecture.md + docker/base-image/Dockerfile with #1195/#1196 — land it last (it pays the rebase). Base-image rebuild required (@openai/codex@0.139.0).

…e-1187-plan # Conflicts: # docs/memory/feature-flows.md

github-advanced-security AI found potential problems Jun 14, 2026

View reviewed changes

Comment thread docker/base-image/agent_server/services/codex_runtime.py Fixed

Comment thread docker/base-image/agent_server/services/codex_runtime.py Fixed

Comment thread docker/base-image/agent_server/services/codex_runtime.py Dismissed

AndriiPasternak31 requested review from dolho, oleksandr-korin and vybe June 14, 2026 04:57

AndriiPasternak31 self-assigned this Jun 14, 2026

AndriiPasternak31 and others added 5 commits June 14, 2026 06:30

test(codex): cover MCP template-server dispatcher + TOML edge cases

5e22c20

adds dispatcher routing for AGENT_RUNTIME=codex, empty-input no-op, and TOML scalar/escape primitive coverage (#1187 follow-up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(codex): assert RuntimeCapabilities.to_dict() round-trips as JSON

0e366aa

the backend/frontend serialize to_dict() to gate UI surfaces; pin every flag so the contract can't drift silently (#1187). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems Jun 14, 2026

View reviewed changes

Comment thread docker/base-image/agent_server/services/codex_runtime.py Dismissed

Comment thread docker/base-image/agent_server/services/codex_runtime.py Dismissed

vybe mentioned this pull request Jun 15, 2026

fix(executor): consume the terminal-write CAS verdict (#671/H4) + #1095 research doc #1195

Merged

vybe approved these changes Jun 15, 2026

View reviewed changes

Merge remote-tracking branch 'origin/dev' into AndriiPasternak31/issu…

3a471d9

…e-1187-plan # Conflicts: # docs/memory/feature-flows.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent-runtime): OpenAI Codex as the third agent runtime + resolve E2E findings (#1187)#1203

feat(agent-runtime): OpenAI Codex as the third agent runtime + resolve E2E findings (#1187)#1203
AndriiPasternak31 wants to merge 8 commits into
devfrom
AndriiPasternak31/issue-1187-plan

AndriiPasternak31 commented Jun 14, 2026

Uh oh!

AndriiPasternak31 commented Jun 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Uh oh!

vybe left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AndriiPasternak31 commented Jun 14, 2026

Summary

What's in it

🔒 Security note

Test plan

Docs

Uh oh!

AndriiPasternak31 commented Jun 14, 2026

🟡 Open design decision: read-only enforcement for Codex

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants