Skip to content

feat(agent-runtime): OpenAI Codex as the third agent runtime + resolve E2E findings (#1187)#1203

Open
AndriiPasternak31 wants to merge 8 commits into
devfrom
AndriiPasternak31/issue-1187-plan
Open

feat(agent-runtime): OpenAI Codex as the third agent runtime + resolve E2E findings (#1187)#1203
AndriiPasternak31 wants to merge 8 commits into
devfrom
AndriiPasternak31/issue-1187-plan

Conversation

@AndriiPasternak31

Copy link
Copy Markdown
Contributor

Summary

Lands the #1187 "harness == runtime" work — OpenAI Codex joins Claude Code and Gemini as a pluggable AgentRuntime inside the agent container — and resolves the two blockers the live E2E surfaced that kept a Codex agent from acting.

A Trinity harness IS an AgentRuntime. AGENT_RUNTIME (from template.yaml runtime:, also a trinity.agent-runtime label) selects one; runtime_adapter.get_runtime() validates against KNOWN_RUNTIMES and raises on an unknown value.

What's in it

Core runtime (codex_runtime.py, built on the per-runtime primitives — not a shared helper, so it never inherits Gemini's blanket kill_cgroup_orphans()):

  • codex exec --json JSONL parse (thread.started / turn.completed.usage / item.completed); reasoning_output_tokens ⊂ output_tokens (no double-count); -o output file authoritative (read-then-delete in finally); estimated cost (CODEX_PRICING).
  • chat continuity via codex exec resume <thread_id>; concurrency-safe _drain_bounded orphan cleanup; error→HTTP auth 503 / rate 429 / runtime-unavailable 500-not-503 / pipe-drop 502.
  • RuntimeCapabilities dataclass + per-runtime capabilities(); Codex skips Claude-subscription assign (is_claude_runtime); OPENAI_API_KEY read from .env; CODEX_HOME under $TMPDIR (gitignored). Session tab gated off via RUNTIMES_WITHOUT_SESSION_TAB_RESUME; frontend RuntimeBadge/terminal/default-model; test-codex template; @openai/codex@0.139.0 pinned.

E2E findings resolved (the four fixes):

  1. Resume arg-order — exec-level flags (--json/-C/--sandbox/-o) must precede the resume sub-subcommand (codex 0.139.0 rejects them after → exit 2 → every turn-2+ failed). Fixed + arg-order guard test.
  2. F-TOOLS (sandbox) — normal mode → --sandbox danger-full-access. Codex's own bubblewrap sandbox can't create a user namespace inside the hardened agent container (bwrap: No permissions to create a new namespace), which blocked every shell tool. Dropping it leaves the Trinity container as the sole boundary. Read-only keeps --sandbox read-only.
  3. F-MCP (runtime-aware prompt)PLATFORM_INSTRUCTIONS documented MCP tools with Claude's mcp__trinity__<tool> prefix; Codex auto-discovers tools by bare name and answered unknown MCP server. get_platform_system_prompt(runtime=…) / compose_system_prompt(runtime=…) strip the prefix + add a Codex orientation note for runtime="codex", threaded from routers/chat.py + services/task_execution_service.py via the trinity.agent-runtime label (docker_service.get_agent_runtime, resolved lazily + guarded so a re-import under a stubbed docker_service in unit tests can't break dispatch). Claude/Gemini/unknown unchanged.
  4. Test-isolationtest_validate_runtime.py hardened against the sibling sys.modules Mock leak.

Housekeeping: cap bcrypt<5 in tests/requirements-test.txt (bcrypt 5.0.0 removed the __about__ shim passlib 1.7.4 reads at import → breaks every test at conftest DB-init in a fresh venv).

🔒 Security note

Disabling Codex's internal bubblewrap sandbox (danger-full-access) does not weaken container hardening. The Trinity agent container is already the security boundary — cap_drop: ALL, AppArmor, no-new-privileges, network isolation — and runs with no userns grant and unchanged caps. This is exactly the posture under which Claude Code and Gemini already run (no internal sandbox). The change drops a redundant inner sandbox that simply cannot function inside the hardened container.

Test plan

  • Unit: full tests/unit green — 2218 passed, 24 skipped, 0 fail (incl. new sandbox-resolution, runtime-aware-prompt, and resume arg-order tests). Caught + fixed a real order-dependency I'd briefly introduced (top-level docker_service import → lazy+guarded), re-verified under adversarial ordering.

  • Live E2E on a real Codex agent booted from the rebuilt base image (key injected via CRED-002):

    Check Result
    F-TOOLS file created + read back (HELLO_CODEX_E2E); log sandbox=danger-full-access; tools ran (was bwrap fail)
    F-MCP list_agents via Trinity MCP returned the agent (was unknown MCP server)
    turn-1 cost ≈ $0.046, session id
    turn-2 resume recalled prior-turn context with no tools; log resume=True
    read-only log sandbox=read-only
    credential leak sk-proj ×0 in agent logs

    danger-full-access ran tools non-interactively (no approval prompt → no --dangerously-bypass-approvals-and-sandbox needed).

  • verify-local --skip-agent (the agent side was validated more thoroughly by the live E2E above; the dev stack held the global agent network).

Docs

architecture.md, feature-flows.md + feature-flows/codex-runtime.md, harness-authoring-guide.md, requirements.md, agent guide.

Implements #1187.

ℹ️ See the follow-up comment on read-only enforcement for Codex — an open design decision for the team (Codex has no PreToolUse hook).

🤖 Generated with Claude Code

…findings (#1187)

Lands the #1187 "harness == runtime" work: Codex CLI joins Claude Code and
Gemini as a pluggable AgentRuntime, plus the two blockers the live E2E surfaced
that kept a Codex agent from *acting*.

Core runtime (codex_runtime.py, built on the per-runtime primitives — NOT a
shared helper, so it never inherits Gemini's blanket kill_cgroup_orphans):
- `codex exec --json` JSONL parse (thread.started / turn.completed.usage /
  item.completed); reasoning_output_tokens ⊂ output_tokens (no double-count);
  `-o` output file authoritative (read-then-delete in finally); estimated cost.
- chat continuity via `codex exec resume <thread_id>`; concurrency-safe
  `_drain_bounded` orphan cleanup; error→HTTP auth 503 / rate 429 /
  runtime-unavailable 500-not-503 / pipe-drop 502.
- RuntimeCapabilities dataclass + per-runtime capabilities(); get_runtime()
  validates AGENT_RUNTIME and fails loud on unknown. Codex skips Claude
  subscription assign (is_claude_runtime); OPENAI_API_KEY read from .env;
  CODEX_HOME relocated under $TMPDIR (gitignored). Session tab gated off via
  RUNTIMES_WITHOUT_SESSION_TAB_RESUME; frontend RuntimeBadge/terminal/default
  model; test-codex template; @openai/codex@0.139.0 pinned.

E2E findings resolved:
- Resume arg-order: exec-level flags (--json/-C/--sandbox/-o) must precede the
  `resume` sub-subcommand (codex 0.139.0 rejects them after, exit 2 → every
  turn-2+ failed). Fixed + arg-order guard test.
- F-TOOLS (sandbox): normal mode → `--sandbox danger-full-access`. Codex's own
  bubblewrap sandbox can't create a user namespace in the hardened agent
  container (`bwrap: No permissions to create a new namespace`), which blocked
  every shell tool. Dropping it leaves the Trinity container as the sole
  boundary — same posture as Claude/Gemini (cap_drop ALL + AppArmor +
  no-new-privileges); no container hardening is weakened. Read-only keeps
  `--sandbox read-only`.
- F-MCP (runtime-aware prompt): PLATFORM_INSTRUCTIONS documented MCP tools with
  Claude's `mcp__trinity__<tool>` prefix; Codex auto-discovers tools by bare
  name and answered "unknown MCP server". get_platform_system_prompt(runtime=…)
  / compose_system_prompt(runtime=…) strip the prefix + add a Codex orientation
  note for runtime="codex"; threaded from chat.py + task_execution_service.py
  via the trinity.agent-runtime label (docker_service.get_agent_runtime,
  resolved lazily + guarded so a re-import under a stubbed docker_service in
  unit tests can't break dispatch). Claude/Gemini/unknown unchanged.
- Test-isolation: test_validate_runtime.py hardened against the sibling
  sys.modules Mock leak.

Housekeeping: cap bcrypt<5 in tests/requirements-test.txt (bcrypt 5.0.0 breaks
passlib 1.7.4 at conftest DB-init in a fresh venv).

Docs: architecture.md, feature-flows.md + feature-flows/codex-runtime.md,
harness-authoring-guide.md, requirements.md, agent guide.

Verification: full tests/unit green (2218 passed) incl. new sandbox-resolution,
runtime-aware-prompt, and resume arg-order tests. Live F-TOOLS/F-MCP E2E
re-verification pending a fresh OPENAI_API_KEY (the test key was rotated).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AndriiPasternak31

Copy link
Copy Markdown
Contributor Author

🟡 Open design decision: read-only enforcement for Codex

This PR ships read-only for Codex as --sandbox read-only (verified: a read-only turn logs [Codex] exec sandbox=read-only). I want to flag the underlying tradeoff for a team decision, because Codex's read-only story is weaker than Claude's and I don't want it to slip in unexamined.

The gap: Claude enforces read-only via a PreToolUse hook (read-only-guard.py) that inspects every tool call against blocked/allowed patterns. Codex has no PreToolUse hook, so we can't replicate that per-pattern enforcement. Its only native read-only mechanism is --sandbox read-only — but that sandbox is bubblewrap, the same mechanism we just had to drop for normal mode because it can't create a user namespace inside the hardened container (bwrap: No permissions to create a new namespace).

So there's a real question of whether --sandbox read-only actually enforces anything for Codex inside our container, or whether bwrap fails the same way and the agent silently has no sandbox. (In this PR's E2E the read-only turn was a no-tool prompt, so it proves the flag is wired, not that writes are blocked — that needs a dedicated write-attempt test.)

Options for the team:

  • (A) fail-closed — keep --sandbox read-only; if bwrap can't init, tools error → no writes possible → agent is effectively chat-only in read-only mode. Safe; this is what the PR ships as the interim.
  • (B) advisorydanger-full-access + a strong read-only prompt directive. Tools work, but write-protection is soft/unenforced (prompt-only).
  • (C) unsupported — block/warn read-only for Codex in the MVP and revisit when Codex exposes a tool-gating hook.

The PR ships (A) as the safe interim. Requesting a decision before this hardens into the contract — and either way we should add a write-attempt test that asserts a blocked write actually fails under read-only.

Comment thread docker/base-image/agent_server/services/codex_runtime.py Fixed
Comment thread docker/base-image/agent_server/services/codex_runtime.py Fixed
Comment thread docker/base-image/agent_server/services/codex_runtime.py Dismissed
AndriiPasternak31 and others added 5 commits June 14, 2026 06:30
…ccess)

normal agents run --sandbox danger-full-access; workspace-write/read-only
both invoke bwrap (user namespace), which the hardened Trinity container
forbids. The container is already the boundary, same posture as Claude/Gemini.
Read-only agents keep --sandbox read-only as interim enforcement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
adds dispatcher routing for AGENT_RUNTIME=codex, empty-input no-op,
and TOML scalar/escape primitive coverage (#1187 follow-up).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
the backend/frontend serialize to_dict() to gate UI surfaces; pin every
flag so the contract can't drift silently (#1187).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
covers docker_service.get_agent_runtime() label read + claude-code
fallbacks, and task_execution_service._resolve_agent_runtime()'s guarded
import — both best-effort, never-raise, never block dispatch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
adds tests for pricing prefix-match, system-prompt compose, result-file
read/unlink, guardrail surfacing, API-key loading precedence, availability
probe, JSONL event parser edges, and CODEX_HOME resolution (#1187).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

PR #1203's two red checks, without weakening anything real:

- test_validate_runtime: adopt the _STUBBED_MODULE_NAMES + autouse
  _restore_sys_modules snapshot/restore pair so the required import-time
  stub eviction is the sanctioned escape hatch the sys.modules pollution
  lint recognizes (precedent: test_telegram_webhook_backfill), instead of
  a bare sys.modules.pop the linter flags.

- codex_runtime: add a sink-side _ensure_within() containment guard on the
  -o result file's open()/unlink(). The filename is already reduced to a
  safe token (_safe_result_token + fixed -last.txt suffix), so the two
  CodeQL path alerts are false positives — but anchoring a realpath
  containment check at the filesystem sink is defense-in-depth and the
  barrier CodeQL recognizes. _read_and_consume_result_file/_safe_unlink now
  take a required base arg; both production call sites pass codex_home and a
  tripped guard logs a warning rather than silently swallowing.

- tests: update the 4 path-helper call sites for the new signature and add
  test_ensure_within_rejects_escape (out-of-base path refused; wrappers
  no-op without touching the file).

The remaining CodeQL "uncontrolled command line" alert (list-form Popen,
shell=False, cmd[0]="codex", prompt after a -- separator) is dismissed via
the API as a verified false positive — no code change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread docker/base-image/agent_server/services/codex_runtime.py Dismissed
Comment thread docker/base-image/agent_server/services/codex_runtime.py Dismissed

@vybe vybe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated & approved (/validate-pr) — clean, no defects. Recommend a /cso --diff security pass before merge (see below).

P1 infrastructure feature (Closes #1187). 42 files, +4423/-41 — large, but the bulk is the new runtime + 12 test files + thorough docs; scope is coherent.

Validation — all green:

  • Base → dev ✓ · 7 conventional commits ✓ · full CI matrix green (build, verify-non-root, schema-parity, prod-image-smoke, pytest ×6 seeds, CodeQL, regression-diff)
  • Docs exceptional: architecture + requirements + feature-flows index + new codex-runtime.md + harness-authoring-guide + TRINITY_COMPATIBLE_AGENT_GUIDE. New capability → requirements + feature-flow both present ✓
  • Security scan clean: no real secrets (the trinity_mcp_SUPERSECRETVALUE hit is a test fixture in test_codex_mcp_config.py; the sk-* strings are test fixtures); MCP token referenced via bearer_token_env_var (never persisted); OPENAI_API_KEY read from .env/process env, not exported into the agent-server process; CODEX_HOME under $TMPDIR + gitignored.
  • Invariant #5 (agent-server + backend both updated) ✓; #13 (no MCP-server change needed — Codex consumes the existing Trinity MCP via config.toml) ✓; no Dockerfile COPY gap (agent_server/ copied wholesale, so codex_runtime.py ships) ✓; no new backend env vars.
  • codex_runtime.py correctly built on per-runtime primitives (never inherits Gemini's blanket kill_cgroup_orphans()); resume arg-order fix + guard test; runtime-aware prompt (strips mcp__trinity__ for Codex); cost/token parse avoids reasoning double-count.

Sandbox posture change — reviewed, sound: normal mode → --sandbox danger-full-access drops Codex's inner bubblewrap (which can't create a userns in the hardened container → blocks every shell tool). Claude/Gemini already run with no inner sandbox, so this is parity, not a regression — the Trinity container (non-root, CAP_DROP, network isolation) remains the boundary. Read-only mode correctly keeps --sandbox read-only via ~/.trinity/read-only-config.json (fail-open consistent with the Claude reference hook).

⚠️ Recommended before merge (P1 methodology gate): run /cso --diff for a deep security pass focused on the danger-full-access decision, the Codex subprocess env/credential flow, and the credential sanitizer over Codex output — /validate-pr + CodeQL are green, but the full P1 pipeline (/review + /validate-pr + /cso) prescribes it for a sandbox-posture change.

Merge note: this shares task_execution_service.py + docs/memory/architecture.md + docker/base-image/Dockerfile with #1195/#1196 — land it last (it pays the rebase). Base-image rebuild required (@openai/codex@0.139.0).

…e-1187-plan

# Conflicts:
#	docs/memory/feature-flows.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants