From 03c4925597cfe69a67d4d251aa994b767cbc171f Mon Sep 17 00:00:00 2001 From: Andrii Pasternak Date: Sun, 14 Jun 2026 05:34:49 +0100 Subject: [PATCH 01/10] feat(agent-runtime): OpenAI Codex as the third runtime + resolve E2E findings (#1187) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lands the #1187 "harness == runtime" work: Codex CLI joins Claude Code and Gemini as a pluggable AgentRuntime, plus the two blockers the live E2E surfaced that kept a Codex agent from *acting*. Core runtime (codex_runtime.py, built on the per-runtime primitives — NOT a shared helper, so it never inherits Gemini's blanket kill_cgroup_orphans): - `codex exec --json` JSONL parse (thread.started / turn.completed.usage / item.completed); reasoning_output_tokens ⊂ output_tokens (no double-count); `-o` output file authoritative (read-then-delete in finally); estimated cost. - chat continuity via `codex exec resume `; concurrency-safe `_drain_bounded` orphan cleanup; error→HTTP auth 503 / rate 429 / runtime-unavailable 500-not-503 / pipe-drop 502. - RuntimeCapabilities dataclass + per-runtime capabilities(); get_runtime() validates AGENT_RUNTIME and fails loud on unknown. Codex skips Claude subscription assign (is_claude_runtime); OPENAI_API_KEY read from .env; CODEX_HOME relocated under $TMPDIR (gitignored). Session tab gated off via RUNTIMES_WITHOUT_SESSION_TAB_RESUME; frontend RuntimeBadge/terminal/default model; test-codex template; @openai/codex@0.139.0 pinned. E2E findings resolved: - Resume arg-order: exec-level flags (--json/-C/--sandbox/-o) must precede the `resume` sub-subcommand (codex 0.139.0 rejects them after, exit 2 → every turn-2+ failed). Fixed + arg-order guard test. - F-TOOLS (sandbox): normal mode → `--sandbox danger-full-access`. Codex's own bubblewrap sandbox can't create a user namespace in the hardened agent container (`bwrap: No permissions to create a new namespace`), which blocked every shell tool. Dropping it leaves the Trinity container as the sole boundary — same posture as Claude/Gemini (cap_drop ALL + AppArmor + no-new-privileges); no container hardening is weakened. Read-only keeps `--sandbox read-only`. - F-MCP (runtime-aware prompt): PLATFORM_INSTRUCTIONS documented MCP tools with Claude's `mcp__trinity__` prefix; Codex auto-discovers tools by bare name and answered "unknown MCP server". get_platform_system_prompt(runtime=…) / compose_system_prompt(runtime=…) strip the prefix + add a Codex orientation note for runtime="codex"; threaded from chat.py + task_execution_service.py via the trinity.agent-runtime label (docker_service.get_agent_runtime, resolved lazily + guarded so a re-import under a stubbed docker_service in unit tests can't break dispatch). Claude/Gemini/unknown unchanged. - Test-isolation: test_validate_runtime.py hardened against the sibling sys.modules Mock leak. Housekeeping: cap bcrypt<5 in tests/requirements-test.txt (bcrypt 5.0.0 breaks passlib 1.7.4 at conftest DB-init in a fresh venv). Docs: architecture.md, feature-flows.md + feature-flows/codex-runtime.md, harness-authoring-guide.md, requirements.md, agent guide. Verification: full tests/unit green (2218 passed) incl. new sandbox-resolution, runtime-aware-prompt, and resume arg-order tests. Live F-TOOLS/F-MCP E2E re-verification pending a fresh OPENAI_API_KEY (the test key was rotated). Co-Authored-By: Claude Opus 4.8 (1M context) --- config/agent-templates/test-codex/CLAUDE.md | 43 + .../agent-templates/test-codex/template.yaml | 34 + docker/base-image/Dockerfile | 7 + docker/base-image/agent_server/models.py | 6 + .../agent_server/services/claude_code.py | 14 +- .../agent_server/services/codex_runtime.py | 1035 +++++++++++++++++ .../agent_server/services/gemini_runtime.py | 14 +- .../agent_server/services/runtime_adapter.py | 69 +- .../agent_server/services/trinity_mcp.py | 226 ++++ docker/base-image/startup.sh | 25 + docs/TRINITY_COMPATIBLE_AGENT_GUIDE.md | 1 + docs/memory/architecture.md | 10 + docs/memory/feature-flows.md | 2 + docs/memory/feature-flows/codex-runtime.md | 119 ++ docs/memory/harness-authoring-guide.md | 165 +++ docs/memory/requirements.md | 47 + src/backend/routers/chat.py | 12 +- src/backend/routers/sessions.py | 40 +- src/backend/services/agent_service/crud.py | 47 +- src/backend/services/agent_service/helpers.py | 43 + .../services/agent_service/lifecycle.py | 19 +- .../services/agent_service/terminal.py | 7 +- src/backend/services/docker_service.py | 24 +- src/backend/services/git_service.py | 1 + .../services/platform_prompt_service.py | 52 +- .../services/task_execution_service.py | 23 +- src/frontend/src/components/AgentTerminal.vue | 12 +- src/frontend/src/components/RuntimeBadge.vue | 24 + src/frontend/src/views/AgentDetail.vue | 8 +- tests/requirements-test.txt | 5 +- tests/unit/test_agent_readiness_probe.py | 1 + tests/unit/test_codex_backend_inertness.py | 50 + tests/unit/test_codex_mcp_config.py | 197 ++++ tests/unit/test_codex_runtime.py | 589 ++++++++++ .../test_codex_skips_subscription_assign.py | 41 + tests/unit/test_platform_prompt_runtime.py | 105 ++ .../unit/test_runtime_capabilities_matrix.py | 45 + tests/unit/test_runtime_factory_codex.py | 43 + tests/unit/test_runtime_label_fast_path.py | 64 + tests/unit/test_session_tab_gate_codex.py | 61 + tests/unit/test_validate_runtime.py | 67 ++ 41 files changed, 3356 insertions(+), 41 deletions(-) create mode 100644 config/agent-templates/test-codex/CLAUDE.md create mode 100644 config/agent-templates/test-codex/template.yaml create mode 100644 docker/base-image/agent_server/services/codex_runtime.py create mode 100644 docs/memory/feature-flows/codex-runtime.md create mode 100644 docs/memory/harness-authoring-guide.md create mode 100644 tests/unit/test_codex_backend_inertness.py create mode 100644 tests/unit/test_codex_mcp_config.py create mode 100644 tests/unit/test_codex_runtime.py create mode 100644 tests/unit/test_codex_skips_subscription_assign.py create mode 100644 tests/unit/test_platform_prompt_runtime.py create mode 100644 tests/unit/test_runtime_capabilities_matrix.py create mode 100644 tests/unit/test_runtime_factory_codex.py create mode 100644 tests/unit/test_runtime_label_fast_path.py create mode 100644 tests/unit/test_session_tab_gate_codex.py create mode 100644 tests/unit/test_validate_runtime.py diff --git a/config/agent-templates/test-codex/CLAUDE.md b/config/agent-templates/test-codex/CLAUDE.md new file mode 100644 index 000000000..8447af062 --- /dev/null +++ b/config/agent-templates/test-codex/CLAUDE.md @@ -0,0 +1,43 @@ +# Test Codex Agent + +You are a test agent running on **OpenAI's Codex CLI** (`codex exec`), Trinity's +third agent runtime alongside Claude Code and Gemini. + +> Trinity mirrors this file to `AGENTS.md` at startup — Codex reads `AGENTS.md`, +> not `CLAUDE.md`. + +## Your Purpose + +Validate that Trinity's Codex runtime works correctly: +- Codex CLI integration (`codex exec --json`) +- The `-o` durable result record (authoritative response) +- MCP tool access (Trinity MCP wired via `config.toml`) +- Cost tracking (estimated from `turn.completed.usage` tokens) +- Chat continuity (`codex exec resume `) +- Sandbox safety (`workspace-write` + network; `read-only` when the agent is read-only) + +## Key Differences from Claude Code + +1. **Instructions file:** You read `AGENTS.md` (Trinity mirrors `CLAUDE.md` → `AGENTS.md`). +2. **Cost:** No native cost field — Trinity estimates it from token counts. +3. **Sandbox:** You run under `--sandbox workspace-write` with network access; writes + outside the workspace are blocked. A read-only agent runs `--sandbox read-only`. +4. **Provider:** OpenAI (not Anthropic). +5. **Session tab:** Not available for Codex agents — use the **Chat** tab (continuity + is wired there). The Session tab's cached-UUID `--resume` model is Claude-specific. + +## Authentication + +Codex authenticates with `OPENAI_API_KEY`, injected via the agent's `.env` +(Quick Inject → `OPENAI_API_KEY`). Codex agents are not assigned a Claude +subscription. + +## Testing Commands + +When asked to test, verify: +- `/test` — basic functionality +- Tool calling works (shell commands, web search) +- MCP servers are accessible +- Cost / token tracking reports correctly + +Report any differences in behavior compared to Claude Code agents. diff --git a/config/agent-templates/test-codex/template.yaml b/config/agent-templates/test-codex/template.yaml new file mode 100644 index 000000000..b7e110980 --- /dev/null +++ b/config/agent-templates/test-codex/template.yaml @@ -0,0 +1,34 @@ +name: test-codex +display_name: Test Codex Agent +description: Test agent using OpenAI's Codex CLI runtime for validation (#1187) +version: "1.0.0" +author: Trinity Platform +priority: 10 # Lower = higher in list (after system templates) + +type: business-assistant + +# Use the OpenAI Codex runtime instead of Claude Code +runtime: + type: codex + model: gpt-5.1-codex + +resources: + cpu: "2" + memory: "2g" + +capabilities: + - chat + - code-generation + - web-search + +mcp_servers: [] + +# Codex authenticates with an OpenAI API key read from the agent's .env +# (CRED-002). Declaring it here makes the Quick Inject UI prompt for it. +credentials: + env_file: + - OPENAI_API_KEY + +slash_commands: + - name: /test + description: Test command to verify the Codex runtime is working diff --git a/docker/base-image/Dockerfile b/docker/base-image/Dockerfile index c0593b188..aae9533cd 100644 --- a/docker/base-image/Dockerfile +++ b/docker/base-image/Dockerfile @@ -50,6 +50,13 @@ RUN npm install -g @anthropic-ai/claude-code@latest # Install Gemini CLI for multi-runtime support RUN npm install -g @google/gemini-cli +# Install OpenAI Codex CLI for multi-runtime support (#1187). +# Pinned (not @latest) so base-image rebuilds are reproducible and a +# breaking Codex release can't silently change the agent runtime. The npm +# package resolves the platform-specific Rust binary via optional deps; +# the linux-x64 build lands here. Bump deliberately after testing. +RUN npm install -g @openai/codex@0.139.0 + RUN useradd -m -s /bin/bash -u 1000 developer && \ echo "developer:developer" | chpasswd && \ usermod -aG sudo developer && \ diff --git a/docker/base-image/agent_server/models.py b/docker/base-image/agent_server/models.py index 4fa19d96d..c148154f7 100644 --- a/docker/base-image/agent_server/models.py +++ b/docker/base-image/agent_server/models.py @@ -109,6 +109,12 @@ class ExecutionMetadata(BaseModel): compact_events: List[CompactEvent] = [] # Auto-compact events observed mid-turn recovered_from_jsonl: bool = False # Stdout race + JSONL fallback fired (response from disk, not stream) model_name: Optional[str] = None # Actual model id from assistant.message.model (e.g., "claude-sonnet-4-5") — #678 + # #1187: typed terminal-result seed (the #945 taxonomy). Populated by + # newer runtimes (Codex) and currently UNUSED by the backend in the MVP — + # the backend still infers AUTH from the HTTP status. A fast-follow makes + # the backend read error_code directly and retire status-inference. + status: Optional[str] = None # "success" | "error" + error_code: Optional[str] = None # "AUTH" | "RATE_LIMIT" | "TIMEOUT" | "AGENT_ERROR" | "RUNTIME_UNAVAILABLE" # ============================================================================ diff --git a/docker/base-image/agent_server/services/claude_code.py b/docker/base-image/agent_server/services/claude_code.py index 4725b5f44..c85acddb4 100644 --- a/docker/base-image/agent_server/services/claude_code.py +++ b/docker/base-image/agent_server/services/claude_code.py @@ -40,7 +40,7 @@ ) from .headless_executor import _attempt_empty_result_recovery, execute_headless_task from .process_registry import get_process_registry -from .runtime_adapter import AgentRuntime +from .runtime_adapter import AgentRuntime, RuntimeCapabilities from .stream_parser import process_stream_line from .subprocess_lifecycle import ( _capture_pgid, @@ -73,6 +73,18 @@ class ClaudeCodeRuntime(AgentRuntime): """Claude Code implementation of AgentRuntime interface.""" + @classmethod + def capabilities(cls) -> RuntimeCapabilities: + # Claude is the reference runtime: full continuity (--continue), the + # Session tab's cached-UUID --resume machinery, MCP, and native cost + # reporting (Claude Code emits total_cost_usd directly). (#1187) + return RuntimeCapabilities( + chat_continuity=True, + session_tab_resume=True, + mcp_support=True, + cost_reporting="native", + ) + def is_available(self) -> bool: """Check if Claude Code CLI is installed.""" try: diff --git a/docker/base-image/agent_server/services/codex_runtime.py b/docker/base-image/agent_server/services/codex_runtime.py new file mode 100644 index 000000000..e922e3870 --- /dev/null +++ b/docker/base-image/agent_server/services/codex_runtime.py @@ -0,0 +1,1035 @@ +"""OpenAI Codex CLI execution service (#1187). + +Implements the :class:`AgentRuntime` interface for OpenAI's Codex CLI, the third +Trinity agent runtime alongside Claude Code and Gemini. + +Built **independently** on the existing per-runtime primitives (process +registry, concurrency-safe orphan drain, activity tracking, credential +sanitizer) rather than on a shared subprocess helper — see #1187 decision 4. +That keeps Codex from inheriting Gemini's blanket ``kill_cgroup_orphans()`` +(which SIGKILLs sibling executions in the same cgroup); Codex uses the +concurrency-safe ``_drain_bounded`` path that preserves other in-flight work. + +Safety parity with the Claude path (#1187 decision 8, Phase C): + * **System prompt / identity** — the backend's effective ``system_prompt`` + is prepended to every turn (Codex ``exec`` has no ``--append-system-prompt``); + persistent identity comes from ``AGENTS.md`` (startup copies ``CLAUDE.md``). + * **Read-only mode** — when ``~/.trinity/read-only-config.json`` is enabled, + Codex runs with ``--sandbox read-only`` (the Claude hook can't apply here). + * **Guardrails** — read-only is honored via the sandbox; ``disallowed_tools`` + that have no Codex equivalent are SURFACED in the logs, never silently + dropped. + * **Credential sanitization** — every stdout line, the final response, and + stderr pass through ``utils.credential_sanitizer`` exactly as the Claude / + headless paths do. + +Codex specifics: + * Non-interactive: ``codex exec [PROMPT]``; ``--json`` emits a JSONL event + stream; ``-o/--output-last-message FILE`` is the durable result record + (#548/#333) — read-then-delete in ``finally``. + * Continuity: ``codex exec resume `` replays a prior thread. + * No native cost — derived from ``turn.completed.usage`` token counts. +""" +from __future__ import annotations + +import asyncio +import json +import logging +import os +import re +import subprocess +import uuid +from concurrent.futures import ThreadPoolExecutor +from dataclasses import dataclass, field +from datetime import datetime +from pathlib import Path +from typing import Dict, List, Optional, Tuple + +from fastapi import HTTPException + +from ..models import ExecutionLogEntry, ExecutionMetadata +from ..state import agent_state +from ..utils.credential_sanitizer import ( + sanitize_dict, + sanitize_subprocess_line, + sanitize_text, +) +from ..utils.subprocess_pgroup import EXECUTION_TAG_NAME +from ._runtime_config import _DEFAULT_EXECUTION_TIMEOUT_SEC, _load_guardrails +from .activity_tracking import complete_tool_execution, start_tool_execution +from .process_registry import get_process_registry +from .runtime_adapter import AgentRuntime, RuntimeCapabilities +from .subprocess_lifecycle import ( + _capture_pgid, + _drain_bounded, + _safe_close_pipes, + _terminate_process_group, +) + +logger = logging.getLogger(__name__) + +# One long-lived reader-thread worker (mirrors claude_code.py / gemini_runtime.py). +# A fresh ThreadPoolExecutor per call relies on CPython's non-deterministic +# weakref cleanup of the worker thread under load (#333 hardening). +_executor = ThreadPoolExecutor(max_workers=1, thread_name_prefix="codex-subproc") + +# GPT-5 context window (input). Cosmetic — drives the context gauge only. +CODEX_CONTEXT_WINDOW = 272000 + +# Codex / GPT-5 pricing per 1K tokens (USD). Codex reports no cost; we derive +# it from token counts. ``cached`` is the discounted rate for cached input +# tokens. Bump deliberately when OpenAI pricing changes (#1137-style). +CODEX_PRICING: Dict[str, Dict[str, float]] = { + "gpt-5.1-codex": {"input": 0.00125, "cached": 0.000125, "output": 0.01}, + "gpt-5.1-codex-max": {"input": 0.00125, "cached": 0.000125, "output": 0.01}, + "gpt-5-codex": {"input": 0.00125, "cached": 0.000125, "output": 0.01}, + "gpt-5.1": {"input": 0.00125, "cached": 0.000125, "output": 0.01}, + "gpt-5": {"input": 0.00125, "cached": 0.000125, "output": 0.01}, + "gpt-5-mini": {"input": 0.00025, "cached": 0.000025, "output": 0.002}, + "gpt-5-nano": {"input": 0.00005, "cached": 0.000005, "output": 0.0004}, + # Unknown / future model → GPT-5 standard pricing. + "default": {"input": 0.00125, "cached": 0.000125, "output": 0.01}, +} + + +def _resolve_pricing(model: Optional[str]) -> Dict[str, float]: + """Pricing for ``model`` — exact key first, then longest matching prefix, + then the ``default`` fallback (never KeyErrors).""" + if not model: + return CODEX_PRICING["default"] + key = model.lower() + if key in CODEX_PRICING: + return CODEX_PRICING[key] + # Longest-prefix match so "gpt-5.1-codex-2025-xx" resolves to the codex rate. + candidates = [k for k in CODEX_PRICING if k != "default" and key.startswith(k)] + if candidates: + return CODEX_PRICING[max(candidates, key=len)] + return CODEX_PRICING["default"] + + +def calculate_codex_cost( + input_tokens: int, + cached_input_tokens: int, + output_tokens: int, + model: Optional[str] = None, +) -> float: + """Estimated USD cost for a Codex turn. + + ``reasoning_output_tokens`` is a SUBSET of ``output_tokens`` — bill + ``output_tokens`` once, never ``output_tokens + reasoning_output_tokens``. + Cached input tokens bill at the cheaper cached rate; only the uncached + remainder bills at the full input rate. + """ + pricing = _resolve_pricing(model) + uncached_input = max(0, input_tokens - cached_input_tokens) + cached = max(0, cached_input_tokens) + input_cost = (uncached_input / 1000) * pricing["input"] + ( + cached / 1000 + ) * pricing["cached"] + output_cost = (output_tokens / 1000) * pricing["output"] + return round(input_cost + output_cost, 6) + + +# --------------------------------------------------------------------------- +# Credentials, sandbox, CODEX_HOME (parity wiring — #1187 Phase C/T4) +# --------------------------------------------------------------------------- + +_API_KEY_VARS = ("OPENAI_API_KEY", "CODEX_API_KEY") +_AGENT_HOME = "/home/developer" +_READ_ONLY_CONFIG = Path(_AGENT_HOME) / ".trinity" / "read-only-config.json" + + +def _parse_env_value(raw_value: str) -> str: + """Extract a value from a ``.env`` ``KEY=VALUE`` right-hand side. + + Handles the shapes a human SSH-editing ``.env`` would produce that Trinity's + own plain ``KEY=VALUE`` writer never emits: a quoted value (the quotes are + stripped and an interior ``#`` is kept), and an unquoted value with a + trailing ``# inline comment`` (dropped at the first whitespace-``#``). + """ + value = raw_value.strip() + if value[:1] in ('"', "'"): + quote = value[0] + end = value.find(quote, 1) + return value[1:end] if end != -1 else value[1:] + comment = value.find(" #") + if comment != -1: + value = value[:comment].rstrip() + return value + + +def _load_openai_api_key() -> Optional[str]: + """Resolve the OpenAI/Codex API key. + + The per-agent ``.env`` (CRED-002) is copied to ``/home/developer/.env`` by + startup.sh but is NOT exported into the agent-server process — so unlike the + Claude/Gemini key (a container env var), the Codex key must be read from the + process env (if present) OR parsed out of ``.env`` (the cold-start path the + outside-voice review flagged). Accepts either OPENAI_API_KEY or CODEX_API_KEY. + """ + for var in _API_KEY_VARS: + value = os.environ.get(var) + if value: + return value + env_path = Path(_AGENT_HOME) / ".env" + try: + for raw in env_path.read_text().splitlines(): + line = raw.strip() + if not line or line.startswith("#") or "=" not in line: + continue + # Tolerate `export KEY=VALUE` (a hand-edited .env), not just KEY=VALUE. + if line.startswith("export "): + line = line[len("export "):].lstrip() + key, _, value = line.partition("=") + if key.strip() in _API_KEY_VARS: + cleaned = _parse_env_value(value) + if cleaned: + return cleaned + except (IOError, OSError): + pass + return None + + +def _codex_home() -> str: + """Non-workspace home for Codex state + the ``-o`` result file. + + Codex defaults ``CODEX_HOME`` to ``~/.codex`` — inside the git-tracked agent + repo, which would dirty auto-sync. Relocate it under ``$TMPDIR`` (the + disk-backed ``/home/developer/.tmp`` scratch dir, #1098) which startup.sh + gitignores for Codex agents. + """ + explicit = os.environ.get("CODEX_HOME") + if explicit: + return explicit + tmpdir = os.environ.get("TMPDIR") or os.path.join(_AGENT_HOME, ".tmp") + return os.path.join(tmpdir, "codex") + + +def _ensure_codex_home() -> str: + home = _codex_home() + try: + os.makedirs(home, exist_ok=True) + except OSError as exc: # pragma: no cover - defensive + logger.warning("[Codex] could not create CODEX_HOME %s: %s", home, exc) + return home + + +def _is_read_only() -> bool: + """True when the backend has put this agent in read-only mode. + + The signal is the same JSON file the Claude read-only *hook* consumes + (``~/.trinity/read-only-config.json`` → ``enabled``). Codex can't run Claude + hooks, so we read the file directly and translate it to ``--sandbox + read-only`` (a sandbox-native, non-cooperative enforcement). + + An absent file ⇒ not read-only (the normal writable-agent state — silent). + A present-but-unreadable/corrupt file fails OPEN **and logs**, matching the + reference hook (``read-only-guard.py`` logs ``read_only_config_load_error`` + and allows). Diverging one runtime to fail-closed would make read-only + enforcement inconsistent across runtimes (CSO #1187 finding 3); if the + platform wants fail-closed, change both loaders together in a dedicated + issue. + """ + try: + raw = _READ_ONLY_CONFIG.read_text() + except FileNotFoundError: + return False + except OSError as exc: + logger.warning( + "[Codex] read-only config unreadable (%s); treating as not read-only", exc + ) + return False + try: + return bool(json.loads(raw).get("enabled")) + except json.JSONDecodeError as exc: + logger.warning( + "[Codex] read-only config malformed (%s); treating as not read-only", exc + ) + return False + + +def _resolve_sandbox_mode() -> str: + """Map Trinity's mode to a Codex ``--sandbox`` value. + + Normal (writable) agents run with ``danger-full-access``, which DISABLES + Codex's own bubblewrap sandbox. ``workspace-write``/``read-only`` both invoke + ``bwrap`` to create a user namespace, which the hardened Trinity container + forbids (``bwrap: No permissions to create a new namespace``) — so any + in-sandbox mode blocks EVERY shell tool. The Trinity container is already the + security boundary (``cap_drop ALL`` + AppArmor + ``no-new-privileges``), + exactly the posture Claude and Gemini run under (no internal sandbox), so + dropping Codex's redundant inner sandbox weakens nothing. + + Read-only mode is the deliberate exception: it keeps ``--sandbox read-only`` + (sandbox-native write protection) as the interim enforcement. Read-only's + enforcement story for Codex is an open discussion on the #1187 PR. + """ + return "read-only" if _is_read_only() else "danger-full-access" + + +def _surface_unmapped_guardrails(allowed_tools: Optional[List[str]]) -> None: + """Honor what maps to Codex's control surface; SURFACE (never silently + drop) the rest (#1187 decision 8 + the unresolved-decision caveat). + + Read-only is enforced via the sandbox. Claude ``disallowed_tools`` names + (Bash, Write, Edit, WebSearch, …) have no 1:1 Codex ``exec`` CLI toggle in + the MVP, so we log them at WARNING for operator visibility rather than + pretending they're enforced. + """ + guardrails = _load_guardrails() + disallowed = guardrails.get("disallowed_tools") or [] + if disallowed: + logger.warning( + "[Codex] guardrails disallow %s — Codex exec has no per-tool CLI " + "toggle in the MVP; only read-only (sandbox) and network access are " + "enforced. Tracking finer-grained Codex tool gating as a fast-follow.", + disallowed, + ) + if allowed_tools: + logger.info( + "[Codex] allowed_tools=%s requested; Codex exec runs its full tool " + "set under the sandbox (no allowlist CLI flag in the MVP).", + allowed_tools, + ) + + +def _compose_prompt(system_prompt: Optional[str], prompt: str) -> str: + """Codex ``exec`` has no system-prompt flag, so the effective platform + prompt (platform instructions + execution context + caller prompt, always + sent by the backend) is prepended to the user message. Persistent identity + additionally comes from AGENTS.md.""" + if system_prompt: + return f"{system_prompt}\n\n---\n\n{prompt}" + return prompt + + +def _read_and_consume_result_file(path: str) -> Optional[str]: + """Read the ``-o`` durable result file. Deletion is the caller's ``finally`` + (read-then-delete, happy + error path — #1187 decision 5).""" + try: + with open(path, "r", encoding="utf-8", errors="replace") as fh: + return fh.read() + except (IOError, OSError): + return None + + +def _safe_unlink(path: str) -> None: + try: + os.unlink(path) + except OSError: + pass + + +def _safe_result_token(execution_id: str) -> str: + """Filesystem-safe token for the ``-o`` result filename. ``execution_id`` is + system-generated today (uuid4 fallback / backend urlsafe token), but never + build a path from it unguarded: reduce it to a basename and a conservative + charset so a '/' or '..' can't escape CODEX_HOME (defense-in-depth — CSO + #1187 finding 2).""" + token = re.sub(r"[^A-Za-z0-9_.-]", "_", os.path.basename(execution_id)) + return token or "codex" + + +def _resolve_returned_session_id(metadata: ExecutionMetadata) -> Optional[str]: + """The thread id to cache for chat continuity (review I4). + + Codex emits ``thread.started`` on every ``exec``; if it somehow didn't, + return ``None`` so the next turn degrades to a fresh run — NOT a fabricated + id (e.g. the random ``execution_id``), which would make the next + ``codex exec resume `` fail hard and repeat every turn. + """ + return metadata.session_id + + +def _finalize_codex_response( + result_file_text: Optional[str], response_parts: List[str] +) -> str: + """The ``-o`` file is the authoritative response; JSONL ``agent_message`` + parts are the fallback when the file is missing/empty (#1187 decision 5).""" + if result_file_text and result_file_text.strip(): + return result_file_text.strip() + return "\n".join(response_parts).strip() + + +# --------------------------------------------------------------------------- +# JSONL event parsing +# --------------------------------------------------------------------------- + +# item.type values that represent tool/command activity (vs. agent_message / +# reasoning / todo_list). Confirmed against codex exec_events.rs ThreadItemDetails. +_CODEX_TOOL_ITEM_TYPES = { + "command_execution", + "file_change", + "mcp_tool_call", + "web_search", +} + +_CODEX_TOOL_DISPLAY = { + "command_execution": "Shell", + "file_change": "FileChange", + "mcp_tool_call": "McpTool", + "web_search": "WebSearch", +} + + +@dataclass +class _CodexParseState: + """Mutable accumulators threaded through per-event parsing.""" + + execution_log: List[ExecutionLogEntry] + metadata: ExecutionMetadata + response_parts: List[str] + model: Optional[str] = None + seen_tool_ids: set = field(default_factory=set) + + +def _tool_display_name(item: dict, item_type: str) -> str: + if item_type == "mcp_tool_call": + tool = item.get("tool") or item.get("name") + server = item.get("server") + if tool: + return f"{server}.{tool}" if server else str(tool) + return _CODEX_TOOL_DISPLAY.get(item_type, item_type) + + +def _tool_input(item: dict, item_type: str) -> dict: + if item_type == "command_execution": + return {"command": item.get("command")} + if item_type == "web_search": + return {"query": item.get("query")} + if item_type == "file_change": + return {"changes": item.get("changes")} + if item_type == "mcp_tool_call": + return {"arguments": item.get("arguments")} + return {} + + +def _tool_output(item: dict, item_type: str) -> str: + for key in ("aggregated_output", "output", "result", "stdout"): + value = item.get(key) + if isinstance(value, str) and value: + return value + return "" + + +def _record_tool_use(state: _CodexParseState, tool_id: str, item: dict, item_type: str) -> None: + if tool_id in state.seen_tool_ids: + return + state.seen_tool_ids.add(tool_id) + name = _tool_display_name(item, item_type) + tool_input = _tool_input(item, item_type) + state.execution_log.append( + ExecutionLogEntry( + id=tool_id, + type="tool_use", + tool=name, + input=tool_input, + timestamp=datetime.now().isoformat(), + ) + ) + try: + start_tool_execution(tool_id, name, tool_input) + except Exception: # noqa: BLE001 - activity tracking is best-effort + logger.debug("[Codex] start_tool_execution failed for %s", tool_id, exc_info=True) + + +def _record_tool_result(state: _CodexParseState, tool_id: str, item: dict, item_type: str) -> None: + name = _tool_display_name(item, item_type) + output = _tool_output(item, item_type) + status = item.get("status") + exit_code = item.get("exit_code") + is_error = status == "failed" or (isinstance(exit_code, int) and exit_code != 0) + state.execution_log.append( + ExecutionLogEntry( + id=tool_id, + type="tool_result", + tool=name, + output=output or None, + success=not is_error, + timestamp=datetime.now().isoformat(), + ) + ) + try: + complete_tool_execution(tool_id, not is_error, output) + except Exception: # noqa: BLE001 + logger.debug("[Codex] complete_tool_execution failed for %s", tool_id, exc_info=True) + + +def _process_codex_event(event: dict, state: _CodexParseState) -> None: + """Update ``state`` from one parsed Codex JSONL event. Tolerant of unknown + event/item types and missing fields — the ``-o`` file is authoritative for + the response, so a best-effort parser here only affects tokens, tool + activity, and error classification.""" + event_type = event.get("type") + + if event_type == "thread.started": + state.metadata.session_id = event.get("thread_id") or state.metadata.session_id + + elif event_type == "turn.completed": + usage = event.get("usage") or {} + input_tokens = int(usage.get("input_tokens") or 0) + cached = int(usage.get("cached_input_tokens") or 0) + output_tokens = int(usage.get("output_tokens") or 0) + # reasoning_output_tokens is a subset of output_tokens — do NOT add it. + state.metadata.input_tokens = input_tokens + state.metadata.output_tokens = output_tokens + state.metadata.cache_read_tokens = cached + state.metadata.cost_usd = calculate_codex_cost( + input_tokens, cached, output_tokens, state.model + ) + + elif event_type == "turn.failed": + error = event.get("error") or {} + state.metadata.error_type = "turn_failed" + state.metadata.error_message = ( + error.get("message") if isinstance(error, dict) else str(error) + ) or "Codex turn failed" + + elif event_type == "error": + state.metadata.error_type = "error" + state.metadata.error_message = event.get("message") or "Codex error" + + elif event_type in ("item.started", "item.updated", "item.completed"): + item = event.get("item") or {} + item_type = item.get("type") or (item.get("details") or {}).get("type") + if not item_type: + return + item_id = item.get("id") or str(uuid.uuid4()) + + if item_type == "agent_message": + if event_type == "item.completed": + text = item.get("text") or item.get("message") or "" + if text: + state.response_parts.append(text) + elif item_type in _CODEX_TOOL_ITEM_TYPES: + if event_type == "item.started": + _record_tool_use(state, item_id, item, item_type) + elif event_type == "item.completed": + _record_tool_use(state, item_id, item, item_type) # no-op if seen + _record_tool_result(state, item_id, item, item_type) + elif item_type == "error": + state.metadata.error_type = "error" + state.metadata.error_message = ( + item.get("message") or state.metadata.error_message or "Codex item error" + ) + + +def parse_codex_jsonl( + lines: List[str], model: Optional[str] = None +) -> Tuple[str, List[ExecutionLogEntry], ExecutionMetadata, List[Dict]]: + """Parse a full Codex ``--json`` line stream (unit-test entrypoint). + + Returns ``(response_text, execution_log, metadata, raw_messages)`` where + ``response_text`` is the JSONL-assembled fallback (the live path overrides + it with the ``-o`` file).""" + metadata = ExecutionMetadata() + metadata.context_window = CODEX_CONTEXT_WINDOW + state = _CodexParseState(execution_log=[], metadata=metadata, response_parts=[], model=model) + raw_messages: List[Dict] = [] + for line in lines: + line = line.strip() + if not line: + continue + try: + event = json.loads(line) + except json.JSONDecodeError: + continue + if isinstance(event, dict): + raw_messages.append(event) + _process_codex_event(event, state) + metadata.tool_count = len([e for e in state.execution_log if e.type == "tool_use"]) + response_text = "\n".join(state.response_parts).strip() + return response_text, state.execution_log, metadata, raw_messages + + +# --------------------------------------------------------------------------- +# Error classification (return-code path) +# --------------------------------------------------------------------------- + +# AUTH detection is anchored, not bare-substring. Bare "401"/"api key" are +# over-broad — a non-auth failure whose output merely contains "401" (e.g. an +# upstream MCP/tool returning 401) must NOT be read as an auth failure, because +# 503 is the backend's AUTH signal and the dispatch breaker counts AUTH only +# (#1187 decision 3, review I1). Each pattern names an actual auth condition and +# uses word boundaries so it won't fire on an incidental token. +_AUTH_PATTERNS = ( + re.compile(r"\bunauthorized\b", re.IGNORECASE), + re.compile(r"\b401\s+unauthorized\b", re.IGNORECASE), + re.compile(r"\b(?:invalid|incorrect|missing|no)[ _]api[ _]key\b", re.IGNORECASE), + re.compile(r"\bnot\s+authenticated\b", re.IGNORECASE), + re.compile(r"\bauthentication\s+(?:failed|error)\b", re.IGNORECASE), +) +_RATE_MARKERS = ("429", "rate limit", "rate_limit", "quota", "too many requests") + + +def _classify_codex_failure( + return_code: int, stderr: str, metadata: ExecutionMetadata +) -> Tuple[int, str]: + """Map a non-zero Codex exit (+ stderr + parsed error) to an HTTP status. + + auth → 503, rate-limit → 429, everything else → 500 (runtime-unavailable). + Crucially a generic runtime failure is 500, NOT 503 — 503 is the backend's + AUTH signal and the dispatch breaker counts AUTH only (#1187 decision 3).""" + haystack = " ".join( + s for s in (stderr or "", metadata.error_message or "") if s + ) + haystack_lower = haystack.lower() + if any(marker in haystack_lower for marker in _RATE_MARKERS): + return 429, f"Codex rate limit: {(stderr or metadata.error_message or '')[:300]}" + if any(pattern.search(haystack) for pattern in _AUTH_PATTERNS): + return 503, ( + f"Codex authentication failure: {(stderr or metadata.error_message or '')[:300]}. " + "Check OPENAI_API_KEY." + ) + detail = stderr.strip() or metadata.error_message or "see agent logs" + return 500, f"Codex execution failed (exit code {return_code}): {detail[:300]}" + + +# --------------------------------------------------------------------------- +# Runtime +# --------------------------------------------------------------------------- + +class CodexRuntime(AgentRuntime): + """OpenAI Codex CLI implementation of AgentRuntime.""" + + def __init__(self) -> None: + # Codex thread id for the interactive chat session (continuity). The + # singleton instance persists across /api/chat calls in a container. + self._chat_thread_id: Optional[str] = None + + # -- capability declaration (#1187 Phase G) -------------------------------- + @classmethod + def capabilities(cls) -> RuntimeCapabilities: + return RuntimeCapabilities( + chat_continuity=True, # codex exec resume + session_tab_resume=False, # MVP: Session tab stays Claude/Gemini + mcp_support=True, # codex mcp add + cost_reporting="estimated", # no native cost → derived from tokens + ) + + def is_available(self) -> bool: + try: + result = subprocess.run( + ["codex", "--version"], capture_output=True, text=True, timeout=5 + ) + return result.returncode == 0 + except Exception: + return False + + def get_default_model(self) -> str: + return "gpt-5.1-codex" + + def get_context_window(self, model: Optional[str] = None) -> int: + return CODEX_CONTEXT_WINDOW + + def configure_mcp(self, mcp_servers: Dict) -> bool: + """Delegate to the shared Codex MCP configuration in trinity_mcp.py.""" + from .trinity_mcp import _configure_codex_mcp_servers + + return _configure_codex_mcp_servers(mcp_servers) + + # -- command construction -------------------------------------------------- + def _build_codex_command( + self, + *, + model: Optional[str], + sandbox_mode: str, + result_file: str, + agent_home: str, + resume_thread_id: Optional[str], + ) -> List[str]: + cmd = ["codex", "exec"] + # Exec-level flags belong to `codex exec`, NOT to the `resume` + # sub-subcommand. In codex 0.139.0, `exec resume [OPTIONS] [SESSION_ID] + # [PROMPT]` has a NARROWER option set and rejects -C/--sandbox/--json/-o + # ("error: unexpected argument '-C' found", exit 2 — breaks every + # turn-2+ continuity call). So they MUST be emitted BEFORE `resume`. + cmd += [ + "--json", + "--skip-git-repo-check", + "-C", + agent_home, + "--sandbox", + sandbox_mode, + "-o", + result_file, + ] + # Normal mode is `danger-full-access` (no inner sandbox; the Trinity + # container is the boundary — see _resolve_sandbox_mode), which already + # permits network access, so no `sandbox_workspace_write.network_access` + # override is needed. Read-only stays `read-only`. We no longer emit + # `workspace-write` at all. + if model: + cmd += ["-m", model] + # Continuity: `codex exec resume ` replays a prior + # thread. Emitted AFTER the exec-level flags above (narrower arg set). + if resume_thread_id: + cmd += ["resume", resume_thread_id] + # End-of-options separator (review I3): the caller appends the prompt as + # the next (positional) token — for a resume it is resume's PROMPT arg — + # so a prompt starting with "-"/"--" can never be reparsed as a flag + # (worst case weakening the sandbox). + cmd.append("--") + return cmd + + # -- core subprocess execution (stubbed in unit tests) --------------------- + async def _execute_codex( + self, + *, + prompt: str, + model: Optional[str], + system_prompt: Optional[str], + resume_thread_id: Optional[str], + timeout_seconds: int, + allowed_tools: Optional[List[str]], + execution_id: Optional[str], + concurrent_reader: bool = False, + ) -> Tuple[str, List[ExecutionLogEntry], ExecutionMetadata, List[Dict], Optional[str]]: + execution_id = execution_id or str(uuid.uuid4()) + + api_key = _load_openai_api_key() + if not api_key: + raise HTTPException( + status_code=503, + detail=( + "OpenAI API key not configured in agent container. Inject " + "OPENAI_API_KEY via credentials." + ), + ) + + codex_home = _ensure_codex_home() + result_file = os.path.join(codex_home, f"{_safe_result_token(execution_id)}-last.txt") + sandbox_mode = _resolve_sandbox_mode() + _surface_unmapped_guardrails(allowed_tools) + composed_prompt = _compose_prompt(system_prompt, prompt) + + cmd = self._build_codex_command( + model=model, + sandbox_mode=sandbox_mode, + result_file=result_file, + agent_home=_AGENT_HOME, + resume_thread_id=resume_thread_id, + ) + cmd.append(composed_prompt) + + env = { + **os.environ, + EXECUTION_TAG_NAME: execution_id, + "CODEX_HOME": codex_home, + # Inject under both names — the ecosystem standard is OPENAI_API_KEY; + # some Codex builds also read CODEX_API_KEY. Defensive (verified in + # /verify-local). + "OPENAI_API_KEY": api_key, + "CODEX_API_KEY": api_key, + } + + metadata = ExecutionMetadata() + metadata.context_window = self.get_context_window(model) + metadata.execution_id = execution_id + execution_log: List[ExecutionLogEntry] = [] + raw_messages: List[Dict] = [] + response_parts: List[str] = [] + state = _CodexParseState( + execution_log=execution_log, + metadata=metadata, + response_parts=response_parts, + model=model, + ) + stderr_lines: List[str] = [] + + registry = get_process_registry() + logger.info( + "[Codex] exec sandbox=%s resume=%s model=%s execution_id=%s", + sandbox_mode, bool(resume_thread_id), model or "(default)", execution_id, + ) + + # stdin=DEVNULL: the prompt is a positional arg, so Codex must not block + # waiting on stdin. start_new_session=True isolates the process group so + # cleanup signals only Codex's descendants, never sibling executions. + process = subprocess.Popen( + cmd, + stdin=subprocess.DEVNULL, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True, + bufsize=1, + start_new_session=True, + env=env, + ) + process_pgid = _capture_pgid(process) + registry.register( + execution_id, process, metadata={"type": "codex", "pgid": process_pgid} + ) + + import threading + + def read_stdout() -> None: + try: + for line in iter(process.stdout.readline, ""): + if not line: + break + try: + sanitized = sanitize_subprocess_line(line) + try: + event = json.loads(sanitized.strip()) + except json.JSONDecodeError: + continue + if isinstance(event, dict): + event = sanitize_dict(event) + raw_messages.append(event) + try: + registry.publish_log_entry(execution_id, event) + except Exception as pub_err: # noqa: BLE001 + logger.debug( + "[Codex] publish_log_entry failed (continuing): %s", + pub_err, + ) + _process_codex_event(event, state) + except Exception as line_err: # noqa: BLE001 + logger.debug( + "[Codex] per-line processing error (continuing): %s", + line_err, + ) + except Exception as exc: # noqa: BLE001 + logger.error("[Codex] error reading stdout: %s", exc) + + def read_stderr() -> None: + try: + for line in iter(process.stderr.readline, ""): + if not line: + break + stderr_lines.append(line) + except Exception as exc: # noqa: BLE001 + logger.error("[Codex] error reading stderr: %s", exc) + + def read_subprocess_output() -> Tuple[str, int]: + stdout_thread = threading.Thread(target=read_stdout, daemon=True) + stderr_thread = threading.Thread(target=read_stderr, daemon=True) + stdout_thread.start() + stderr_thread.start() + try: + return_code = process.wait(timeout=timeout_seconds) + except subprocess.TimeoutExpired: + logger.error( + "[Codex] execution %s timed out after %ss — killing group", + execution_id, timeout_seconds, + ) + _terminate_process_group( + process, graceful_timeout=5, pgid=process_pgid, + execution_tag=execution_id, + ) + _drain_bounded( + process, stdout_thread, stderr_thread, grace=3, + pgid=process_pgid, execution_tag=execution_id, + ) + raise + _drain_bounded( + process, stdout_thread, stderr_thread, grace=5, + pgid=process_pgid, execution_tag=execution_id, + ) + stderr = "".join(stderr_lines) + return (sanitize_text(stderr) if stderr else stderr), return_code + + # The lock-serialized chat path uses the bounded single-worker executor; + # the concurrent /api/task path uses the loop's default executor so + # parallel task readers don't serialize behind one worker (review I2, + # parity with Claude's headless path). None → default executor. + reader_executor = None if concurrent_reader else _executor + + loop = asyncio.get_event_loop() + try: + try: + stderr_output, return_code = await asyncio.wait_for( + loop.run_in_executor(reader_executor, read_subprocess_output), + timeout=timeout_seconds + 60, + ) + except asyncio.TimeoutError: + logger.error( + "[Codex] outer timeout on %s — killing group as last resort", + execution_id, + ) + await loop.run_in_executor( + None, + lambda: _terminate_process_group( + process, graceful_timeout=2, pgid=process_pgid, + execution_tag=execution_id, + ), + ) + await loop.run_in_executor(None, _safe_close_pipes, process) + raise HTTPException( + status_code=504, + detail=f"Codex execution timed out after {timeout_seconds} seconds", + ) + except subprocess.TimeoutExpired: + raise HTTPException( + status_code=504, + detail=f"Codex execution timed out after {timeout_seconds} seconds", + ) + + if return_code != 0: + status_code, detail = _classify_codex_failure( + return_code, stderr_output, metadata + ) + # NOTE: no metadata.status write here — this path raises + # HTTPException and the local metadata is discarded, so the + # backend reads the failure from the HTTP status, not metadata. + logger.error("[Codex] %s", detail) + raise HTTPException(status_code=status_code, detail=detail) + + # -o file is authoritative; JSONL parts are the fallback. + result_text = _read_and_consume_result_file(result_file) + response_text = _finalize_codex_response(result_text, response_parts) + response_text = sanitize_text(response_text) + + tool_use_count = len([e for e in execution_log if e.type == "tool_use"]) + metadata.tool_count = tool_use_count + if not response_text: + response_text = ( + "(Task completed)" if tool_use_count else "(No response from Codex)" + ) + metadata.status = "success" + session_id = _resolve_returned_session_id(metadata) + logger.info( + "[Codex] done execution_id=%s cost=$%s tokens=%s/%s tools=%s", + execution_id, metadata.cost_usd, metadata.input_tokens, + metadata.output_tokens, metadata.tool_count, + ) + return response_text, execution_log, metadata, raw_messages, session_id + finally: + # Read-then-delete in finally — happy + error path (#1187 decision 5). + _safe_unlink(result_file) + registry.unregister(execution_id) + + # -- public interface ------------------------------------------------------ + async def execute( + self, + prompt: str, + model: Optional[str] = None, + continue_session: bool = False, + stream: bool = False, + system_prompt: Optional[str] = None, + execution_id: Optional[str] = None, + ) -> Tuple[str, List[ExecutionLogEntry], ExecutionMetadata, List[Dict]]: + if not self.is_available(): + raise HTTPException( + status_code=503, + detail="Codex CLI is not available in this container", + ) + + resume_thread_id: Optional[str] = None + if continue_session and agent_state.session_started and self._chat_thread_id: + resume_thread_id = self._chat_thread_id + else: + agent_state.session_started = True + self._chat_thread_id = None + + guardrails = _load_guardrails() + timeout_seconds = int( + guardrails.get("execution_timeout_sec") or _DEFAULT_EXECUTION_TIMEOUT_SEC + ) + + try: + response, log, metadata, raw, session_id = await self._execute_codex( + prompt=prompt, + model=model, + system_prompt=system_prompt, + resume_thread_id=resume_thread_id, + timeout_seconds=timeout_seconds, + allowed_tools=None, + execution_id=execution_id, + concurrent_reader=False, # chat is lock-serialized → bounded reader + ) + except HTTPException: + raise + except TimeoutError as exc: + raise HTTPException(status_code=504, detail=str(exc)) + except (BrokenPipeError, ConnectionResetError) as pipe_err: + logger.info("[Codex] subprocess pipe closed before completion: %s", pipe_err) + raise HTTPException( + status_code=502, + detail="Agent subprocess closed before the chat could complete", + ) + except Exception as exc: # noqa: BLE001 + logger.error("[Codex] execution error: %s", exc) + raise HTTPException(status_code=500, detail=f"Execution error: {exc}") + + # Track thread id for the next continue_session turn. + if session_id: + self._chat_thread_id = session_id + agent_state.session_started = True + + # Update session rollups (mirrors the Gemini path). + if metadata.cost_usd: + agent_state.session_total_cost += metadata.cost_usd + agent_state.session_total_output_tokens += metadata.output_tokens + if metadata.input_tokens > agent_state.session_context_tokens: + agent_state.session_context_tokens = metadata.input_tokens + agent_state.session_context_window = metadata.context_window + return response, log, metadata, raw + + async def execute_headless( + self, + prompt: str, + model: Optional[str] = None, + allowed_tools: Optional[List[str]] = None, + system_prompt: Optional[str] = None, + timeout_seconds: int = 900, + max_turns: Optional[int] = None, + execution_id: Optional[str] = None, + resume_session_id: Optional[str] = None, + persist_session: bool = False, + images: Optional[List[Dict]] = None, + ) -> Tuple[str, List[ExecutionLogEntry], ExecutionMetadata, Optional[str]]: + if not self.is_available(): + raise HTTPException( + status_code=503, + detail="Codex CLI is not available in this container", + ) + if images: + logger.warning("[Codex] images are not supported in the MVP — ignoring") + if max_turns is not None: + logger.info( + "[Codex] max_turns=%s requested; Codex exec has no turn cap CLI " + "flag — relying on the %ss wall-clock timeout.", + max_turns, timeout_seconds, + ) + + try: + response, log, metadata, raw, session_id = await self._execute_codex( + prompt=prompt, + model=model, + system_prompt=system_prompt, + resume_thread_id=resume_session_id, + timeout_seconds=timeout_seconds, + allowed_tools=allowed_tools, + execution_id=execution_id, + concurrent_reader=True, # /api/task runs concurrently → default reader + ) + except HTTPException: + raise + except TimeoutError as exc: + raise HTTPException(status_code=504, detail=str(exc)) + except (BrokenPipeError, ConnectionResetError) as pipe_err: + # 502 (not 503) so the SUB-003 auth-switch isn't tripped by an early + # child exit — parity with the Claude/Gemini headless paths (#474). + logger.info("[Codex] subprocess pipe closed before completion: %s", pipe_err) + raise HTTPException( + status_code=502, + detail="Agent subprocess closed before task could complete", + ) + except Exception as exc: # noqa: BLE001 + logger.error("[Codex] task execution error: %s", exc) + raise HTTPException(status_code=500, detail=f"Task execution error: {exc}") + + return response, log, metadata, session_id + + +# Global Codex runtime instance (singleton, mirrors claude/gemini). +_codex_runtime: Optional[CodexRuntime] = None + + +def get_codex_runtime() -> CodexRuntime: + global _codex_runtime + if _codex_runtime is None: + _codex_runtime = CodexRuntime() + return _codex_runtime diff --git a/docker/base-image/agent_server/services/gemini_runtime.py b/docker/base-image/agent_server/services/gemini_runtime.py index ec8b15032..7d619069e 100644 --- a/docker/base-image/agent_server/services/gemini_runtime.py +++ b/docker/base-image/agent_server/services/gemini_runtime.py @@ -21,7 +21,7 @@ from ..utils.subprocess_pgroup import EXECUTION_TAG_NAME from ..utils.orphan_sweep import kill_cgroup_orphans from .activity_tracking import start_tool_execution, complete_tool_execution -from .runtime_adapter import AgentRuntime +from .runtime_adapter import AgentRuntime, RuntimeCapabilities logger = logging.getLogger(__name__) @@ -90,6 +90,18 @@ def calculate_gemini_cost(input_tokens: int, output_tokens: int, model: Optional class GeminiRuntime(AgentRuntime): """Gemini CLI implementation of AgentRuntime interface.""" + @classmethod + def capabilities(cls) -> RuntimeCapabilities: + # Gemini supports chat continuity (--resume) and MCP, but NOT the + # Session-tab cached-UUID resume (execute_headless ignores + # resume_session_id), and cost is derived from tokens. (#1187) + return RuntimeCapabilities( + chat_continuity=True, + session_tab_resume=False, + mcp_support=True, + cost_reporting="estimated", + ) + def is_available(self) -> bool: """Check if Gemini CLI is installed.""" try: diff --git a/docker/base-image/agent_server/services/runtime_adapter.py b/docker/base-image/agent_server/services/runtime_adapter.py index e40e19a61..a91bc1b61 100644 --- a/docker/base-image/agent_server/services/runtime_adapter.py +++ b/docker/base-image/agent_server/services/runtime_adapter.py @@ -7,6 +7,7 @@ import os import logging from abc import ABC, abstractmethod +from dataclasses import dataclass, asdict from typing import List, Dict, Optional, Tuple from datetime import datetime @@ -15,6 +16,24 @@ logger = logging.getLogger(__name__) +@dataclass(frozen=True) +class RuntimeCapabilities: + """What a runtime supports, so callers gate on a capability instead of + branching on the runtime name (#1187). + + ``cost_reporting`` is a string, not a bool: ``"native"`` means the CLI + reports a real cost (Claude Code), ``"estimated"`` means Trinity derives + it from token counts (Gemini, Codex). + """ + chat_continuity: bool = False + session_tab_resume: bool = False + mcp_support: bool = False + cost_reporting: str = "estimated" # "native" | "estimated" + + def to_dict(self) -> Dict[str, object]: + return asdict(self) + + class AgentRuntime(ABC): """ Abstract base class for agent execution runtimes. @@ -142,28 +161,60 @@ async def execute_headless( """ pass + @classmethod + def capabilities(cls) -> RuntimeCapabilities: + """Declare what this runtime supports. + + Conservative by default (#1187, AC2): a runtime that forgets to + override this is treated as the least-capable — no Session-tab + resume, no assumed MCP, estimated cost. Override per runtime to + declare real support. + """ + return RuntimeCapabilities() + + +# Accepted AGENT_RUNTIME values (lowercased). Unknown values fail loudly +# rather than silently selecting Claude (#1187 Phase D). +_CLAUDE_RUNTIMES = frozenset({"claude-code", "claude"}) +_GEMINI_RUNTIMES = frozenset({"gemini-cli", "gemini"}) +_CODEX_RUNTIMES = frozenset({"codex"}) +KNOWN_RUNTIMES = _CLAUDE_RUNTIMES | _GEMINI_RUNTIMES | _CODEX_RUNTIMES + def get_runtime() -> AgentRuntime: """ Factory function to get the appropriate runtime based on configuration. Reads AGENT_RUNTIME environment variable to determine which runtime to use. - Defaults to Claude Code for backward compatibility. + Defaults to Claude Code (env unset) for backward compatibility, but an + explicitly-set UNKNOWN value raises instead of silently falling back to + Claude — a typo'd runtime should fail loudly, not run the wrong engine + (#1187 Phase D). Returns: - AgentRuntime instance (ClaudeCodeRuntime or GeminiRuntime) + AgentRuntime instance (ClaudeCodeRuntime, GeminiRuntime, or CodexRuntime) + + Raises: + ValueError: if AGENT_RUNTIME is set to an unrecognized value. """ runtime_type = os.getenv("AGENT_RUNTIME", "claude-code").lower() - if runtime_type == "gemini-cli" or runtime_type == "gemini": + if runtime_type in _GEMINI_RUNTIMES: from .gemini_runtime import get_gemini_runtime - runtime = get_gemini_runtime() logger.info("Using Gemini CLI runtime") - return runtime - else: - # Default to Claude Code + return get_gemini_runtime() + if runtime_type in _CODEX_RUNTIMES: + from .codex_runtime import get_codex_runtime + logger.info("Using OpenAI Codex runtime") + return get_codex_runtime() + if runtime_type in _CLAUDE_RUNTIMES: from .claude_code import get_claude_runtime - runtime = get_claude_runtime() logger.info("Using Claude Code runtime") - return runtime + return get_claude_runtime() + + raise ValueError( + f"Unknown AGENT_RUNTIME={runtime_type!r}. " + f"Known runtimes: {sorted(KNOWN_RUNTIMES)}. " + "Refusing to silently fall back to Claude Code." + ) diff --git a/docker/base-image/agent_server/services/trinity_mcp.py b/docker/base-image/agent_server/services/trinity_mcp.py index c398ef9c1..0fd24dcff 100644 --- a/docker/base-image/agent_server/services/trinity_mcp.py +++ b/docker/base-image/agent_server/services/trinity_mcp.py @@ -7,6 +7,7 @@ import json import logging import subprocess +import tomllib # py3.11+; agent base image is python 3.13 from pathlib import Path logger = logging.getLogger(__name__) @@ -31,6 +32,8 @@ def inject_trinity_mcp_if_configured() -> bool: runtime = os.getenv("AGENT_RUNTIME", "claude-code").lower() + if runtime == "codex": + return _inject_codex_mcp(trinity_mcp_url, trinity_mcp_api_key) if runtime == "gemini-cli": return _inject_gemini_mcp(trinity_mcp_url, trinity_mcp_api_key) else: @@ -143,6 +146,8 @@ def configure_mcp_servers(mcp_servers: dict) -> bool: runtime = os.getenv("AGENT_RUNTIME", "claude-code").lower() + if runtime == "codex": + return _configure_codex_mcp_servers(mcp_servers) if runtime == "gemini-cli": return _configure_gemini_mcp_servers(mcp_servers) else: @@ -211,3 +216,224 @@ def _configure_gemini_mcp_servers(mcp_servers: dict) -> bool: logger.info(f"Configured {success_count}/{len(mcp_servers)} MCP servers for Gemini CLI") return success_count > 0 or len(mcp_servers) == 0 + + +# --------------------------------------------------------------------------- +# Codex CLI MCP configuration (#1187 Phase F) +# +# Codex reads MCP servers from ``$CODEX_HOME/config.toml`` under +# ``[mcp_servers.]``. We write that file DIRECTLY (the same approach the +# Gemini path uses for its settings.json — deterministic, avoids `codex mcp +# add` CLI-syntax drift) and MERGE so the Trinity-MCP injection and the +# template-MCP configuration (two separate calls) don't clobber each other. +# +# CODEX_HOME is the relocated, gitignored scratch path (see codex_runtime.py); +# both this config writer and the runtime resolve it via the same helper so the +# file we write is the file Codex reads. +# --------------------------------------------------------------------------- + +def _codex_config_path() -> Path: + from .codex_runtime import _codex_home # lazy: avoid an import cycle + + return Path(_codex_home()) / "config.toml" + + +def _read_codex_config(path: Path) -> dict: + try: + with open(path, "rb") as fh: + return tomllib.load(fh) + except (IOError, OSError): + return {} + except tomllib.TOMLDecodeError as exc: + # Do NOT silently reset on a decode error. If we returned {} here, the + # next _upsert_codex_mcp_servers would rewrite the file from {} and + # drop every previously-written server — including the Trinity MCP + # wiring — with no trace. Back the bad file up and log loudly so the + # corruption is recoverable and visible; the caller re-injects its + # servers onto a clean slate on this run. + try: + backup = path.with_name(path.name + ".corrupt") + path.replace(backup) + logger.error( + "Codex config.toml is malformed (%s); backed it up to %s and " + "starting from an empty config. MCP servers are re-written this " + "run.", + exc, backup, + ) + except OSError as backup_err: + logger.error( + "Codex config.toml is malformed (%s) and the backup also failed " + "(%s); rewriting from an empty config.", + exc, backup_err, + ) + return {} + + +# Bare TOML keys are limited to ASCII letters, digits, '_' and '-'. Anything +# else (space, '.', ']', '#', control chars) must be a quoted basic-string key. +_BARE_KEY_CHARS = frozenset( + "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_-" +) + +# Basic-string escapes with TOML shorthand. Everything else < 0x20 (plus 0x7F) +# becomes a \uXXXX escape; otherwise an out-of-band character (e.g. a newline +# in a server name or env value) yields invalid TOML. +_TOML_SHORTHAND_ESCAPES = { + "\\": "\\\\", + '"': '\\"', + "\b": "\\b", + "\t": "\\t", + "\n": "\\n", + "\f": "\\f", + "\r": "\\r", +} + + +def _toml_escape(value: str) -> str: + out: list[str] = [] + for ch in value: + shorthand = _TOML_SHORTHAND_ESCAPES.get(ch) + if shorthand is not None: + out.append(shorthand) + elif ord(ch) < 0x20 or ord(ch) == 0x7F: + out.append(f"\\u{ord(ch):04X}") + else: + out.append(ch) + return "".join(out) + + +def _toml_key(key: str) -> str: + """Render a TOML key segment: bare when it is a valid bare key, otherwise a + quoted basic-string key. Used for both ``key = ...`` lines and the dotted + segments of ``[table.header]`` lines so a server name or env key with a + space/dot/special char can't produce unparseable TOML.""" + if key and all(c in _BARE_KEY_CHARS for c in key): + return key + return f'"{_toml_escape(key)}"' + + +def _toml_scalar(value) -> str: + if isinstance(value, bool): + return "true" if value else "false" + if isinstance(value, (int, float)): + return str(value) + if isinstance(value, list): + # A list of dicts is a TOML array-of-tables ([[name]]), which this + # writer never emits. Stringifying the dicts would silently corrupt a + # pre-existing config on round-trip. Raise instead: the caller + # (_upsert_codex_mcp_servers) serializes BEFORE write_text, so a raise + # leaves the original file intact and logs a warning — a safe no-op + # rather than a mangled rewrite (#1187 review). + if any(isinstance(item, dict) for item in value): + raise TypeError( + "codex config writer does not support TOML array-of-tables; " + "refusing to serialize to avoid corrupting the existing file" + ) + return "[" + ", ".join(_toml_scalar(item) for item in value) + "]" + if isinstance(value, dict): + # _serialize_table routes dicts to sub-tables, so a dict reaching here + # is an unexpected nesting. Raise rather than emit a stringified dict. + raise TypeError( + "codex config writer received a dict where a scalar was expected" + ) + return f'"{_toml_escape(str(value))}"' + + +def _serialize_table(path: list[str], table: dict, lines: list[str]) -> None: + """Recursively emit a TOML table. ``path`` is the header segments (empty for + the document root). Scalar keys are always emitted before any nested-table + headers (TOML requires it). A table with only sub-tables is left as an + implicit super-table (no redundant ``[parent]`` header), matching the + hand-written output this replaced.""" + scalars = {k: v for k, v in table.items() if not isinstance(v, dict)} + sub_tables = {k: v for k, v in table.items() if isinstance(v, dict)} + # Emit a header for a non-root table that has its own scalar keys, or that + # is entirely empty (so an explicit empty table round-trips). Skip it for a + # pure super-table whose only contents are nested tables. + emit_header = bool(path) and (bool(scalars) or not sub_tables) + if emit_header: + lines.append("[" + ".".join(_toml_key(seg) for seg in path) + "]") + for key, value in scalars.items(): + lines.append(f"{_toml_key(key)} = {_toml_scalar(value)}") + if emit_header or (scalars and sub_tables): + lines.append("") + for key, sub in sub_tables.items(): + _serialize_table(path + [key], sub, lines) + + +def _serialize_codex_config(config: dict) -> str: + """Serialize the codex config we manage to TOML: arbitrary top-level scalars + and tables (preserved on round-trip) plus the ``[mcp_servers.]`` + tables we write. Table/key names and string values are quoted/escaped so a + special character can never yield unparseable TOML.""" + lines: list[str] = [] + _serialize_table([], config, lines) + return "\n".join(lines).rstrip() + "\n" + + +def _upsert_codex_mcp_servers(servers: dict) -> bool: + """Merge ``servers`` into ``$CODEX_HOME/config.toml`` ``[mcp_servers.*]``, + preserving any existing servers + top-level settings.""" + path = _codex_config_path() + try: + path.parent.mkdir(parents=True, exist_ok=True) + config = _read_codex_config(path) + config.setdefault("mcp_servers", {}) + config["mcp_servers"].update(servers) + path.write_text(_serialize_codex_config(config)) + return True + except Exception as e: # noqa: BLE001 + logger.warning(f"Failed to write codex config.toml: {e}") + return False + + +def _inject_codex_mcp(trinity_mcp_url: str, trinity_mcp_api_key: str) -> bool: + """Wire the Trinity HTTP MCP server into the codex config. + + The bearer token is referenced by ENV VAR (``bearer_token_env_var``), NOT + written as a literal — the secret stays in the agent's environment and is + never persisted to config.toml (#1187 Phase F). + """ + # trinity_mcp_api_key is intentionally unused: Codex reads it from the + # TRINITY_MCP_API_KEY env var at run time. Accepting it keeps the + # _inject_*_mcp signatures uniform across runtimes. + del trinity_mcp_api_key + server = { + "url": trinity_mcp_url, + "bearer_token_env_var": "TRINITY_MCP_API_KEY", + } + if _upsert_codex_mcp_servers({"trinity": server}): + logger.info("Injected Trinity MCP server into codex config.toml") + return True + return False + + +def _configure_codex_mcp_servers(mcp_servers: dict) -> bool: + """Configure template-supplied MCP servers for Codex via config.toml. + + Stdio servers (command + args) are supported, matching the Gemini path's + scope. A server with no command is skipped with a warning. + """ + servers: dict = {} + for server_name, config in mcp_servers.items(): + command = config.get("command", "") + if not command: + logger.warning(f"Skipping MCP server '{server_name}': no command specified") + continue + entry: dict = {"command": command} + args = config.get("args") + if args: + entry["args"] = args + env = config.get("env") + if isinstance(env, dict) and env: + entry["env"] = env + servers[server_name] = entry + + if not servers: + return len(mcp_servers) == 0 + + ok = _upsert_codex_mcp_servers(servers) + logger.info( + f"Configured {len(servers)}/{len(mcp_servers)} MCP servers for Codex" + ) + return ok diff --git a/docker/base-image/startup.sh b/docker/base-image/startup.sh index 59d0f7a07..06f80fe4f 100644 --- a/docker/base-image/startup.sh +++ b/docker/base-image/startup.sh @@ -307,6 +307,31 @@ if [ -d "/generated-creds" ]; then echo "Credential files copied" fi +# === Codex runtime setup (#1187) === +# Codex is the third agent runtime. Two Codex-specific quirks are fixed here: +# 1. Identity: Codex reads AGENTS.md (NOT CLAUDE.md). Mirror the agent's +# instructions so a Codex agent gets the same platform identity Claude does +# (the per-turn system prompt is additionally prepended by codex_runtime.py). +# 2. CODEX_HOME defaults to ~/.codex — inside the git-tracked repo, which would +# dirty auto-sync. Relocate it onto the disk-backed scratch dir (#1098). +# NOTE: startup.sh must NOT write to .gitignore (#953) — the canonical list +# (`_GITIGNORE_PATTERNS` in src/backend/services/git_service.py, applied on git +# init/push by `_build_gitignore_merge_command`) carries `.tmp/`, so the +# relocated CODEX_HOME under $TMPDIR is excluded from git without a shell write. +if [ "${AGENT_RUNTIME}" = "codex" ]; then + echo "Configuring Codex runtime..." + + if [ -f "/home/developer/CLAUDE.md" ] && [ ! -f "/home/developer/AGENTS.md" ]; then + cp /home/developer/CLAUDE.md /home/developer/AGENTS.md 2>/dev/null && \ + echo " Mirrored CLAUDE.md -> AGENTS.md for Codex" || \ + echo " Warning: could not create AGENTS.md" + fi + + CODEX_HOME_DIR="${CODEX_HOME:-${AGENT_TMPDIR}/codex}" + mkdir -p "${CODEX_HOME_DIR}" 2>/dev/null && chmod 700 "${CODEX_HOME_DIR}" 2>/dev/null || \ + echo " Warning: could not create CODEX_HOME ${CODEX_HOME_DIR}" +fi + # Ensure core agent-server dependencies are installed correctly # This prevents template repos from breaking the agent server with incompatible packages echo "Verifying agent-server dependencies..." diff --git a/docs/TRINITY_COMPATIBLE_AGENT_GUIDE.md b/docs/TRINITY_COMPATIBLE_AGENT_GUIDE.md index 843251143..6f523c1c5 100644 --- a/docs/TRINITY_COMPATIBLE_AGENT_GUIDE.md +++ b/docs/TRINITY_COMPATIBLE_AGENT_GUIDE.md @@ -156,6 +156,7 @@ credentials.json .npm/ .ssh/ .trinity/ +.tmp/ # Large generated content - DO NOT COMMIT content/ diff --git a/docs/memory/architecture.md b/docs/memory/architecture.md index cd7f5d047..61add2d65 100644 --- a/docs/memory/architecture.md +++ b/docs/memory/architecture.md @@ -325,6 +325,16 @@ Services that run continuously in the backend process: Canonical home for each multi-component feature. Endpoint signatures live in [API Endpoints](#api-endpoints); table DDL in [Database Schema](#database-schema). +### Agent Runtimes — multi-runtime / "harness == runtime" (#1187) + +A Trinity **harness IS an `AgentRuntime`** — the pluggable execution engine inside the agent container. Three ship today: **Claude Code** (default), **Gemini CLI**, and **OpenAI Codex** (#1187). `AGENT_RUNTIME` (container env, set from `template.yaml runtime:` via `crud.py`; also a `trinity.agent-runtime` label) selects one; `runtime_adapter.get_runtime()` is the factory — it **validates** the value against `KNOWN_RUNTIMES` and raises on an unknown one rather than silently defaulting to Claude. + +**ABC** (`agent_server/services/runtime_adapter.py`): `execute` (chat), `execute_headless` (stateless task), `configure_mcp`, `is_available`, `get_default_model`, `get_context_window`, plus a non-abstract `capabilities()` classmethod returning a `RuntimeCapabilities` dataclass (`chat_continuity`, `session_tab_resume`, `mcp_support`, `cost_reporting: "native"|"estimated"`) — conservative by default so a new runtime that forgets to override is treated as least-capable. Each runtime is a singleton (`get__runtime()`). + +**Codex** (`codex_runtime.py`, built independently on the per-runtime primitives — NOT a shared helper, so it never inherits Gemini's blanket `kill_cgroup_orphans()`): `codex exec --json` → JSONL events (`thread.started`→session id, `turn.completed.usage`→tokens where `reasoning_output_tokens` is a SUBSET of `output_tokens`, `item.completed`→response/tool activity, `turn.failed`/`error`); `-o/--output-last-message` is the authoritative result (read-then-delete in `finally`); `codex exec resume ` for chat continuity; cost estimated via `CODEX_PRICING`. Concurrency-safe orphan cleanup via `_drain_bounded` (`kill_cgroup_orphans(extra_pids=…)` preserves sibling executions). Error→HTTP: auth→503, rate→429, runtime-unavailable→**500** (not 503 — avoids the AUTH collision), pipe-drop→**502** (SUB-003 guard). + +**Parity surface** (every runtime must wire these — see the [Harness Authoring Guide](harness-authoring-guide.md)): platform **system prompt** (Codex prepends it + mirrors `CLAUDE.md`→`AGENTS.md` at startup), **sandbox** (`_resolve_sandbox_mode`: normal mode → `--sandbox danger-full-access` — Codex's own bubblewrap sandbox can't create a user namespace inside the hardened agent container (`bwrap: No permissions to create a new namespace`), which blocks every shell tool, so it's dropped and the Trinity container is the sole boundary, same posture as Claude/Gemini; **read-only mode** → `--sandbox read-only`, read from `~/.trinity/read-only-config.json` since the Claude PreToolUse hook doesn't apply — read-only enforcement for Codex is an open #1187 PR discussion), **guardrails** (`_load_guardrails()`; unmappable Claude tool-names are surfaced in logs, not silently dropped), and **credential sanitization** (`utils/credential_sanitizer` over response + logs). Codex credentials: `OPENAI_API_KEY` from process env else parsed from `/home/developer/.env` (CRED-002; not exported into the agent-server process), injected into the subprocess; `CODEX_HOME` is relocated under `$TMPDIR` (gitignored) so codex state never dirties the repo. Codex agents skip Claude-subscription auto-assign in `crud.py`/`lifecycle.py` (`is_claude_runtime`). Backend reads nothing runtime-specific in MVP: it still infers AUTH from HTTP 503; `ExecutionMetadata.status`/`error_code` ship unused (fast-follow). The **Session tab** is gated off for runtimes lacking `session_tab_resume` (one backend constant `RUNTIMES_WITHOUT_SESSION_TAB_RESUME` in `sessions.py` runs a stateless turn; frontend hides the tab). MCP: `_configure_codex_mcp_servers`/`_inject_codex_mcp` write `$CODEX_HOME/config.toml` directly, the Trinity HTTP MCP referencing the token via `bearer_token_env_var` (never persisted as a literal). The platform prompt is **runtime-aware** (`platform_prompt_service.get_platform_system_prompt(runtime=…)`/`compose_system_prompt(runtime=…)`, threaded from `routers/chat.py` + `task_execution_service.py` via the `trinity.agent-runtime` label resolved best-effort by `docker_service.get_agent_runtime`): for Codex it strips the Claude-only `mcp__trinity__` tool-name prefix (which otherwise made Codex emit `unknown MCP server`) and references the auto-discovered `trinity` tools by bare name; Claude/Gemini/unknown keep the canonical naming. Frontend: `RuntimeBadge.vue` codex case, `AgentDetail.vue` default model + terminal map, `AgentTerminal.vue` `codex` mode. + ### Capacity & Backlog (#428) `CapacityManager` (CAPACITY-CONSOLIDATE) is the single public API for admit/release/status across `/chat` (`max_concurrent=max_parallel_tasks`, `queue_in_memory` policy) and `/task` (`queue_persistent` policy). It composes two private internals — `slot_service.py` (atomic N-ary counter, Redis ZSET `agent:slots:{name}`, dynamic per-agent TTL) and `backlog_service.py` (SQLite FIFO over `schedule_executions.status='queued'`, drain-on-release) — and owns the in-memory overflow store (Redis LIST, depth 3). diff --git a/docs/memory/feature-flows.md b/docs/memory/feature-flows.md index d78d6df41..1ac737614 100644 --- a/docs/memory/feature-flows.md +++ b/docs/memory/feature-flows.md @@ -11,6 +11,7 @@ | Date | ID | Feature | Flow | |------|-----|---------|------| +| 2026-06-14 | #1187 | feat(agent-runtime): **OpenAI Codex** as the third agent runtime ("harness == runtime") alongside Claude Code + Gemini. New `codex_runtime.py` implements `AgentRuntime` on the existing per-runtime primitives (NOT a shared helper → never inherits Gemini's blanket `kill_cgroup_orphans()`): `codex exec --json` JSONL parse (`thread.started`/`turn.completed.usage`/`item.completed`; `reasoning_output_tokens` ⊂ `output_tokens`, no double-count), `-o/--output-last-message` authoritative result (read-then-delete in `finally`), `codex exec resume ` continuity, estimated cost (`CODEX_PRICING`), concurrency-safe `_drain_bounded` orphan cleanup, error→HTTP (auth **503** / rate **429** / runtime-unavailable **500**-not-503 / pipe-drop **502**). **Safety parity** (blocking): platform system-prompt prepended + `CLAUDE.md`→`AGENTS.md` mirror, **sandbox** normal→`--sandbox danger-full-access` (drops Codex's inner bwrap sandbox — it can't create a user namespace in the hardened container, which blocks every tool — leaving the Trinity container as the sole boundary, same as Claude) / read-only→`--sandbox read-only` (reads `~/.trinity/read-only-config.json`; enforcement an open PR discussion), guardrails honored/surfaced, credential sanitizer over output+logs. **Runtime-aware platform prompt** (E2E fix): Codex strips the Claude-only `mcp__trinity__` tool prefix (else the model emits `unknown MCP server`) via `platform_prompt_service.{get_platform_system_prompt,compose_system_prompt}(runtime=…)`, threaded from `chat.py` + `task_execution_service.py` through the `trinity.agent-runtime` label (`docker_service.get_agent_runtime`). New `RuntimeCapabilities` dataclass + per-runtime `capabilities()`; `get_runtime()` validates `AGENT_RUNTIME` and fails loud on unknown. Codex skips Claude-subscription assign (`is_claude_runtime`, crud+lifecycle); `OPENAI_API_KEY` read from `.env`; `CODEX_HOME` relocated under `$TMPDIR` + gitignored. MCP via `$CODEX_HOME/config.toml` (`bearer_token_env_var` — token never persisted). Session tab gated off via one backend constant (`RUNTIMES_WITHOUT_SESSION_TAB_RESUME`); frontend `RuntimeBadge`/default-model/terminal map + hidden Session tab; `test-codex` template; `@openai/codex@0.139.0` pinned. E2E-verified live (boot/chat/cost/resume/read-only/no-leak); resume arg-order bug fixed (exec flags must precede the `resume` sub-subcommand). | [codex-runtime.md](feature-flows/codex-runtime.md) | | 2026-06-10 | #1130 | fix: retired `gemini-2.0-flash` replaced with env-configurable models — `GEMINI_TEXT_MODEL` (image-gen prompt refinement) + `GEMINI_TRANSCRIPTION_MODEL` (Telegram voice), both default `gemini-3.5-flash`, defined in `config.py`, empty-string-safe wiring in both compose files (#1076 pattern). | [image-generation.md](feature-flows/image-generation.md), [telegram-integration.md](feature-flows/telegram-integration.md) | | 2026-06-10 | #1108 | feat(ui): Agent Detail **Guardrails** tab renamed to **Settings** — sectioned config home. New `components/settings/SettingsPanel.vue` renders `GuardrailsPanel` unchanged as section #1; future per-agent settings land as additive sections, not new tabs. `?tab=guardrails` deep links alias to `settings` via `TAB_ALIASES`. Pure frontend. | [agent-guardrails.md](feature-flows/agent-guardrails.md) | | 2026-06-10 | #1114 | feat(ui): Agent Detail tabs overflow into a **"More ▾"** dropdown instead of horizontal scroll. New reusable `components/OverflowTabs.vue` ("priority+" pattern): a hidden, zero-layout mirror row measures every `{id,label,badge?}` tab's width (+ a worst-case "More" button) so the visible row renders as many tabs as fit and collapses the trailing remainder into a right-aligned disclosure menu. Re-measures on container resize (`ResizeObserver` on the outer wrapper, width-diff-guarded + rAF-debounced) and after `document.fonts.ready`; re-measures on tab/label/badge changes via a derived-signature `watch` (`flush:'post'`). Defaults to all-inline before the first measure (no first-paint snap; no "More" when everything fits). Active-in-overflow reflected on the trigger (active underline + dot), tab order never reshuffled. Plain `