Skip to content

feat(specs): add custom runner image specification#1563

Open
jbpratt wants to merge 7 commits into
ambient-code:mainfrom
jbpratt:spec/custom-runner-image
Open

feat(specs): add custom runner image specification#1563
jbpratt wants to merge 7 commits into
ambient-code:mainfrom
jbpratt:spec/custom-runner-image

Conversation

@jbpratt
Copy link
Copy Markdown
Collaborator

@jbpratt jbpratt commented May 12, 2026

Summary

  • Adds specs/agents/runner-image.spec.md defining the stable runner contract and a workspace-level custom image override
  • Custom images are built via Dockerfile FROM on a published base image — no init hooks
  • New runner_image and runner_image_pull_secret fields on ProjectSettings let workspace admins configure a custom runner per project
  • Defines stable interfaces: AG-UI HTTP endpoints, filesystem layout, entrypoint contract, environment variables, security constraints, and Python runtime requirements
  • Includes image selection precedence (ProjectSettings > agent registry > operator default), registry allowlist validation, RBAC, and failure mode scenarios

Details

The spec establishes the boundary between "what the platform guarantees" and "what custom images can change." Key design decisions:

  • Dockerfile FROM only — init hooks rejected due to non-reproducibility, startup latency, network dependency, and OpenShift SCC conflicts
  • ProjectSettings, not Session — image trust is an admin concern; all sessions in a project use the same vetted image
  • Agent registry is orthogonal — custom image overrides the container image but preserves RUNNER_TYPE, resources, and sandbox config from the registry

Test plan

  • Review spec for completeness against runner.spec.md and control-plane.spec.md
  • Verify GIVEN/WHEN/THEN scenarios are testable
  • Confirm implementation touchpoints table is accurate

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive spec for custom runner images: required AG-UI HTTP endpoints/port, Python 3.12+ requirement, preserved Python minor version, and required filesystem paths.
    • Clarified runtime constraints: no CMD/ENTRYPOINT overrides, required non-root UID 1001, forbidden overrides of specific injected env vars, and graceful shutdown behavior.
    • Added ProjectSettings options for runner_image and runner_image_pull_secret with precedence, validation, pull-policy rules, RBAC guidance, failure modes, and security/isolation expectations.

@netlify
Copy link
Copy Markdown

netlify Bot commented May 12, 2026

Deploy Preview for cheerful-kitten-f556a0 canceled.

Name Link
🔨 Latest commit 54f94db
🔍 Latest deploy log https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/6a0388fe8232c10009b70e0c

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

Adds a stable "custom runner image" specification describing required AG-UI HTTP endpoints, Python/runtime and filesystem constraints, non-root and exec/signal expectations, ProjectSettings-based runner image override with validation and pull-secret support, failure modes, security assumptions, and base-image contract labeling.

Changes

Runner image contract & ProjectSettings override

Layer / File(s) Summary
AG-UI HTTP contract and required endpoints
specs/agents/runner-image.spec.md (lines 1–173)
Defines AG-UI interface on AGUI_PORT (default 8001) and required endpoints: /, /interrupt, /health, /capabilities, /events/{thread_id}; response formats must not be removed/changed; remaining endpoints inherit from ambient_runner.
Runtime, filesystem, and packaging constraints
specs/agents/runner-image.spec.md (lines 1–173)
Requires Python 3.12+ and ambient_runner package; runner process must preserve base image Python major.minor; enforces preservation of paths (/workspace, /app, /app/ambient-runner, /app/vertex, /tmp) and forbids removing/relocating them.
Entrypoint, signal handling, and startup behavior
specs/agents/runner-image.spec.md (lines 1–173)
Custom images must not override CMD/ENTRYPOINT; wrappers allowed only if they exec the runner, ensure listener on AGUI_PORT, propagate SIGTERM for graceful shutdown (PID 1 or child of PID 1), and start within configured startup timeout.
Control-plane injected env vars and UID/security constraints
specs/agents/runner-image.spec.md (lines 1–173)
Lists env vars that must not be overridden (e.g., SESSION_ID, PROJECT_NAME, WORKSPACE_PATH, AGUI_PORT, backend/grpc/token vars, INITIAL_PROMPT, IS_RESUME, CREDENTIAL_IDS, RUNNER_TYPE); requires UID 1001/non-root execution, allowPrivilegeEscalation: false, capability drops; root allowed in build stages only.
ProjectSettings fields and selection precedence
specs/agents/runner-image.spec.md (lines 175–305)
Adds runner_image and runner_image_pull_secret to ProjectSettings; defines selection precedence (ProjectSettings > agent registry defaults > operator RUNNER_IMAGE), without altering agent-type-specific settings like RUNNER_TYPE or resource limits; changes apply only to new sessions.
Image validation, allowlist, and pull policy
specs/agents/runner-image.spec.md (lines 175–305)
Specifies image syntax/host validation and optional allowlist via RUNNER_IMAGE_ALLOWED_REGISTRIES; injects imagePullSecrets from runner_image_pull_secret; imagePullPolicy = IfNotPresent for digests and localhost/ references, Always otherwise.
RBAC and operational constraints
specs/agents/runner-image.spec.md (lines 175–305)
Documents RBAC requirement: only principals with project_settings:update may modify runner_image/runner_image_pull_secret; notes that updates affect new sessions only.
Failure modes and session state transitions
specs/agents/runner-image.spec.md (lines 306–346)
Enumerates failure cases and outcomes: health/readiness timeout → session Failed; startup crashes → Failed; missing bridge for RUNNER_TYPE → session error; image pull failures → error/backoff and Failed as applicable.
Security boundary and isolation expectations
specs/agents/runner-image.spec.md (lines 347–356)
States assumed controls: NetworkPolicy isolation, per-turn credential fetching, per-session ServiceAccount isolation; clarifies these are platform responsibilities around the image contract.
Base image publishing and contract labeling
specs/agents/runner-image.spec.md (lines 359–379)
Requires base image to include OCI label io.ambient-code.runner-contract-version="1"; mismatch emits a warning (non-blocking) on pod creation; includes example publishing guidance.
🚥 Pre-merge checks | ✅ 8
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title follows Conventional Commits format (feat(specs): description) and accurately describes the main change: adding a new runner image specification file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Performance And Algorithmic Complexity ✅ Passed Documentation-only PR (379-line spec file added). No code modifications, no algorithmic implementations, no data structures, loops, caching, or API patterns. Performance check not applicable.
Security And Secret Handling ✅ Passed No hardcoded secrets, tokens, or injection vulnerabilities. Spec mandates CP-injected env vars, per-turn credentials, per-session SA isolation, RBAC enforcement, and network isolation.
Kubernetes Resource Safety ✅ Passed PR is documentation-only (spec file, no actual Kubernetes resources). Spec documents security context, resource limits, RBAC, and namespace scoping requirements that implementations must follow.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
specs/agents/runner-image.spec.md (2)

93-93: 💤 Low value

Clarify path description to avoid confusion.

The phrase "MUST contain installed ambient_runner package" could be misread to mean the pip package must be installed at /app/ambient-runner, when it actually means this directory contains the application code (main.py) that imports the package installed elsewhere in site-packages.

📝 Clearer phrasing
-| `/app/ambient-runner` | Runner package source and working directory | MUST contain installed `ambient_runner` package |
+| `/app/ambient-runner` | Runner application root and working directory | MUST contain main.py and application code; requires `ambient_runner` package installed via pip |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/agents/runner-image.spec.md` at line 93, The spec line for
`/app/ambient-runner` is ambiguous about where the pip-installed ambient_runner
resides; update the wording so it clearly states that `/app/ambient-runner`
contains the application source (e.g., main.py) which imports the
`ambient_runner` package installed in site-packages, not that the pip package
itself is installed at that path; reference the `/app/ambient-runner` directory,
the application entrypoint `main.py`, and the `ambient_runner` package in the
revised sentence to make this distinction explicit.

461-461: ⚡ Quick win

Consider blocking contract version mismatches by default.

The spec makes version checking advisory-only (CP logs warning but creates pod anyway). However, if a custom image uses contract v2 with breaking changes and the CP expects v1, the session will fail unpredictably at runtime rather than being rejected upfront.

💡 Alternative design

Make blocking the default with operator opt-in for mismatches:

-The CP MAY read this label at pod creation time and log a warning if the contract version does not match the expected version. This is advisory — the CP SHALL NOT block pod creation based on contract version mismatch.
+The CP SHALL read this label at pod creation time. If the contract version does not match the expected version, the CP SHALL transition the session to `Failed` with a condition describing the mismatch UNLESS the operator has set `ALLOW_CONTRACT_VERSION_MISMATCH=true`.

This preserves flexibility for operators who explicitly opt in while preventing accidental incompatibilities.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/agents/runner-image.spec.md` at line 461, Update the sentence about
contract-version handling so the Control Plane (CP) SHALL by default reject pod
creation on a contract version mismatch instead of merely warning; add a clear
operator-configurable override (e.g., an "allowContractMismatch" opt-in flag)
that, when enabled, permits the previous advisory behavior and logs a warning;
ensure the wording references the "contract version" label and the CP's behavior
at "pod creation" so readers can locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@specs/agents/runner-image.spec.md`:
- Around line 274-276: Document that ProjectSettings.runner_image can override
image but not agent-type config (RUNNER_TYPE, resource limits, state dir) and
add a Failure Modes entry describing the cryptic Python import error when a
custom image lacks the required bridge implementation (e.g., ClaudeBridge,
GeminiCLIBridge, LangGraphBridge) for the session's runner type; update the
recommendations to advise building custom images FROM the standard base to
inherit all bridges and add a runtime validation step in the session creation
flow (where ProjectSettings.runner_image is applied) that inspects the image or
performs a quick probe to confirm the presence of the required bridge for the
requested RUNNER_TYPE and surface a clear, actionable error if missing.

---

Nitpick comments:
In `@specs/agents/runner-image.spec.md`:
- Line 93: The spec line for `/app/ambient-runner` is ambiguous about where the
pip-installed ambient_runner resides; update the wording so it clearly states
that `/app/ambient-runner` contains the application source (e.g., main.py) which
imports the `ambient_runner` package installed in site-packages, not that the
pip package itself is installed at that path; reference the
`/app/ambient-runner` directory, the application entrypoint `main.py`, and the
`ambient_runner` package in the revised sentence to make this distinction
explicit.
- Line 461: Update the sentence about contract-version handling so the Control
Plane (CP) SHALL by default reject pod creation on a contract version mismatch
instead of merely warning; add a clear operator-configurable override (e.g., an
"allowContractMismatch" opt-in flag) that, when enabled, permits the previous
advisory behavior and logs a warning; ensure the wording references the
"contract version" label and the CP's behavior at "pod creation" so readers can
locate and implement the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 34cd54f5-c174-486c-a499-0113c9af9cf5

📥 Commits

Reviewing files that changed from the base of the PR and between 28874a9 and 0add287.

📒 Files selected for processing (1)
  • specs/agents/runner-image.spec.md

Comment thread specs/agents/runner-image.spec.md Outdated
Define the stable runner contract and a ProjectSettings-driven image
override so workspace admins can layer tools onto the base runner via
Dockerfile FROM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jbpratt jbpratt force-pushed the spec/custom-runner-image branch from b49e8eb to 2308ab4 Compare May 12, 2026 13:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@specs/agents/runner-image.spec.md`:
- Around line 154-164: The spec enforces a contradictory UID requirement: it
mandates a fixed UID 1001 via Dockerfile `USER 1001` while also recommending
OpenShift arbitrary-UID compatibility (e.g. `chmod -R g=u`), which conflicts
under restrictive SCCs; change the normative contract to require non-root
runtime behavior (`runAsNonRoot: true`, `allowPrivilegeEscalation: false`,
`drop: ["ALL"]` and no root at runtime) and demote `UID 1001`/`Dockerfile USER
1001` to a base-image default or recommendation, keeping the OpenShift
compatibility guidance (`chmod -R g=u` on writable paths) as a SHOULD rather
than a SHALL so implementations can satisfy `runAsNonRoot` without a fixed UID.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7fd1e84c-8a85-4262-a291-b54a4719e4c9

📥 Commits

Reviewing files that changed from the base of the PR and between 0add287 and e576f72.

📒 Files selected for processing (1)
  • specs/agents/runner-image.spec.md

Comment on lines +154 to +164
A custom runner image SHALL run as UID 1001 with no root privileges.

| Constraint | Enforced by |
|------------|-------------|
| UID 1001 | Dockerfile `USER 1001` |
| `runAsNonRoot: true` | Pod SecurityContext |
| `allowPrivilegeEscalation: false` | Pod SecurityContext |
| `drop: ["ALL"]` capabilities | Pod SecurityContext |

Custom images MAY use `USER 0` during build stages for installing system packages, provided the final `USER` directive sets UID 1001. Custom images SHOULD include OpenShift arbitrary-UID compatibility (`chmod -R g=u` on writeable paths).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve UID contract contradiction for OpenShift compatibility

Line 154 mandates a fixed UID (1001), but Line 163 simultaneously recommends OpenShift arbitrary-UID compatibility. These are mutually inconsistent as a normative contract and can lead to incompatible implementations under restricted SCC.

Use a non-root contract as normative (runAsNonRoot, no privilege escalation, dropped caps), and make 1001 a base-image default rather than a hard runtime requirement.

Proposed spec wording change
-A custom runner image SHALL run as UID 1001 with no root privileges.
+A custom runner image SHALL run as non-root with no root privileges. The base image default runtime user is UID 1001, but deployments MAY run with an arbitrary non-root UID (e.g., OpenShift restricted SCC).

 | Constraint | Enforced by |
 |------------|-------------|
-| UID 1001 | Dockerfile `USER 1001` |
+| Non-root runtime user (default: UID 1001) | Dockerfile `USER 1001` + Pod SecurityContext |
 | `runAsNonRoot: true` | Pod SecurityContext |
 | `allowPrivilegeEscalation: false` | Pod SecurityContext |
 | `drop: ["ALL"]` capabilities | Pod SecurityContext |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/agents/runner-image.spec.md` around lines 154 - 164, The spec enforces
a contradictory UID requirement: it mandates a fixed UID 1001 via Dockerfile
`USER 1001` while also recommending OpenShift arbitrary-UID compatibility (e.g.
`chmod -R g=u`), which conflicts under restrictive SCCs; change the normative
contract to require non-root runtime behavior (`runAsNonRoot: true`,
`allowPrivilegeEscalation: false`, `drop: ["ALL"]` and no root at runtime) and
demote `UID 1001`/`Dockerfile USER 1001` to a base-image default or
recommendation, keeping the OpenShift compatibility guidance (`chmod -R g=u` on
writable paths) as a SHOULD rather than a SHALL so implementations can satisfy
`runAsNonRoot` without a fixed UID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant