Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
379 changes: 379 additions & 0 deletions specs/agents/runner-image.spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,379 @@
# Custom Runner Image Specification

**Date:** 2026-05-12
**Status:** Proposed
**Related:**
- `runner.spec.md` — Runner runtime, AG-UI protocol, bridge layer
- `../control-plane/control-plane.spec.md` — Pod provisioning, image selection, env var injection
- `../api/ambient-model.spec.md` — ProjectSettings, Session data model
- `../security/security.spec.md` — Per-session SA isolation, credential boundaries

---

## Purpose

The Ambient Runner ships a single image containing Python, git, Node.js, Go, and several CLI tools. Workspace admins who need additional tools — Terraform, kubectl, language-specific SDKs, internal CLIs — have no supported extension path short of forking the image.

This spec defines a **stable runner contract** (the set of filesystem paths, HTTP endpoints, environment variables, and security constraints that custom images must preserve), a **Dockerfile FROM extension model** (users layer tools onto a published base image), and a **ProjectSettings-driven image override** (workspace admins declare a custom image per project).

The extension model is Dockerfile FROM only. Init hooks (scripts run at pod startup) were rejected: they are non-reproducible across pods, add startup latency, require runtime network egress that conflicts with NetworkPolicy isolation, and create OpenShift SCC conflicts when installing system packages.

This spec covers only the **image boundary** — what must be true about a container image for the platform to run it as a runner. Runner internals (bridge layer, gRPC transport, credential management) are defined in `runner.spec.md`. Pod provisioning mechanics are defined in `control-plane.spec.md`.

---

## Stable Runner Contract

Everything in this section is the stable interface. Anything not listed here is internal and MAY change without notice between runner releases.

### Requirement: AG-UI HTTP Contract

A custom runner image SHALL expose the AG-UI protocol on the port specified by the `AGUI_PORT` environment variable (default `8001`).

The following endpoints are part of the stable contract:

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/` | POST | AG-UI run — execute one turn, stream SSE events |
| `/interrupt` | POST | Halt the active run for a thread |
| `/health` | GET | Liveness/readiness probe |
| `/capabilities` | GET | Declare supported features to callers |
| `/events/{thread_id}` | GET | SSE live event stream for a specific thread |

Custom images MUST NOT remove, relocate, or change the response format of these endpoints. The remaining platform endpoints (`/repos`, `/workflow`, `/feedback`, `/mcp-status`, `/content`, `/tasks`) are registered by the `ambient_runner` package and inherited automatically.

#### Scenario: Custom image passes health check

- GIVEN a custom runner image built FROM the base
- WHEN the CP creates a pod and the readiness probe calls `GET /health`
- THEN the response is `200 OK`
- AND the session transitions to `Running` phase

#### Scenario: Custom image serves AG-UI protocol

- GIVEN a custom runner image is running in a session pod
- WHEN the api-server proxies a user message to `POST /`
- THEN the runner processes the turn and streams AG-UI events via SSE
- AND the event format is identical to the standard runner

---

### Requirement: Python Runtime Contract

Custom images SHALL provide Python 3.12+ and SHALL have the `ambient_runner` package installed. The runner process MUST use the same Python major.minor version as the base image.

Custom tools MAY use different Python versions via explicit interpreter paths, but the runner's uvicorn process MUST run under the base image's Python.

#### Scenario: Missing ambient_runner package

- GIVEN a custom image without the `ambient_runner` package
- WHEN the pod starts
- THEN the runner process fails to start
- AND the pod exits with a non-zero exit code
- AND the CP transitions the session to `Failed`

---

### Requirement: Filesystem Contract

A custom runner image SHALL preserve the following filesystem layout:

| Path | Constraint |
|------|------------|
| `/workspace` | MUST exist; EmptyDir mounted by CP at pod creation |
| `/app` | MUST exist; writeable by UID 1001; serves as `HOME` |
| `/app/ambient-runner` | MUST contain installed `ambient_runner` package |
| `/app/vertex` | MUST tolerate read-only Secret mount by CP (when Vertex AI enabled) |
| `/tmp` | MUST be writeable |

Custom images MAY add files and directories anywhere. Custom images MUST NOT remove or relocate the paths listed above.

#### Scenario: Custom tools installed in system PATH

- GIVEN a custom image with additional system packages installed
- WHEN a session runs in a pod using this image
- THEN the additional binaries are available in the agent's PATH
- AND all AG-UI endpoints function normally

---

### Requirement: Entrypoint Contract

Custom images SHOULD NOT override CMD or ENTRYPOINT. The platform controls the runner process lifecycle through the base image's default command.

If a custom image needs pre-startup logic, it MAY use a wrapper entrypoint that performs setup and then `exec`s the original command. The runner process MUST:

- Listen on the port specified by `AGUI_PORT` (default `8001`)
- Receive SIGTERM for graceful shutdown (process must be PID 1 or a direct child of PID 1)
- Start within the pod's startup timeout

#### Scenario: Wrapper entrypoint preserves signal handling

- GIVEN a custom image with a wrapper entrypoint that execs the runner process
- WHEN the CP sends SIGTERM to the pod
- THEN the runner process receives the signal
- AND shuts down gracefully within `terminationGracePeriodSeconds`

---

### Requirement: Environment Contract

The following environment variables are injected by the CP at pod creation time. Custom images MUST NOT override these in the Dockerfile:

| Variable | Purpose |
|----------|---------|
| `SESSION_ID` | Primary session identifier |
| `PROJECT_NAME` | Project context |
| `WORKSPACE_PATH` | Workspace root (always `/workspace`) |
| `AGUI_PORT` | Runner HTTP port |
| `BACKEND_API_URL` | api-server base URL |
| `AMBIENT_GRPC_URL` | api-server gRPC address |
| `AMBIENT_GRPC_USE_TLS` | TLS flag for gRPC channel |
| `AMBIENT_CP_TOKEN_URL` | CP token endpoint |
| `AMBIENT_CP_TOKEN_PUBLIC_KEY` | RSA public key for token auth |
| `INITIAL_PROMPT` | Auto-execute prompt |
| `IS_RESUME` | Resume flag on pod restart |
| `CREDENTIAL_IDS` | JSON map of resolved credential IDs |
| `RUNNER_TYPE` | Bridge selection (from agent registry) |

The base image also sets `PYTHONUNBUFFERED=1`, `HOME=/app`, and `SHELL=/bin/bash`. Custom images SHOULD preserve these.

Custom images MAY set additional environment variables. Custom images MUST NOT unset CP-injected variables.

#### Scenario: Custom image adds environment variables

- GIVEN a custom image with additional `ENV` directives
- WHEN a session pod starts
- THEN both the custom env vars and all CP-injected env vars are present
- AND the runner starts normally

---

### Requirement: Security Contract

A custom runner image SHALL run as UID 1001 with no root privileges.

| Constraint | Enforced by |
|------------|-------------|
| UID 1001 | Dockerfile `USER 1001` |
| `runAsNonRoot: true` | Pod SecurityContext |
| `allowPrivilegeEscalation: false` | Pod SecurityContext |
| `drop: ["ALL"]` capabilities | Pod SecurityContext |

Custom images MAY use `USER 0` during build stages for installing system packages, provided the final `USER` directive sets UID 1001. Custom images SHOULD include OpenShift arbitrary-UID compatibility (`chmod -R g=u` on writeable paths).

Comment on lines +154 to +164
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve UID contract contradiction for OpenShift compatibility

Line 154 mandates a fixed UID (1001), but Line 163 simultaneously recommends OpenShift arbitrary-UID compatibility. These are mutually inconsistent as a normative contract and can lead to incompatible implementations under restricted SCC.

Use a non-root contract as normative (runAsNonRoot, no privilege escalation, dropped caps), and make 1001 a base-image default rather than a hard runtime requirement.

Proposed spec wording change
-A custom runner image SHALL run as UID 1001 with no root privileges.
+A custom runner image SHALL run as non-root with no root privileges. The base image default runtime user is UID 1001, but deployments MAY run with an arbitrary non-root UID (e.g., OpenShift restricted SCC).

 | Constraint | Enforced by |
 |------------|-------------|
-| UID 1001 | Dockerfile `USER 1001` |
+| Non-root runtime user (default: UID 1001) | Dockerfile `USER 1001` + Pod SecurityContext |
 | `runAsNonRoot: true` | Pod SecurityContext |
 | `allowPrivilegeEscalation: false` | Pod SecurityContext |
 | `drop: ["ALL"]` capabilities | Pod SecurityContext |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/agents/runner-image.spec.md` around lines 154 - 164, The spec enforces
a contradictory UID requirement: it mandates a fixed UID 1001 via Dockerfile
`USER 1001` while also recommending OpenShift arbitrary-UID compatibility (e.g.
`chmod -R g=u`), which conflicts under restrictive SCCs; change the normative
contract to require non-root runtime behavior (`runAsNonRoot: true`,
`allowPrivilegeEscalation: false`, `drop: ["ALL"]` and no root at runtime) and
demote `UID 1001`/`Dockerfile USER 1001` to a base-image default or
recommendation, keeping the OpenShift compatibility guidance (`chmod -R g=u` on
writable paths) as a SHOULD rather than a SHALL so implementations can satisfy
`runAsNonRoot` without a fixed UID.

#### Scenario: Custom image with system package installation

- GIVEN a custom image that installs system packages as root during build
- AND sets `USER 1001` as the final directive
- WHEN the pod starts with `securityContext.runAsNonRoot: true`
- THEN the pod starts successfully
- AND the installed packages are executable by UID 1001

---

## ProjectSettings Integration

### Requirement: Custom Runner Image Field

The ProjectSettings resource SHALL support a `runner_image` field (string). When set, the CP SHALL use this image instead of the default when creating session pods for that project.

The field SHALL contain a fully qualified container image reference: `registry/repository:tag` or `registry/repository@sha256:digest`. When empty or unset, the CP uses the default image.

#### Scenario: Project with custom runner image

- GIVEN a ProjectSettings with `runner_image` set to a custom image
- WHEN a session is started in that project
- THEN the CP creates the runner pod with the custom image
- AND all other pod configuration (env vars, volumes, security context) is unchanged

#### Scenario: Project without custom runner image

- GIVEN a ProjectSettings with `runner_image` unset
- WHEN a session is started
- THEN the CP uses the default runner image

---

### Requirement: Image Selection Precedence

The CP SHALL select the runner image using the following precedence (highest to lowest):

1. **ProjectSettings `runner_image`** — workspace admin override
2. **Agent registry `container.image`** — per-agent-type default
3. **Operator `RUNNER_IMAGE` env var** — cluster-level default
4. **Hardcoded fallback**

`ProjectSettings.runner_image` overrides the **image** but not the **agent type configuration**. The `RUNNER_TYPE` env var, resource limits, state directory, and other agent-registry settings are still applied from the registry entry matching the session's runner type.

Custom images MUST contain the bridge implementation for every agent type that sessions in this project may use. Images built FROM the standard base inherit all bridges.

#### Scenario: Custom image with non-default runner type

- GIVEN a project with `runner_image` set to a custom image
- AND a session created with a non-default runner type
- WHEN the CP provisions the pod
- THEN the pod uses the custom image
- AND the pod env includes the `RUNNER_TYPE` from the agent registry
- AND the custom image MUST contain the matching bridge implementation

#### Scenario: No custom image — agent registry selects image

- GIVEN a project with `runner_image` unset
- AND a session with a specific runner type
- WHEN the CP provisions the pod
- THEN the pod uses the image from the agent registry entry for that runner type

---

### Requirement: Image Validation

The CP SHALL validate the `runner_image` value before creating pods.

The CP SHALL reject images where the reference is syntactically invalid (missing repository or tag/digest) or the registry host is empty.

The CP SHOULD support an operator-level allowlist of permitted registries via `RUNNER_IMAGE_ALLOWED_REGISTRIES` (comma-separated hostnames). When set, images from unlisted registries SHALL be rejected and the session SHALL transition to `Failed` with a descriptive condition.

When the allowlist is unset, any registry is allowed.

#### Scenario: Image from disallowed registry

- GIVEN a registry allowlist that does not include `docker.io`
- AND a ProjectSettings with `runner_image` pointing to `docker.io`
- WHEN the CP validates the image reference
- THEN the session transitions to `Failed` with a condition describing the rejection

#### Scenario: No registry allowlist

- GIVEN no registry allowlist configured
- AND a ProjectSettings with `runner_image` pointing to any registry
- THEN the image is accepted

---

### Requirement: Image Pull Credentials

The ProjectSettings resource SHALL support a `runner_image_pull_secret` field (string) containing the name of a Kubernetes Secret (type `kubernetes.io/dockerconfigjson`) in the project's namespace.

When set, the CP SHALL add it to the pod's `spec.imagePullSecrets`.

#### Scenario: Private registry with pull secret

- GIVEN a ProjectSettings with `runner_image` and `runner_image_pull_secret` set
- AND the referenced Secret exists in the project namespace
- WHEN the CP creates the runner pod
- THEN the pod spec includes the secret in `imagePullSecrets`

---

### Requirement: Image Pull Policy

The CP SHALL set `imagePullPolicy` based on the image reference:

| Reference type | Policy |
|----------------|--------|
| `@sha256:` digest | `IfNotPresent` |
| `localhost/` prefix | `IfNotPresent` |
| All others (tags) | `Always` |

---

### Requirement: RBAC for Runner Image Configuration

Only users with `project_settings:update` permission SHALL be permitted to modify ProjectSettings, including the `runner_image` and `runner_image_pull_secret` fields. This follows the existing endpoint-level RBAC model.

#### Scenario: User without update permission

- GIVEN a user without `project_settings:update` permission
- WHEN they PATCH ProjectSettings with a `runner_image` value
- THEN the request is rejected with `403 Forbidden`

---

### Requirement: Running Sessions Unaffected

When `runner_image` changes on a ProjectSettings resource, the change SHALL apply to **new sessions only**. Running sessions continue using the image they were created with.

#### Scenario: Image change does not affect running sessions

- GIVEN running sessions in a project using image A
- WHEN the admin changes `runner_image` to image B
- THEN running sessions continue with image A
- AND the next session started uses image B

---

## Failure Modes

### Requirement: Health Check Timeout

The CP SHALL configure a readiness probe on the runner container (`GET /health` on `AGUI_PORT`). If the probe does not pass within the pod's startup timeout, the CP SHALL transition the session to `Failed`.

#### Scenario: Custom image crashes on start

- GIVEN a custom image with a broken dependency
- WHEN the pod starts and the runner process fails to initialize
- THEN the pod exits with a non-zero exit code
- AND the CP transitions the session to `Failed`

### Requirement: Bridge Mismatch

When a custom image does not contain the bridge implementation required by the session's `RUNNER_TYPE`, the runner process SHALL fail at startup. The pod logs SHALL contain an error identifying the missing bridge module.

Custom images built FROM the standard base image inherit all bridge implementations and are not affected.

#### Scenario: Custom image missing bridge for session runner type

- GIVEN a custom image that does not include the bridge for a given runner type
- AND a session is created with that runner type
- WHEN the pod starts
- THEN the runner process fails to load the bridge module
- AND the pod exits with a non-zero exit code
- AND the CP transitions the session to `Failed`

### Requirement: Image Pull Failure

When the kubelet cannot pull the custom image, the CP SHALL transition the session to `Failed` with the pull error in the session condition.

#### Scenario: Image does not exist in registry

- GIVEN `runner_image` pointing to a non-existent image
- WHEN the CP creates the pod
- THEN the kubelet enters `ImagePullBackOff`
- AND the CP transitions the session to `Failed`

---

## Security Boundary

Custom runner images run within the same security perimeter as the standard runner:

- **Network isolation**: Runner pods are subject to NetworkPolicy. Outbound internet access is blocked by default.
- **Credential isolation**: Credentials are fetched per-turn via cluster-local endpoints only.
- **Per-session ServiceAccount**: Each session gets its own SA with minimal RBAC.

Custom images inherit these constraints.

---

## Base Image Publishing

### Requirement: Published Base Image

The platform SHALL publish a base runner image suitable for `FROM` directives at a stable, versioned tag. The image SHALL be built from the same source as the standard runner image.

Breaking changes to the stable contract SHALL increment the major version.

### Requirement: Contract Version Label

The base image SHALL carry an OCI label indicating the contract version (e.g., `io.ambient-code.runner-contract-version`).

The CP MAY log a warning if the contract version does not match the expected version. The CP SHALL NOT block pod creation based on contract version mismatch.

#### Scenario: Contract version mismatch

- GIVEN the CP expects contract version `1`
- AND a custom image has a different contract version label
- WHEN the CP creates the pod
- THEN the CP logs a warning
- AND the pod is created normally