diff --git a/recipes/README.md b/recipes/README.md
index fb06b97..728e972 100644
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -1,37 +1,41 @@
 # Recipes
 
-Two flavors:
+forkd is a fork-on-write microVM primitive — the "for AI agents"
+framing on the front page is one prominent use case, not the
+ceiling. Anything that wants **N isolated children spawned from a
+warmed parent in milliseconds** fits.
 
-1. **Integration recipes** (host-side, no rootfs build) — Python scripts
-   that drive forkd from an agent framework, demonstrating the
-   BRANCH + fanout pattern with that framework's idioms. Start here if
-   you want to plug forkd into CrewAI / AutoGen / Swarm / Claude
-   Desktop in five minutes.
-2. **Rootfs recipes** (parent images) — `build.sh` scripts that turn
-   public OCI images into forkd parent snapshots. Start here if you
-   want a custom warmed image to fork from.
+## Pick your starting point
 
-## Integration recipes (host-side)
+### By problem you're solving
 
-Read the script, copy the helper, drop it into your project. Each is
-~150-250 lines of Python with a `--dry-run` mode so you can verify
-the forkd plumbing without an LLM key.
-
-| Recipe | Framework | Forkd-specific value |
+| Problem | Recipes | What forkd buys you |
 |---|---|---|
-| [`mcp-agent/`](./mcp-agent/) | Claude Desktop / Cursor / Cline (via MCP) | End-to-end MCP protocol verification — same JSON-RPC framing a real LLM client uses |
-| [`crewai-fanout/`](./crewai-fanout/) | CrewAI | N agents on N microVMs from one warmed parent — per-agent isolation without Docker cold-start |
-| [`autogen-branch/`](./autogen-branch/) | AutoGen | Forkd-backed `CodeExecutor` + mid-conversation BRANCH that fans out N alternatives from the same warmed state |
-| [`openai-swarm/`](./openai-swarm/) | OpenAI Swarm / Agents SDK | Handoff = BRANCH: agent B inherits agent A's full VM state (filesystem, imports, env) on handoff |
+| **AI agent fan-out** — try N approaches, branch a thinking agent | [`langgraph-react/`](./langgraph-react/) · [`crewai-fanout/`](./crewai-fanout/) · [`autogen-branch/`](./autogen-branch/) · [`openai-swarm/`](./openai-swarm/) · [`mcp-agent/`](./mcp-agent/) · [`speculative-agent/`](./speculative-agent/) · [`coding-agent-fork/`](./coding-agent-fork/) | Per-child KVM isolation + warmed runtime inheritance. The "fork mid-thought" story. |
+| **CI test parallelism** — run 100 pytest workers from a warmed parent | [`postgres-fixture/`](./postgres-fixture/) (DB-per-test) · [`ci-parallel-pytest/`](./ci-parallel-pytest/) (worker fan-out) | Skip per-worker container cold-start + dependency install. ~50 ms / worker instead of ~3 s. |
+| **Database test fixtures** — fresh, isolated postgres per test | [`postgres-fixture/`](./postgres-fixture/) | `initdb` runs **once** at parent build; every fork inherits the post-init state. ~200× faster than per-test container. |
+| **Browser automation farms** — Playwright / Puppeteer fan-out at scale | [`playwright-browser/`](./playwright-browser/) | Fork warmed headless Chromium at ~10 ms instead of ~2 s cold-boot. |
+| **Notebook / code interpreter** — Jupyter kernel per session | [`jupyter-kernel/`](./jupyter-kernel/) · [`e2b-codeinterpreter/`](./e2b-codeinterpreter/) | Full SciPy stack pre-imported. ~1 ms per fresh kernel. |
+| **General-purpose compute fan-out** — anything that needs N warmed sandboxes | [`python-numpy/`](./python-numpy/) · [`coding-agent/`](./coding-agent/) · [`nodejs/`](./nodejs/) · [`agent-workbench/`](./agent-workbench/) | Pre-baked language runtime + canonical fan-out benchmark. |
+
+### By integration framework (host-side Python scripts)
 
-## Rootfs recipes
+If you're plugging forkd into an existing agent framework, these
+are ~150-250 lines of Python with a `--dry-run` mode so you can
+verify the forkd plumbing without an LLM key.
 
-Ready-made parent-rootfs recipes for common workbench images.
-Each recipe takes a public Docker / OCI image and turns it into a
-forkd parent snapshot, so you can fork N warmed children from it
-in milliseconds.
+| Framework | Recipe |
+|---|---|
+| Claude Desktop / Cursor / Cline (via MCP) | [`mcp-agent/`](./mcp-agent/) |
+| CrewAI multi-agent crew | [`crewai-fanout/`](./crewai-fanout/) |
+| AutoGen ConversableAgent / GroupChat | [`autogen-branch/`](./autogen-branch/) |
+| OpenAI Swarm / Agents SDK | [`openai-swarm/`](./openai-swarm/) |
+| LangGraph ReAct (the front-page demo) | [`langgraph-react/`](./langgraph-react/) |
 
-The pattern is the same across recipes:
+## How rootfs recipes work
+
+Rootfs recipes turn a public Docker / OCI image into a forkd parent
+snapshot. Same shape across all of them:
 
 ```bash
 # 1. Build a parent rootfs from an upstream image
@@ -39,7 +43,7 @@ sudo bash recipes/<name>/build.sh
 
 # 2. Snapshot the warmed parent (one-time per image version)
 sudo forkd snapshot --tag <name> \
-    --kernel ./vmlinux-6.1.141 \
+    --kernel /var/lib/forkd/kernels/vmlinux \
     --rootfs recipes/<name>/parent.ext4 \
     --tap forkd-tap0
 
@@ -47,43 +51,32 @@ sudo forkd snapshot --tag <name> \
 sudo -E forkd fork --tag <name> -n 100 --per-child-netns
 ```
 
+The first-time `build.sh` of each recipe takes a few minutes
+(pulling the Docker image + converting to ext4). The snapshot step
+is ~10 s. After that, forking children is the published benchmark
+cost.
+
 ### Available rootfs recipes
 
-| Recipe | Parent image | Size | Audience |
+| Recipe | Parent image | Size | Best for |
 |---|---|---|---|
-| [`python-numpy/`](./python-numpy/) | `python:3.12-slim` + `python3-numpy` | ~1.5 GB | The canonical fan-out demo; what the chart on the front README measures |
-| [`e2b-codeinterpreter/`](./e2b-codeinterpreter/) | `e2bdev/code-interpreter` | ~600 MB | AI code-execution agents (Anthropic / OpenAI tutorials use this image). Lightest "agent ready" option |
-| [`jupyter-kernel/`](./jupyter-kernel/) | `quay.io/jupyter/scipy-notebook` | ~3 GB | Code-interpreter / notebook-style agents — full SciPy stack pre-imported, ~1 ms per fresh kernel instead of ~2 s |
-| [`coding-agent/`](./coding-agent/) | `python:3.12` + git + ruff + black + pytest | ~1.8 GB | SWE-style coding agents that need a real dev toolchain inside the sandbox |
-| [`nodejs/`](./nodejs/) | `node:22-slim` | ~250 MB | JavaScript / TypeScript workloads (Jest, Playwright fan-out) |
-| [`playwright-browser/`](./playwright-browser/) | `mcr.microsoft.com/playwright` (Node + Chromium pre-warmed) | ~2.5 GB | Browser-driving agents (computer-use, web research, UI test gen). Fork warmed headless Chromium at ~10 ms instead of ~2 s. **Alpha** |
-| [`agent-workbench/`](./agent-workbench/) | `agent-infra/sandbox` (browser + VSCode + Jupyter + MCP + shell) | ~5 GB | Kitchen-sink agent workbench when you want every tool already mounted; trades a bigger memory.bin for batteries-included |
-| [`postgres-fixture/`](./postgres-fixture/) | `postgres:16` (initdb done, postmaster pre-launched) | ~500 MB | Fork-per-test isolated databases; each child gets a ready-to-query postgres in ~10 ms instead of ~2 s for fresh initdb |
-
-## Choosing a recipe
-
-**By framework / driver (integration recipes):**
-- **Claude Desktop / Cursor / Cline** → `mcp-agent/`
-- **CrewAI multi-agent crew** → `crewai-fanout/`
-- **AutoGen ConversableAgent / GroupChat** → `autogen-branch/`
-- **OpenAI Swarm / Agents SDK** → `openai-swarm/`
-
-**By workload (rootfs recipes):**
-- **You're benchmarking** → `python-numpy/`
-- **You're running an AI code interpreter** → `e2b-codeinterpreter/`
-- **You need the full SciPy / notebook stack** → `jupyter-kernel/`
-- **You're running a coding agent (SWE-bench style)** → `coding-agent/`
-- **JS / TS only** → `nodejs/`
-- **Browser-driving agent (computer-use, scraping, UI testing)** → `playwright-browser/`
-- **You want browser + IDE + everything in one box** → `agent-workbench/`
-- **You're running a test suite that needs an isolated DB per test** → `postgres-fixture/`
+| [`python-numpy/`](./python-numpy/) | `python:3.12-slim` + `python3-numpy` | ~1.5 GB | **The canonical fan-out benchmark** — what the chart on the front README measures |
+| [`postgres-fixture/`](./postgres-fixture/) | `postgres:16` (initdb done, postmaster pre-launched) | ~500 MB | **Fork-per-test isolated databases** — each child gets a ready-to-query postgres in ~10 ms vs ~2 s for fresh initdb |
+| [`ci-parallel-pytest/`](./ci-parallel-pytest/) | `python:3.12-slim` + numpy/pandas/sklearn + your test suite | ~2 GB | **CI test fan-out** — parallel pytest workers without per-worker container cold-start |
+| [`playwright-browser/`](./playwright-browser/) | `mcr.microsoft.com/playwright` (Node + Chromium pre-warmed) | ~2.5 GB | **Browser automation farms** — warmed headless Chromium at ~10 ms instead of ~2 s. **Alpha** |
+| [`jupyter-kernel/`](./jupyter-kernel/) | `quay.io/jupyter/scipy-notebook` | ~3 GB | **Code-interpreter / notebook agents** — full SciPy stack pre-imported, ~1 ms per fresh kernel |
+| [`e2b-codeinterpreter/`](./e2b-codeinterpreter/) | `e2bdev/code-interpreter` | ~600 MB | **AI code-execution agents** (Anthropic / OpenAI tutorials use this image). Lightest "agent ready" option |
+| [`coding-agent/`](./coding-agent/) | `python:3.12` + git + ruff + black + pytest | ~1.8 GB | **SWE-style coding agents** that need a real dev toolchain inside the sandbox |
+| [`nodejs/`](./nodejs/) | `node:22-slim` | ~250 MB | **JavaScript / TypeScript workloads** (Jest, Playwright fan-out, scraping) |
+| [`agent-workbench/`](./agent-workbench/) | `agent-infra/sandbox` (browser + VSCode + Jupyter + MCP + shell) | ~5 GB | **Kitchen-sink workbench** when you want every tool already mounted |
 
 ## Notes
 
 - Recipes are tested on Ubuntu 24.04 / Linux 6.14 / x86_64. Other distros
   may need adjustments to `scripts/build-rootfs.sh`.
-- The first-time `build.sh` of each recipe takes a few minutes (pulling
-  the Docker image + converting to ext4). The snapshot step is ~10 s.
-  After that, forking children is the published benchmark cost.
 - Each recipe is self-contained — pick one, run it; you don't need to
   understand the others.
+- The "AI agent" framing on the project front page is the dominant use
+  case **today** but not the only one — the technology is `fork(2)` for
+  KVM microVMs. If your workload needs N hardware-isolated children
+  spawned from a warmed parent in milliseconds, forkd is the primitive.
diff --git a/recipes/ci-parallel-pytest/README.md b/recipes/ci-parallel-pytest/README.md
new file mode 100644
index 0000000..cba1042
--- /dev/null
+++ b/recipes/ci-parallel-pytest/README.md
@@ -0,0 +1,174 @@
+# `ci-parallel-pytest`
+
+**Run your pytest suite across N forkd microVMs in parallel,
+without paying per-worker container cold-start or dependency
+import cost.**
+
+A typical Python CI job re-imports numpy / pandas / scikit-learn on
+every fresh worker container — ~1-2 s of pure overhead before the
+first test runs. With forkd, those imports live in the warmed
+parent's snapshot; every fork inherits them via `mmap MAP_PRIVATE`
+copy-on-write. Per-worker fixed cost drops to ~50-100 ms.
+
+## Architecture
+
+```
+                 ┌──────────────────────────────────────┐
+                 │  parent snapshot `ci-pytest`         │
+                 │  python:3.12-slim                    │
+                 │  + pytest 8.3 numpy 2.0 pandas 2.2   │
+                 │  + scikit-learn 1.5                  │
+                 │  + your /opt/test_project            │
+                 │  (heavy imports already paid)        │
+                 └────────────────┬─────────────────────┘
+                                  │  mmap MAP_PRIVATE (CoW)
+            ┌─────────────────────┼─────────────────────┐
+            │                     │                     │
+       ┌────▼───────┐       ┌─────▼──────┐       ┌──────▼─────┐
+       │ worker 1   │       │ worker 2   │       │ worker N   │
+       │ pytest     │       │ pytest     │       │ pytest     │
+       │ slice 1/N  │  ...  │ slice 2/N  │  ...  │ slice N/N  │
+       └────────────┘       └────────────┘       └────────────┘
+                            run in parallel
+```
+
+## What ships in this recipe
+
+| File | What it does |
+|---|---|
+| [`build.sh`](./build.sh) | Builds a forkd parent rootfs: `python:3.12-slim` + pinned pytest/numpy/pandas/sklearn, the demo test project copied to `/opt/test_project`, and a pre-warm step that imports the heavy deps so they're in the snapshot's page cache |
+| [`test_project/`](./test_project/) | A representative pytest project — ~30 tests across 5 files (arithmetic, numpy, pandas, sklearn, text). Replace with your own |
+| [`demo.py`](./demo.py) | Fan-out driver: slices test files across N forkd workers, runs each slice in a child sandbox, reports per-worker spawn/exec timing + total wall-clock + sequential-baseline comparison |
+
+## When to use this
+
+- **CI pipelines with 100s of pytest tests** that re-import heavy
+  ML libs every run. The savings compound: every PR run, every
+  retry, every nightly.
+- **PR-preview environments** where each PR needs its own clean
+  pytest run with fresh side-effects (DB, filesystem, env). forkd's
+  per-child KVM isolation means workers truly don't see each other.
+- **Sharded fuzz / property testing**: split a 10 000-iteration
+  Hypothesis run across N microVMs without setup tax.
+
+## When NOT to use this
+
+- Your test suite is < 30 tests and finishes in < 2 s sequentially —
+  parallelism overhead exceeds the gain.
+- You don't actually need per-worker isolation (e.g. pure-function
+  unit tests with no shared state) — `pytest -n <N>` (pytest-xdist)
+  in a single container is simpler.
+- You can't run forkd on your CI host (managed CI like default
+  GitHub Actions, no KVM). For self-hosted runners with bare-metal
+  Linux + KVM this works great.
+
+## Quickstart
+
+```bash
+# 1. Build the parent (one-time, ~5 min — pip install pandas + sklearn
+#    dominates the time)
+sudo bash recipes/ci-parallel-pytest/build.sh
+
+# 2. Snapshot the warmed parent (one-time, ~10 s)
+sudo forkd snapshot --tag ci-pytest \
+    --kernel /var/lib/forkd/kernels/vmlinux \
+    --rootfs recipes/ci-parallel-pytest/parent.ext4 \
+    --tap forkd-tap0
+
+# 3. Fan out — 4 workers in parallel
+FORKD_TOKEN=$(sudo cat /tmp/bench-pause/token) \
+    python3 recipes/ci-parallel-pytest/demo.py --workers 4 \
+                                               --sequential-baseline
+```
+
+Output from the dev box (Intel i7-12700, ext4, 2026-06-06):
+
+```
+Plan: 4 worker(s) × pytest slice off `ci-pytest`.
+  worker 0: 2 file(s) — test_arithmetic.py, test_text_processing.py
+  worker 1: 1 file(s) — test_numpy_ops.py
+  worker 2: 1 file(s) — test_pandas_etl.py
+  worker 3: 1 file(s) — test_sklearn_models.py
+
+=== fan-out: 4 workers in parallel ===
+  batch spawn (4 children): 81 ms
+  [0] PASS  exec= 232 ms  files=test_arithmetic.py,test_text_processing.py
+  [1] PASS  exec= 304 ms  files=test_numpy_ops.py
+  [2] PASS  exec= 546 ms  files=test_pandas_etl.py
+  [3] PASS  exec=1458 ms  files=test_sklearn_models.py
+
+fan-out wall-clock:  1601 ms   (batch spawn=81 ms = ~20 ms/worker,
+                                slowest worker exec=1458 ms)
+
+=== sequential baseline: one child runs the whole suite ===
+  [0] PASS  spawn=61 ms  exec=1507 ms
+sequential wall-clock: 1625 ms   (fan-out speedup: 1.01×)
+```
+
+The 1.01× fan-out-vs-sequential figure is honest: this demo suite
+only has ~30 tests and is dominated by one sklearn slice (1458 ms).
+Fan-out shines when **your suite has many slow slices of comparable
+size** — e.g. 8 sklearn-heavy slices each taking ~1.5 s would fan
+out to ~1.5 s wall, vs ~12 s sequentially.
+
+**The number that matters across suite shapes is the batch spawn
+cost: 81 ms for 4 children — ~20 ms per worker.** That's the
+forkd-vs-container comparison: ~20 ms to start a forkd worker vs
+~2-3 s to start a fresh container.
+
+## GitHub Actions integration
+
+Drop this into your workflow on a self-hosted runner that has forkd
++ a `ci-pytest` snapshot pre-built:
+
+```yaml
+jobs:
+  test:
+    runs-on: [self-hosted, linux, x64, forkd]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Refresh the parent snapshot
+        run: |
+          sudo cp -r ./tests /opt/test_project/tests   # mount your tests into the snap dir
+          # or rebuild the parent if your deps changed:
+          # sudo bash recipes/ci-parallel-pytest/build.sh
+      - name: Fan out
+        env:
+          FORKD_TOKEN: ${{ secrets.FORKD_TOKEN }}
+        run: |
+          python3 recipes/ci-parallel-pytest/demo.py \
+              --workers 8 \
+              --snapshot-tag ci-pytest
+```
+
+For a hosted-runner setup, the equivalent is one forkd daemon on
+your CI infrastructure, exposed over a port the runner can reach.
+
+## How it compares
+
+| Approach | Per-worker fixed cost | Notes |
+|---|---|---|
+| `pytest` sequential, fresh container | ~2 s container cold + ~1.5 s `import numpy/pandas/sklearn` | Each PR run / retry / nightly re-pays both |
+| `pytest-xdist -n 4` in one container | ~3.5 s container cold + ~1.5 s imports (shared across workers) | Single shared kernel; one test crash takes the host down |
+| `docker run` × 4 fresh containers | ~3.5 s × 4 cold-starts, parallelized | Per-container isolation, but slow to spawn |
+| **forkd fan-out (this recipe)** | **~20 ms batch spawn + 0 ms imports** | Per-child KVM isolation, warmed Python deps inherited via mmap CoW |
+
+The break-even point is roughly: if your sequential test slice is
+slower than your container cold-start (~3 s), container
+parallelism is fine. If your slice is **comparable to or shorter
+than** the ~3 s container tax, forkd wins outright. ML / data
+science suites where you re-pay sklearn / torch import on every
+worker fall squarely in the forkd-wins zone.
+
+## Caveats
+
+- **`pip install` inside snapshots requires v0.5.1+** — the guest
+  kernel rebuild that landed in #226 closed #218 (CRNG starvation
+  blocked OpenSSL → pip hung). Confirm your kernel:
+  `forkd snapshot-info ci-pytest`
+- **Per-worker netns is on by default** — children get their own
+  `lo`, no cross-talk. If your tests need to hit a shared DB, use
+  `--per-child-netns=false` or put the DB on the host tap.
+- **Worker count vs vCPU**: forkd's per-vCPU policy is "share the
+  host's cores". On a 20-core host, 8 workers is comfortable; 50
+  is over-subscribed.
diff --git a/recipes/ci-parallel-pytest/build.sh b/recipes/ci-parallel-pytest/build.sh
new file mode 100644
index 0000000..cc7e3f1
--- /dev/null
+++ b/recipes/ci-parallel-pytest/build.sh
@@ -0,0 +1,57 @@
+#!/usr/bin/env bash
+# Build a forkd parent rootfs for CI test parallelism — pytest +
+# numpy/pandas/sklearn pre-imported, the demo test project under
+# /opt/test_project. Children fork from this, each running a slice
+# of the test suite from the warmed parent.
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+IMAGE="${IMAGE:-python:3.12-slim}"
+SIZE_MIB="${SIZE_MIB:-2048}"
+OUT="$SCRIPT_DIR/parent.ext4"
+
+[ "$(id -u)" -eq 0 ] || { echo "run as root" >&2; exit 1; }
+
+# Heavy deps baked in so children inherit the import cost. Pinned so
+# the benchmark numbers in README.md stay reproducible across builds.
+PIP_PKGS="pytest==8.3.4 numpy==2.0.2 pandas==2.2.3 scikit-learn==1.5.2"
+
+WRAPPED_TAG="forkd-ci-pytest:tmp-$$"
+TMP_CTX="$(mktemp -d)"
+trap "rm -rf '$TMP_CTX' && docker image rm -f '$WRAPPED_TAG' >/dev/null 2>&1 || true" EXIT
+
+# Copy the test project into the build context so it's baked into
+# the rootfs at /opt/test_project. Real users would `cp -r` their
+# own project here instead.
+cp -r "$SCRIPT_DIR/test_project" "$TMP_CTX/test_project"
+
+cat > "$TMP_CTX/Dockerfile" <<DOCKER
+FROM ${IMAGE}
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        build-essential \
+ && rm -rf /var/lib/apt/lists/*
+RUN pip install --no-cache-dir ${PIP_PKGS}
+COPY test_project /opt/test_project
+WORKDIR /opt/test_project
+# Pre-warm: import the heavy deps so they live in the snapshot's
+# page cache. Children inherit the warmed mappings via mmap CoW.
+RUN python3 -c "import numpy, pandas, sklearn; print('prewarm:', numpy.__version__, pandas.__version__, sklearn.__version__)"
+DOCKER
+
+docker build -t "$WRAPPED_TAG" "$TMP_CTX"
+
+bash "$REPO_ROOT/scripts/build-rootfs.sh" "$WRAPPED_TAG" "$OUT" "$SIZE_MIB"
+
+echo
+echo "parent rootfs ready: $OUT ($(du -h "$OUT" | cut -f1))"
+echo
+echo "next:"
+echo "  sudo forkd snapshot --tag ci-pytest \\"
+echo "      --kernel /var/lib/forkd/kernels/vmlinux \\"
+echo "      --rootfs $OUT --tap forkd-tap0"
+echo
+echo "then run the fan-out demo:"
+echo "  FORKD_TOKEN=\$(cat /tmp/bench-pause/token) python3 $SCRIPT_DIR/demo.py"
diff --git a/recipes/ci-parallel-pytest/demo.py b/recipes/ci-parallel-pytest/demo.py
new file mode 100644
index 0000000..3e8e2b6
--- /dev/null
+++ b/recipes/ci-parallel-pytest/demo.py
@@ -0,0 +1,239 @@
+#!/usr/bin/env python3
+"""Fan-out a pytest suite across N forkd microVMs.
+
+Splits the test_project's tests into N slices (by file), spawns N
+children from the `ci-pytest` snapshot, runs one slice per child in
+parallel, collects results, and reports total wall-clock vs the
+sequential baseline.
+
+For the demo to work the parent must already be built + registered:
+
+    sudo bash recipes/ci-parallel-pytest/build.sh
+    sudo forkd snapshot --tag ci-pytest \\
+        --kernel /var/lib/forkd/kernels/vmlinux \\
+        --rootfs recipes/ci-parallel-pytest/parent.ext4 \\
+        --tap forkd-tap0
+
+Then drive it:
+
+    FORKD_TOKEN=$(cat /tmp/bench-pause/token) \\
+        python3 recipes/ci-parallel-pytest/demo.py --workers 4
+
+Usage:
+    demo.py [--workers N] [--snapshot-tag TAG] [--sequential-baseline]
+"""
+
+from __future__ import annotations
+
+import argparse
+import concurrent.futures as futures
+import json
+import os
+import time
+import urllib.error
+import urllib.request
+
+DEFAULT_TAG = "ci-pytest"
+DEFAULT_URL = os.environ.get("FORKD_URL", "http://127.0.0.1:8889")
+
+
+def http(
+    method: str, path: str, token: str, body: dict | None = None, timeout: float = 120
+) -> dict:
+    data = json.dumps(body).encode() if body is not None else None
+    headers = {"Authorization": f"Bearer {token}"}
+    if body is not None:
+        headers["Content-Type"] = "application/json"
+    req = urllib.request.Request(
+        f"{DEFAULT_URL}{path}", data=data, method=method, headers=headers
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            raw = resp.read()
+            return json.loads(raw) if raw else {}
+    except urllib.error.HTTPError as e:
+        body = e.read().decode("utf-8", "replace")
+        raise RuntimeError(f"{method} {path} → HTTP {e.code} {body[:400]}") from e
+
+
+# The set of test files baked into /opt/test_project/tests/ in the
+# `ci-pytest` snapshot. In a real CI setup this would come from
+# `pytest --collect-only -q` against the user's project.
+TEST_FILES = [
+    "tests/test_arithmetic.py",
+    "tests/test_numpy_ops.py",
+    "tests/test_pandas_etl.py",
+    "tests/test_sklearn_models.py",
+    "tests/test_text_processing.py",
+]
+
+
+def slice_tests(n_workers: int) -> list[list[str]]:
+    """Round-robin assign test files to N worker slices."""
+    slices: list[list[str]] = [[] for _ in range(n_workers)]
+    for i, f in enumerate(TEST_FILES):
+        slices[i % n_workers].append(f)
+    return [s for s in slices if s]
+
+
+def batch_spawn(n: int, snap_tag: str, token: str) -> tuple[list[str], float]:
+    """One POST /v1/sandboxes with n=N. The daemon's `restore_many`
+    spawns all N children atomically — this avoids the
+    'operation not supported after starting the microVM' race that
+    bites if multiple POST /v1/sandboxes calls fire concurrently
+    against the same snapshot.
+
+    Returns (sandbox_ids, total_spawn_wall_clock_ms).
+    """
+    t0 = time.monotonic()
+    spawned = http(
+        "POST",
+        "/v1/sandboxes",
+        token,
+        # per_child_netns: each child gets its own network namespace
+        # (forkd-child-<i>) so workers don't compete for forkd-tap0.
+        {"snapshot_tag": snap_tag, "n": n, "per_child_netns": True},
+    )
+    spawn_ms = (time.monotonic() - t0) * 1000
+    return [s["id"] for s in spawned], spawn_ms
+
+
+def run_pytest_in_sandbox(
+    idx: int, sb_id: str, files: list[str], token: str
+) -> dict:
+    """Drive an already-spawned child: ping until ready → exec pytest
+    → delete. Returns per-worker timing.
+    """
+    # Wait for the guest agent.
+    deadline = time.monotonic() + 30
+    while time.monotonic() < deadline:
+        try:
+            http("POST", f"/v1/sandboxes/{sb_id}/ping", token, body={}, timeout=2)
+            break
+        except Exception:
+            time.sleep(0.1)
+
+    cmd = "cd /opt/test_project && python3 -m pytest -v --tb=short " + " ".join(files)
+    args = ["sh", "-c", cmd]
+    t_exec = time.monotonic()
+    try:
+        result = http(
+            "POST",
+            f"/v1/sandboxes/{sb_id}/exec",
+            token,
+            {"args": args, "timeout_secs": 120},
+            timeout=130,
+        )
+        exec_ms = (time.monotonic() - t_exec) * 1000
+        return {
+            "worker_idx": idx,
+            "files": files,
+            "exec_ms": round(exec_ms, 1),
+            "exit_code": result.get("exit_code", -1),
+            "stdout_tail": (result.get("stdout") or "").strip().split("\n")[-3:],
+        }
+    except Exception as e:
+        return {
+            "worker_idx": idx,
+            "files": files,
+            "exec_ms": round((time.monotonic() - t_exec) * 1000, 1),
+            "exit_code": -1,
+            "stdout_tail": [f"ERR: {e}"],
+        }
+    finally:
+        try:
+            http("DELETE", f"/v1/sandboxes/{sb_id}", token, timeout=15)
+        except Exception:
+            pass
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--workers", type=int, default=4)
+    ap.add_argument("--snapshot-tag", default=DEFAULT_TAG)
+    ap.add_argument(
+        "--sequential-baseline",
+        action="store_true",
+        help="Also run the full suite in one child for comparison",
+    )
+    ap.add_argument(
+        "--token",
+        default=os.environ.get("FORKD_TOKEN", ""),
+        help="Bearer token (or FORKD_TOKEN env)",
+    )
+    args = ap.parse_args()
+
+    if not args.token:
+        print("ERROR: set FORKD_TOKEN env or pass --token")
+        return 2
+
+    slices = slice_tests(args.workers)
+    print(
+        f"Plan: {len(slices)} worker(s) × pytest slice off `{args.snapshot_tag}`."
+    )
+    for i, s in enumerate(slices):
+        print(f"  worker {i}: {len(s)} file(s) — {', '.join(f.split('/')[-1] for f in s)}")
+    print()
+
+    print(f"=== fan-out: {len(slices)} workers in parallel ===")
+    t_wall0 = time.monotonic()
+    sb_ids, batch_spawn_ms = batch_spawn(len(slices), args.snapshot_tag, args.token)
+    print(f"  batch spawn ({len(slices)} children): {batch_spawn_ms:.0f} ms")
+
+    with futures.ThreadPoolExecutor(max_workers=len(slices)) as pool:
+        results = list(
+            pool.map(
+                lambda p: run_pytest_in_sandbox(*p),
+                [
+                    (i, sb_ids[i], slices[i], args.token)
+                    for i in range(len(slices))
+                ],
+            )
+        )
+    wall_ms = (time.monotonic() - t_wall0) * 1000
+
+    fail = 0
+    for r in results:
+        status = "PASS" if r["exit_code"] == 0 else f"FAIL({r['exit_code']})"
+        files_short = ",".join(f.split("/")[-1] for f in r["files"])
+        print(
+            f"  [{r['worker_idx']}] {status}  exec={r['exec_ms']:>5.0f}ms  "
+            f"files={files_short}"
+        )
+        if r["exit_code"] != 0:
+            fail += 1
+            for line in r["stdout_tail"]:
+                print(f"        | {line}")
+
+    exec_ms = [r["exec_ms"] for r in results]
+    spawn_per_worker = batch_spawn_ms / len(slices)
+    print()
+    print(
+        f"fan-out wall-clock:  {wall_ms:.0f} ms   "
+        f"(batch spawn={batch_spawn_ms:.0f} ms = ~{spawn_per_worker:.0f} ms/worker, "
+        f"slowest worker exec={max(exec_ms):.0f} ms)"
+    )
+
+    if args.sequential_baseline:
+        print()
+        print("=== sequential baseline: one child runs the whole suite ===")
+        t0 = time.monotonic()
+        seq_ids, seq_spawn_ms = batch_spawn(1, args.snapshot_tag, args.token)
+        seq = run_pytest_in_sandbox(0, seq_ids[0], TEST_FILES, args.token)
+        seq_wall_ms = (time.monotonic() - t0) * 1000
+        status = "PASS" if seq["exit_code"] == 0 else f"FAIL({seq['exit_code']})"
+        print(
+            f"  [0] {status}  spawn={seq_spawn_ms:.0f}ms  "
+            f"exec={seq['exec_ms']:.0f}ms"
+        )
+        speedup = seq_wall_ms / wall_ms if wall_ms > 0 else 0
+        print(
+            f"sequential wall-clock: {seq_wall_ms:.0f} ms   "
+            f"(fan-out speedup: {speedup:.2f}×)"
+        )
+
+    return fail
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/recipes/ci-parallel-pytest/test_project/pyproject.toml b/recipes/ci-parallel-pytest/test_project/pyproject.toml
new file mode 100644
index 0000000..4ea4a7e
--- /dev/null
+++ b/recipes/ci-parallel-pytest/test_project/pyproject.toml
@@ -0,0 +1,9 @@
+[project]
+name = "forkd-ci-pytest-demo"
+version = "0.1.0"
+description = "Pytest test suite used by the forkd ci-parallel-pytest recipe"
+requires-python = ">=3.10"
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+addopts = "-v --tb=short"
diff --git a/recipes/ci-parallel-pytest/test_project/tests/test_arithmetic.py b/recipes/ci-parallel-pytest/test_project/tests/test_arithmetic.py
new file mode 100644
index 0000000..b03ecd4
--- /dev/null
+++ b/recipes/ci-parallel-pytest/test_project/tests/test_arithmetic.py
@@ -0,0 +1,27 @@
+"""Tiny arithmetic suite — exercises pytest startup + import overhead.
+
+Realistic CI suites have hundreds of these; they're individually cheap
+but the per-test fixed overhead (`pytest` startup + test collection +
+fixture setup) is what eats wall-clock when run sequentially.
+"""
+import pytest
+
+
+@pytest.mark.parametrize("a,b,expected", [(1, 2, 3), (5, 7, 12), (-1, 1, 0), (0, 0, 0)])
+def test_addition(a: int, b: int, expected: int) -> None:
+    assert a + b == expected
+
+
+@pytest.mark.parametrize("a,b,expected", [(10, 3, 30), (-2, 4, -8), (0, 999, 0)])
+def test_multiplication(a: int, b: int, expected: int) -> None:
+    assert a * b == expected
+
+
+def test_division_by_zero_raises() -> None:
+    with pytest.raises(ZeroDivisionError):
+        _ = 1 / 0
+
+
+def test_modulo_invariant() -> None:
+    for n in range(20):
+        assert n % 7 in range(7)
diff --git a/recipes/ci-parallel-pytest/test_project/tests/test_numpy_ops.py b/recipes/ci-parallel-pytest/test_project/tests/test_numpy_ops.py
new file mode 100644
index 0000000..8d96db0
--- /dev/null
+++ b/recipes/ci-parallel-pytest/test_project/tests/test_numpy_ops.py
@@ -0,0 +1,52 @@
+"""numpy array tests — the import alone is ~150 ms cold.
+
+The whole point of forkd for CI: numpy is already imported in the
+parent snapshot, so children inherit the import for free. Each
+child's pytest startup skips the import cost.
+"""
+import numpy as np
+import pytest
+
+
+def test_zeros_shape() -> None:
+    a = np.zeros((4, 5))
+    assert a.shape == (4, 5)
+    assert a.sum() == 0
+
+
+def test_linspace_endpoints() -> None:
+    a = np.linspace(0.0, 1.0, 11)
+    assert a[0] == pytest.approx(0.0)
+    assert a[-1] == pytest.approx(1.0)
+    assert len(a) == 11
+
+
+def test_dot_product_associative() -> None:
+    rng = np.random.default_rng(seed=0)
+    a = rng.standard_normal((3, 3))
+    b = rng.standard_normal((3, 3))
+    c = rng.standard_normal((3, 3))
+    np.testing.assert_allclose((a @ b) @ c, a @ (b @ c), rtol=1e-10)
+
+
+def test_solve_smoke() -> None:
+    a = np.array([[2.0, 1.0], [1.0, 3.0]])
+    b = np.array([1.0, 2.0])
+    x = np.linalg.solve(a, b)
+    np.testing.assert_allclose(a @ x, b)
+
+
+def test_eigvals_real_for_symmetric() -> None:
+    rng = np.random.default_rng(seed=1)
+    a = rng.standard_normal((8, 8))
+    sym = (a + a.T) / 2
+    vals = np.linalg.eigvalsh(sym)
+    assert np.allclose(vals.imag, 0.0)
+    assert len(vals) == 8
+
+
+def test_fft_inverse_round_trips() -> None:
+    rng = np.random.default_rng(seed=2)
+    a = rng.standard_normal(64)
+    recovered = np.fft.ifft(np.fft.fft(a)).real
+    np.testing.assert_allclose(a, recovered, atol=1e-10)
diff --git a/recipes/ci-parallel-pytest/test_project/tests/test_pandas_etl.py b/recipes/ci-parallel-pytest/test_project/tests/test_pandas_etl.py
new file mode 100644
index 0000000..8a51a43
--- /dev/null
+++ b/recipes/ci-parallel-pytest/test_project/tests/test_pandas_etl.py
@@ -0,0 +1,57 @@
+"""pandas DataFrame tests — `import pandas` is ~400-800 ms cold."""
+import numpy as np
+import pandas as pd
+import pytest
+
+
+@pytest.fixture
+def sample_df() -> pd.DataFrame:
+    return pd.DataFrame({
+        "user": ["alice", "bob", "carol", "dave", "eve"],
+        "age": [29, 41, 35, 22, 58],
+        "score": [0.81, 0.65, 0.92, 0.74, 0.88],
+    })
+
+
+def test_dataframe_construction(sample_df: pd.DataFrame) -> None:
+    assert len(sample_df) == 5
+    assert list(sample_df.columns) == ["user", "age", "score"]
+
+
+def test_filter_by_age(sample_df: pd.DataFrame) -> None:
+    adults = sample_df[sample_df["age"] >= 30]
+    assert len(adults) == 3
+    assert set(adults["user"]) == {"bob", "carol", "eve"}
+
+
+def test_groupby_aggregation() -> None:
+    df = pd.DataFrame({
+        "team": ["a", "a", "b", "b", "c"],
+        "score": [1, 2, 3, 4, 5],
+    })
+    means = df.groupby("team")["score"].mean()
+    assert means["a"] == pytest.approx(1.5)
+    assert means["b"] == pytest.approx(3.5)
+    assert means["c"] == pytest.approx(5.0)
+
+
+def test_merge_inner() -> None:
+    left = pd.DataFrame({"id": [1, 2, 3], "x": ["a", "b", "c"]})
+    right = pd.DataFrame({"id": [2, 3, 4], "y": [20, 30, 40]})
+    merged = left.merge(right, on="id", how="inner")
+    assert len(merged) == 2
+    assert list(merged["y"]) == [20, 30]
+
+
+def test_to_csv_roundtrip(tmp_path) -> None:
+    df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
+    p = tmp_path / "x.csv"
+    df.to_csv(p, index=False)
+    loaded = pd.read_csv(p)
+    pd.testing.assert_frame_equal(df, loaded)
+
+
+def test_numeric_describe(sample_df: pd.DataFrame) -> None:
+    stats = sample_df[["age", "score"]].describe()
+    assert stats.loc["count", "age"] == 5
+    assert stats.loc["min", "score"] == pytest.approx(0.65)
diff --git a/recipes/ci-parallel-pytest/test_project/tests/test_sklearn_models.py b/recipes/ci-parallel-pytest/test_project/tests/test_sklearn_models.py
new file mode 100644
index 0000000..5b91a4c
--- /dev/null
+++ b/recipes/ci-parallel-pytest/test_project/tests/test_sklearn_models.py
@@ -0,0 +1,53 @@
+"""sklearn model tests — `import sklearn` is ~600-1200 ms cold.
+
+This is the worst per-test fixed cost in a typical Python ML CI:
+every fresh pytest invocation re-pays it. In a forkd parent, the
+import is part of the warmed snapshot.
+"""
+import numpy as np
+import pytest
+from sklearn.cluster import KMeans
+from sklearn.datasets import make_classification
+from sklearn.linear_model import LinearRegression, LogisticRegression
+from sklearn.metrics import accuracy_score, r2_score
+from sklearn.model_selection import train_test_split
+
+
+def test_linear_regression_exact_fit() -> None:
+    rng = np.random.default_rng(seed=0)
+    x = rng.standard_normal((100, 3))
+    coef_true = np.array([1.5, -2.0, 0.5])
+    y = x @ coef_true + 7.0
+    model = LinearRegression().fit(x, y)
+    np.testing.assert_allclose(model.coef_, coef_true, atol=1e-9)
+    assert model.intercept_ == pytest.approx(7.0)
+    assert r2_score(y, model.predict(x)) == pytest.approx(1.0)
+
+
+def test_logistic_regression_separable() -> None:
+    x, y = make_classification(
+        n_samples=200, n_features=8, n_informative=4, random_state=42,
+    )
+    x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.25, random_state=0)
+    model = LogisticRegression(max_iter=500).fit(x_tr, y_tr)
+    acc = accuracy_score(y_te, model.predict(x_te))
+    assert acc > 0.75
+
+
+def test_kmeans_two_clusters() -> None:
+    rng = np.random.default_rng(seed=7)
+    cluster_a = rng.standard_normal((50, 2)) + [0, 0]
+    cluster_b = rng.standard_normal((50, 2)) + [10, 10]
+    x = np.vstack([cluster_a, cluster_b])
+    km = KMeans(n_clusters=2, random_state=0, n_init=10).fit(x)
+    centers = sorted(km.cluster_centers_.tolist(), key=lambda c: c[0])
+    assert centers[0][0] < 2.0
+    assert centers[1][0] > 8.0
+
+
+def test_pipeline_predict_shape() -> None:
+    rng = np.random.default_rng(seed=3)
+    x = rng.standard_normal((50, 4))
+    y = x @ np.array([1.0, -1.0, 0.5, 0.0]) + 2.0
+    pred = LinearRegression().fit(x, y).predict(x[:10])
+    assert pred.shape == (10,)
diff --git a/recipes/ci-parallel-pytest/test_project/tests/test_text_processing.py b/recipes/ci-parallel-pytest/test_project/tests/test_text_processing.py
new file mode 100644
index 0000000..b4ea4db
--- /dev/null
+++ b/recipes/ci-parallel-pytest/test_project/tests/test_text_processing.py
@@ -0,0 +1,45 @@
+"""String / regex tests — stdlib-only, very fast per test.
+
+The point of including these is to mirror real CI suites: a mix of
+heavy ML tests and fast unit tests. forkd's per-worker fixed cost
+is amortized across whatever slice each worker gets.
+"""
+import re
+
+import pytest
+
+
+@pytest.mark.parametrize("s,expected", [
+    ("hello world", 11),
+    ("forkd", 5),
+    ("", 0),
+])
+def test_string_length(s: str, expected: int) -> None:
+    assert len(s) == expected
+
+
+def test_split_join_roundtrip() -> None:
+    s = "one two three four"
+    assert " ".join(s.split()) == s
+
+
+def test_regex_email_basic() -> None:
+    pattern = re.compile(r"^[\w.+-]+@[\w.-]+\.[a-z]{2,}$", re.IGNORECASE)
+    assert pattern.match("alice@example.com")
+    assert pattern.match("user+tag@sub.example.co.uk")
+    assert not pattern.match("not-an-email")
+    assert not pattern.match("@example.com")
+
+
+def test_dict_comprehension() -> None:
+    src = {"a": 1, "b": 2, "c": 3}
+    doubled = {k: v * 2 for k, v in src.items()}
+    assert doubled == {"a": 2, "b": 4, "c": 6}
+
+
+def test_set_operations() -> None:
+    a = {1, 2, 3, 4}
+    b = {3, 4, 5, 6}
+    assert a & b == {3, 4}
+    assert a | b == {1, 2, 3, 4, 5, 6}
+    assert a - b == {1, 2}