Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions bench/chain-spawn/RESULTS-v0.5.1-pip-chain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# v0.5.1 pip-install chain bench

Real-package follow-up to [`RESULTS-v0.5.md`](./RESULTS-v0.5.md). Same host, same FC, same daemon — but with the v0.5.1 guest kernel (Linux 6.1.141) so `pip install` actually works inside the guest. The Phase 5 bench had to use stdlib-only Python source deltas because `pip install` hung on CRNG starvation (#218); this bench uses the real `pip install numpy → pandas → scikit-learn` chain that v0.5 was designed for.

## TL;DR

| | |
|---|---|
| **Per-link tax** | **~460 ms** (same as Phase 5 stdlib chain — model holds) |
| **Depth-3 vs compacted** | **1700 ms vs 78 ms** — 22× faster after `snapshot-compact` |
| **Strategy** | Chain to build, compact to ship |

## Setup

| | |
|---|---|
| Host | `yangdongxu-desktop` — Intel i7-12700, 32 GiB DDR4, ext4 |
| Kernel | host 6.14.0-36, **guest 6.1.141** (was 4.14.174 in v0.5.0) |
| FC | v1.12.0 + `mem_backend.shared` vendored patch |
| forkd | v0.5.1 (commit a1b32561) |
| Base (L0) | `demo-pyt` — `python:3.12-slim`, 512 MiB guest RAM |
| Iterations | 10 per head |
| Date | 2026-06-05 |

## Chain shape

```
demo-pyt (L0, base, python:3.12-slim)
└── py-numpy (L1: +numpy 2.0.2) chain depth 1
└── py-pandas (L2: +pandas 2.2.3) chain depth 2
└── py-sklearn (L3: +scikit-learn 1.5.2) chain depth 3
└── py-sklearn-flat (compact of py-sklearn) depth 0
```

Built by feeding `forkd snapshot-diff --from <parent> --tag <child> --exec "pip install <pkg>=="<ver>""` for each layer. Wall-clock per build:

| layer | exec | build wall | diff bytes (FC's count) |
|---|---|---:|---:|
| py-numpy | `pip install numpy==2.0.2` | **27.2 s** | 222 MB |
| py-pandas | `pip install pandas==2.2.3` | **30.5 s** | 184 MB |
| py-sklearn | `pip install scikit-learn==1.5.2` | **60.6 s** | 380 MB |

Total chain build: ~2 minutes for the full numpy/pandas/sklearn stack.

## Spawn phase

`POST /v1/sandboxes` HTTP round-trip, 10 iters per head. Each iter kills any orphan FC + sleeps 0.5 s to give the tap device a chance to clear (forkd ships one tap per VMM).

| head | depth | p50 (ms) | p90 (ms) | max (ms) | min (ms) |
|---|---:|---:|---:|---:|---:|
| L0 (base `demo-pyt`) | 0 | **75** | 79 | 92 | 69 |
| L1 (`+numpy`) | 1 | **778** | 786 | 787 | 752 |
| L2 (`+pandas`) | 2 | **1 229** | 1 258 | 1 308 | 1 224 |
| L3 (`+sklearn`) | 3 | **1 700** | 1 703 | 1 706 | 1 687 |
| Flat-equiv (`py-sklearn-flat`) | 0 | **78** | 79 | 81 | 72 |

Per-link incremental tax (p50):

| from → to | Δ p50 (ms) |
|---|---:|
| L0 → L1 | **+703** |
| L1 → L2 | **+451** |
| L2 → L3 | **+471** |
| **L3 (depth 3) vs Flat-equivalent** | **+1 622** |

L0→L1 is slightly higher than the later increments because it includes the chain handler's per-spawn fixed cost (verify schema, build the resolver closure). L1→L2 and L2→L3 are pure SHA-256 of one more base-sized memory image — ~460 ms at the host CPU's ~1.1 GiB/s SHA-256 throughput. Same model the Phase 5 bench fit.

## L0 vs Flat-equivalent: 75 vs 78 ms

These are within noise. The two are different snapshots (one is the original `python:3.12-slim` base, the other was produced by `snapshot-compact py-sklearn → py-sklearn-flat`) but spawning either takes ~75 ms because both have `parent_tag = None` and the daemon takes the historical non-chain fast path.

This is the **headline operational guidance**:

> Build with `snapshot-diff` chains, ship with `snapshot-compact` to flatten.
>
> A chain stores its lineage compactly during agent iteration / experimentation (no need to re-pip-install when one upstream layer changes), but every spawn pays ~460 ms × depth. Once the chain stabilizes, one `snapshot-compact` collapses the per-link tax to zero forever.

## Disk

| | logical | du -sh |
|---|---:|---:|
| demo-pyt | 513 MiB | 513 M |
| py-numpy | 513 MiB | 513 M |
| py-pandas | 513 MiB | 513 M |
| py-sklearn | 513 MiB | 513 M |
| py-sklearn-flat | 513 MiB | 513 M |

Same story as Phase 5: FC's diff snapshots write a fixed-size `memory.bin` with zeros for unchanged pages rather than punching holes, so on ext4 every link weighs in at the full base size. The actual *changed* bytes per link are ~200–380 MiB per the `diff_physical_bytes` numbers FC reports during BRANCH, but those aren't visible to `du` without a reflink filesystem.

On btrfs / xfs with reflink, the `assemble_chain_memory` call in `crates/forkd-vmm/src/chain.rs` issues `ioctl(FICLONE)` for the base copy so blocks share with the parent — disk savings would be real there. Untested in this round; flagged for a v0.5.2 follow-up.

## What changed vs Phase 5

The two benches measure the same thing — `POST /v1/sandboxes` HTTP RTT for chains of varying depth on a 512 MiB base. Numbers are within noise of each other:

| | Phase 5 (stdlib delta) | v0.5.1 (real pip) | Δ |
|---|---:|---:|---:|
| L0 p50 | 59 ms | 75 ms | +27 % (cold-cache after a fresh boot) |
| L1 p50 | 751 ms | 778 ms | +4 % |
| L2 p50 | 1 222 ms | 1 229 ms | +1 % |
| L3 p50 | 1 668 ms | 1 700 ms | +2 % |
| Per-link tax | ~460 ms | ~460 ms | — |

The big behavioral difference is the **Flat-equivalent** row. Phase 5's "Flat" was a separately-built single-link snapshot (`chain-bench-flat`), so it still paid one chain hop (~746 ms p50). This bench's "Flat" is `py-sklearn-flat` produced by `snapshot-compact`, which actually sets `parent_tag = None` and restores via the original Phase 1 non-chain path — **78 ms p50**.

That's the v0.5 design rounding out: chains compose, compact flattens, both with predictable cost.

## Probe correctness on the chain

After the chain built, one POST /v1/sandboxes against `py-sklearn` → exec `python3 -c "import numpy, pandas, sklearn; from sklearn.linear_model import LinearRegression; ..."`:

```
numpy 2.0.2
pandas 2.2.3
sklearn 1.5.2
sklearn.LinearRegression fitted, coef=[1.0, 1.9999999999999993] intercept=3.0
```

100 % import success across all three layers, plus the fitted-model probe runs to completion. The vmstate-drift question — closed in Phase 5 for synthetic deltas — stays closed for real PyPI packages.

## Reproducing

```sh
# 1. Build the chain (one-time, ~2 min):
forkd snapshot-diff --from demo-pyt --tag py-numpy --exec "pip install numpy==2.0.2"
forkd snapshot-diff --from py-numpy --tag py-pandas --exec "pip install pandas==2.2.3"
forkd snapshot-diff --from py-pandas --tag py-sklearn --exec "pip install scikit-learn==1.5.2"

# 2. Compact for prod (one-time, a few seconds):
forkd snapshot-compact --from py-sklearn --to py-sklearn-flat

# 3. Spawn from either (chain head ~1.7 s, flat ~78 ms):
forkd fork --tag py-sklearn -n 1 # chain
forkd fork --tag py-sklearn-flat -n 1 # compacted

# 4. Bench harness used here:
scripts/dev/v05-e2e.sh # asserts the chain semantics
# (a dedicated spawn-bench script lives at bench/chain-spawn/bench-chain-spawn.py
# for Phase 5; it accepts --base-tag and can be re-pointed at py-sklearn.)
```

## Operational takeaways

- **~1.7 s** is the price of a depth-3 chain on a 512 MiB base, dominated by per-link SHA-256.
- That tax is **deterministic and per-base-MiB** (~460 ms / 512 MiB ≈ 0.9 ms / MiB on this CPU). A 2 GiB chain head would be ~7 s at depth 3.
- The v0.6 mmap-once-then-incremental SHA verify (queued from the Phase 5 design) is the right path to cut this — it would amortize each parent's SHA over the lifetime of the daemon process rather than per-spawn.
- Until then: build with chain, ship with compact.
85 changes: 85 additions & 0 deletions bench/chain-spawn/bench-v0.5.1-pip-chain.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@

real 0m3.620s
user 0m0.003s
sys 0m0.005s
rc=0
=== build py-sklearn-flat for apples-to-apples comparison ===
✗ py-sklearn-flat (snapshot py-sklearn-flat not found (no daemon entry, no disk dir at /home/yangdongxu/.local/share/forkd/snapshots/py-sklearn-flat))
compact py-sklearn → py-sklearn-flat
tag: py-sklearn-flat
dir: /home/yangdongxu/.local/share/forkd/snapshots/py-sklearn-flat
source lineage: compact:py-sklearn

Verify with: forkd snapshot-info py-sklearn-flat

=== L0 base (demo-pyt, no chain) (tag=demo-pyt) ===
iter 1: 92ms
iter 2: 69ms
iter 3: 79ms
iter 4: 75ms
iter 5: 74ms
iter 6: 75ms
iter 7: 79ms
iter 8: 73ms
iter 9: 79ms
iter 10: 74ms
→ p50=75 p90=79 max=92 min=69 n=10

=== L1 numpy (chain depth 1) (tag=py-numpy) ===
iter 1: 752ms
iter 2: 786ms
iter 3: 784ms
iter 4: 780ms
iter 5: 775ms
iter 6: 781ms
iter 7: 771ms
iter 8: 777ms
iter 9: 787ms
iter 10: 774ms
→ p50=778 p90=786 max=787 min=752 n=10

=== L2 pandas (chain depth 2) (tag=py-pandas) ===
iter 1: 1236ms
iter 2: 1228ms
iter 3: 1229ms
iter 4: 1224ms
iter 5: 1258ms
iter 6: 1308ms
iter 7: 1230ms
iter 8: 1228ms
iter 9: 1227ms
iter 10: 1229ms
→ p50=1229 p90=1258 max=1308 min=1224 n=10

=== L3 sklearn (chain depth 3) (tag=py-sklearn) ===
iter 1: 1702ms
iter 2: 1687ms
iter 3: 1700ms
iter 4: 1696ms
iter 5: 1700ms
iter 6: 1697ms
iter 7: 1706ms
iter 8: 1703ms
iter 9: 1702ms
iter 10: 1699ms
→ p50=1700 p90=1703 max=1706 min=1687 n=10

=== Flat equivalent of L3 (tag=py-sklearn-flat) ===
iter 1: 76ms
iter 2: 72ms
iter 3: 79ms
iter 4: 77ms
iter 5: 81ms
iter 6: 77ms
iter 7: 79ms
iter 8: 79ms
iter 9: 78ms
iter 10: 73ms
→ p50=78 p90=79 max=81 min=72 n=10

=== on-disk sizes (du -sh) ===
513M /home/yangdongxu/.local/share/forkd/snapshots/demo-pyt
513M /home/yangdongxu/.local/share/forkd/snapshots/py-numpy
513M /home/yangdongxu/.local/share/forkd/snapshots/py-pandas
513M /home/yangdongxu/.local/share/forkd/snapshots/py-sklearn
513M /home/yangdongxu/.local/share/forkd/snapshots/py-sklearn-flat
Loading