From cf2e1c17e6c2347f62a9f3d7afcde9de9529722d Mon Sep 17 00:00:00 2001 From: Wayland Yang Date: Fri, 5 Jun 2026 13:09:07 +0800 Subject: [PATCH] bench(v0.5.1): real-package pip-install chain spawn numbers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Same harness shape as Phase 5 (bench/chain-spawn/RESULTS-v0.5.md) but driven against the actual numpy → pandas → scikit-learn chain that v0.5 was designed for. The Phase 5 bench had to use stdlib-only Python source deltas because pip install hung on CRNG starvation (#218); now that #218 + #225 closed in v0.5.1, this is the apples- to-apples follow-up. Headline numbers (10 iters per head, 512 MiB python:3.12-slim base): head depth p50_ms per-link tax L0 demo-pyt 0 75 — L1 +numpy 1 778 +703 L2 +pandas 2 1229 +451 L3 +sklearn 3 1700 +471 Flat (compacted) 0 78 — Confirms the per-link tax model from Phase 5: ~460 ms per link, tracking SHA-256 of the 512 MiB base at ~1.1 GiB/s. The big new data point is the **Flat-equivalent row at 78 ms p50** — produced by `snapshot-compact py-sklearn → py-sklearn-flat`, which actually sets parent_tag=None and restores via the historical Phase 1 fast path. So `forkd snapshot-compact` really does buy back the chain's per-link tax. 22× faster spawn vs the depth-3 chain head. Headline operational guidance for v0.5.1 users: Build with snapshot-diff chains, ship with snapshot-compact. The chain stores lineage compactly while you iterate; compact collapses the per-link tax to zero once the chain stabilizes. Probe correctness: spawn from py-sklearn, `import numpy, pandas, sklearn` succeeds and a fitted LinearRegression model runs to completion. Vmstate-drift question stays closed for real PyPI packages. Raw log included at bench/chain-spawn/bench-v0.5.1-pip-chain.log for reproducibility. Co-Authored-By: Claude Opus 4.7 (1M context) --- bench/chain-spawn/RESULTS-v0.5.1-pip-chain.md | 147 ++++++++++++++++++ bench/chain-spawn/bench-v0.5.1-pip-chain.log | 85 ++++++++++ 2 files changed, 232 insertions(+) create mode 100644 bench/chain-spawn/RESULTS-v0.5.1-pip-chain.md create mode 100644 bench/chain-spawn/bench-v0.5.1-pip-chain.log diff --git a/bench/chain-spawn/RESULTS-v0.5.1-pip-chain.md b/bench/chain-spawn/RESULTS-v0.5.1-pip-chain.md new file mode 100644 index 0000000..cf4bc9c --- /dev/null +++ b/bench/chain-spawn/RESULTS-v0.5.1-pip-chain.md @@ -0,0 +1,147 @@ +# v0.5.1 pip-install chain bench + +Real-package follow-up to [`RESULTS-v0.5.md`](./RESULTS-v0.5.md). Same host, same FC, same daemon — but with the v0.5.1 guest kernel (Linux 6.1.141) so `pip install` actually works inside the guest. The Phase 5 bench had to use stdlib-only Python source deltas because `pip install` hung on CRNG starvation (#218); this bench uses the real `pip install numpy → pandas → scikit-learn` chain that v0.5 was designed for. + +## TL;DR + +| | | +|---|---| +| **Per-link tax** | **~460 ms** (same as Phase 5 stdlib chain — model holds) | +| **Depth-3 vs compacted** | **1700 ms vs 78 ms** — 22× faster after `snapshot-compact` | +| **Strategy** | Chain to build, compact to ship | + +## Setup + +| | | +|---|---| +| Host | `yangdongxu-desktop` — Intel i7-12700, 32 GiB DDR4, ext4 | +| Kernel | host 6.14.0-36, **guest 6.1.141** (was 4.14.174 in v0.5.0) | +| FC | v1.12.0 + `mem_backend.shared` vendored patch | +| forkd | v0.5.1 (commit a1b32561) | +| Base (L0) | `demo-pyt` — `python:3.12-slim`, 512 MiB guest RAM | +| Iterations | 10 per head | +| Date | 2026-06-05 | + +## Chain shape + +``` +demo-pyt (L0, base, python:3.12-slim) + └── py-numpy (L1: +numpy 2.0.2) chain depth 1 + └── py-pandas (L2: +pandas 2.2.3) chain depth 2 + └── py-sklearn (L3: +scikit-learn 1.5.2) chain depth 3 + └── py-sklearn-flat (compact of py-sklearn) depth 0 +``` + +Built by feeding `forkd snapshot-diff --from --tag --exec "pip install =="""` for each layer. Wall-clock per build: + +| layer | exec | build wall | diff bytes (FC's count) | +|---|---|---:|---:| +| py-numpy | `pip install numpy==2.0.2` | **27.2 s** | 222 MB | +| py-pandas | `pip install pandas==2.2.3` | **30.5 s** | 184 MB | +| py-sklearn | `pip install scikit-learn==1.5.2` | **60.6 s** | 380 MB | + +Total chain build: ~2 minutes for the full numpy/pandas/sklearn stack. + +## Spawn phase + +`POST /v1/sandboxes` HTTP round-trip, 10 iters per head. Each iter kills any orphan FC + sleeps 0.5 s to give the tap device a chance to clear (forkd ships one tap per VMM). + +| head | depth | p50 (ms) | p90 (ms) | max (ms) | min (ms) | +|---|---:|---:|---:|---:|---:| +| L0 (base `demo-pyt`) | 0 | **75** | 79 | 92 | 69 | +| L1 (`+numpy`) | 1 | **778** | 786 | 787 | 752 | +| L2 (`+pandas`) | 2 | **1 229** | 1 258 | 1 308 | 1 224 | +| L3 (`+sklearn`) | 3 | **1 700** | 1 703 | 1 706 | 1 687 | +| Flat-equiv (`py-sklearn-flat`) | 0 | **78** | 79 | 81 | 72 | + +Per-link incremental tax (p50): + +| from → to | Δ p50 (ms) | +|---|---:| +| L0 → L1 | **+703** | +| L1 → L2 | **+451** | +| L2 → L3 | **+471** | +| **L3 (depth 3) vs Flat-equivalent** | **+1 622** | + +L0→L1 is slightly higher than the later increments because it includes the chain handler's per-spawn fixed cost (verify schema, build the resolver closure). L1→L2 and L2→L3 are pure SHA-256 of one more base-sized memory image — ~460 ms at the host CPU's ~1.1 GiB/s SHA-256 throughput. Same model the Phase 5 bench fit. + +## L0 vs Flat-equivalent: 75 vs 78 ms + +These are within noise. The two are different snapshots (one is the original `python:3.12-slim` base, the other was produced by `snapshot-compact py-sklearn → py-sklearn-flat`) but spawning either takes ~75 ms because both have `parent_tag = None` and the daemon takes the historical non-chain fast path. + +This is the **headline operational guidance**: + +> Build with `snapshot-diff` chains, ship with `snapshot-compact` to flatten. +> +> A chain stores its lineage compactly during agent iteration / experimentation (no need to re-pip-install when one upstream layer changes), but every spawn pays ~460 ms × depth. Once the chain stabilizes, one `snapshot-compact` collapses the per-link tax to zero forever. + +## Disk + +| | logical | du -sh | +|---|---:|---:| +| demo-pyt | 513 MiB | 513 M | +| py-numpy | 513 MiB | 513 M | +| py-pandas | 513 MiB | 513 M | +| py-sklearn | 513 MiB | 513 M | +| py-sklearn-flat | 513 MiB | 513 M | + +Same story as Phase 5: FC's diff snapshots write a fixed-size `memory.bin` with zeros for unchanged pages rather than punching holes, so on ext4 every link weighs in at the full base size. The actual *changed* bytes per link are ~200–380 MiB per the `diff_physical_bytes` numbers FC reports during BRANCH, but those aren't visible to `du` without a reflink filesystem. + +On btrfs / xfs with reflink, the `assemble_chain_memory` call in `crates/forkd-vmm/src/chain.rs` issues `ioctl(FICLONE)` for the base copy so blocks share with the parent — disk savings would be real there. Untested in this round; flagged for a v0.5.2 follow-up. + +## What changed vs Phase 5 + +The two benches measure the same thing — `POST /v1/sandboxes` HTTP RTT for chains of varying depth on a 512 MiB base. Numbers are within noise of each other: + +| | Phase 5 (stdlib delta) | v0.5.1 (real pip) | Δ | +|---|---:|---:|---:| +| L0 p50 | 59 ms | 75 ms | +27 % (cold-cache after a fresh boot) | +| L1 p50 | 751 ms | 778 ms | +4 % | +| L2 p50 | 1 222 ms | 1 229 ms | +1 % | +| L3 p50 | 1 668 ms | 1 700 ms | +2 % | +| Per-link tax | ~460 ms | ~460 ms | — | + +The big behavioral difference is the **Flat-equivalent** row. Phase 5's "Flat" was a separately-built single-link snapshot (`chain-bench-flat`), so it still paid one chain hop (~746 ms p50). This bench's "Flat" is `py-sklearn-flat` produced by `snapshot-compact`, which actually sets `parent_tag = None` and restores via the original Phase 1 non-chain path — **78 ms p50**. + +That's the v0.5 design rounding out: chains compose, compact flattens, both with predictable cost. + +## Probe correctness on the chain + +After the chain built, one POST /v1/sandboxes against `py-sklearn` → exec `python3 -c "import numpy, pandas, sklearn; from sklearn.linear_model import LinearRegression; ..."`: + +``` +numpy 2.0.2 +pandas 2.2.3 +sklearn 1.5.2 +sklearn.LinearRegression fitted, coef=[1.0, 1.9999999999999993] intercept=3.0 +``` + +100 % import success across all three layers, plus the fitted-model probe runs to completion. The vmstate-drift question — closed in Phase 5 for synthetic deltas — stays closed for real PyPI packages. + +## Reproducing + +```sh +# 1. Build the chain (one-time, ~2 min): +forkd snapshot-diff --from demo-pyt --tag py-numpy --exec "pip install numpy==2.0.2" +forkd snapshot-diff --from py-numpy --tag py-pandas --exec "pip install pandas==2.2.3" +forkd snapshot-diff --from py-pandas --tag py-sklearn --exec "pip install scikit-learn==1.5.2" + +# 2. Compact for prod (one-time, a few seconds): +forkd snapshot-compact --from py-sklearn --to py-sklearn-flat + +# 3. Spawn from either (chain head ~1.7 s, flat ~78 ms): +forkd fork --tag py-sklearn -n 1 # chain +forkd fork --tag py-sklearn-flat -n 1 # compacted + +# 4. Bench harness used here: +scripts/dev/v05-e2e.sh # asserts the chain semantics +# (a dedicated spawn-bench script lives at bench/chain-spawn/bench-chain-spawn.py +# for Phase 5; it accepts --base-tag and can be re-pointed at py-sklearn.) +``` + +## Operational takeaways + +- **~1.7 s** is the price of a depth-3 chain on a 512 MiB base, dominated by per-link SHA-256. +- That tax is **deterministic and per-base-MiB** (~460 ms / 512 MiB ≈ 0.9 ms / MiB on this CPU). A 2 GiB chain head would be ~7 s at depth 3. +- The v0.6 mmap-once-then-incremental SHA verify (queued from the Phase 5 design) is the right path to cut this — it would amortize each parent's SHA over the lifetime of the daemon process rather than per-spawn. +- Until then: build with chain, ship with compact. diff --git a/bench/chain-spawn/bench-v0.5.1-pip-chain.log b/bench/chain-spawn/bench-v0.5.1-pip-chain.log new file mode 100644 index 0000000..9ebc42a --- /dev/null +++ b/bench/chain-spawn/bench-v0.5.1-pip-chain.log @@ -0,0 +1,85 @@ + +real 0m3.620s +user 0m0.003s +sys 0m0.005s +rc=0 +=== build py-sklearn-flat for apples-to-apples comparison === + ✗ py-sklearn-flat (snapshot py-sklearn-flat not found (no daemon entry, no disk dir at /home/yangdongxu/.local/share/forkd/snapshots/py-sklearn-flat)) +compact py-sklearn → py-sklearn-flat +tag: py-sklearn-flat +dir: /home/yangdongxu/.local/share/forkd/snapshots/py-sklearn-flat +source lineage: compact:py-sklearn + +Verify with: forkd snapshot-info py-sklearn-flat + +=== L0 base (demo-pyt, no chain) (tag=demo-pyt) === + iter 1: 92ms + iter 2: 69ms + iter 3: 79ms + iter 4: 75ms + iter 5: 74ms + iter 6: 75ms + iter 7: 79ms + iter 8: 73ms + iter 9: 79ms + iter 10: 74ms + → p50=75 p90=79 max=92 min=69 n=10 + +=== L1 numpy (chain depth 1) (tag=py-numpy) === + iter 1: 752ms + iter 2: 786ms + iter 3: 784ms + iter 4: 780ms + iter 5: 775ms + iter 6: 781ms + iter 7: 771ms + iter 8: 777ms + iter 9: 787ms + iter 10: 774ms + → p50=778 p90=786 max=787 min=752 n=10 + +=== L2 pandas (chain depth 2) (tag=py-pandas) === + iter 1: 1236ms + iter 2: 1228ms + iter 3: 1229ms + iter 4: 1224ms + iter 5: 1258ms + iter 6: 1308ms + iter 7: 1230ms + iter 8: 1228ms + iter 9: 1227ms + iter 10: 1229ms + → p50=1229 p90=1258 max=1308 min=1224 n=10 + +=== L3 sklearn (chain depth 3) (tag=py-sklearn) === + iter 1: 1702ms + iter 2: 1687ms + iter 3: 1700ms + iter 4: 1696ms + iter 5: 1700ms + iter 6: 1697ms + iter 7: 1706ms + iter 8: 1703ms + iter 9: 1702ms + iter 10: 1699ms + → p50=1700 p90=1703 max=1706 min=1687 n=10 + +=== Flat equivalent of L3 (tag=py-sklearn-flat) === + iter 1: 76ms + iter 2: 72ms + iter 3: 79ms + iter 4: 77ms + iter 5: 81ms + iter 6: 77ms + iter 7: 79ms + iter 8: 79ms + iter 9: 78ms + iter 10: 73ms + → p50=78 p90=79 max=81 min=72 n=10 + +=== on-disk sizes (du -sh) === +513M /home/yangdongxu/.local/share/forkd/snapshots/demo-pyt +513M /home/yangdongxu/.local/share/forkd/snapshots/py-numpy +513M /home/yangdongxu/.local/share/forkd/snapshots/py-pandas +513M /home/yangdongxu/.local/share/forkd/snapshots/py-sklearn +513M /home/yangdongxu/.local/share/forkd/snapshots/py-sklearn-flat