Scope host-global resources to a per-installation instance id by aljoscha · Pull Request #17 · aljoscha/ember

aljoscha · 2026-05-19T09:29:19Z

Summary

Two ember installations on the same host previously shared the dm-thin
pool name (ember-pool), the em-* TAP namespace, the 10.100.0.0/16
default subnet, and — on macOS — vmnet's 192.168.64.0/24. Worse,
reconcile() listed every em-* interface and would delete the other
install's TAPs on any command; deinit --purge tore down the singleton
ember-pool regardless of which install was asking. An integration
test against a temp --state-dir could destroy a live install on the
same machine.

This branch introduces a 4-hex-char instance_id (derived from a hash
of the canonicalized state-dir at ember init, overridable via
--instance-id) and threads it through every host-global resource
name.

What gets scoped

dm-thin: ember-{id}-pool, ember-{id}-img-, ember-{id}-vm-
Networking (Linux): em{id}- TAP prefix (14-char names fit
IFNAMSIZ), ember:{id} iptables comment threaded through every
add/remove so cleanup only ever matches this install's rules
IP subnet (Linux): /16 slot inside 10.0.0.0/8 derived from
the id; overridable via --ip-subnet
IP subnet (macOS): vmnet's /24 sub-allocated into 8× /27 slices
picked by the low 3 bits of the id (1/8 collision per pair; the
override is --ip-subnet)
macOS guest IPs: replaced the Linux /30 P2P allocator with a
single-IP /32 allocator since vmnet is a shared L2 bridge. Lifts
the per-install cap from 8 to ~30 VMs while staying inside the /27.

reconcile() now scopes the orphan-TAP sweep by the install's prefix
and skips it entirely if the config can't be read (preferable to
deleting a foreign device).

Boundary discipline

GlobalConfig does not own the format of any subsystem's names. It
exposes one generic accessor — instance_namespace() -> Option<&str>,
which collapses the empty-string legacy sentinel to None — and each
subsystem owns its own pure derivation next to the code that has to
match the literals byte-for-byte:

dm_thin::pool::name(Option<&str>)
dm_thin::thin::{image_prefix, vm_prefix}(Option<&str>)
network::tap::prefix(Option<&str>)
network::nat::comment(Option<&str>)
ember_linux::network::ip::derive_default_subnet(...) (Linux /16 in /8)
ember_macos::network::derive_vmnet_subnet(...) (macOS /27 slot)

Added a one-bullet style note in CLAUDE.md capturing the principle.

Back-compat

No deinit --purge && init migration required. Older configs lack
instance_id; instance_namespace() returns None and each
subsystem falls back to its historical literal (ember-pool, em-,
ember-img-, ember-vm-, empty iptables comment, full /24 on macOS).

The subtle one is iptables: older binaries added rules without
-m comment at all, so on -D the comment match has to be elided
or the rule won't match. Handled in network::nat.

Risk

Cross-install isolation is invariant-level: covered by three
integration tests in tests/isolation.rs that go through the
binary — two-install dm-thin non-interference, legacy config
compatibility, scoped TAP reconcile.
macOS /27 collision is probabilistic (1/8 per install pair).
Acceptable for personal use; --ip-subnet is the escape hatch.
macOS allocations.json layout is preserved byte-for-byte for
legacy installs (no instance_id → keep the /30 allocator on the
full /24); new installs get the /32 allocator on the /27.
derive_vmnet_subnet on bad instance_id panics rather than
silently falling back to slot 0 (which would re-collide every
affected install on the same slice). The id is CLI-validated or
auto-derived, so this is unreachable in practice — the panic is
there to catch a regression in validation.

Test plan

cargo test (229 unit tests pass)
cargo test --test isolation (three cross-install tests)
cargo test --test dm_thin -- --ignored on a host with the
dm-thin module + thin-provisioning-tools
Manual: two ember init invocations against distinct
--state-dirs on the same Linux host; create VMs in both;
deinit --purge on A leaves B running
Manual: upgrade a pre-existing install in place (no re-init);
verify VMs keep working and deinit cleans up old iptables
rules
Manual macOS: two installs running VMs concurrently; verify
distinct /27 slices and no guest-IP collisions

Two ember installations on the same host previously shared the dm-thin pool name, the `em-*` TAP namespace, and (when both ran VMs) the IP subnet. Worse, reconcile() listed every `em-*` interface and would delete the other install's TAPs on any command, and `deinit --purge` tore down the singleton `ember-pool` regardless of which install was asking. Integration tests against a temp `--state-dir` could destroy the user's live install on the same machine. Introduce `instance_id` (4 hex chars, derived from a hash of the canonicalized state-dir at `ember init`, overridable via `--instance-id`) and `ip_subnet` (defaulting to a `/16` slot in `10.0.0.0/8` derived from the same hash, overridable via `--ip-subnet`). Both are persisted on `GlobalConfig`. Helper accessors derive every host-global name from the id: `ember-{id}-pool`, `ember-{id}-{img,vm}-` device-mapper prefixes, `em{id}-` TAP prefix (14-char names fit Linux IFNAMSIZ), and `ember:{id}` iptables comment threaded through every rule add/remove so cleanup only ever matches this install's rules. reconcile() now scopes the orphan-TAP sweep by the install's prefix and skips it entirely if the config can't be read, which is preferable to deleting a foreign device. NetworkBackend::teardown takes &GlobalConfig to thread the comment through; macOS impl is a no-op update. Removes the singleton `pool::POOL_NAME`, `thin::IMAGE_PREFIX`, `thin::VM_PREFIX`, `network::ip::DEFAULT_SUBNET`, and `tap::list_ember_devices()` constants/helpers — names now flow from config. No back-compat shim for older configs: existing installs need `ember deinit --purge && ember init` to migrate. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6

Previous commit's clean-slate naming forced existing installs to deinit + reinit because the config schema gained two required-by- convention fields (instance_id, ip_subnet). That's a worse migration than the problem it solves. Make the accessors back-compat aware instead: when instance_id is empty (the serde default for older configs), return the historical unprefixed names — `ember-pool`, `em-`, `ember-img-`, `ember-vm-`, and an empty iptables comment. ip_subnet's serde default already maps to `10.100.0.0/16`. The iptables comment match is the subtle one. Older binaries added rules without `-m comment` at all; if the upgraded binary then tried to delete the same rule WITH a comment match, iptables would treat it as a different rule and the cleanup would silently no-op. Splice the comment match in only when the comment is non-empty so the byte- exact rule stays matchable on `-D`. Add three integration tests guarding the isolation contract: * `dm_thin_two_installs_dont_interfere`: two installs with distinct --instance-id stand up `ember-aaaa-pool` and `ember-bbbb-pool` side-by-side; deinit on A leaves B's pool intact. * `legacy_config_without_instance_id_keeps_working`: writes a hand-crafted legacy config (no instance_id), runs `ember info` through the binary, asserts the rendered pool name is the unprefixed `ember-pool`. * `reconcile_does_not_touch_other_installs_taps`: creates a TAP matching install A's prefix manually, runs `ember vm list` against install B (which triggers reconcile), verifies A's TAP survives. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6

The cross-installation isolation tests (two-install dm-thin non-interference, legacy config compatibility, scoped TAP reconciliation) live in `tests/isolation.rs` rather than piggybacking on `tests/dm_thin.rs`. The dm-thin file is for backend mechanics; isolation is a different invariant that happens to use dm-thin as its substrate. Also fold a one-line WHY note into each legacy-mode branch on GlobalConfig — the literals there must match what the pre-instance-id binary wrote to disk and to the kernel, and the specific failure mode (orphaned thin ids, IFNAMSIZ overflow, silent iptables -D no-ops) varies per accessor. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6

vmnet's shared subnet (192.168.64.0/24) is owned by the framework and there's no public API to ask for a different one, so two macOS installs running VMs concurrently both allocate from the same /24 and hand out colliding guest IPs. State-dir isolation handled everything else on macOS, but this last collision survived. Carve the /24 into 8 /27 slots (32 addresses, 8 /30 VM blocks per install) and pick a slot from the low 3 bits of the instance id. Two installs collide with 1/8 probability per pair; `--ip-subnet` is the override when that bites. 8 concurrent VMs per install is generous for personal use on macOS. Plumb via a new `Platform::default_ip_subnet(instance_id)` trait method so init.rs picks the right derivation per platform: Linux keeps the existing /16 slot inside 10/8, macOS gets the /27 slice. `MacosNetwork::setup` reads `config.ip_subnet` for new configs and falls back to the full /24 when `instance_id` is empty so legacy installs continue to use the same allocations.json layout the old binary wrote. IP reuse on VM teardown was already correct: the allocator picks the lowest-numbered free `/30` block and `release()` removes the entry, so freed slots are reused before higher indices. Existing `allocate_reuses_released_block` test covers it. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6

Previous /27 slicing capped each macOS install at 8 VMs because ember reused Linux's /30 P2P allocator on macOS, where it makes no sense — vmnet provides a shared L2 bridge so every guest sits on the same subnet behind one gateway, and the host (.1) / broadcast (.3) of the per-VM /30 are pure waste. 75% of the address space went unused. Add `network::ip::allocate_single`: hands out one /32 address from a shared subnet, with a `reserved` list so callers can pin host-global addresses (vmnet's network/gateway/broadcast in our case) without seeding allocations.json. Persists through the same flock-locked file as `allocate`, with `block_index` reinterpreted as an address offset. `MacosNetwork::setup` branches on `instance_id`: * Legacy (no instance id): keeps using the /30 allocator against the full /24 so existing allocations.json layouts stay valid byte-for-byte across the upgrade. * New install: uses `allocate_single` against the /27 slot. ~30 VMs/install (32 - reserved), same 1/8 collision profile. Move the /30-minimum constraint out of `parse_cidr` into `allocate` so the shared parser can also accept /27, /28, /32 for single-IP callers. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6

GlobalConfig previously owned the format of the dm-thin pool name, the image/VM thin-volume prefixes, the TAP device prefix, and the iptables comment — five accessors that hard-coded what should be a subsystem-private concern. Each had its own legacy-fallback branch duplicating the same `if instance_id.is_empty()` shape. Replace the five accessors with one generic identity surface, `GlobalConfig::instance_namespace() -> Option<&str>`, that collapses the empty-string legacy sentinel into `None`. Each subsystem owns its own pure derivation: * `dm_thin::pool::name(Option<&str>)` * `dm_thin::thin::{image_prefix, vm_prefix}(Option<&str>)` * `network::tap::prefix(Option<&str>)` * `network::nat::comment(Option<&str>)` The legacy literals (`ember-pool`, `ember-img-`, `ember-vm-`, `em-`, empty iptables comment) live next to the code that has to match them byte-for-byte against existing kernel state, with the "why this exact string" comment alongside. Two related cleanups from the review: * `DmThinStorage::init` was rebuilding the pool name itself with `format!("ember-{}-pool", ...)`, duplicating the `dm_thin_pool_name()` accessor's format and creating a silent drift risk. It now calls `pool::name(Some(...))` — one source of truth. * The macOS vmnet-reserved address list was inlined in `MacosNetwork::setup`; hoist to a module-level `VMNET_RESERVED` const so the three host-global addresses have one definition. * `reconcile::cleanup_network` was a one-line passthrough to `network::cleanup` — inlined the call. 229 ember-core + ember-linux unit tests still pass; integration tests (`tests/isolation.rs`, `tests/dm_thin.rs`) are unchanged and still cover the cross-install isolation contract via the binary. https://claude.ai/code/session_01F51Z281wnRSUHqHPub4msK

Adds a one-bullet style note under "Coding Style & Conventions" capturing the principle that motivated the previous refactor: subsystems own their own names/formats, shared types like `GlobalConfig` expose generic identity, and reaching across a boundary to format or match another subsystem's strings is a signal to move the logic to the owning side. https://claude.ai/code/session_01F51Z281wnRSUHqHPub4msK

`MacosNetwork::setup` was the lone holdout still reading `config.instance_id` raw to decide between the legacy /30 allocator and the new /27 single-IP path — every other subsystem goes through `GlobalConfig::instance_namespace()`. Switch to the accessor and match on `Option<&str>` so the legacy-vs-tagged intent reads at the type level, consistent with `dm_thin::pool`, `network::tap`, `network::nat`, etc. No behavior change — `instance_namespace()` collapses the empty instance_id to `None`, which selects the same legacy branch the `is_empty()` check did. https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi

`derive_ip_subnet` lived in `ember_core::config` but was only ever called from `LinuxPlatform::default_ip_subnet`. macOS has its own derivation (`derive_vmnet_subnet` in `ember-macos`) that carves a /27 out of vmnet's fixed /24, so keeping the Linux-specific function in core was the opposite of the boundary discipline this branch is trying to land: shared types stay generic; platforms own their own scoped derivations. Move it to `ember_linux::network::ip::derive_default_subnet`, alongside the `network::ip::allocate`/`release` it pairs with. Expose `ember_core::config::fnv1a_32` as `pub` so the platform crate can hash the instance id with the same primitive `derive_instance_id` (still in core, still platform-agnostic) uses — duplicating a 7-line hash would be sillier than publishing a stable utility. `network.rs`'s `pub use ember_core::network::ip` re-export becomes a local `pub mod ip` that re-exports the same surface (`allocate`, `allocate_single`, `release`, `IpAllocation`, …) and adds the Linux-only `derive_default_subnet`. Every existing `network::ip::X` call site keeps resolving — the new module pulls core's public items through `pub use ember_core::network::ip::*`. https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi

`derive_vmnet_subnet`'s `unwrap_or(0)` was defensive against a hand-edited config, but the function is only called from `MacosPlatform::default_ip_subnet` at `ember init` — never at runtime — and the `instance_id` it gets there is either CLI-validated (`parse_instance_id` enforces 4 lowercase hex) or auto-derived (`derive_instance_id` always emits 4 hex chars). The fallback was unreachable, but in the hypothetical world where validation regresses it would silently collide every affected install on slot 0 — exactly the cross-install clash this branch exists to prevent. Switch to a `panic!` with a message that names the contract ("CLI validation should have rejected this"), and replace the `falls_back_to_slot_zero_on_garbage_id` test with two `#[should_panic]` cases. The test that was locking in the silent fallback is now locking in the loud one. Also drops a trailing blank line in reconcile.rs that `cargo fmt` flagged in passing. https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi

claude added 10 commits May 18, 2026 07:32

aljoscha merged commit 62250d9 into main May 20, 2026
2 checks passed

aljoscha deleted the claude/isolate-config-installations-7rzVm branch May 20, 2026 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope host-global resources to a per-installation instance id#17

Scope host-global resources to a per-installation instance id#17
aljoscha merged 10 commits into
mainfrom
claude/isolate-config-installations-7rzVm

aljoscha commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aljoscha commented May 19, 2026

Summary

What gets scoped

Boundary discipline

Back-compat

Risk

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants