Scope host-global resources to a per-installation instance id#17
Merged
Conversation
Two ember installations on the same host previously shared the dm-thin
pool name, the `em-*` TAP namespace, and (when both ran VMs) the IP
subnet. Worse, reconcile() listed every `em-*` interface and would
delete the other install's TAPs on any command, and `deinit --purge`
tore down the singleton `ember-pool` regardless of which install was
asking. Integration tests against a temp `--state-dir` could destroy
the user's live install on the same machine.
Introduce `instance_id` (4 hex chars, derived from a hash of the
canonicalized state-dir at `ember init`, overridable via
`--instance-id`) and `ip_subnet` (defaulting to a `/16` slot in
`10.0.0.0/8` derived from the same hash, overridable via
`--ip-subnet`). Both are persisted on `GlobalConfig`. Helper
accessors derive every host-global name from the id:
`ember-{id}-pool`, `ember-{id}-{img,vm}-` device-mapper prefixes,
`em{id}-` TAP prefix (14-char names fit Linux IFNAMSIZ), and
`ember:{id}` iptables comment threaded through every rule add/remove
so cleanup only ever matches this install's rules.
reconcile() now scopes the orphan-TAP sweep by the install's prefix
and skips it entirely if the config can't be read, which is
preferable to deleting a foreign device. NetworkBackend::teardown
takes &GlobalConfig to thread the comment through; macOS impl is a
no-op update.
Removes the singleton `pool::POOL_NAME`, `thin::IMAGE_PREFIX`,
`thin::VM_PREFIX`, `network::ip::DEFAULT_SUBNET`, and
`tap::list_ember_devices()` constants/helpers — names now flow from
config. No back-compat shim for older configs: existing installs
need `ember deinit --purge && ember init` to migrate.
https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
Previous commit's clean-slate naming forced existing installs to deinit + reinit because the config schema gained two required-by- convention fields (instance_id, ip_subnet). That's a worse migration than the problem it solves. Make the accessors back-compat aware instead: when instance_id is empty (the serde default for older configs), return the historical unprefixed names — `ember-pool`, `em-`, `ember-img-`, `ember-vm-`, and an empty iptables comment. ip_subnet's serde default already maps to `10.100.0.0/16`. The iptables comment match is the subtle one. Older binaries added rules without `-m comment` at all; if the upgraded binary then tried to delete the same rule WITH a comment match, iptables would treat it as a different rule and the cleanup would silently no-op. Splice the comment match in only when the comment is non-empty so the byte- exact rule stays matchable on `-D`. Add three integration tests guarding the isolation contract: * `dm_thin_two_installs_dont_interfere`: two installs with distinct --instance-id stand up `ember-aaaa-pool` and `ember-bbbb-pool` side-by-side; deinit on A leaves B's pool intact. * `legacy_config_without_instance_id_keeps_working`: writes a hand-crafted legacy config (no instance_id), runs `ember info` through the binary, asserts the rendered pool name is the unprefixed `ember-pool`. * `reconcile_does_not_touch_other_installs_taps`: creates a TAP matching install A's prefix manually, runs `ember vm list` against install B (which triggers reconcile), verifies A's TAP survives. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
The cross-installation isolation tests (two-install dm-thin non-interference, legacy config compatibility, scoped TAP reconciliation) live in `tests/isolation.rs` rather than piggybacking on `tests/dm_thin.rs`. The dm-thin file is for backend mechanics; isolation is a different invariant that happens to use dm-thin as its substrate. Also fold a one-line WHY note into each legacy-mode branch on GlobalConfig — the literals there must match what the pre-instance-id binary wrote to disk and to the kernel, and the specific failure mode (orphaned thin ids, IFNAMSIZ overflow, silent iptables -D no-ops) varies per accessor. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
vmnet's shared subnet (192.168.64.0/24) is owned by the framework and there's no public API to ask for a different one, so two macOS installs running VMs concurrently both allocate from the same /24 and hand out colliding guest IPs. State-dir isolation handled everything else on macOS, but this last collision survived. Carve the /24 into 8 /27 slots (32 addresses, 8 /30 VM blocks per install) and pick a slot from the low 3 bits of the instance id. Two installs collide with 1/8 probability per pair; `--ip-subnet` is the override when that bites. 8 concurrent VMs per install is generous for personal use on macOS. Plumb via a new `Platform::default_ip_subnet(instance_id)` trait method so init.rs picks the right derivation per platform: Linux keeps the existing /16 slot inside 10/8, macOS gets the /27 slice. `MacosNetwork::setup` reads `config.ip_subnet` for new configs and falls back to the full /24 when `instance_id` is empty so legacy installs continue to use the same allocations.json layout the old binary wrote. IP reuse on VM teardown was already correct: the allocator picks the lowest-numbered free `/30` block and `release()` removes the entry, so freed slots are reused before higher indices. Existing `allocate_reuses_released_block` test covers it. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
Previous /27 slicing capped each macOS install at 8 VMs because ember reused Linux's /30 P2P allocator on macOS, where it makes no sense — vmnet provides a shared L2 bridge so every guest sits on the same subnet behind one gateway, and the host (.1) / broadcast (.3) of the per-VM /30 are pure waste. 75% of the address space went unused. Add `network::ip::allocate_single`: hands out one /32 address from a shared subnet, with a `reserved` list so callers can pin host-global addresses (vmnet's network/gateway/broadcast in our case) without seeding allocations.json. Persists through the same flock-locked file as `allocate`, with `block_index` reinterpreted as an address offset. `MacosNetwork::setup` branches on `instance_id`: * Legacy (no instance id): keeps using the /30 allocator against the full /24 so existing allocations.json layouts stay valid byte-for-byte across the upgrade. * New install: uses `allocate_single` against the /27 slot. ~30 VMs/install (32 - reserved), same 1/8 collision profile. Move the /30-minimum constraint out of `parse_cidr` into `allocate` so the shared parser can also accept /27, /28, /32 for single-IP callers. https://claude.ai/code/session_01XbXjsQa8SRWLzxnuKFzxt6
GlobalConfig previously owned the format of the dm-thin pool name,
the image/VM thin-volume prefixes, the TAP device prefix, and the
iptables comment — five accessors that hard-coded what should be a
subsystem-private concern. Each had its own legacy-fallback branch
duplicating the same `if instance_id.is_empty()` shape.
Replace the five accessors with one generic identity surface,
`GlobalConfig::instance_namespace() -> Option<&str>`, that collapses
the empty-string legacy sentinel into `None`. Each subsystem owns
its own pure derivation:
* `dm_thin::pool::name(Option<&str>)`
* `dm_thin::thin::{image_prefix, vm_prefix}(Option<&str>)`
* `network::tap::prefix(Option<&str>)`
* `network::nat::comment(Option<&str>)`
The legacy literals (`ember-pool`, `ember-img-`, `ember-vm-`, `em-`,
empty iptables comment) live next to the code that has to match
them byte-for-byte against existing kernel state, with the
"why this exact string" comment alongside.
Two related cleanups from the review:
* `DmThinStorage::init` was rebuilding the pool name itself with
`format!("ember-{}-pool", ...)`, duplicating the
`dm_thin_pool_name()` accessor's format and creating a silent
drift risk. It now calls `pool::name(Some(...))` — one source of
truth.
* The macOS vmnet-reserved address list was inlined in
`MacosNetwork::setup`; hoist to a module-level `VMNET_RESERVED`
const so the three host-global addresses have one definition.
* `reconcile::cleanup_network` was a one-line passthrough to
`network::cleanup` — inlined the call.
229 ember-core + ember-linux unit tests still pass; integration
tests (`tests/isolation.rs`, `tests/dm_thin.rs`) are unchanged and
still cover the cross-install isolation contract via the binary.
https://claude.ai/code/session_01F51Z281wnRSUHqHPub4msK
Adds a one-bullet style note under "Coding Style & Conventions" capturing the principle that motivated the previous refactor: subsystems own their own names/formats, shared types like `GlobalConfig` expose generic identity, and reaching across a boundary to format or match another subsystem's strings is a signal to move the logic to the owning side. https://claude.ai/code/session_01F51Z281wnRSUHqHPub4msK
`MacosNetwork::setup` was the lone holdout still reading `config.instance_id` raw to decide between the legacy /30 allocator and the new /27 single-IP path — every other subsystem goes through `GlobalConfig::instance_namespace()`. Switch to the accessor and match on `Option<&str>` so the legacy-vs-tagged intent reads at the type level, consistent with `dm_thin::pool`, `network::tap`, `network::nat`, etc. No behavior change — `instance_namespace()` collapses the empty instance_id to `None`, which selects the same legacy branch the `is_empty()` check did. https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi
`derive_ip_subnet` lived in `ember_core::config` but was only ever called from `LinuxPlatform::default_ip_subnet`. macOS has its own derivation (`derive_vmnet_subnet` in `ember-macos`) that carves a /27 out of vmnet's fixed /24, so keeping the Linux-specific function in core was the opposite of the boundary discipline this branch is trying to land: shared types stay generic; platforms own their own scoped derivations. Move it to `ember_linux::network::ip::derive_default_subnet`, alongside the `network::ip::allocate`/`release` it pairs with. Expose `ember_core::config::fnv1a_32` as `pub` so the platform crate can hash the instance id with the same primitive `derive_instance_id` (still in core, still platform-agnostic) uses — duplicating a 7-line hash would be sillier than publishing a stable utility. `network.rs`'s `pub use ember_core::network::ip` re-export becomes a local `pub mod ip` that re-exports the same surface (`allocate`, `allocate_single`, `release`, `IpAllocation`, …) and adds the Linux-only `derive_default_subnet`. Every existing `network::ip::X` call site keeps resolving — the new module pulls core's public items through `pub use ember_core::network::ip::*`. https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi
`derive_vmnet_subnet`'s `unwrap_or(0)` was defensive against a
hand-edited config, but the function is only called from
`MacosPlatform::default_ip_subnet` at `ember init` — never at
runtime — and the `instance_id` it gets there is either
CLI-validated (`parse_instance_id` enforces 4 lowercase hex) or
auto-derived (`derive_instance_id` always emits 4 hex chars).
The fallback was unreachable, but in the hypothetical world where
validation regresses it would silently collide every affected
install on slot 0 — exactly the cross-install clash this branch
exists to prevent.
Switch to a `panic!` with a message that names the contract
("CLI validation should have rejected this"), and replace the
`falls_back_to_slot_zero_on_garbage_id` test with two
`#[should_panic]` cases. The test that was locking in the silent
fallback is now locking in the loud one.
Also drops a trailing blank line in reconcile.rs that
`cargo fmt` flagged in passing.
https://claude.ai/code/session_01DdHPXvcXCZagRRDobyPbvi
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two ember installations on the same host previously shared the dm-thin
pool name (
ember-pool), theem-*TAP namespace, the10.100.0.0/16default subnet, and — on macOS — vmnet's
192.168.64.0/24. Worse,reconcile()listed everyem-*interface and would delete the otherinstall's TAPs on any command;
deinit --purgetore down the singletonember-poolregardless of which install was asking. An integrationtest against a temp
--state-dircould destroy a live install on thesame machine.
This branch introduces a 4-hex-char
instance_id(derived from a hashof the canonicalized state-dir at
ember init, overridable via--instance-id) and threads it through every host-global resourcename.
What gets scoped
ember-{id}-pool,ember-{id}-img-,ember-{id}-vm-em{id}-TAP prefix (14-char names fitIFNAMSIZ),
ember:{id}iptables comment threaded through everyadd/remove so cleanup only ever matches this install's rules
/16slot inside10.0.0.0/8derived fromthe id; overridable via
--ip-subnetpicked by the low 3 bits of the id (1/8 collision per pair; the
override is
--ip-subnet)single-IP
/32allocator since vmnet is a shared L2 bridge. Liftsthe per-install cap from 8 to ~30 VMs while staying inside the /27.
reconcile()now scopes the orphan-TAP sweep by the install's prefixand skips it entirely if the config can't be read (preferable to
deleting a foreign device).
Boundary discipline
GlobalConfigdoes not own the format of any subsystem's names. Itexposes one generic accessor —
instance_namespace() -> Option<&str>,which collapses the empty-string legacy sentinel to
None— and eachsubsystem owns its own pure derivation next to the code that has to
match the literals byte-for-byte:
dm_thin::pool::name(Option<&str>)dm_thin::thin::{image_prefix, vm_prefix}(Option<&str>)network::tap::prefix(Option<&str>)network::nat::comment(Option<&str>)ember_linux::network::ip::derive_default_subnet(...)(Linux /16 in /8)ember_macos::network::derive_vmnet_subnet(...)(macOS /27 slot)Added a one-bullet style note in
CLAUDE.mdcapturing the principle.Back-compat
No
deinit --purge && initmigration required. Older configs lackinstance_id;instance_namespace()returnsNoneand eachsubsystem falls back to its historical literal (
ember-pool,em-,ember-img-,ember-vm-, empty iptables comment, full /24 on macOS).The subtle one is iptables: older binaries added rules without
-m commentat all, so on-Dthe comment match has to be elidedor the rule won't match. Handled in
network::nat.Risk
integration tests in
tests/isolation.rsthat go through thebinary — two-install dm-thin non-interference, legacy config
compatibility, scoped TAP reconcile.
Acceptable for personal use;
--ip-subnetis the escape hatch.legacy installs (no instance_id → keep the /30 allocator on the
full /24); new installs get the /32 allocator on the /27.
derive_vmnet_subneton bad instance_id panics rather thansilently falling back to slot 0 (which would re-collide every
affected install on the same slice). The id is CLI-validated or
auto-derived, so this is unreachable in practice — the panic is
there to catch a regression in validation.
Test plan
cargo test(229 unit tests pass)cargo test --test isolation(three cross-install tests)cargo test --test dm_thin -- --ignoredon a host with thedm-thin module + thin-provisioning-tools
ember initinvocations against distinct--state-dirs on the same Linux host; create VMs in both;deinit --purgeon A leaves B runningverify VMs keep working and
deinitcleans up old iptablesrules
distinct /27 slices and no guest-IP collisions