Skip to content

serviceability: Create RFC for interface upgrades without breaking existing readers #3652

@ben-dz

Description

@ben-dz

Background

The 2026-05-01 devnet incident (lake#556, lake#557) exposed a fundamental Borsh limitation: Vec<T> is encoded as (count: u32, elements: T[count]) — element count, not byte size. When an old SDK encounters a Vec<Interface> containing an element whose version it doesn't recognize, it can't skip the element to reach the next one — there's no framing telling it where the unknown element ends. The dispatcher consumes only the version byte and returns early, leaving the reader cursor stranded inside the unknown element's body. The next iteration reads garbage as the next discriminant. Cursor desync then propagates through the rest of the Vec and through the 12 Device fields after Interfaces.

This is a systemic forward-compatibility hole that recurs every time a new Interface version ships and an off-chain SDK consumer hasn't been upgraded.

New Proposed Design

New vector at the end of the device struct where V3+ interfaces will go. For compatibility, writes will put interfaces into both vectors, with the original location getting a truncated version that matches the V2 format.

Original Proposed design

Stop evolving Interface itself. Freeze it at the current shape (V2) forever. New per-interface fields live in a new sibling Vec on Device that is appended at the end of Device's on-chain layout:

pub struct Device {
    // ... all existing fields, unchanged ...
    pub interfaces: Vec<Interface>,           // frozen at V2 shape, never evolves
    // ... rest of existing Device fields, unchanged ...
    pub max_multicast_publishers: u16,        // currently the last on-chain field
    pub interface_additional: Vec<InterfaceAdditional>,  // NEW, appended at end
}

pub struct InterfaceAdditional {
    pub length: u32,    // total bytes of this struct after this field — for skip-on-unknown
    pub name: String,   // FK to interface.name; on-chain code MUST keep names in sync
    pub version: u8,    // V1, V2, ... independent of Interface's discriminant
    // version-specific body follows
}

The on-chain program is responsible for keeping interface_additional[i].name in sync with the corresponding interfaces[j].name on every name mutation. (Names are mutable on-chain; this is the cost of using name as the join key, but no other Interface field is a better candidate.)

Why this preserves current SDKs

Current SDKs reading post-migration Device accounts:

  1. Read everything up through max_multicast_publishers correctly — Interface is frozen at V2 shape, which they already understand.
  2. Their DeserializeDevice has no code for interface_additional, so they simply stop reading. The reader cursor lands at the start of the new field but nothing tries to consume it. ByteReader returns zeros past EOF anyway, so even mistaken reads don't panic.
  3. They get the base interface data correctly and silently miss the additional data.

No SDK consumer needs to upgrade before the migration runs. Compare to Options A and B (below), which both required every consumer to be on the new SDK before the migration shipped.

This is the property that elevates this design above any of the in-Vec framing approaches. Old CLIs, old admin tools, old monitoring scripts all keep working — they just don't see the additional fields.

Forward compat within interface_additional

The same evolution problem has to be solved within the new Vec itself. Future InterfaceAdditional versions will add fields. SDKs that know V_n but encounter V_n+1:

  • Read length (u32),
  • Read version,
  • If unknown, reader.Advance(length - sizeof(version)) to the next element,
  • Continue.

Per-element length + version is the same length-prefix-per-element pattern that made Option B (below) robust on its own. We're applying it where it's needed (inside the new evolutionary container) but not paying for it on interfaces (which is now frozen).

We choose a single catch-all interface_additional rather than one Vec per evolution so future fields grow into the existing structure rather than proliferating per-evolution Vecs.

Migration

Required:

  1. On-chain program upgrade that:

    • Treats Interface as frozen at V2 shape (drops V3-with-inline-flex-algo from the writer side).
    • Defines InterfaceAdditional and the interface_additional field on Device.
    • Adds a migration instruction that, for every existing Device, rewrites each Interface from V3 back to V2 and creates a corresponding InterfaceAdditional entry holding the moved flex_algo_node_segments data.
    • Keeps interface_additional[i].name in sync with interfaces[j].name on every Interface name mutation.
  2. SDK updates across Go (internal + external), TypeScript, Python — adding InterfaceAdditional deserialization. Crucially, this can roll out on a relaxed timeline because old SDKs continue to work in degraded-but-correct mode.

  3. Consumer rollouts: lake-indexer, CLIs, admin tools update at their own pace. Strict consumers (lake) should adopt new-format reads before downstream queries depend on extension fields.

The contrast with the original RFC-18 migration is the headline benefit: that one needed every consumer in lockstep before the migration ran on each env. This one doesn't.

Bonus cleanup

The pre-flight heuristic in deserialize.go:164 (length*18 > reader.Remaining()) hardcodes a minimum interface size that's already inaccurate post-V3. Once Interface is frozen at V2, the minimum is well-defined again — but better to drop the heuristic in favor of letting ByteReader's EOF handling do its job.

Discipline (what stays true forever)

  • Interface struct shape never changes again.
  • All future per-interface evolution lives in interface_additional with per-element length+version framing.
  • interface_additional stays at the end of Device's on-chain layout.
  • On-chain Interface name mutations propagate to corresponding interface_additional entries atomically.

Alternatives considered

A. Length-prefix the existing interfaces Vec body + sort by version

(byte_length: u32, count: u32, elements...) for the Vec, with on-chain code re-sorting interfaces ascending by version on every mutation. SDKs hitting an unknown version bail out of the Vec and reader.Advance(remaining_bytes) to resync.

Why rejected: requires every off-chain consumer to upgrade before the migration runs (current SDKs can't read the new format). Also requires maintaining a sort invariant on every Interface mutator forever, and breaks any latent positional dependencies on Vec order.

B. Length-prefix per Interface element

Each Interface element gains its own (version: u8, body_length: u32, body) framing. SDKs skip unknown elements via body_length regardless of position.

Why rejected: also requires every off-chain consumer to upgrade before the migration runs. Marginally more robust than Option A (no sort needed) but doesn't preserve current SDKs at all.

C. Per-evolution sibling Vec on Device

Each new evolution gets its own typed Vec field on Device (e.g., interface_flex_algo: Vec<...>, later interface_bgp: Vec<...>). Same backward-compat properties as the chosen design.

Why not chosen: works identically from old-SDK perspective, but proliferates fields on Device as evolutions accumulate. The catch-all interface_additional keeps the Device struct cleaner.

D. Tagged Borsh enum for Interface

Doesn't help — Borsh's enum encoding is (discriminant: u8, body) with no length information. Old SDKs hitting an unknown discriminant still can't skip past it.

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions