Skip to content

Design multi-device server lifecycle#786

Merged
smolkaj merged 1 commit into
mainfrom
multidevice-design
Jun 17, 2026
Merged

Design multi-device server lifecycle#786
smolkaj merged 1 commit into
mainfrom
multidevice-design

Conversation

@smolkaj

@smolkaj smolkaj commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a high-level design for running many independent 4ward devices on one large machine.

Requirements are intentionally terse:

  • scale: support the target workload on one large host
  • equivalence: behave like N independent 4ward instances, except for shared host resources and one management endpoint
  • simple default: one-device startup does not require the management API or device-ID ceremony
  • P4Runtime compliance: multi-device P4Runtime behavior follows the spec
  • dynamic lifecycle: create/delete devices after startup, including 10k devices in one RPC
  • simplicity: keep the design as simple as the requirements allow

The second half now stays at HLD level while preserving the concrete proto/API shape: one JVM/gRPC server, independent DeviceContexts, P4Runtime routing by uint64 device_id, no state sharing in the first design, bulk management RPCs, and P4Runtime-only pipeline loading.

Validation

Design-only change. Checked markdown diff with git diff --check.

@smolkaj smolkaj force-pushed the multidevice-design branch 12 times, most recently from 4820bdb to 961e8f7 Compare June 17, 2026 05:01

@smolkaj smolkaj left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design is clear, well-structured, and appropriately scoped. The decision table distinguishing spec-forced choices from simplicity choices is a great pattern. The "Assumptions to validate" section in particular is the right move — concrete numbers to check before committing to the architecture.

A few questions and suggestions:

Substantive

  1. Atomicity boundary on DeleteDevices: The doc says deletion is atomic and that active streams are closed with a clear error. But stream closure is inherently asynchronous. What is the precise atomicity guarantee? I'd expect: the registry map update is atomic (under the registry mutex), in-flight RPCs that already resolved a DeviceContext before deletion complete normally, and stream teardown happens after the map update. "Atomic" reads as all-or-nothing on the map, not "all streams are gone before the RPC returns" — worth stating that explicitly to avoid implementer confusion.

  2. Error code for forcibly closed streams: The error model table covers all other cases but omits this one. What Status should a stream receive when its device is deleted mid-flight? (e.g., NOT_FOUND with "device N deleted"?) Test #10 says "clearly" but the design should be the authoritative reference.

  3. FourwardServer::DeviceId() in a multi-device world: The design says this method "continues to expose DeviceId()." In the multi-device world, this must mean "returns 1, the startup default." Worth making that explicit here — callers who see server.DeviceId() will wonder whether it returns something meaningful in a multi-device context, and the answer ("always 1") is load-bearing for the single-device compatibility story.

  4. Server-wide device cap: The error model covers per-RPC batch limits (RESOURCE_EXHAUSTED), but is there a configured server-wide maximum live device count? If not, that's a valid choice, but worth stating explicitly rather than leaving it implied.

Minor

  • count overflow: The overflow check first_device_id + count - 1 must not overflow uint64 is correct, but since count is uint32, implementations must widen it to uint64 before adding. Worth a note — this is exactly the kind of subtle bug that surfaces only at boundary values.

  • ListDevices pagination: The empty ListDevicesRequest leaves no room for a page token. 10k IDs is ~80 KB packed — fine for now. Since "Future work" already mentions range-compressed listing, noting that pagination is a natural extension of that path would make the limitations explicit.

  • Proto package: The proto snippets don't specify a package declaration. Adding package fourward; (or matching the existing convention) makes the snippets self-contained as a reference.

  • NewManagementStub naming: Google C++ Style generally reserves New* for raw-pointer factories; unique_ptr-returning factories conventionally use Create*. Minor, but easy to fix before the API is cast in code.

@smolkaj smolkaj marked this pull request as ready for review June 17, 2026 05:03
@smolkaj smolkaj force-pushed the multidevice-design branch from 961e8f7 to 5f1e08d Compare June 17, 2026 05:04
@smolkaj

smolkaj commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks, this was useful. I kept the doc at HLD level and addressed the substantive points:

  • clarified that delete atomicity is the registry update, not synchronous teardown of every active stream
  • specified NOT_FOUND for streams closed because their device was deleted
  • made FourwardServer::DeviceId() explicitly mean the default device ID (1) even after more devices are created
  • added a server-wide device limit alongside the per-RPC request limit
  • noted uint64 overflow handling after widening count
  • added package fourward; to the proto snippet
  • moved ListDevices pagination/range compression into future work

The NewManagementStub naming point no longer applies after the HLD cleanup; I removed the stub-factory detail and kept only the higher-level C++ wrapper shape.

@smolkaj smolkaj enabled auto-merge (squash) June 17, 2026 05:09
@smolkaj smolkaj disabled auto-merge June 17, 2026 05:09
@smolkaj smolkaj enabled auto-merge (squash) June 17, 2026 05:09
@smolkaj smolkaj merged commit dfb3333 into main Jun 17, 2026
8 checks passed
@smolkaj smolkaj deleted the multidevice-design branch June 17, 2026 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant