Design multi-device server lifecycle#786
Conversation
4820bdb to
961e8f7
Compare
smolkaj
left a comment
There was a problem hiding this comment.
The design is clear, well-structured, and appropriately scoped. The decision table distinguishing spec-forced choices from simplicity choices is a great pattern. The "Assumptions to validate" section in particular is the right move — concrete numbers to check before committing to the architecture.
A few questions and suggestions:
Substantive
-
Atomicity boundary on
DeleteDevices: The doc says deletion is atomic and that active streams are closed with a clear error. But stream closure is inherently asynchronous. What is the precise atomicity guarantee? I'd expect: the registry map update is atomic (under the registry mutex), in-flight RPCs that already resolved aDeviceContextbefore deletion complete normally, and stream teardown happens after the map update. "Atomic" reads as all-or-nothing on the map, not "all streams are gone before the RPC returns" — worth stating that explicitly to avoid implementer confusion. -
Error code for forcibly closed streams: The error model table covers all other cases but omits this one. What
Statusshould a stream receive when its device is deleted mid-flight? (e.g.,NOT_FOUNDwith"device N deleted"?) Test #10 says "clearly" but the design should be the authoritative reference. -
FourwardServer::DeviceId()in a multi-device world: The design says this method "continues to exposeDeviceId()." In the multi-device world, this must mean "returns 1, the startup default." Worth making that explicit here — callers who seeserver.DeviceId()will wonder whether it returns something meaningful in a multi-device context, and the answer ("always 1") is load-bearing for the single-device compatibility story. -
Server-wide device cap: The error model covers per-RPC batch limits (
RESOURCE_EXHAUSTED), but is there a configured server-wide maximum live device count? If not, that's a valid choice, but worth stating explicitly rather than leaving it implied.
Minor
-
countoverflow: The overflow checkfirst_device_id + count - 1 must not overflow uint64is correct, but sincecountisuint32, implementations must widen it touint64before adding. Worth a note — this is exactly the kind of subtle bug that surfaces only at boundary values. -
ListDevicespagination: The emptyListDevicesRequestleaves no room for a page token. 10k IDs is ~80 KB packed — fine for now. Since "Future work" already mentions range-compressed listing, noting that pagination is a natural extension of that path would make the limitations explicit. -
Proto package: The proto snippets don't specify a
packagedeclaration. Addingpackage fourward;(or matching the existing convention) makes the snippets self-contained as a reference. -
NewManagementStubnaming: Google C++ Style generally reservesNew*for raw-pointer factories;unique_ptr-returning factories conventionally useCreate*. Minor, but easy to fix before the API is cast in code.
961e8f7 to
5f1e08d
Compare
|
Thanks, this was useful. I kept the doc at HLD level and addressed the substantive points:
The |
Summary
Adds a high-level design for running many independent 4ward devices on one large machine.
Requirements are intentionally terse:
The second half now stays at HLD level while preserving the concrete proto/API shape: one JVM/gRPC server, independent
DeviceContexts, P4Runtime routing byuint64 device_id, no state sharing in the first design, bulk management RPCs, and P4Runtime-only pipeline loading.Validation
Design-only change. Checked markdown diff with
git diff --check.