Efa relax ordering/01 per token weak signals#1
Open
vladimiraerov wants to merge 3 commits into
Open
Conversation
added 3 commits
May 18, 2026 10:36
EFA does not natively enforce same-QP ordering between RDMA writes and signals, so the dispatch kernel's count-signal contract (count implies all prior data puts are visible) is currently upheld by aws-ofi-nccl's sequence-number enforcement. This is not free on the datapath. Add a parallel per-token weak-signal protocol on the dispatch path so the receiver can verify each token landed independently of the plugin's strong-signal contract. Allocate a third signal range for per-token data signals (3 * num_total_signals). The existing count-signal protocol stays intact; the per-token signal is additive and shares the timeout budget. NVLink path is unchanged. This is a validation scaffold: the plan is to relax plugin ordering enforcement in aws-ofi-nccl while keeping this kernel-side check to catch regressions. Signed-off-by: Vladimir Aerov <vaerov@amazon.com>
Mirrors the dispatch-side weak-signal scaffold (prior commit) on the combine path. Goes further: combine RX no longer depends on the count-carrying finish-flag signal for inter-node. Dispatch writes a per-expert token count to a device buffer; combine reads it locally to know the expected data-signal count and spins only on per-token signals. Signal layout simplified to 3*N (combine-data, dispatch-count, dispatch-data). The old combine finish-flag region is removed. NVLink intra-node path unchanged. Signed-off-by: Vladimir Aerov <vaerov@amazon.com>
EFA's GDAKI path does not correctly handle atomic SignalAdd values greater than 1, which is exactly what the dispatch count carried (numTokensSent + 1). Replace the 0-byte put + SignalAdd with net.putValue<int>, an inline 4-byte RDMA write: on GDAKI the value rides inline in the WQE (no local MR, no DMA read on the sender); on the proxy path it goes through NCCL's per-state pre-registered inlines slot, invisible to us. Also lower latency since we skip the signal-table atomic on RX and the per-signal proxy bookkeeping. Receiver polls recvCntBuf directly for non-zero; encoding unchanged (numTokensSent + 1). cleanNextRecvCntBuf*() drop the !dP2pDisabled guard since the count region is now RDMA-written on both P2P and GIN paths. Intra-node P2P and the per-token data-signal step are unchanged. Signed-off-by: Vladimir Aerov <vaerov@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Modifying the LL kernel to be unordered kernels, this will allow to work with weak signals semantics.