Efa relax ordering/01 per token weak signals by vladimiraerov · Pull Request #1 · amazon-contributing/upstream-to-nccl

vladimiraerov · 2026-05-18T07:38:04Z

Description

Modifying the LL kernel to be unordered kernels, this will allow to work with weak signals semantics.

EFA does not natively enforce same-QP ordering between RDMA writes and signals, so the dispatch kernel's count-signal contract (count implies all prior data puts are visible) is currently upheld by aws-ofi-nccl's sequence-number enforcement. This is not free on the datapath. Add a parallel per-token weak-signal protocol on the dispatch path so the receiver can verify each token landed independently of the plugin's strong-signal contract. Allocate a third signal range for per-token data signals (3 * num_total_signals). The existing count-signal protocol stays intact; the per-token signal is additive and shares the timeout budget. NVLink path is unchanged. This is a validation scaffold: the plan is to relax plugin ordering enforcement in aws-ofi-nccl while keeping this kernel-side check to catch regressions. Signed-off-by: Vladimir Aerov <vaerov@amazon.com>

Mirrors the dispatch-side weak-signal scaffold (prior commit) on the combine path. Goes further: combine RX no longer depends on the count-carrying finish-flag signal for inter-node. Dispatch writes a per-expert token count to a device buffer; combine reads it locally to know the expected data-signal count and spins only on per-token signals. Signal layout simplified to 3*N (combine-data, dispatch-count, dispatch-data). The old combine finish-flag region is removed. NVLink intra-node path unchanged. Signed-off-by: Vladimir Aerov <vaerov@amazon.com>

EFA's GDAKI path does not correctly handle atomic SignalAdd values greater than 1, which is exactly what the dispatch count carried (numTokensSent + 1). Replace the 0-byte put + SignalAdd with net.putValue<int>, an inline 4-byte RDMA write: on GDAKI the value rides inline in the WQE (no local MR, no DMA read on the sender); on the proxy path it goes through NCCL's per-state pre-registered inlines slot, invisible to us. Also lower latency since we skip the signal-table atomic on RX and the per-signal proxy bookkeeping. Receiver polls recvCntBuf directly for non-zero; encoding unchanged (numTokensSent + 1). cleanNextRecvCntBuf*() drop the !dP2pDisabled guard since the count region is now RDMA-written on both P2P and GIN paths. Intra-node P2P and the per-token data-signal step are unchanged. Signed-off-by: Vladimir Aerov <vaerov@amazon.com>

Vladimir Aerov added 3 commits May 18, 2026 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efa relax ordering/01 per token weak signals#1

Efa relax ordering/01 per token weak signals#1
vladimiraerov wants to merge 3 commits into
amazon-contributing:masterfrom
vladimiraerov:efa-relax-ordering/01-per-token-weak-signals

vladimiraerov commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vladimiraerov commented May 18, 2026

Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant