Skip to content

Efa relax ordering/01 per token weak signals#1

Open
vladimiraerov wants to merge 3 commits into
amazon-contributing:masterfrom
vladimiraerov:efa-relax-ordering/01-per-token-weak-signals
Open

Efa relax ordering/01 per token weak signals#1
vladimiraerov wants to merge 3 commits into
amazon-contributing:masterfrom
vladimiraerov:efa-relax-ordering/01-per-token-weak-signals

Conversation

@vladimiraerov
Copy link
Copy Markdown
Collaborator

Description

Modifying the LL kernel to be unordered kernels, this will allow to work with weak signals semantics.

Vladimir Aerov added 3 commits May 18, 2026 10:36
EFA does not natively enforce same-QP ordering between RDMA writes and
signals, so the dispatch kernel's count-signal contract (count implies
all prior data puts are visible) is currently upheld by aws-ofi-nccl's
sequence-number enforcement. This is not free on the datapath.

Add a parallel per-token weak-signal protocol on the dispatch path so
the receiver can verify each token landed independently of the plugin's
strong-signal contract. Allocate a third signal range for per-token data
signals (3 * num_total_signals). The existing count-signal protocol
stays intact; the per-token signal is additive and shares the timeout
budget.

NVLink path is unchanged. This is a validation scaffold: the plan is to
relax plugin ordering enforcement in aws-ofi-nccl while keeping this
kernel-side check to catch regressions.

Signed-off-by: Vladimir Aerov <vaerov@amazon.com>
Mirrors the dispatch-side weak-signal scaffold (prior commit) on the
combine path. Goes further: combine RX no longer depends on the
count-carrying finish-flag signal for inter-node. Dispatch writes a
per-expert token count to a device buffer; combine reads it locally to
know the expected data-signal count and spins only on per-token signals.

Signal layout simplified to 3*N (combine-data, dispatch-count,
dispatch-data). The old combine finish-flag region is removed.

NVLink intra-node path unchanged.

Signed-off-by: Vladimir Aerov <vaerov@amazon.com>
EFA's GDAKI path does not correctly handle atomic SignalAdd values
greater than 1, which is exactly what the dispatch count carried
(numTokensSent + 1). Replace the 0-byte put + SignalAdd with
net.putValue<int>, an inline 4-byte RDMA write: on GDAKI the value
rides inline in the WQE (no local MR, no DMA read on the sender);
on the proxy path it goes through NCCL's per-state pre-registered
inlines slot, invisible to us. Also lower latency since we skip the
signal-table atomic on RX and the per-signal proxy bookkeeping.

Receiver polls recvCntBuf directly for non-zero; encoding unchanged
(numTokensSent + 1). cleanNextRecvCntBuf*() drop the !dP2pDisabled
guard since the count region is now RDMA-written on both P2P and GIN
paths. Intra-node P2P and the per-token data-signal step are unchanged.

Signed-off-by: Vladimir Aerov <vaerov@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant