Question
In issue #4, @a-szegel pointed to amzn/efa-dp-direct as the EFA equivalent of IBGDA, a CUDA library that lets GPU kernels post work requests and poll completions directly on EFA queue pairs without CPU proxy involvement.
Currently, NVSHMEM's libfabric transport on EFA routes all device-initiated operations through a CPU proxy thread. The performance docs note that IBGDA "differs from other transports that rely on passing messages to a CPU proxy thread to initiate the transfer." For fine-grained P2P workloads this proxy path is a severe bottleneck.
Our use case: Ring attention for distributed video diffusion training on AWS P5en (H200, 4× EFA NICs/node). The workload performs nvshmemx_putmem_nbi_on_stream + nvshmemx_signal_op across nodes (32 layers × 7 ring hops × forward+backward).
Observed:
NVSHMEM cross-node ring: ~248 s/step, 1% GPU utilization
Same topology with NCCL P2P: ~65 s/step — 2.6× faster
GPU shows burst-then-idle pattern: 100% during compute, then 0% while proxy drains transfers
This matches the Perseus paper (arXiv:2605.00686) analysis of proxy fence serialization reducing signaled transfer throughput to ~2% of unsignaled at high concurrency on non-IBGDA fabrics.
Questions:
Is there a plan to integrate efa-dp-direct as a GPU-initiated transport in NVSHMEM (similar to how IBGDA provides GPU-direct NIC submission for Mellanox)?
If not planned, are there alternative API patterns or tuning parameters that can reduce the proxy serialization overhead for fine-grained signaled P2P on EFA?
Environment:
NVSHMEM v3.6.5-0 (with patches #35, #76)
CUDA 12.8.1, NCCL v2.29.7, aws-ofi-nccl v1.19.0, GDRCopy v2.5.1
4× H200 nodes (P5en), 8 GPUs/node, 4× EFA NICs/node
Question
In issue #4, @a-szegel pointed to amzn/efa-dp-direct as the EFA equivalent of IBGDA, a CUDA library that lets GPU kernels post work requests and poll completions directly on EFA queue pairs without CPU proxy involvement.
Currently, NVSHMEM's libfabric transport on EFA routes all device-initiated operations through a CPU proxy thread. The performance docs note that IBGDA "differs from other transports that rely on passing messages to a CPU proxy thread to initiate the transfer." For fine-grained P2P workloads this proxy path is a severe bottleneck.
Our use case: Ring attention for distributed video diffusion training on AWS P5en (H200, 4× EFA NICs/node). The workload performs nvshmemx_putmem_nbi_on_stream + nvshmemx_signal_op across nodes (32 layers × 7 ring hops × forward+backward).
Observed:
NVSHMEM cross-node ring: ~248 s/step, 1% GPU utilization
Same topology with NCCL P2P: ~65 s/step — 2.6× faster
GPU shows burst-then-idle pattern: 100% during compute, then 0% while proxy drains transfers
This matches the Perseus paper (arXiv:2605.00686) analysis of proxy fence serialization reducing signaled transfer throughput to ~2% of unsignaled at high concurrency on non-IBGDA fabrics.
Questions:
Is there a plan to integrate efa-dp-direct as a GPU-initiated transport in NVSHMEM (similar to how IBGDA provides GPU-direct NIC submission for Mellanox)?
If not planned, are there alternative API patterns or tuning parameters that can reduce the proxy serialization overhead for fine-grained signaled P2P on EFA?
Environment:
NVSHMEM v3.6.5-0 (with patches #35, #76)
CUDA 12.8.1, NCCL v2.29.7, aws-ofi-nccl v1.19.0, GDRCopy v2.5.1
4× H200 nodes (P5en), 8 GPUs/node, 4× EFA NICs/node