Skip to content

[Question]: Is efa-dp-direct integration planned as a GPU-initiated transport for libfabric/EFA? #87

@quanta42

Description

@quanta42

Question

In issue #4, @a-szegel pointed to amzn/efa-dp-direct as the EFA equivalent of IBGDA, a CUDA library that lets GPU kernels post work requests and poll completions directly on EFA queue pairs without CPU proxy involvement.

Currently, NVSHMEM's libfabric transport on EFA routes all device-initiated operations through a CPU proxy thread. The performance docs note that IBGDA "differs from other transports that rely on passing messages to a CPU proxy thread to initiate the transfer." For fine-grained P2P workloads this proxy path is a severe bottleneck.

Our use case: Ring attention for distributed video diffusion training on AWS P5en (H200, 4× EFA NICs/node). The workload performs nvshmemx_putmem_nbi_on_stream + nvshmemx_signal_op across nodes (32 layers × 7 ring hops × forward+backward).

Observed:
NVSHMEM cross-node ring: ~248 s/step, 1% GPU utilization
Same topology with NCCL P2P: ~65 s/step — 2.6× faster
GPU shows burst-then-idle pattern: 100% during compute, then 0% while proxy drains transfers
This matches the Perseus paper (arXiv:2605.00686) analysis of proxy fence serialization reducing signaled transfer throughput to ~2% of unsignaled at high concurrency on non-IBGDA fabrics.

Questions:
Is there a plan to integrate efa-dp-direct as a GPU-initiated transport in NVSHMEM (similar to how IBGDA provides GPU-direct NIC submission for Mellanox)?

If not planned, are there alternative API patterns or tuning parameters that can reduce the proxy serialization overhead for fine-grained signaled P2P on EFA?

Environment:
NVSHMEM v3.6.5-0 (with patches #35, #76)
CUDA 12.8.1, NCCL v2.29.7, aws-ofi-nccl v1.19.0, GDRCopy v2.5.1
4× H200 nodes (P5en), 8 GPUs/node, 4× EFA NICs/node

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions