[Question]: Is efa-dp-direct integration planned as a GPU-initiated transport for libfabric/EFA?

### Question

In issue #4, @a-szegel pointed to [amzn/efa-dp-direct](https://github.com/amzn/efa-dp-direct) as the EFA equivalent of IBGDA, a CUDA library that lets GPU kernels post work requests and poll completions directly on EFA queue pairs without CPU proxy involvement.

Currently, NVSHMEM's libfabric transport on EFA routes all device-initiated operations through a CPU proxy thread. The performance docs note that IBGDA "differs from other transports that rely on passing messages to a CPU proxy thread to initiate the transfer." For fine-grained P2P workloads this proxy path is a severe bottleneck.

Our use case: Ring attention for distributed video diffusion training on AWS P5en (H200, 4× EFA NICs/node). The workload performs nvshmemx_putmem_nbi_on_stream + nvshmemx_signal_op across nodes (32 layers × 7 ring hops × forward+backward).

Observed:
NVSHMEM cross-node ring: ~248 s/step, 1% GPU utilization
Same topology with NCCL P2P: ~65 s/step — 2.6× faster
GPU shows burst-then-idle pattern: 100% during compute, then 0% while proxy drains transfers
This matches the [Perseus paper (arXiv:2605.00686)](https://arxiv.org/abs/2605.00686) analysis of proxy fence serialization reducing signaled transfer throughput to ~2% of unsignaled at high concurrency on non-IBGDA fabrics.

Questions:
Is there a plan to integrate efa-dp-direct as a GPU-initiated transport in NVSHMEM (similar to how IBGDA provides GPU-direct NIC submission for Mellanox)?

If not planned, are there alternative API patterns or tuning parameters that can reduce the proxy serialization overhead for fine-grained signaled P2P on EFA?

Environment:
NVSHMEM v3.6.5-0 (with patches [#35](https://github.com/NVIDIA/nvshmem/pull/35), [#76](https://github.com/NVIDIA/nvshmem/pull/76))
CUDA 12.8.1, NCCL v2.29.7, aws-ofi-nccl v1.19.0, GDRCopy v2.5.1
4× H200 nodes (P5en), 8 GPUs/node, 4× EFA NICs/node


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Is efa-dp-direct integration planned as a GPU-initiated transport for libfabric/EFA? #87

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Question]: Is efa-dp-direct integration planned as a GPU-initiated transport for libfabric/EFA? #87

Description

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions