From 4361fbfc7622574fb856e583e44b76b987eabc24 Mon Sep 17 00:00:00 2001 From: Chuck Ketcham Date: Fri, 3 Apr 2026 19:38:57 +0000 Subject: [PATCH 1/2] Add nv-qldpc-decoder Relay BP decoding guide Signed-off-by: Chuck Ketcham --- docs/realtime_qldpc_relay_bp_guide.md | 406 ++++++++++++++++++++++++++ 1 file changed, 406 insertions(+) create mode 100644 docs/realtime_qldpc_relay_bp_guide.md diff --git a/docs/realtime_qldpc_relay_bp_guide.md b/docs/realtime_qldpc_relay_bp_guide.md new file mode 100644 index 00000000..e095b5a1 --- /dev/null +++ b/docs/realtime_qldpc_relay_bp_guide.md @@ -0,0 +1,406 @@ +# Realtime nv-qldpc-decoder Relay BP Decoding Guide + +This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP decoder +using CUDA-Q's realtime host dispatch system. The decoder runs as a +CPU-launched CUDA graph (`HOST_LOOP` dispatch path) and can operate in three +configurations: + +- **CI unit test** -- standalone executable, no FPGA or network hardware needed +- **Emulated end-to-end test** -- software FPGA emulator replaces real hardware +- **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE + +--- + +## Table of Contents + +1. [Prerequisites](#prerequisites) +2. [Repository Layout](#repository-layout) +3. [Building](#building) +4. [CI Unit Test](#ci-unit-test) +5. [Emulated End-to-End Test](#emulated-end-to-end-test) +6. [FPGA End-to-End Test](#fpga-end-to-end-test) +7. [Orchestration Script Reference](#orchestration-script-reference) + +--- + +## Prerequisites + +### Hardware + +| Configuration | GPU | ConnectX NIC | FPGA | +|---|---|---|---| +| CI unit test | Any CUDA-capable GPU | Not required | Not required | +| Emulated E2E | Any CUDA-capable GPU | Required (loopback cable) | Not required | +| FPGA E2E | CUDA GPU with GPUDirect RDMA | Required | Required | + +Tested platforms: DGX Spark, GB200. + +### Software + +- **CUDA Toolkit**: 12.6 or 13.0 +- **CUDA-Q SDK**: pre-installed (provides `libcudaq`, `libnvqir`, `nvq++`) +- **`nv-qldpc-decoder` plugin**: the proprietary nv-qldpc-decoder shared library + (`libcudaq-qec-nv-qldpc-decoder.so`). Required at runtime for all + three configurations. + +### Source Repositories + +| Repository | URL | Version | +|---|---|---| +| **cudaqx** | | `main` branch (or your feature branch) | +| **cuda-quantum** (realtime) | | Commit `9ce3d2e886` | +| **holoscan-sensor-bridge** | | Tag `2.6.0-EA2` | + +`cuda-quantum` provides `libcudaq-realtime` (the host dispatcher, ring buffer +management, and dispatch kernel). `holoscan-sensor-bridge` provides the +Hololink `GpuRoceTransceiver` library for RDMA transport. + +> **Note:** `holoscan-sensor-bridge` is only needed for the emulated and FPGA +> end-to-end tests. The CI unit test requires only `libcudaq-realtime`. + +--- + +## Repository Layout + +Key files within `cudaqx`: + +``` +libs/qec/ + unittests/ + realtime/ + qec_graph_decode_test/ + test_realtime_qldpc_graph_decoding.cpp # CI unit test + qec_roce_decode_test/ + data/ + config_nv_qldpc_relay.yml # Relay BP decoder config + syndromes_nv_qldpc_relay.txt # 100 test syndrome shots + utils/ + hololink_qldpc_graph_decoder_bridge.cpp # Bridge tool (RDMA ↔ decoder) + hololink_qldpc_graph_decoder_test.sh # Orchestration script + hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes) +``` + +The FPGA emulator is in the `cuda-quantum` repository: + +``` +cuda-quantum/realtime/ + unittests/utils/ + hololink_fpga_emulator.cpp # Software FPGA emulator +``` + +--- + +## Building + +### CI unit test only (no Hololink tools) + +If you only need to run the CI unit test, you can build without +`holoscan-sensor-bridge`: + +```bash +# 1. Build libcudaq-realtime +git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src +cd cudaq-realtime-src +git checkout 9ce3d2e886 +cd realtime && mkdir -p build && cd build +cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. +ninja && ninja install +cd ../../.. + +# 2. Build cudaqx with the nv-qldpc-decoder test +cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON +cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding +``` + +### Full build (CI test + Hololink bridge/playback tools) + +To also build the bridge and playback tools for emulated or FPGA testing: + +```bash +# 1. Clone cuda-quantum (realtime) +git clone --filter=blob:none --no-checkout \ + https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src +cd cudaq-realtime-src +git sparse-checkout init --cone +git sparse-checkout set realtime +git checkout 9ce3d2e886 +cd .. + +# 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) +# Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger). +# If your system cmake is older: pip install cmake +git clone --branch 2.6.0-EA2 \ + https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git +cd holoscan-sensor-bridge + +# Strip operators we don't need to avoid configure failures from missing deps +sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d; + /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d; + /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d; + /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d; + /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d; + /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d; + /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \ + src/hololink/operators/CMakeLists.txt + +mkdir -p build && cd build +cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ + -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \ + -DHOLOLINK_BUILD_PYTHON=OFF \ + -DHOLOLINK_BUILD_TESTS=OFF \ + -DHOLOLINK_BUILD_TOOLS=OFF \ + -DHOLOLINK_BUILD_EXAMPLES=OFF \ + -DHOLOLINK_BUILD_EMULATOR=OFF .. +cmake --build . --target gpu_roce_transceiver hololink_core +cd ../.. + +# 3. Build libcudaq-realtime with Hololink tools enabled +# This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge tool) +# as well as the FPGA emulator. +cd cudaq-realtime-src/realtime && mkdir -p build && cd build +cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \ + -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \ + .. +ninja && ninja install +cd ../../.. + +# 4. Build cudaqx with Hololink tools enabled +cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON \ + -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build +cmake --build cudaqx/build --target \ + test_realtime_qldpc_graph_decoding \ + hololink_qldpc_graph_decoder_bridge \ + hololink_fpga_syndrome_playback +``` + +### Using the orchestration script + +The orchestration script can build everything automatically: + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --hsb-dir /path/to/holoscan-sensor-bridge \ + --cuda-quantum-dir /path/to/cuda-quantum \ + --no-run +``` + +--- + +## CI Unit Test + +The CI unit test (`test_realtime_qldpc_graph_decoding`) exercises the full +host dispatch decode path without any network hardware. It: + +1. Loads the Relay BP config and syndrome data from YAML/text files +2. Creates the decoder via the `decoder::get("nv-qldpc-decoder", ...)` plugin API +3. Captures a CUDA graph of the decode pipeline +4. Wires `libcudaq-realtime`'s host dispatcher (HOST_LOOP) to a ring buffer +5. Writes RPC requests into the ring buffer, the host dispatcher launches the + CUDA graph, and the test verifies corrections + +### Running + +```bash +cd cudaqx/build + +# The nv-qldpc-decoder plugin must be discoverable at runtime. +# Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path: +export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so + +./libs/qec/unittests/test_realtime_qldpc_graph_decoding +``` + +Expected output: + +``` +[==========] Running 1 test from 1 test suite. +[----------] 1 test from RealtimeQLDPCGraphDecodingTest +[ RUN ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots +... +[ OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms) +[==========] 1 test from 1 test suite ran. +[ PASSED ] 1 test. +``` + +--- + +## Emulated End-to-End Test + +The emulated test replaces the physical FPGA with a software emulator. Three +processes run concurrently: + +1. **Emulator** -- receives syndromes via the UDP control plane, sends them + to the bridge via RDMA, and captures corrections +2. **Bridge** -- runs the host dispatcher and CUDA graph decode loop on the GPU, + receiving syndromes and sending corrections via RDMA +3. **Playback** -- loads syndrome data into the emulator's BRAM and triggers + playback, then verifies corrections + +### Requirements + +- ConnectX NIC with a loopback cable connecting both ports (the emulator + sends RDMA traffic out one port and the bridge receives on the other) +- `libibverbs` / RDMA-capable network stack +- All three tools built (bridge, playback, emulator) + +### Running + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --emulate \ + --build \ + --setup-network \ + --hsb-dir /path/to/holoscan-sensor-bridge +``` + +The `--setup-network` flag configures the ConnectX interface with the +appropriate IP addresses and MTU. It only needs to be run once per boot. + +After the initial build and network setup, subsequent runs are faster: + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate +``` + +--- + +## FPGA End-to-End Test + +The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two +processes run: + +1. **Bridge** -- same as emulated mode +2. **Playback** -- loads syndromes into the FPGA's BRAM and triggers playback, + then reads back corrections from the FPGA's capture RAM to verify them + +### Requirements + +- FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via + direct cable or switch. Bitfiles for supported FPGA vendors are available + [here](https://edge.urm.nvidia.com/artifactory/sw-holoscan-thirdparty-generic-local/QEC/HSB-2.6.0-EA/). + See the [cuda-quantum realtime user guide](https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/user_guide.md) + for FPGA setup instructions. +- FPGA IP and bridge IP on the same subnet +- ConnectX device name (e.g., `mlx5_4`, `mlx5_5`) + +### Running + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --setup-network \ + --device mlx5_5 \ + --bridge-ip 192.168.0.1 \ + --fpga-ip 192.168.0.2 \ + --gpu 2 \ + --page-size 512 \ + --hsb-dir /path/to/holoscan-sensor-bridge +``` + +Key parameters for FPGA mode: + +| Parameter | Description | +|---|---| +| `--device` | ConnectX IB device name (e.g., `mlx5_5`) | +| `--bridge-ip` | IP address assigned to the ConnectX interface | +| `--fpga-ip` | FPGA's IP address | +| `--gpu` | GPU device ID (choose NUMA-local GPU for lowest latency) | +| `--page-size` | Ring buffer slot size in bytes (use `512` on GB200 for alignment) | +| `--spacing` | Inter-shot spacing in microseconds | + +> **Note:** The `--spacing` value should be set to at least the per-shot decode +> time to avoid overrunning the input ring buffer. If syndromes arrive faster +> than the decoder can process them, the buffer fills up and messages are lost. +> Use a `--spacing` value at or above the observed decode time for sustained +> operation. + +### GPU Selection + +For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. +For example, on a GB200 system where `mlx5_5` is on NUMA node 1, +use `--gpu 2` or `--gpu 3`. Check NUMA locality with: + +```bash +cat /sys/class/infiniband//device/numa_node +``` + +### Network Sanity Check + +Before running, verify that the bridge IP is assigned to exactly one interface: + +```bash +ip addr show | grep 192.168.0.1 +``` + +If multiple interfaces show the same IP, remove the duplicate to avoid +routing ambiguity that silently drops RDMA packets. + +--- + +## Orchestration Script Reference + +``` +hololink_qldpc_graph_decoder_test.sh [options] +``` + +### Modes + +| Flag | Description | +|---|---| +| `--emulate` | Use FPGA emulator (no real FPGA needed) | +| *(default)* | FPGA mode (requires real FPGA) | + +### Actions + +| Flag | Description | +|---|---| +| `--build` | Build all required tools before running | +| `--setup-network` | Configure ConnectX network interfaces | +| `--no-run` | Skip running the test (useful with `--build`) | + +### Build Options + +| Flag | Default | Description | +|---|---|---| +| `--hsb-dir DIR` | `/workspaces/holoscan-sensor-bridge` | holoscan-sensor-bridge source directory | +| `--cuda-quantum-dir DIR` | `/workspaces/cuda-quantum` | cuda-quantum source directory | +| `--cuda-qx-dir DIR` | `/workspaces/cudaqx` | cudaqx source directory | +| `--jobs N` | `nproc` | Parallel build jobs | + +### Network Options + +| Flag | Default | Description | +|---|---|---| +| `--device DEV` | auto-detect | ConnectX IB device name | +| `--bridge-ip ADDR` | `10.0.0.1` | Bridge tool IP address | +| `--emulator-ip ADDR` | `10.0.0.2` | Emulator IP (emulate mode) | +| `--fpga-ip ADDR` | `192.168.0.2` | FPGA IP address | +| `--mtu N` | `4096` | MTU size | + +### Run Options + +| Flag | Default | Description | +|---|---|---| +| `--gpu N` | `0` | GPU device ID | +| `--timeout N` | `60` | Timeout in seconds | +| `--num-shots N` | all available | Limit number of syndrome shots | +| `--page-size N` | `384` | Ring buffer slot size in bytes | +| `--num-pages N` | `128` | Number of ring buffer slots | +| `--spacing N` | `10` | Inter-shot spacing in microseconds | +| `--no-verify` | *(verify)* | Skip correction verification | +| `--control-port N` | `8193` | UDP control port for emulator | From 8635c45dd26d1101a6b4819c2e52b57b0612d6c2 Mon Sep 17 00:00:00 2001 From: Chuck Ketcham Date: Fri, 3 Apr 2026 19:52:32 +0000 Subject: [PATCH 2/2] Fix emulated mode prerequisites in Relay BP guide Emulated mode requires GPUDirect RDMA (same as FPGA mode). Note loopback cable requirement in hardware table. Replace libibverbs mention with reference to cuda-quantum realtime build guide for complete software dependency listing. Signed-off-by: Chuck Ketcham --- docs/realtime_qldpc_relay_bp_guide.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/realtime_qldpc_relay_bp_guide.md b/docs/realtime_qldpc_relay_bp_guide.md index e095b5a1..8b906319 100644 --- a/docs/realtime_qldpc_relay_bp_guide.md +++ b/docs/realtime_qldpc_relay_bp_guide.md @@ -30,7 +30,7 @@ configurations: | Configuration | GPU | ConnectX NIC | FPGA | |---|---|---|---| | CI unit test | Any CUDA-capable GPU | Not required | Not required | -| Emulated E2E | Any CUDA-capable GPU | Required (loopback cable) | Not required | +| Emulated E2E | CUDA GPU with GPUDirect RDMA | Required (loopback cable) | Not required | | FPGA E2E | CUDA GPU with GPUDirect RDMA | Required | Required | Tested platforms: DGX Spark, GB200. @@ -255,7 +255,8 @@ processes run concurrently: - ConnectX NIC with a loopback cable connecting both ports (the emulator sends RDMA traffic out one port and the bridge receives on the other) -- `libibverbs` / RDMA-capable network stack +- Software dependencies (DOCA, Holoscan SDK, etc.) as described in the + [cuda-quantum realtime build guide](https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/building.md) - All three tools built (bridge, playback, emulator) ### Running