From 4361fbfc7622574fb856e583e44b76b987eabc24 Mon Sep 17 00:00:00 2001 From: Chuck Ketcham Date: Fri, 3 Apr 2026 19:38:57 +0000 Subject: [PATCH 1/5] Add nv-qldpc-decoder Relay BP decoding guide Signed-off-by: Chuck Ketcham --- docs/realtime_qldpc_relay_bp_guide.md | 406 ++++++++++++++++++++++++++ 1 file changed, 406 insertions(+) create mode 100644 docs/realtime_qldpc_relay_bp_guide.md diff --git a/docs/realtime_qldpc_relay_bp_guide.md b/docs/realtime_qldpc_relay_bp_guide.md new file mode 100644 index 00000000..e095b5a1 --- /dev/null +++ b/docs/realtime_qldpc_relay_bp_guide.md @@ -0,0 +1,406 @@ +# Realtime nv-qldpc-decoder Relay BP Decoding Guide + +This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP decoder +using CUDA-Q's realtime host dispatch system. The decoder runs as a +CPU-launched CUDA graph (`HOST_LOOP` dispatch path) and can operate in three +configurations: + +- **CI unit test** -- standalone executable, no FPGA or network hardware needed +- **Emulated end-to-end test** -- software FPGA emulator replaces real hardware +- **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE + +--- + +## Table of Contents + +1. [Prerequisites](#prerequisites) +2. [Repository Layout](#repository-layout) +3. [Building](#building) +4. [CI Unit Test](#ci-unit-test) +5. [Emulated End-to-End Test](#emulated-end-to-end-test) +6. [FPGA End-to-End Test](#fpga-end-to-end-test) +7. [Orchestration Script Reference](#orchestration-script-reference) + +--- + +## Prerequisites + +### Hardware + +| Configuration | GPU | ConnectX NIC | FPGA | +|---|---|---|---| +| CI unit test | Any CUDA-capable GPU | Not required | Not required | +| Emulated E2E | Any CUDA-capable GPU | Required (loopback cable) | Not required | +| FPGA E2E | CUDA GPU with GPUDirect RDMA | Required | Required | + +Tested platforms: DGX Spark, GB200. + +### Software + +- **CUDA Toolkit**: 12.6 or 13.0 +- **CUDA-Q SDK**: pre-installed (provides `libcudaq`, `libnvqir`, `nvq++`) +- **`nv-qldpc-decoder` plugin**: the proprietary nv-qldpc-decoder shared library + (`libcudaq-qec-nv-qldpc-decoder.so`). Required at runtime for all + three configurations. + +### Source Repositories + +| Repository | URL | Version | +|---|---|---| +| **cudaqx** | | `main` branch (or your feature branch) | +| **cuda-quantum** (realtime) | | Commit `9ce3d2e886` | +| **holoscan-sensor-bridge** | | Tag `2.6.0-EA2` | + +`cuda-quantum` provides `libcudaq-realtime` (the host dispatcher, ring buffer +management, and dispatch kernel). `holoscan-sensor-bridge` provides the +Hololink `GpuRoceTransceiver` library for RDMA transport. + +> **Note:** `holoscan-sensor-bridge` is only needed for the emulated and FPGA +> end-to-end tests. The CI unit test requires only `libcudaq-realtime`. + +--- + +## Repository Layout + +Key files within `cudaqx`: + +``` +libs/qec/ + unittests/ + realtime/ + qec_graph_decode_test/ + test_realtime_qldpc_graph_decoding.cpp # CI unit test + qec_roce_decode_test/ + data/ + config_nv_qldpc_relay.yml # Relay BP decoder config + syndromes_nv_qldpc_relay.txt # 100 test syndrome shots + utils/ + hololink_qldpc_graph_decoder_bridge.cpp # Bridge tool (RDMA ↔ decoder) + hololink_qldpc_graph_decoder_test.sh # Orchestration script + hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes) +``` + +The FPGA emulator is in the `cuda-quantum` repository: + +``` +cuda-quantum/realtime/ + unittests/utils/ + hololink_fpga_emulator.cpp # Software FPGA emulator +``` + +--- + +## Building + +### CI unit test only (no Hololink tools) + +If you only need to run the CI unit test, you can build without +`holoscan-sensor-bridge`: + +```bash +# 1. Build libcudaq-realtime +git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src +cd cudaq-realtime-src +git checkout 9ce3d2e886 +cd realtime && mkdir -p build && cd build +cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. +ninja && ninja install +cd ../../.. + +# 2. Build cudaqx with the nv-qldpc-decoder test +cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON +cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding +``` + +### Full build (CI test + Hololink bridge/playback tools) + +To also build the bridge and playback tools for emulated or FPGA testing: + +```bash +# 1. Clone cuda-quantum (realtime) +git clone --filter=blob:none --no-checkout \ + https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src +cd cudaq-realtime-src +git sparse-checkout init --cone +git sparse-checkout set realtime +git checkout 9ce3d2e886 +cd .. + +# 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) +# Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger). +# If your system cmake is older: pip install cmake +git clone --branch 2.6.0-EA2 \ + https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git +cd holoscan-sensor-bridge + +# Strip operators we don't need to avoid configure failures from missing deps +sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d; + /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d; + /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d; + /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d; + /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d; + /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d; + /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \ + src/hololink/operators/CMakeLists.txt + +mkdir -p build && cd build +cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ + -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \ + -DHOLOLINK_BUILD_PYTHON=OFF \ + -DHOLOLINK_BUILD_TESTS=OFF \ + -DHOLOLINK_BUILD_TOOLS=OFF \ + -DHOLOLINK_BUILD_EXAMPLES=OFF \ + -DHOLOLINK_BUILD_EMULATOR=OFF .. +cmake --build . --target gpu_roce_transceiver hololink_core +cd ../.. + +# 3. Build libcudaq-realtime with Hololink tools enabled +# This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge tool) +# as well as the FPGA emulator. +cd cudaq-realtime-src/realtime && mkdir -p build && cd build +cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \ + -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \ + .. +ninja && ninja install +cd ../../.. + +# 4. Build cudaqx with Hololink tools enabled +cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON \ + -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build +cmake --build cudaqx/build --target \ + test_realtime_qldpc_graph_decoding \ + hololink_qldpc_graph_decoder_bridge \ + hololink_fpga_syndrome_playback +``` + +### Using the orchestration script + +The orchestration script can build everything automatically: + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --hsb-dir /path/to/holoscan-sensor-bridge \ + --cuda-quantum-dir /path/to/cuda-quantum \ + --no-run +``` + +--- + +## CI Unit Test + +The CI unit test (`test_realtime_qldpc_graph_decoding`) exercises the full +host dispatch decode path without any network hardware. It: + +1. Loads the Relay BP config and syndrome data from YAML/text files +2. Creates the decoder via the `decoder::get("nv-qldpc-decoder", ...)` plugin API +3. Captures a CUDA graph of the decode pipeline +4. Wires `libcudaq-realtime`'s host dispatcher (HOST_LOOP) to a ring buffer +5. Writes RPC requests into the ring buffer, the host dispatcher launches the + CUDA graph, and the test verifies corrections + +### Running + +```bash +cd cudaqx/build + +# The nv-qldpc-decoder plugin must be discoverable at runtime. +# Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path: +export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so + +./libs/qec/unittests/test_realtime_qldpc_graph_decoding +``` + +Expected output: + +``` +[==========] Running 1 test from 1 test suite. +[----------] 1 test from RealtimeQLDPCGraphDecodingTest +[ RUN ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots +... +[ OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms) +[==========] 1 test from 1 test suite ran. +[ PASSED ] 1 test. +``` + +--- + +## Emulated End-to-End Test + +The emulated test replaces the physical FPGA with a software emulator. Three +processes run concurrently: + +1. **Emulator** -- receives syndromes via the UDP control plane, sends them + to the bridge via RDMA, and captures corrections +2. **Bridge** -- runs the host dispatcher and CUDA graph decode loop on the GPU, + receiving syndromes and sending corrections via RDMA +3. **Playback** -- loads syndrome data into the emulator's BRAM and triggers + playback, then verifies corrections + +### Requirements + +- ConnectX NIC with a loopback cable connecting both ports (the emulator + sends RDMA traffic out one port and the bridge receives on the other) +- `libibverbs` / RDMA-capable network stack +- All three tools built (bridge, playback, emulator) + +### Running + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --emulate \ + --build \ + --setup-network \ + --hsb-dir /path/to/holoscan-sensor-bridge +``` + +The `--setup-network` flag configures the ConnectX interface with the +appropriate IP addresses and MTU. It only needs to be run once per boot. + +After the initial build and network setup, subsequent runs are faster: + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate +``` + +--- + +## FPGA End-to-End Test + +The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two +processes run: + +1. **Bridge** -- same as emulated mode +2. **Playback** -- loads syndromes into the FPGA's BRAM and triggers playback, + then reads back corrections from the FPGA's capture RAM to verify them + +### Requirements + +- FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via + direct cable or switch. Bitfiles for supported FPGA vendors are available + [here](https://edge.urm.nvidia.com/artifactory/sw-holoscan-thirdparty-generic-local/QEC/HSB-2.6.0-EA/). + See the [cuda-quantum realtime user guide](https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/user_guide.md) + for FPGA setup instructions. +- FPGA IP and bridge IP on the same subnet +- ConnectX device name (e.g., `mlx5_4`, `mlx5_5`) + +### Running + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --setup-network \ + --device mlx5_5 \ + --bridge-ip 192.168.0.1 \ + --fpga-ip 192.168.0.2 \ + --gpu 2 \ + --page-size 512 \ + --hsb-dir /path/to/holoscan-sensor-bridge +``` + +Key parameters for FPGA mode: + +| Parameter | Description | +|---|---| +| `--device` | ConnectX IB device name (e.g., `mlx5_5`) | +| `--bridge-ip` | IP address assigned to the ConnectX interface | +| `--fpga-ip` | FPGA's IP address | +| `--gpu` | GPU device ID (choose NUMA-local GPU for lowest latency) | +| `--page-size` | Ring buffer slot size in bytes (use `512` on GB200 for alignment) | +| `--spacing` | Inter-shot spacing in microseconds | + +> **Note:** The `--spacing` value should be set to at least the per-shot decode +> time to avoid overrunning the input ring buffer. If syndromes arrive faster +> than the decoder can process them, the buffer fills up and messages are lost. +> Use a `--spacing` value at or above the observed decode time for sustained +> operation. + +### GPU Selection + +For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. +For example, on a GB200 system where `mlx5_5` is on NUMA node 1, +use `--gpu 2` or `--gpu 3`. Check NUMA locality with: + +```bash +cat /sys/class/infiniband//device/numa_node +``` + +### Network Sanity Check + +Before running, verify that the bridge IP is assigned to exactly one interface: + +```bash +ip addr show | grep 192.168.0.1 +``` + +If multiple interfaces show the same IP, remove the duplicate to avoid +routing ambiguity that silently drops RDMA packets. + +--- + +## Orchestration Script Reference + +``` +hololink_qldpc_graph_decoder_test.sh [options] +``` + +### Modes + +| Flag | Description | +|---|---| +| `--emulate` | Use FPGA emulator (no real FPGA needed) | +| *(default)* | FPGA mode (requires real FPGA) | + +### Actions + +| Flag | Description | +|---|---| +| `--build` | Build all required tools before running | +| `--setup-network` | Configure ConnectX network interfaces | +| `--no-run` | Skip running the test (useful with `--build`) | + +### Build Options + +| Flag | Default | Description | +|---|---|---| +| `--hsb-dir DIR` | `/workspaces/holoscan-sensor-bridge` | holoscan-sensor-bridge source directory | +| `--cuda-quantum-dir DIR` | `/workspaces/cuda-quantum` | cuda-quantum source directory | +| `--cuda-qx-dir DIR` | `/workspaces/cudaqx` | cudaqx source directory | +| `--jobs N` | `nproc` | Parallel build jobs | + +### Network Options + +| Flag | Default | Description | +|---|---|---| +| `--device DEV` | auto-detect | ConnectX IB device name | +| `--bridge-ip ADDR` | `10.0.0.1` | Bridge tool IP address | +| `--emulator-ip ADDR` | `10.0.0.2` | Emulator IP (emulate mode) | +| `--fpga-ip ADDR` | `192.168.0.2` | FPGA IP address | +| `--mtu N` | `4096` | MTU size | + +### Run Options + +| Flag | Default | Description | +|---|---|---| +| `--gpu N` | `0` | GPU device ID | +| `--timeout N` | `60` | Timeout in seconds | +| `--num-shots N` | all available | Limit number of syndrome shots | +| `--page-size N` | `384` | Ring buffer slot size in bytes | +| `--num-pages N` | `128` | Number of ring buffer slots | +| `--spacing N` | `10` | Inter-shot spacing in microseconds | +| `--no-verify` | *(verify)* | Skip correction verification | +| `--control-port N` | `8193` | UDP control port for emulator | From 8635c45dd26d1101a6b4819c2e52b57b0612d6c2 Mon Sep 17 00:00:00 2001 From: Chuck Ketcham Date: Fri, 3 Apr 2026 19:52:32 +0000 Subject: [PATCH 2/5] Fix emulated mode prerequisites in Relay BP guide Emulated mode requires GPUDirect RDMA (same as FPGA mode). Note loopback cable requirement in hardware table. Replace libibverbs mention with reference to cuda-quantum realtime build guide for complete software dependency listing. Signed-off-by: Chuck Ketcham --- docs/realtime_qldpc_relay_bp_guide.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/realtime_qldpc_relay_bp_guide.md b/docs/realtime_qldpc_relay_bp_guide.md index e095b5a1..8b906319 100644 --- a/docs/realtime_qldpc_relay_bp_guide.md +++ b/docs/realtime_qldpc_relay_bp_guide.md @@ -30,7 +30,7 @@ configurations: | Configuration | GPU | ConnectX NIC | FPGA | |---|---|---|---| | CI unit test | Any CUDA-capable GPU | Not required | Not required | -| Emulated E2E | Any CUDA-capable GPU | Required (loopback cable) | Not required | +| Emulated E2E | CUDA GPU with GPUDirect RDMA | Required (loopback cable) | Not required | | FPGA E2E | CUDA GPU with GPUDirect RDMA | Required | Required | Tested platforms: DGX Spark, GB200. @@ -255,7 +255,8 @@ processes run concurrently: - ConnectX NIC with a loopback cable connecting both ports (the emulator sends RDMA traffic out one port and the bridge receives on the other) -- `libibverbs` / RDMA-capable network stack +- Software dependencies (DOCA, Holoscan SDK, etc.) as described in the + [cuda-quantum realtime build guide](https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/building.md) - All three tools built (bridge, playback, emulator) ### Running From cf65f5d405d164c44447eefdad092501c070c6ff Mon Sep 17 00:00:00 2001 From: Chuck Ketcham Date: Mon, 6 Apr 2026 21:12:33 +0000 Subject: [PATCH 3/5] Move Relay BP decoding guide into Sphinx documentation Relocate docs/realtime_qldpc_relay_bp_guide.md into the Sphinx doc tree as docs/sphinx/examples_rst/qec/realtime_relay_bp.rst (RST format). Add toctree entry in the QEC examples page and a cross-reference link in the QEC introduction page under Real-Time Decoding. Rename the document title to "Relay BP Decoding with HSB RDMA". Signed-off-by: Chuck Ketcham --- docs/realtime_qldpc_relay_bp_guide.md | 407 -------------- docs/sphinx/components/qec/introduction.rst | 1 + docs/sphinx/examples_rst/qec/examples.rst | 3 +- .../examples_rst/qec/realtime_relay_bp.rst | 509 ++++++++++++++++++ 4 files changed, 512 insertions(+), 408 deletions(-) delete mode 100644 docs/realtime_qldpc_relay_bp_guide.md create mode 100644 docs/sphinx/examples_rst/qec/realtime_relay_bp.rst diff --git a/docs/realtime_qldpc_relay_bp_guide.md b/docs/realtime_qldpc_relay_bp_guide.md deleted file mode 100644 index 8b906319..00000000 --- a/docs/realtime_qldpc_relay_bp_guide.md +++ /dev/null @@ -1,407 +0,0 @@ -# Realtime nv-qldpc-decoder Relay BP Decoding Guide - -This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP decoder -using CUDA-Q's realtime host dispatch system. The decoder runs as a -CPU-launched CUDA graph (`HOST_LOOP` dispatch path) and can operate in three -configurations: - -- **CI unit test** -- standalone executable, no FPGA or network hardware needed -- **Emulated end-to-end test** -- software FPGA emulator replaces real hardware -- **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE - ---- - -## Table of Contents - -1. [Prerequisites](#prerequisites) -2. [Repository Layout](#repository-layout) -3. [Building](#building) -4. [CI Unit Test](#ci-unit-test) -5. [Emulated End-to-End Test](#emulated-end-to-end-test) -6. [FPGA End-to-End Test](#fpga-end-to-end-test) -7. [Orchestration Script Reference](#orchestration-script-reference) - ---- - -## Prerequisites - -### Hardware - -| Configuration | GPU | ConnectX NIC | FPGA | -|---|---|---|---| -| CI unit test | Any CUDA-capable GPU | Not required | Not required | -| Emulated E2E | CUDA GPU with GPUDirect RDMA | Required (loopback cable) | Not required | -| FPGA E2E | CUDA GPU with GPUDirect RDMA | Required | Required | - -Tested platforms: DGX Spark, GB200. - -### Software - -- **CUDA Toolkit**: 12.6 or 13.0 -- **CUDA-Q SDK**: pre-installed (provides `libcudaq`, `libnvqir`, `nvq++`) -- **`nv-qldpc-decoder` plugin**: the proprietary nv-qldpc-decoder shared library - (`libcudaq-qec-nv-qldpc-decoder.so`). Required at runtime for all - three configurations. - -### Source Repositories - -| Repository | URL | Version | -|---|---|---| -| **cudaqx** | | `main` branch (or your feature branch) | -| **cuda-quantum** (realtime) | | Commit `9ce3d2e886` | -| **holoscan-sensor-bridge** | | Tag `2.6.0-EA2` | - -`cuda-quantum` provides `libcudaq-realtime` (the host dispatcher, ring buffer -management, and dispatch kernel). `holoscan-sensor-bridge` provides the -Hololink `GpuRoceTransceiver` library for RDMA transport. - -> **Note:** `holoscan-sensor-bridge` is only needed for the emulated and FPGA -> end-to-end tests. The CI unit test requires only `libcudaq-realtime`. - ---- - -## Repository Layout - -Key files within `cudaqx`: - -``` -libs/qec/ - unittests/ - realtime/ - qec_graph_decode_test/ - test_realtime_qldpc_graph_decoding.cpp # CI unit test - qec_roce_decode_test/ - data/ - config_nv_qldpc_relay.yml # Relay BP decoder config - syndromes_nv_qldpc_relay.txt # 100 test syndrome shots - utils/ - hololink_qldpc_graph_decoder_bridge.cpp # Bridge tool (RDMA ↔ decoder) - hololink_qldpc_graph_decoder_test.sh # Orchestration script - hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes) -``` - -The FPGA emulator is in the `cuda-quantum` repository: - -``` -cuda-quantum/realtime/ - unittests/utils/ - hololink_fpga_emulator.cpp # Software FPGA emulator -``` - ---- - -## Building - -### CI unit test only (no Hololink tools) - -If you only need to run the CI unit test, you can build without -`holoscan-sensor-bridge`: - -```bash -# 1. Build libcudaq-realtime -git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src -cd cudaq-realtime-src -git checkout 9ce3d2e886 -cd realtime && mkdir -p build && cd build -cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. -ninja && ninja install -cd ../../.. - -# 2. Build cudaqx with the nv-qldpc-decoder test -cmake -S cudaqx -B cudaqx/build \ - -DCMAKE_BUILD_TYPE=Release \ - -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ - -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ - -DCUDAQX_ENABLE_LIBS="qec" \ - -DCUDAQX_INCLUDE_TESTS=ON -cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding -``` - -### Full build (CI test + Hololink bridge/playback tools) - -To also build the bridge and playback tools for emulated or FPGA testing: - -```bash -# 1. Clone cuda-quantum (realtime) -git clone --filter=blob:none --no-checkout \ - https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src -cd cudaq-realtime-src -git sparse-checkout init --cone -git sparse-checkout set realtime -git checkout 9ce3d2e886 -cd .. - -# 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) -# Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger). -# If your system cmake is older: pip install cmake -git clone --branch 2.6.0-EA2 \ - https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git -cd holoscan-sensor-bridge - -# Strip operators we don't need to avoid configure failures from missing deps -sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d; - /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d; - /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d; - /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d; - /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d; - /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d; - /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \ - src/hololink/operators/CMakeLists.txt - -mkdir -p build && cd build -cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ - -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \ - -DHOLOLINK_BUILD_PYTHON=OFF \ - -DHOLOLINK_BUILD_TESTS=OFF \ - -DHOLOLINK_BUILD_TOOLS=OFF \ - -DHOLOLINK_BUILD_EXAMPLES=OFF \ - -DHOLOLINK_BUILD_EMULATOR=OFF .. -cmake --build . --target gpu_roce_transceiver hololink_core -cd ../.. - -# 3. Build libcudaq-realtime with Hololink tools enabled -# This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge tool) -# as well as the FPGA emulator. -cd cudaq-realtime-src/realtime && mkdir -p build && cd build -cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \ - -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \ - -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \ - -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \ - .. -ninja && ninja install -cd ../../.. - -# 4. Build cudaqx with Hololink tools enabled -cmake -S cudaqx -B cudaqx/build \ - -DCMAKE_BUILD_TYPE=Release \ - -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ - -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ - -DCUDAQX_ENABLE_LIBS="qec" \ - -DCUDAQX_INCLUDE_TESTS=ON \ - -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \ - -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \ - -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build -cmake --build cudaqx/build --target \ - test_realtime_qldpc_graph_decoding \ - hololink_qldpc_graph_decoder_bridge \ - hololink_fpga_syndrome_playback -``` - -### Using the orchestration script - -The orchestration script can build everything automatically: - -```bash -./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ - --build \ - --hsb-dir /path/to/holoscan-sensor-bridge \ - --cuda-quantum-dir /path/to/cuda-quantum \ - --no-run -``` - ---- - -## CI Unit Test - -The CI unit test (`test_realtime_qldpc_graph_decoding`) exercises the full -host dispatch decode path without any network hardware. It: - -1. Loads the Relay BP config and syndrome data from YAML/text files -2. Creates the decoder via the `decoder::get("nv-qldpc-decoder", ...)` plugin API -3. Captures a CUDA graph of the decode pipeline -4. Wires `libcudaq-realtime`'s host dispatcher (HOST_LOOP) to a ring buffer -5. Writes RPC requests into the ring buffer, the host dispatcher launches the - CUDA graph, and the test verifies corrections - -### Running - -```bash -cd cudaqx/build - -# The nv-qldpc-decoder plugin must be discoverable at runtime. -# Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path: -export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so - -./libs/qec/unittests/test_realtime_qldpc_graph_decoding -``` - -Expected output: - -``` -[==========] Running 1 test from 1 test suite. -[----------] 1 test from RealtimeQLDPCGraphDecodingTest -[ RUN ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots -... -[ OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms) -[==========] 1 test from 1 test suite ran. -[ PASSED ] 1 test. -``` - ---- - -## Emulated End-to-End Test - -The emulated test replaces the physical FPGA with a software emulator. Three -processes run concurrently: - -1. **Emulator** -- receives syndromes via the UDP control plane, sends them - to the bridge via RDMA, and captures corrections -2. **Bridge** -- runs the host dispatcher and CUDA graph decode loop on the GPU, - receiving syndromes and sending corrections via RDMA -3. **Playback** -- loads syndrome data into the emulator's BRAM and triggers - playback, then verifies corrections - -### Requirements - -- ConnectX NIC with a loopback cable connecting both ports (the emulator - sends RDMA traffic out one port and the bridge receives on the other) -- Software dependencies (DOCA, Holoscan SDK, etc.) as described in the - [cuda-quantum realtime build guide](https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/building.md) -- All three tools built (bridge, playback, emulator) - -### Running - -```bash -./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ - --emulate \ - --build \ - --setup-network \ - --hsb-dir /path/to/holoscan-sensor-bridge -``` - -The `--setup-network` flag configures the ConnectX interface with the -appropriate IP addresses and MTU. It only needs to be run once per boot. - -After the initial build and network setup, subsequent runs are faster: - -```bash -./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate -``` - ---- - -## FPGA End-to-End Test - -The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two -processes run: - -1. **Bridge** -- same as emulated mode -2. **Playback** -- loads syndromes into the FPGA's BRAM and triggers playback, - then reads back corrections from the FPGA's capture RAM to verify them - -### Requirements - -- FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via - direct cable or switch. Bitfiles for supported FPGA vendors are available - [here](https://edge.urm.nvidia.com/artifactory/sw-holoscan-thirdparty-generic-local/QEC/HSB-2.6.0-EA/). - See the [cuda-quantum realtime user guide](https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/user_guide.md) - for FPGA setup instructions. -- FPGA IP and bridge IP on the same subnet -- ConnectX device name (e.g., `mlx5_4`, `mlx5_5`) - -### Running - -```bash -./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ - --build \ - --setup-network \ - --device mlx5_5 \ - --bridge-ip 192.168.0.1 \ - --fpga-ip 192.168.0.2 \ - --gpu 2 \ - --page-size 512 \ - --hsb-dir /path/to/holoscan-sensor-bridge -``` - -Key parameters for FPGA mode: - -| Parameter | Description | -|---|---| -| `--device` | ConnectX IB device name (e.g., `mlx5_5`) | -| `--bridge-ip` | IP address assigned to the ConnectX interface | -| `--fpga-ip` | FPGA's IP address | -| `--gpu` | GPU device ID (choose NUMA-local GPU for lowest latency) | -| `--page-size` | Ring buffer slot size in bytes (use `512` on GB200 for alignment) | -| `--spacing` | Inter-shot spacing in microseconds | - -> **Note:** The `--spacing` value should be set to at least the per-shot decode -> time to avoid overrunning the input ring buffer. If syndromes arrive faster -> than the decoder can process them, the buffer fills up and messages are lost. -> Use a `--spacing` value at or above the observed decode time for sustained -> operation. - -### GPU Selection - -For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. -For example, on a GB200 system where `mlx5_5` is on NUMA node 1, -use `--gpu 2` or `--gpu 3`. Check NUMA locality with: - -```bash -cat /sys/class/infiniband//device/numa_node -``` - -### Network Sanity Check - -Before running, verify that the bridge IP is assigned to exactly one interface: - -```bash -ip addr show | grep 192.168.0.1 -``` - -If multiple interfaces show the same IP, remove the duplicate to avoid -routing ambiguity that silently drops RDMA packets. - ---- - -## Orchestration Script Reference - -``` -hololink_qldpc_graph_decoder_test.sh [options] -``` - -### Modes - -| Flag | Description | -|---|---| -| `--emulate` | Use FPGA emulator (no real FPGA needed) | -| *(default)* | FPGA mode (requires real FPGA) | - -### Actions - -| Flag | Description | -|---|---| -| `--build` | Build all required tools before running | -| `--setup-network` | Configure ConnectX network interfaces | -| `--no-run` | Skip running the test (useful with `--build`) | - -### Build Options - -| Flag | Default | Description | -|---|---|---| -| `--hsb-dir DIR` | `/workspaces/holoscan-sensor-bridge` | holoscan-sensor-bridge source directory | -| `--cuda-quantum-dir DIR` | `/workspaces/cuda-quantum` | cuda-quantum source directory | -| `--cuda-qx-dir DIR` | `/workspaces/cudaqx` | cudaqx source directory | -| `--jobs N` | `nproc` | Parallel build jobs | - -### Network Options - -| Flag | Default | Description | -|---|---|---| -| `--device DEV` | auto-detect | ConnectX IB device name | -| `--bridge-ip ADDR` | `10.0.0.1` | Bridge tool IP address | -| `--emulator-ip ADDR` | `10.0.0.2` | Emulator IP (emulate mode) | -| `--fpga-ip ADDR` | `192.168.0.2` | FPGA IP address | -| `--mtu N` | `4096` | MTU size | - -### Run Options - -| Flag | Default | Description | -|---|---|---| -| `--gpu N` | `0` | GPU device ID | -| `--timeout N` | `60` | Timeout in seconds | -| `--num-shots N` | all available | Limit number of syndrome shots | -| `--page-size N` | `384` | Ring buffer slot size in bytes | -| `--num-pages N` | `128` | Number of ring buffer slots | -| `--spacing N` | `10` | Inter-shot spacing in microseconds | -| `--no-verify` | *(verify)* | Skip correction verification | -| `--control-port N` | `8193` | UDP control port for emulator | diff --git a/docs/sphinx/components/qec/introduction.rst b/docs/sphinx/components/qec/introduction.rst index 3297ed76..61a4e41e 100644 --- a/docs/sphinx/components/qec/introduction.rst +++ b/docs/sphinx/components/qec/introduction.rst @@ -861,6 +861,7 @@ Additional quantum gates can be applied, and only when `get_corrections` is call For detailed information on real-time decoding, see: * :doc:`/examples_rst/qec/realtime_decoding` - Complete Guide with Examples +* :doc:`/examples_rst/qec/realtime_relay_bp` - Relay BP Decoding with HSB RDMA * :doc:`/api/qec/cpp_api` - C++ API Reference (see Real-Time Decoding section) * :doc:`/api/qec/python_api` - Python API Reference (see Real-Time Decoding section) diff --git a/docs/sphinx/examples_rst/qec/examples.rst b/docs/sphinx/examples_rst/qec/examples.rst index 79247213..a20b387b 100644 --- a/docs/sphinx/examples_rst/qec/examples.rst +++ b/docs/sphinx/examples_rst/qec/examples.rst @@ -10,4 +10,5 @@ Examples that illustrate how to use CUDA-QX for application development are avai Code-Capacity QEC Circuit-Level QEC Decoders - Real-Time Decoding \ No newline at end of file + Real-Time Decoding + Relay BP Decoding with HSB RDMA \ No newline at end of file diff --git a/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst b/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst new file mode 100644 index 00000000..c15ea3a8 --- /dev/null +++ b/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst @@ -0,0 +1,509 @@ +Relay BP Decoding with HSB RDMA +================================ + +This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP +decoder using CUDA-Q's realtime host dispatch system. The decoder runs as a +CPU-launched CUDA graph (``HOST_LOOP`` dispatch path) and can operate in three +configurations: + +- **CI unit test** -- standalone executable, no FPGA or network hardware needed +- **Emulated end-to-end test** -- software FPGA emulator replaces real hardware +- **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE + +Prerequisites +------------- + +Hardware +^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 20 25 25 20 + + * - Configuration + - GPU + - ConnectX NIC + - FPGA + * - CI unit test + - Any CUDA-capable GPU + - Not required + - Not required + * - Emulated E2E + - CUDA GPU with GPUDirect RDMA + - Required (loopback cable) + - Not required + * - FPGA E2E + - CUDA GPU with GPUDirect RDMA + - Required + - Required + +Tested platforms: DGX Spark, GB200. + +Software +^^^^^^^^ + +- **CUDA Toolkit**: 12.6 or 13.0 +- **CUDA-Q SDK**: pre-installed (provides ``libcudaq``, ``libnvqir``, ``nvq++``) +- **nv-qldpc-decoder plugin**: the proprietary nv-qldpc-decoder shared library + (``libcudaq-qec-nv-qldpc-decoder.so``). Required at runtime for all + three configurations. + +Source Repositories +^^^^^^^^^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 30 40 30 + + * - Repository + - URL + - Version + * - **cudaqx** + - https://github.com/NVIDIA/cudaqx + - ``main`` branch (or your feature branch) + * - **cuda-quantum** (realtime) + - https://github.com/NVIDIA/cuda-quantum + - Commit ``9ce3d2e886`` + * - **holoscan-sensor-bridge** + - https://github.com/nvidia-holoscan/holoscan-sensor-bridge + - Tag ``2.6.0-EA2`` + +``cuda-quantum`` provides ``libcudaq-realtime`` (the host dispatcher, ring +buffer management, and dispatch kernel). ``holoscan-sensor-bridge`` provides +the Hololink ``GpuRoceTransceiver`` library for RDMA transport. + +.. note:: + + ``holoscan-sensor-bridge`` is only needed for the emulated and FPGA + end-to-end tests. The CI unit test requires only ``libcudaq-realtime``. + +Repository Layout +----------------- + +Key files within ``cudaqx``: + +.. code-block:: text + + libs/qec/ + unittests/ + realtime/ + qec_graph_decode_test/ + test_realtime_qldpc_graph_decoding.cpp # CI unit test + qec_roce_decode_test/ + data/ + config_nv_qldpc_relay.yml # Relay BP decoder config + syndromes_nv_qldpc_relay.txt # 100 test syndrome shots + utils/ + hololink_qldpc_graph_decoder_bridge.cpp # Bridge tool (RDMA <-> decoder) + hololink_qldpc_graph_decoder_test.sh # Orchestration script + hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes) + +The FPGA emulator is in the ``cuda-quantum`` repository: + +.. code-block:: text + + cuda-quantum/realtime/ + unittests/utils/ + hololink_fpga_emulator.cpp # Software FPGA emulator + +Building +-------- + +CI unit test only (no Hololink tools) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If you only need to run the CI unit test, you can build without +``holoscan-sensor-bridge``: + +.. code-block:: bash + + # 1. Build libcudaq-realtime + git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src + cd cudaq-realtime-src + git checkout 9ce3d2e886 + cd realtime && mkdir -p build && cd build + cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. + ninja && ninja install + cd ../../.. + + # 2. Build cudaqx with the nv-qldpc-decoder test + cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON + cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding + +Full build (CI test + Hololink bridge/playback tools) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To also build the bridge and playback tools for emulated or FPGA testing: + +.. code-block:: bash + + # 1. Clone cuda-quantum (realtime) + git clone --filter=blob:none --no-checkout \ + https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src + cd cudaq-realtime-src + git sparse-checkout init --cone + git sparse-checkout set realtime + git checkout 9ce3d2e886 + cd .. + + # 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) + # Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger). + # If your system cmake is older: pip install cmake + git clone --branch 2.6.0-EA2 \ + https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git + cd holoscan-sensor-bridge + + # Strip operators we don't need to avoid configure failures from missing deps + sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d; + /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d; + /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d; + /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d; + /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d; + /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d; + /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \ + src/hololink/operators/CMakeLists.txt + + mkdir -p build && cd build + cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ + -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \ + -DHOLOLINK_BUILD_PYTHON=OFF \ + -DHOLOLINK_BUILD_TESTS=OFF \ + -DHOLOLINK_BUILD_TOOLS=OFF \ + -DHOLOLINK_BUILD_EXAMPLES=OFF \ + -DHOLOLINK_BUILD_EMULATOR=OFF .. + cmake --build . --target gpu_roce_transceiver hololink_core + cd ../.. + + # 3. Build libcudaq-realtime with Hololink tools enabled + # This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge + # tool) as well as the FPGA emulator. + cd cudaq-realtime-src/realtime && mkdir -p build && cd build + cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \ + -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \ + .. + ninja && ninja install + cd ../../.. + + # 4. Build cudaqx with Hololink tools enabled + cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON \ + -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build + cmake --build cudaqx/build --target \ + test_realtime_qldpc_graph_decoding \ + hololink_qldpc_graph_decoder_bridge \ + hololink_fpga_syndrome_playback + +Using the orchestration script +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The orchestration script can build everything automatically: + +.. code-block:: bash + + ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --hsb-dir /path/to/holoscan-sensor-bridge \ + --cuda-quantum-dir /path/to/cuda-quantum \ + --no-run + +CI Unit Test +------------ + +The CI unit test (``test_realtime_qldpc_graph_decoding``) exercises the full +host dispatch decode path without any network hardware. It: + +1. Loads the Relay BP config and syndrome data from YAML/text files +2. Creates the decoder via the ``decoder::get("nv-qldpc-decoder", ...)`` plugin API +3. Captures a CUDA graph of the decode pipeline +4. Wires ``libcudaq-realtime``'s host dispatcher (HOST_LOOP) to a ring buffer +5. Writes RPC requests into the ring buffer, the host dispatcher launches the + CUDA graph, and the test verifies corrections + +Running +^^^^^^^ + +.. code-block:: bash + + cd cudaqx/build + + # The nv-qldpc-decoder plugin must be discoverable at runtime. + # Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path: + export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so + + ./libs/qec/unittests/test_realtime_qldpc_graph_decoding + +Expected output: + +.. code-block:: text + + [==========] Running 1 test from 1 test suite. + [----------] 1 test from RealtimeQLDPCGraphDecodingTest + [ RUN ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots + ... + [ OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms) + [==========] 1 test from 1 test suite ran. + [ PASSED ] 1 test. + +Emulated End-to-End Test +------------------------ + +The emulated test replaces the physical FPGA with a software emulator. Three +processes run concurrently: + +1. **Emulator** -- receives syndromes via the UDP control plane, sends them + to the bridge via RDMA, and captures corrections +2. **Bridge** -- runs the host dispatcher and CUDA graph decode loop on the GPU, + receiving syndromes and sending corrections via RDMA +3. **Playback** -- loads syndrome data into the emulator's BRAM and triggers + playback, then verifies corrections + +Requirements +^^^^^^^^^^^^ + +- ConnectX NIC with a loopback cable connecting both ports (the emulator + sends RDMA traffic out one port and the bridge receives on the other) +- Software dependencies (DOCA, Holoscan SDK, etc.) as described in the + `cuda-quantum realtime build guide `__ +- All three tools built (bridge, playback, emulator) + +Running +^^^^^^^ + +.. code-block:: bash + + ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --emulate \ + --build \ + --setup-network \ + --hsb-dir /path/to/holoscan-sensor-bridge + +The ``--setup-network`` flag configures the ConnectX interface with the +appropriate IP addresses and MTU. It only needs to be run once per boot. + +After the initial build and network setup, subsequent runs are faster: + +.. code-block:: bash + + ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate + +FPGA End-to-End Test +-------------------- + +The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two +processes run: + +1. **Bridge** -- same as emulated mode +2. **Playback** -- loads syndromes into the FPGA's BRAM and triggers playback, + then reads back corrections from the FPGA's capture RAM to verify them + +Requirements +^^^^^^^^^^^^ + +- FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via + direct cable or switch. Bitfiles for supported FPGA vendors are available + `here `__. + See the `cuda-quantum realtime user guide `__ + for FPGA setup instructions. +- FPGA IP and bridge IP on the same subnet +- ConnectX device name (e.g., ``mlx5_4``, ``mlx5_5``) + +Running +^^^^^^^ + +.. code-block:: bash + + ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --setup-network \ + --device mlx5_5 \ + --bridge-ip 192.168.0.1 \ + --fpga-ip 192.168.0.2 \ + --gpu 2 \ + --page-size 512 \ + --hsb-dir /path/to/holoscan-sensor-bridge + +Key parameters for FPGA mode: + +.. list-table:: + :header-rows: 1 + :widths: 20 80 + + * - Parameter + - Description + * - ``--device`` + - ConnectX IB device name (e.g., ``mlx5_5``) + * - ``--bridge-ip`` + - IP address assigned to the ConnectX interface + * - ``--fpga-ip`` + - FPGA's IP address + * - ``--gpu`` + - GPU device ID (choose NUMA-local GPU for lowest latency) + * - ``--page-size`` + - Ring buffer slot size in bytes (use ``512`` on GB200 for alignment) + * - ``--spacing`` + - Inter-shot spacing in microseconds + +.. note:: + + The ``--spacing`` value should be set to at least the per-shot decode + time to avoid overrunning the input ring buffer. If syndromes arrive faster + than the decoder can process them, the buffer fills up and messages are lost. + Use a ``--spacing`` value at or above the observed decode time for sustained + operation. + +GPU Selection +^^^^^^^^^^^^^ + +For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. +For example, on a GB200 system where ``mlx5_5`` is on NUMA node 1, +use ``--gpu 2`` or ``--gpu 3``. Check NUMA locality with: + +.. code-block:: bash + + cat /sys/class/infiniband//device/numa_node + +Network Sanity Check +^^^^^^^^^^^^^^^^^^^^ + +Before running, verify that the bridge IP is assigned to exactly one interface: + +.. code-block:: bash + + ip addr show | grep 192.168.0.1 + +If multiple interfaces show the same IP, remove the duplicate to avoid +routing ambiguity that silently drops RDMA packets. + +Orchestration Script Reference +------------------------------ + +.. code-block:: text + + hololink_qldpc_graph_decoder_test.sh [options] + +Modes +^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 20 80 + + * - Flag + - Description + * - ``--emulate`` + - Use FPGA emulator (no real FPGA needed) + * - *(default)* + - FPGA mode (requires real FPGA) + +Actions +^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 20 80 + + * - Flag + - Description + * - ``--build`` + - Build all required tools before running + * - ``--setup-network`` + - Configure ConnectX network interfaces + * - ``--no-run`` + - Skip running the test (useful with ``--build``) + +Build Options +^^^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 25 30 45 + + * - Flag + - Default + - Description + * - ``--hsb-dir DIR`` + - ``/workspaces/holoscan-sensor-bridge`` + - holoscan-sensor-bridge source directory + * - ``--cuda-quantum-dir DIR`` + - ``/workspaces/cuda-quantum`` + - cuda-quantum source directory + * - ``--cuda-qx-dir DIR`` + - ``/workspaces/cudaqx`` + - cudaqx source directory + * - ``--jobs N`` + - ``nproc`` + - Parallel build jobs + +Network Options +^^^^^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 25 20 55 + + * - Flag + - Default + - Description + * - ``--device DEV`` + - auto-detect + - ConnectX IB device name + * - ``--bridge-ip ADDR`` + - ``10.0.0.1`` + - Bridge tool IP address + * - ``--emulator-ip ADDR`` + - ``10.0.0.2`` + - Emulator IP (emulate mode) + * - ``--fpga-ip ADDR`` + - ``192.168.0.2`` + - FPGA IP address + * - ``--mtu N`` + - ``4096`` + - MTU size + +Run Options +^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Flag + - Default + - Description + * - ``--gpu N`` + - ``0`` + - GPU device ID + * - ``--timeout N`` + - ``60`` + - Timeout in seconds + * - ``--num-shots N`` + - all available + - Limit number of syndrome shots + * - ``--page-size N`` + - ``384`` + - Ring buffer slot size in bytes + * - ``--num-pages N`` + - ``128`` + - Number of ring buffer slots + * - ``--spacing N`` + - ``10`` + - Inter-shot spacing in microseconds + * - ``--no-verify`` + - *(verify)* + - Skip correction verification + * - ``--control-port N`` + - ``8193`` + - UDP control port for emulator From 8f4ddec7041e40347304d56718c295eaa80a5d04 Mon Sep 17 00:00:00 2001 From: Chuck Ketcham Date: Mon, 6 Apr 2026 21:19:49 +0000 Subject: [PATCH 4/5] Update commit SHA in instructions Signed-off-by: Chuck Ketcham --- docs/sphinx/examples_rst/qec/realtime_relay_bp.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst b/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst index c15ea3a8..47dd0933 100644 --- a/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst +++ b/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst @@ -63,7 +63,7 @@ Source Repositories - ``main`` branch (or your feature branch) * - **cuda-quantum** (realtime) - https://github.com/NVIDIA/cuda-quantum - - Commit ``9ce3d2e886`` + - Commit ``bb21b7a031`` * - **holoscan-sensor-bridge** - https://github.com/nvidia-holoscan/holoscan-sensor-bridge - Tag ``2.6.0-EA2`` @@ -120,7 +120,7 @@ If you only need to run the CI unit test, you can build without # 1. Build libcudaq-realtime git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src cd cudaq-realtime-src - git checkout 9ce3d2e886 + git checkout bb21b7a031 cd realtime && mkdir -p build && cd build cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. ninja && ninja install @@ -148,7 +148,7 @@ To also build the bridge and playback tools for emulated or FPGA testing: cd cudaq-realtime-src git sparse-checkout init --cone git sparse-checkout set realtime - git checkout 9ce3d2e886 + git checkout bb21b7a031 cd .. # 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) From ede9ba9eed503edbaa6a09e8fc045df36524575c Mon Sep 17 00:00:00 2001 From: Chuck Ketcham Date: Wed, 8 Apr 2026 19:43:42 +0000 Subject: [PATCH 5/5] Address PR review comments for Relay BP decoding guide Rename document title and references from "Relay BP Decoding with HSB RDMA" to "Relay BP Decoding with CUDA-Q Realtime". Replace hardcoded cuda-quantum commit SHA (bb21b7a031) with the releases/v0.14.1 branch name in the source repositories table and build instructions. Signed-off-by: Chuck Ketcham --- docs/sphinx/components/qec/introduction.rst | 2 +- docs/sphinx/examples_rst/qec/examples.rst | 2 +- docs/sphinx/examples_rst/qec/realtime_relay_bp.rst | 10 +++++----- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/sphinx/components/qec/introduction.rst b/docs/sphinx/components/qec/introduction.rst index 61a4e41e..12c5b3cb 100644 --- a/docs/sphinx/components/qec/introduction.rst +++ b/docs/sphinx/components/qec/introduction.rst @@ -861,7 +861,7 @@ Additional quantum gates can be applied, and only when `get_corrections` is call For detailed information on real-time decoding, see: * :doc:`/examples_rst/qec/realtime_decoding` - Complete Guide with Examples -* :doc:`/examples_rst/qec/realtime_relay_bp` - Relay BP Decoding with HSB RDMA +* :doc:`/examples_rst/qec/realtime_relay_bp` - Relay BP Decoding with CUDA-Q Realtime * :doc:`/api/qec/cpp_api` - C++ API Reference (see Real-Time Decoding section) * :doc:`/api/qec/python_api` - Python API Reference (see Real-Time Decoding section) diff --git a/docs/sphinx/examples_rst/qec/examples.rst b/docs/sphinx/examples_rst/qec/examples.rst index a20b387b..6e91e8c6 100644 --- a/docs/sphinx/examples_rst/qec/examples.rst +++ b/docs/sphinx/examples_rst/qec/examples.rst @@ -11,4 +11,4 @@ Examples that illustrate how to use CUDA-QX for application development are avai Circuit-Level QEC Decoders Real-Time Decoding - Relay BP Decoding with HSB RDMA \ No newline at end of file + Relay BP Decoding with CUDA-Q Realtime \ No newline at end of file diff --git a/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst b/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst index 47dd0933..379f777c 100644 --- a/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst +++ b/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst @@ -1,5 +1,5 @@ -Relay BP Decoding with HSB RDMA -================================ +Relay BP Decoding with CUDA-Q Realtime +======================================== This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP decoder using CUDA-Q's realtime host dispatch system. The decoder runs as a @@ -63,7 +63,7 @@ Source Repositories - ``main`` branch (or your feature branch) * - **cuda-quantum** (realtime) - https://github.com/NVIDIA/cuda-quantum - - Commit ``bb21b7a031`` + - Branch ``releases/v0.14.1`` * - **holoscan-sensor-bridge** - https://github.com/nvidia-holoscan/holoscan-sensor-bridge - Tag ``2.6.0-EA2`` @@ -120,7 +120,7 @@ If you only need to run the CI unit test, you can build without # 1. Build libcudaq-realtime git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src cd cudaq-realtime-src - git checkout bb21b7a031 + git checkout releases/v0.14.1 cd realtime && mkdir -p build && cd build cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. ninja && ninja install @@ -148,7 +148,7 @@ To also build the bridge and playback tools for emulated or FPGA testing: cd cudaq-realtime-src git sparse-checkout init --cone git sparse-checkout set realtime - git checkout bb21b7a031 + git checkout releases/v0.14.1 cd .. # 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2)