diff --git a/docs/realtime_qldpc_relay_bp_guide.md b/docs/realtime_qldpc_relay_bp_guide.md new file mode 100644 index 00000000..8b906319 --- /dev/null +++ b/docs/realtime_qldpc_relay_bp_guide.md @@ -0,0 +1,407 @@ +# Realtime nv-qldpc-decoder Relay BP Decoding Guide + +This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP decoder +using CUDA-Q's realtime host dispatch system. The decoder runs as a +CPU-launched CUDA graph (`HOST_LOOP` dispatch path) and can operate in three +configurations: + +- **CI unit test** -- standalone executable, no FPGA or network hardware needed +- **Emulated end-to-end test** -- software FPGA emulator replaces real hardware +- **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE + +--- + +## Table of Contents + +1. [Prerequisites](#prerequisites) +2. [Repository Layout](#repository-layout) +3. [Building](#building) +4. [CI Unit Test](#ci-unit-test) +5. [Emulated End-to-End Test](#emulated-end-to-end-test) +6. [FPGA End-to-End Test](#fpga-end-to-end-test) +7. [Orchestration Script Reference](#orchestration-script-reference) + +--- + +## Prerequisites + +### Hardware + +| Configuration | GPU | ConnectX NIC | FPGA | +|---|---|---|---| +| CI unit test | Any CUDA-capable GPU | Not required | Not required | +| Emulated E2E | CUDA GPU with GPUDirect RDMA | Required (loopback cable) | Not required | +| FPGA E2E | CUDA GPU with GPUDirect RDMA | Required | Required | + +Tested platforms: DGX Spark, GB200. + +### Software + +- **CUDA Toolkit**: 12.6 or 13.0 +- **CUDA-Q SDK**: pre-installed (provides `libcudaq`, `libnvqir`, `nvq++`) +- **`nv-qldpc-decoder` plugin**: the proprietary nv-qldpc-decoder shared library + (`libcudaq-qec-nv-qldpc-decoder.so`). Required at runtime for all + three configurations. + +### Source Repositories + +| Repository | URL | Version | +|---|---|---| +| **cudaqx** | | `main` branch (or your feature branch) | +| **cuda-quantum** (realtime) | | Commit `9ce3d2e886` | +| **holoscan-sensor-bridge** | | Tag `2.6.0-EA2` | + +`cuda-quantum` provides `libcudaq-realtime` (the host dispatcher, ring buffer +management, and dispatch kernel). `holoscan-sensor-bridge` provides the +Hololink `GpuRoceTransceiver` library for RDMA transport. + +> **Note:** `holoscan-sensor-bridge` is only needed for the emulated and FPGA +> end-to-end tests. The CI unit test requires only `libcudaq-realtime`. + +--- + +## Repository Layout + +Key files within `cudaqx`: + +``` +libs/qec/ + unittests/ + realtime/ + qec_graph_decode_test/ + test_realtime_qldpc_graph_decoding.cpp # CI unit test + qec_roce_decode_test/ + data/ + config_nv_qldpc_relay.yml # Relay BP decoder config + syndromes_nv_qldpc_relay.txt # 100 test syndrome shots + utils/ + hololink_qldpc_graph_decoder_bridge.cpp # Bridge tool (RDMA ↔ decoder) + hololink_qldpc_graph_decoder_test.sh # Orchestration script + hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes) +``` + +The FPGA emulator is in the `cuda-quantum` repository: + +``` +cuda-quantum/realtime/ + unittests/utils/ + hololink_fpga_emulator.cpp # Software FPGA emulator +``` + +--- + +## Building + +### CI unit test only (no Hololink tools) + +If you only need to run the CI unit test, you can build without +`holoscan-sensor-bridge`: + +```bash +# 1. Build libcudaq-realtime +git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src +cd cudaq-realtime-src +git checkout 9ce3d2e886 +cd realtime && mkdir -p build && cd build +cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. +ninja && ninja install +cd ../../.. + +# 2. Build cudaqx with the nv-qldpc-decoder test +cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON +cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding +``` + +### Full build (CI test + Hololink bridge/playback tools) + +To also build the bridge and playback tools for emulated or FPGA testing: + +```bash +# 1. Clone cuda-quantum (realtime) +git clone --filter=blob:none --no-checkout \ + https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src +cd cudaq-realtime-src +git sparse-checkout init --cone +git sparse-checkout set realtime +git checkout 9ce3d2e886 +cd .. + +# 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) +# Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger). +# If your system cmake is older: pip install cmake +git clone --branch 2.6.0-EA2 \ + https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git +cd holoscan-sensor-bridge + +# Strip operators we don't need to avoid configure failures from missing deps +sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d; + /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d; + /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d; + /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d; + /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d; + /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d; + /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \ + src/hololink/operators/CMakeLists.txt + +mkdir -p build && cd build +cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ + -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \ + -DHOLOLINK_BUILD_PYTHON=OFF \ + -DHOLOLINK_BUILD_TESTS=OFF \ + -DHOLOLINK_BUILD_TOOLS=OFF \ + -DHOLOLINK_BUILD_EXAMPLES=OFF \ + -DHOLOLINK_BUILD_EMULATOR=OFF .. +cmake --build . --target gpu_roce_transceiver hololink_core +cd ../.. + +# 3. Build libcudaq-realtime with Hololink tools enabled +# This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge tool) +# as well as the FPGA emulator. +cd cudaq-realtime-src/realtime && mkdir -p build && cd build +cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \ + -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \ + .. +ninja && ninja install +cd ../../.. + +# 4. Build cudaqx with Hololink tools enabled +cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON \ + -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build +cmake --build cudaqx/build --target \ + test_realtime_qldpc_graph_decoding \ + hololink_qldpc_graph_decoder_bridge \ + hololink_fpga_syndrome_playback +``` + +### Using the orchestration script + +The orchestration script can build everything automatically: + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --hsb-dir /path/to/holoscan-sensor-bridge \ + --cuda-quantum-dir /path/to/cuda-quantum \ + --no-run +``` + +--- + +## CI Unit Test + +The CI unit test (`test_realtime_qldpc_graph_decoding`) exercises the full +host dispatch decode path without any network hardware. It: + +1. Loads the Relay BP config and syndrome data from YAML/text files +2. Creates the decoder via the `decoder::get("nv-qldpc-decoder", ...)` plugin API +3. Captures a CUDA graph of the decode pipeline +4. Wires `libcudaq-realtime`'s host dispatcher (HOST_LOOP) to a ring buffer +5. Writes RPC requests into the ring buffer, the host dispatcher launches the + CUDA graph, and the test verifies corrections + +### Running + +```bash +cd cudaqx/build + +# The nv-qldpc-decoder plugin must be discoverable at runtime. +# Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path: +export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so + +./libs/qec/unittests/test_realtime_qldpc_graph_decoding +``` + +Expected output: + +``` +[==========] Running 1 test from 1 test suite. +[----------] 1 test from RealtimeQLDPCGraphDecodingTest +[ RUN ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots +... +[ OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms) +[==========] 1 test from 1 test suite ran. +[ PASSED ] 1 test. +``` + +--- + +## Emulated End-to-End Test + +The emulated test replaces the physical FPGA with a software emulator. Three +processes run concurrently: + +1. **Emulator** -- receives syndromes via the UDP control plane, sends them + to the bridge via RDMA, and captures corrections +2. **Bridge** -- runs the host dispatcher and CUDA graph decode loop on the GPU, + receiving syndromes and sending corrections via RDMA +3. **Playback** -- loads syndrome data into the emulator's BRAM and triggers + playback, then verifies corrections + +### Requirements + +- ConnectX NIC with a loopback cable connecting both ports (the emulator + sends RDMA traffic out one port and the bridge receives on the other) +- Software dependencies (DOCA, Holoscan SDK, etc.) as described in the + [cuda-quantum realtime build guide](https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/building.md) +- All three tools built (bridge, playback, emulator) + +### Running + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --emulate \ + --build \ + --setup-network \ + --hsb-dir /path/to/holoscan-sensor-bridge +``` + +The `--setup-network` flag configures the ConnectX interface with the +appropriate IP addresses and MTU. It only needs to be run once per boot. + +After the initial build and network setup, subsequent runs are faster: + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate +``` + +--- + +## FPGA End-to-End Test + +The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two +processes run: + +1. **Bridge** -- same as emulated mode +2. **Playback** -- loads syndromes into the FPGA's BRAM and triggers playback, + then reads back corrections from the FPGA's capture RAM to verify them + +### Requirements + +- FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via + direct cable or switch. Bitfiles for supported FPGA vendors are available + [here](https://edge.urm.nvidia.com/artifactory/sw-holoscan-thirdparty-generic-local/QEC/HSB-2.6.0-EA/). + See the [cuda-quantum realtime user guide](https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/user_guide.md) + for FPGA setup instructions. +- FPGA IP and bridge IP on the same subnet +- ConnectX device name (e.g., `mlx5_4`, `mlx5_5`) + +### Running + +```bash +./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --setup-network \ + --device mlx5_5 \ + --bridge-ip 192.168.0.1 \ + --fpga-ip 192.168.0.2 \ + --gpu 2 \ + --page-size 512 \ + --hsb-dir /path/to/holoscan-sensor-bridge +``` + +Key parameters for FPGA mode: + +| Parameter | Description | +|---|---| +| `--device` | ConnectX IB device name (e.g., `mlx5_5`) | +| `--bridge-ip` | IP address assigned to the ConnectX interface | +| `--fpga-ip` | FPGA's IP address | +| `--gpu` | GPU device ID (choose NUMA-local GPU for lowest latency) | +| `--page-size` | Ring buffer slot size in bytes (use `512` on GB200 for alignment) | +| `--spacing` | Inter-shot spacing in microseconds | + +> **Note:** The `--spacing` value should be set to at least the per-shot decode +> time to avoid overrunning the input ring buffer. If syndromes arrive faster +> than the decoder can process them, the buffer fills up and messages are lost. +> Use a `--spacing` value at or above the observed decode time for sustained +> operation. + +### GPU Selection + +For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. +For example, on a GB200 system where `mlx5_5` is on NUMA node 1, +use `--gpu 2` or `--gpu 3`. Check NUMA locality with: + +```bash +cat /sys/class/infiniband//device/numa_node +``` + +### Network Sanity Check + +Before running, verify that the bridge IP is assigned to exactly one interface: + +```bash +ip addr show | grep 192.168.0.1 +``` + +If multiple interfaces show the same IP, remove the duplicate to avoid +routing ambiguity that silently drops RDMA packets. + +--- + +## Orchestration Script Reference + +``` +hololink_qldpc_graph_decoder_test.sh [options] +``` + +### Modes + +| Flag | Description | +|---|---| +| `--emulate` | Use FPGA emulator (no real FPGA needed) | +| *(default)* | FPGA mode (requires real FPGA) | + +### Actions + +| Flag | Description | +|---|---| +| `--build` | Build all required tools before running | +| `--setup-network` | Configure ConnectX network interfaces | +| `--no-run` | Skip running the test (useful with `--build`) | + +### Build Options + +| Flag | Default | Description | +|---|---|---| +| `--hsb-dir DIR` | `/workspaces/holoscan-sensor-bridge` | holoscan-sensor-bridge source directory | +| `--cuda-quantum-dir DIR` | `/workspaces/cuda-quantum` | cuda-quantum source directory | +| `--cuda-qx-dir DIR` | `/workspaces/cudaqx` | cudaqx source directory | +| `--jobs N` | `nproc` | Parallel build jobs | + +### Network Options + +| Flag | Default | Description | +|---|---|---| +| `--device DEV` | auto-detect | ConnectX IB device name | +| `--bridge-ip ADDR` | `10.0.0.1` | Bridge tool IP address | +| `--emulator-ip ADDR` | `10.0.0.2` | Emulator IP (emulate mode) | +| `--fpga-ip ADDR` | `192.168.0.2` | FPGA IP address | +| `--mtu N` | `4096` | MTU size | + +### Run Options + +| Flag | Default | Description | +|---|---|---| +| `--gpu N` | `0` | GPU device ID | +| `--timeout N` | `60` | Timeout in seconds | +| `--num-shots N` | all available | Limit number of syndrome shots | +| `--page-size N` | `384` | Ring buffer slot size in bytes | +| `--num-pages N` | `128` | Number of ring buffer slots | +| `--spacing N` | `10` | Inter-shot spacing in microseconds | +| `--no-verify` | *(verify)* | Skip correction verification | +| `--control-port N` | `8193` | UDP control port for emulator |