diff --git a/docs/sphinx/components/qec/introduction.rst b/docs/sphinx/components/qec/introduction.rst index 3297ed76..12c5b3cb 100644 --- a/docs/sphinx/components/qec/introduction.rst +++ b/docs/sphinx/components/qec/introduction.rst @@ -861,6 +861,7 @@ Additional quantum gates can be applied, and only when `get_corrections` is call For detailed information on real-time decoding, see: * :doc:`/examples_rst/qec/realtime_decoding` - Complete Guide with Examples +* :doc:`/examples_rst/qec/realtime_relay_bp` - Relay BP Decoding with CUDA-Q Realtime * :doc:`/api/qec/cpp_api` - C++ API Reference (see Real-Time Decoding section) * :doc:`/api/qec/python_api` - Python API Reference (see Real-Time Decoding section) diff --git a/docs/sphinx/examples_rst/qec/examples.rst b/docs/sphinx/examples_rst/qec/examples.rst index 79247213..6e91e8c6 100644 --- a/docs/sphinx/examples_rst/qec/examples.rst +++ b/docs/sphinx/examples_rst/qec/examples.rst @@ -10,4 +10,5 @@ Examples that illustrate how to use CUDA-QX for application development are avai Code-Capacity QEC Circuit-Level QEC Decoders - Real-Time Decoding \ No newline at end of file + Real-Time Decoding + Relay BP Decoding with CUDA-Q Realtime \ No newline at end of file diff --git a/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst b/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst new file mode 100644 index 00000000..379f777c --- /dev/null +++ b/docs/sphinx/examples_rst/qec/realtime_relay_bp.rst @@ -0,0 +1,509 @@ +Relay BP Decoding with CUDA-Q Realtime +======================================== + +This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP +decoder using CUDA-Q's realtime host dispatch system. The decoder runs as a +CPU-launched CUDA graph (``HOST_LOOP`` dispatch path) and can operate in three +configurations: + +- **CI unit test** -- standalone executable, no FPGA or network hardware needed +- **Emulated end-to-end test** -- software FPGA emulator replaces real hardware +- **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE + +Prerequisites +------------- + +Hardware +^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 20 25 25 20 + + * - Configuration + - GPU + - ConnectX NIC + - FPGA + * - CI unit test + - Any CUDA-capable GPU + - Not required + - Not required + * - Emulated E2E + - CUDA GPU with GPUDirect RDMA + - Required (loopback cable) + - Not required + * - FPGA E2E + - CUDA GPU with GPUDirect RDMA + - Required + - Required + +Tested platforms: DGX Spark, GB200. + +Software +^^^^^^^^ + +- **CUDA Toolkit**: 12.6 or 13.0 +- **CUDA-Q SDK**: pre-installed (provides ``libcudaq``, ``libnvqir``, ``nvq++``) +- **nv-qldpc-decoder plugin**: the proprietary nv-qldpc-decoder shared library + (``libcudaq-qec-nv-qldpc-decoder.so``). Required at runtime for all + three configurations. + +Source Repositories +^^^^^^^^^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 30 40 30 + + * - Repository + - URL + - Version + * - **cudaqx** + - https://github.com/NVIDIA/cudaqx + - ``main`` branch (or your feature branch) + * - **cuda-quantum** (realtime) + - https://github.com/NVIDIA/cuda-quantum + - Branch ``releases/v0.14.1`` + * - **holoscan-sensor-bridge** + - https://github.com/nvidia-holoscan/holoscan-sensor-bridge + - Tag ``2.6.0-EA2`` + +``cuda-quantum`` provides ``libcudaq-realtime`` (the host dispatcher, ring +buffer management, and dispatch kernel). ``holoscan-sensor-bridge`` provides +the Hololink ``GpuRoceTransceiver`` library for RDMA transport. + +.. note:: + + ``holoscan-sensor-bridge`` is only needed for the emulated and FPGA + end-to-end tests. The CI unit test requires only ``libcudaq-realtime``. + +Repository Layout +----------------- + +Key files within ``cudaqx``: + +.. code-block:: text + + libs/qec/ + unittests/ + realtime/ + qec_graph_decode_test/ + test_realtime_qldpc_graph_decoding.cpp # CI unit test + qec_roce_decode_test/ + data/ + config_nv_qldpc_relay.yml # Relay BP decoder config + syndromes_nv_qldpc_relay.txt # 100 test syndrome shots + utils/ + hololink_qldpc_graph_decoder_bridge.cpp # Bridge tool (RDMA <-> decoder) + hololink_qldpc_graph_decoder_test.sh # Orchestration script + hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes) + +The FPGA emulator is in the ``cuda-quantum`` repository: + +.. code-block:: text + + cuda-quantum/realtime/ + unittests/utils/ + hololink_fpga_emulator.cpp # Software FPGA emulator + +Building +-------- + +CI unit test only (no Hololink tools) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If you only need to run the CI unit test, you can build without +``holoscan-sensor-bridge``: + +.. code-block:: bash + + # 1. Build libcudaq-realtime + git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src + cd cudaq-realtime-src + git checkout releases/v0.14.1 + cd realtime && mkdir -p build && cd build + cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. + ninja && ninja install + cd ../../.. + + # 2. Build cudaqx with the nv-qldpc-decoder test + cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON + cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding + +Full build (CI test + Hololink bridge/playback tools) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To also build the bridge and playback tools for emulated or FPGA testing: + +.. code-block:: bash + + # 1. Clone cuda-quantum (realtime) + git clone --filter=blob:none --no-checkout \ + https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src + cd cudaq-realtime-src + git sparse-checkout init --cone + git sparse-checkout set realtime + git checkout releases/v0.14.1 + cd .. + + # 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) + # Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger). + # If your system cmake is older: pip install cmake + git clone --branch 2.6.0-EA2 \ + https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git + cd holoscan-sensor-bridge + + # Strip operators we don't need to avoid configure failures from missing deps + sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d; + /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d; + /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d; + /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d; + /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d; + /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d; + /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \ + src/hololink/operators/CMakeLists.txt + + mkdir -p build && cd build + cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ + -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \ + -DHOLOLINK_BUILD_PYTHON=OFF \ + -DHOLOLINK_BUILD_TESTS=OFF \ + -DHOLOLINK_BUILD_TOOLS=OFF \ + -DHOLOLINK_BUILD_EXAMPLES=OFF \ + -DHOLOLINK_BUILD_EMULATOR=OFF .. + cmake --build . --target gpu_roce_transceiver hololink_core + cd ../.. + + # 3. Build libcudaq-realtime with Hololink tools enabled + # This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge + # tool) as well as the FPGA emulator. + cd cudaq-realtime-src/realtime && mkdir -p build && cd build + cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \ + -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \ + .. + ninja && ninja install + cd ../../.. + + # 4. Build cudaqx with Hololink tools enabled + cmake -S cudaqx -B cudaqx/build \ + -DCMAKE_BUILD_TYPE=Release \ + -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ + -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ + -DCUDAQX_ENABLE_LIBS="qec" \ + -DCUDAQX_INCLUDE_TESTS=ON \ + -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \ + -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \ + -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build + cmake --build cudaqx/build --target \ + test_realtime_qldpc_graph_decoding \ + hololink_qldpc_graph_decoder_bridge \ + hololink_fpga_syndrome_playback + +Using the orchestration script +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The orchestration script can build everything automatically: + +.. code-block:: bash + + ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --hsb-dir /path/to/holoscan-sensor-bridge \ + --cuda-quantum-dir /path/to/cuda-quantum \ + --no-run + +CI Unit Test +------------ + +The CI unit test (``test_realtime_qldpc_graph_decoding``) exercises the full +host dispatch decode path without any network hardware. It: + +1. Loads the Relay BP config and syndrome data from YAML/text files +2. Creates the decoder via the ``decoder::get("nv-qldpc-decoder", ...)`` plugin API +3. Captures a CUDA graph of the decode pipeline +4. Wires ``libcudaq-realtime``'s host dispatcher (HOST_LOOP) to a ring buffer +5. Writes RPC requests into the ring buffer, the host dispatcher launches the + CUDA graph, and the test verifies corrections + +Running +^^^^^^^ + +.. code-block:: bash + + cd cudaqx/build + + # The nv-qldpc-decoder plugin must be discoverable at runtime. + # Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path: + export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so + + ./libs/qec/unittests/test_realtime_qldpc_graph_decoding + +Expected output: + +.. code-block:: text + + [==========] Running 1 test from 1 test suite. + [----------] 1 test from RealtimeQLDPCGraphDecodingTest + [ RUN ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots + ... + [ OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms) + [==========] 1 test from 1 test suite ran. + [ PASSED ] 1 test. + +Emulated End-to-End Test +------------------------ + +The emulated test replaces the physical FPGA with a software emulator. Three +processes run concurrently: + +1. **Emulator** -- receives syndromes via the UDP control plane, sends them + to the bridge via RDMA, and captures corrections +2. **Bridge** -- runs the host dispatcher and CUDA graph decode loop on the GPU, + receiving syndromes and sending corrections via RDMA +3. **Playback** -- loads syndrome data into the emulator's BRAM and triggers + playback, then verifies corrections + +Requirements +^^^^^^^^^^^^ + +- ConnectX NIC with a loopback cable connecting both ports (the emulator + sends RDMA traffic out one port and the bridge receives on the other) +- Software dependencies (DOCA, Holoscan SDK, etc.) as described in the + `cuda-quantum realtime build guide `__ +- All three tools built (bridge, playback, emulator) + +Running +^^^^^^^ + +.. code-block:: bash + + ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --emulate \ + --build \ + --setup-network \ + --hsb-dir /path/to/holoscan-sensor-bridge + +The ``--setup-network`` flag configures the ConnectX interface with the +appropriate IP addresses and MTU. It only needs to be run once per boot. + +After the initial build and network setup, subsequent runs are faster: + +.. code-block:: bash + + ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate + +FPGA End-to-End Test +-------------------- + +The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two +processes run: + +1. **Bridge** -- same as emulated mode +2. **Playback** -- loads syndromes into the FPGA's BRAM and triggers playback, + then reads back corrections from the FPGA's capture RAM to verify them + +Requirements +^^^^^^^^^^^^ + +- FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via + direct cable or switch. Bitfiles for supported FPGA vendors are available + `here `__. + See the `cuda-quantum realtime user guide `__ + for FPGA setup instructions. +- FPGA IP and bridge IP on the same subnet +- ConnectX device name (e.g., ``mlx5_4``, ``mlx5_5``) + +Running +^^^^^^^ + +.. code-block:: bash + + ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ + --build \ + --setup-network \ + --device mlx5_5 \ + --bridge-ip 192.168.0.1 \ + --fpga-ip 192.168.0.2 \ + --gpu 2 \ + --page-size 512 \ + --hsb-dir /path/to/holoscan-sensor-bridge + +Key parameters for FPGA mode: + +.. list-table:: + :header-rows: 1 + :widths: 20 80 + + * - Parameter + - Description + * - ``--device`` + - ConnectX IB device name (e.g., ``mlx5_5``) + * - ``--bridge-ip`` + - IP address assigned to the ConnectX interface + * - ``--fpga-ip`` + - FPGA's IP address + * - ``--gpu`` + - GPU device ID (choose NUMA-local GPU for lowest latency) + * - ``--page-size`` + - Ring buffer slot size in bytes (use ``512`` on GB200 for alignment) + * - ``--spacing`` + - Inter-shot spacing in microseconds + +.. note:: + + The ``--spacing`` value should be set to at least the per-shot decode + time to avoid overrunning the input ring buffer. If syndromes arrive faster + than the decoder can process them, the buffer fills up and messages are lost. + Use a ``--spacing`` value at or above the observed decode time for sustained + operation. + +GPU Selection +^^^^^^^^^^^^^ + +For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. +For example, on a GB200 system where ``mlx5_5`` is on NUMA node 1, +use ``--gpu 2`` or ``--gpu 3``. Check NUMA locality with: + +.. code-block:: bash + + cat /sys/class/infiniband//device/numa_node + +Network Sanity Check +^^^^^^^^^^^^^^^^^^^^ + +Before running, verify that the bridge IP is assigned to exactly one interface: + +.. code-block:: bash + + ip addr show | grep 192.168.0.1 + +If multiple interfaces show the same IP, remove the duplicate to avoid +routing ambiguity that silently drops RDMA packets. + +Orchestration Script Reference +------------------------------ + +.. code-block:: text + + hololink_qldpc_graph_decoder_test.sh [options] + +Modes +^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 20 80 + + * - Flag + - Description + * - ``--emulate`` + - Use FPGA emulator (no real FPGA needed) + * - *(default)* + - FPGA mode (requires real FPGA) + +Actions +^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 20 80 + + * - Flag + - Description + * - ``--build`` + - Build all required tools before running + * - ``--setup-network`` + - Configure ConnectX network interfaces + * - ``--no-run`` + - Skip running the test (useful with ``--build``) + +Build Options +^^^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 25 30 45 + + * - Flag + - Default + - Description + * - ``--hsb-dir DIR`` + - ``/workspaces/holoscan-sensor-bridge`` + - holoscan-sensor-bridge source directory + * - ``--cuda-quantum-dir DIR`` + - ``/workspaces/cuda-quantum`` + - cuda-quantum source directory + * - ``--cuda-qx-dir DIR`` + - ``/workspaces/cudaqx`` + - cudaqx source directory + * - ``--jobs N`` + - ``nproc`` + - Parallel build jobs + +Network Options +^^^^^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 25 20 55 + + * - Flag + - Default + - Description + * - ``--device DEV`` + - auto-detect + - ConnectX IB device name + * - ``--bridge-ip ADDR`` + - ``10.0.0.1`` + - Bridge tool IP address + * - ``--emulator-ip ADDR`` + - ``10.0.0.2`` + - Emulator IP (emulate mode) + * - ``--fpga-ip ADDR`` + - ``192.168.0.2`` + - FPGA IP address + * - ``--mtu N`` + - ``4096`` + - MTU size + +Run Options +^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Flag + - Default + - Description + * - ``--gpu N`` + - ``0`` + - GPU device ID + * - ``--timeout N`` + - ``60`` + - Timeout in seconds + * - ``--num-shots N`` + - all available + - Limit number of syndrome shots + * - ``--page-size N`` + - ``384`` + - Ring buffer slot size in bytes + * - ``--num-pages N`` + - ``128`` + - Number of ring buffer slots + * - ``--spacing N`` + - ``10`` + - Inter-shot spacing in microseconds + * - ``--no-verify`` + - *(verify)* + - Skip correction verification + * - ``--control-port N`` + - ``8193`` + - UDP control port for emulator