Add amdxdna HAL driver for AMD XDNA NPUs#37
Open
jtuyls wants to merge 1 commit into
Open
Conversation
e4dc83e to
565102b
Compare
Collaborator
Author
|
@benvanik Could you help review this PR? |
c59c420 to
24bcc85
Compare
584a451 to
9df8d43
Compare
jtuyls
added a commit
to jtuyls/FastFlowLM
that referenced
this pull request
Jun 17, 2026
Batch each FLM runlist into one HRX ERT_CMD_CHAIN (forward_runlist) instead of one synchronous dispatch per kernel, amortizing per-dispatch submit/completion overhead. On by default in the shim; set FLM_CHAIN=0 to fall back to per-dispatch. Requires the HRX amdxdna command-chain support in ROCm/hrx-system#37. Measured (Qwen3-0.6B, Strix Point, flm bench): decode 45.1/33.0/16.8/10.0 tok/s at 1k/4k/16k/32k (vs 39.6/30.2/15.0/9.4 per-dispatch). Adds bench/: a standalone, HRX-only microbenchmark (libhrx.so only, no shim/ runtime) replaying one captured runlist as an ERT_CMD_CHAIN vs separate dispatches. The shim's env-gated FLM_DUMP_RUNLIST capture regenerates the (uncommitted) runlist artifacts locally. Self-contained: references only this branch + an HRX checkout/build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the amdxdna HAL driver with Linux KMQ and Windows MCDM native shims, PDIX and XADX executable schemas, async queue execution, transfer-queue support, native command recording/submission paths, HRX runtime integration, and amdxdna host CI coverage. Includes unit and CTS coverage for allocator/buffer/device/event/semaphore/executable paths, command-buffer planning and caches, transfer queue behavior, XADX/PDIX artifact handling, and platform shim utilities.
9df8d43 to
637809f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce an IREE HAL runtime driver targeting AMD XDNA NPUs directly through the in-kernel amdxdna driver's DRM ioctl ABI (kernel-managed queue / KMQ). It is self-contained: a small static user-space shim links into the runtime and talks to the device directly, keeping the dependency surface minimal. The driver is gated by the IREE_HAL_DRIVER_AMDXDNA build option and wired into the HAL driver registry, init.c, and libhrx's GPU-driver selection.
Executable format: a new pdi_executable_def.fbs ("PDIX") schema models an executable as a shared PDI pool plus per-entry-point runs. Each run is an XAie transaction ("TXN") control-code stream with an optional control-packet data_payload (array reconfiguration) and an optional host patch_table. Entry points reference PDIs by index, so several can share a single loaded PDI (e.g. manually merged kernels).
Submission paths: by default each command is submitted as ERT_START_CU and the firmware patches shim-DMA addresses. An opt-in path (--amdxdna_cmd_chain) batches a command buffer's dispatches into one ERT_CMD_CHAIN, host-patching the buffer-descriptor addresses from the compiler-emitted patch_table (validated for npu4).
Device model: host-side timeline semaphores and a single-worker async queue that defers HAL queue ops until their waits are satisfied and serializes all NPU access. Adds the allocator, buffers, no-op executable cache, events, and the Linux/KMQ native binding.
Shim: the user-space shim under shim/linux/kmq is self-contained, exception-free code rewritten from amd/xdna-driver's shim, plus the verbatim kernel UAPI (amdxdna_accel.h, GPL-2.0-WITH-syscall-note) and ERT ABI (ert.h, dual-licensed) headers. Provenance and per-file licenses are documented in shim/linux/kmq/README.md; per-file SPDX headers are authoritative.
Tests cover the allocator, buffers, async queue, driver, executable parsing/verification, semaphores/events, and the host patch-table logic (TXN op sizing, sentinel constant patching, and address patching).