Skip to content

Add amdxdna HAL driver for AMD XDNA NPUs#37

Open
jtuyls wants to merge 1 commit into
ROCm:mainfrom
jtuyls:amdxdna-hal-native
Open

Add amdxdna HAL driver for AMD XDNA NPUs#37
jtuyls wants to merge 1 commit into
ROCm:mainfrom
jtuyls:amdxdna-hal-native

Conversation

@jtuyls

@jtuyls jtuyls commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Introduce an IREE HAL runtime driver targeting AMD XDNA NPUs directly through the in-kernel amdxdna driver's DRM ioctl ABI (kernel-managed queue / KMQ). It is self-contained: a small static user-space shim links into the runtime and talks to the device directly, keeping the dependency surface minimal. The driver is gated by the IREE_HAL_DRIVER_AMDXDNA build option and wired into the HAL driver registry, init.c, and libhrx's GPU-driver selection.

Executable format: a new pdi_executable_def.fbs ("PDIX") schema models an executable as a shared PDI pool plus per-entry-point runs. Each run is an XAie transaction ("TXN") control-code stream with an optional control-packet data_payload (array reconfiguration) and an optional host patch_table. Entry points reference PDIs by index, so several can share a single loaded PDI (e.g. manually merged kernels).

Submission paths: by default each command is submitted as ERT_START_CU and the firmware patches shim-DMA addresses. An opt-in path (--amdxdna_cmd_chain) batches a command buffer's dispatches into one ERT_CMD_CHAIN, host-patching the buffer-descriptor addresses from the compiler-emitted patch_table (validated for npu4).

Device model: host-side timeline semaphores and a single-worker async queue that defers HAL queue ops until their waits are satisfied and serializes all NPU access. Adds the allocator, buffers, no-op executable cache, events, and the Linux/KMQ native binding.

Shim: the user-space shim under shim/linux/kmq is self-contained, exception-free code rewritten from amd/xdna-driver's shim, plus the verbatim kernel UAPI (amdxdna_accel.h, GPL-2.0-WITH-syscall-note) and ERT ABI (ert.h, dual-licensed) headers. Provenance and per-file licenses are documented in shim/linux/kmq/README.md; per-file SPDX headers are authoritative.

Tests cover the allocator, buffers, async queue, driver, executable parsing/verification, semaphores/events, and the host patch-table logic (TXN op sizing, sentinel constant patching, and address patching).

@jtuyls jtuyls force-pushed the amdxdna-hal-native branch 3 times, most recently from e4dc83e to 565102b Compare June 5, 2026 20:29
@jtuyls jtuyls requested a review from benvanik June 5, 2026 20:31
@jtuyls

jtuyls commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

@benvanik Could you help review this PR?

@jtuyls jtuyls force-pushed the amdxdna-hal-native branch 2 times, most recently from c59c420 to 24bcc85 Compare June 15, 2026 19:36
@jtuyls jtuyls changed the title Add amdxdna HAL driver for AMD XDNA NPUs (Linux KMQ) Add amdxdna HAL driver for AMD XDNA NPUs Jun 15, 2026
@jtuyls jtuyls force-pushed the amdxdna-hal-native branch 6 times, most recently from 584a451 to 9df8d43 Compare June 17, 2026 11:12
jtuyls added a commit to jtuyls/FastFlowLM that referenced this pull request Jun 17, 2026
Batch each FLM runlist into one HRX ERT_CMD_CHAIN (forward_runlist) instead of
one synchronous dispatch per kernel, amortizing per-dispatch submit/completion
overhead. On by default in the shim; set FLM_CHAIN=0 to fall back to per-dispatch.
Requires the HRX amdxdna command-chain support in ROCm/hrx-system#37.

Measured (Qwen3-0.6B, Strix Point, flm bench): decode 45.1/33.0/16.8/10.0 tok/s
at 1k/4k/16k/32k (vs 39.6/30.2/15.0/9.4 per-dispatch).

Adds bench/: a standalone, HRX-only microbenchmark (libhrx.so only, no shim/
runtime) replaying one captured runlist as an ERT_CMD_CHAIN vs separate
dispatches. The shim's env-gated FLM_DUMP_RUNLIST capture regenerates the
(uncommitted) runlist artifacts locally. Self-contained: references only this
branch + an HRX checkout/build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the amdxdna HAL driver with Linux KMQ and Windows MCDM native shims, PDIX and XADX executable schemas, async queue execution, transfer-queue support, native command recording/submission paths, HRX runtime integration, and amdxdna host CI coverage.

Includes unit and CTS coverage for allocator/buffer/device/event/semaphore/executable paths, command-buffer planning and caches, transfer queue behavior, XADX/PDIX artifact handling, and platform shim utilities.
@jtuyls jtuyls force-pushed the amdxdna-hal-native branch from 9df8d43 to 637809f Compare June 17, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant