Add multi-dtype support to elementwise examples and new elementwise_arith ops by erwei-xilinx · Pull Request #48 · amd/Triton-XDNA

erwei-xilinx · 2026-04-10T03:27:18Z

Summary

Extends all elementwise examples to support multiple data types and adds new elementwise arithmetic operations (sub, mul, div, square).

Infrastructure: dtype-aware transform script placeholders

Driver-level dtype detection from Linalg IR (_detect_element_type) and placeholder substitution (@DTYPE@, @PAD_VAL@, @VECTOR_SIZE@) resolved before transform library injection. Fully backward-compatible: no-op when no placeholders are present.
Transform library: pad_and_promote_{unary,binary}_{bf16,f32,i8,i16} sequences for all dtype/arity combinations.

Multi-dtype support for existing examples

NPU2 (Strix / AIE2P):

Example	bf16	f32 (bf16-emu)	i8	i16
vec-add	Pass	Pass	Pass	Pass
axpy	Pass	Pass	--	Pass
relu	Pass	Pass	--	Pass
sigmoid	Pass	Pass	--	--
silu	Pass	Pass	--	--
gelu	Pass	Pass	--	--
swiglu	Pass	Pass	--	--
leaky_relu	Pass	Pass	--	--

NPU1 (Phoenix / AIE2):

Example	bf16	f32 (bf16-emu)	i8	i16
vec-add	Pass	Pass	Fail¹	Pass
axpy	Pass	Pass	Fail¹	Pass
relu	Pass	Pass	Fail¹	Pass
sigmoid	Pass	Fail³	--	--
silu	Pass	Pass	--	--
gelu	-- (no AIE2 transform)	--	--	--
swiglu	Pass	Pass	--	--
leaky_relu	Pass	Fail²	--	--

All examples accept --dtype and --bf16-emulation CLI arguments. Default behavior (bf16, no args) is identical to before.

New example: elementwise_arith

Single multi-op example (--op sub|mul|div|square) with auto-selected unary/binary transform scripts. Auto-detects NPU version and selects the correct transform script (AIE2 or AIE2P).

NPU2 (Strix / AIE2P):

Op	bf16	f32 (bf16-emu)
sub	Pass	Pass
mul	Pass	Pass
div	--	Pass (f32-only)
square	Pass	Pass

NPU1 (Phoenix / AIE2):

Op	bf16	f32 (bf16-emu)
sub	Pass	Pass
mul	Pass	Pass
div	--	Pass (f32-only)
square	Pass	Pass

Known limitations

i8 vector ops: Only arith.addi works (vec-add on NPU2). All other i8 ops (muli, maxsi, subi) fail at aircc. On NPU1, even vec-add i8 fails at Peano LLC. Filed Xilinx/mlir-aie#3027.
i16 vector ops: Only arith.addi works (vec-add). subi/muli also fail at aircc (same issue).
bf16 divf: Not supported on AIE hardware. div example is f32-only.
¹ i8 on NPU1: Peano LLC cannot lower i8 vector operations to AIE2 machine code.
² leaky_relu f32 on NPU1: Peano LLC cannot lower arith.cmpf + arith.select with bf16-emulation on AIE2.
³ sigmoid f32 on NPU1: transform_aie2.mlir hardcodes bf16 padding value; type mismatch with f32 input. Tracked in PR #48 multi-dtype tests: NPU1 (Phoenix/AIE2) results — 15/20 pass, 5 failures #49.

Test plan

Full regression suite on NPU2: 16 passed, 0 regressions (matvec pre-existing fail, matmul_i8_m128 pre-existing timeout)
All 8 modified elementwise examples pass bf16 + f32 on NPU2
New elementwise_arith: sub/mul/div/square pass for supported dtypes on NPU2
vec-add: bf16/f32/i8/i16 all pass on NPU2
NPU1 (Phoenix/AIE2): 15/20 existing tests pass, 5 fail (3x i8 Peano, 1x leaky_relu f32 Peano, 1x sigmoid f32 transform bug)
NPU1: all 7 elementwise_arith tests pass (sub/mul/div/square x bf16/f32)
CI build validation

🤖 Generated with Claude Code

@dtype

Extends the vec-add example to support bf16, f32 (via bf16-emulation), i8, and i16 data types, inspired by mlir-air's triton_vec_add test. Driver changes: - Add dtype detection from Linalg IR (_detect_element_type) - Add placeholder substitution (@dtype@, @PAD_VAL@, @VECTOR_SIZE@) in transform scripts, resolved before library injection based on the IR element type and NPU version. Backward-compatible: no-op when no placeholders are present. Transform library: - Add pad_and_promote_binary_{f32,i8,i16} sequences alongside the existing bf16 variant. Vec-add example: - Add --dtype and --bf16-emulation CLI arguments - Transform scripts now use @dtype@ and @VECTOR_SIZE@ placeholders, making them dtype-generic across both AIE2 and AIE2P. Tested on NPU2 (Strix/AIE2P): all 4 dtypes pass correctness checks across vector sizes 1024-32768. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Extends the examples/vec-add pipeline to run vector addition across multiple element types (bf16 default, f32 via bf16-emulation, i8, i16) by introducing dtype-aware transform-script placeholder substitution in the backend driver and adding corresponding transform-library sequences.

Changes:

Add CLI dtype selection to vec-add.py and wire bf16-emulation via an env var for f32 runs.
Make vec-add transform scripts dtype-/NPU-generic via @DTYPE@ and @VECTOR_SIZE@ placeholders.
Add pad_and_promote_binary_{f32,i8,i16} sequences and add driver-side placeholder substitution based on detected element type + NPU version.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
examples/vec-add/vec-add.py	Adds `--dtype`/`--bf16-emulation` and per-dtype input generation + tolerances.
examples/vec-add/transform_aie2p.mlir	Switches to placeholder-based includes for dtype + vector size.
examples/vec-add/transform_aie2.mlir	Switches to placeholder-based includes for dtype + vector size.
amd_triton_npu/backend/transform_library/elementwise.mlir	Adds pad/promotion sequences for f32/i8/i16 binary elementwise ops.
amd_triton_npu/backend/driver.py	Adds element-type detection and placeholder substitution prior to transform-library injection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

amd_triton_npu/backend/driver.py

- Only call detect_npu_version() when @VECTOR_SIZE@ placeholder is actually present, avoiding failures in environments without xrt-smi - Raise ValueError with supported types when an unsupported element type is detected but placeholders are present - Fix _detect_element_type docstring to match actual behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update generate_readme.py registry to reflect multi-dtype support (bf16, f32, i8, i16) and regenerate examples/README.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

examples/README.md is auto-generated by generate_readme.py in CI and should not be committed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@dtype

Extend axpy and relu to support bf16, f32 (bf16-emulation), i8, and i16 using the same @dtype@/@VECTOR_SIZE@ placeholder mechanism as vec-add. Transform library: add pad_and_promote_unary_{f32,i8,i16} sequences. Tested on NPU2 (Strix/AIE2P): - bf16, f32, i16: pass for both axpy and relu - i8: compiles through triton-shared-opt and AIR transforms but fails at aircc (arith.muli/maxsi not supported for i8 vectors on AIE2P). vec-add i8 works because it only uses arith.addi. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@dtype

Extend sigmoid, silu, gelu, swiglu, and leaky_relu examples to support f32 input via bf16-emulation, in addition to the existing bf16. All transform scripts updated with @dtype@/@VECTOR_SIZE@ placeholders. The @cast_bf16_only_ops and @cast_cmpf_and_select_ops phases work correctly for both bf16 and f32 inputs -- for f32, the cast converts f32 vector ops to bf16 at the MLIR level (equivalent to what bf16-emulation does at the LLVM level). Tested on NPU2 (Strix/AIE2P): all 5 examples pass correctness checks for both bf16 and f32 across vector sizes 1024-32768. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New multi-op example supporting sub, mul, div, and square with --op and --dtype CLI arguments. Auto-selects unary or binary transform script based on op arity. Supported dtypes: bf16 and f32 (via bf16-emulation). Integer types (i16) fail at aircc for subi/muli -- only addi works for integer vectors on AIE2P (tracked in Xilinx/mlir-aie#3027). div is f32-only (arith.divf has no bf16 hardware support on AIE2P). Tested on NPU2 (Strix/AIE2P): sub, mul, div, square all pass for their supported dtypes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@dtype

Create transform_binary_aie2.mlir and transform_unary_aie2.mlir for NPU1 targets — content is identical to the AIE2P variants since @dtype@ and @VECTOR_SIZE@ placeholders handle the differences. Update elementwise_arith.py to auto-detect the NPU version via detect_npu_version() and select the correct transform script suffix (aie2 vs aie2p) instead of hardcoding aie2p. Update generate_readme.py get_device_support() to use glob patterns so it detects both transform_aie2.mlir and transform_*_aie2.mlir naming conventions used by multi-op examples. Tested on NPU1 (Phoenix/AIE2): all 7 test cases pass (sub bf16/f32, mul bf16/f32, div f32, square bf16/f32). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 10, 2026 03:27

Copilot started reviewing on behalf of erwei-xilinx April 10, 2026 03:28 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

amd_triton_npu/backend/driver.py Outdated Show resolved Hide resolved

amd_triton_npu/backend/driver.py Show resolved Hide resolved

amd_triton_npu/backend/driver.py Outdated Show resolved Hide resolved

erwei-xilinx and others added 5 commits April 9, 2026 20:35

Update vec-add datatypes in examples dashboard

eaa7210

Update generate_readme.py registry to reflect multi-dtype support (bf16, f32, i8, i16) and regenerate examples/README.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove auto-generated README.md (generated by CI)

a188071

examples/README.md is auto-generated by generate_readme.py in CI and should not be committed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

erwei-xilinx mentioned this pull request Apr 10, 2026

PR #48 multi-dtype tests: NPU1 (Phoenix/AIE2) results — 15/20 pass, 5 failures #49

Open

3 tasks

erwei-xilinx and others added 2 commits April 9, 2026 22:01

Remove build artifacts from elementwise_arith

d38be1d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

erwei-xilinx changed the title ~~Add multi-dtype support to vec-add (bf16, f32, i8, i16)~~ Add multi-dtype support to elementwise examples and new elementwise_arith ops Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-dtype support to elementwise examples and new elementwise_arith ops#48

Add multi-dtype support to elementwise examples and new elementwise_arith ops#48
erwei-xilinx wants to merge 9 commits intomainfrom
multi-dtype-vec-add

erwei-xilinx commented Apr 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Infrastructure: dtype-aware transform script placeholders

Multi-dtype support for existing examples

New example: elementwise_arith

Known limitations

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented Apr 10, 2026 •

edited

Loading