Skip to content

Add multi-dtype support to elementwise examples and new elementwise_arith ops#48

Open
erwei-xilinx wants to merge 9 commits intomainfrom
multi-dtype-vec-add
Open

Add multi-dtype support to elementwise examples and new elementwise_arith ops#48
erwei-xilinx wants to merge 9 commits intomainfrom
multi-dtype-vec-add

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented Apr 10, 2026

Summary

Extends all elementwise examples to support multiple data types and adds new elementwise arithmetic operations (sub, mul, div, square).

Infrastructure: dtype-aware transform script placeholders

  • Driver-level dtype detection from Linalg IR (_detect_element_type) and placeholder substitution (@DTYPE@, @PAD_VAL@, @VECTOR_SIZE@) resolved before transform library injection. Fully backward-compatible: no-op when no placeholders are present.
  • Transform library: pad_and_promote_{unary,binary}_{bf16,f32,i8,i16} sequences for all dtype/arity combinations.

Multi-dtype support for existing examples

NPU2 (Strix / AIE2P):

Example bf16 f32 (bf16-emu) i8 i16
vec-add Pass Pass Pass Pass
axpy Pass Pass -- Pass
relu Pass Pass -- Pass
sigmoid Pass Pass -- --
silu Pass Pass -- --
gelu Pass Pass -- --
swiglu Pass Pass -- --
leaky_relu Pass Pass -- --

NPU1 (Phoenix / AIE2):

Example bf16 f32 (bf16-emu) i8 i16
vec-add Pass Pass Fail¹ Pass
axpy Pass Pass Fail¹ Pass
relu Pass Pass Fail¹ Pass
sigmoid Pass Fail³ -- --
silu Pass Pass -- --
gelu -- (no AIE2 transform) -- -- --
swiglu Pass Pass -- --
leaky_relu Pass Fail² -- --

All examples accept --dtype and --bf16-emulation CLI arguments. Default behavior (bf16, no args) is identical to before.

New example: elementwise_arith

Single multi-op example (--op sub|mul|div|square) with auto-selected unary/binary transform scripts. Auto-detects NPU version and selects the correct transform script (AIE2 or AIE2P).

NPU2 (Strix / AIE2P):

Op bf16 f32 (bf16-emu)
sub Pass Pass
mul Pass Pass
div -- Pass (f32-only)
square Pass Pass

NPU1 (Phoenix / AIE2):

Op bf16 f32 (bf16-emu)
sub Pass Pass
mul Pass Pass
div -- Pass (f32-only)
square Pass Pass

Known limitations

  • i8 vector ops: Only arith.addi works (vec-add on NPU2). All other i8 ops (muli, maxsi, subi) fail at aircc. On NPU1, even vec-add i8 fails at Peano LLC. Filed Xilinx/mlir-aie#3027.
  • i16 vector ops: Only arith.addi works (vec-add). subi/muli also fail at aircc (same issue).
  • bf16 divf: Not supported on AIE hardware. div example is f32-only.
  • ¹ i8 on NPU1: Peano LLC cannot lower i8 vector operations to AIE2 machine code.
  • ² leaky_relu f32 on NPU1: Peano LLC cannot lower arith.cmpf + arith.select with bf16-emulation on AIE2.
  • ³ sigmoid f32 on NPU1: transform_aie2.mlir hardcodes bf16 padding value; type mismatch with f32 input. Tracked in PR #48 multi-dtype tests: NPU1 (Phoenix/AIE2) results — 15/20 pass, 5 failures #49.

Test plan

  • Full regression suite on NPU2: 16 passed, 0 regressions (matvec pre-existing fail, matmul_i8_m128 pre-existing timeout)
  • All 8 modified elementwise examples pass bf16 + f32 on NPU2
  • New elementwise_arith: sub/mul/div/square pass for supported dtypes on NPU2
  • vec-add: bf16/f32/i8/i16 all pass on NPU2
  • NPU1 (Phoenix/AIE2): 15/20 existing tests pass, 5 fail (3x i8 Peano, 1x leaky_relu f32 Peano, 1x sigmoid f32 transform bug)
  • NPU1: all 7 elementwise_arith tests pass (sub/mul/div/square x bf16/f32)
  • CI build validation

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 10, 2026 03:27
Extends the vec-add example to support bf16, f32 (via bf16-emulation),
i8, and i16 data types, inspired by mlir-air's triton_vec_add test.

Driver changes:
- Add dtype detection from Linalg IR (_detect_element_type)
- Add placeholder substitution (@dtype@, @PAD_VAL@, @VECTOR_SIZE@) in
  transform scripts, resolved before library injection based on the
  IR element type and NPU version. Backward-compatible: no-op when
  no placeholders are present.

Transform library:
- Add pad_and_promote_binary_{f32,i8,i16} sequences alongside the
  existing bf16 variant.

Vec-add example:
- Add --dtype and --bf16-emulation CLI arguments
- Transform scripts now use @dtype@ and @VECTOR_SIZE@ placeholders,
  making them dtype-generic across both AIE2 and AIE2P.

Tested on NPU2 (Strix/AIE2P): all 4 dtypes pass correctness checks
across vector sizes 1024-32768.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the examples/vec-add pipeline to run vector addition across multiple element types (bf16 default, f32 via bf16-emulation, i8, i16) by introducing dtype-aware transform-script placeholder substitution in the backend driver and adding corresponding transform-library sequences.

Changes:

  • Add CLI dtype selection to vec-add.py and wire bf16-emulation via an env var for f32 runs.
  • Make vec-add transform scripts dtype-/NPU-generic via @DTYPE@ and @VECTOR_SIZE@ placeholders.
  • Add pad_and_promote_binary_{f32,i8,i16} sequences and add driver-side placeholder substitution based on detected element type + NPU version.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
examples/vec-add/vec-add.py Adds --dtype/--bf16-emulation and per-dtype input generation + tolerances.
examples/vec-add/transform_aie2p.mlir Switches to placeholder-based includes for dtype + vector size.
examples/vec-add/transform_aie2.mlir Switches to placeholder-based includes for dtype + vector size.
amd_triton_npu/backend/transform_library/elementwise.mlir Adds pad/promotion sequences for f32/i8/i16 binary elementwise ops.
amd_triton_npu/backend/driver.py Adds element-type detection and placeholder substitution prior to transform-library injection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

erwei-xilinx and others added 5 commits April 9, 2026 20:35
- Only call detect_npu_version() when @VECTOR_SIZE@ placeholder is
  actually present, avoiding failures in environments without xrt-smi
- Raise ValueError with supported types when an unsupported element
  type is detected but placeholders are present
- Fix _detect_element_type docstring to match actual behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update generate_readme.py registry to reflect multi-dtype support
(bf16, f32, i8, i16) and regenerate examples/README.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
examples/README.md is auto-generated by generate_readme.py in CI
and should not be committed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend axpy and relu to support bf16, f32 (bf16-emulation), i8, and
i16 using the same @dtype@/@VECTOR_SIZE@ placeholder mechanism as
vec-add.

Transform library: add pad_and_promote_unary_{f32,i8,i16} sequences.

Tested on NPU2 (Strix/AIE2P):
- bf16, f32, i16: pass for both axpy and relu
- i8: compiles through triton-shared-opt and AIR transforms but fails
  at aircc (arith.muli/maxsi not supported for i8 vectors on AIE2P).
  vec-add i8 works because it only uses arith.addi.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend sigmoid, silu, gelu, swiglu, and leaky_relu examples to support
f32 input via bf16-emulation, in addition to the existing bf16.

All transform scripts updated with @dtype@/@VECTOR_SIZE@ placeholders.
The @cast_bf16_only_ops and @cast_cmpf_and_select_ops phases work
correctly for both bf16 and f32 inputs -- for f32, the cast converts
f32 vector ops to bf16 at the MLIR level (equivalent to what
bf16-emulation does at the LLVM level).

Tested on NPU2 (Strix/AIE2P): all 5 examples pass correctness checks
for both bf16 and f32 across vector sizes 1024-32768.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
erwei-xilinx and others added 2 commits April 9, 2026 22:01
New multi-op example supporting sub, mul, div, and square with
--op and --dtype CLI arguments. Auto-selects unary or binary
transform script based on op arity.

Supported dtypes: bf16 and f32 (via bf16-emulation). Integer types
(i16) fail at aircc for subi/muli -- only addi works for integer
vectors on AIE2P (tracked in Xilinx/mlir-aie#3027).

div is f32-only (arith.divf has no bf16 hardware support on AIE2P).

Tested on NPU2 (Strix/AIE2P): sub, mul, div, square all pass for
their supported dtypes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx changed the title Add multi-dtype support to vec-add (bf16, f32, i8, i16) Add multi-dtype support to elementwise examples and new elementwise_arith ops Apr 10, 2026
Create transform_binary_aie2.mlir and transform_unary_aie2.mlir for
NPU1 targets — content is identical to the AIE2P variants since
@dtype@ and @VECTOR_SIZE@ placeholders handle the differences.

Update elementwise_arith.py to auto-detect the NPU version via
detect_npu_version() and select the correct transform script suffix
(aie2 vs aie2p) instead of hardcoding aie2p.

Update generate_readme.py get_device_support() to use glob patterns
so it detects both transform_aie2.mlir and transform_*_aie2.mlir
naming conventions used by multi-op examples.

Tested on NPU1 (Phoenix/AIE2): all 7 test cases pass
(sub bf16/f32, mul bf16/f32, div f32, square bf16/f32).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants