Problem Description
"A Minimal Vector Add Kernel" code snippet from docs/quickstart.rst segfaults when executed:
python: /workspace/FlyDSL/include/flydsl/Dialect/Fly/Utils/IntTupleUtils.h:930: IntTuple mlir::fly::intTupleSlice(const IntTupleBuilder<IntTuple>&, IntTuple, IntTupleAttr) [with IntTuple = IntTupleAttr]: Assertion `coord.rank() == tuple.rank() && "Mismatched ranks in slice"' failed.
Aborted (core dumped)
Operating System
Ubuntu 24.04.4 LTS (Noble Numbat)
CPU
AMD EPYC 9575F 64-Core Processor
GPU
AMD Instinct MI355X
ROCm Version
ROCm 7.1.0
Steps to Reproduce
Running it from an interactive IPython shell:
root@smci355-ccs-aus-m02-13:/workspace/FlyDSL# ipython
Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.11.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: `?` alone on a line will brings up IPython's help
In [1]: import torch
...: import flydsl.compiler as flyc
...: import flydsl.expr as fx
...:
...: @flyc.kernel
...: def vectorAddKernel(
...: A: fx.Tensor, B: fx.Tensor, C: fx.Tensor,
...: block_dim: fx.Constexpr[int],
...: ):
...: bid = fx.block_idx.x
...: tid = fx.thread_idx.x
...:
...: # Partition tensors by block using layout algebra
...: tA = fx.logical_divide(A, fx.make_layout(block_dim, 1))
...: tB = fx.logical_divide(B, fx.make_layout(block_dim, 1))
...: tC = fx.logical_divide(C, fx.make_layout(block_dim, 1))
...:
...: tA = fx.slice(tA, (None, bid))
...: tB = fx.slice(tB, (None, bid))
...: tC = fx.slice(tC, (None, bid))
...:
...: # Allocate register fragments, load, compute, store
...: RABTy = fx.MemRefType.get(fx.T.f32(), fx.LayoutType.get(1, 1),
...: fx.AddressSpace.Register)
...: copyAtom = fx.make_copy_atom(fx.UniversalCopy32b(), fx.Float32)
...: rA = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
...: rB = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
...: rC = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
...:
...: fx.copy_atom_call(copyAtom, fx.slice(tA, (None, tid)), rA)
...: fx.copy_atom_call(copyAtom, fx.slice(tB, (None, tid)), rB)
...:
...: vC = fx.arith.addf(fx.memref_load_vec(rA), fx.memref_load_vec(rB))
...: fx.memref_store_vec(vC, rC)
...: fx.copy_atom_call(copyAtom, rC, fx.slice(tC, (None, tid)))
...:
...: @flyc.jit
...: def vectorAdd(
...: A: fx.Tensor, B: fx.Tensor, C,
...: n: fx.Int32,
...: const_n: fx.Constexpr[int],
...: stream: fx.Stream = fx.Stream(None),
...: ):
...: block_dim = 64
...: grid_x = (n + block_dim - 1) // block_dim
...: vectorAddKernel(A, B, C, block_dim).launch(
...: grid=(grid_x, 1, 1), block=[block_dim, 1, 1], stream=stream,
...: )
...:
...: # Usage
...: n = 128
...: A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
...: B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
...: C = torch.zeros(n, dtype=torch.float32).cuda()
...: vectorAdd(A, B, C, n, n + 1, stream=torch.cuda.Stream())
...: torch.cuda.synchronize()
...: print("Result correct:", torch.allclose(C, A + B))
python: /workspace/FlyDSL/include/flydsl/Dialect/Fly/Utils/IntTupleUtils.h:930: IntTuple mlir::fly::intTupleSlice(const IntTupleBuilder<IntTuple>&, IntTuple, IntTupleAttr) [with IntTuple = IntTupleAttr]: Assertion `coord.rank() == tuple.rank() && "Mismatched ranks in slice"' failed.
Aborted (core dumped)
Running it from a Python script:
root@smci355-ccs-aus-m02-13:/workspace/FlyDSL# cat > vec_add.py
import torch
import flydsl.compiler as flyc
import flydsl.expr as fx
@flyc.kernel
def vectorAddKernel(
A: fx.Tensor, B: fx.Tensor, C: fx.Tensor,
block_dim: fx.Constexpr[int],
):
bid = fx.block_idx.x
tid = fx.thread_idx.x
# Partition tensors by block using layout algebra
tA = fx.logical_divide(A, fx.make_layout(block_dim, 1))
tB = fx.logical_divide(B, fx.make_layout(block_dim, 1))
tC = fx.logical_divide(C, fx.make_layout(block_dim, 1))
tA = fx.slice(tA, (None, bid))
tB = fx.slice(tB, (None, bid))
tC = fx.slice(tC, (None, bid))
# Allocate register fragments, load, compute, store
RABTy = fx.MemRefType.get(fx.T.f32(), fx.LayoutType.get(1, 1),
fx.AddressSpace.Register)
copyAtom = fx.make_copy_atom(fx.UniversalCopy32b(), fx.Float32)
rA = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
rB = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
rC = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
fx.copy_atom_call(copyAtom, fx.slice(tA, (None, tid)), rA)
fx.copy_atom_call(copyAtom, fx.slice(tB, (None, tid)), rB)
vC = fx.arith.addf(fx.memref_load_vec(rA), fx.memref_load_vec(rB))
fx.memref_store_vec(vC, rC)
fx.copy_atom_call(copyAtom, rC, fx.slice(tC, (None, tid)))
@flyc.jit
def vectorAdd(
A: fx.Tensor, B: fx.Tensor, C,
n: fx.Int32,
const_n: fx.Constexpr[int],
stream: fx.Stream = fx.Stream(None),
):
block_dim = 64
grid_x = (n + block_dim - 1) // block_dim
vectorAddKernel(A, B, C, block_dim).launch(
grid=(grid_x, 1, 1), block=[block_dim, 1, 1], stream=stream,
)
# Usage
n = 128
A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
C = torch.zeros(n, dtype=torch.float32).cuda()
vectorAdd(A, B, C, n, n + 1, stream=torch.cuda.Stream())
torch.cuda.synchronize()
print("Result correct:", torch.allclose(C, A + B))
root@smci355-ccs-aus-m02-13:/workspace/FlyDSL# python vec_add.py
python: /workspace/FlyDSL/include/flydsl/Dialect/Fly/Utils/IntTupleUtils.h:930: IntTuple mlir::fly::intTupleSlice(const IntTupleBuilder<IntTuple>&, IntTuple, IntTupleAttr) [with IntTuple = IntTupleAttr]: Assertion `coord.rank() == tuple.rank() && "Mismatched ranks in slice"' failed.
Aborted (core dumped)
Additional Information
FlyDSL commit: f63022e
Problem Description
"A Minimal Vector Add Kernel" code snippet from
docs/quickstart.rstsegfaults when executed:Operating System
Ubuntu 24.04.4 LTS (Noble Numbat)
CPU
AMD EPYC 9575F 64-Core Processor
GPU
AMD Instinct MI355X
ROCm Version
ROCm 7.1.0
Steps to Reproduce
Running it from an interactive IPython shell:
Running it from a Python script:
Additional Information
FlyDSL commit: f63022e