[Issue]: "A Minimal Vector Add Kernel" from `docs/quickstart.rst` segfaults

### Problem Description

"A Minimal Vector Add Kernel" code snippet from `docs/quickstart.rst` segfaults when executed:

```text
python: /workspace/FlyDSL/include/flydsl/Dialect/Fly/Utils/IntTupleUtils.h:930: IntTuple mlir::fly::intTupleSlice(const IntTupleBuilder<IntTuple>&, IntTuple, IntTupleAttr) [with IntTuple = IntTupleAttr]: Assertion `coord.rank() == tuple.rank() && "Mismatched ranks in slice"' failed.
Aborted (core dumped)
```

### Operating System

Ubuntu 24.04.4 LTS (Noble Numbat)

### CPU

AMD EPYC 9575F 64-Core Processor

### GPU

AMD Instinct MI355X

### ROCm Version

ROCm 7.1.0

### Steps to Reproduce

Running it from an interactive IPython shell:

```text
root@smci355-ccs-aus-m02-13:/workspace/FlyDSL# ipython
Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.11.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: `?` alone on a line will brings up IPython's help

In [1]: import torch
   ...: import flydsl.compiler as flyc
   ...: import flydsl.expr as fx
   ...:
   ...: @flyc.kernel
   ...: def vectorAddKernel(
   ...:     A: fx.Tensor, B: fx.Tensor, C: fx.Tensor,
   ...:     block_dim: fx.Constexpr[int],
   ...: ):
   ...:     bid = fx.block_idx.x
   ...:     tid = fx.thread_idx.x
   ...:
   ...:     # Partition tensors by block using layout algebra
   ...:     tA = fx.logical_divide(A, fx.make_layout(block_dim, 1))
   ...:     tB = fx.logical_divide(B, fx.make_layout(block_dim, 1))
   ...:     tC = fx.logical_divide(C, fx.make_layout(block_dim, 1))
   ...:
   ...:     tA = fx.slice(tA, (None, bid))
   ...:     tB = fx.slice(tB, (None, bid))
   ...:     tC = fx.slice(tC, (None, bid))
   ...:
   ...:     # Allocate register fragments, load, compute, store
   ...:     RABTy = fx.MemRefType.get(fx.T.f32(), fx.LayoutType.get(1, 1),
   ...:                               fx.AddressSpace.Register)
   ...:     copyAtom = fx.make_copy_atom(fx.UniversalCopy32b(), fx.Float32)
   ...:     rA = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
   ...:     rB = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
   ...:     rC = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
   ...:
   ...:     fx.copy_atom_call(copyAtom, fx.slice(tA, (None, tid)), rA)
   ...:     fx.copy_atom_call(copyAtom, fx.slice(tB, (None, tid)), rB)
   ...:
   ...:     vC = fx.arith.addf(fx.memref_load_vec(rA), fx.memref_load_vec(rB))
   ...:     fx.memref_store_vec(vC, rC)
   ...:     fx.copy_atom_call(copyAtom, rC, fx.slice(tC, (None, tid)))
   ...:
   ...: @flyc.jit
   ...: def vectorAdd(
   ...:     A: fx.Tensor, B: fx.Tensor, C,
   ...:     n: fx.Int32,
   ...:     const_n: fx.Constexpr[int],
   ...:     stream: fx.Stream = fx.Stream(None),
   ...: ):
   ...:     block_dim = 64
   ...:     grid_x = (n + block_dim - 1) // block_dim
   ...:     vectorAddKernel(A, B, C, block_dim).launch(
   ...:         grid=(grid_x, 1, 1), block=[block_dim, 1, 1], stream=stream,
   ...:     )
   ...:
   ...: # Usage
   ...: n = 128
   ...: A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
   ...: B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
   ...: C = torch.zeros(n, dtype=torch.float32).cuda()
   ...: vectorAdd(A, B, C, n, n + 1, stream=torch.cuda.Stream())
   ...: torch.cuda.synchronize()
   ...: print("Result correct:", torch.allclose(C, A + B))
python: /workspace/FlyDSL/include/flydsl/Dialect/Fly/Utils/IntTupleUtils.h:930: IntTuple mlir::fly::intTupleSlice(const IntTupleBuilder<IntTuple>&, IntTuple, IntTupleAttr) [with IntTuple = IntTupleAttr]: Assertion `coord.rank() == tuple.rank() && "Mismatched ranks in slice"' failed.
Aborted (core dumped)
```

Running it from a Python script:

```text
root@smci355-ccs-aus-m02-13:/workspace/FlyDSL# cat > vec_add.py
import torch
import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def vectorAddKernel(
    A: fx.Tensor, B: fx.Tensor, C: fx.Tensor,
    block_dim: fx.Constexpr[int],
):
    bid = fx.block_idx.x
    tid = fx.thread_idx.x

    # Partition tensors by block using layout algebra
    tA = fx.logical_divide(A, fx.make_layout(block_dim, 1))
    tB = fx.logical_divide(B, fx.make_layout(block_dim, 1))
    tC = fx.logical_divide(C, fx.make_layout(block_dim, 1))

    tA = fx.slice(tA, (None, bid))
    tB = fx.slice(tB, (None, bid))
    tC = fx.slice(tC, (None, bid))

    # Allocate register fragments, load, compute, store
    RABTy = fx.MemRefType.get(fx.T.f32(), fx.LayoutType.get(1, 1),
                              fx.AddressSpace.Register)
    copyAtom = fx.make_copy_atom(fx.UniversalCopy32b(), fx.Float32)
    rA = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
    rB = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
    rC = fx.memref_alloca(RABTy, fx.make_layout(1, 1))

    fx.copy_atom_call(copyAtom, fx.slice(tA, (None, tid)), rA)
    fx.copy_atom_call(copyAtom, fx.slice(tB, (None, tid)), rB)

    vC = fx.arith.addf(fx.memref_load_vec(rA), fx.memref_load_vec(rB))
    fx.memref_store_vec(vC, rC)
    fx.copy_atom_call(copyAtom, rC, fx.slice(tC, (None, tid)))

@flyc.jit
def vectorAdd(
    A: fx.Tensor, B: fx.Tensor, C,
    n: fx.Int32,
    const_n: fx.Constexpr[int],
    stream: fx.Stream = fx.Stream(None),
):
    block_dim = 64
    grid_x = (n + block_dim - 1) // block_dim
    vectorAddKernel(A, B, C, block_dim).launch(
        grid=(grid_x, 1, 1), block=[block_dim, 1, 1], stream=stream,
    )

# Usage
n = 128
A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
C = torch.zeros(n, dtype=torch.float32).cuda()
vectorAdd(A, B, C, n, n + 1, stream=torch.cuda.Stream())
torch.cuda.synchronize()
print("Result correct:", torch.allclose(C, A + B))
root@smci355-ccs-aus-m02-13:/workspace/FlyDSL# python vec_add.py
python: /workspace/FlyDSL/include/flydsl/Dialect/Fly/Utils/IntTupleUtils.h:930: IntTuple mlir::fly::intTupleSlice(const IntTupleBuilder<IntTuple>&, IntTuple, IntTupleAttr) [with IntTuple = IntTupleAttr]: Assertion `coord.rank() == tuple.rank() && "Mismatched ranks in slice"' failed.
Aborted (core dumped)
```

### Additional Information

FlyDSL commit: https://github.com/ROCm/FlyDSL/commit/f63022e19aec6af894efbefcfd2f4ac79ac87956

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: "A Minimal Vector Add Kernel" from `docs/quickstart.rst` segfaults #286

Problem Description

Operating System

CPU

GPU

ROCm Version

Steps to Reproduce

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: "A Minimal Vector Add Kernel" from docs/quickstart.rst segfaults #286

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

Steps to Reproduce

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Issue]: "A Minimal Vector Add Kernel" from `docs/quickstart.rst` segfaults #286