Root cause hypothesis: The AMDGPU LLVM backend's scratch memory lowering for CDNA targets uses flat addressing (i64), and the instruction selection for i64 = FrameIndex fails when the kernel has a large stack frame. This code path is likely not exercised by typical HPC/AI workloads. The same kernel may compile successfully on RDNA targets where scratch is accessed via buffer instructions with i32 offsets.
releated to genesis issue Genesis-Embodied-AI/Genesis#2570
Thanks for checking
David