I encountered this error and have since resolved it. I want to document the process here and share it with other developers.
It turns out that my driver version was 550, and after checking with nvidia-smi, I found it only supports CUDA 12.4. So, I switched to a CUDA 12.3 driver, and everything started working fine. This issue only occurred with some of my code, specifically with TMA, while other code (like the regular CUTLASS code) did not raise this error. It seems that the TMA file's implementation is more sensitive.