Conversation
|
Oh, AMD support has entered the chat 🚀 |
|
Made some adjustments and can confirm that this works on Windows (native ROCm 7 via TheRock) as well. Built in the console. So we can get past these warnings:
I got curious and checked what Dependencies reports. Out of the three My custom-built Now that it loads, I'm curious whether it actually works as intended or just errors out. \ I’m experiencing GPU hangs. After some debugging, I suspect it’s related to VMM + ROCm on Windows. Summary:
Suspected root cause: The AMD Windows WDDM driver may not fully support access to memory allocated via the VMM APIs. |
|
If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us. |
Now that ComfyUI x AMD is official, and this PR paves the way for ROCm Linux users to use it, it would be great to have |
|
@asagi4 Just wanted to check in - is there any update or further progress on this PR? |
eb2e747 to
e95bb5c
Compare
|
@tvukovic-amd Well I can't do much beyond run hipify and make it compile. I don't know enough about ROCM to debug any issues. I rebased against master to get it to compile again, but it's untested. |
|
With latest master it seems to be completely broken. all VRAM allocations fail with |
|
After @asagi4 confirmed that the latest updates break I'm sure
\ After further benchmarking, some workloads still trigger GPU hangs, while others run fine. Previously, neither of them ran successfully. It seems that the new |
|
@0xDELUXA you mean you can run hipify without changes to master? How did you manage that? |
Using the script in my fork: https://github.com/0xDELUXA/comfy-aimdo_win-rocm/blob/master/build-rocm-windows.bat |
|
Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all |
ROCm: |
|
I managed to locally fix things so that aimdo works for me again. apparently vrambuf_create somehow works on CUDA without aligning to chunk size but with HIP (on Linux?) it fails. I don't know why it works on Windows. |
|
I haven’t encountered any OOMs in my workflows, but occasionally the GPU hangs at 100% usage. It would be great if Windows and Linux ROCm were even more similar. |
e95bb5c to
9c4c215
Compare
|
with these changes things work for me again on Linux. Or at least one workflow ran successfully. Previously pretty much all allocations failed with "invalid argument" when mapping new vram allocations, presumably because the vram buffers weren't aligned to the defined chunk size. |
|
Hm, with the latest changes to master the fixing has gotten a bit more complicated because aimdo's overriding functions have mismatching result types from cuda functions and hipify / clang doesn't like that. For example, they're defined to return int in the header, but the actual function prototype says cudaError_t. In addition, the actual aimdo implementations return CUresults... I'll try to see what happens if I just fix the return types and cast the return values, but that seems like something that should be fixed regardless of ROCm, since I don't think relying on implicit casts from integers is very good behaviour. @rattus128 what do you think? |
9c4c215 to
51d4d2f
Compare
|
Now it compiles, loads and appears to work again. Haven't stress-tested though. |
|
Have you run any workload that exceeds VRAM and would OOM without Does the original example.py work on your system? Another thing is that the ROCm documentation states that VMM is “under development” on Windows. Some APIs are even marked as beta on Linux too, so I can’t really do anything to get it to work reliably on Windows. |
|
@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument". I wonder if since the pointer it's working with is I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything. |
I see. I don’t really think the
Not a problem - the build script from my fork, on Windows, as you said, "at least it compiles and runs, so it's a start." |
|
I'm rather curious about how your AMD Linux implementation behaves. Could you try running example.py pls? My output on Windows is this. |
|
@0xDELUXA I can't run it at all because it tries to import a function called vbars_analyze that doesn't seem to exist anywhere. |
|
I needed to modify it as well, and this one works for me. Commented out |
|
I fixed the script and it gives me this: |
|
I see. I've also added some debug output, but shouldn't the script also print |
|
It might be that it runs like that because everything fits into VRAM. If I change the layer counts, at some point I just get OOMs. I don't think it's properly offloading anything automatically. |
These are deprecated on HIP. Apparently you're supposed to use device and stream APIs
abc6671 to
bd6b9bf
Compare
|
context APIs are deprecated on HIP. Apparently you're supposed to use device / stream APIs? I'm not sure if there's a simple 1:1 mapping or if the cuda code could be changed to map to non-deprecated HIP APIs |
Another difference is that your output shows: I always manually copy this file from: It could be that this is what causes the hangs for you. Edit: If you've copied that I assume you're using TheRock and not these wheels. |
We also get warnings about this on Windows; nevertheless, it builds aimdo. |
amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in
Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine? |
Great! I was wondering why it doesn't work on your end.
Did something change? I can't seem to reproduce my own issue anymore XD Except for this one, which I assume occurs on your side too. Not having access to Sage/Flash makes aimdo kind of not worth using, sadly. |
I only followed your instructions and tested with 2 venvs:
and copied amdhip64_7.dll from According to this case, I installed triton-windows from https://github.com/triton-lang/triton-windows/actions/runs/22558044670 according to instructions and tried to run Flux.2 Klein 9B. Triton error |
|
Which attention type? Sage or Flash? |
|
This happens with running sage attention. |
I'm using torch
Could you try Flash too? |
With flash attention it gives the following error while running model |
So the same error. I’ve opened an issue on Also updated the build script to automatically copy |
I fixed this bug |
I see. Don't really understand how I could get the same |
|
It comes from installed triton, did you install this one https://github.com/triton-lang/triton-windows/actions/runs/22558044670? |
Since |
|
When running |
|
The only thing I'm still unclear on is:
|
|
We at triton-windows are still busy on fixing some unit tests on RDNA3 GPUs. After passing the existing unit tests, I may find some time to check this issue. |
I see. Thanks for the update and for looking into it later. |
Contribution Agreement
This is not really intended for merging as is, but for reference. hipify-clang can convert the CUDA code to HIP code pretty easily with a few fixes, and it actually allows you to run aimdo on ROCM.
You might have to make sure your Python venv is using your system ROCM libraries for this to work.
It does not work perfectly (I'm still getting pytorch OOMs when it should be freeing memory) but workflows can run and produce good output.
I am not able to test, but the HIP code should be compilable as is on nvidia platforms too. If you run build-rocm on an nvidia platform, hipcc and hipconfig should set it up to link against cuda instead of ROCM and the result should be basically identical to the CUDA implementation.