[DRAFT] Barebones ROCM support by asagi4 · Pull Request #2 · Comfy-Org/comfy-aimdo

asagi4 · 2026-02-05T11:42:39Z

Contribution Agreement

I agree that my contributions are licensed under the GPLv3.
I grant Comfy Org the rights to relicense these contributions as outlined in CONTRIBUTING.md.

This is not really intended for merging as is, but for reference. hipify-clang can convert the CUDA code to HIP code pretty easily with a few fixes, and it actually allows you to run aimdo on ROCM.

You might have to make sure your Python venv is using your system ROCM libraries for this to work.

It does not work perfectly (I'm still getting pytorch OOMs when it should be freeing memory) but workflows can run and produce good output.

I am not able to test, but the HIP code should be compilable as is on nvidia platforms too. If you run build-rocm on an nvidia platform, hipcc and hipconfig should set it up to link against cuda instead of ROCM and the result should be basically identical to the CUDA implementation.

0xDELUXA · 2026-02-06T14:46:33Z

Oh, AMD support has entered the chat 🚀

0xDELUXA · 2026-02-07T15:48:24Z

Made some adjustments and can confirm that this works on Windows (native ROCm 7 via TheRock) as well. Built aimdo.dll locally, installed this custom wheel, and got:

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB)
DynamicVRAM support detected and enabled

in the console.

So we can get past these warnings:
No working comfy-aimdo install detected. DynamicVRAM support disabled. Falling back to legacy ModelPatcher. VRAM estimates may be unreliable especially on Windows
NOTE: comfy-aimdo is currently only support for Nvidia GPUs

pip install comfy-aimdo automatically installs the Windows (Nvidia-only) package. It does include an aimdo.dll, but AMD gets the following error:

comfy-aimdo failed to load: E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll: Could not find module 'E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll' (or one of its dependencies). Try using the full path with constructor syntax.

I got curious and checked what Dependencies reports. Out of the three .dlls it requires, we AMD users are missing nvcuda.dll.

My custom-built aimdo.dll, which actually loads on AMD, replaces the nvcuda.dll dependency with amdhip6_7.dll.

Now that it loads, I'm curious whether it actually works as intended or just errors out.

\

I’m experiencing GPU hangs. After some debugging, I suspect it’s related to VMM + ROCm on Windows.

Summary:
VMM allocation APIs report success, but the GPU cannot reliably access the allocated memory.

All hipMemCreate, hipMemMap, and hipMemSetAccess calls return success.
hipMemsetD8 also returns success (the async operation is queued).
hipDeviceSynchronize completes without errors.
PyTorch kernel hangs when attempting to use the memory.

Suspected root cause: The AMD Windows WDDM driver may not fully support access to memory allocated via the VMM APIs.

tvukovic-amd · 2026-02-10T15:49:02Z

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

0xDELUXA · 2026-02-10T16:27:09Z

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

Now that ComfyUI x AMD is official, and this PR paves the way for ROCm Linux users to use it, it would be great to have comfy-aimdo running on ROCm Windows too. Theoretically, what is preventing it from working? I've tried many things, but it seems there’s something I haven’t been able to figure out.

tvukovic-amd · 2026-02-19T12:12:26Z

@asagi4 Just wanted to check in - is there any update or further progress on this PR?

asagi4 · 2026-02-19T15:17:07Z

@tvukovic-amd Well I can't do much beyond run hipify and make it compile. I don't know enough about ROCM to debug any issues.

I rebased against master to get it to compile again, but it's untested.

asagi4 · 2026-02-19T17:10:33Z

With latest master it seems to be completely broken. all VRAM allocations fail with aimdo: hip_src/vrambuf.c:56:ERROR:VRAM Allocation failed (non OOM) and torch throws an OOM exception immediately.

0xDELUXA · 2026-02-20T14:54:33Z

After @asagi4 confirmed that the latest updates break comfy-aimdo on AMD (Linux), I decided to try building the version checked out from the master branch. I have a very long, workaround-upon-workaround (mainly for hipify, else it just doesn't work) build script that I use on Windows. And somehow it magically avoids the GPU hang issue I was getting when comfy-aimdo was enabled.

I'm sure comfy-aimdo is actually being taken into consideration here, based on the console output (filtered):

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB) DynamicVRAM support detected and enabled
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 0 patches attached.
Model Initializing ...
Model Initialization complete!
Prompt executed in X seconds

\

After further benchmarking, some workloads still trigger GPU hangs, while others run fine. Previously, neither of them ran successfully. It seems that the new Model Initializing... phase is quite heavy on AMD, which is where it occasionally hangs.

asagi4 · 2026-02-20T16:54:19Z

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

0xDELUXA · 2026-02-20T16:58:02Z

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

Using the script in my fork: https://github.com/0xDELUXA/comfy-aimdo_win-rocm/blob/master/build-rocm-windows.bat

asagi4 · 2026-02-20T17:43:35Z

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

0xDELUXA · 2026-02-20T17:44:53Z

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

ROCm: 7.12.0a20260218
PyTorch: 2.12.0a0+rocm7.12.0a20260218
OS: WIndows 11

asagi4 · 2026-02-20T18:38:06Z

I managed to locally fix things so that aimdo works for me again.
I think vrambuf_create has some alignment issue that appears with HIP
diff for hipified source here

diff -ru hip_src/vrambuf.c hip_src_fixed2/vrambuf.c
--- hip_src/vrambuf.c   2026-02-20 20:34:56.698464966 +0200
+++ hip_src_fixed2/vrambuf.c    2026-02-20 20:32:52.685112770 +0200
@@ -7,8 +7,16 @@
 SHARED_EXPORT
 void *vrambuf_create(int device, size_t max_size) {
     VramBuffer *buf;
+    if ((max_size / VRAM_CHUNK_SIZE) * VRAM_CHUNK_SIZE < max_size) {
+       log(ERROR, "??? alignment %zu\n", max_size);
+       max_size = ((max_size / VRAM_CHUNK_SIZE) + 1) * VRAM_CHUNK_SIZE;
+       log(ERROR, "??? fixed alignment %zu\n", max_size);
+    }

-    buf = (VramBuffer *)calloc(1, sizeof(*buf) + sizeof(hipMemGenericAllocationHandle_t) * max
_size / VRAM_CHUNK_SIZE);
+    size_t size = 0;
+    size = sizeof(*buf) + (sizeof(hipMemGenericAllocationHandle_t) * (max_size / VRAM_CHUNK_SI
ZE));
+    log(ERROR, "vrambuf_create calloc %zu\n", size)
+    buf = (VramBuffer *)calloc(1, size);
     if (!buf) {
         return NULL;
     }
@@ -53,7 +61,7 @@
         }
         if ((err = three_stooges(buf->base_ptr + buf->allocated, to_allocate, buf->device, &ha
ndle)) != hipSuccess) {
             if (err != hipErrorOutOfMemory) {
-                log(ERROR, "VRAM Allocation failed (non OOM): %d\n", err);
+                log(ERROR, "VRAM Allocation failed (non OOM): %s\n", hipGetErrorString(err));
                 return false;
             }
             log(DEBUG, "Pytorch allocator attempt exceeds available VRAM ...\n");

apparently vrambuf_create somehow works on CUDA without aligning to chunk size but with HIP (on Linux?) it fails. I don't know why it works on Windows.

0xDELUXA · 2026-02-20T18:40:35Z

I haven’t encountered any OOMs in my workflows, but occasionally the GPU hangs at 100% usage. It would be great if Windows and Linux ROCm were even more similar.

asagi4 · 2026-02-20T18:55:48Z

with these changes things work for me again on Linux. Or at least one workflow ran successfully. Previously pretty much all allocations failed with "invalid argument" when mapping new vram allocations, presumably because the vram buffers weren't aligned to the defined chunk size.

asagi4 · 2026-02-22T09:04:27Z

Hm, with the latest changes to master the fixing has gotten a bit more complicated because aimdo's overriding functions have mismatching result types from cuda functions and hipify / clang doesn't like that.

For example, they're defined to return int in the header, but the actual function prototype says cudaError_t. In addition, the actual aimdo implementations return CUresults...

I'll try to see what happens if I just fix the return types and cast the return values, but that seems like something that should be fixed regardless of ROCm, since I don't think relying on implicit casts from integers is very good behaviour.

@rattus128 what do you think?

asagi4 · 2026-02-22T09:27:19Z

Now it compiles, loads and appears to work again.

Haven't stress-tested though.

0xDELUXA · 2026-02-22T14:52:49Z

Have you run any workload that exceeds VRAM and would OOM without comfy-aimdo?

Does the original example.py work on your system?

Another thing is that the ROCm documentation states that VMM is “under development” on Windows. Some APIs are even marked as beta on Linux too, so I can’t really do anything to get it to work reliably on Windows.

asagi4 · 2026-02-22T17:41:19Z

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

0xDELUXA · 2026-02-22T20:53:48Z

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I see. I don’t really think the comfy-aimdo dev has much insight into the AMD side, so it’s just us. I assume there will still be things that work reliably on Nvidia but not as well on AMD.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

Not a problem - the build script from my fork, on Windows, as you said, "at least it compiles and runs, so it's a start."

0xDELUXA · 2026-02-23T13:32:20Z

I'm rather curious about how your AMD Linux implementation behaves. Could you try running example.py pls? My output on Windows is this.

asagi4 · 2026-02-23T13:59:24Z

@0xDELUXA I can't run it at all because it tries to import a function called vbars_analyze that doesn't seem to exist anywhere.

0xDELUXA · 2026-02-23T14:01:44Z

I needed to modify it as well, and this one works for me. Commented out vbars_analyze, etc.

asagi4 · 2026-02-23T14:16:05Z

I fixed the script and it gives me this:

Init complete
aimdo: hip_src/control.c:67:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 7900 XTX (VRAM: 24560 MB)
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=131072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xabacef0
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xabacef0
##################### Run the first model #######################
Some weights will be loaded and stay there for all iterations
Some weights will be offloaded

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[First Load] Populated weight at offset: 400.0M
[First Load] Populated weight at offset: 800.0M
[First Load] Populated weight at offset: 1200.0M
[First Load] Populated weight at offset: 1600.0M
[First Load] Populated weight at offset: 2000.0M
[First Load] Populated weight at offset: 2400.0M
[First Load] Populated weight at offset: 2800.0M
[First Load] Populated weight at offset: 3200.0M
[First Load] Populated weight at offset: 3600.0M
[First Load] Populated weight at offset: 4000.0M
[First Load] Populated weight at offset: 4400.0M
[First Load] Populated weight at offset: 4800.0M
[First Load] Populated weight at offset: 5200.0M
[First Load] Populated weight at offset: 5600.0M
[First Load] Populated weight at offset: 6000.0M
[First Load] Populated weight at offset: 6400.0M
[First Load] Populated weight at offset: 6800.0M
[First Load] Populated weight at offset: 7200.0M
[First Load] Populated weight at offset: 7600.0M
[First Load] Populated weight at offset: 8000.0M
[First Load] Populated weight at offset: 8400.0M
[First Load] Populated weight at offset: 8800.0M
[First Load] Populated weight at offset: 9200.0M
[First Load] Populated weight at offset: 9600.0M
[First Load] Populated weight at offset: 10000.0M
[First Load] Populated weight at offset: 10400.0M
[First Load] Populated weight at offset: 10800.0M
[First Load] Populated weight at offset: 11200.0M
[First Load] Populated weight at offset: 11600.0M
[First Load] Populated weight at offset: 12000.0M
[First Load] Populated weight at offset: 12400.0M
[First Load] Populated weight at offset: 12800.0M
[First Load] Populated weight at offset: 13200.0M
[First Load] Populated weight at offset: 13600.0M
[First Load] Populated weight at offset: 14000.0M
[First Load] Populated weight at offset: 14400.0M
[First Load] Populated weight at offset: 14800.0M
[First Load] Populated weight at offset: 15200.0M
[First Load] Populated weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    16400 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     7820 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 16000 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=3072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xb135160
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xb135160
##################### Run the second model #######################
Everything will be loaded and will displace some weights of the first model

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!
[First Load] Populated weight at offset: 603.2421875M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17824 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6396 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 3544] Ptr: 0x7fa5bb000000 | Size:  622592k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     608 MB
##################### Run the first model again #######################
Some weights will still be loaded from before and be there first iteration
Some weights will get re-loaded on the first interation
The rest will be offloaded again

aimdo: hip_src/model-vbar.c:234:DEBUG:vbar_prioritize vbar=0xabacef0
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[No Load Needed] Reusing weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17616 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6604 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'```
Some of the ERROR logs from aimdo aren't actually errors, they're just things I added that I wanted to log without enabling debug logging.

0xDELUXA · 2026-02-23T14:24:56Z

I see. I've also added some debug output, but shouldn't the script also print [Offloaded] alongside [First Load] and [No Load Needed], considering the Some weights will be offloaded and The rest will be offloaded again comments included in the script by rattus128?
Based on the outputs, this is the main difference between comfy-aimdo on AMD Linux/Windows at present.
Which AMD GPU do you have, btw? Mine has 16 GB VRAM, if yours has more, that could explain the offload difference.

asagi4 · 2026-02-23T15:01:29Z

It might be that it runs like that because everything fits into VRAM. If I change the layer counts, at some point I just get OOMs. I don't think it's properly offloading anything automatically.

…te OOM

These are deprecated on HIP. Apparently you're supposed to use device and stream APIs

asagi4 · 2026-03-09T15:37:24Z

context APIs are deprecated on HIP. Apparently you're supposed to use device / stream APIs?

I'm not sure if there's a simple 1:1 mapping or if the cuda code could be changed to map to non-deprecated HIP APIs

0xDELUXA · 2026-03-09T19:44:49Z

Thank you for providing new wheel. The script works well without hangs/crashes. Here is the local output. It has couple of differences from your output - for example some [First Load] Populated weight at offset: 12210.0M happens on my machine, while on your output I can see [Offloaded] offset: 12210.0M I also tried Flux.2 [Klein] 9B Text to Image workflow with new comfy-aimdo wheel but it still hangs during model load.

Another difference is that your output shows:
HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll
This means aimdo uses the system-wide amdhip64_7.dll that is installed with Adrenalin.

I always manually copy this file from:
ComfyUI\venv\Lib\site-packages\_rocm_sdk_core\bin\
into:
ComfyUI\venv\Lib\site-packages\comfy_aimdo\
to ensure it uses the one provided by TheRock.

It could be that this is what causes the hangs for you.
I also had hangs here, but they were caused by aimdo using Adrenalin's amdhip64_7.dll.

Edit:
I'm pretty sure this is the issue on your side. I just tested it, and example.py works that way, but ComfyUI doesn't work at all.

If you've copied that .dll, and aimdo loads it, then you shouldn't see HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll anymore. That way it should work.

I assume you're using TheRock and not these wheels.

0xDELUXA · 2026-03-09T20:37:44Z

context APIs are deprecated on HIP. Apparently you're supposed to use device / stream APIs?

I'm not sure if there's a simple 1:1 mapping or if the cuda code could be changed to map to non-deprecated HIP APIs

We also get warnings about this on Windows; nevertheless, it builds aimdo.

tvukovic-amd · 2026-03-10T14:18:18Z

Thank you for providing new wheel. The script works well without hangs/crashes. Here is the local output. It has couple of differences from your output - for example some [First Load] Populated weight at offset: 12210.0M happens on my machine, while on your output I can see [Offloaded] offset: 12210.0M I also tried Flux.2 [Klein] 9B Text to Image workflow with new comfy-aimdo wheel but it still hangs during model load.

Another difference is that your output shows: HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll This means aimdo uses the system-wide amdhip64_7.dll that is installed with Adrenalin.

I always manually copy this file from: ComfyUI\venv\Lib\site-packages\_rocm_sdk_core\bin\ into: ComfyUI\venv\Lib\site-packages\comfy_aimdo\ to ensure it uses the one provided by TheRock.

It could be that this is what causes the hangs for you. I also had hangs here, but they were caused by aimdo using Adrenalin's amdhip64_7.dll.

Edit: I'm pretty sure this is the issue on your side. I just tested it, and example.py works that way, but ComfyUI doesn't work at all.

If you've copied that .dll, and aimdo loads it, then you shouldn't see HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll anymore. That way it should work.

I assume you're using TheRock and not these wheels.

amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in [First Load] Populated weight at offset: 12210.0M on Iteration 0.
Flux.2 [Klein] 9B Text to Image workflow from templates doesn't have hangs anymore and can execute properly. I compared performance results with and without comfy-aimdo:

Flux.2 Klein 9B Distilled - comfy_aimdo from your wheel 1.17 it/s, fresh built comfy_aimdo 1.2it/s, without comfy_aimdo 1.17it/s
Flux.2 Klein 9B - comfy_aimdo from your wheel 0.58 it/s, fresh built comfy_aimdo 0.6it/s, without comfy_aimdo 0.62it/s.

Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine?

0xDELUXA · 2026-03-10T15:14:40Z

amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in [First Load] Populated weight at offset: 12210.0M on Iteration 0. Flux.2 [Klein] 9B Text to Image workflow from templates doesn't have hangs anymore and can execute properly. I compared performance results with and without comfy-aimdo:

Flux.2 Klein 9B Distilled - both around 1.17it/s

Flux.2 Klein 9B - with comfy_aimdo 0.58 it/s, without comfy_aimdo 0.62it/s).

Great! I was wondering why it doesn't work on your end.

Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine?

Did something change? I can't seem to reproduce my own issue anymore XD

Except for this one, which I assume occurs on your side too. Not having access to Sage/Flash makes aimdo kind of not worth using, sadly.

tvukovic-amd · 2026-03-10T15:55:27Z

amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in [First Load] Populated weight at offset: 12210.0M on Iteration 0. Flux.2 [Klein] 9B Text to Image workflow from templates doesn't have hangs anymore and can execute properly. I compared performance results with and without comfy-aimdo:

Flux.2 Klein 9B Distilled - both around 1.17it/s

Flux.2 Klein 9B - with comfy_aimdo 0.58 it/s, without comfy_aimdo 0.62it/s).

Great! I was wondering why it doesn't work on your end.

Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine?

Did something change? I can't seem to reproduce my own issue anymore XD

Except for this one, which I assume occurs on your side too. Not having access to Sage/Flash makes aimdo kind of not worth using, sadly.

I only followed your instructions and tested with 2 venvs:

one with comfy_aimdo you provided yesterday
another one with fresh built comfy_aimdo

and copied amdhip64_7.dll from venv\Lib\site-packages\_rocm_sdk_core\bin\ to venv\Lib\site-packages\comfy_aimdo\.

According to this case, I installed triton-windows from https://github.com/triton-lang/triton-windows/actions/runs/22558044670 according to instructions and tried to run Flux.2 Klein 9B. Triton error ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) from the issue didn't appear, but new errors occurred while running model:

Error running sage attention: Command '['c:\\develop\\ComfyUI\\venv\\Lib\\site-packages\\_rocm_sdk_core\\lib\\llvm\\bin\\clang-cl.exe', 'C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.c', '/nologo', '/O2', '/LD', '/wd4819', '/std:c11', '/IC:\\develop\\ComfyUI\\venv\\Lib\\site-packages\\triton\\backends\\amd\\include', '/IC:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f', '/IC:\\Python312\\Include', '/IC:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.44.35207\\include', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\shared', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\ucrt', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\um', '/FoC:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.obj', '/link', '/LIBPATH:C:\\Python312\\libs', '/LIBPATH:C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.44.35207\\lib\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\ucrt\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\um\\x64', 'python312.lib', '/OUT:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.pyd', '/IMPLIB:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.lib', '/PDB:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.pdb']' returned non-zero exit status 1., using pytorch attention instead.
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1078,39): error: call to undeclared function 'alloca'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
 1078 |   PyObject **args_data = (PyObject **)alloca(num_args * sizeof(PyObject *));
      |                                       ^
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1078,26): warning: cast to 'PyObject **' (aka 'struct _object **') from smaller integer type 'int' [-Wint-to-pointer-cast]
 1078 |   PyObject **args_data = (PyObject **)alloca(num_args * sizeof(PyObject *));
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1089,19): warning: cast to 'void **' from smaller integer type 'int' [-Wint-to-pointer-cast]
 1089 |   void **params = (void **)alloca(num_params * sizeof(void *));
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1102,24): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1102 |     params[params_idx] = alloca(extractor.size);
      |                        ^ ~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1108,22): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1108 |   params[params_idx] = alloca(sizeof(void *));
      |                      ^ ~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1113,22): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1113 |   params[params_idx] = alloca(sizeof(void *));
      |                      ^ ~~~~~~~~~~~~~~~~~~~~~~
2 warnings and 4 errors generated.

0xDELUXA · 2026-03-10T15:58:40Z

Which attention type? Sage or Flash?

tvukovic-amd · 2026-03-10T15:59:39Z

This happens with running sage attention.
FYI I am using latest nightly theRock build for torch, torchvision and torchaudio.

0xDELUXA · 2026-03-10T16:00:49Z

FYI I am using latest nightly theRock build for torch, torchvision and torchaudio.

I'm using torch 2.12.0a0+rocm7.12.0a20260304

This happens with running sage attention.

Could you try Flash too?

tvukovic-amd · 2026-03-11T13:57:48Z

FYI I am using latest nightly theRock build for torch, torchvision and torchaudio.

I'm using torch 2.12.0a0+rocm7.12.0a20260304

This happens with running sage attention.

Could you try Flash too?

With flash attention it gives the following error while running model Flash Attention failed, using default SDPA: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?).

0xDELUXA · 2026-03-11T14:01:55Z

Could you try Flash too?

With flash attention it gives the following error while running model Flash Attention failed, using default SDPA: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?).

So the same error. I’ve opened an issue on triton-windows and reached out to the maintainers on Discord. It’s now in their hands.

Also updated the build script to automatically copy amdhip64_7.dll, mandatory on Windows.

tvukovic-amd · 2026-03-11T14:25:44Z

amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in [First Load] Populated weight at offset: 12210.0M on Iteration 0. Flux.2 [Klein] 9B Text to Image workflow from templates doesn't have hangs anymore and can execute properly. I compared performance results with and without comfy-aimdo:

Flux.2 Klein 9B Distilled - both around 1.17it/s

Flux.2 Klein 9B - with comfy_aimdo 0.58 it/s, without comfy_aimdo 0.62it/s).

Great! I was wondering why it doesn't work on your end.

Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine?

Did something change? I can't seem to reproduce my own issue anymore XD
Except for this one, which I assume occurs on your side too. Not having access to Sage/Flash makes aimdo kind of not worth using, sadly.

I only followed your instructions and tested with 2 venvs:

one with comfy_aimdo you provided yesterday
another one with fresh built comfy_aimdo

and copied amdhip64_7.dll from venv\Lib\site-packages\_rocm_sdk_core\bin\ to venv\Lib\site-packages\comfy_aimdo\.

According to this case, I installed triton-windows from https://github.com/triton-lang/triton-windows/actions/runs/22558044670 according to instructions and tried to run Flux.2 Klein 9B. Triton error ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) from the issue didn't appear, but new errors occurred while running model:

Error running sage attention: Command '['c:\\develop\\ComfyUI\\venv\\Lib\\site-packages\\_rocm_sdk_core\\lib\\llvm\\bin\\clang-cl.exe', 'C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.c', '/nologo', '/O2', '/LD', '/wd4819', '/std:c11', '/IC:\\develop\\ComfyUI\\venv\\Lib\\site-packages\\triton\\backends\\amd\\include', '/IC:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f', '/IC:\\Python312\\Include', '/IC:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.44.35207\\include', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\shared', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\ucrt', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\um', '/FoC:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.obj', '/link', '/LIBPATH:C:\\Python312\\libs', '/LIBPATH:C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.44.35207\\lib\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\ucrt\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\um\\x64', 'python312.lib', '/OUT:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.pyd', '/IMPLIB:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.lib', '/PDB:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.pdb']' returned non-zero exit status 1., using pytorch attention instead.
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1078,39): error: call to undeclared function 'alloca'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
 1078 |   PyObject **args_data = (PyObject **)alloca(num_args * sizeof(PyObject *));
      |                                       ^
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1078,26): warning: cast to 'PyObject **' (aka 'struct _object **') from smaller integer type 'int' [-Wint-to-pointer-cast]
 1078 |   PyObject **args_data = (PyObject **)alloca(num_args * sizeof(PyObject *));
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1089,19): warning: cast to 'void **' from smaller integer type 'int' [-Wint-to-pointer-cast]
 1089 |   void **params = (void **)alloca(num_params * sizeof(void *));
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1102,24): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1102 |     params[params_idx] = alloca(extractor.size);
      |                        ^ ~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1108,22): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1108 |   params[params_idx] = alloca(sizeof(void *));
      |                      ^ ~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1113,22): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1113 |   params[params_idx] = alloca(sizeof(void *));
      |                      ^ ~~~~~~~~~~~~~~~~~~~~~~
2 warnings and 4 errors generated.

I fixed this bug
in venv\Lib\site-packages\triton\backends\amd\driver.c after #include <windows.h> needs to be added #include <malloc.h>. After that sage attention also has same error: Error running sage attention: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?), using pytorch attention instead.

0xDELUXA · 2026-03-11T14:32:39Z

I fixed this bug in venv\Lib\site-packages\triton\backends\amd\driver.c after #include <windows.h> needs to be added #include <malloc.h>. After that sage attention also has same error: Error running sage attention: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?), using pytorch attention instead.

I see. Don't really understand how I could get the same cpu tensor? error even without this change.

tvukovic-amd · 2026-03-11T14:34:18Z

It comes from installed triton, did you install this one https://github.com/triton-lang/triton-windows/actions/runs/22558044670?

0xDELUXA · 2026-03-11T14:35:59Z

It comes from installed triton, did you install this one https://github.com/triton-lang/triton-windows/actions/runs/22558044670?

Since post26 is now available on PyPI, I used pip install triton-windows.

tvukovic-amd · 2026-03-11T14:38:54Z

When running pip install triton-windows I don't need to do additional fixes, it only has Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) error.

0xDELUXA · 2026-03-11T14:44:43Z

The only thing I'm still unclear on is:

Why does the upstream Triton work on Linux with comfy-aimdo, but triton-windows doesn't (on Windows)?
Why does it work on Nvidia + triton-windows but not on AMD?

woct0rdho · 2026-03-11T15:15:00Z

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) basically means Triton expects some tensor on GPU but it's actually on CPU. I guess it's not an issue of Triton but an issue somewhere else.

We at triton-windows are still busy on fixing some unit tests on RDNA3 GPUs. After passing the existing unit tests, I may find some time to check this issue.

0xDELUXA · 2026-03-11T15:30:07Z

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) basically means Triton expects some tensor on GPU but it's actually on CPU. I guess it's not an issue of Triton but an issue somewhere else.

We at triton-windows are still busy on fixing some unit tests on RDNA3 GPUs. After passing the existing unit tests, I may find some time to check this issue.

I see. Thanks for the update and for looking into it later.

0xDELUXA mentioned this pull request Feb 6, 2026

Fix: Add memory precheck before VAE decode to prevent crash Comfy-Org/ComfyUI#12109

Open

asagi4 force-pushed the hack/rocm-support branch from eb2e747 to e95bb5c Compare February 19, 2026 15:15

asagi4 force-pushed the hack/rocm-support branch from e95bb5c to 9c4c215 Compare February 20, 2026 18:53

asagi4 force-pushed the hack/rocm-support branch from 9c4c215 to 51d4d2f Compare February 22, 2026 09:21

DELUXA and others added 7 commits March 9, 2026 17:18

Improve validation and error handling in Windows ROCm build script

5b7600f

Fix windows build script

bb3aa9a

Grammar fixes

2f61ca3

Detect required library from torch version

b62175d

Aligning up to chunk size is still needed, otherwise I get an immedia…

3c4f98c

…te OOM

Fix error types

cd3fbb6

HIP: Context support

bd6b9bf

These are deprecated on HIP. Apparently you're supposed to use device and stream APIs

asagi4 force-pushed the hack/rocm-support branch from abc6671 to bd6b9bf Compare March 9, 2026 15:36

This was referenced Mar 11, 2026

[Draft] Auto-copy amdhip64_7.dll from ROCm SDK to site-packages\comfy_aimdo\ asagi4/comfy-aimdo#3

Closed

Win: Auto-copy amdhip64_7.dll from ROCm SDK to site-packages\comfy_aimdo\ asagi4/comfy-aimdo#4

Closed

0xDELUXA mentioned this pull request Mar 12, 2026

Merged updates from Comfy-Org/comfy-aimdo#20 and Windows build chore asagi4/comfy-aimdo#6

Open

Conversation

asagi4 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Agreement

Uh oh!

0xDELUXA commented Feb 6, 2026

Uh oh!

0xDELUXA commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tvukovic-amd commented Feb 10, 2026

Uh oh!

0xDELUXA commented Feb 10, 2026

Uh oh!

tvukovic-amd commented Feb 19, 2026

Uh oh!

asagi4 commented Feb 19, 2026

Uh oh!

asagi4 commented Feb 19, 2026

Uh oh!

0xDELUXA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 20, 2026

Uh oh!

0xDELUXA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 20, 2026

Uh oh!

0xDELUXA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 20, 2026

Uh oh!

0xDELUXA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 20, 2026

Uh oh!

asagi4 commented Feb 22, 2026

Uh oh!

asagi4 commented Feb 22, 2026

Uh oh!

0xDELUXA commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 22, 2026

Uh oh!

0xDELUXA commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 23, 2026

Uh oh!

0xDELUXA commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 23, 2026

Uh oh!

0xDELUXA commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 23, 2026

Uh oh!

asagi4 commented Mar 9, 2026

Uh oh!

0xDELUXA commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented Mar 9, 2026

Uh oh!

tvukovic-amd commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Feb 5, 2026 •

edited

Loading

0xDELUXA commented Feb 7, 2026 •

edited

Loading

0xDELUXA commented Feb 20, 2026 •

edited

Loading

0xDELUXA commented Feb 20, 2026 •

edited

Loading

0xDELUXA commented Feb 20, 2026 •

edited

Loading

0xDELUXA commented Feb 20, 2026 •

edited

Loading

0xDELUXA commented Feb 22, 2026 •

edited

Loading

0xDELUXA commented Feb 22, 2026 •

edited

Loading

0xDELUXA commented Feb 23, 2026 •

edited

Loading

0xDELUXA commented Feb 23, 2026 •

edited

Loading

0xDELUXA commented Feb 23, 2026 •

edited

Loading

0xDELUXA commented Mar 9, 2026 •

edited

Loading

tvukovic-amd commented Mar 10, 2026 •

edited

Loading

0xDELUXA commented Mar 10, 2026 •

edited

Loading

tvukovic-amd commented Mar 10, 2026 •

edited

Loading

0xDELUXA commented Mar 10, 2026 •

edited

Loading

0xDELUXA commented Mar 11, 2026 •

edited

Loading

tvukovic-amd commented Mar 11, 2026 •

edited

Loading

0xDELUXA commented Mar 11, 2026 •

edited

Loading

0xDELUXA commented Mar 11, 2026 •

edited

Loading