Skip to content

[DRAFT] Barebones ROCM support#2

Open
asagi4 wants to merge 16 commits intoComfy-Org:masterfrom
asagi4:hack/rocm-support
Open

[DRAFT] Barebones ROCM support#2
asagi4 wants to merge 16 commits intoComfy-Org:masterfrom
asagi4:hack/rocm-support

Conversation

@asagi4
Copy link

@asagi4 asagi4 commented Feb 5, 2026

Contribution Agreement

  • I agree that my contributions are licensed under the GPLv3.
  • I grant Comfy Org the rights to relicense these contributions as outlined in CONTRIBUTING.md.

This is not really intended for merging as is, but for reference. hipify-clang can convert the CUDA code to HIP code pretty easily with a few fixes, and it actually allows you to run aimdo on ROCM.

You might have to make sure your Python venv is using your system ROCM libraries for this to work.

It does not work perfectly (I'm still getting pytorch OOMs when it should be freeing memory) but workflows can run and produce good output.

I am not able to test, but the HIP code should be compilable as is on nvidia platforms too. If you run build-rocm on an nvidia platform, hipcc and hipconfig should set it up to link against cuda instead of ROCM and the result should be basically identical to the CUDA implementation.

@0xDELUXA
Copy link

0xDELUXA commented Feb 6, 2026

Oh, AMD support has entered the chat 🚀

@0xDELUXA
Copy link

0xDELUXA commented Feb 7, 2026

Made some adjustments and can confirm that this works on Windows (native ROCm 7 via TheRock) as well. Built aimdo.dll locally, installed this custom wheel, and got:

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB)
DynamicVRAM support detected and enabled

in the console.

So we can get past these warnings:
No working comfy-aimdo install detected. DynamicVRAM support disabled. Falling back to legacy ModelPatcher. VRAM estimates may be unreliable especially on Windows
NOTE: comfy-aimdo is currently only support for Nvidia GPUs

pip install comfy-aimdo automatically installs the Windows (Nvidia-only) package. It does include an aimdo.dll, but AMD gets the following error:

comfy-aimdo failed to load: E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll: Could not find module 'E:\ComfyUI\venv\Lib\site-packages\comfy_aimdo\aimdo.dll' (or one of its dependencies). Try using the full path with constructor syntax.

I got curious and checked what Dependencies reports. Out of the three .dlls it requires, we AMD users are missing nvcuda.dll.

My custom-built aimdo.dll, which actually loads on AMD, replaces the nvcuda.dll dependency with amdhip6_7.dll.

Now that it loads, I'm curious whether it actually works as intended or just errors out.

\

I’m experiencing GPU hangs. After some debugging, I suspect it’s related to VMM + ROCm on Windows.

Summary:
VMM allocation APIs report success, but the GPU cannot reliably access the allocated memory.

  1. All hipMemCreate, hipMemMap, and hipMemSetAccess calls return success.
  2. hipMemsetD8 also returns success (the async operation is queued).
  3. hipDeviceSynchronize completes without errors.
  4. PyTorch kernel hangs when attempting to use the memory.

Suspected root cause: The AMD Windows WDDM driver may not fully support access to memory allocated via the VMM APIs.

@tvukovic-amd
Copy link

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

@0xDELUXA
Copy link

If you need any assistance from the AMD team or have additional questions regarding ROCm on Windows, please feel free to reach out to us.

Now that ComfyUI x AMD is official, and this PR paves the way for ROCm Linux users to use it, it would be great to have comfy-aimdo running on ROCm Windows too. Theoretically, what is preventing it from working? I've tried many things, but it seems there’s something I haven’t been able to figure out.

@tvukovic-amd
Copy link

@asagi4 Just wanted to check in - is there any update or further progress on this PR?

@asagi4
Copy link
Author

asagi4 commented Feb 19, 2026

@tvukovic-amd Well I can't do much beyond run hipify and make it compile. I don't know enough about ROCM to debug any issues.

I rebased against master to get it to compile again, but it's untested.

@asagi4
Copy link
Author

asagi4 commented Feb 19, 2026

With latest master it seems to be completely broken. all VRAM allocations fail with aimdo: hip_src/vrambuf.c:56:ERROR:VRAM Allocation failed (non OOM) and torch throws an OOM exception immediately.

@0xDELUXA
Copy link

0xDELUXA commented Feb 20, 2026

After @asagi4 confirmed that the latest updates break comfy-aimdo on AMD (Linux), I decided to try building the version checked out from the master branch. I have a very long, workaround-upon-workaround (mainly for hipify, else it just doesn't work) build script that I use on Windows. And somehow it magically avoids the GPU hang issue I was getting when comfy-aimdo was enabled.

I'm sure comfy-aimdo is actually being taken into consideration here, based on the console output (filtered):

aimdo: hip_src\control.c:51:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 9060 XT (VRAM: 16304 MB) DynamicVRAM support detected and enabled
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 0 patches attached.
Model Initializing ...
Model Initialization complete!
Prompt executed in X seconds

\

After further benchmarking, some workloads still trigger GPU hangs, while others run fine. Previously, neither of them ran successfully. It seems that the new Model Initializing... phase is quite heavy on AMD, which is where it occasionally hangs.

@asagi4
Copy link
Author

asagi4 commented Feb 20, 2026

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

@0xDELUXA
Copy link

0xDELUXA commented Feb 20, 2026

@0xDELUXA you mean you can run hipify without changes to master? How did you manage that?

Using the script in my fork: https://github.com/0xDELUXA/comfy-aimdo_win-rocm/blob/master/build-rocm-windows.bat

@asagi4
Copy link
Author

asagi4 commented Feb 20, 2026

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

@0xDELUXA
Copy link

0xDELUXA commented Feb 20, 2026

Which version of ROCM do you have? My hipify-clang fails because it treats the implicit void* casts as errors (I think because I tries to compile the code as C++) but I don't see you dealing with that at all

ROCm: 7.12.0a20260218
PyTorch: 2.12.0a0+rocm7.12.0a20260218
OS: WIndows 11

@asagi4
Copy link
Author

asagi4 commented Feb 20, 2026

I managed to locally fix things so that aimdo works for me again.
I think vrambuf_create has some alignment issue that appears with HIP
diff for hipified source here

diff -ru hip_src/vrambuf.c hip_src_fixed2/vrambuf.c
--- hip_src/vrambuf.c   2026-02-20 20:34:56.698464966 +0200
+++ hip_src_fixed2/vrambuf.c    2026-02-20 20:32:52.685112770 +0200
@@ -7,8 +7,16 @@
 SHARED_EXPORT
 void *vrambuf_create(int device, size_t max_size) {
     VramBuffer *buf;
+    if ((max_size / VRAM_CHUNK_SIZE) * VRAM_CHUNK_SIZE < max_size) {
+       log(ERROR, "??? alignment %zu\n", max_size);
+       max_size = ((max_size / VRAM_CHUNK_SIZE) + 1) * VRAM_CHUNK_SIZE;
+       log(ERROR, "??? fixed alignment %zu\n", max_size);
+    }

-    buf = (VramBuffer *)calloc(1, sizeof(*buf) + sizeof(hipMemGenericAllocationHandle_t) * max
_size / VRAM_CHUNK_SIZE);
+    size_t size = 0;
+    size = sizeof(*buf) + (sizeof(hipMemGenericAllocationHandle_t) * (max_size / VRAM_CHUNK_SI
ZE));
+    log(ERROR, "vrambuf_create calloc %zu\n", size)
+    buf = (VramBuffer *)calloc(1, size);
     if (!buf) {
         return NULL;
     }
@@ -53,7 +61,7 @@
         }
         if ((err = three_stooges(buf->base_ptr + buf->allocated, to_allocate, buf->device, &ha
ndle)) != hipSuccess) {
             if (err != hipErrorOutOfMemory) {
-                log(ERROR, "VRAM Allocation failed (non OOM): %d\n", err);
+                log(ERROR, "VRAM Allocation failed (non OOM): %s\n", hipGetErrorString(err));
                 return false;
             }
             log(DEBUG, "Pytorch allocator attempt exceeds available VRAM ...\n");

apparently vrambuf_create somehow works on CUDA without aligning to chunk size but with HIP (on Linux?) it fails. I don't know why it works on Windows.

@0xDELUXA
Copy link

0xDELUXA commented Feb 20, 2026

I haven’t encountered any OOMs in my workflows, but occasionally the GPU hangs at 100% usage. It would be great if Windows and Linux ROCm were even more similar.

@asagi4
Copy link
Author

asagi4 commented Feb 20, 2026

with these changes things work for me again on Linux. Or at least one workflow ran successfully. Previously pretty much all allocations failed with "invalid argument" when mapping new vram allocations, presumably because the vram buffers weren't aligned to the defined chunk size.

@asagi4
Copy link
Author

asagi4 commented Feb 22, 2026

Hm, with the latest changes to master the fixing has gotten a bit more complicated because aimdo's overriding functions have mismatching result types from cuda functions and hipify / clang doesn't like that.

For example, they're defined to return int in the header, but the actual function prototype says cudaError_t. In addition, the actual aimdo implementations return CUresults...

I'll try to see what happens if I just fix the return types and cast the return values, but that seems like something that should be fixed regardless of ROCm, since I don't think relying on implicit casts from integers is very good behaviour.

@rattus128 what do you think?

@asagi4
Copy link
Author

asagi4 commented Feb 22, 2026

Now it compiles, loads and appears to work again.

Haven't stress-tested though.

@0xDELUXA
Copy link

0xDELUXA commented Feb 22, 2026

Have you run any workload that exceeds VRAM and would OOM without comfy-aimdo?

Does the original example.py work on your system?

Another thing is that the ROCm documentation states that VMM is “under development” on Windows. Some APIs are even marked as beta on Linux too, so I can’t really do anything to get it to work reliably on Windows.

@asagi4
Copy link
Author

asagi4 commented Feb 22, 2026

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

@0xDELUXA
Copy link

0xDELUXA commented Feb 22, 2026

@0xDELUXA I haven't stress tested things much, so it's possible that the code isn't very useful as is and fails under memory pressure, but at least it compiles and runs, so it's a start. I also suspect that it failing if vrambuffer allocations aren't aligned to the chunk size is a bug that's just masked by some CUDA-specific behaviour, but I don't know what exactly it's doing wrong, but with ROCm the hipified cuMemSetAccess calls fail with "invalid argument".

I wonder if since the pointer it's working with is vrambuf->base_addr+vrambuf->allocated, that it gives an invalid pointer with some allocation patterns.

I see. I don’t really think the comfy-aimdo dev has much insight into the AMD side, so it’s just us. I assume there will still be things that work reliably on Nvidia but not as well on AMD.

I can't help with Windows at all unfortunately. It's been a long time since I last used it for anything.

Not a problem - the build script from my fork, on Windows, as you said, "at least it compiles and runs, so it's a start."

@0xDELUXA
Copy link

0xDELUXA commented Feb 23, 2026

I'm rather curious about how your AMD Linux implementation behaves. Could you try running example.py pls? My output on Windows is this.

@asagi4
Copy link
Author

asagi4 commented Feb 23, 2026

@0xDELUXA I can't run it at all because it tries to import a function called vbars_analyze that doesn't seem to exist anywhere.

@0xDELUXA
Copy link

0xDELUXA commented Feb 23, 2026

I needed to modify it as well, and this one works for me. Commented out vbars_analyze, etc.

@asagi4
Copy link
Author

asagi4 commented Feb 23, 2026

I fixed the script and it gives me this:

Init complete
aimdo: hip_src/control.c:67:INFO:comfy-aimdo inited for GPU: AMD Radeon RX 7900 XTX (VRAM: 24560 MB)
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=131072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xabacef0
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xabacef0
##################### Run the first model #######################
Some weights will be loaded and stay there for all iterations
Some weights will be offloaded

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[First Load] Populated weight at offset: 400.0M
[First Load] Populated weight at offset: 800.0M
[First Load] Populated weight at offset: 1200.0M
[First Load] Populated weight at offset: 1600.0M
[First Load] Populated weight at offset: 2000.0M
[First Load] Populated weight at offset: 2400.0M
[First Load] Populated weight at offset: 2800.0M
[First Load] Populated weight at offset: 3200.0M
[First Load] Populated weight at offset: 3600.0M
[First Load] Populated weight at offset: 4000.0M
[First Load] Populated weight at offset: 4400.0M
[First Load] Populated weight at offset: 4800.0M
[First Load] Populated weight at offset: 5200.0M
[First Load] Populated weight at offset: 5600.0M
[First Load] Populated weight at offset: 6000.0M
[First Load] Populated weight at offset: 6400.0M
[First Load] Populated weight at offset: 6800.0M
[First Load] Populated weight at offset: 7200.0M
[First Load] Populated weight at offset: 7600.0M
[First Load] Populated weight at offset: 8000.0M
[First Load] Populated weight at offset: 8400.0M
[First Load] Populated weight at offset: 8800.0M
[First Load] Populated weight at offset: 9200.0M
[First Load] Populated weight at offset: 9600.0M
[First Load] Populated weight at offset: 10000.0M
[First Load] Populated weight at offset: 10400.0M
[First Load] Populated weight at offset: 10800.0M
[First Load] Populated weight at offset: 11200.0M
[First Load] Populated weight at offset: 11600.0M
[First Load] Populated weight at offset: 12000.0M
[First Load] Populated weight at offset: 12400.0M
[First Load] Populated weight at offset: 12800.0M
[First Load] Populated weight at offset: 13200.0M
[First Load] Populated weight at offset: 13600.0M
[First Load] Populated weight at offset: 14000.0M
[First Load] Populated weight at offset: 14400.0M
[First Load] Populated weight at offset: 14800.0M
[First Load] Populated weight at offset: 15200.0M
[First Load] Populated weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    16400 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     7820 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 16000 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
aimdo: hip_src/model-vbar.c:181:DEBUG:vbar_allocate (start): size=3072M, device=0
aimdo: hip_src/model-vbar.c:208:DEBUG:vbar_allocate (return): vbar=0xb135160
aimdo: hip_src/model-vbar.c:260:DEBUG:vbar_get vbar=0xb135160
##################### Run the second model #######################
Everything will be loaded and will displace some weights of the first model

aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!

Iteration 0
[First Load] Populated weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 633339904
aimdo: hip_src/vrambuf.c:16:ERROR:vrambuffer max_size not aligned to chunk size!
[First Load] Populated weight at offset: 603.2421875M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 603.2421875M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17824 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6396 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 3544] Ptr: 0x7fa5bb000000 | Size:  622592k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     608 MB
##################### Run the first model again #######################
Some weights will still be loaded from before and be there first iteration
Some weights will get re-loaded on the first interation
The rest will be offloaded again

aimdo: hip_src/model-vbar.c:234:DEBUG:vbar_prioritize vbar=0xabacef0
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400

Iteration 0
[No Load Needed] Reusing weight at offset: 0.0M
aimdo: hip_src/vrambuf.c:10:ERROR:Creating vrambuffer of size 419430400
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 1
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 2
[No Load Needed] Reusing weight at offset: 0.0M
[No Load Needed] Reusing weight at offset: 400.0M
[No Load Needed] Reusing weight at offset: 800.0M
[No Load Needed] Reusing weight at offset: 1200.0M
[No Load Needed] Reusing weight at offset: 1600.0M
[No Load Needed] Reusing weight at offset: 2000.0M
[No Load Needed] Reusing weight at offset: 2400.0M
[No Load Needed] Reusing weight at offset: 2800.0M
[No Load Needed] Reusing weight at offset: 3200.0M
[No Load Needed] Reusing weight at offset: 3600.0M
[No Load Needed] Reusing weight at offset: 4000.0M
[No Load Needed] Reusing weight at offset: 4400.0M
[No Load Needed] Reusing weight at offset: 4800.0M
[No Load Needed] Reusing weight at offset: 5200.0M
[No Load Needed] Reusing weight at offset: 5600.0M
[No Load Needed] Reusing weight at offset: 6000.0M
[No Load Needed] Reusing weight at offset: 6400.0M
[No Load Needed] Reusing weight at offset: 6800.0M
[No Load Needed] Reusing weight at offset: 7200.0M
[No Load Needed] Reusing weight at offset: 7600.0M
[No Load Needed] Reusing weight at offset: 8000.0M
[No Load Needed] Reusing weight at offset: 8400.0M
[No Load Needed] Reusing weight at offset: 8800.0M
[No Load Needed] Reusing weight at offset: 9200.0M
[No Load Needed] Reusing weight at offset: 9600.0M
[No Load Needed] Reusing weight at offset: 10000.0M
[No Load Needed] Reusing weight at offset: 10400.0M
[No Load Needed] Reusing weight at offset: 10800.0M
[No Load Needed] Reusing weight at offset: 11200.0M
[No Load Needed] Reusing weight at offset: 11600.0M
[No Load Needed] Reusing weight at offset: 12000.0M
[No Load Needed] Reusing weight at offset: 12400.0M
[No Load Needed] Reusing weight at offset: 12800.0M
[No Load Needed] Reusing weight at offset: 13200.0M
[No Load Needed] Reusing weight at offset: 13600.0M
[No Load Needed] Reusing weight at offset: 14000.0M
[No Load Needed] Reusing weight at offset: 14400.0M
[No Load Needed] Reusing weight at offset: 14800.0M
[No Load Needed] Reusing weight at offset: 15200.0M
[No Load Needed] Reusing weight at offset: 15600.0M

Iteration 3
...

Iteration 4
...

Iteration 5
...

Iteration 6
...

Iteration 7
...

Iteration 8
...

Iteration 9
...
aimdo: hip_src/pyt-cu-plug-alloc.c:89:DEBUG:Pytorch is freeing VRAM ...
aimdo: hip_src/control.c:34:DEBUG:--- VRAM Stats ---
aimdo: hip_src/control.c:37:DEBUG:  Aimdo Recorded Usage:    17616 MB
aimdo: hip_src/control.c:38:DEBUG:  Cuda:     6604 MB /   24560 MB Free
aimdo: hip_src/model-vbar.c:53:DEBUG:---------------- VBAR Usage ---------------
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xb135160: Actual Resident VRAM = 1216 MB
aimdo: hip_src/model-vbar.c:83:DEBUG:VBAR 0xabacef0: Actual Resident VRAM = 16000 MB
aimdo: hip_src/model-vbar.c:86:DEBUG:Total VRAM for VBARs: 17216 MB
aimdo: hip_src/pyt-cu-plug-alloc.c:21:DEBUG:--- Allocation Analysis Start ---
aimdo: hip_src/pyt-cu-plug-alloc.c:30:DEBUG:  [Bucket 1591] Ptr: 0x7fa6c6e00000 | Size:  409600k
aimdo: hip_src/pyt-cu-plug-alloc.c:39:DEBUG:1 Active Allocations for a total of     400 MB
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'
Exception ignored in: <function ModelVBAR.__del__ at 0x7fae20bee7a0>
Traceback (most recent call last):
  File "/home/sd/git/comfy-aimdo/comfy_aimdo/model_vbar.py", line 95, in __del__
AttributeError: 'NoneType' object has no attribute 'vbar_free'```
Some of the ERROR logs from aimdo aren't actually errors, they're just things I added that I wanted to log without enabling debug logging.

@0xDELUXA
Copy link

0xDELUXA commented Feb 23, 2026

I see. I've also added some debug output, but shouldn't the script also print [Offloaded] alongside [First Load] and [No Load Needed], considering the Some weights will be offloaded and The rest will be offloaded again comments included in the script by rattus128?
Based on the outputs, this is the main difference between comfy-aimdo on AMD Linux/Windows at present.
Which AMD GPU do you have, btw? Mine has 16 GB VRAM, if yours has more, that could explain the offload difference.

@asagi4
Copy link
Author

asagi4 commented Feb 23, 2026

It might be that it runs like that because everything fits into VRAM. If I change the layer counts, at some point I just get OOMs. I don't think it's properly offloading anything automatically.

@asagi4 asagi4 force-pushed the hack/rocm-support branch from abc6671 to bd6b9bf Compare March 9, 2026 15:36
@asagi4
Copy link
Author

asagi4 commented Mar 9, 2026

context APIs are deprecated on HIP. Apparently you're supposed to use device / stream APIs?

I'm not sure if there's a simple 1:1 mapping or if the cuda code could be changed to map to non-deprecated HIP APIs

@0xDELUXA
Copy link

0xDELUXA commented Mar 9, 2026

Thank you for providing new wheel. The script works well without hangs/crashes. Here is the local output. It has couple of differences from your output - for example some [First Load] Populated weight at offset: 12210.0M happens on my machine, while on your output I can see [Offloaded] offset: 12210.0M I also tried Flux.2 [Klein] 9B Text to Image workflow with new comfy-aimdo wheel but it still hangs during model load.

Another difference is that your output shows:
HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll
This means aimdo uses the system-wide amdhip64_7.dll that is installed with Adrenalin.

I always manually copy this file from:
ComfyUI\venv\Lib\site-packages\_rocm_sdk_core\bin\
into:
ComfyUI\venv\Lib\site-packages\comfy_aimdo\
to ensure it uses the one provided by TheRock.

It could be that this is what causes the hangs for you.
I also had hangs here, but they were caused by aimdo using Adrenalin's amdhip64_7.dll.

Edit:
I'm pretty sure this is the issue on your side. I just tested it, and example.py works that way, but ComfyUI doesn't work at all.

If you've copied that .dll, and aimdo loads it, then you shouldn't see HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll anymore. That way it should work.

I assume you're using TheRock and not these wheels.

@0xDELUXA
Copy link

0xDELUXA commented Mar 9, 2026

context APIs are deprecated on HIP. Apparently you're supposed to use device / stream APIs?

I'm not sure if there's a simple 1:1 mapping or if the cuda code could be changed to map to non-deprecated HIP APIs

We also get warnings about this on Windows; nevertheless, it builds aimdo.

@tvukovic-amd
Copy link

tvukovic-amd commented Mar 10, 2026

Thank you for providing new wheel. The script works well without hangs/crashes. Here is the local output. It has couple of differences from your output - for example some [First Load] Populated weight at offset: 12210.0M happens on my machine, while on your output I can see [Offloaded] offset: 12210.0M I also tried Flux.2 [Klein] 9B Text to Image workflow with new comfy-aimdo wheel but it still hangs during model load.

Another difference is that your output shows: HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll This means aimdo uses the system-wide amdhip64_7.dll that is installed with Adrenalin.

I always manually copy this file from: ComfyUI\venv\Lib\site-packages\_rocm_sdk_core\bin\ into: ComfyUI\venv\Lib\site-packages\comfy_aimdo\ to ensure it uses the one provided by TheRock.

It could be that this is what causes the hangs for you. I also had hangs here, but they were caused by aimdo using Adrenalin's amdhip64_7.dll.

Edit: I'm pretty sure this is the issue on your side. I just tested it, and example.py works that way, but ComfyUI doesn't work at all.

If you've copied that .dll, and aimdo loads it, then you shouldn't see HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll anymore. That way it should work.

I assume you're using TheRock and not these wheels.

amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in [First Load] Populated weight at offset: 12210.0M on Iteration 0.
Flux.2 [Klein] 9B Text to Image workflow from templates doesn't have hangs anymore and can execute properly. I compared performance results with and without comfy-aimdo:

  • Flux.2 Klein 9B Distilled - comfy_aimdo from your wheel 1.17 it/s, fresh built comfy_aimdo 1.2it/s, without comfy_aimdo 1.17it/s
  • Flux.2 Klein 9B - comfy_aimdo from your wheel 0.58 it/s, fresh built comfy_aimdo 0.6it/s, without comfy_aimdo 0.62it/s.

Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine?

@0xDELUXA
Copy link

0xDELUXA commented Mar 10, 2026

amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in [First Load] Populated weight at offset: 12210.0M on Iteration 0. Flux.2 [Klein] 9B Text to Image workflow from templates doesn't have hangs anymore and can execute properly. I compared performance results with and without comfy-aimdo:

  • Flux.2 Klein 9B Distilled - both around 1.17it/s
  • Flux.2 Klein 9B - with comfy_aimdo 0.58 it/s, without comfy_aimdo 0.62it/s).

Great! I was wondering why it doesn't work on your end.

Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine?

Did something change? I can't seem to reproduce my own issue anymore XD

Except for this one, which I assume occurs on your side too. Not having access to Sage/Flash makes aimdo kind of not worth using, sadly.

@tvukovic-amd
Copy link

amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in [First Load] Populated weight at offset: 12210.0M on Iteration 0. Flux.2 [Klein] 9B Text to Image workflow from templates doesn't have hangs anymore and can execute properly. I compared performance results with and without comfy-aimdo:

  • Flux.2 Klein 9B Distilled - both around 1.17it/s
  • Flux.2 Klein 9B - with comfy_aimdo 0.58 it/s, without comfy_aimdo 0.62it/s).

Great! I was wondering why it doesn't work on your end.

Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine?

Did something change? I can't seem to reproduce my own issue anymore XD

Except for this one, which I assume occurs on your side too. Not having access to Sage/Flash makes aimdo kind of not worth using, sadly.

I only followed your instructions and tested with 2 venvs:

  • one with comfy_aimdo you provided yesterday
  • another one with fresh built comfy_aimdo

and copied amdhip64_7.dll from venv\Lib\site-packages\_rocm_sdk_core\bin\ to venv\Lib\site-packages\comfy_aimdo\.

According to this case, I installed triton-windows from https://github.com/triton-lang/triton-windows/actions/runs/22558044670 according to instructions and tried to run Flux.2 Klein 9B. Triton error ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) from the issue didn't appear, but new errors occurred while running model:

Error running sage attention: Command '['c:\\develop\\ComfyUI\\venv\\Lib\\site-packages\\_rocm_sdk_core\\lib\\llvm\\bin\\clang-cl.exe', 'C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.c', '/nologo', '/O2', '/LD', '/wd4819', '/std:c11', '/IC:\\develop\\ComfyUI\\venv\\Lib\\site-packages\\triton\\backends\\amd\\include', '/IC:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f', '/IC:\\Python312\\Include', '/IC:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.44.35207\\include', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\shared', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\ucrt', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\um', '/FoC:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.obj', '/link', '/LIBPATH:C:\\Python312\\libs', '/LIBPATH:C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.44.35207\\lib\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\ucrt\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\um\\x64', 'python312.lib', '/OUT:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.pyd', '/IMPLIB:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.lib', '/PDB:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.pdb']' returned non-zero exit status 1., using pytorch attention instead.
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1078,39): error: call to undeclared function 'alloca'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
 1078 |   PyObject **args_data = (PyObject **)alloca(num_args * sizeof(PyObject *));
      |                                       ^
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1078,26): warning: cast to 'PyObject **' (aka 'struct _object **') from smaller integer type 'int' [-Wint-to-pointer-cast]
 1078 |   PyObject **args_data = (PyObject **)alloca(num_args * sizeof(PyObject *));
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1089,19): warning: cast to 'void **' from smaller integer type 'int' [-Wint-to-pointer-cast]
 1089 |   void **params = (void **)alloca(num_params * sizeof(void *));
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1102,24): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1102 |     params[params_idx] = alloca(extractor.size);
      |                        ^ ~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1108,22): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1108 |   params[params_idx] = alloca(sizeof(void *));
      |                      ^ ~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1113,22): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1113 |   params[params_idx] = alloca(sizeof(void *));
      |                      ^ ~~~~~~~~~~~~~~~~~~~~~~
2 warnings and 4 errors generated.

@0xDELUXA
Copy link

Which attention type? Sage or Flash?

@tvukovic-amd
Copy link

tvukovic-amd commented Mar 10, 2026

This happens with running sage attention.
FYI I am using latest nightly theRock build for torch, torchvision and torchaudio.

@0xDELUXA
Copy link

0xDELUXA commented Mar 10, 2026

FYI I am using latest nightly theRock build for torch, torchvision and torchaudio.

I'm using torch 2.12.0a0+rocm7.12.0a20260304

This happens with running sage attention.

Could you try Flash too?

@tvukovic-amd
Copy link

FYI I am using latest nightly theRock build for torch, torchvision and torchaudio.

I'm using torch 2.12.0a0+rocm7.12.0a20260304

This happens with running sage attention.

Could you try Flash too?

With flash attention it gives the following error while running model Flash Attention failed, using default SDPA: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?).

@0xDELUXA
Copy link

0xDELUXA commented Mar 11, 2026

Could you try Flash too?

With flash attention it gives the following error while running model Flash Attention failed, using default SDPA: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?).

So the same error. I’ve opened an issue on triton-windows and reached out to the maintainers on Discord. It’s now in their hands.

Also updated the build script to automatically copy amdhip64_7.dll, mandatory on Windows.

@tvukovic-amd
Copy link

tvukovic-amd commented Mar 11, 2026

amdhip64_7.dll (from driver) caused the issue. I updated example output, it has difference but only in [First Load] Populated weight at offset: 12210.0M on Iteration 0. Flux.2 [Klein] 9B Text to Image workflow from templates doesn't have hangs anymore and can execute properly. I compared performance results with and without comfy-aimdo:

  • Flux.2 Klein 9B Distilled - both around 1.17it/s
  • Flux.2 Klein 9B - with comfy_aimdo 0.58 it/s, without comfy_aimdo 0.62it/s).

Great! I was wondering why it doesn't work on your end.

Can you please provide the example when enabling LoRA causes performance regression when comfy_aimdo is enabled to double check on my machine?

Did something change? I can't seem to reproduce my own issue anymore XD
Except for this one, which I assume occurs on your side too. Not having access to Sage/Flash makes aimdo kind of not worth using, sadly.

I only followed your instructions and tested with 2 venvs:

  • one with comfy_aimdo you provided yesterday
  • another one with fresh built comfy_aimdo

and copied amdhip64_7.dll from venv\Lib\site-packages\_rocm_sdk_core\bin\ to venv\Lib\site-packages\comfy_aimdo\.

According to this case, I installed triton-windows from https://github.com/triton-lang/triton-windows/actions/runs/22558044670 according to instructions and tried to run Flux.2 Klein 9B. Triton error ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) from the issue didn't appear, but new errors occurred while running model:

Error running sage attention: Command '['c:\\develop\\ComfyUI\\venv\\Lib\\site-packages\\_rocm_sdk_core\\lib\\llvm\\bin\\clang-cl.exe', 'C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.c', '/nologo', '/O2', '/LD', '/wd4819', '/std:c11', '/IC:\\develop\\ComfyUI\\venv\\Lib\\site-packages\\triton\\backends\\amd\\include', '/IC:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f', '/IC:\\Python312\\Include', '/IC:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.44.35207\\include', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\shared', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\ucrt', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\um', '/FoC:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.obj', '/link', '/LIBPATH:C:\\Python312\\libs', '/LIBPATH:C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.44.35207\\lib\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\ucrt\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\um\\x64', 'python312.lib', '/OUT:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.pyd', '/IMPLIB:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.lib', '/PDB:C:\\Users\\master\\AppData\\Local\\Temp\\tmpddmy1m2f\\hip_utils.cp312-win_amd64.pdb']' returned non-zero exit status 1., using pytorch attention instead.
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1078,39): error: call to undeclared function 'alloca'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
 1078 |   PyObject **args_data = (PyObject **)alloca(num_args * sizeof(PyObject *));
      |                                       ^
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1078,26): warning: cast to 'PyObject **' (aka 'struct _object **') from smaller integer type 'int' [-Wint-to-pointer-cast]
 1078 |   PyObject **args_data = (PyObject **)alloca(num_args * sizeof(PyObject *));
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1089,19): warning: cast to 'void **' from smaller integer type 'int' [-Wint-to-pointer-cast]
 1089 |   void **params = (void **)alloca(num_params * sizeof(void *));
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1102,24): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1102 |     params[params_idx] = alloca(extractor.size);
      |                        ^ ~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1108,22): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1108 |   params[params_idx] = alloca(sizeof(void *));
      |                      ^ ~~~~~~~~~~~~~~~~~~~~~~
C:\Users\master\AppData\Local\Temp\tmpak3fmqyt\hip_utils.c(1113,22): error: incompatible integer to pointer conversion assigning to 'void *' from 'int' [-Wint-conversion]
 1113 |   params[params_idx] = alloca(sizeof(void *));
      |                      ^ ~~~~~~~~~~~~~~~~~~~~~~
2 warnings and 4 errors generated.

I fixed this bug
in venv\Lib\site-packages\triton\backends\amd\driver.c after #include <windows.h> needs to be added #include <malloc.h>. After that sage attention also has same error: Error running sage attention: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?), using pytorch attention instead.

@0xDELUXA
Copy link

I fixed this bug in venv\Lib\site-packages\triton\backends\amd\driver.c after #include <windows.h> needs to be added #include <malloc.h>. After that sage attention also has same error: Error running sage attention: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?), using pytorch attention instead.

I see. Don't really understand how I could get the same cpu tensor? error even without this change.

@tvukovic-amd
Copy link

It comes from installed triton, did you install this one https://github.com/triton-lang/triton-windows/actions/runs/22558044670?

@0xDELUXA
Copy link

0xDELUXA commented Mar 11, 2026

It comes from installed triton, did you install this one https://github.com/triton-lang/triton-windows/actions/runs/22558044670?

Since post26 is now available on PyPI, I used pip install triton-windows.

@tvukovic-amd
Copy link

When running pip install triton-windows I don't need to do additional fixes, it only has Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) error.

@0xDELUXA
Copy link

0xDELUXA commented Mar 11, 2026

The only thing I'm still unclear on is:

  • Why does the upstream Triton work on Linux with comfy-aimdo, but triton-windows doesn't (on Windows)?
  • Why does it work on Nvidia + triton-windows but not on AMD?

@woct0rdho
Copy link

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) basically means Triton expects some tensor on GPU but it's actually on CPU. I guess it's not an issue of Triton but an issue somewhere else.

We at triton-windows are still busy on fixing some unit tests on RDNA3 GPUs. After passing the existing unit tests, I may find some time to check this issue.

@0xDELUXA
Copy link

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) basically means Triton expects some tensor on GPU but it's actually on CPU. I guess it's not an issue of Triton but an issue somewhere else.

We at triton-windows are still busy on fixing some unit tests on RDNA3 GPUs. After passing the existing unit tests, I may find some time to check this issue.

I see. Thanks for the update and for looking into it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants