Fix: Add memory precheck before VAE decode to prevent crash#12109
Fix: Add memory precheck before VAE decode to prevent crash#12109tvukovic-amd wants to merge 3 commits intoComfy-Org:masterfrom
Conversation
|
We have made a lot of VAE VRAM fixes recently so pre-emptively tiling is going to false positive without a fairly major audit of the VAE estimates. Some of them (like LTX) have non-trivial non-linear VRAM consumption patterns. The comfy-aimdo project is trying to take things the other way and control the allocator under VRAM pressure to get us away from the maintainence of accurate model estimates. #11845 No AMD support yet though. Is there a path forward on pytorch being able to allocate with clean exception on OOM? Which VAEs are the worst offenders and how much VRAM are you generally trying to support? |
Not the OP but, the most VRAM is consumed by the video VAEs, all the new image VAEs are very efficient by default. On another note, since you recently reduced the LTX 2 VAE VRAM consumption by 1/3 , ComfyUI still offloads the whole model before decoding with the LTX 2 VAE (likely because the VAE memory estimation wasn't changed), this is critical for low 16gb VRAM users since the model and TEs cannot be kept within 32GB RAM and the model spills onto pagefile. A custom node that sets the amount of model to offload before decoding will be so nice to have since LTX 2 VAE literally takes only 3GB VRAM when decoding with tiled decoding, but ComfyUI still offloads the whole model into RAM and pagefile. |
| #NOTE: We don't know what tensors were allocated to stack variables at the time of the | ||
| #exception and the exception itself refs them all until we get out of this except block. | ||
| #So we just set a flag for tiler fallback so that tensor gc can happen once the | ||
| #exception is fully off the books. |
There was a problem hiding this comment.
I think you need to set do_tile = True here to actually do the tiled VAE retry.
I think this patch would be fairly helpful on AMD especially. Some VAE VRAM estimates with AMD seem to be kind of bonkers; the Flux VAE requests 11.6GB of VRAM to decode a 1 megapixel image and somehow I don't think it actually uses anywhere near that much.
EDIT: I just did a quick memory dump after a VAE decode. Torch maximum memory usage was about 6.6GB, and that would probably include the loaded VAE model and anything else that might be in VRAM. I'm not sure how to accurately tell what the actual VAE decoding used, but clearly not 11.6GB
There was a problem hiding this comment.
You are right, do_tile should be set to True in this case. I pushed new commit with appropriate changes.
|
Has anyone tried enabling PyTorch/SDPA attention in VAE by changing this line in to |
Hopefully, AMD support will be available in the future as well. Nvm, the second-ever PR there is already about it: Comfy-Org/comfy-aimdo#2 XD |
Tried with this change but VAE decoder still causes |
I see. But then what's the point of using split attention in VAE automatically on all AMD gpus? For me, PyTorch attention is faster / better, and it's the default on NVIDIA as well. I mean, on all gpus except AMD. |
VAE decoder uses the same amount of memory if pytorch_attention_enabled_vae is set to True except for the internal attention matrix computation which is a small portion of the total pipeline. If we use changes from this PR for memory precheck (to try to switch to tiled vae decoder if there is not enough memory) and enable PyTorch/SDPA attention in VAE with models execute correctly. |
On my end, with MIOpen/cudnn off, Pytorch attention uses a little more peak VRAM than split attention during VAE decode. Both take mere seconds as long as they fit in memory, but exceeding memory freezes Windows. (ComfyUI does not detect the danger or auto-fallback to tiled decode; I suspect it might be double-counting shared memory on my iGPU system.) 1600x1280 is sort of on the knife's edge for me where it usually works, but any funny business can put it over the top, so I've stayed with split attention for VAE decode. But for sampler steps, Pytorch attention is appreciably faster. |
Adding memory precheck before VAE decode to prevent Windows
0xC0000005access violation crashes, particularly on devices with limited VRAM.Problem
VAE decode loading could trigger
0xC0000005(access violation) crashes when:--highvram,--gpu-only, or insufficient CPU RAM)The existing OOM exception handling couldn't catch these crashes because they occur at the driver/system level before PyTorch can raise an exception.
Solution
Added a proactive memory check (
use_tiled_vae_decode()) that evaluates memory conditions before attempting decode:--highvram,--gpu-onlyflags)--disable-smart-memory)If any condition fails, switch to tiled VAE decode preemptively.