Skip to content

Dynamic VRAM support#427

Draft
rattus128 wants to merge 6 commits intocity96:mainfrom
rattus128:dynamic-vram
Draft

Dynamic VRAM support#427
rattus128 wants to merge 6 commits intocity96:mainfrom
rattus128:dynamic-vram

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Mar 5, 2026

The new dynamic VRAM system in the comfy-core enhances both RAM and VRAM management. Models are no longer offloader from VRAM to RAM (which has a habit of becoming swap) and are now loadable asynchronously on the sampler first iteration. This gives significant speedup to big multi-model workflows on low-resource systems. VRAM offloading is managed by demand offloading, such there is no need to have VRAM usage esitmates anymore.

The core has already upstreamed several of the resource saving features of GGUF in various forms.

  • The core linear layers are now inited un-allocated to avoid the naked commit charge for the empty tensor.
  • Models are loaded with assign=True to avoid deep copy and committed memory on model load (GGUF does similar but with _load_state_dict hooking)
  • the sft file is mmaped read only to avoid that commit charge. GGUF does this

So this implements a QuantizedTensor backend and subclasses the new ModelPatcherDynamic to bring GGUF+dynamic without needed custom ops.

The patcher subclass is needed to unhook the lora into on-the-fly. Otherwise its just load the state-dict into the new QuantizedTensor and go.

This brings the full feature-set of the core comfy caster to GGUF including, async-offload (and async primary load), pinned-memory and now the dynamic management.

There's some boilerplate to implement downgrade back to ModelPatcher. This is needed for things like torch compiler and hooks where Dynamic VRAM is TBD.

Still drafing and will post some more performance results. I am going to pull a RAM stick and go for some 16GB RAM flows with GGUF.

Example Test conditions:

WAN2.2 14B Q8 GGUF, 640x640x81f, RTX5090, Linux, 96GB, 2x Runs (disk caches warm with model first runs)

Before

Prompt executed in 60.31 seconds
Prompt executed in 55.99 seconds

After

Prompt executed in 48.75 seconds
Prompt executed in 43.35 seconds

Vibe code. To be reviewed.
If in dynamic mode, load GGUF as a QT.
Refactor this to support the new reconstructability protocol in the
comfy core. This is needed for DynamicVRAM (to support legacy
demotion for fallbacks). Add the logic for dynamic_vram construction.

This is also needed for worksplit multi-gpu branch where the model
is deep-cloned via reconstruction to put the model on two parallel
GPUs.
Refactor this to support the new reconstructability protocol in the
comfy core. This is needed for DynamicVRAM (to support legacy
demotion for fallbacks). Add the logic for dynamic_vram construction.

This is also needed for worksplit multi-gpu branch where the model
is deep-cloned via reconstruction to put the model on two parallel
GPUs.
Factor this out to a helper and implement the new core reconstruction
protocol. Consider the mmap_released flag 1:1 with the underlying model
such that it moves with the base model in model_override.
@m8rr
Copy link

m8rr commented Mar 6, 2026

https://github.com/rattus128/ComfyUI-GGUF/tree/dynamic-vram

Is this the same thing?
I used the above and got the following error.


D:\AI\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --disable-api-nodes --output-directory E:\output --temp-directory E:\output
Setting output directory to: E:\output
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Checkpoint files will always be loaded safely.
Total VRAM 12282 MB, total RAM 32085 MB
pytorch version: 2.10.0+cu130
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 SUPER : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 14438.0
working around nvidia conv3d memory bug.
Using pytorch attention
aimdo: src-win/cuda-detour.c:77:INFO:aimdo_setup_hooks: found driver at 00007FFB60C00000, installing 4 hooks
aimdo: src-win/cuda-detour.c:61:DEBUG:install_hook_entrys: hooks successfully installed
aimdo: src/control.c:66:INFO:comfy-aimdo inited for GPU: NVIDIA GeForce RTX 4070 SUPER (VRAM: 12281 MB)
DynamicVRAM support detected and enabled
Python version: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.16.3
Setting temp directory to: E:\output\temp
ComfyUI frontend version: 1.39.19
[Prompt Server] web root: D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfyui_frontend_package\static
ComfyUI-GGUF: Allowing full torch compile

Import times for custom nodes:
   0.0 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-KJNodes
   0.1 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.056s (created=0, skipped_existing=81, orphans_pruned=0, total_seen=85)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
gguf qtypes: F32 (289), Q6_K (337)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 262208
Dequantizing token_embd.weight to prevent runtime OOM.
clip missing: ['multi_modal_projector.mm_input_projection_weight', 
....
....
'vision_model.post_layernorm.weight', 'vision_model.post_layernorm.bias']
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load LTXAVTEModel_
Model LTXAVTEModel_ prepared for dynamic VRAM loading. 50881MB Staged. 0 patches attached. Force pre-loaded 290 weights: 2995 KB.
!!! Exception during processing !!! shape '[4096, 3840]' is invalid for input of size 12902400
Traceback (most recent call last):
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 524, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 333, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 307, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 295, in process_inputs
    result = f(**inputs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\nodes.py", line 80, in encode
    return (clip.encode_from_tokens_scheduled(tokens), )
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 313, in encode_from_tokens_scheduled
    pooled_dict = self.encode_from_tokens(tokens, return_pooled=return_pooled, return_dict=True)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 377, in encode_from_tokens
    o = self.cond_stage_model.encode_token_weights(tokens)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\lt.py", line 167, in encode_token_weights
    out, pooled, extra = self.gemma3_12b.encode_token_weights(token_weight_pairs)
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 45, in encode_token_weights
    o = self.encode(to_encode)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 306, in encode
    return self(tokens)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 279, in forward
    outputs = self.transformer(None, attention_mask_model, embeds=embeds, num_tokens=num_tokens, intermediate_output=intermediate_output, final_layer_norm_intermediate=self.layer_norm_hidden_state, dtype=torch.float32, embeds_info=embeds_info)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 794, in forward
    return self.model(input_ids, *args, **kwargs)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 719, in forward
    x, current_kv = layer(
                    ~~~~~^
        x=x,
        ^^^^
    ...<3 lines>...
        past_key_value=past_kv,
        ^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 605, in forward
    x, present_key_value = self.self_attn(
                           ~~~~~~~~~~~~~~^
        hidden_states=x,
        ^^^^^^^^^^^^^^^^
    ...<4 lines>...
        sliding_window=sliding_window,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 466, in forward
    xq = self.q_proj(hidden_states)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 373, in forward
    return self.forward_comfy_cast_weights(*args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 365, in forward_comfy_cast_weights
    weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
                                   ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 228, in cast_bias_weight
    return cast_bias_weight_with_vbar(s, dtype, device, bias_dtype, non_blocking, compute_dtype, want_requant)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 148, in cast_bias_weight_with_vbar
    comfy.model_management.cast_to_gathered(xfer_source, pin)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\model_management.py", line 1204, in cast_to_gathered
    dest_views = comfy.memory_management.interpret_gathered_like(tensors, r)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\memory_management.py", line 71, in interpret_gathered_like
    actuals[attr] = gathered[offset:offset+size].view(dtype=template.dtype).view(template.shape)
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
RuntimeError: shape '[4096, 3840]' is invalid for input of size 12902400

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants