Draft
Conversation
Vibe code. To be reviewed.
If in dynamic mode, load GGUF as a QT.
Refactor this to support the new reconstructability protocol in the comfy core. This is needed for DynamicVRAM (to support legacy demotion for fallbacks). Add the logic for dynamic_vram construction. This is also needed for worksplit multi-gpu branch where the model is deep-cloned via reconstruction to put the model on two parallel GPUs.
Refactor this to support the new reconstructability protocol in the comfy core. This is needed for DynamicVRAM (to support legacy demotion for fallbacks). Add the logic for dynamic_vram construction. This is also needed for worksplit multi-gpu branch where the model is deep-cloned via reconstruction to put the model on two parallel GPUs.
Factor this out to a helper and implement the new core reconstruction protocol. Consider the mmap_released flag 1:1 with the underlying model such that it moves with the base model in model_override.
|
https://github.com/rattus128/ComfyUI-GGUF/tree/dynamic-vram Is this the same thing? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The new dynamic VRAM system in the comfy-core enhances both RAM and VRAM management. Models are no longer offloader from VRAM to RAM (which has a habit of becoming swap) and are now loadable asynchronously on the sampler first iteration. This gives significant speedup to big multi-model workflows on low-resource systems. VRAM offloading is managed by demand offloading, such there is no need to have VRAM usage esitmates anymore.
The core has already upstreamed several of the resource saving features of GGUF in various forms.
So this implements a QuantizedTensor backend and subclasses the new ModelPatcherDynamic to bring GGUF+dynamic without needed custom ops.
The patcher subclass is needed to unhook the lora into on-the-fly. Otherwise its just load the state-dict into the new QuantizedTensor and go.
This brings the full feature-set of the core comfy caster to GGUF including, async-offload (and async primary load), pinned-memory and now the dynamic management.
There's some boilerplate to implement downgrade back to ModelPatcher. This is needed for things like torch compiler and hooks where Dynamic VRAM is TBD.
Still drafing and will post some more performance results. I am going to pull a RAM stick and go for some 16GB RAM flows with GGUF.
Example Test conditions:
WAN2.2 14B Q8 GGUF, 640x640x81f, RTX5090, Linux, 96GB, 2x Runs (disk caches warm with model first runs)
Before
After