feat(isolation): DynamicVRAM compatibility for process isolation by pollockjj · Pull Request #12775 · Comfy-Org/ComfyUI

pollockjj · 2026-03-05T05:51:10Z

Summary

DynamicVRAM's on-demand model loading/offloading conflicted with process isolation in three ways: RPC tensor transport stalls from mid-call GPU offload, race conditions between model lifecycle and active RPC operations, and false positive memory leak detection from changed finalizer patterns.

Marshal CUDA tensors to CPU before RPC transport for dynamic models
Add operation state tracking + quiescence waits at workflow boundaries
Distinguish proxy reference release from actual leaks in cleanup_models_gc
Fix init order: DynamicVRAM must initialize before isolation proxies
Add RPC timeouts to prevent indefinite hangs on model unavailability
Prevent proxy-of-proxy chains from DynamicVRAM model reload cycles
Add torch.device/torch.dtype serializers for new DynamicVRAM RPC paths
Guard isolation overhead so non-isolated workflows are unaffected
Migrate env var to PYISOLATE_CHILD

Test plan

Run baseline (non-isolated) workflow to confirm no regression
Run isolated workflow with DynamicVRAM enabled — verify no RPC stalls or tensor transport hangs
Run sequential isolated workflows — verify quiescence waits prevent cleanup race conditions
Confirm no false positive "memory leak" warnings in logs during normal model cycling
Verify non-isolated workflows show zero isolation overhead (guard checks)

API Node PR Checklist

Scope

Is API Node Change

Pricing & Billing

Need pricing update
No pricing update

If Need pricing update:

Metronome rate cards updated
Auto‑billing tests updated and passing

QA

QA done
QA not required

Comms

Informed Kosinkadink

DynamicVRAM's on-demand model loading/offloading conflicted with process isolation in three ways: RPC tensor transport stalls from mid-call GPU offload, race conditions between model lifecycle and active RPC operations, and false positive memory leak detection from changed finalizer patterns. - Marshal CUDA tensors to CPU before RPC transport for dynamic models - Add operation state tracking + quiescence waits at workflow boundaries - Distinguish proxy reference release from actual leaks in cleanup_models_gc - Fix init order: DynamicVRAM must initialize before isolation proxies - Add RPC timeouts to prevent indefinite hangs on model unavailability - Prevent proxy-of-proxy chains from DynamicVRAM model reload cycles - Add torch.device/torch.dtype serializers for new DynamicVRAM RPC paths - Guard isolation overhead so non-isolated workflows are unaffected - Migrate env var to PYISOLATE_CHILD

socket-security · 2026-03-05T05:52:00Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	comfyui-workflow-templates@0.9.4 ⏵ 0.9.5

View full report

Copilot

Pull request overview

This PR adds DynamicVRAM compatibility with process isolation, resolving three conflict classes: RPC tensor transport stalls during mid-call GPU offload, race conditions between model lifecycle and active RPC operations, and false-positive memory leak detection.

Changes:

Marshal CUDA tensors to CPU at RPC boundaries for dynamic models and add device-restore helpers on the proxy side
Add per-operation state tracking and quiescence waits (wait_for_model_patcher_quiescence) at workflow boundaries to prevent cleanup races
Distinguish proxy reference release from actual memory leaks in cleanup_models_gc (isolation-aware branching), fix is_dead logic, and add NULL-safety for model_finalizer
New WAN21_SCAIL model support (detection, model class, node, sampling type)
Migrate isolation env var checks from PYISOLATE_ISOLATION_ACTIVE to PYISOLATE_CHILD; guard isolation overhead behind args.use_process_isolation; fix DynamicVRAM init order; add --disable-dynamic-vram CLI flag; extend enables_dynamic_vram() logic

Reviewed changes

Copilot reviewed 36 out of 38 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`comfy/isolation/model_sampling_proxy.py`	CPU marshalling for RPC tensor transport; adds timeout, logging, and `_to_cpu_for_rpc` helper
`comfy/isolation/model_patcher_proxy_registry.py`	Operation state tracking, quiescence waits, and CPU marshalling for apply_model/latent RPC paths
`comfy/isolation/proxies/base.py`	RPC timeout support, structured debug logging for all proxy RPC calls
`comfy/isolation/__init__.py`	Adds `wait_for_model_patcher_quiescence`, quiescence calls in notify/flush, guards eviction on isolated-only graphs
`execution.py`	Guards isolation overhead with `args.use_process_isolation`; adds quiescence waits at workflow boundaries
`comfy/model_management.py`	`cleanup_models_gc` isolation-aware split; `is_dead`/`dead_state` refactor; `is_wsl()` helper; `_isolation_mode_enabled()`
`comfy/model_base.py`	`ModelType.IMG_TO_IMG_FLOW`; `WAN21_FlowRVS`/`WAN21_SCAIL` model classes; isolation-guarded `__reduce__` in `ModelSampling`
`comfy/ldm/wan/model.py`	`SCAILWanModel` with RoPE-encoded pose conditioning
`comfy/supported_models.py`	`WAN21_SCAIL` config registered in models list
`comfy_extras/nodes_wan.py`	`WanSCAILToVideo` node with pose/reference latent support
`comfy/sd.py`	`disable_dynamic` flag threaded through CLIP loading; `load_clip_model_patcher` and `load_checkpoint_guess_config_clip_only` helpers for DynamicVRAM reload
`comfy/model_patcher.py`	`get_clone_model_override`, `get_non_dynamic_delegate`; `model_override` param in `clone`
`comfy/samplers.py`	Env-var migration; `get_non_dynamic_delegate` call when cond has hooks
`comfy/cli_args.py`	`--disable-dynamic-vram` flag; removes `DynamicVRAM` from `PerformanceFeature`; updated `enables_dynamic_vram()`
`main.py`	DynamicVRAM init order fix (before isolation proxies); WSL guard added
`comfy/isolation/adapter.py`	`torch.device`/`torch.dtype` serializers; dynamic ModelSampling class registration; proxy-of-proxy prevention
`comfy/isolation/model_patcher_proxy.py`	Device restoration after CPU-marshalled RPC; `get_non_dynamic_delegate` guard; `_model_sampling` cache
`utils/install_util.py`	`get_required_packages_versions()` centralizes requirements parsing
`app/frontend_management.py`	Delegates to new `get_required_packages_versions()`
`node_helpers.py`	`conditioning_set_values_with_timestep_range` helper
`comfy/hooks.py` / `comfy/k_diffusion/sampling.py` / `comfy/context_windows.py`	Env-var migration; sampling substep fix
`requirements.txt` / `pyproject.toml`	Dependency bumps and host isolation config
`.pyisolate_venvs/*/cache/`	Generated cache files (should be gitignored)
`tests/execution/test_jobs.py` / `tests-unit/app_test/frontend_manager_test.py`	Test updates for text previewable type and cache clearing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-05T05:59:32Z

utils/install_util.py

+PACKAGE_VERSIONS = {}
+def get_required_packages_versions():
+    if len(PACKAGE_VERSIONS) > 0:
+        return PACKAGE_VERSIONS.copy()
+    out = PACKAGE_VERSIONS
+    try:
+        with open(requirements_path, "r", encoding="utf-8") as f:
+            for line in f:
+                line = line.strip().replace(">=", "==")
+                s = line.split("==")
+                if len(s) == 2:
+                    version_str = s[-1]
+                    if not is_valid_version(version_str):
+                        logging.error(f"Invalid version format in requirements.txt: {version_str}")
+                        continue
+                    out[s[0]] = version_str
+        return out.copy()


The get_required_packages_versions function has a subtle cache inconsistency. On success it populates PACKAGE_VERSIONS (the global dict out) and returns out.copy(). On failure (e.g., FileNotFoundError), it returns None without modifying PACKAGE_VERSIONS. However, on partial success (some lines parsed, then the with open() block raises partway through—unlikely but possible), PACKAGE_VERSIONS would be partially filled but None is returned. Subsequent calls would then find len(PACKAGE_VERSIONS) > 0 and return the partial data as if it were complete. This is a latent race condition in the cache population logic.

Copilot · 2026-03-05T05:59:32Z

comfy/isolation/extension_wrapper.py

+        return wrapped

    async def flush_transport_state(self) -> int:
        if os.environ.get("PYISOLATE_ISOLATION_ACTIVE") != "1":


In comfy/isolation/extension_wrapper.py, the flush_transport_state method (line 408) still checks os.environ.get("PYISOLATE_ISOLATION_ACTIVE") != "1" rather than the migrated PYISOLATE_CHILD env var. Similarly, line 309 uses PYISOLATE_ISOLATION_ACTIVE. The PR description says the migration is to PYISOLATE_CHILD, but these two places in extension_wrapper.py were not updated, leaving inconsistent env-var checks. The same applies to model_patcher_proxy_utils.py line 16 and runtime_helpers.py line 179.

Suggested change

if os.environ.get("PYISOLATE_ISOLATION_ACTIVE") != "1":

if os.environ.get("PYISOLATE_CHILD") != "1":

Copilot · 2026-03-05T05:59:32Z

comfy/isolation/model_sampling_proxy.py

    def _call(self, method_name: str, *args: Any) -> Any:
        rpc = self._get_rpc()
        method = getattr(rpc, method_name)
        result = method(self._instance_id, *args)
+        timeout_ms = self._rpc_timeout_ms()
+        start_epoch = time.time()
+        start_perf = time.perf_counter()
+        thread_id = threading.get_ident()
+        call_id = "%s:%s:%s:%.6f" % (
+            self._instance_id,
+            method_name,
+            thread_id,
+            start_perf,
+        )
+        logger.debug(
+            "ISO:modelsampling_rpc_start method=%s instance_id=%s call_id=%s start_ts=%.6f thread=%s timeout_ms=%s",
+            method_name,
+            self._instance_id,
+            call_id,
+            start_epoch,
+            thread_id,
+            timeout_ms,
+        )
        if asyncio.iscoroutine(result):
+            result = asyncio.wait_for(result, timeout=timeout_ms / 1000.0)


In ModelSamplingProxy._call, the RPC method is called on line 238 (result = method(self._instance_id, *args)) before timeout_ms, start_epoch, and start_perf are computed. This means:

The asyncio.wait_for timeout wraps the already-resolved result of the synchronous/completed RPC call, making it a no-op.

The "ISO:modelsampling_rpc_start" log message is emitted after the RPC call has already returned, so it does not represent the actual start of the operation.

The elapsed_ms measurement captures post-call overhead only, not the actual RPC duration.

The timing setup, logging, and asyncio.wait_for wrapping should all happen before result = method(...) is called.

Copilot · 2026-03-05T05:59:33Z

utils/install_util.py

+    except FileNotFoundError:
+        logging.error("requirements.txt not found.")
+        return None
+    except Exception as e:
+        logging.error(f"Error reading requirements.txt: {e}")
+        return None


In get_required_packages_versions, when FileNotFoundError or another exception is caught, the function returns None. However, at the call sites (get_required_frontend_version and get_required_templates_version) the result is passed directly to .get(...) — e.g. get_required_packages_versions().get("comfyui-frontend-package", None). If the function returns None, this will raise AttributeError: 'NoneType' object has no attribute 'get', crashing the frontend version check instead of gracefully returning None.

Previously, the callers had their own error handling. The callers should check for None before calling .get(), or the function should return an empty dict on error instead of None.

Copilot · 2026-03-05T05:59:33Z

comfy/isolation/model_patcher_proxy_registry.py

+    def _run_operation_with_lease(self, instance_id: str, method_name: str, fn):
+        with self._operation_state_cv:
+            state = self._get_or_create_operation_state(instance_id)
+            lease = state.lease
+        with lease:
+            _, start_perf = self._begin_operation(instance_id, method_name)
+            try:
+                result = fn()
+            except Exception as exc:
+                self._end_operation(instance_id, method_name, start_perf, error=exc)
+                raise
+            self._end_operation(instance_id, method_name, start_perf)
+            return result


The _run_operation_with_lease method acquires a threading.Lock (per-instance "lease") synchronously inside an async method. Since all async registry methods (partially_load, patch_model, etc.) are awaitable and likely called from an async event loop, blocking on a threading.Lock within an async context can deadlock or starve the event loop. In Python's asyncio, blocking a thread on a lock inside an async coroutine means the event loop's thread cannot process other tasks while waiting.

This matters for methods like wait_all_idle, which waits on the same _operation_state_cv (which wraps self._lock) while _run_operation_with_lease may be holding it. If both are awaited from the same single-threaded asyncio loop, deadlock is possible.

Copilot · 2026-03-05T05:59:33Z

main.py

+import comfy_aimdo.control
+
+if enables_dynamic_vram():
+    if not comfy_aimdo.control.init():
+        logging.warning(
+            "DynamicVRAM requested, but comfy-aimdo failed to initialize early. "
+            "Will fall back to legacy model loading if device init fails."
+        )


The comfy_aimdo.control.init() call at the top of main.py (lines 30-35) happens before logging is configured, before torch is imported, and before comfy.model_management is available. If comfy_aimdo.control.init() requires any of these (e.g., torch or CUDA), it will fail at a point where meaningful error handling is not yet in place. The original code placed init() much later after torch was confirmed imported and comfy.model_management was available. The logging.warning on line 32 is also called before the logging system is configured by setup_logger, so the warning level/format may not be set correctly.

Copilot · 2026-03-05T05:59:33Z

comfy/isolation/model_patcher_proxy_registry.py

+    async def wait_for_idle(self, instance_id: str, timeout_ms: int = 0) -> bool:
+        timeout_s = None if timeout_ms <= 0 else (timeout_ms / 1000.0)
+        deadline = None if timeout_s is None else (time.monotonic() + timeout_s)
+        with self._operation_state_cv:
+            while True:
+                active = self._operation_states.get(instance_id)
+                if active is None or active.active_count == 0:
+                    return True
+                if deadline is None:
+                    self._operation_state_cv.wait()
+                    continue
+                remaining = deadline - time.monotonic()
+                if remaining <= 0:
+                    return False
+                self._operation_state_cv.wait(timeout=remaining)
+
+    async def wait_all_idle(self, timeout_ms: int = 0) -> bool:
+        timeout_s = None if timeout_ms <= 0 else (timeout_ms / 1000.0)
+        deadline = None if timeout_s is None else (time.monotonic() + timeout_s)
+        with self._operation_state_cv:
+            while True:
+                has_active = any(
+                    state.active_count > 0 for state in self._operation_states.values()
+                )
+                if not has_active:
+                    return True
+                if deadline is None:
+                    self._operation_state_cv.wait()
+                    continue
+                remaining = deadline - time.monotonic()
+                if remaining <= 0:
+                    return False
+                self._operation_state_cv.wait(timeout=remaining)


The wait_for_idle async method holds the _operation_state_cv (a threading.Condition) lock while calling self._operation_state_cv.wait(). This is a blocking call that will block the asyncio event loop thread if called from a coroutine running in that thread. The same applies to wait_all_idle. These methods are async but use blocking threading.Condition.wait() internally, which will block the event loop and prevent other coroutines from running while waiting for operations to complete.

Copilot AI review requested due to automatic review settings March 5, 2026 05:51

pollockjj requested review from Kosinkadink, comfyanonymous and guill as code owners March 5, 2026 05:51

Copilot started reviewing on behalf of pollockjj March 5, 2026 05:51 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

pollockjj merged commit 623a9d2 into Comfy-Org:pyisolate-support Mar 5, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(isolation): DynamicVRAM compatibility for process isolation#12775

feat(isolation): DynamicVRAM compatibility for process isolation#12775
pollockjj merged 1 commit intoComfy-Org:pyisolate-supportfrom
pollockjj:pyisolate-pr-20260304

pollockjj commented Mar 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

socket-security bot commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if os.environ.get("PYISOLATE_ISOLATION_ACTIVE") != "1":
	if os.environ.get("PYISOLATE_CHILD") != "1":

Conversation

pollockjj commented Mar 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

API Node PR Checklist

Scope

Pricing & Billing

QA

Comms

Uh oh!

socket-security bot commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pollockjj commented Mar 5, 2026 •

edited by github-actions bot

Loading