Skip to content

feat(isolation): DynamicVRAM compatibility for process isolation#12775

Merged
pollockjj merged 1 commit intoComfy-Org:pyisolate-supportfrom
pollockjj:pyisolate-pr-20260304
Mar 5, 2026
Merged

feat(isolation): DynamicVRAM compatibility for process isolation#12775
pollockjj merged 1 commit intoComfy-Org:pyisolate-supportfrom
pollockjj:pyisolate-pr-20260304

Conversation

@pollockjj
Copy link

@pollockjj pollockjj commented Mar 5, 2026

Summary

DynamicVRAM's on-demand model loading/offloading conflicted with process isolation in three ways: RPC tensor transport stalls from mid-call GPU offload, race conditions between model lifecycle and active RPC operations, and false positive memory leak detection from changed finalizer patterns.

  • Marshal CUDA tensors to CPU before RPC transport for dynamic models
  • Add operation state tracking + quiescence waits at workflow boundaries
  • Distinguish proxy reference release from actual leaks in cleanup_models_gc
  • Fix init order: DynamicVRAM must initialize before isolation proxies
  • Add RPC timeouts to prevent indefinite hangs on model unavailability
  • Prevent proxy-of-proxy chains from DynamicVRAM model reload cycles
  • Add torch.device/torch.dtype serializers for new DynamicVRAM RPC paths
  • Guard isolation overhead so non-isolated workflows are unaffected
  • Migrate env var to PYISOLATE_CHILD

Test plan

  • Run baseline (non-isolated) workflow to confirm no regression
  • Run isolated workflow with DynamicVRAM enabled — verify no RPC stalls or tensor transport hangs
  • Run sequential isolated workflows — verify quiescence waits prevent cleanup race conditions
  • Confirm no false positive "memory leak" warnings in logs during normal model cycling
  • Verify non-isolated workflows show zero isolation overhead (guard checks)

API Node PR Checklist

Scope

  • Is API Node Change

Pricing & Billing

  • Need pricing update
  • No pricing update

If Need pricing update:

  • Metronome rate cards updated
  • Auto‑billing tests updated and passing

QA

  • QA done
  • QA not required

Comms

  • Informed Kosinkadink

DynamicVRAM's on-demand model loading/offloading conflicted with  process isolation in three ways: RPC tensor transport stalls from mid-call GPU offload, race conditions between model lifecycle and active RPC operations, and false positive memory leak detection from changed finalizer patterns.

- Marshal CUDA tensors to CPU before RPC transport for dynamic models
- Add operation state tracking + quiescence waits at workflow boundaries
- Distinguish proxy reference release from actual leaks in cleanup_models_gc
- Fix init order: DynamicVRAM must initialize before isolation proxies
- Add RPC timeouts to prevent indefinite hangs on model unavailability
- Prevent proxy-of-proxy chains from DynamicVRAM model reload cycles
- Add torch.device/torch.dtype serializers for new DynamicVRAM RPC paths
- Guard isolation overhead so non-isolated workflows are unaffected
- Migrate env var to PYISOLATE_CHILD
Copilot AI review requested due to automatic review settings March 5, 2026 05:51
@socket-security
Copy link

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedcomfyui-workflow-templates@​0.9.4 ⏵ 0.9.510010098100100

View full report

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds DynamicVRAM compatibility with process isolation, resolving three conflict classes: RPC tensor transport stalls during mid-call GPU offload, race conditions between model lifecycle and active RPC operations, and false-positive memory leak detection.

Changes:

  • Marshal CUDA tensors to CPU at RPC boundaries for dynamic models and add device-restore helpers on the proxy side
  • Add per-operation state tracking and quiescence waits (wait_for_model_patcher_quiescence) at workflow boundaries to prevent cleanup races
  • Distinguish proxy reference release from actual memory leaks in cleanup_models_gc (isolation-aware branching), fix is_dead logic, and add NULL-safety for model_finalizer
  • New WAN21_SCAIL model support (detection, model class, node, sampling type)
  • Migrate isolation env var checks from PYISOLATE_ISOLATION_ACTIVE to PYISOLATE_CHILD; guard isolation overhead behind args.use_process_isolation; fix DynamicVRAM init order; add --disable-dynamic-vram CLI flag; extend enables_dynamic_vram() logic

Reviewed changes

Copilot reviewed 36 out of 38 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
comfy/isolation/model_sampling_proxy.py CPU marshalling for RPC tensor transport; adds timeout, logging, and _to_cpu_for_rpc helper
comfy/isolation/model_patcher_proxy_registry.py Operation state tracking, quiescence waits, and CPU marshalling for apply_model/latent RPC paths
comfy/isolation/proxies/base.py RPC timeout support, structured debug logging for all proxy RPC calls
comfy/isolation/__init__.py Adds wait_for_model_patcher_quiescence, quiescence calls in notify/flush, guards eviction on isolated-only graphs
execution.py Guards isolation overhead with args.use_process_isolation; adds quiescence waits at workflow boundaries
comfy/model_management.py cleanup_models_gc isolation-aware split; is_dead/dead_state refactor; is_wsl() helper; _isolation_mode_enabled()
comfy/model_base.py ModelType.IMG_TO_IMG_FLOW; WAN21_FlowRVS/WAN21_SCAIL model classes; isolation-guarded __reduce__ in ModelSampling
comfy/ldm/wan/model.py SCAILWanModel with RoPE-encoded pose conditioning
comfy/supported_models.py WAN21_SCAIL config registered in models list
comfy_extras/nodes_wan.py WanSCAILToVideo node with pose/reference latent support
comfy/sd.py disable_dynamic flag threaded through CLIP loading; load_clip_model_patcher and load_checkpoint_guess_config_clip_only helpers for DynamicVRAM reload
comfy/model_patcher.py get_clone_model_override, get_non_dynamic_delegate; model_override param in clone
comfy/samplers.py Env-var migration; get_non_dynamic_delegate call when cond has hooks
comfy/cli_args.py --disable-dynamic-vram flag; removes DynamicVRAM from PerformanceFeature; updated enables_dynamic_vram()
main.py DynamicVRAM init order fix (before isolation proxies); WSL guard added
comfy/isolation/adapter.py torch.device/torch.dtype serializers; dynamic ModelSampling class registration; proxy-of-proxy prevention
comfy/isolation/model_patcher_proxy.py Device restoration after CPU-marshalled RPC; get_non_dynamic_delegate guard; _model_sampling cache
utils/install_util.py get_required_packages_versions() centralizes requirements parsing
app/frontend_management.py Delegates to new get_required_packages_versions()
node_helpers.py conditioning_set_values_with_timestep_range helper
comfy/hooks.py / comfy/k_diffusion/sampling.py / comfy/context_windows.py Env-var migration; sampling substep fix
requirements.txt / pyproject.toml Dependency bumps and host isolation config
.pyisolate_venvs/*/cache/ Generated cache files (should be gitignored)
tests/execution/test_jobs.py / tests-unit/app_test/frontend_manager_test.py Test updates for text previewable type and cache clearing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +29 to +45
PACKAGE_VERSIONS = {}
def get_required_packages_versions():
if len(PACKAGE_VERSIONS) > 0:
return PACKAGE_VERSIONS.copy()
out = PACKAGE_VERSIONS
try:
with open(requirements_path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip().replace(">=", "==")
s = line.split("==")
if len(s) == 2:
version_str = s[-1]
if not is_valid_version(version_str):
logging.error(f"Invalid version format in requirements.txt: {version_str}")
continue
out[s[0]] = version_str
return out.copy()
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_required_packages_versions function has a subtle cache inconsistency. On success it populates PACKAGE_VERSIONS (the global dict out) and returns out.copy(). On failure (e.g., FileNotFoundError), it returns None without modifying PACKAGE_VERSIONS. However, on partial success (some lines parsed, then the with open() block raises partway through—unlikely but possible), PACKAGE_VERSIONS would be partially filled but None is returned. Subsequent calls would then find len(PACKAGE_VERSIONS) > 0 and return the partial data as if it were complete. This is a latent race condition in the cache population logic.

Copilot uses AI. Check for mistakes.
return wrapped

async def flush_transport_state(self) -> int:
if os.environ.get("PYISOLATE_ISOLATION_ACTIVE") != "1":
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In comfy/isolation/extension_wrapper.py, the flush_transport_state method (line 408) still checks os.environ.get("PYISOLATE_ISOLATION_ACTIVE") != "1" rather than the migrated PYISOLATE_CHILD env var. Similarly, line 309 uses PYISOLATE_ISOLATION_ACTIVE. The PR description says the migration is to PYISOLATE_CHILD, but these two places in extension_wrapper.py were not updated, leaving inconsistent env-var checks. The same applies to model_patcher_proxy_utils.py line 16 and runtime_helpers.py line 179.

Suggested change
if os.environ.get("PYISOLATE_ISOLATION_ACTIVE") != "1":
if os.environ.get("PYISOLATE_CHILD") != "1":

Copilot uses AI. Check for mistakes.
Comment on lines 235 to +259
def _call(self, method_name: str, *args: Any) -> Any:
rpc = self._get_rpc()
method = getattr(rpc, method_name)
result = method(self._instance_id, *args)
timeout_ms = self._rpc_timeout_ms()
start_epoch = time.time()
start_perf = time.perf_counter()
thread_id = threading.get_ident()
call_id = "%s:%s:%s:%.6f" % (
self._instance_id,
method_name,
thread_id,
start_perf,
)
logger.debug(
"ISO:modelsampling_rpc_start method=%s instance_id=%s call_id=%s start_ts=%.6f thread=%s timeout_ms=%s",
method_name,
self._instance_id,
call_id,
start_epoch,
thread_id,
timeout_ms,
)
if asyncio.iscoroutine(result):
result = asyncio.wait_for(result, timeout=timeout_ms / 1000.0)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ModelSamplingProxy._call, the RPC method is called on line 238 (result = method(self._instance_id, *args)) before timeout_ms, start_epoch, and start_perf are computed. This means:

  1. The asyncio.wait_for timeout wraps the already-resolved result of the synchronous/completed RPC call, making it a no-op.
  2. The "ISO:modelsampling_rpc_start" log message is emitted after the RPC call has already returned, so it does not represent the actual start of the operation.
  3. The elapsed_ms measurement captures post-call overhead only, not the actual RPC duration.

The timing setup, logging, and asyncio.wait_for wrapping should all happen before result = method(...) is called.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +51
except FileNotFoundError:
logging.error("requirements.txt not found.")
return None
except Exception as e:
logging.error(f"Error reading requirements.txt: {e}")
return None
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In get_required_packages_versions, when FileNotFoundError or another exception is caught, the function returns None. However, at the call sites (get_required_frontend_version and get_required_templates_version) the result is passed directly to .get(...) — e.g. get_required_packages_versions().get("comfyui-frontend-package", None). If the function returns None, this will raise AttributeError: 'NoneType' object has no attribute 'get', crashing the frontend version check instead of gracefully returning None.

Previously, the callers had their own error handling. The callers should check for None before calling .get(), or the function should return an empty dict on error instead of None.

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +152
def _run_operation_with_lease(self, instance_id: str, method_name: str, fn):
with self._operation_state_cv:
state = self._get_or_create_operation_state(instance_id)
lease = state.lease
with lease:
_, start_perf = self._begin_operation(instance_id, method_name)
try:
result = fn()
except Exception as exc:
self._end_operation(instance_id, method_name, start_perf, error=exc)
raise
self._end_operation(instance_id, method_name, start_perf)
return result
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _run_operation_with_lease method acquires a threading.Lock (per-instance "lease") synchronously inside an async method. Since all async registry methods (partially_load, patch_model, etc.) are awaitable and likely called from an async event loop, blocking on a threading.Lock within an async context can deadlock or starve the event loop. In Python's asyncio, blocking a thread on a lock inside an async coroutine means the event loop's thread cannot process other tasks while waiting.

This matters for methods like wait_all_idle, which waits on the same _operation_state_cv (which wraps self._lock) while _run_operation_with_lease may be holding it. If both are awaited from the same single-threaded asyncio loop, deadlock is possible.

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +35
import comfy_aimdo.control

if enables_dynamic_vram():
if not comfy_aimdo.control.init():
logging.warning(
"DynamicVRAM requested, but comfy-aimdo failed to initialize early. "
"Will fall back to legacy model loading if device init fails."
)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comfy_aimdo.control.init() call at the top of main.py (lines 30-35) happens before logging is configured, before torch is imported, and before comfy.model_management is available. If comfy_aimdo.control.init() requires any of these (e.g., torch or CUDA), it will fail at a point where meaningful error handling is not yet in place. The original code placed init() much later after torch was confirmed imported and comfy.model_management was available. The logging.warning on line 32 is also called before the logging system is configured by setup_logger, so the warning level/format may not be set correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +202 to +234
async def wait_for_idle(self, instance_id: str, timeout_ms: int = 0) -> bool:
timeout_s = None if timeout_ms <= 0 else (timeout_ms / 1000.0)
deadline = None if timeout_s is None else (time.monotonic() + timeout_s)
with self._operation_state_cv:
while True:
active = self._operation_states.get(instance_id)
if active is None or active.active_count == 0:
return True
if deadline is None:
self._operation_state_cv.wait()
continue
remaining = deadline - time.monotonic()
if remaining <= 0:
return False
self._operation_state_cv.wait(timeout=remaining)

async def wait_all_idle(self, timeout_ms: int = 0) -> bool:
timeout_s = None if timeout_ms <= 0 else (timeout_ms / 1000.0)
deadline = None if timeout_s is None else (time.monotonic() + timeout_s)
with self._operation_state_cv:
while True:
has_active = any(
state.active_count > 0 for state in self._operation_states.values()
)
if not has_active:
return True
if deadline is None:
self._operation_state_cv.wait()
continue
remaining = deadline - time.monotonic()
if remaining <= 0:
return False
self._operation_state_cv.wait(timeout=remaining)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait_for_idle async method holds the _operation_state_cv (a threading.Condition) lock while calling self._operation_state_cv.wait(). This is a blocking call that will block the asyncio event loop thread if called from a coroutine running in that thread. The same applies to wait_all_idle. These methods are async but use blocking threading.Condition.wait() internally, which will block the event loop and prevent other coroutines from running while waiting for operations to complete.

Copilot uses AI. Check for mistakes.
@pollockjj pollockjj merged commit 623a9d2 into Comfy-Org:pyisolate-support Mar 5, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants