Skip to content

feat: Add CacheProvider API for external distributed caching#12056

Open
deepme987 wants to merge 33 commits intomasterfrom
feat/cache-provider-api
Open

feat: Add CacheProvider API for external distributed caching#12056
deepme987 wants to merge 33 commits intomasterfrom
feat/cache-provider-api

Conversation

@deepme987
Copy link

Summary

Adds a public CacheProvider API that enables external cache providers to integrate with ComfyUI's caching system. This allows distributed caching across multiple ComfyUI instances (e.g., Kubernetes pods) without monkey-patching internal methods.

Key changes:

  • New comfy_execution/cache_provider.py module with public API:
    • CacheProvider abstract base class with on_lookup, on_store, should_cache methods
    • CacheContext and CacheValue dataclasses for type-safe data passing
    • Thread-safe provider registry (register_cache_provider, get_cache_providers, etc.)
    • Utility functions: serialize_cache_key, contains_nan, estimate_value_size
  • Modified comfy_execution/caching.py to call registered providers on cache miss/store
  • Modified execution.py to notify providers on prompt start/end
  • Deterministic cache key hashing using canonicalize + JSON (not pickle) for cross-pod consistency

Why this matters:

  • Replaces fragile monkey-patching approach with a stable, public API
  • Enables Comfy Cloud (and other deployments) to share cached results across pods
  • Reduces redundant computation for expensive nodes (KSampler, VAEDecode, etc.)

Test plan

  • Added unit tests in tests-unit/execution_test/test_cache_provider.py
    • Tests for _canonicalize deterministic ordering
    • Tests for serialize_cache_key hash consistency
    • Tests for contains_nan detection
    • Tests for estimate_value_size
    • Tests for provider registry (register/unregister/clear)
  • Verified deterministic hashing works across Python sessions
  • Tested with Comfy Cloud's SidecarCacheProvider implementation

🤖 Generated with Claude Code

deepme987 and others added 4 commits January 19, 2026 16:43
Introduces a public API for external cache providers, enabling distributed
caching across multiple ComfyUI instances (e.g., Kubernetes pods).

New files:
- comfy_execution/cache_provider.py: CacheProvider ABC, CacheContext/CacheValue
  dataclasses, thread-safe provider registry, serialization utilities

Modified files:
- comfy_execution/caching.py: Add provider hooks to BasicCache (_notify_providers_store,
  _check_providers_lookup), subcache exclusion, prompt ID propagation
- execution.py: Add prompt lifecycle hooks (on_prompt_start/on_prompt_end) to
  PromptExecutor, set _current_prompt_id on caches

Key features:
- Local-first caching (check local before external for performance)
- NaN detection to prevent incorrect external cache hits
- Subcache exclusion (ephemeral subgraph results not cached externally)
- Thread-safe provider snapshot caching
- Graceful error handling (provider errors logged, never break execution)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Pickle serialization is NOT deterministic across Python sessions due
to hash randomization affecting frozenset iteration order. This causes
distributed caching to fail because different pods compute different
hashes for identical cache keys.

Fix: Use _canonicalize() + JSON serialization which ensures deterministic
ordering regardless of Python's hash randomization.

This is critical for cross-pod cache key consistency in Kubernetes
deployments.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add comprehensive tests for _canonicalize deterministic ordering
- Add tests for serialize_cache_key hash consistency
- Add tests for contains_nan utility
- Add tests for estimate_value_size
- Add tests for provider registry (register, unregister, clear)
- Move json import to top-level (fix inline import)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
deepme987 and others added 2 commits January 24, 2026 15:34
Fixes ruff F821 (undefined name) and F401 (unused import) errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Frozensets can only contain hashable types, so use nested frozensets
instead of dicts. Added separate test for dict handling via serialize_cache_key.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@deepme987 deepme987 marked this pull request as draft January 24, 2026 10:19
@deepme987 deepme987 marked this pull request as ready for review January 28, 2026 18:56
deepme987 and others added 4 commits January 29, 2026 19:42
- Add Caching class to comfy_api/latest/__init__.py that re-exports
  from comfy_execution.cache_provider (source of truth)
- Fix docstring: "Skip large values" instead of "Skip small values"
  (small compute-heavy values are good cache targets)
- Maintain backward compatibility: comfy_execution.cache_provider
  imports still work

Usage:
    from comfy_api.latest import Caching

    class MyProvider(Caching.CacheProvider):
        def on_lookup(self, context): ...
        def on_store(self, context, value): ...

    Caching.register_provider(MyProvider())

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Change docstring from "Skip large values" to "Skip if download time > compute time"
which better captures the cost/benefit tradeoff for external caching.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Remove prescriptive filtering suggestions - let implementations
decide their own caching logic based on their use case.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add ui field to CacheValue dataclass (default None)
- Pass ui when creating CacheValue for external providers
- Use result.ui (or default {}) when returning from external cache lookup

This allows external cache implementations to store/retrieve UI data
if desired, while remaining optional for implementations that skip it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@deepme987 deepme987 requested a review from guill January 29, 2026 14:54
Clearer name since objects are also cached locally - this specifically
checks for external caching eligibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@deepme987 deepme987 requested a review from guill February 3, 2026 15:54
- Make on_lookup/on_store async on CacheProvider ABC
- Simplify CacheContext: replace cache_key + cache_key_bytes with
  cache_key_hash (str hex digest)
- Make registry/utility functions internal (_prefix)
- Trim comfy_api.latest.Caching exports to core API only
- Make cache get/set async throughout caching.py hierarchy
- Use asyncio.create_task for fire-and-forget on_store
- Add NaN gating before provider calls in Core
- Add await to 5 cache call sites in execution.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an asynchronous, pluggable distributed cache provider system and integrates it into caching and execution. Introduces CacheContext, CacheValue, and an abstract CacheProvider with async on_lookup/on_store plus lifecycle hooks; a thread-safe provider registry and utilities for deterministic canonicalization, cache-key hashing, NaN detection, and size estimation. Converts core cache APIs to async, propagates per-prompt IDs to caches, consults external providers on misses and warms local cache on hits, and notifies providers of prompt lifecycle events from the executor. Adds unit tests covering utilities and registry behavior.

Changes

Cohort / File(s) Summary
Public API Exposure
comfy_api/latest/__init__.py
Adds public Caching class under Types that re-exports CacheProvider, CacheContext, CacheValue, and registration helpers (register_provider, unregister_provider); updates __all__ and Types.
Cache Provider Framework
comfy_execution/cache_provider.py
Adds CacheContext and CacheValue dataclasses, CacheProvider ABC with async on_lookup/on_store, optional should_cache and lifecycle hooks, provider registry (register_cache_provider/unregister_cache_provider), and internal utilities for deterministic canonicalization, key serialization (SHA-256 with pickle fallback), NaN detection, and size estimation. Exposes registry query/management helpers.
Caching Layer Integration
comfy_execution/caching.py
Converts core cache APIs to async, adds _is_subcache and _current_prompt_id propagation, consults external providers on cache misses and warms local cache on provider hits, fires provider stores asynchronously (fire-and-forget), and adds validation/safeguards for external caching (e.g., NaN handling). Updates many internal call paths to async and adds local sync variants (get_local, set_local).
Execution Integration
execution.py
Switches execution to await async cache operations for outputs and objects, adds _notify_prompt_lifecycle(self, event: str, prompt_id: str) on PromptExecutor, and invokes prompt start/end notifications in a finally block.
Graph internals
comfy_execution/graph.py
Replaces some uses of main cache accessors with local variants (get_local/set_local) in ExecutionList to avoid external-path interactions for those internal flows.
Unit Tests
tests-unit/execution_test/test_cache_provider.py
Adds tests covering canonicalization, deterministic key hashing, NaN detection, size estimation (with optional PyTorch checks), provider registry behavior (register/unregister/clear/duplicates), dataclass integrity, and a MockCacheProvider for interaction tests.

Sequence Diagram

sequenceDiagram
    participant Executor as Executor
    participant LocalCache as Local Cache
    participant Provider as Cache Provider(s)
    
    rect rgba(100, 150, 200, 0.5)
    Note over Executor,Provider: Cache Lookup Flow
    Executor->>LocalCache: await get(node_id)
    LocalCache->>LocalCache: Check local storage
    alt Cache Hit
        LocalCache-->>Executor: Return cached value
    else Cache Miss
        LocalCache->>Provider: await on_lookup(context)
        Provider-->>LocalCache: CacheValue or None
        alt Provider Hit
            LocalCache->>LocalCache: Warm local cache
            LocalCache-->>Executor: Return external value
        else Provider Miss
            LocalCache-->>Executor: Return None
        end
    end
    end
    
    rect rgba(150, 200, 100, 0.5)
    Note over Executor,Provider: Cache Store Flow
    Executor->>LocalCache: await set(node_id, value)
    LocalCache->>LocalCache: Store locally
    LocalCache->>Provider: _notify_providers_store(context, value)
    Provider->>Provider: await on_store(context, value)
    Note over Provider: Fire-and-forget
    LocalCache-->>Executor: Complete
    end
    
    rect rgba(200, 150, 100, 0.5)
    Note over Executor,Provider: Lifecycle Management
    Executor->>Provider: on_prompt_start(prompt_id)
    Note over Provider: Initialize for prompt
    Executor->>Executor: Execute nodes with caching
    Executor->>Provider: on_prompt_end(prompt_id)
    Note over Provider: Cleanup for prompt
    end
Loading
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the primary change: introducing a public CacheProvider API for external distributed caching.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the purpose, key changes, and rationale for the new CacheProvider API.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_execution/cache_provider.py`:
- Around line 178-202: The helper functions are defined with leading underscores
(_get_cache_providers, _has_cache_providers, _clear_cache_providers) while
callers in the PR import non-underscored names, causing import/runtime errors;
add thin public wrappers named get_cache_providers(), has_cache_providers(), and
clear_cache_providers() that simply call and return the existing underscored
implementations (preserving behavior around _providers_snapshot, _providers_lock
and _providers) or alternatively rename all callers to use the existing
underscored names—ensure the public wrappers delegate to _get_cache_providers,
_has_cache_providers, and _clear_cache_providers so existing internal
locking/caching behavior is preserved.
- Around line 260-267: Replace the process-local id()-based fallback in
comfy_execution/cache_provider.py so that when both deterministic serialization
and pickle serialization of cache_key fail the function returns None
(fail-closed) instead of hashing id(cache_key); specifically modify the except
block around pickle.dumps to return None on exception rather than
hashlib.sha256(str(id(cache_key)).encode()).hexdigest(). Then update
comfy_execution/caching.py to check the cache_key_hash variable and skip calling
the provider (do not attempt get/set on the provider) when cache_key_hash is
None so caching is bypassed for non-serializable keys.

In `@comfy_execution/caching.py`:
- Around line 231-235: Remove the unused imported names from the provider-hook
import blocks: in the import that currently imports _has_cache_providers,
_get_cache_providers, CacheContext, CacheValue, _serialize_cache_key,
_contains_nan, _logger remove CacheContext and _serialize_cache_key so only
_has_cache_providers, _get_cache_providers, CacheValue, _contains_nan, and
_logger remain; likewise remove CacheContext from the later import block (the
one around line ~270) so it does not get imported where unused. Keep the other
symbols (_has_cache_providers, _get_cache_providers, CacheValue, _contains_nan,
_logger) intact and rerun CI/lint to confirm F401 is resolved.
- Around line 212-223: The code calls self._check_providers_lookup on every
local cache miss (in _get_immediate), causing unnecessary external lookups for
non-output caches like caches.objects; restrict external provider checks to
output-value keys only by guarding the call with a check against the cache-key
type from cache_key_set (e.g., use a method such as
cache_key_set.is_output_key(node_id) or cache_key_set.is_output(cache_key) if
available) so that _check_providers_lookup(cache_key) is invoked only when the
key represents an output value; apply the same guard to the other affected block
(lines ~268-305) that currently triggers provider lookups on non-output misses.

In `@tests-unit/execution_test/test_cache_provider.py`:
- Around line 135-174: Tests assume serialize_cache_key returns raw SHA256 bytes
but the cache-provider contract returns a hex string; update assertions in
test_returns_bytes and test_dict_in_cache_key (and similar locations) to assert
isinstance(result, str) and len(result) == 64 (SHA256 hex length). Replace any
uses of non-existent CacheContext fields cache_key or cache_key_bytes with
cache_key_hash when constructing CacheContext instances (refer to CacheContext).
Finally, update tests that mock provider methods to follow the provider's async
signatures (use async def test helpers and await the provider methods or return
awaitable mocks) so they match the async method shape used by the provider
interface.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f5bd39 and 4cbe4fe.

📒 Files selected for processing (5)
  • comfy_api/latest/__init__.py
  • comfy_execution/cache_provider.py
  • comfy_execution/caching.py
  • execution.py
  • tests-unit/execution_test/test_cache_provider.py

- Remove unused CacheContext and _serialize_cache_key imports from
  caching.py (now handled by _build_context helper)
- Update test_cache_provider.py to use _-prefixed internal names
- Update tests for new CacheContext.cache_key_hash field (str)
- Make MockCacheProvider methods async to match ABC

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
deepme987 and others added 4 commits March 3, 2026 16:52
Make all docstrings and comments generic for the OSS codebase.
Remove references to Kubernetes, Redis, GCS, pods, and other
infrastructure-specific terminology.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip verbose docstrings and section banners to match existing minimal
documentation style used throughout the codebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add docstring with usage example to Caching class matching the
  convention used by sibling APIs (Execution.set_progress, ComfyExtension)
- Remove non-deterministic pickle fallback from _serialize_cache_key;
  return None on JSON failure instead of producing unretrievable hashes
- Move cache_provider imports to top of execution.py (no circular dep)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback:
- Move CacheProvider/CacheContext/CacheValue definitions to
  comfy_api/latest/_caching.py (source of truth for public API)
- comfy_execution/cache_provider.py re-exports types from there
- Build _providers_snapshot eagerly on register/unregister instead
  of lazy memoization in _get_cache_providers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
deepme987 and others added 3 commits March 3, 2026 18:32
Address review feedback from guill:
- Rename _contains_nan to _contains_self_unequal, use not (x == x)
  instead of math.isnan to catch any self-unequal value
- Remove Unhashable and repr() fallbacks from _canonicalize; raise
  ValueError for unknown types so _serialize_cache_key returns None
  and external caching is skipped (fail-closed)
- Update tests for renamed function and new fail-closed behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CacheContext is imported from _caching and re-exported for use by
caching.py. Add noqa comment to satisfy the linter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Subcache nodes (from node expansion) now participate in external
provider store/lookup. Previously skipped to avoid duplicates, but
the cost of missing partial-expansion cache hits outweighs redundant
stores — especially with looping behavior on the horizon.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@deepme987 deepme987 requested a review from guill March 4, 2026 04:26
Comment on lines +142 to +143
register_cache_provider as register_provider,
unregister_cache_provider as unregister_provider,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define these as small wrappers here rather than just importing them. As-is, someone could add/remove a param from register_provider and not realize that it'll be breaking the public API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this?

Define register_provider and unregister_provider as wrapper functions
in the Caching class instead of re-importing. This locks the public
API signature in comfy_api/ so internal changes can't accidentally
break it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@deepme987 deepme987 requested a review from guill March 6, 2026 18:57
from ._caching import CacheProvider, CacheContext, CacheValue

@staticmethod
def register_provider(provider: "Caching.CacheProvider") -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the pattern of class Execution or class NodeReplacement in this file. To work with both process isolation and our versioning system:

  1. This class must inherit from ProxiedSingleton.
  2. The functions should be defined with this as a singleton (i.e. not staticmethods).
  3. The functions should be defined as async. (People can get an automatically converted non-async version by importing ComfyAPISync.)

return
_providers.append(provider)
_providers_snapshot = tuple(_providers)
_logger.info(f"Registered cache provider: {provider.__class__.__name__}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should this be a debug-level log line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

deepme987 and others added 4 commits March 9, 2026 14:13
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Caching as a nested class inside ComfyAPI_latest inheriting from
ProxiedSingleton with async instance methods, matching the Execution
and NodeReplacement patterns. Retains standalone Caching class for
direct import convenience.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Follow the Execution/NodeReplacement pattern — the public API methods
contain the actual logic operating on cache_provider module state,
not wrapper functions delegating to free functions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove duplicate standalone Caching class. Define it once as a nested
class in ComfyAPI_latest (matching Execution/NodeReplacement pattern),
with a module-level alias for import convenience.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@deepme987 deepme987 requested a review from guill March 9, 2026 21:37

@dataclass
class CacheContext:
prompt_id: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the prompt_id in the CacheContext? It seems like we might be able to simplify some things (e.g. storing the prompt id in the caches themselves) if we remove this. It also seems like we wouldn't want to care about the prompt id when determining caching (since the whole point of caching is to cache across different prompt ids).

Remove prompt_id from CacheContext — it's not relevant for cache
matching and added unnecessary plumbing (_current_prompt_id on every
cache). Lifecycle hooks still receive prompt_id directly.

Include type name in canonicalized primitives so that int 7 and
str "7" produce distinct hashes. Also canonicalize dict keys properly
instead of str() coercion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@deepme987 deepme987 requested a review from guill March 10, 2026 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants