Phambinh/wip cached allocs pingpong by phambinhfin · Pull Request #781 · ROCm/xla

phambinhfin · 2026-04-06T23:16:09Z

Mode	Command Buffer Types	Cached Allocs	Median	vs Baseline
A. Baseline	none	n/a	13.601 ms	--
B. FUSION only	FUSION	ON	14.369 ms	+5.6% (overhead > savings)
C. No collectives	FUSION + CUBLAS + CUBLASLT + CUSTOM_CALL	ON	13.413 ms	-1.4%
D. All types	FUSION + CUBLAS + CUBLASLT + CUSTOM_CALL + COLLECTIVES	ON	10.236 ms	-24.7%
E. All types (broken)	same as D	OFF	287.085 ms	+2010% (re-tracing every step)

Remove two overly-conservative checks from rocm_command_buffer.cc (ported from phambinh/full_vmm_solution branch): 1. Trace(): Remove rejection of empty traced HIP graphs, which caused segfaults when custom calls don't launch GPU ops. 2. PrepareFinalization(): Remove empty node insertion into empty graphs, which could crash with conditional nodes. Baseline for command buffer performance profiling on main branch.

Implements buffer allocation caching (XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS) to stabilize buffer addresses across execution steps. Key changes: - CachedAllocator: wraps DeviceAddressAllocator, intercepts Deallocate for cached buffers to prevent TearDown from freeing them - AllocationsCache: per-device cache mapping allocation indices to addresses - BufferForAllocation: checks cache before allocating, caches on miss - ExecuteAsyncOnStreamImpl: uses CachedAllocator for both BufferAllocations and ExecutionOutput Evaluation results: - Successfully stabilizes addresses (0 updates needed per step) - hipGraphLaunch submit is fast (~4ms) - BUT GPU-side graph execution takes ~54s/step (vs 42ms baseline) - Root cause: ROCm HIP graph execution has fundamental performance issue where captured graphs lose kernel concurrency, causing sequential execution - This is the same bug that VMM dual-mapping also works around by enabling the VA remapping code path (which uses is_initialization=true recording) The CachedAllocator approach alone cannot fix the ROCm HIP graph execution performance bug. It must be combined with the VA remapping code path or an equivalent fix to the graph capture/execution mechanism.

Progress on Pavel's buffer caching approach: 1. CachedAllocator stabilizes temp/live-out buffer addresses (0 updates) 2. collective_init_done flag prevents re-recording in Initialize() path 3. Parameter memcpy: copies donated param data to cached stable address Current state: - Allocation 621 (donated parameter) changes address every step - Every 4th step hits graph cache (~280ms), others re-trace (~45s) - The param memcpy updates buffer_allocations but the TracedCommandBuffer sub-graph cache inside individual commands still detects address changes and re-traces those sub-graphs - Need to investigate why cached address isn't propagated to TracedCommandBuffer's recorded_allocs comparison Key insight: the slow path is NOT hipGraphInstantiate or hipGraphLaunch. It's TracedCommandBuffer::GetOrTraceCommandBuffer doing hipStreamBeginCapture/EndCapture for each command that uses the changed allocation. This capture is ~45s for the full model.

Move address stabilization logic from GpuExecutable into CommandBufferThunk::ExecuteOnStream where it can intercept after UpdateBufferAllocations and before Record/Submit. This ensures both the coarse-grained recorded_allocs check and the fine-grained TracedCommandBuffer cache see stable addresses. Key changes in command_buffer_thunk.cc: - After UpdateBufferAllocations detects address changes, stabilize by allocating persistent device buffers and memcpy'ing data - Build stable_allocs/stable_params with cached addresses for Record - After Submit, copy-back results from persistent to original buffers - Track alloc_cache_ready to skip stabilization on first record - collective_init_done prevents redundant Initialize re-recording Key changes in command_buffer_thunk.h: - Add alloc_address_cache map to ExecutorCommandBuffer - Add alloc_cache_ready flag - Add collective_init_done flag Performance results: - Baseline (no cmd buffers): 44.1 ms/step - VMM dual-mapping approach: 47.4 ms/step (9.4% overhead) - CachedAllocs approach: 42.5 ms/step (3.6% FASTER than baseline) - No segfaults, 20/20 steps completed, correct loss values The CachedAllocs approach outperforms even the baseline because HIP graph replay is more efficient than op-by-op kernel dispatch.

DeviceAddressAllocator::Allocate() is UNIMPLEMENTED in the thunk execution context, causing command buffer stabilization to silently fail. Switch to StreamExecutor::Allocate() which is always available. With this fix, donated parameter addresses are properly stabilized and TracedCommandBuffer cache hits on every step after the first: Baseline (no cmd buffers): 13.6ms/step All cmd buffer types + fix: 10.2ms/step (-24.7%) All cmd buffer types, no fix: 287.1ms/step (re-traces every step)

claude · 2026-04-06T23:24:09Z

xla/backends/gpu/runtime/command_buffer_thunk.cc

+        se::DeviceAddressBase persistent =
+            executor->Allocate(current.size(), /*memory_space=*/0);
+        if (!persistent.is_null()) {
+          TF_RETURN_IF_ERROR(params.stream->MemcpyD2D(
+              &persistent, current, current.size()));
+          cache[idx] = persistent;
+          cmd_buffer->recorded_allocs[idx] = persistent;
+          copyback_indices.push_back(idx);
+          ++stabilized;
+        } else {
+          LOG(WARNING) << "CachedAllocs: failed to allocate persistent "
+                       << "buffer for alloc " << idx
+                       << " size=" << current.size();
+        }
+        ++it;


Memory leak: persistent allocations are never freed.

executor->Allocate() creates device memory that is stored in alloc_address_cache, but there is no cleanup path. The ExecutorCommandBuffer has no destructor that frees these allocations, and the GpuExecutable destructor only frees entries in cached_allocs_ (the executable-level cache), not the thunk-level alloc_address_cache.

This means every persistent buffer allocated here leaks GPU memory until process exit, and the leak grows with the number of stabilized allocations across all executors.

Consider adding a destructor to ExecutorCommandBuffer that iterates alloc_address_cache and frees each allocation via StreamExecutor::Deallocate, or unifying this cache with the executable-level cached_allocs_ so a single cleanup path covers both.

claude · 2026-04-06T23:24:14Z

xla/backends/gpu/runtime/command_buffer_thunk.cc

+      !cmd_buffer->collective_init_done &&
+      commands_.requires_initialization();
+  if (cmd_buffer->warmup_done && (first_init || needs_collective_init)) {
+    cmd_buffer->collective_init_done = true;


Bug: collective_init_done set before initialization completes.

The flag is set to true at line 227 before commands_.Record(...) runs (line 248). If Record fails via TF_RETURN_IF_ERROR, the flag stays true, preventing any future retry of collective initialization. This means a transient failure permanently disables collective init.

Move cmd_buffer->collective_init_done = true; to after the Record call succeeds (e.g., after line 249).

claude · 2026-04-06T23:24:25Z

xla/service/gpu/gpu_executable.cc

+    effective_allocator = allocator.get();
+  }
+
  ExecutionOutput result(/*on_device_shape=*/program_shape_.result(),
-                         memory_allocator, device_ordinal,
+                         effective_allocator, device_ordinal,
                         executor->device_ordinal());

  ASSIGN_OR_RETURN(BufferAllocations buffer_allocations,


Data race: CachedAllocator holds a raw pointer to cached_ptrs used outside the lock.

The CachedAllocator stores a raw pointer to cache.cached_ptrs (a flat_hash_set), which is protected by allocs_cache_mutex_. However, CachedAllocator::Deallocate reads cached_ptrs_ (via contains()) outside the lock — e.g., during TearDown or ExecutionOutput destruction.

Meanwhile, BufferForAllocation (line 1076) inserts into cached_ptrs under the lock. A concurrent Deallocate call reading the same flat_hash_set without holding the lock is undefined behavior.

Options:

Have CachedAllocator take its own lock (or share allocs_cache_mutex_) when accessing cached_ptrs_.

Use a thread-safe set for cached_ptrs.

claude · 2026-04-06T23:24:31Z

xla/stream_executor/rocm/rocm_command_buffer.cc

@@ -400,18 +400,6 @@ absl::Status RocmCommandBuffer::Trace(
  VLOG(5) << "Traced into the GPU command buffer graph " << graph_ << " (took "


Removing empty-graph safety checks may cause ROCm crashes.

This PR removes both:

The Trace() check that returned an error when a traced HIP graph had zero root nodes

The PrepareFinalization() empty-node insertion that prevented crashes from empty graphs

The CUDA path still retains both its empty-graph check in Trace() and its PrepareFinalization() empty-body handling. The PR description provides no justification for why this is safe on ROCm. If any custom call traces an empty graph (no HIP operations launched), this will crash at instantiation time.

Could you add a comment explaining why this is safe, or add a test that verifies empty-graph behavior on ROCm?

claude · 2026-04-06T23:24:36Z

xla/service/gpu/gpu_executable.cc

  }
+  // Free cached allocations through the real (underlying) allocator.
+  if (enable_cached_allocs_) {
+    absl::MutexLock lock(&allocs_cache_mutex_);
+    for (auto& [device_ordinal, cache] : cached_allocs_) {
+      auto it = cached_allocators_.find(device_ordinal);
+      if (it == cached_allocators_.end() || !it->second) continue;
+      auto* real_allocator =
+          static_cast<CachedAllocator*>(it->second.get());
+      for (auto& [idx, addr] : cache.entries) {
+        if (!addr.is_null()) {
+          (void)real_allocator->DeallocateUnderlying(device_ordinal, addr);
+        }


Destructor silently swallows deallocation errors.

(void)real_allocator->DeallocateUnderlying(device_ordinal, addr) discards the absl::Status. If deallocation fails (e.g., device context already torn down, or wrong device ordinal), the error is silently lost. At minimum, log a warning:

Suggested change

}

// Free cached allocations through the real (underlying) allocator.

if (enable_cached_allocs_) {

absl::MutexLock lock(&allocs_cache_mutex_);

for (auto& [device_ordinal, cache] : cached_allocs_) {

auto it = cached_allocators_.find(device_ordinal);

if (it == cached_allocators_.end() || !it->second) continue;

auto* real_allocator =

static_cast<CachedAllocator*>(it->second.get());

for (auto& [idx, addr] : cache.entries) {

if (!addr.is_null()) {

(void)real_allocator->DeallocateUnderlying(device_ordinal, addr);

}

for (auto& [idx, addr] : cache.entries) {

if (!addr.is_null()) {

auto status = real_allocator->DeallocateUnderlying(device_ordinal, addr);

if (!status.ok()) {

LOG(WARNING) << "CachedAllocs: failed to deallocate buffer "

<< idx << ": " << status;

}

}

}

Also, the static_cast<CachedAllocator*>(it->second.get()) on line 414 is unnecessary — cached_allocators_ stores std::unique_ptr<CachedAllocator>, so .get() already returns CachedAllocator*.

claude · 2026-04-06T23:24:40Z

xla/service/gpu/gpu_executable.h

+// Wraps a real DeviceAddressAllocator and intercepts Deallocate calls for
+// cached buffer addresses, preventing TearDown from freeing them.
+class CachedAllocator : public se::DeviceAddressAllocator {
+ public:
+  CachedAllocator(se::DeviceAddressAllocator* underlying,
+                  const absl::flat_hash_set<void*>* cached_ptrs)
+      : DeviceAddressAllocator(underlying->platform()),
+        underlying_(underlying),
+        cached_ptrs_(cached_ptrs) {}
+
+  absl::StatusOr<se::ScopedDeviceAddress<uint8_t>> Allocate(
+      int device_ordinal, uint64_t size, bool retry_on_failure,
+      int64_t memory_space) override {
+    return underlying_->Allocate(device_ordinal, size, retry_on_failure,
+                                 memory_space);
+  }
+
+  absl::Status Deallocate(int device_ordinal,
+                          se::DeviceAddressBase mem) override {
+    if (cached_ptrs_ && cached_ptrs_->contains(mem.opaque())) {
+      return absl::OkStatus();
+    }
+    return underlying_->Deallocate(device_ordinal, mem);
+  }
+
+  absl::StatusOr<se::Stream*> GetStream(int device_ordinal) override {
+    return underlying_->GetStream(device_ordinal);
+  }
+
+  bool AllowsAsynchronousDeallocation() const override {
+    return underlying_->AllowsAsynchronousDeallocation();
+  }
+
+  // Bypass the cache and deallocate directly through the underlying allocator.
+  absl::Status DeallocateUnderlying(int device_ordinal,
+                                    se::DeviceAddressBase mem) {
+    return underlying_->Deallocate(device_ordinal, mem);
+  }
+
+ private:
+  se::DeviceAddressAllocator* underlying_;
+  const absl::flat_hash_set<void*>* cached_ptrs_;
+};
+
+struct AllocationsCache {
+  absl::flat_hash_map<BufferAllocation::Index, se::DeviceAddressBase> entries;
+  absl::flat_hash_set<void*> cached_ptrs;
+};


CachedAllocator is an implementation detail exposed in a public header.

CachedAllocator and AllocationsCache are only used internally by GpuExecutable, but they are defined in the public header gpu_executable.h. This pollutes the public API and exposes implementation details (raw pointer members, internal caching strategy).

Consider moving these to an anonymous namespace in gpu_executable.cc or to a separate *_internal.h header.

claude · 2026-04-06T23:24:44Z

xla/service/gpu/gpu_executable.cc

+
+  int64_t n_params = 0, n_const = 0, n_thread = 0, n_alloc = 0, n_cached = 0;
  for (int64_t i = 0; i < num_buffers; ++i) {
    const BufferAllocation& allocation = *allocations[i];
+    if (allocation.is_thread_local())
+      n_thread++;
+    else if (allocation.is_entry_computation_parameter())
+      n_params++;
+    else if (allocation.is_constant())
+      n_const++;
+    else
+      n_alloc++;
+
    ASSIGN_OR_RETURN(
        buffers.emplace_back(),
        BufferForAllocation(arguments, globals, allocation, memory_allocator,
                            device_ordinal, i, allocate_granularity));
    RETURN_IF_ERROR(CheckAlignment(allocation, buffers.back(), i));
  }
+
+  if (enable_cached_allocs_) {
+    absl::MutexLock lock(&allocs_cache_mutex_);
+    n_cached = cached_allocs_[device_ordinal].entries.size();
+  }
+
+  VLOG(1) << "GenBufAllocs dev=" << device_ordinal
+           << " total=" << num_buffers << " params=" << n_params


Debug counters computed unconditionally, even when VLOG is disabled.

The counters n_params, n_const, n_thread, n_alloc are incremented for every allocation on every call to GenerateBufferAllocations, but the output is gated behind VLOG(1). Consider gating the counting behind VLOG_IS_ON(1) to avoid the overhead:

if (VLOG_IS_ON(1)) { int64_t n_params = 0, n_const = 0, n_thread = 0, n_alloc = 0; // ... counting loop ... VLOG(1) << "GenBufAllocs ..."; }

claude · 2026-04-06T23:24:49Z

xla/backends/gpu/runtime/command_buffer_thunk.cc

      enable_command_buffers_during_profiling_(
          enable_command_buffers_during_profiling),
      enable_command_buffer_va_remapping_(enable_command_buffer_va_remapping),
+      enable_cached_allocs_([] {
+        const char* env = std::getenv("XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS");
+        return env && (std::string(env) == "1" || std::string(env) == "true");
+      }()),


Nit: std::getenv is called redundantly at two locations.

The env var XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS is read here (per CommandBufferThunk construction) and also in GpuExecutable's constructor (gpu_executable.cc:393). This creates maintenance risk — if one location is changed, the other could be forgotten. std::getenv is also not thread-safe with respect to setenv/putenv.

Consider reading it once and passing the value through the construction chain, or using a tsl::DebugOptions flag instead of an env var.

claude · 2026-04-06T23:25:12Z

xla/service/gpu/gpu_executable.cc

+
+      // Cache all internally-allocated buffers (temp + live-out) for address
+      // stability. The CachedAllocator wrapper prevents TearDown and
+      // ExecutionOutput from freeing cached addresses.
+      bool cacheable = enable_cached_allocs_;
+
+      if (cacheable) {
+        absl::MutexLock lock(&allocs_cache_mutex_);
+        auto& cache = cached_allocs_[device_ordinal];
+        auto it = cache.entries.find(arg_idx);
+        if (it != cache.entries.end() &&
+            it->second.size() >= static_cast<uint64_t>(buffer_size)) {
+          return it->second;
+        }
+      }
+
      ASSIGN_OR_RETURN(
          se::ScopedDeviceAddress<uint8_t> buffer,
          memory_allocator->Allocate(device_ordinal, buffer_size,
                                     /*retry_on_failure=*/true,
                                     /*memory_space=*/allocation.color()));
      buffer_address = buffer.Release();
+
+      if (cacheable) {
+        absl::MutexLock lock(&allocs_cache_mutex_);
+        auto& cache = cached_allocs_[device_ordinal];
+        cache.entries[arg_idx] = buffer_address;
+        cache.cached_ptrs.insert(buffer_address.opaque());
+      }
    }
    return buffer_address;
  }


TOCTOU race: double-locking of allocs_cache_mutex_ allows duplicate allocations.

The mutex is acquired for the cache lookup (line 1049), released, then re-acquired for the cache insertion (line 1072). Between the two lock acquisitions, another thread could insert the same arg_idx, causing:

A redundant allocation via memory_allocator->Allocate

The first allocation being silently overwritten in cache.entries, leaking device memory

Consider holding the lock across the entire lookup-allocate-insert sequence, or using a per-entry lock/flag to prevent concurrent allocation for the same index.

claude · 2026-04-06T23:25:33Z

Review Summary

This PR implements a cached allocations optimization for GPU command buffer thunks, gated behind XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS. When enabled, donated parameter addresses are stabilized via D2D memcpy to/from persistent buffers, avoiding costly command buffer re-tracing. The PR also removes empty-graph safety checks from the ROCm command buffer path.

Key issues found (see inline comments for details):

Memory leak — Thunk-level alloc_address_cache persistent allocations are never freed (no destructor cleanup path)
Data race — CachedAllocator reads cached_ptrs without holding allocs_cache_mutex_, while BufferForAllocation concurrently inserts into it
TOCTOU race — Double-locking in BufferForAllocation allows duplicate allocations and memory leaks between lock acquisitions
Bug — collective_init_done flag set before Record() completes; a transient failure permanently disables collective init
Safety concern — ROCm empty-graph checks removed without justification; CUDA path retains equivalent checks
Minor — Destructor swallows deallocation errors silently; debug counters computed unconditionally; redundant std::getenv calls; CachedAllocator exposed in public header
No tests for the caching mechanism

The concurrency issues (data race on cached_ptrs, TOCTOU in BufferForAllocation) and the memory leak in alloc_address_cache should be addressed before merging.

phambinhfin added 5 commits April 5, 2026 19:42

phambinhfin added the claude-review Request a Claude AI code review for this PR label Apr 6, 2026

claude bot reviewed Apr 6, 2026

View reviewed changes

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phambinh/wip cached allocs pingpong#781

Phambinh/wip cached allocs pingpong#781
phambinhfin wants to merge 5 commits intomainfrom
phambinh/wip_cached-allocs-pingpong

phambinhfin commented Apr 6, 2026 •

edited

Loading

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot Apr 6, 2026

Uh oh!

claude bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -400,18 +400,6 @@ absl::Status RocmCommandBuffer::Trace(
		VLOG(5) << "Traced into the GPU command buffer graph " << graph_ << " (took "

Conversation

phambinhfin commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 6, 2026

Review Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

phambinhfin commented Apr 6, 2026 •

edited

Loading