Skip to content

Phambinh/wip cached allocs pingpong#781

Open
phambinhfin wants to merge 5 commits intomainfrom
phambinh/wip_cached-allocs-pingpong
Open

Phambinh/wip cached allocs pingpong#781
phambinhfin wants to merge 5 commits intomainfrom
phambinh/wip_cached-allocs-pingpong

Conversation

@phambinhfin
Copy link
Copy Markdown

@phambinhfin phambinhfin commented Apr 6, 2026

Mode Command Buffer Types Cached Allocs Median vs Baseline
A. Baseline none n/a 13.601 ms --
B. FUSION only FUSION ON 14.369 ms +5.6% (overhead > savings)
C. No collectives FUSION + CUBLAS + CUBLASLT + CUSTOM_CALL ON 13.413 ms -1.4%
D. All types FUSION + CUBLAS + CUBLASLT + CUSTOM_CALL + COLLECTIVES ON 10.236 ms -24.7%
E. All types (broken) same as D OFF 287.085 ms +2010% (re-tracing every step)

Remove two overly-conservative checks from rocm_command_buffer.cc
(ported from phambinh/full_vmm_solution branch):

1. Trace(): Remove rejection of empty traced HIP graphs, which caused
   segfaults when custom calls don't launch GPU ops.
2. PrepareFinalization(): Remove empty node insertion into empty graphs,
   which could crash with conditional nodes.

Baseline for command buffer performance profiling on main branch.
Implements buffer allocation caching (XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS)
to stabilize buffer addresses across execution steps. Key changes:

- CachedAllocator: wraps DeviceAddressAllocator, intercepts Deallocate for
  cached buffers to prevent TearDown from freeing them
- AllocationsCache: per-device cache mapping allocation indices to addresses
- BufferForAllocation: checks cache before allocating, caches on miss
- ExecuteAsyncOnStreamImpl: uses CachedAllocator for both BufferAllocations
  and ExecutionOutput

Evaluation results:
- Successfully stabilizes addresses (0 updates needed per step)
- hipGraphLaunch submit is fast (~4ms)
- BUT GPU-side graph execution takes ~54s/step (vs 42ms baseline)
- Root cause: ROCm HIP graph execution has fundamental performance issue
  where captured graphs lose kernel concurrency, causing sequential execution
- This is the same bug that VMM dual-mapping also works around by enabling
  the VA remapping code path (which uses is_initialization=true recording)

The CachedAllocator approach alone cannot fix the ROCm HIP graph execution
performance bug. It must be combined with the VA remapping code path or
an equivalent fix to the graph capture/execution mechanism.
Progress on Pavel's buffer caching approach:

1. CachedAllocator stabilizes temp/live-out buffer addresses (0 updates)
2. collective_init_done flag prevents re-recording in Initialize() path
3. Parameter memcpy: copies donated param data to cached stable address

Current state:
- Allocation 621 (donated parameter) changes address every step
- Every 4th step hits graph cache (~280ms), others re-trace (~45s)
- The param memcpy updates buffer_allocations but the TracedCommandBuffer
  sub-graph cache inside individual commands still detects address changes
  and re-traces those sub-graphs
- Need to investigate why cached address isn't propagated to
  TracedCommandBuffer's recorded_allocs comparison

Key insight: the slow path is NOT hipGraphInstantiate or hipGraphLaunch.
It's TracedCommandBuffer::GetOrTraceCommandBuffer doing
hipStreamBeginCapture/EndCapture for each command that uses the changed
allocation. This capture is ~45s for the full model.
Move address stabilization logic from GpuExecutable into
CommandBufferThunk::ExecuteOnStream where it can intercept after
UpdateBufferAllocations and before Record/Submit. This ensures both
the coarse-grained recorded_allocs check and the fine-grained
TracedCommandBuffer cache see stable addresses.

Key changes in command_buffer_thunk.cc:
- After UpdateBufferAllocations detects address changes, stabilize
  by allocating persistent device buffers and memcpy'ing data
- Build stable_allocs/stable_params with cached addresses for Record
- After Submit, copy-back results from persistent to original buffers
- Track alloc_cache_ready to skip stabilization on first record
- collective_init_done prevents redundant Initialize re-recording

Key changes in command_buffer_thunk.h:
- Add alloc_address_cache map to ExecutorCommandBuffer
- Add alloc_cache_ready flag
- Add collective_init_done flag

Performance results:
- Baseline (no cmd buffers):    44.1 ms/step
- VMM dual-mapping approach:    47.4 ms/step (9.4% overhead)
- CachedAllocs approach:        42.5 ms/step (3.6% FASTER than baseline)
- No segfaults, 20/20 steps completed, correct loss values

The CachedAllocs approach outperforms even the baseline because
HIP graph replay is more efficient than op-by-op kernel dispatch.
DeviceAddressAllocator::Allocate() is UNIMPLEMENTED in the thunk
execution context, causing command buffer stabilization to silently
fail. Switch to StreamExecutor::Allocate() which is always available.

With this fix, donated parameter addresses are properly stabilized
and TracedCommandBuffer cache hits on every step after the first:

  Baseline (no cmd buffers):       13.6ms/step
  All cmd buffer types + fix:      10.2ms/step  (-24.7%)
  All cmd buffer types, no fix:   287.1ms/step  (re-traces every step)
@phambinhfin phambinhfin added the claude-review Request a Claude AI code review for this PR label Apr 6, 2026
Comment on lines +332 to +346
se::DeviceAddressBase persistent =
executor->Allocate(current.size(), /*memory_space=*/0);
if (!persistent.is_null()) {
TF_RETURN_IF_ERROR(params.stream->MemcpyD2D(
&persistent, current, current.size()));
cache[idx] = persistent;
cmd_buffer->recorded_allocs[idx] = persistent;
copyback_indices.push_back(idx);
++stabilized;
} else {
LOG(WARNING) << "CachedAllocs: failed to allocate persistent "
<< "buffer for alloc " << idx
<< " size=" << current.size();
}
++it;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak: persistent allocations are never freed.

executor->Allocate() creates device memory that is stored in alloc_address_cache, but there is no cleanup path. The ExecutorCommandBuffer has no destructor that frees these allocations, and the GpuExecutable destructor only frees entries in cached_allocs_ (the executable-level cache), not the thunk-level alloc_address_cache.

This means every persistent buffer allocated here leaks GPU memory until process exit, and the leak grows with the number of stabilized allocations across all executors.

Consider adding a destructor to ExecutorCommandBuffer that iterates alloc_address_cache and frees each allocation via StreamExecutor::Deallocate, or unifying this cache with the executable-level cached_allocs_ so a single cleanup path covers both.

Comment on lines +224 to +227
!cmd_buffer->collective_init_done &&
commands_.requires_initialization();
if (cmd_buffer->warmup_done && (first_init || needs_collective_init)) {
cmd_buffer->collective_init_done = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: collective_init_done set before initialization completes.

The flag is set to true at line 227 before commands_.Record(...) runs (line 248). If Record fails via TF_RETURN_IF_ERROR, the flag stays true, preventing any future retry of collective initialization. This means a transient failure permanently disables collective init.

Move cmd_buffer->collective_init_done = true; to after the Record call succeeds (e.g., after line 249).

Comment on lines +1237 to 1244
effective_allocator = allocator.get();
}

ExecutionOutput result(/*on_device_shape=*/program_shape_.result(),
memory_allocator, device_ordinal,
effective_allocator, device_ordinal,
executor->device_ordinal());

ASSIGN_OR_RETURN(BufferAllocations buffer_allocations,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data race: CachedAllocator holds a raw pointer to cached_ptrs used outside the lock.

The CachedAllocator stores a raw pointer to cache.cached_ptrs (a flat_hash_set), which is protected by allocs_cache_mutex_. However, CachedAllocator::Deallocate reads cached_ptrs_ (via contains()) outside the lock — e.g., during TearDown or ExecutionOutput destruction.

Meanwhile, BufferForAllocation (line 1076) inserts into cached_ptrs under the lock. A concurrent Deallocate call reading the same flat_hash_set without holding the lock is undefined behavior.

Options:

  1. Have CachedAllocator take its own lock (or share allocs_cache_mutex_) when accessing cached_ptrs_.
  2. Use a thread-safe set for cached_ptrs.

@@ -400,18 +400,6 @@ absl::Status RocmCommandBuffer::Trace(
VLOG(5) << "Traced into the GPU command buffer graph " << graph_ << " (took "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing empty-graph safety checks may cause ROCm crashes.

This PR removes both:

  1. The Trace() check that returned an error when a traced HIP graph had zero root nodes
  2. The PrepareFinalization() empty-node insertion that prevented crashes from empty graphs

The CUDA path still retains both its empty-graph check in Trace() and its PrepareFinalization() empty-body handling. The PR description provides no justification for why this is safe on ROCm. If any custom call traces an empty graph (no HIP operations launched), this will crash at instantiation time.

Could you add a comment explaining why this is safe, or add a test that verifies empty-graph behavior on ROCm?

Comment on lines 408 to +420
}
// Free cached allocations through the real (underlying) allocator.
if (enable_cached_allocs_) {
absl::MutexLock lock(&allocs_cache_mutex_);
for (auto& [device_ordinal, cache] : cached_allocs_) {
auto it = cached_allocators_.find(device_ordinal);
if (it == cached_allocators_.end() || !it->second) continue;
auto* real_allocator =
static_cast<CachedAllocator*>(it->second.get());
for (auto& [idx, addr] : cache.entries) {
if (!addr.is_null()) {
(void)real_allocator->DeallocateUnderlying(device_ordinal, addr);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructor silently swallows deallocation errors.

(void)real_allocator->DeallocateUnderlying(device_ordinal, addr) discards the absl::Status. If deallocation fails (e.g., device context already torn down, or wrong device ordinal), the error is silently lost. At minimum, log a warning:

Suggested change
}
// Free cached allocations through the real (underlying) allocator.
if (enable_cached_allocs_) {
absl::MutexLock lock(&allocs_cache_mutex_);
for (auto& [device_ordinal, cache] : cached_allocs_) {
auto it = cached_allocators_.find(device_ordinal);
if (it == cached_allocators_.end() || !it->second) continue;
auto* real_allocator =
static_cast<CachedAllocator*>(it->second.get());
for (auto& [idx, addr] : cache.entries) {
if (!addr.is_null()) {
(void)real_allocator->DeallocateUnderlying(device_ordinal, addr);
}
for (auto& [idx, addr] : cache.entries) {
if (!addr.is_null()) {
auto status = real_allocator->DeallocateUnderlying(device_ordinal, addr);
if (!status.ok()) {
LOG(WARNING) << "CachedAllocs: failed to deallocate buffer "
<< idx << ": " << status;
}
}
}

Also, the static_cast<CachedAllocator*>(it->second.get()) on line 414 is unnecessary — cached_allocators_ stores std::unique_ptr<CachedAllocator>, so .get() already returns CachedAllocator*.

Comment on lines +75 to +122
// Wraps a real DeviceAddressAllocator and intercepts Deallocate calls for
// cached buffer addresses, preventing TearDown from freeing them.
class CachedAllocator : public se::DeviceAddressAllocator {
public:
CachedAllocator(se::DeviceAddressAllocator* underlying,
const absl::flat_hash_set<void*>* cached_ptrs)
: DeviceAddressAllocator(underlying->platform()),
underlying_(underlying),
cached_ptrs_(cached_ptrs) {}

absl::StatusOr<se::ScopedDeviceAddress<uint8_t>> Allocate(
int device_ordinal, uint64_t size, bool retry_on_failure,
int64_t memory_space) override {
return underlying_->Allocate(device_ordinal, size, retry_on_failure,
memory_space);
}

absl::Status Deallocate(int device_ordinal,
se::DeviceAddressBase mem) override {
if (cached_ptrs_ && cached_ptrs_->contains(mem.opaque())) {
return absl::OkStatus();
}
return underlying_->Deallocate(device_ordinal, mem);
}

absl::StatusOr<se::Stream*> GetStream(int device_ordinal) override {
return underlying_->GetStream(device_ordinal);
}

bool AllowsAsynchronousDeallocation() const override {
return underlying_->AllowsAsynchronousDeallocation();
}

// Bypass the cache and deallocate directly through the underlying allocator.
absl::Status DeallocateUnderlying(int device_ordinal,
se::DeviceAddressBase mem) {
return underlying_->Deallocate(device_ordinal, mem);
}

private:
se::DeviceAddressAllocator* underlying_;
const absl::flat_hash_set<void*>* cached_ptrs_;
};

struct AllocationsCache {
absl::flat_hash_map<BufferAllocation::Index, se::DeviceAddressBase> entries;
absl::flat_hash_set<void*> cached_ptrs;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CachedAllocator is an implementation detail exposed in a public header.

CachedAllocator and AllocationsCache are only used internally by GpuExecutable, but they are defined in the public header gpu_executable.h. This pollutes the public API and exposes implementation details (raw pointer members, internal caching strategy).

Consider moving these to an anonymous namespace in gpu_executable.cc or to a separate *_internal.h header.

Comment on lines +1138 to +1164

int64_t n_params = 0, n_const = 0, n_thread = 0, n_alloc = 0, n_cached = 0;
for (int64_t i = 0; i < num_buffers; ++i) {
const BufferAllocation& allocation = *allocations[i];
if (allocation.is_thread_local())
n_thread++;
else if (allocation.is_entry_computation_parameter())
n_params++;
else if (allocation.is_constant())
n_const++;
else
n_alloc++;

ASSIGN_OR_RETURN(
buffers.emplace_back(),
BufferForAllocation(arguments, globals, allocation, memory_allocator,
device_ordinal, i, allocate_granularity));
RETURN_IF_ERROR(CheckAlignment(allocation, buffers.back(), i));
}

if (enable_cached_allocs_) {
absl::MutexLock lock(&allocs_cache_mutex_);
n_cached = cached_allocs_[device_ordinal].entries.size();
}

VLOG(1) << "GenBufAllocs dev=" << device_ordinal
<< " total=" << num_buffers << " params=" << n_params
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug counters computed unconditionally, even when VLOG is disabled.

The counters n_params, n_const, n_thread, n_alloc are incremented for every allocation on every call to GenerateBufferAllocations, but the output is gated behind VLOG(1). Consider gating the counting behind VLOG_IS_ON(1) to avoid the overhead:

if (VLOG_IS_ON(1)) {
  int64_t n_params = 0, n_const = 0, n_thread = 0, n_alloc = 0;
  // ... counting loop ...
  VLOG(1) << "GenBufAllocs ...";
}

Comment on lines 71 to +77
enable_command_buffers_during_profiling_(
enable_command_buffers_during_profiling),
enable_command_buffer_va_remapping_(enable_command_buffer_va_remapping),
enable_cached_allocs_([] {
const char* env = std::getenv("XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS");
return env && (std::string(env) == "1" || std::string(env) == "true");
}()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: std::getenv is called redundantly at two locations.

The env var XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS is read here (per CommandBufferThunk construction) and also in GpuExecutable's constructor (gpu_executable.cc:393). This creates maintenance risk — if one location is changed, the other could be forgotten. std::getenv is also not thread-safe with respect to setenv/putenv.

Consider reading it once and passing the value through the construction chain, or using a tsl::DebugOptions flag instead of an env var.

Comment on lines +1046 to 1077

// Cache all internally-allocated buffers (temp + live-out) for address
// stability. The CachedAllocator wrapper prevents TearDown and
// ExecutionOutput from freeing cached addresses.
bool cacheable = enable_cached_allocs_;

if (cacheable) {
absl::MutexLock lock(&allocs_cache_mutex_);
auto& cache = cached_allocs_[device_ordinal];
auto it = cache.entries.find(arg_idx);
if (it != cache.entries.end() &&
it->second.size() >= static_cast<uint64_t>(buffer_size)) {
return it->second;
}
}

ASSIGN_OR_RETURN(
se::ScopedDeviceAddress<uint8_t> buffer,
memory_allocator->Allocate(device_ordinal, buffer_size,
/*retry_on_failure=*/true,
/*memory_space=*/allocation.color()));
buffer_address = buffer.Release();

if (cacheable) {
absl::MutexLock lock(&allocs_cache_mutex_);
auto& cache = cached_allocs_[device_ordinal];
cache.entries[arg_idx] = buffer_address;
cache.cached_ptrs.insert(buffer_address.opaque());
}
}
return buffer_address;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU race: double-locking of allocs_cache_mutex_ allows duplicate allocations.

The mutex is acquired for the cache lookup (line 1049), released, then re-acquired for the cache insertion (line 1072). Between the two lock acquisitions, another thread could insert the same arg_idx, causing:

  1. A redundant allocation via memory_allocator->Allocate
  2. The first allocation being silently overwritten in cache.entries, leaking device memory

Consider holding the lock across the entire lookup-allocate-insert sequence, or using a per-entry lock/flag to prevent concurrent allocation for the same index.

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Review Summary

This PR implements a cached allocations optimization for GPU command buffer thunks, gated behind XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS. When enabled, donated parameter addresses are stabilized via D2D memcpy to/from persistent buffers, avoiding costly command buffer re-tracing. The PR also removes empty-graph safety checks from the ROCm command buffer path.

Key issues found (see inline comments for details):

  • Memory leak — Thunk-level alloc_address_cache persistent allocations are never freed (no destructor cleanup path)
  • Data raceCachedAllocator reads cached_ptrs without holding allocs_cache_mutex_, while BufferForAllocation concurrently inserts into it
  • TOCTOU race — Double-locking in BufferForAllocation allows duplicate allocations and memory leaks between lock acquisitions
  • Bugcollective_init_done flag set before Record() completes; a transient failure permanently disables collective init
  • Safety concern — ROCm empty-graph checks removed without justification; CUDA path retains equivalent checks
  • Minor — Destructor swallows deallocation errors silently; debug counters computed unconditionally; redundant std::getenv calls; CachedAllocator exposed in public header
  • No tests for the caching mechanism

The concurrency issues (data race on cached_ptrs, TOCTOU in BufferForAllocation) and the memory leak in alloc_address_cache should be addressed before merging.

@github-actions github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant