Conversation
Remove two overly-conservative checks from rocm_command_buffer.cc (ported from phambinh/full_vmm_solution branch): 1. Trace(): Remove rejection of empty traced HIP graphs, which caused segfaults when custom calls don't launch GPU ops. 2. PrepareFinalization(): Remove empty node insertion into empty graphs, which could crash with conditional nodes. Baseline for command buffer performance profiling on main branch.
Implements buffer allocation caching (XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS) to stabilize buffer addresses across execution steps. Key changes: - CachedAllocator: wraps DeviceAddressAllocator, intercepts Deallocate for cached buffers to prevent TearDown from freeing them - AllocationsCache: per-device cache mapping allocation indices to addresses - BufferForAllocation: checks cache before allocating, caches on miss - ExecuteAsyncOnStreamImpl: uses CachedAllocator for both BufferAllocations and ExecutionOutput Evaluation results: - Successfully stabilizes addresses (0 updates needed per step) - hipGraphLaunch submit is fast (~4ms) - BUT GPU-side graph execution takes ~54s/step (vs 42ms baseline) - Root cause: ROCm HIP graph execution has fundamental performance issue where captured graphs lose kernel concurrency, causing sequential execution - This is the same bug that VMM dual-mapping also works around by enabling the VA remapping code path (which uses is_initialization=true recording) The CachedAllocator approach alone cannot fix the ROCm HIP graph execution performance bug. It must be combined with the VA remapping code path or an equivalent fix to the graph capture/execution mechanism.
Progress on Pavel's buffer caching approach: 1. CachedAllocator stabilizes temp/live-out buffer addresses (0 updates) 2. collective_init_done flag prevents re-recording in Initialize() path 3. Parameter memcpy: copies donated param data to cached stable address Current state: - Allocation 621 (donated parameter) changes address every step - Every 4th step hits graph cache (~280ms), others re-trace (~45s) - The param memcpy updates buffer_allocations but the TracedCommandBuffer sub-graph cache inside individual commands still detects address changes and re-traces those sub-graphs - Need to investigate why cached address isn't propagated to TracedCommandBuffer's recorded_allocs comparison Key insight: the slow path is NOT hipGraphInstantiate or hipGraphLaunch. It's TracedCommandBuffer::GetOrTraceCommandBuffer doing hipStreamBeginCapture/EndCapture for each command that uses the changed allocation. This capture is ~45s for the full model.
Move address stabilization logic from GpuExecutable into CommandBufferThunk::ExecuteOnStream where it can intercept after UpdateBufferAllocations and before Record/Submit. This ensures both the coarse-grained recorded_allocs check and the fine-grained TracedCommandBuffer cache see stable addresses. Key changes in command_buffer_thunk.cc: - After UpdateBufferAllocations detects address changes, stabilize by allocating persistent device buffers and memcpy'ing data - Build stable_allocs/stable_params with cached addresses for Record - After Submit, copy-back results from persistent to original buffers - Track alloc_cache_ready to skip stabilization on first record - collective_init_done prevents redundant Initialize re-recording Key changes in command_buffer_thunk.h: - Add alloc_address_cache map to ExecutorCommandBuffer - Add alloc_cache_ready flag - Add collective_init_done flag Performance results: - Baseline (no cmd buffers): 44.1 ms/step - VMM dual-mapping approach: 47.4 ms/step (9.4% overhead) - CachedAllocs approach: 42.5 ms/step (3.6% FASTER than baseline) - No segfaults, 20/20 steps completed, correct loss values The CachedAllocs approach outperforms even the baseline because HIP graph replay is more efficient than op-by-op kernel dispatch.
DeviceAddressAllocator::Allocate() is UNIMPLEMENTED in the thunk execution context, causing command buffer stabilization to silently fail. Switch to StreamExecutor::Allocate() which is always available. With this fix, donated parameter addresses are properly stabilized and TracedCommandBuffer cache hits on every step after the first: Baseline (no cmd buffers): 13.6ms/step All cmd buffer types + fix: 10.2ms/step (-24.7%) All cmd buffer types, no fix: 287.1ms/step (re-traces every step)
| se::DeviceAddressBase persistent = | ||
| executor->Allocate(current.size(), /*memory_space=*/0); | ||
| if (!persistent.is_null()) { | ||
| TF_RETURN_IF_ERROR(params.stream->MemcpyD2D( | ||
| &persistent, current, current.size())); | ||
| cache[idx] = persistent; | ||
| cmd_buffer->recorded_allocs[idx] = persistent; | ||
| copyback_indices.push_back(idx); | ||
| ++stabilized; | ||
| } else { | ||
| LOG(WARNING) << "CachedAllocs: failed to allocate persistent " | ||
| << "buffer for alloc " << idx | ||
| << " size=" << current.size(); | ||
| } | ||
| ++it; |
There was a problem hiding this comment.
Memory leak: persistent allocations are never freed.
executor->Allocate() creates device memory that is stored in alloc_address_cache, but there is no cleanup path. The ExecutorCommandBuffer has no destructor that frees these allocations, and the GpuExecutable destructor only frees entries in cached_allocs_ (the executable-level cache), not the thunk-level alloc_address_cache.
This means every persistent buffer allocated here leaks GPU memory until process exit, and the leak grows with the number of stabilized allocations across all executors.
Consider adding a destructor to ExecutorCommandBuffer that iterates alloc_address_cache and frees each allocation via StreamExecutor::Deallocate, or unifying this cache with the executable-level cached_allocs_ so a single cleanup path covers both.
| !cmd_buffer->collective_init_done && | ||
| commands_.requires_initialization(); | ||
| if (cmd_buffer->warmup_done && (first_init || needs_collective_init)) { | ||
| cmd_buffer->collective_init_done = true; |
There was a problem hiding this comment.
Bug: collective_init_done set before initialization completes.
The flag is set to true at line 227 before commands_.Record(...) runs (line 248). If Record fails via TF_RETURN_IF_ERROR, the flag stays true, preventing any future retry of collective initialization. This means a transient failure permanently disables collective init.
Move cmd_buffer->collective_init_done = true; to after the Record call succeeds (e.g., after line 249).
| effective_allocator = allocator.get(); | ||
| } | ||
|
|
||
| ExecutionOutput result(/*on_device_shape=*/program_shape_.result(), | ||
| memory_allocator, device_ordinal, | ||
| effective_allocator, device_ordinal, | ||
| executor->device_ordinal()); | ||
|
|
||
| ASSIGN_OR_RETURN(BufferAllocations buffer_allocations, |
There was a problem hiding this comment.
Data race: CachedAllocator holds a raw pointer to cached_ptrs used outside the lock.
The CachedAllocator stores a raw pointer to cache.cached_ptrs (a flat_hash_set), which is protected by allocs_cache_mutex_. However, CachedAllocator::Deallocate reads cached_ptrs_ (via contains()) outside the lock — e.g., during TearDown or ExecutionOutput destruction.
Meanwhile, BufferForAllocation (line 1076) inserts into cached_ptrs under the lock. A concurrent Deallocate call reading the same flat_hash_set without holding the lock is undefined behavior.
Options:
- Have
CachedAllocatortake its own lock (or shareallocs_cache_mutex_) when accessingcached_ptrs_. - Use a thread-safe set for
cached_ptrs.
| @@ -400,18 +400,6 @@ absl::Status RocmCommandBuffer::Trace( | |||
| VLOG(5) << "Traced into the GPU command buffer graph " << graph_ << " (took " | |||
There was a problem hiding this comment.
Removing empty-graph safety checks may cause ROCm crashes.
This PR removes both:
- The
Trace()check that returned an error when a traced HIP graph had zero root nodes - The
PrepareFinalization()empty-node insertion that prevented crashes from empty graphs
The CUDA path still retains both its empty-graph check in Trace() and its PrepareFinalization() empty-body handling. The PR description provides no justification for why this is safe on ROCm. If any custom call traces an empty graph (no HIP operations launched), this will crash at instantiation time.
Could you add a comment explaining why this is safe, or add a test that verifies empty-graph behavior on ROCm?
| } | ||
| // Free cached allocations through the real (underlying) allocator. | ||
| if (enable_cached_allocs_) { | ||
| absl::MutexLock lock(&allocs_cache_mutex_); | ||
| for (auto& [device_ordinal, cache] : cached_allocs_) { | ||
| auto it = cached_allocators_.find(device_ordinal); | ||
| if (it == cached_allocators_.end() || !it->second) continue; | ||
| auto* real_allocator = | ||
| static_cast<CachedAllocator*>(it->second.get()); | ||
| for (auto& [idx, addr] : cache.entries) { | ||
| if (!addr.is_null()) { | ||
| (void)real_allocator->DeallocateUnderlying(device_ordinal, addr); | ||
| } |
There was a problem hiding this comment.
Destructor silently swallows deallocation errors.
(void)real_allocator->DeallocateUnderlying(device_ordinal, addr) discards the absl::Status. If deallocation fails (e.g., device context already torn down, or wrong device ordinal), the error is silently lost. At minimum, log a warning:
| } | |
| // Free cached allocations through the real (underlying) allocator. | |
| if (enable_cached_allocs_) { | |
| absl::MutexLock lock(&allocs_cache_mutex_); | |
| for (auto& [device_ordinal, cache] : cached_allocs_) { | |
| auto it = cached_allocators_.find(device_ordinal); | |
| if (it == cached_allocators_.end() || !it->second) continue; | |
| auto* real_allocator = | |
| static_cast<CachedAllocator*>(it->second.get()); | |
| for (auto& [idx, addr] : cache.entries) { | |
| if (!addr.is_null()) { | |
| (void)real_allocator->DeallocateUnderlying(device_ordinal, addr); | |
| } | |
| for (auto& [idx, addr] : cache.entries) { | |
| if (!addr.is_null()) { | |
| auto status = real_allocator->DeallocateUnderlying(device_ordinal, addr); | |
| if (!status.ok()) { | |
| LOG(WARNING) << "CachedAllocs: failed to deallocate buffer " | |
| << idx << ": " << status; | |
| } | |
| } | |
| } |
Also, the static_cast<CachedAllocator*>(it->second.get()) on line 414 is unnecessary — cached_allocators_ stores std::unique_ptr<CachedAllocator>, so .get() already returns CachedAllocator*.
| // Wraps a real DeviceAddressAllocator and intercepts Deallocate calls for | ||
| // cached buffer addresses, preventing TearDown from freeing them. | ||
| class CachedAllocator : public se::DeviceAddressAllocator { | ||
| public: | ||
| CachedAllocator(se::DeviceAddressAllocator* underlying, | ||
| const absl::flat_hash_set<void*>* cached_ptrs) | ||
| : DeviceAddressAllocator(underlying->platform()), | ||
| underlying_(underlying), | ||
| cached_ptrs_(cached_ptrs) {} | ||
|
|
||
| absl::StatusOr<se::ScopedDeviceAddress<uint8_t>> Allocate( | ||
| int device_ordinal, uint64_t size, bool retry_on_failure, | ||
| int64_t memory_space) override { | ||
| return underlying_->Allocate(device_ordinal, size, retry_on_failure, | ||
| memory_space); | ||
| } | ||
|
|
||
| absl::Status Deallocate(int device_ordinal, | ||
| se::DeviceAddressBase mem) override { | ||
| if (cached_ptrs_ && cached_ptrs_->contains(mem.opaque())) { | ||
| return absl::OkStatus(); | ||
| } | ||
| return underlying_->Deallocate(device_ordinal, mem); | ||
| } | ||
|
|
||
| absl::StatusOr<se::Stream*> GetStream(int device_ordinal) override { | ||
| return underlying_->GetStream(device_ordinal); | ||
| } | ||
|
|
||
| bool AllowsAsynchronousDeallocation() const override { | ||
| return underlying_->AllowsAsynchronousDeallocation(); | ||
| } | ||
|
|
||
| // Bypass the cache and deallocate directly through the underlying allocator. | ||
| absl::Status DeallocateUnderlying(int device_ordinal, | ||
| se::DeviceAddressBase mem) { | ||
| return underlying_->Deallocate(device_ordinal, mem); | ||
| } | ||
|
|
||
| private: | ||
| se::DeviceAddressAllocator* underlying_; | ||
| const absl::flat_hash_set<void*>* cached_ptrs_; | ||
| }; | ||
|
|
||
| struct AllocationsCache { | ||
| absl::flat_hash_map<BufferAllocation::Index, se::DeviceAddressBase> entries; | ||
| absl::flat_hash_set<void*> cached_ptrs; | ||
| }; |
There was a problem hiding this comment.
CachedAllocator is an implementation detail exposed in a public header.
CachedAllocator and AllocationsCache are only used internally by GpuExecutable, but they are defined in the public header gpu_executable.h. This pollutes the public API and exposes implementation details (raw pointer members, internal caching strategy).
Consider moving these to an anonymous namespace in gpu_executable.cc or to a separate *_internal.h header.
|
|
||
| int64_t n_params = 0, n_const = 0, n_thread = 0, n_alloc = 0, n_cached = 0; | ||
| for (int64_t i = 0; i < num_buffers; ++i) { | ||
| const BufferAllocation& allocation = *allocations[i]; | ||
| if (allocation.is_thread_local()) | ||
| n_thread++; | ||
| else if (allocation.is_entry_computation_parameter()) | ||
| n_params++; | ||
| else if (allocation.is_constant()) | ||
| n_const++; | ||
| else | ||
| n_alloc++; | ||
|
|
||
| ASSIGN_OR_RETURN( | ||
| buffers.emplace_back(), | ||
| BufferForAllocation(arguments, globals, allocation, memory_allocator, | ||
| device_ordinal, i, allocate_granularity)); | ||
| RETURN_IF_ERROR(CheckAlignment(allocation, buffers.back(), i)); | ||
| } | ||
|
|
||
| if (enable_cached_allocs_) { | ||
| absl::MutexLock lock(&allocs_cache_mutex_); | ||
| n_cached = cached_allocs_[device_ordinal].entries.size(); | ||
| } | ||
|
|
||
| VLOG(1) << "GenBufAllocs dev=" << device_ordinal | ||
| << " total=" << num_buffers << " params=" << n_params |
There was a problem hiding this comment.
Debug counters computed unconditionally, even when VLOG is disabled.
The counters n_params, n_const, n_thread, n_alloc are incremented for every allocation on every call to GenerateBufferAllocations, but the output is gated behind VLOG(1). Consider gating the counting behind VLOG_IS_ON(1) to avoid the overhead:
if (VLOG_IS_ON(1)) {
int64_t n_params = 0, n_const = 0, n_thread = 0, n_alloc = 0;
// ... counting loop ...
VLOG(1) << "GenBufAllocs ...";
}| enable_command_buffers_during_profiling_( | ||
| enable_command_buffers_during_profiling), | ||
| enable_command_buffer_va_remapping_(enable_command_buffer_va_remapping), | ||
| enable_cached_allocs_([] { | ||
| const char* env = std::getenv("XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS"); | ||
| return env && (std::string(env) == "1" || std::string(env) == "true"); | ||
| }()), |
There was a problem hiding this comment.
Nit: std::getenv is called redundantly at two locations.
The env var XLA_COMMAND_BUFFERS_USE_CACHED_ALLOCS is read here (per CommandBufferThunk construction) and also in GpuExecutable's constructor (gpu_executable.cc:393). This creates maintenance risk — if one location is changed, the other could be forgotten. std::getenv is also not thread-safe with respect to setenv/putenv.
Consider reading it once and passing the value through the construction chain, or using a tsl::DebugOptions flag instead of an env var.
|
|
||
| // Cache all internally-allocated buffers (temp + live-out) for address | ||
| // stability. The CachedAllocator wrapper prevents TearDown and | ||
| // ExecutionOutput from freeing cached addresses. | ||
| bool cacheable = enable_cached_allocs_; | ||
|
|
||
| if (cacheable) { | ||
| absl::MutexLock lock(&allocs_cache_mutex_); | ||
| auto& cache = cached_allocs_[device_ordinal]; | ||
| auto it = cache.entries.find(arg_idx); | ||
| if (it != cache.entries.end() && | ||
| it->second.size() >= static_cast<uint64_t>(buffer_size)) { | ||
| return it->second; | ||
| } | ||
| } | ||
|
|
||
| ASSIGN_OR_RETURN( | ||
| se::ScopedDeviceAddress<uint8_t> buffer, | ||
| memory_allocator->Allocate(device_ordinal, buffer_size, | ||
| /*retry_on_failure=*/true, | ||
| /*memory_space=*/allocation.color())); | ||
| buffer_address = buffer.Release(); | ||
|
|
||
| if (cacheable) { | ||
| absl::MutexLock lock(&allocs_cache_mutex_); | ||
| auto& cache = cached_allocs_[device_ordinal]; | ||
| cache.entries[arg_idx] = buffer_address; | ||
| cache.cached_ptrs.insert(buffer_address.opaque()); | ||
| } | ||
| } | ||
| return buffer_address; | ||
| } |
There was a problem hiding this comment.
TOCTOU race: double-locking of allocs_cache_mutex_ allows duplicate allocations.
The mutex is acquired for the cache lookup (line 1049), released, then re-acquired for the cache insertion (line 1072). Between the two lock acquisitions, another thread could insert the same arg_idx, causing:
- A redundant allocation via
memory_allocator->Allocate - The first allocation being silently overwritten in
cache.entries, leaking device memory
Consider holding the lock across the entire lookup-allocate-insert sequence, or using a per-entry lock/flag to prevent concurrent allocation for the same index.
Review SummaryThis PR implements a cached allocations optimization for GPU command buffer thunks, gated behind Key issues found (see inline comments for details):
The concurrency issues (data race on |
Uh oh!
There was an error while loading. Please reload this page.