Conversation
Port CUDA's VMM allocator to ROCm/HIP using HIP VMM APIs (hipMemCreate, hipMemAddressReserve, hipMemMap, hipMemSetAccess, hipStreamWriteValue64). This enables per-GPU memory access control on AMD GPUs, replacing the all-or-nothing hipDeviceMallocFinegrained approach that grants every GPU access and disables L2 cache. Four layers matching the CUDA structure: - RocmRawMemoryAllocation: RAII wrapper for hipMemCreate/hipMemRelease - RocmMemoryReservation: RAII wrapper for virtual address reservation, mapping, access control, and unmapping - RocmVmmAllocator: simple all-in-one allocator (create + reserve + map + setAccess in a single Allocate call) - RocmDeviceAddressVmmAllocator: advanced allocator with GPU timeline- based deferred deallocation via hipStreamWriteValue64 and signal memory Key differences from the CUDA implementation: - hipMemGenericAllocationHandle_t is a pointer type (not integer) - hipStreamWriteValue64 requires signal memory (hipMallocSignalMemory) instead of pinned host memory (cuMemHostAlloc) - hipDeviceptr_t is void* requiring PtrAdd() helper for offset arithmetic - All wrap:: calls use nullptr/0ULL for proper template type deduction
Add four test files mirroring the CUDA VMM test structure: - rocm_raw_memory_allocation_test: CreateAllocation, AddressReflectsHandle, SizeIsAtLeastRequested - rocm_memory_reservation_test: CreateReservation, MapToWrongType, MapToSingleAllocation, ScopedMappingUnmapsOnDestruction, MapToMultipleAllocations, TwoReservationsDifferentAddresses - rocm_vmm_allocator_test: AllocateAndFree, AllocateZeroBytes, MemcpyRoundTrip (parameterized with RdmaEnabled/RdmaDisabled) - rocm_device_address_vmm_allocator_test: 13 single-GPU tests covering allocate/deallocate, memory read/write, stream accessors, deferred deallocation, VA reuse, destructor safety, and error handling; 2 multi-GPU tests for cross-device allocation isolation
RocmEvent was missing the Synchronize() override, causing ExecuteThunksWithVaRemapping to fail with "UNIMPLEMENTED: Not supported for this Event." The VA remapping path calls unmap_event->Synchronize() to wait for the GPU to finish before unmapping virtual addresses. Implements hipEventSynchronize matching the CUDA CudaEvent::Synchronize pattern.
…:SetAccess SetAccess was only granting read/write access to the owning device, causing GPU memory access faults when peer GPUs tried to read VMM allocations during multi-GPU pmap/collective operations. Now iterates over all devices and grants access to P2P-capable peers, matching the pattern already used in RocmVmmAllocator (Layer 3).
Pre-allocate physical memory slots with permanent VA mappings at startup. The GPU signals slot completion via hipStreamWriteValue64 to coherent host memory; the CPU checks with a non-blocking memory read before reusing a slot. After the first iteration, per-iteration overhead is just D2D memcpy for params — no hipMemMap, no hipMemUnmap, no hipEventSynchronize. Key components: - CircularVmmPool: creates N physical chunks, each permanently mapped to its own VA range. AcquireNextSlot/ReleaseSlot for slot lifecycle. - ExecuteThunksWithCircularVmmPool in gpu_executable.cc: pools all command buffer allocations, copies param data from BFC into pool before execution, copies live-out results back after execution. - xla_gpu_enable_circular_vmm_pool flag (field 468) to opt in. - va_remapping enabled for CommandBufferThunk when circular pool is active, so graphs are recorded once and replayed without re-recording. Benchmark results on 8x MI300X (transformer training step): - Baseline (no cmd buf): 114.5 ms/iter - BFC + COLLECTIVES: 234.2 ms/iter (2x slower) - Circular VMM + COLLECTIVES: 12.0 ms/iter (9.5x faster than baseline) Benchmark on 2x MI300X (5 matmul+tanh + allreduce): - BFC + COLLECTIVES: 19,074 us/iter - Circular VMM + COLLECTIVES: 1,107 us/iter (17.2x faster) All correctness tests pass (tanh, pmap+allreduce, grad, 10-layer model, different inputs per iteration).
| absl::MutexLock lock(&circular_pool_mutex_); | ||
| pool_state = &circular_pools_[executor]; | ||
| } |
There was a problem hiding this comment.
Bug: thread safety — pool_state accessed without lock
The mutex circular_pool_mutex_ is held only while looking up/inserting the CircularPoolState entry. After that, pool_state->pool, pool_state->iteration_count++, etc. are accessed without the lock.
If two threads call ExecuteThunksWithCircularVmmPool for the same executor concurrently (e.g., multi-threaded inference), they can race on iteration_count++ (non-atomic, no lock) — two iterations could use the same slot concurrently, corrupting data.
Consider either:
- Holding the mutex for the entire function body (but watch for deadlocks with GPU calls), or
- Making
iteration_countanstd::atomic<uint64_t>and adding a per-pool mutex for the initialization path.
There was a problem hiding this comment.
Resolved -- addressed in this revision. iteration_count is now std::atomic<uint64_t> with fetch_add(1), and pool initialization uses double-checked locking with a per-pool mutex.
| uint64_t required = iteration - num_slots_ + 1; | ||
| uint64_t completed = __atomic_load_n(timeline_, __ATOMIC_ACQUIRE); | ||
| while (completed < required) { | ||
| completed = __atomic_load_n(timeline_, __ATOMIC_ACQUIRE); | ||
| } |
There was a problem hiding this comment.
Bug: unbounded spin-wait with no backoff or timeout
AcquireNextSlot spin-waits in a tight loop reading the timeline counter. If the GPU stalls or the signal is never written (e.g., GPU hang), the CPU thread will spin forever consuming 100% of a core.
At minimum, add a timeout that returns an error status after a reasonable period (e.g., 30s). A sched_yield() or short std::this_thread::sleep_for in the loop body would also reduce CPU waste during normal waits.
There was a problem hiding this comment.
Resolved -- addressed in this revision. AcquireNextSlot now has a 30-second timeout with DeadlineExceededError, spin-wait with kMaxSpinIterations limit, and sched_yield() for backoff.
| LOG(INFO) << absl::StrFormat( | ||
| "CommandBufferThunk::Initialize: warmup_done=%d state=%d " | ||
| "va_remapping=%d requires_init=%d will_record=%d", | ||
| warmup, static_cast<int>(state), enable_command_buffer_va_remapping_, | ||
| commands_.requires_initialization(), will_record); | ||
|
|
||
| // Log the addresses that will be used for recording | ||
| if (will_record) { | ||
| for (auto idx : commands_.allocs_indices()) { | ||
| auto addr = execute_params.buffer_allocations->GetDeviceAddress(idx); | ||
| LOG(INFO) << absl::StrFormat( | ||
| " Initialize record addr[%d]: %p size=%d", idx, addr.opaque(), | ||
| addr.size()); |
There was a problem hiding this comment.
Cleanup: LOG(INFO) in hot path should be VLOG
Multiple LOG(INFO) statements are added to Initialize and ExecuteOnStream that fire on every iteration. These will produce massive log output in production and add measurable overhead. The existing code already uses VLOG(2)/VLOG(3) for this purpose.
These should be converted to VLOG(2) or VLOG(3) before merging, or removed if they were only needed during development.
There was a problem hiding this comment.
Resolved -- addressed in this revision. The LOG(INFO) statements in Initialize and ExecuteOnStream have been converted to VLOG(3).
xla/service/gpu/gpu_executable.cc
Outdated
| LOG(INFO) << absl::StreamFormat( | ||
| "ExecuteThunks: cmd_buffer_allocs=%d circular_vmm_pool=%d " | ||
| "va_remapping=%d", | ||
| command_buffer_allocation_indexes_.size(), enable_circular_vmm_pool, | ||
| enable_command_buffer_va_remapping); |
There was a problem hiding this comment.
Cleanup: LOG(INFO) in hot path — should be VLOG
This fires on every invocation of ExecuteThunks. The original code used XLA_VLOG_DEVICE(3, ...). Should be restored to VLOG(3) or XLA_VLOG_DEVICE to avoid flooding production logs.
There was a problem hiding this comment.
Resolved -- addressed in this revision. The ExecuteThunks logging now uses XLA_VLOG_DEVICE(3, ...) instead of LOG(INFO).
| "Number of slots in the circular VMM pool (default 2).")); | ||
| } // NOLINT(readability/fn_size) | ||
|
|
||
| // Allocates flag_values and flag_objects; this function must not be called more |
There was a problem hiding this comment.
Bug: default value / help text mismatch
The actual default is set to 1 (line 507: set_xla_gpu_circular_vmm_pool_slots(1)), but the flag help text here says "(default 2)".
With 1 slot, AcquireNextSlot will spin-wait for GPU completion on every iteration (no overlap), negating the primary benefit of the circular pool. Either fix the help text to say "default 1" or change the default to 2.
There was a problem hiding this comment.
Partially resolved. The flag help text in debug_options_flags.cc now correctly says "(default 1)", but the proto comment in xla.proto still says "(default 2)" -- see xla_gpu_circular_vmm_pool_slots field comment. Please update the proto comment to match the actual default of 1.
There was a problem hiding this comment.
Resolved -- fully addressed in this revision. Both the flag help text in debug_options_flags.cc and the proto comment in xla.proto now correctly say "(default 1)".
There was a problem hiding this comment.
Resolved: both the flag help text (line 3034) and proto comment now correctly say "(default 1)", matching the actual default value.
xla/service/gpu/gpu_executable.cc
Outdated
| absl::btree_set<BufferAllocation::Index> pool_indexes; | ||
| absl::btree_set<BufferAllocation::Index> copy_indexes; | ||
| if (buffer_assignment_) { | ||
| for (BufferAllocation::Index idx : command_buffer_allocation_indexes_) { | ||
| const auto& alloc = buffer_assignment_->GetAllocation(idx); | ||
| if (alloc.is_constant() || alloc.size() == 0) continue; | ||
| pool_indexes.insert(idx); | ||
| if (alloc.is_entry_computation_parameter() || alloc.maybe_live_out()) { | ||
| copy_indexes.insert(idx); | ||
| } | ||
| } |
There was a problem hiding this comment.
Performance: pool_indexes and copy_indexes recomputed every iteration
These sets are derived from buffer_assignment_ and command_buffer_allocation_indexes_, which don't change between iterations. They should be computed once during pool initialization and cached in CircularPoolState, rather than rebuilt on every call.
There was a problem hiding this comment.
Resolved -- addressed in this revision. pool_indexes and copy_indexes are now cached in CircularPoolState with an indexes_cached flag and double-checked locking, computed once at init.
xla/service/gpu/gpu_executable.h
Outdated
| // Track last-seen BFC addresses to skip redundant D2D memcpy. | ||
| absl::flat_hash_map<BufferAllocation::Index, void*> last_param_addrs; |
There was a problem hiding this comment.
Unused field: last_param_addrs is never read or written
This map is declared with a comment about skipping redundant D2D memcpy, but it's never used anywhere in the implementation. Either remove it or implement the optimization it's intended for.
There was a problem hiding this comment.
Resolved -- addressed in this revision. The last_param_addrs field has been removed.
xla/service/gpu/gpu_executable.h
Outdated
| Thunk::ExecutableSource executable_source, bool block_host_until_done); | ||
|
|
||
| struct CircularPoolState { | ||
| std::shared_ptr<void> pool; |
There was a problem hiding this comment.
Nit: std::shared_ptr<void> for type erasure is fragile
This works correctly because unique_ptr<CircularVmmPool> converting to shared_ptr<void> captures the right deleter. However, it sacrifices type safety and requires static_cast<CircularVmmPool*> at every use site.
Since the header already has a #if TENSORFLOW_USE_ROCM include for CircularVmmPool in the .cc file, consider forward-declaring CircularVmmPool here and using std::unique_ptr<se::gpu::CircularVmmPool> directly (with a custom deleter or forward-declared destructor). This eliminates the type-erased casts.
| const auto& alloc = buffer_assignment_->GetAllocation(i); | ||
| if (alloc.maybe_live_out()) { | ||
| auto bfc_addr = buffer_allocations.GetDeviceAddress(i); | ||
| if (!bfc_addr.is_null() && bfc_addr.size() > 0) { | ||
| se::DeviceAddressBase bfc_dst(bfc_addr.opaque(), bfc_addr.size()); | ||
| se::DeviceAddressBase pool_src(pool_addr.opaque(), bfc_addr.size()); | ||
| TF_RETURN_IF_ERROR(run_options->stream()->MemcpyD2D( | ||
| &bfc_dst, pool_src, bfc_addr.size())); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Potential bug: live-out copy-back not synchronized when block_host_until_done is true
If block_host_until_done is true, ExecuteThunksImpl synchronizes the stream before returning. These D2D copy-back operations are enqueued after that sync, meaning the caller may read stale data at the BFC address.
Consider adding a stream synchronization after the copy-back when block_host_until_done is true:
if (block_host_until_done) {
TF_RETURN_IF_ERROR(run_options->stream()->BlockHostUntilDone());
}There was a problem hiding this comment.
Resolved -- addressed in this revision. The copy-back D2D memcpy calls now happen before BlockHostUntilDone, so the sync correctly covers both the thunk execution and the copy-back.
xla/service/gpu/gpu_executable.cc
Outdated
| LOG(INFO) << absl::StrFormat( | ||
| "CircularVmmPool iter=%d slot=%d/%d: %d pool addrs", | ||
| iteration, slot_idx, pool->num_slots(), slot_addresses.size()); |
There was a problem hiding this comment.
Cleanup: LOG(INFO) per-iteration — should be VLOG
This logs on every iteration. Should be VLOG(3) or similar to avoid flooding production logs.
There was a problem hiding this comment.
Resolved -- addressed in this revision. Per-iteration logging in ExecuteThunksWithCircularVmmPool now uses VLOG(3). The remaining LOG(INFO) calls are one-time pool initialization messages.
| // via hipStreamWriteValue64 after each iteration; the CPU reads it to | ||
| // determine when a slot is safe to reuse. | ||
| volatile uint64_t* timeline_; | ||
| void* timeline_host_ptr_; // Raw pointer for hipHostFree. |
There was a problem hiding this comment.
Nit: misleading comment
Comment says "Raw pointer for hipHostFree" but the destructor actually calls hipFree, not hipHostFree. The comment should say "Raw pointer for hipFree".
There was a problem hiding this comment.
Resolved -- addressed in this revision. Comment now correctly reads 'Raw pointer for hipFree (signal memory).'.
| size_t granularity = 0; | ||
| TF_RETURN_IF_ERROR(ToStatus(wrap::hipMemGetAllocationGranularity( |
There was a problem hiding this comment.
Nit: unused parameter is_rdma_supported
GetVmmAllocationProperties takes is_rdma_supported but never uses it. Either remove the parameter or implement the RDMA support path.
There was a problem hiding this comment.
Resolved -- addressed in this revision. GetVmmAllocationProperties no longer takes is_rdma_supported, and the constructor documents the parameter as unused with a comment explaining why.
Review SummaryThis PR adds a circular VMM pool for ROCm to eliminate per-iteration Key issues to address before merge:
See inline comments for details and additional suggestions. |
Fixes from Claude code review on PR #774: 1. Convert LOG(INFO) to VLOG(3) in hot paths (command_buffer_thunk.cc, gpu_executable.cc) to avoid flooding production logs 2. Fix help text: default pool slots is 1, not 2 3. Thread safety: use std::atomic<uint64_t> for iteration_count and add per-pool mutex for initialization 4. Cache pool_indexes/copy_indexes in CircularPoolState (computed once at init, reused every iteration instead of recomputed) 5. Add stream sync after live-out copy-back when block_host_until_done is true, ensuring caller reads correct data at BFC address 6. Remove unused last_param_addrs field from CircularPoolState
- Convert LOG(INFO) to VLOG(3) in hot paths to avoid flooding logs - Fix help text: default pool slots is 1, not 2 - Thread safety: use std::atomic for iteration_count, add per-pool mutex for initialization - Cache pool_indexes/copy_indexes in CircularPoolState instead of recomputing every iteration - Add stream sync after live-out copy-back when block_host_until_done - Remove unused last_param_addrs field
d67610c to
af977d0
Compare
- Add timeout (30s) and sched_yield to AcquireNextSlot spin-wait to prevent unbounded CPU burn if GPU stalls - Remove unused is_rdma_supported parameter from RocmVmmAllocator (hipMemAllocationProp has no RDMA flag on ROCm) - Fix misleading comment: timeline pointer freed with hipFree, not hipHostFree (allocated via hipExtMallocWithFlags signal memory)
xla/service/gpu/gpu_executable.cc
Outdated
| } | ||
|
|
||
| // Cache buffer classification on first call (doesn't change between iters). | ||
| if (!pool_state->indexes_cached) { |
There was a problem hiding this comment.
Bug: data race in double-checked locking on indexes_cached
indexes_cached is a plain bool (declared at gpu_executable.h:449), but it is read here outside the lock. Another thread may be writing true to it (at line 1548) under the lock. Reading a non-atomic variable concurrently with a write is undefined behavior in the C++ memory model.
Either:
- Change
indexes_cachedtostd::atomic<bool>and useload(std::memory_order_acquire)/store(true, std::memory_order_release)for the double-checked locking pattern, or - Remove the outer check and always acquire the lock (the lock is uncontended after the first call, so the overhead is negligible).
There was a problem hiding this comment.
Resolved -- addressed in this revision. indexes_cached replaced by std::atomic initialized with proper acquire/release memory ordering for safe double-checked locking.
There was a problem hiding this comment.
Resolved: indexes_cached has been replaced by std::atomic<bool> initialized with proper acquire/release ordering, fixing the data race.
xla/service/gpu/gpu_executable.cc
Outdated
| if (pool_state->pool == nullptr) { | ||
| int num_slots = has_module() | ||
| ? module_config().debug_options().xla_gpu_circular_vmm_pool_slots() | ||
| : 1; | ||
|
|
||
| if (pool_indexes.empty()) { | ||
| return ExecuteThunksImpl( | ||
| has_module() ? &module_config().debug_options() : nullptr, | ||
| module_name_, unique_id, *thunk_executor_, executable_source, | ||
| run_options, buffer_allocations, block_host_until_done, | ||
| execution_stream_ids_, collective_memory_cache_); | ||
| } | ||
|
|
||
| std::vector<uint64_t> buffer_sizes; | ||
| buffer_sizes.reserve(pool_indexes.size()); | ||
| for (BufferAllocation::Index idx : pool_indexes) { | ||
| buffer_sizes.push_back(buffer_allocations.GetDeviceAddress(idx).size()); | ||
| } | ||
|
|
||
| TF_ASSIGN_OR_RETURN( | ||
| auto pool, | ||
| se::gpu::CircularVmmPool::Create(executor, buffer_sizes, num_slots)); | ||
|
|
||
| LOG(INFO) << absl::StrFormat( | ||
| "CircularVmmPool: created %d slots for module %s on device %d " | ||
| "(%d command buffer allocations)", | ||
| num_slots, module_name_, executor->device_ordinal(), | ||
| command_buffer_allocation_indexes_.size()); | ||
|
|
||
| pool_state->pool = std::move(pool); |
There was a problem hiding this comment.
Bug: pool initialization race -- pool == nullptr checked without lock
If two threads call ExecuteThunksWithCircularVmmPool concurrently for the same executor before the pool is initialized, both will see pool_state->pool == nullptr and both will create a pool. The second std::move(pool) assignment overwrites the first, and any iteration already in-flight on the first pool will use stale state.
This should use the same double-checked locking pattern as indexes_cached (once that is fixed to use atomics), or simply hold pool_state->mu during initialization:
if (pool_state->pool == nullptr) {
absl::MutexLock lock(&pool_state->mu);
if (pool_state->pool == nullptr) {
// ... create pool ...
}
}There was a problem hiding this comment.
Resolved -- addressed in this revision. Pool initialization is now guarded by atomic initialized + per-pool mutex double-checked locking. The pool creation, index computation, and initialized flag are all protected within the same critical section.
There was a problem hiding this comment.
Resolved: pool initialization is now inside an atomic double-checked locking pattern with a per-pool mutex, eliminating the race.
xla/xla.proto
Outdated
| // map/unmap overhead entirely after startup. | ||
| optional bool xla_gpu_enable_circular_vmm_pool = 468; | ||
|
|
||
| // Number of slots in the circular VMM pool (default 2). Higher values allow |
There was a problem hiding this comment.
Nit: proto comment says "default 2" but actual default is 1
The default set in debug_options_flags.cc is set_xla_gpu_circular_vmm_pool_slots(1). This comment should say "(default 1)" to match. The flag help text was already corrected.
There was a problem hiding this comment.
Resolved -- addressed in this revision. Proto comment now correctly says "(default 1)".
There was a problem hiding this comment.
Resolved: proto comment now correctly says "(default 1)", matching the actual default.
| RocmMemoryReservation::~RocmMemoryReservation() { | ||
| if (ptr_ == nullptr) { | ||
| return; | ||
| } | ||
| std::unique_ptr<ActivateContext> activation = executor_->Activate(); | ||
| auto unmap_status = | ||
| ToStatus(wrap::hipMemUnmap(ptr_, size_), "Error unmapping ROCm memory"); |
There was a problem hiding this comment.
Nit: potential double-unmap when used with ScopedMapping
The destructor calls hipMemUnmap(ptr_, size_) on the full reservation range. However, if a ScopedMapping was created via MapTo() and is destroyed before the reservation (which is the case in CircularVmmPool::Slot due to C++ reverse-declaration-order destruction), ScopedMapping::~ScopedMapping() will have already called UnMap on the mapped sub-range.
This means the same range gets unmapped twice: once by the ScopedMapping destructor, then again here. HIP likely tolerates this (the LOG(ERROR) will fire silently), but it could mask real mapping bugs.
Consider either:
- Tracking whether sub-ranges are already unmapped, or
- Documenting that the reservation destructor is the sole owner of unmap responsibility and
ScopedMappingshould be released (moved-from) before the reservation is destroyed.
There was a problem hiding this comment.
Resolved -- addressed in this revision. The CircularVmmPool destructor now explicitly releases ScopedMappings before clearing slots, ensuring only one unmap occurs per range.
There was a problem hiding this comment.
Resolved: the destructor now explicitly releases ScopedMappings before clearing slots, avoiding the double-unmap scenario.
| int device_count = 0; | ||
| TF_RETURN_IF_ERROR(ToStatus(wrap::hipGetDeviceCount(&device_count))); | ||
| for (int peer = 0; peer < device_count; peer++) { | ||
| hipMemAccessDesc access_desc = GetVmmAccessDescriptor(peer); | ||
| auto access_status = | ||
| ToStatus(wrap::hipMemSetAccess(ptr, padded_size, &access_desc, 1)); | ||
| if (!access_status.ok()) { | ||
| if (peer == executor->device_ordinal()) { | ||
| return access_status; | ||
| } | ||
| VLOG(3) << "Could not set VMM access for peer device " << peer << ": " | ||
| << access_status; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Nit: inconsistent peer access pattern vs. RocmMemoryReservation::SetAccess
This loop unconditionally tries hipMemSetAccess for every device and silently logs failures for non-owning peers. In contrast, RocmMemoryReservation::SetAccess (in rocm_memory_reservation.cc:111-120) first checks CanEnablePeerAccessTo() and skips devices that don't support P2P access.
Both code paths serve the same purpose — setting VMM access permissions for peer devices. It would be more robust and consistent to use the same pattern (checking CanEnablePeerAccessTo() first) here as well, especially since unconditionally calling hipMemSetAccess on unsupported peers could produce spurious log noise on multi-GPU systems without full P2P connectivity.
There was a problem hiding this comment.
Resolved -- addressed in this revision. rocm_vmm_allocator.cc now checks CanEnablePeerAccessTo() before calling hipMemSetAccess, matching the pattern in RocmMemoryReservation::SetAccess.
There was a problem hiding this comment.
Resolved: peer access is now checked via CanEnablePeerAccessTo() before calling hipMemSetAccess, consistent with the pattern in RocmMemoryReservation::SetAccess.
Re-review SummaryGood progress — the majority of issues from the initial review have been addressed. Of the 12 original findings, 10 are fully resolved and 1 is partially resolved. Two new issues were identified in this revision, plus two minor nits on the new VMM infrastructure code. Remaining issues (from thread updates + new findings)Bugs (2 new):
Partially resolved (1):
Nits (2 new):
Still open from initial review (1, no developer reply):
|
- Fix data race: use std::atomic<bool> with acquire/release for double-checked locking on pool initialization - Fix pool initialization race: merge index caching and pool creation into single mutex-protected block, checked via atomic initialized flag - Fix proto comment: default slots is 1, not 2 - Fix double-unmap: release ScopedMappings before reservation destructor in CircularVmmPool to avoid unmapping same range twice - Align peer access pattern in rocm_vmm_allocator.cc with RocmMemoryReservation::SetAccess (check CanEnablePeerAccessTo first)
| "@com_google_absl//absl/status", | ||
| "@com_google_absl//absl/status:statusor", | ||
| "@com_google_absl//absl/strings:str_format", | ||
| "@com_google_absl//absl/types:span", |
There was a problem hiding this comment.
Bug: missing BUILD dependency @com_google_absl//absl/time
circular_vmm_pool.cc includes absl/time/clock.h and absl/time/time.h (for absl::Now(), absl::Seconds, absl::Time in AcquireNextSlot), but this BUILD target does not declare @com_google_absl//absl/time as a dependency. This will cause a build failure when strict dependency checking is enabled (Bazel --strict_deps / layering check).
| "@com_google_absl//absl/types:span", | |
| "@com_google_absl//absl/types:span", | |
| "@com_google_absl//absl/time", |
There was a problem hiding this comment.
Resolved ✓ — addressed in this revision.
| std::vector<Slot> slots; | ||
| slots.reserve(num_slots); | ||
| for (int i = 0; i < num_slots; ++i) { | ||
| TF_ASSIGN_OR_RETURN(auto physical, |
There was a problem hiding this comment.
Bug: signal memory leaked on error path
timeline_host_ptr is allocated via hipExtMallocWithFlags at line 115 but is only freed inside ~CircularVmmPool. If any of the TF_ASSIGN_OR_RETURN calls in this loop fail (e.g., RocmRawMemoryAllocation::Create or RocmMemoryReservation::Create), the function returns early without constructing a CircularVmmPool object, so the destructor never runs and timeline_host_ptr is leaked.
Consider wrapping timeline_host_ptr in a cleanup guard (e.g., absl::Cleanup) that calls hipFree and is dismissed/released only after the CircularVmmPool is successfully constructed:
auto timeline_cleanup = absl::MakeCleanup([&]() {
wrap::hipFree(timeline_host_ptr);
});
// ... slot creation loop ...
std::move(timeline_cleanup).Cancel(); // dismiss before returning
return absl::WrapUnique(new CircularVmmPool(...));There was a problem hiding this comment.
Resolved ✓ — addressed in this revision.
Re-review SummaryGood progress — 11 of 15 prior findings have been addressed. 2 new issues found in this cycle (posted inline). 4 prior findings remain open without developer reply (see existing inline threads). New findings (2):
Still open from prior review (4):
🤖 Generated with Claude Code |
Re-review Summary6 previously flagged issues are now resolved:
3 prior issues still open (no new comments posted — see existing inline threads):
No new issues found in this revision. Good progress on addressing the prior review feedback. |
- Add absl::Cleanup guard for timeline signal memory in CircularVmmPool::Create so hipExtMallocWithFlags allocation is freed if slot creation fails partway through - Add missing BUILD deps: @com_google_absl//absl/time and @com_google_absl//absl/cleanup for circular_vmm_pool target - Replace std::shared_ptr<void> with typed std::shared_ptr<CircularVmmPool> using forward declaration, eliminating unsafe static_cast at use sites
| std::shared_ptr<stream_executor::gpu::CircularVmmPool> pool | ||
| ABSL_GUARDED_BY(mu); | ||
| std::atomic<uint64_t> iteration_count{0}; | ||
| absl::btree_set<BufferAllocation::Index> pool_indexes ABSL_GUARDED_BY(mu); | ||
| absl::btree_set<BufferAllocation::Index> copy_indexes ABSL_GUARDED_BY(mu); |
There was a problem hiding this comment.
Nit: ABSL_GUARDED_BY(mu) annotations are inaccurate after initialization
pool, pool_indexes, and copy_indexes are annotated ABSL_GUARDED_BY(mu), but after the one-time initialization they are read without holding mu in ExecuteThunksWithCircularVmmPool (lines 1582-1585, 1593 of gpu_executable.cc). This is safe because the atomic double-checked locking on initialized guarantees the fields are immutable post-init, but the annotations will produce false positives with Clang's thread safety analysis (-Wthread-safety).
Consider either:
- Removing the
ABSL_GUARDED_BY(mu)annotations and adding a comment that these fields are write-once undermuand immutable thereafter (reads protected by theinitializedacquire fence), or - Keeping the lock held for reads as well (unnecessary overhead for the hot path).
| if (block_host_until_done) { | ||
| TF_RETURN_IF_ERROR(run_options->stream()->BlockHostUntilDone()); | ||
| } | ||
|
|
||
| // GPU signals slot completion so the CPU knows when this slot is safe to | ||
| // reuse (non-blocking write via hipStreamWriteValue64). | ||
| TF_RETURN_IF_ERROR(pool->ReleaseSlot(run_options->stream(), iteration)); |
There was a problem hiding this comment.
[Medium] ReleaseSlot (which enqueues hipStreamWriteValue64) is called after BlockHostUntilDone. This means the GPU-side timeline signal write is not covered by the host synchronization — if the caller destroys the pool (or the process exits) immediately after this function returns with block_host_until_done=true, the signal write may not have executed yet.
Consider swapping the order so ReleaseSlot is enqueued before the sync:
| if (block_host_until_done) { | |
| TF_RETURN_IF_ERROR(run_options->stream()->BlockHostUntilDone()); | |
| } | |
| // GPU signals slot completion so the CPU knows when this slot is safe to | |
| // reuse (non-blocking write via hipStreamWriteValue64). | |
| TF_RETURN_IF_ERROR(pool->ReleaseSlot(run_options->stream(), iteration)); | |
| // GPU signals slot completion so the CPU knows when this slot is safe to | |
| // reuse (non-blocking write via hipStreamWriteValue64). | |
| TF_RETURN_IF_ERROR(pool->ReleaseSlot(run_options->stream(), iteration)); | |
| if (block_host_until_done) { | |
| TF_RETURN_IF_ERROR(run_options->stream()->BlockHostUntilDone()); | |
| } |
This ensures the timeline signal is included in the synchronization fence.
| std::shared_ptr<stream_executor::gpu::CircularVmmPool> pool | ||
| ABSL_GUARDED_BY(mu); | ||
| std::atomic<uint64_t> iteration_count{0}; | ||
| absl::btree_set<BufferAllocation::Index> pool_indexes ABSL_GUARDED_BY(mu); | ||
| absl::btree_set<BufferAllocation::Index> copy_indexes ABSL_GUARDED_BY(mu); |
There was a problem hiding this comment.
[Low] pool, pool_indexes, and copy_indexes are annotated ABSL_GUARDED_BY(mu), but after initialization they are read without holding mu in the hot path (lines 1582-1593 of gpu_executable.cc). This is functionally safe because these are write-once fields protected by the initialized atomic, but the annotations are inconsistent and will produce false positives if Clang's thread safety analysis is enabled.
Consider either:
- Removing
ABSL_GUARDED_BY(mu)(since the fields are effectively immutable post-init and theatomic<bool> initializedprovides the barrier), or - Adding
ABSL_NO_THREAD_SAFETY_ANALYSISto the hot-path reader function.
| int device_count = 0; | ||
| TF_RETURN_IF_ERROR(ToStatus(wrap::hipGetDeviceCount(&device_count))); | ||
| for (int peer = 0; peer < device_count; peer++) { | ||
| if (peer != executor->device_ordinal()) { | ||
| auto peer_executor_or = | ||
| const_cast<Platform*>(executor->GetPlatform()) | ||
| ->ExecutorForDevice(peer); | ||
| if (!peer_executor_or.ok() || | ||
| !executor->CanEnablePeerAccessTo(peer_executor_or.value())) { | ||
| continue; | ||
| } | ||
| } | ||
| hipMemAccessDesc access_desc = GetVmmAccessDescriptor(peer); | ||
| TF_RETURN_IF_ERROR(ToStatus( | ||
| wrap::hipMemSetAccess(ptr, padded_size, &access_desc, 1), |
There was a problem hiding this comment.
[Medium] If hipMemSetAccess fails (line 118-119) or hipGetDeviceCount fails (line 106), TF_RETURN_IF_ERROR returns immediately without cleaning up the already-mapped memory. At this point hipMemMap has succeeded (line 90), but the error path never calls hipMemUnmap, hipMemAddressFree, or hipMemRelease — leaking both the VA mapping and the physical allocation handle.
Consider using an absl::Cleanup guard after the successful hipMemMap to ensure proper cleanup on all subsequent error paths:
auto map_cleanup = absl::MakeCleanup([&]() {
wrap::hipMemUnmap(ptr, padded_size);
wrap::hipMemAddressFree(ptr, padded_size);
wrap::hipMemRelease(handle);
});
// ... hipMemSetAccess calls ...
std::move(map_cleanup).Cancel();
return std::make_tuple(ptr, padded_size, handle);|
|
||
| auto* pool = pool_state->pool.get(); | ||
| uint64_t iteration = pool_state->iteration_count.fetch_add(1); | ||
| int slot_idx = iteration % pool->num_slots(); |
There was a problem hiding this comment.
[Nit] slot_idx is only used in the VLOG(3) on line 1601. In builds where VLOG(3) compiles out, this may trigger an unused-variable warning. Consider moving this computation inside the VLOG statement or guarding it with VLOG_IS_ON(3).
Re-review Summary (Round 3)All 19 previous inline findings have been addressed and resolved — nice work on the fixes across the board (signal memory leak, thread safety, logging levels, annotation mismatches, unused fields, etc.). This round identified 4 new findings (2 medium, 1 low, 1 nit), posted inline:
Overall the circular VMM pool design is solid — pre-allocated slots with permanent VA mappings and GPU timeline signaling is a clean approach to eliminating per-iteration 🤖 Generated with Claude Code |
xla/service/gpu/gpu_executable.cc
Outdated
| pool_state->initialized.load(std::memory_order_acquire)) { | ||
| auto cache_it = pool_state->alias_cache.find(arg_idx); | ||
| if (cache_it != pool_state->alias_cache.end()) { | ||
| return cache_it->second; |
There was a problem hiding this comment.
[Medium] alias_cache always returns slot-0 alias VA addresses regardless of which slot the current iteration will actually use. With the default num_slots=1 this is correct (slot 0 is the only slot, so alias and graph VAs share the same physical memory). However, with num_slots>1, there is a potential correctness issue:
GenerateBufferAllocationscallsBufferForAllocation, which returns slot-0 alias addresses for all pooled live-out buffers (via thisalias_cachelookup).- The live-out result buffers in the
ExecutionOutputare set from these slot-0 alias addresses (line 1283:buffer_allocations.GetDeviceAddress(output_info.allocation_index)). ExecuteThunksWithCircularVmmPoolthen runs the graph on slot X (where X = iteration % N), writing results to slot-X's graph VA (different physical memory from slot 0).- Downstream consumers read from the slot-0 alias VA in the
ExecutionOutput, which points to slot-0's physical memory -- not slot-X's.
Result: with N>1, live-out data from iterations that don't land on slot 0 would be read from the wrong physical memory.
Since the default is N=1 and the PR description recommends N=1, this is safe in practice today. But if someone sets xla_gpu_circular_vmm_pool_slots>1, it could produce silent data corruption. Consider either:
- Documenting that N>1 is not yet supported, or
- Updating the alias cache per-iteration (or making
ExecuteThunksWithCircularVmmPoolpatch the result buffers after execution).
| absl::StatusOr<std::vector<DeviceAddressBase>> | ||
| CircularVmmPool::AcquireNextSlotAliases(uint64_t iteration) { | ||
| int slot_idx = iteration % num_slots_; | ||
|
|
||
| if (iteration >= static_cast<uint64_t>(num_slots_)) { | ||
| uint64_t required = iteration - num_slots_ + 1; | ||
| uint64_t completed = __atomic_load_n(timeline_, __ATOMIC_ACQUIRE); | ||
| constexpr int kMaxSpinIterations = 1000; | ||
| constexpr auto kTimeout = absl::Seconds(30); | ||
| auto deadline = absl::Now() + kTimeout; | ||
| int spin_count = 0; | ||
| while (completed < required) { | ||
| if (++spin_count > kMaxSpinIterations) { | ||
| if (absl::Now() > deadline) { | ||
| return absl::DeadlineExceededError(absl::StrFormat( | ||
| "CircularVmmPool: timed out waiting for alias slot %d " | ||
| "(required=%d, completed=%d)", | ||
| iteration % num_slots_, required, completed)); | ||
| } | ||
| sched_yield(); | ||
| spin_count = 0; | ||
| } | ||
| completed = __atomic_load_n(timeline_, __ATOMIC_ACQUIRE); | ||
| } | ||
| } | ||
|
|
||
| return slots_[slot_idx].alias_addresses; |
There was a problem hiding this comment.
[Nit] AcquireNextSlotAliases duplicates the entire spin-wait logic from AcquireNextSlot (lines 199-228). Consider extracting the spin-wait into a private helper, e.g. WaitForSlotAvailable(uint64_t iteration), so both methods call it before returning their respective address vectors. This avoids maintaining the same timeout/yield logic in two places.
Re-review SummaryResolved (2): Missing Still open from prior review (5): New findings (2): 🤖 Generated with Claude Code |
341069b to
bea785e
Compare
… pool The circular VMM pool was forcing enable_va_remapping=true in CommandBufferThunk, which skips the warmup/initialization path that NCCL collectives require. This caused a segfault at step 3 on MaxText LLaMA 3 8B with 8x MI300X FSDP-8. The fix: don't set enable_va_remapping when only the circular pool is enabled. The pool provides permanently stable VA addresses, so CommandBufferThunk naturally detects no address changes and replays the graph without re-recording. This allows the normal warmup path to run, properly initializing NCCL collectives.
bea785e to
33a92a8
Compare
Circular VMM Pool for Update-Free Command Buffer Execution on ROCm
Problem
Enabling COLLECTIVES in XLA command buffers (HIP graphs) on ROCm causes severe performance degradation. Every iteration triggers a full NCCL stream capture cycle when buffer addresses change, making COLLECTIVES 5–15× slower than running without command buffers.
Solution
Pre-allocate physical memory with permanent virtual address (VA) mappings at startup. The HIP graph is recorded once with stable VA addresses and replayed forever. Per-iteration overhead is reduced to lightweight D2D memcpy for parameter data — no
hipMemMap, nohipMemUnmap, nohipEventSynchronize.Architecture:
GPU-to-CPU synchronization uses
hipStreamWriteValue64to write a monotonic timeline counter to signal memory. The CPU checks with a non-blocking atomic read (__atomic_load_nwith acquire semantics, cost: 1 nanosecond) before reusing a slot. A 30-second timeout withsched_yieldbackoff prevents unbounded spin-wait.Why D2D Copy Instead of Map/Unmap?
Measured on 8× MI300X (ROCm 7.2):
__atomic_load_n(timeline read)Full per-iteration cycle (8 GPUs, 1 MB buffers):
D2D memcpy runs at HBM bandwidth (~3.35 TB/s on MI300X), making even large buffer copies cheaper than a single driver call. The NVIDIA remap cycle explodes with multiple GPUs due to driver lock contention.
(Benchmark source:
vmm_benchmark.cpp, compiled withhipcc -O2)Results
MaxText LLaMA 3 8B on 8× MI300X (FSDP-8, synthetic data, 10 steps)
F+C+CL+CC = FUSION,CUBLAS,CUBLASLT,CUSTOM_CALL
Loss convergence is identical across all configs (10.871 → 8.973).
Micro-benchmarks
Accuracy: All 5 numerical accuracy tests pass with zero bit-level difference across 20 iterations (tanh, pmap+allreduce, grad, varying inputs, large matmul on 8 GPUs).
Key Design Decisions
va_remapping=truein CommandBufferThunk — tells the thunk to record once and replay forever.std::atomic<bool>with acquire/release for double-checked locking,std::atomic<uint64_t>for iteration counter.hipExtMallocWithFlags(hipMallocSignalMemory), freed withabsl::Cleanupguard on error paths.sched_yield()backoff inAcquireNextSlot.Files
New (3):
xla/stream_executor/rocm/circular_vmm_pool.h/.cc— Pool implementationxla/stream_executor/rocm/circular_vmm_pool_test.cc— Unit testsModified (10):
xla/xla.proto—xla_gpu_enable_circular_vmm_pool(field 468),xla_gpu_circular_vmm_pool_slots(field 469)xla/debug_options_flags.cc— Flag defaults and registrationxla/service/gpu/gpu_executable.h/.cc—ExecuteThunksWithCircularVmmPool,CircularPoolStatexla/service/gpu/thunk_emitter.cc— Enableva_remappingwhen circular pool is activexla/backends/gpu/runtime/command_buffer_conversion_pass.cc— Samexla/stream_executor/rocm/rocm_vmm_allocator.h/.cc— Remove unused param, align peer accessxla/stream_executor/rocm/BUILD,xla/service/gpu/BUILD— Targets and depsEnable