Skip to content

[WIP] Extend memory tracking resources tool#3051

Open
huuanhhuyn wants to merge 2 commits into
NVIDIA:mainfrom
huuanhhuyn:fix-alloc-mislabel
Open

[WIP] Extend memory tracking resources tool#3051
huuanhhuyn wants to merge 2 commits into
NVIDIA:mainfrom
huuanhhuyn:fix-alloc-mislabel

Conversation

@huuanhhuyn

@huuanhhuyn huuanhhuyn commented Jun 8, 2026

Copy link
Copy Markdown

NVTX range labels can be mis-attributed in memory_tracking_resources CSV

Issue summary

In the CSV produced by the raft::memory_tracking_resources, an allocation can be recorded under the wrong nvtx_range. The allocation size is correct, only the nvtx range label is incorrect. This issue is reproduced in a unit test of this PR.

Root cause

Stats are recorded on two different timelines:

  • Counters/allocations/deallocations are updated synchronously on the allocating thread, the
    instant allocate()/deallocate() is called.
  • CSV rows are written asynchronously by a background sampler thread,
    which reads the currently active NVTX range at the moment it writes the row -> here.

The peak is an interval maximum (it persists until the next row resets it), but
the range label is a point-in-time read. If the sampler lags past a range
boundary — e.g. during a large/slow allocation+free — it stamps the carried-over
peak with whatever currently active range, not the range active when the allocation occured.

Issue Reproduction

The MemoryTrackingResources.MismatchedRangeLabeling test allocates, inside
three consecutive NVTX ranges:

timestamp_us host_current host_peak host_total_alloc host_total_freed nvtx_depth nvtx_range
292898 10 KiB 10 KiB 10 KiB 0 0 1 1. expect 10 KB
298492 1 GiB 1 GiB 1 GiB + 10 KiB 10 KiB 0 1 2. expect 1 GiB
303858 4 MiB 1 GiB ~1.004 GiB 1 GiB + 10 KiB 0 1 3. expect 4 MiB
308529 0 4 MiB ⚡ ~1.004 GiB ~1.004 GiB 0 0 ""
309647 0 0 ~1.004 GiB ~1.004 GiB 0 0 ""
mismatch_range_label_peaks
  • The slow 1 GiB alloc/free delays the sampler so the 1 GiB peak is written after
    range 2 has already been popped, landing on range 3
  • The 4MB allocation of range 3 is consecutively delayed until the range 3 is popped, landing on an empty range.

Consequently, we get a misleading profiling result where an allocation can be associated with a wrong range.

Desired Behavior

  1. Each nvtx range, in which (de)allocations occurred, must be recorded
  2. Each recording should be associated to the correct range.

@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@huuanhhuyn huuanhhuyn force-pushed the fix-alloc-mislabel branch from 8d634df to 9ed0e23 Compare June 16, 2026 11:32
@huuanhhuyn huuanhhuyn force-pushed the fix-alloc-mislabel branch from 9ed0e23 to ef5b83f Compare June 16, 2026 11:42
@huuanhhuyn huuanhhuyn force-pushed the fix-alloc-mislabel branch from ef5b83f to e852f30 Compare June 16, 2026 11:43
@huuanhhuyn huuanhhuyn changed the title [WIP] Reproduce allocation mislabelling issue [WIP] Extend memory tracking resources tool Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant