[ROCm] Fix LoadKernel to use refcounted module path for proper cleanup by magaonka-amd · Pull Request #798 · ROCm/xla

magaonka-amd · 2026-04-13T01:06:54Z

Motivation

ROCm's RocmExecutor has two kernel/module loading paths with
different caching behavior:

LoadModule → LoadModuleFromHsaco → populates both
gpu_binary_to_module_ (refcounted) and in_memory_modules_
LoadKernel → directly inserts into in_memory_modules_ only,
bypassing gpu_binary_to_module_ entirely

When UnloadKernel calls UnloadGpuBinary, it only looks in
gpu_binary_to_module_ for cleanup. Since LoadKernel never populated
that map, the cleanup was a no-op: in_memory_modules_ entries and
loaded hipModule_t objects were never removed, leaking until
executor destruction.

This caused stale module cache entries when CustomKernelThunk's
owned HSACO buffers were freed and reallocated at the same address.
The cache returned old modules that didn't contain the expected
kernels, producing flaky hipErrorNotFound failures in sort and
LuSolve tests under xdist parallelism.

Fix by routing LoadKernel through LoadModuleFromHsaco, so every
loaded module participates in the refcount mechanism and is properly
cleaned up when the last kernel referencing it is destroyed.

With this fix properly wiring LoadKernel through the
refcounted LoadModuleFromHsaco path, stale cache entries are cleaned
up before address reuse can occur. The content-hash key introduced
in PR openxla#40419 is no longer needed as a workaround.

Revert to using raw pointer identity as the ModuleHandle key
while keeping the hipPeekAtLastError
guard in GetModuleFunction for diagnostic purposes.

claude · 2026-04-13T01:13:56Z

Claude Code Review Summary

Clean PR — no issues found.

This change correctly fixes a resource leak in RocmExecutor::LoadKernel by routing through the refcounted LoadModuleFromHsaco path instead of directly inserting into in_memory_modules_. The pattern now matches the CUDA executor's LoadKernel/LoadModuleFromCuBin approach exactly. The revert of the content-hash ModuleHandle key is sound since the root cause (missing refcount path) is now addressed, making raw pointer identity sufficient again.

Lock acquisition order, map access patterns, and BUILD dependency cleanup are all correct.

🤖 Reviewed with Claude Code

magaonka-amd · 2026-04-13T01:20:02Z

I'm inclined to keep this change from my original PR hence did partial revert here. I think it is useful to stop execution if kernel is not found else get ourselves into undefined behavior which is much harder to debug.
https://github.com/openxla/xla/pull/40419/changes#diff-41b3141e3358c4655366766128707b51ed7ad7ec0d7f7f0059770c01fced1859R179

but I'm open to suggestions on this.

draganmladjenovic · 2026-04-13T12:09:11Z

I'm inclined to keep this change from my original PR hence did partial revert here. I think it is useful to stop execution if kernel is not found else get ourselves into undefined behavior which is much harder to debug. https://github.com/openxla/xla/pull/40419/changes#diff-41b3141e3358c4655366766128707b51ed7ad7ec0d7f7f0059770c01fced1859R179

but I'm open to suggestions on this.
Remove it. It makes no sense to spray code with hipPeakAtLastError.

ROCm's RocmExecutor has two kernel/module loading paths with different caching behavior: 1. LoadModule → LoadModuleFromHsaco → populates both gpu_binary_to_module_ (refcounted) and in_memory_modules_ 2. LoadKernel → directly inserts into in_memory_modules_ only, bypassing gpu_binary_to_module_ entirely When UnloadKernel calls UnloadGpuBinary, it only looks in gpu_binary_to_module_ for cleanup. Since LoadKernel never populated that map, the cleanup was a no-op: in_memory_modules_ entries and loaded hipModule_t objects were never removed, leaking until executor destruction. This caused stale module cache entries when CustomKernelThunk's owned HSACO buffers were freed and reallocated at the same address. The cache returned old modules that didn't contain the expected kernels, producing flaky hipErrorNotFound failures in sort and LuSolve tests under xdist parallelism. Fix by routing LoadKernel through LoadModuleFromHsaco, so every loaded module participates in the refcount mechanism and is properly cleaned up when the last kernel referencing it is destroyed.

With the previous commit properly wiring LoadKernel through the refcounted LoadModuleFromHsaco path, stale cache entries are cleaned up before address reuse can occur. The content-hash key introduced in PR openxla#40419 is no longer needed as a workaround. Revert to using raw pointer identity as the ModuleHandle key (matching CUDA's approach) and remove the hipPeekAtLastError guard that was added as part of the same workaround.

magaonka-amd · 2026-04-13T23:11:27Z

Okay Dragan removed and pushed PR upstream thanks for feedback.
openxla#40847 pushed fix upstream with new fix and revert of my old fix.

draganmladjenovic

LGTM. It seem to match what cuda does.

magaonka-amd added the claude-review Request a Claude AI code review for this PR label Apr 13, 2026

magaonka-amd force-pushed the fix/rocm-loadkernel-module-refcount branch from 43aee8e to 440bfd0 Compare April 13, 2026 01:12

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 13, 2026

magaonka-amd force-pushed the fix/rocm-loadkernel-module-refcount branch from 440bfd0 to db0e17b Compare April 13, 2026 01:17

magaonka-amd marked this pull request as ready for review April 13, 2026 01:21

magaonka-amd mentioned this pull request Apr 13, 2026

[ROCm] Cherry-pick: Fix HSACO module cache using pointer-based key causing stale lookups #780

Draft

magaonka-amd requested review from draganmladjenovic and pemeliya April 13, 2026 01:25

magaonka-amd force-pushed the fix/rocm-loadkernel-module-refcount branch from db0e17b to 2c1f58f Compare April 13, 2026 22:54

magaonka-amd added 2 commits April 13, 2026 17:58

magaonka-amd force-pushed the fix/rocm-loadkernel-module-refcount branch from 2c1f58f to 9eb2d65 Compare April 13, 2026 22:59

draganmladjenovic approved these changes Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix LoadKernel to use refcounted module path for proper cleanup#798

[ROCm] Fix LoadKernel to use refcounted module path for proper cleanup#798
magaonka-amd wants to merge 2 commits intoROCm:mainfrom
magaonka-amd:fix/rocm-loadkernel-module-refcount

magaonka-amd commented Apr 13, 2026

Uh oh!

claude bot commented Apr 13, 2026

Uh oh!

magaonka-amd commented Apr 13, 2026

Uh oh!

draganmladjenovic commented Apr 13, 2026

Uh oh!

magaonka-amd commented Apr 13, 2026

Uh oh!

draganmladjenovic left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

magaonka-amd commented Apr 13, 2026

Motivation

Uh oh!

claude bot commented Apr 13, 2026

Claude Code Review Summary

Uh oh!

magaonka-amd commented Apr 13, 2026

Uh oh!

draganmladjenovic commented Apr 13, 2026

Uh oh!

magaonka-amd commented Apr 13, 2026

Uh oh!

draganmladjenovic left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants