Skip to content

[ROCm] Fix LoadKernel to use refcounted module path for proper cleanup#798

Open
magaonka-amd wants to merge 2 commits intoROCm:mainfrom
magaonka-amd:fix/rocm-loadkernel-module-refcount
Open

[ROCm] Fix LoadKernel to use refcounted module path for proper cleanup#798
magaonka-amd wants to merge 2 commits intoROCm:mainfrom
magaonka-amd:fix/rocm-loadkernel-module-refcount

Conversation

@magaonka-amd
Copy link
Copy Markdown

Motivation

ROCm's RocmExecutor has two kernel/module loading paths with
different caching behavior:

  1. LoadModule → LoadModuleFromHsaco → populates both
    gpu_binary_to_module_ (refcounted) and in_memory_modules_
  2. LoadKernel → directly inserts into in_memory_modules_ only,
    bypassing gpu_binary_to_module_ entirely

When UnloadKernel calls UnloadGpuBinary, it only looks in
gpu_binary_to_module_ for cleanup. Since LoadKernel never populated
that map, the cleanup was a no-op: in_memory_modules_ entries and
loaded hipModule_t objects were never removed, leaking until
executor destruction.

This caused stale module cache entries when CustomKernelThunk's
owned HSACO buffers were freed and reallocated at the same address.
The cache returned old modules that didn't contain the expected
kernels, producing flaky hipErrorNotFound failures in sort and
LuSolve tests under xdist parallelism.

Fix by routing LoadKernel through LoadModuleFromHsaco, so every
loaded module participates in the refcount mechanism and is properly
cleaned up when the last kernel referencing it is destroyed.

With this fix properly wiring LoadKernel through the
refcounted LoadModuleFromHsaco path, stale cache entries are cleaned
up before address reuse can occur. The content-hash key introduced
in PR openxla#40419 is no longer needed as a workaround.

Revert to using raw pointer identity as the ModuleHandle key
while keeping the hipPeekAtLastError
guard in GetModuleFunction for diagnostic purposes.

@magaonka-amd magaonka-amd added the claude-review Request a Claude AI code review for this PR label Apr 13, 2026
@magaonka-amd magaonka-amd force-pushed the fix/rocm-loadkernel-module-refcount branch from 43aee8e to 440bfd0 Compare April 13, 2026 01:12
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Claude Code Review Summary

Clean PR — no issues found.

This change correctly fixes a resource leak in RocmExecutor::LoadKernel by routing through the refcounted LoadModuleFromHsaco path instead of directly inserting into in_memory_modules_. The pattern now matches the CUDA executor's LoadKernel/LoadModuleFromCuBin approach exactly. The revert of the content-hash ModuleHandle key is sound since the root cause (missing refcount path) is now addressed, making raw pointer identity sufficient again.

Lock acquisition order, map access patterns, and BUILD dependency cleanup are all correct.

🤖 Reviewed with Claude Code

@github-actions github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 13, 2026
@magaonka-amd magaonka-amd force-pushed the fix/rocm-loadkernel-module-refcount branch from 440bfd0 to db0e17b Compare April 13, 2026 01:17
@magaonka-amd
Copy link
Copy Markdown
Author

I'm inclined to keep this change from my original PR hence did partial revert here. I think it is useful to stop execution if kernel is not found else get ourselves into undefined behavior which is much harder to debug.
https://github.com/openxla/xla/pull/40419/changes#diff-41b3141e3358c4655366766128707b51ed7ad7ec0d7f7f0059770c01fced1859R179

but I'm open to suggestions on this.

@draganmladjenovic
Copy link
Copy Markdown

I'm inclined to keep this change from my original PR hence did partial revert here. I think it is useful to stop execution if kernel is not found else get ourselves into undefined behavior which is much harder to debug. https://github.com/openxla/xla/pull/40419/changes#diff-41b3141e3358c4655366766128707b51ed7ad7ec0d7f7f0059770c01fced1859R179

but I'm open to suggestions on this.
Remove it. It makes no sense to spray code with hipPeakAtLastError.

@magaonka-amd magaonka-amd force-pushed the fix/rocm-loadkernel-module-refcount branch from db0e17b to 2c1f58f Compare April 13, 2026 22:54
ROCm's RocmExecutor has two kernel/module loading paths with
different caching behavior:

  1. LoadModule → LoadModuleFromHsaco → populates both
     gpu_binary_to_module_ (refcounted) and in_memory_modules_
  2. LoadKernel → directly inserts into in_memory_modules_ only,
     bypassing gpu_binary_to_module_ entirely

When UnloadKernel calls UnloadGpuBinary, it only looks in
gpu_binary_to_module_ for cleanup. Since LoadKernel never populated
that map, the cleanup was a no-op: in_memory_modules_ entries and
loaded hipModule_t objects were never removed, leaking until
executor destruction.

This caused stale module cache entries when CustomKernelThunk's
owned HSACO buffers were freed and reallocated at the same address.
The cache returned old modules that didn't contain the expected
kernels, producing flaky hipErrorNotFound failures in sort and
LuSolve tests under xdist parallelism.

Fix by routing LoadKernel through LoadModuleFromHsaco, so every
loaded module participates in the refcount mechanism and is properly
cleaned up when the last kernel referencing it is destroyed.
With the previous commit properly wiring LoadKernel through the
refcounted LoadModuleFromHsaco path, stale cache entries are cleaned
up before address reuse can occur. The content-hash key introduced
in PR openxla#40419 is no longer needed as a workaround.

Revert to using raw pointer identity as the ModuleHandle key
(matching CUDA's approach) and remove the hipPeekAtLastError guard
that was added as part of the same workaround.
@magaonka-amd magaonka-amd force-pushed the fix/rocm-loadkernel-module-refcount branch from 2c1f58f to 9eb2d65 Compare April 13, 2026 22:59
@magaonka-amd
Copy link
Copy Markdown
Author

Okay Dragan removed and pushed PR upstream thanks for feedback.
openxla#40847 pushed fix upstream with new fix and revert of my old fix.

Copy link
Copy Markdown

@draganmladjenovic draganmladjenovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It seem to match what cuda does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants