[ROCm] Fix LoadKernel to use refcounted module path for proper cleanup#798
[ROCm] Fix LoadKernel to use refcounted module path for proper cleanup#798magaonka-amd wants to merge 2 commits intoROCm:mainfrom
Conversation
43aee8e to
440bfd0
Compare
Claude Code Review SummaryClean PR — no issues found. This change correctly fixes a resource leak in Lock acquisition order, map access patterns, and BUILD dependency cleanup are all correct. 🤖 Reviewed with Claude Code |
440bfd0 to
db0e17b
Compare
|
I'm inclined to keep this change from my original PR hence did partial revert here. I think it is useful to stop execution if kernel is not found else get ourselves into undefined behavior which is much harder to debug. but I'm open to suggestions on this. |
|
db0e17b to
2c1f58f
Compare
ROCm's RocmExecutor has two kernel/module loading paths with
different caching behavior:
1. LoadModule → LoadModuleFromHsaco → populates both
gpu_binary_to_module_ (refcounted) and in_memory_modules_
2. LoadKernel → directly inserts into in_memory_modules_ only,
bypassing gpu_binary_to_module_ entirely
When UnloadKernel calls UnloadGpuBinary, it only looks in
gpu_binary_to_module_ for cleanup. Since LoadKernel never populated
that map, the cleanup was a no-op: in_memory_modules_ entries and
loaded hipModule_t objects were never removed, leaking until
executor destruction.
This caused stale module cache entries when CustomKernelThunk's
owned HSACO buffers were freed and reallocated at the same address.
The cache returned old modules that didn't contain the expected
kernels, producing flaky hipErrorNotFound failures in sort and
LuSolve tests under xdist parallelism.
Fix by routing LoadKernel through LoadModuleFromHsaco, so every
loaded module participates in the refcount mechanism and is properly
cleaned up when the last kernel referencing it is destroyed.
With the previous commit properly wiring LoadKernel through the refcounted LoadModuleFromHsaco path, stale cache entries are cleaned up before address reuse can occur. The content-hash key introduced in PR openxla#40419 is no longer needed as a workaround. Revert to using raw pointer identity as the ModuleHandle key (matching CUDA's approach) and remove the hipPeekAtLastError guard that was added as part of the same workaround.
2c1f58f to
9eb2d65
Compare
|
Okay Dragan removed and pushed PR upstream thanks for feedback. |
draganmladjenovic
left a comment
There was a problem hiding this comment.
LGTM. It seem to match what cuda does.
Motivation
ROCm's RocmExecutor has two kernel/module loading paths with
different caching behavior:
gpu_binary_to_module_ (refcounted) and in_memory_modules_
bypassing gpu_binary_to_module_ entirely
When UnloadKernel calls UnloadGpuBinary, it only looks in
gpu_binary_to_module_ for cleanup. Since LoadKernel never populated
that map, the cleanup was a no-op: in_memory_modules_ entries and
loaded hipModule_t objects were never removed, leaking until
executor destruction.
This caused stale module cache entries when CustomKernelThunk's
owned HSACO buffers were freed and reallocated at the same address.
The cache returned old modules that didn't contain the expected
kernels, producing flaky hipErrorNotFound failures in sort and
LuSolve tests under xdist parallelism.
Fix by routing LoadKernel through LoadModuleFromHsaco, so every
loaded module participates in the refcount mechanism and is properly
cleaned up when the last kernel referencing it is destroyed.
With this fix properly wiring LoadKernel through the
refcounted LoadModuleFromHsaco path, stale cache entries are cleaned
up before address reuse can occur. The content-hash key introduced
in PR openxla#40419 is no longer needed as a workaround.
Revert to using raw pointer identity as the ModuleHandle key
while keeping the hipPeekAtLastError
guard in GetModuleFunction for diagnostic purposes.