Fix CallStack.register orphan lock if exception fires before try/finally#896
Fix CallStack.register orphan lock if exception fires before try/finally#896shaunc wants to merge 1 commit into
Conversation
CallStack.register currently calls self._lock.acquire() and self._stack.append(...) BEFORE entering the try/finally that releases the lock. Any exception fired between the acquire and the try-block -- for example raised by CallFrame.__init__, an OOM, or an asynchronously-injected interrupt -- orphans the lock; subsequent threads queued on it block forever. Restructure with `with self._lock:` so the release is unconditional. No behavior change on the happy path; strictly stronger under failure. Adds a regression test in numba_cuda/numba/cuda/tests/nocuda that stubs CallFrame to raise on construction and asserts the lock is still free afterwards (verified to fail on the previous code).
|
That's quite a huge test for that small change hahaha! I did run the test before and after and verified that this does fix the failing test. On the one hand, I want to say that this is good enough of a reason to merge this PR. However, it almost feels like we are testing functionality of cpython's I don't know if perhaps the cleanest path forward to prevents issue like this from occurring while minimizing tech debt accumulation is to use a linter to suggest using context managers (e.g. this rule). All of these questions, I leave for the numba-cuda maintainers to answer :) |
Fix
CallStack.registerorphan-lock when an exception fires before thetry/finallySummary
numba_cuda.numba.cuda.typing.context.CallStack.registeracquiresself._lockand appends aCallFrametoself._stackoutside thetry/finallyblock that releases the lock. Any exception raised betweenthe
acquire()call and entry into thetryblock leaves the lock heldforever; subsequent threads (or re-entrant compiles on the same
thread, after enough nesting) deadlock waiting for it.
This PR restructures
registerto usewith self._lock:so release isunconditional, and adds a regression test that simulates a failure
inside
CallFrame.__init__and asserts the lock is no longer held.No behavior change on the happy path.
The defect
Current code (
numba_cuda/numba/cuda/typing/context.py, lines 64-76):If anything raises between (A) and entry into the
tryblock, thelock is leaked. Concrete ways that can happen in practice:
CallFrame.__init__itself raises (e.g. an attribute lookup on apathological
func_id, or a downstream-suppliedtargetwhoseinitialisation has side effects).
MemoryErrorbetween thebytecode that returns from
acquire()and the bytecode thatpushes the
tryframe.those same two points -- e.g. a
KeyboardInterrupt, or aPyThreadState_SetAsyncExcinjected by another thread toforcibly cancel a long compile.
In any of those cases the lock is held by no live frame, the
context manager never runs, and the next caller of
registerblocksindefinitely on
self._lock.acquire(). Because_lockis anRLock, a single-threaded reproducer needs nested registrations todeadlock, but in the multi-threaded compile case any other thread
attempting to register is stuck.
The fix
with self._lock:is the idiomatic Python lock pattern. The releasehappens unconditionally, regardless of where inside the block the
exception fires, including during the
CallFrame(...)constructorexpression. No behavior change on the happy path.
Reproducer test
Added at
numba_cuda/numba/cuda/tests/nocuda/test_callstack_register_orphan_lock.py.The test patches
CallFrameto raise on construction, callsCallStack.register(...)inside anassertRaises, and then provesthe lock is free by attempting
_lock.acquire(blocking=False)from adifferent thread (so
RLockre-entrancy can't mask the bug).Behavior on the two code paths:
AssertionError: CallStack._lock was orphaned after CallFrame() raised.A
test_lock_released_on_normal_exitsanity test also verifies thehappy path still releases the lock and pops the stack.
The test lives under
tests/nocuda/because it exercises only thepure-Python typing scaffolding, with no CUDA runtime required.
Downstream context
We hit this surface in the FactFiber
ovidproject, anovid/Numba-CUDA-based JIT pipeline that supports forcible cancellation
of long-running compiles. When a compile is cancelled mid-flight via
PyThreadState_SetAsyncExc, the asynchronous exception can landbetween
_lock.acquire()and entry into thetryblock, orphaning thelock and wedging the next compile attempt indefinitely. We are
shipping a defensive monkey-patch on top of
numba_cuda(trackedinternally as
ovid-j4oj) that applies essentially the changeproposed here. Landing this fix upstream lets us retire the
monkey-patch.
We do not believe the defect requires our cancellation use case to be
real -- any path that can raise between (A) and (B) suffices. We are
just the use case that surfaced it consistently.
Checklist
ruff checkandruff format --checkpass on the touched files(using the repo's pinned
ruff==0.11.2).context.py.CLA.md) -- pending; not signed in thisbranch.
Notes for reviewers
CUDA-capable build matrix; I exercised the affected test directly
via
python tests/nocuda/test_callstack_register_orphan_lock.py.The
nocudatest directory is by design CPU-only, so CI should beable to run it without GPU resources.
case) if the reviewers prefer; the minimal form here keeps the
signal clean.