Fix CallStack.register orphan lock if exception fires before try/finally by shaunc · Pull Request #896 · NVIDIA/numba-cuda

shaunc · 2026-06-03T23:45:46Z

Fix `CallStack.register` orphan-lock when an exception fires before the `try/finally`

Summary

numba_cuda.numba.cuda.typing.context.CallStack.register acquires
self._lock and appends a CallFrame to self._stack outside the
try/finally block that releases the lock. Any exception raised between
the acquire() call and entry into the try block leaves the lock held
forever; subsequent threads (or re-entrant compiles on the same
thread, after enough nesting) deadlock waiting for it.

This PR restructures register to use with self._lock: so release is
unconditional, and adds a regression test that simulates a failure
inside CallFrame.__init__ and asserts the lock is no longer held.

No behavior change on the happy path.

The defect

Current code (numba_cuda/numba/cuda/typing/context.py, lines 64-76):

@contextlib.contextmanager
def register(self, target, typeinfer, func_id, args):
    # guard compiling the same function with the same signature
    if self.match(func_id.func, args):
        msg = "compiler re-entrant to the same function signature"
        raise errors.NumbaRuntimeError(msg)
    self._lock.acquire()                                           # (A)
    self._stack.append(CallFrame(target, typeinfer, func_id, args))# (B)
    try:
        yield
    finally:
        self._stack.pop()
        self._lock.release()

If anything raises between (A) and entry into the try block, the
lock is leaked. Concrete ways that can happen in practice:

CallFrame.__init__ itself raises (e.g. an attribute lookup on a
pathological func_id, or a downstream-supplied target whose
initialisation has side effects).
The Python interpreter raises an MemoryError between the
bytecode that returns from acquire() and the bytecode that
pushes the try frame.
An asynchronous interrupt is delivered to this thread between
those same two points -- e.g. a KeyboardInterrupt, or a
PyThreadState_SetAsyncExc injected by another thread to
forcibly cancel a long compile.

In any of those cases the lock is held by no live frame, the
context manager never runs, and the next caller of register blocks
indefinitely on self._lock.acquire(). Because _lock is an
RLock, a single-threaded reproducer needs nested registrations to
deadlock, but in the multi-threaded compile case any other thread
attempting to register is stuck.

The fix

@contextlib.contextmanager
def register(self, target, typeinfer, func_id, args):
    if self.match(func_id.func, args):
        msg = "compiler re-entrant to the same function signature"
        raise errors.NumbaRuntimeError(msg)
    # Use `with self._lock:` so an exception between acquire and the
    # try-block (e.g. raised by CallFrame.__init__, an OOM, or an
    # asynchronously-injected interrupt) cannot orphan the lock.
    with self._lock:
        self._stack.append(CallFrame(target, typeinfer, func_id, args))
        try:
            yield
        finally:
            self._stack.pop()

with self._lock: is the idiomatic Python lock pattern. The release
happens unconditionally, regardless of where inside the block the
exception fires, including during the CallFrame(...) constructor
expression. No behavior change on the happy path.

Reproducer test

Added at
numba_cuda/numba/cuda/tests/nocuda/test_callstack_register_orphan_lock.py.

The test patches CallFrame to raise on construction, calls
CallStack.register(...) inside an assertRaises, and then proves
the lock is free by attempting _lock.acquire(blocking=False) from a
different thread (so RLock re-entrancy can't mask the bug).

Behavior on the two code paths:

Against the current upstream code, the test fails with
AssertionError: CallStack._lock was orphaned after CallFrame() raised.
Against this PR, the test passes.

A test_lock_released_on_normal_exit sanity test also verifies the
happy path still releases the lock and pops the stack.

The test lives under tests/nocuda/ because it exercises only the
pure-Python typing scaffolding, with no CUDA runtime required.

Downstream context

We hit this surface in the FactFiber ovid project, an
ovid/Numba-CUDA-based JIT pipeline that supports forcible cancellation
of long-running compiles. When a compile is cancelled mid-flight via
PyThreadState_SetAsyncExc, the asynchronous exception can land
between _lock.acquire() and entry into the try block, orphaning the
lock and wedging the next compile attempt indefinitely. We are
shipping a defensive monkey-patch on top of numba_cuda (tracked
internally as ovid-j4oj) that applies essentially the change
proposed here. Landing this fix upstream lets us retire the
monkey-patch.

We do not believe the defect requires our cancellation use case to be
real -- any path that can raise between (A) and (B) suffices. We are
just the use case that surfaced it consistently.

Checklist

Bug fix scoped to a single defect.
Regression test added (and verified to fail on the previous code).
No behavior change on the happy path.
ruff check and ruff format --check pass on the touched files
(using the repo's pinned ruff==0.11.2).
SPDX headers on the new test file; existing header preserved on
context.py.
CLA signed (see CLA.md) -- pending; not signed in this
branch.

Notes for reviewers

I could not run the full numba-cuda test suite locally without a
CUDA-capable build matrix; I exercised the affected test directly
via python tests/nocuda/test_callstack_register_orphan_lock.py.
The nocuda test directory is by design CPU-only, so CI should be
able to run it without GPU resources.
Happy to expand the test (e.g. add a multi-thread deadlock-detector
case) if the reviewers prefer; the minimal form here keeps the
signal clean.

CallStack.register currently calls self._lock.acquire() and self._stack.append(...) BEFORE entering the try/finally that releases the lock. Any exception fired between the acquire and the try-block -- for example raised by CallFrame.__init__, an OOM, or an asynchronously-injected interrupt -- orphans the lock; subsequent threads queued on it block forever. Restructure with `with self._lock:` so the release is unconditional. No behavior change on the happy path; strictly stronger under failure. Adds a regression test in numba_cuda/numba/cuda/tests/nocuda that stubs CallFrame to raise on construction and asserts the lock is still free afterwards (verified to fail on the previous code).

copy-pr-bot · 2026-06-03T23:45:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

acosmicflamingo · 2026-06-04T15:26:30Z

That's quite a huge test for that small change hahaha! I did run the test before and after and verified that this does fix the failing test. On the one hand, I want to say that this is good enough of a reason to merge this PR.

However, it almost feels like we are testing functionality of cpython's threading module, no? After all, using _RLock context manager functionality on the surface seems to be the same as what is currently being done (where enter and exit dunder methods are calling the same functions in that same order).

I don't know if perhaps the cleanest path forward to prevents issue like this from occurring while minimizing tech debt accumulation is to use a linter to suggest using context managers (e.g. this rule). All of these questions, I leave for the numba-cuda maintainers to answer :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CallStack.register orphan lock if exception fires before try/finally#896

Fix CallStack.register orphan lock if exception fires before try/finally#896
shaunc wants to merge 1 commit into
NVIDIA:mainfrom
shaunc:fix/callstack-register-orphan-lock

shaunc commented Jun 3, 2026

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

acosmicflamingo commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaunc commented Jun 3, 2026

Fix CallStack.register orphan-lock when an exception fires before the try/finally

Summary

The defect

The fix

Reproducer test

Downstream context

Checklist

Notes for reviewers

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

acosmicflamingo commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix `CallStack.register` orphan-lock when an exception fires before the `try/finally`

acosmicflamingo commented Jun 4, 2026 •

edited

Loading