Skip to content

Fix CallStack.register orphan lock if exception fires before try/finally#896

Open
shaunc wants to merge 1 commit into
NVIDIA:mainfrom
shaunc:fix/callstack-register-orphan-lock
Open

Fix CallStack.register orphan lock if exception fires before try/finally#896
shaunc wants to merge 1 commit into
NVIDIA:mainfrom
shaunc:fix/callstack-register-orphan-lock

Conversation

@shaunc

@shaunc shaunc commented Jun 3, 2026

Copy link
Copy Markdown

Fix CallStack.register orphan-lock when an exception fires before the try/finally

Summary

numba_cuda.numba.cuda.typing.context.CallStack.register acquires
self._lock and appends a CallFrame to self._stack outside the
try/finally block that releases the lock. Any exception raised between
the acquire() call and entry into the try block leaves the lock held
forever; subsequent threads (or re-entrant compiles on the same
thread, after enough nesting) deadlock waiting for it.

This PR restructures register to use with self._lock: so release is
unconditional, and adds a regression test that simulates a failure
inside CallFrame.__init__ and asserts the lock is no longer held.

No behavior change on the happy path.

The defect

Current code (numba_cuda/numba/cuda/typing/context.py, lines 64-76):

@contextlib.contextmanager
def register(self, target, typeinfer, func_id, args):
    # guard compiling the same function with the same signature
    if self.match(func_id.func, args):
        msg = "compiler re-entrant to the same function signature"
        raise errors.NumbaRuntimeError(msg)
    self._lock.acquire()                                           # (A)
    self._stack.append(CallFrame(target, typeinfer, func_id, args))# (B)
    try:
        yield
    finally:
        self._stack.pop()
        self._lock.release()

If anything raises between (A) and entry into the try block, the
lock is leaked. Concrete ways that can happen in practice:

  • CallFrame.__init__ itself raises (e.g. an attribute lookup on a
    pathological func_id, or a downstream-supplied target whose
    initialisation has side effects).
  • The Python interpreter raises an MemoryError between the
    bytecode that returns from acquire() and the bytecode that
    pushes the try frame.
  • An asynchronous interrupt is delivered to this thread between
    those same two points -- e.g. a KeyboardInterrupt, or a
    PyThreadState_SetAsyncExc injected by another thread to
    forcibly cancel a long compile.

In any of those cases the lock is held by no live frame, the
context manager never runs, and the next caller of register blocks
indefinitely on self._lock.acquire(). Because _lock is an
RLock, a single-threaded reproducer needs nested registrations to
deadlock, but in the multi-threaded compile case any other thread
attempting to register is stuck.

The fix

@contextlib.contextmanager
def register(self, target, typeinfer, func_id, args):
    if self.match(func_id.func, args):
        msg = "compiler re-entrant to the same function signature"
        raise errors.NumbaRuntimeError(msg)
    # Use `with self._lock:` so an exception between acquire and the
    # try-block (e.g. raised by CallFrame.__init__, an OOM, or an
    # asynchronously-injected interrupt) cannot orphan the lock.
    with self._lock:
        self._stack.append(CallFrame(target, typeinfer, func_id, args))
        try:
            yield
        finally:
            self._stack.pop()

with self._lock: is the idiomatic Python lock pattern. The release
happens unconditionally, regardless of where inside the block the
exception fires, including during the CallFrame(...) constructor
expression. No behavior change on the happy path.

Reproducer test

Added at
numba_cuda/numba/cuda/tests/nocuda/test_callstack_register_orphan_lock.py.

The test patches CallFrame to raise on construction, calls
CallStack.register(...) inside an assertRaises, and then proves
the lock is free by attempting _lock.acquire(blocking=False) from a
different thread (so RLock re-entrancy can't mask the bug).

Behavior on the two code paths:

  • Against the current upstream code, the test fails with
    AssertionError: CallStack._lock was orphaned after CallFrame() raised.
  • Against this PR, the test passes.

A test_lock_released_on_normal_exit sanity test also verifies the
happy path still releases the lock and pops the stack.

The test lives under tests/nocuda/ because it exercises only the
pure-Python typing scaffolding, with no CUDA runtime required.

Downstream context

We hit this surface in the FactFiber ovid project, an
ovid/Numba-CUDA-based JIT pipeline that supports forcible cancellation
of long-running compiles. When a compile is cancelled mid-flight via
PyThreadState_SetAsyncExc, the asynchronous exception can land
between _lock.acquire() and entry into the try block, orphaning the
lock and wedging the next compile attempt indefinitely. We are
shipping a defensive monkey-patch on top of numba_cuda (tracked
internally as ovid-j4oj) that applies essentially the change
proposed here. Landing this fix upstream lets us retire the
monkey-patch.

We do not believe the defect requires our cancellation use case to be
real -- any path that can raise between (A) and (B) suffices. We are
just the use case that surfaced it consistently.

Checklist

  • Bug fix scoped to a single defect.
  • Regression test added (and verified to fail on the previous code).
  • No behavior change on the happy path.
  • ruff check and ruff format --check pass on the touched files
    (using the repo's pinned ruff==0.11.2).
  • SPDX headers on the new test file; existing header preserved on
    context.py.
  • CLA signed (see CLA.md) -- pending; not signed in this
    branch.

Notes for reviewers

  • I could not run the full numba-cuda test suite locally without a
    CUDA-capable build matrix; I exercised the affected test directly
    via python tests/nocuda/test_callstack_register_orphan_lock.py.
    The nocuda test directory is by design CPU-only, so CI should be
    able to run it without GPU resources.
  • Happy to expand the test (e.g. add a multi-thread deadlock-detector
    case) if the reviewers prefer; the minimal form here keeps the
    signal clean.

CallStack.register currently calls self._lock.acquire() and
self._stack.append(...) BEFORE entering the try/finally that releases
the lock. Any exception fired between the acquire and the try-block --
for example raised by CallFrame.__init__, an OOM, or an
asynchronously-injected interrupt -- orphans the lock; subsequent
threads queued on it block forever.

Restructure with `with self._lock:` so the release is unconditional.
No behavior change on the happy path; strictly stronger under failure.

Adds a regression test in numba_cuda/numba/cuda/tests/nocuda that
stubs CallFrame to raise on construction and asserts the lock is
still free afterwards (verified to fail on the previous code).
@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@acosmicflamingo

acosmicflamingo commented Jun 4, 2026

Copy link
Copy Markdown

That's quite a huge test for that small change hahaha! I did run the test before and after and verified that this does fix the failing test. On the one hand, I want to say that this is good enough of a reason to merge this PR.

However, it almost feels like we are testing functionality of cpython's threading module, no? After all, using _RLock context manager functionality on the surface seems to be the same as what is currently being done (where enter and exit dunder methods are calling the same functions in that same order).

I don't know if perhaps the cleanest path forward to prevents issue like this from occurring while minimizing tech debt accumulation is to use a linter to suggest using context managers (e.g. this rule). All of these questions, I leave for the numba-cuda maintainers to answer :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants