[CPU] Abort kernel execution on assertion failure instead of segfaulting by hughperkins · Pull Request #419 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-16T04:46:10Z

On CPU, when debug mode is enabled, out-of-bounds array accesses trigger a runtime assertion that records the error but allows execution to continue -- leading to a SIGSEGV before Python can retrieve the error.

Fix this by using setjmp/longjmp: each CPU task runner (range_for, struct_for, mesh_for, serial) sets up a jmp_buf via RuntimeContext, and the new quadrants_assert_format_ctx function longjmps back on failure. The existing check_runtime_error path then raises QuadrantsAssertionError.

GPU architectures are unaffected (they already kill threads via asm).

Made-with: Cursor

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

On CPU, when debug mode is enabled, out-of-bounds array accesses trigger a runtime assertion that records the error but allows execution to continue -- leading to a SIGSEGV before Python can retrieve the error. Fix this by using setjmp/longjmp: each CPU task runner (range_for, struct_for, mesh_for, serial) sets up a jmp_buf via RuntimeContext, and the new quadrants_assert_format_ctx function longjmps back on failure. The existing check_runtime_error path then raises QuadrantsAssertionError. GPU architectures are unaffected (they already kill threads via asm). Made-with: Cursor

hughperkins · 2026-03-16T16:47:45Z

tests/python/test_debug.py

+
+    @qd.kernel
+    def write_oob(a: qd.types.ndarray(dtype=qd.f32, ndim=1)):
+        for i in range(10):


I dont think this test is valid, because this si a paallel loop? we should make ti serial I think?

hughperkins · 2026-03-16T16:52:39Z

Opus 4.6 review:

Branch hp/cpu-longjmp-after-assert — Analysis

Problem it solves: On the CPU backend, when a runtime assertion fires (e.g., out-of-bounds ndarray access in debug mode), the assertion was logged but execution continued. The
subsequent memory access would then dereference invalid memory and segfault, which is a terrible user experience — you get a crash instead of a clean Python-side AssertionError.
On GPU this isn't a problem because asm("exit;") kills the thread immediately after the assertion.
How it works: The fix uses setjmp/longjmp to abort kernel execution on the CPU backend after an assertion failure:

context.h — Adds a cpu_abort_jmp_buf pointer to RuntimeContext. When non-null, it points to a jmp_buf set up by the caller.
kernel_launcher.cpp — Before running each task, sets up a setjmp guard. If longjmp fires, it breaks out of the task loop, skipping remaining tasks.
runtime.cpp — Three parallel-for variants (struct-for, range-for, mesh-for) each get the same pattern: per-thread setjmp guard wrapping the task body. A new function
quadrants_assert_format_ctx is added that calls the existing quadrants_assert_format (to print the error) and then longjmps back if the assertion failed.
codegen_llvm.cpp — The codegen for AssertStmt now emits a call to quadrants_assert_format_ctx (passing the context) on CPU, while GPU still uses the original
quadrants_assert_format (passing just the runtime).
Tests — Four new tests in test_debug.py covering 1D, 2D, small-array OOB access (all expecting AssertionError instead of segfault), plus a sanity test that in-bounds access
still works correctly.

Opinion:
This is a clean, well-scoped fix. The setjmp/longjmp pattern is the right approach here — it's the standard C/C++ mechanism for non-local jumps when you can't use exceptions (and
in JIT'd code you can't). A few observations:
• Correctness looks good. The jmp_buf is stack-allocated at each guard site and cleaned up (= nullptr) after the task completes, so there's no dangling pointer risk. Each
parallel thread gets its own jmp_buf via the per-thread RuntimeContext copy.
• Scope is minimal. Only CPU is affected; GPU codepath is untouched.
• One subtlety: longjmp out of JIT'd code skips any destructors or cleanup that might be on the JIT'd stack. For quadrants kernels this should be fine since the JIT'd code
operates on raw buffers with no RAII, but it's worth keeping in mind if the JIT'd code ever acquires resources.
• The tests are thorough — they cover the exact segfault scenario (small array OOB), multi-dimensional access, and verify the happy path isn't broken by the setjmp machinery.

Overall this looks solid and ready for review/merge.

On Windows x64, longjmp performs SEH-based stack unwinding which requires proper unwind tables (.pdata/.xdata) for every frame on the call stack. JIT-compiled code does not register these tables, so longjmp from JIT'd code crashes the process — causing all Windows OOB-check tests to fail with worker crashes. Replace the mechanism: quadrants_assert_format_ctx now returns 1 on failure instead of calling longjmp, and the codegen emits a conditional ret-void after each assert call on CPU. Task runners check the cpu_assert_failed flag after each body call to break out of their loops.

…assert # Conflicts: # quadrants/runtime/cpu/kernel_launcher.cpp

hughperkins added 3 commits March 15, 2026 21:45

Merge branch 'main' into hp/cpu-longjmp-after-assert

72b23e3

Fix pre-commit lint: clang-format and unused import

8f51ba5

hughperkins commented Mar 16, 2026

View reviewed changes

hughperkins added 3 commits March 16, 2026 11:06

Merge branch 'main' into hp/cpu-longjmp-after-assert

41cb051

Merge remote-tracking branch 'origin/main' into hp/cpu-longjmp-after-…

0006e75

…assert # Conflicts: # quadrants/runtime/cpu/kernel_launcher.cpp

hughperkins force-pushed the hp/cpu-longjmp-after-assert branch from c552c09 to 0006e75 Compare March 16, 2026 18:39

Merge remote-tracking branch 'origin/main' into hp/cpu-longjmp-after-…

661c9fe

…assert # Conflicts: # quadrants/runtime/cpu/kernel_launcher.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Abort kernel execution on assertion failure instead of segfaulting#419

[CPU] Abort kernel execution on assertion failure instead of segfaulting#419
hughperkins wants to merge 7 commits intomainfrom
hp/cpu-longjmp-after-assert

hughperkins commented Mar 16, 2026

Uh oh!

hughperkins Mar 16, 2026

Uh oh!

hughperkins commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hughperkins commented Mar 16, 2026

Brief Summary

Walkthrough

Uh oh!

hughperkins Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant