Skip to content

[Vulkan] Skip 3x3 sym_eig tests + dedup OpTypeArray on NVIDIA#706

Open
hughperkins wants to merge 1 commit into
mainfrom
hp/vulkan-sym-eig-segfault
Open

[Vulkan] Skip 3x3 sym_eig tests + dedup OpTypeArray on NVIDIA#706
hughperkins wants to merge 1 commit into
mainfrom
hp/vulkan-sym-eig-segfault

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

… in pipeline creation

NVIDIA driver 580.76.05 SIGSEGVs in libnvidia-gpucomp.so / libnvidia-glvkspirv.so during compute-pipeline creation for the fully-inlined _sym_eig3x3 (Eigen3 computeDirect Cardano method + dsyevq3 Givens-rotation fallback) shader. The emitted SPIR-V is accepted by spirv-val --target-env vulkan1.3 and round-trips cleanly through spirv-cross, so the bug is in NVIDIA's SPIR-V → NVVM frontend, not Quadrants codegen — test_sym_eig_sort_order already documents the same crash and skips the n=3 case (see comment there).

Two changes:

  1. tests/python/test_eig.py — skip the four affected tests on Vulkan (test_sym_eig3x3_identity_f{32,64}, test_sym_eig3x3_f{32,64}) with a matching comment pointing at the same pre-existing driver quirk. n=2 and n>=4 are unaffected.

  2. quadrants/codegen/spirv/spirv_ir_builder.{h,cpp} — dedup OpTypeArray declarations in get_function_array_type / get_array_type. The Jacobi path was emitting six independent float[3] / float[9] types for the same local SoA, which trips strict drivers (NVIDIA actually crashes in pipeline creation on the duplicated-type variant — separate code path from the above, but same blast radius) and leaves observable _arr_float_uint_3_0 / ..._1 / ..._2 aliases in QD_DUMP_IR and spirv-cross output. Separate caches for the Function-scope vs. ArrayStride-decorated buffer variants — sharing one cache would re-apply ArrayStride to Function-scope arrays and re-introduce VUID-StandaloneSpirv-None-10684. This dedup is independent from the sym_eig skip (alone it isn't sufficient to make _sym_eig3x3 compile on NVIDIA) but is a real bug worth fixing on its own.

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

… in pipeline creation

NVIDIA driver 580.76.05 SIGSEGVs in `libnvidia-gpucomp.so` / `libnvidia-glvkspirv.so`
during compute-pipeline creation for the fully-inlined `_sym_eig3x3` (Eigen3
`computeDirect` Cardano method + `dsyevq3` Givens-rotation fallback) shader. The
emitted SPIR-V is accepted by `spirv-val --target-env vulkan1.3` and round-trips
cleanly through `spirv-cross`, so the bug is in NVIDIA's SPIR-V → NVVM frontend,
not Quadrants codegen — `test_sym_eig_sort_order` already documents the same
crash and skips the n=3 case (see comment there).

Two changes:

1. `tests/python/test_eig.py` — skip the four affected tests on Vulkan
   (`test_sym_eig3x3_identity_f{32,64}`, `test_sym_eig3x3_f{32,64}`) with a
   matching comment pointing at the same pre-existing driver quirk. n=2 and n>=4
   are unaffected.

2. `quadrants/codegen/spirv/spirv_ir_builder.{h,cpp}` — dedup `OpTypeArray`
   declarations in `get_function_array_type` / `get_array_type`. The Jacobi path
   was emitting six independent `float[3]` / `float[9]` types for the same local
   SoA, which trips strict drivers (NVIDIA actually crashes in pipeline creation
   on the duplicated-type variant — separate code path from the above, but same
   blast radius) and leaves observable `_arr_float_uint_3_0` / `..._1` / `..._2`
   aliases in `QD_DUMP_IR` and `spirv-cross` output. Separate caches for the
   Function-scope vs. `ArrayStride`-decorated buffer variants — sharing one cache
   would re-apply `ArrayStride` to Function-scope arrays and re-introduce
   `VUID-StandaloneSpirv-None-10684`. This dedup is independent from the sym_eig
   skip (alone it isn't sufficient to make `_sym_eig3x3` compile on NVIDIA) but
   is a real bug worth fixing on its own.
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b18f47c31

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/python/test_eig.py


def _test_sym_eig3x3(dt, a00):
if qd.lang.impl.current_cfg().arch == qd.vulkan:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate 3x3 skip to affected Vulkan drivers

The new arch == qd.vulkan guard skips these tests on every Vulkan implementation, but the failure described here is NVIDIA-driver-specific; in the same file, _test_sym_eig_sort_order notes that n == 3 runs cleanly on AMD Vulkan. As written, AMD/Intel Vulkan runs will now always skip _sym_eig3x3 coverage, so real regressions in the 3x3 path on unaffected Vulkan stacks can no longer be detected. Please narrow this skip to the problematic vendor/driver condition instead of all Vulkan backends.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, these should probalby be xfail, not skip anyway, I would think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant