[CI] Add slow marker and remove un-necessary tests by hughperkins · Pull Request #711 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-05-21T10:11:26Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Add a `slow` pytest marker, mark the worst-case tests with it, and have `tests/run_tests.py` skip those tests by default (use `--run-slow` to include them, or `pytest -m slow` to run only those). Picked from macOS CI per-file timing (QD_FILE_TIMING=1, run 26083950810): phase 1 totals 6415s across 8641 test calls; the slowest 3 files alone (test_eig, test_tile16, test_linalg) cover 55%. The cost of test_eig / test_make_spd is super-linear in matrix size n (n=12 ≈ 5x n=9). Marked slow: - Parametrize cases n in {6, 9, 12} (and 7..11 for inverse_large) across test_eig.py and test_linalg.py. - Rectangular (9, 12) / (12, 3) cases in test_frobenius_inner_rectangular. - test_matmul_chain_qipc_sizes_{f32,f64} (>130s each on macOS CI). - test_clear_all_gradients (180s/invocation). - test_reset_ndarrays::test_ndarray_doesnt_crash_on_gc (127s). - test_mpm88::{test_mpm88, test_mpm88_numpy_and_ndarray} (~30s/invocation). - test_struct::test_2d_nested (122s/invocation). run_tests.py composes `not slow` with any user-supplied `-m` expression, so existing CI invocations like `-m "not needs_torch"` become `(not needs_torch) and not slow`. Note that this also drops slow tests from GPU / Linux / macOS CI runs — a separate workflow (or `--run-slow` job) is needed if we still want to exercise the n>=6 / n=12 paths in CI.

The previous lists ([4, 5, 6, 9, 12], [2, 3, 4, 6, 9, 12], [5..12], etc.) gave the Householder/QR path a lot of redundant size coverage. For routine CI we only need to exercise a small size + the largest supported size (12, which also doubles as the slow-marked stress case): if a bug shows up only at n=7 or n=11 it almost certainly also shows up at n=12. test_eig.py sym_eig_general_{f32,f64} [4,5,6,9,12] -> [4, 12*] make_spd_{f32,f64} [4,6,9,12] -> [4, 12*] sym_eig_alpha_identity_f64 [4,6,9,12] -> [4, 12*] make_spd_idempotent_f64 [4,6,9,12] -> [4, 12*] make_spd_negative_definite_zero_f64 [4,6,9,12] -> [4, 12*] sym_eig_sort_order_{f32,f64} [2,3,4,6,9,12] -> [3, 12*] test_linalg.py frobenius_inner_{f32,f64} [2,3,6,9,12] -> [3, 12*] inverse_large_{f32,f64} [5..12] -> [5, 12*] * n=12 retains the `slow` marker, so default `run_tests.py` invocations only hit n=4 / n=3 / n=5. `--run-slow` runs both. Closed-form 2x2/3x3 paths in test_sym_eig_sort_order: dropped n=2 in favour of n=3 (per directive); the 2x2 path is still covered by test_sym_eig2x2_{f32,f64}. The 3x3 closed-form path stays covered by n=3. Other parametrize lists left untouched: - rectangular (rows, cols) tuples in test_frobenius_inner_rectangular (it's varying shape, not pure size). - test_mat_inverse_size's `range(1, 5)` (tiny sizes only). - `a00` integer parametrize in test_sym_eig3x3_{f32,f64}.

The blocked-Cholesky demo previously hard-coded N=92, N_ENVS=4096, WARMUP=50, ITERS=200 as module globals. The unit-test wrapper test_tile16_cholesky_blocked_demo runs the demo as a subprocess and only cares that it returns 0; at the hard-coded sizes that takes ~74 s on cluster CUDA, dominated by JIT-compiling 3 large unrolled kernels at N=92 and running the 4096-env x 250-iter benchmark loop. Expose all four knobs as command-line flags with the previous values as defaults, so: python misc/demos/cholesky_blocked.py # unchanged, full demo python misc/demos/cholesky_blocked.py --n 32 --n-envs 64 \ --num-warmup 1 --num-iters 1 # smoke-mode The test will switch to the smoke-mode invocation in a follow-up commit so it stops dominating the slow critical path. Flag names (--n, --n-envs, --num-warmup, --num-iters) follow the user spec; using argparse + ArgumentDefaultsHelpFormatter so --help shows the full demo defaults.

Pass small CLI overrides (--n 32 --n-envs 64 --num-warmup 1 --num-iters 1) so the demo runs end-to-end in seconds instead of ~74 s. The test contract is just "demo exits 0"; it doesn't read any of the benchmark numbers, so the smaller workload still satisfies the smoke test. The full N=92 / N_ENVS=4096 / 50+200-iter demo is still what humans running misc/demos/cholesky_blocked.py see by default (argparse defaults match the previous hard-coded values). Together with the previous commit, this drops the test_tile16_cholesky_blocked_demo wall time on cluster CUDA from ~74 s to (expected) a few seconds, removing the largest remaining single-test outlier on hp/mark-slow-tests.

Previously the test hard-coded the qipc IPC sizes (9x12) · (12x12) · (12x9). On cluster CUDA those two cases (f32 + f64) take ~92.7s and ~87.3s respectively -- the top two single-test outliers in the suite, each holding one xdist worker for ~90s of contiguous JIT-compile + unrolled-FMA work. Parametrize `_test_matmul_chain` on (rows_a, cols_a, cols_b, cols_c). Default lane runs the small (3,4,4,3) chain to exercise the same Matrix.__matmul__ codegen path; the original (9,12,12,9) qipc-sized chain is slow-marked so it still runs on --run-slow (i.e. CI's nightly / release lane, once that's wired up). Estimated saving: ~180s CPU, ~70s wall (these tests were on the critical path of the branch run). No function-level coverage lost: both f32 and f64 versions still run the same chain by default, just at a smaller size.

Previously hard-coded N=30 (900 particles), n_grid=120, steps=32 -- 26s on cluster CUDA. The test's actual contract is that the AD-validation checker raises QuadrantsAssertionError on the global-data-access violation in g2p (`v[f, p] = new_v`), which fires on the first substep regardless of grid / particle / step counts. Parametrize on (particles_side, n_grid_size, num_steps) with a small default (8, 32, 4) and slow-marked original (30, 120, 32). The default still exercises the same diff-MPM pipeline (p2g / grid_op / g2p, qd.ad.Tape with validation=True, `with pytest.raises(...)`) and still triggers the assertion error. Estimated CPU saving: ~22s; wall saving ~3s on the branch run.

…ne op-parametrized test The three reduce variants (and the three scan variants) shared an identical kernel signature, identical input shape, and differed only in (a) which qd.algorithms.device_<op> function they called and (b) overflow vs bitwise-exact verification. Collapse each triple into a single op-parametrized test: test_device_reduce(op, dtype, N) # op in {add, min, max} test_device_exclusive_scan(op, dtype, N) # op in {add, min, max} Behavior, coverage and the parametrize space are unchanged -- pytest still collects the same number of parametrize cases, just under unified test names. This is purely a code-dedup refactor (~130 LOC less) which makes the next op-axis sampling change (if/when we choose to drop A vs B vs C from the sweep) a one-line edit.

…run_tests help string Pure formatting fix from `pre-commit run -a`; no behavior change.

…ppers into 2 op-parametrized tests Lines 3608-3694 in test_simt.py were 18 ~5-line wrappers each calling ``_check_full_matches_tiled(subgroup.<op>, subgroup.<op>_tiled, ...)``. Lines 3841-3848 were 2 more, parametrized on dtype. ``_check_full_matches_tiled`` already accepts the full / tiled functions as Python arguments (closure-captured into ``@qd.kernel``), so collapsing the family is a pure dedup move: test_subgroup_full_matches_tiled(op_name, host_init) # 18 cases: {reduce, inclusive, exclusive}_{add,min,max,mul,and,or,xor} on qd.i32 test_subgroup_full_matches_tiled_float(op_name, dtype) # 4 cases: {reduce_add, inclusive_add} x {qd.f32, qd.f64} Behavior + coverage unchanged (still 22 parametrize cases, same dtype + init configurations). Pytest ids are designed to match the original test-name suffixes (e.g. ``[reduce_add]``, ``[inclusive_mul]``) so ``-k`` selectors and test reports stay readable. Drops ~50 LOC net.

…zed tests The six block-reduce tests (3 single-output + 3 broadcast) share an identical kernel skeleton, parametrize axes, and verification loop. They only differ in which `block.reduce_*` function is called (closure-captured into `@qd.kernel` via getattr), the host-side reference oracle, the init pattern (sequential for `add` so the running sum has signal; permuted hash for `min` / `max` so the result depends on lanes other than first / last), and the float tolerance regime (relative for accumulating `add`, absolute for picker `min` / `max`). Collapse the six tests into two op-parametrized tests: test_block_reduce(sg_per_block, dtype, op_name, ...) # single-output, 3 ops test_block_reduce_all(sg_per_block, dtype, op_name, ...) # broadcast, 3 ops Parametrize space is unchanged (3 sg x 5 dtype x 3 op = 45 cases per fused test, matching the original 3 tests x 15 cases each). Pytest ids use plain `[add|min|max]` suffixes so `-k` selectors remain readable. Drops ~100 LOC of boilerplate -- two new small helpers (`_init_block_reduce_src` and `_assert_block_reduce_close`) capture the per-op behavioral differences in one place each.

…zed test The three block inclusive scan tests share the same kernel skeleton and only differ in the closure-captured `block.inclusive_<op>` function, the host-side reference oracle, the init pattern (sequential for `add` -- sums grow with prefix length; permuted for `min` / `max` -- result depends on lanes other than first / last), and the float tolerance regime (relative for `add`, absolute for `min` / `max`). Collapse into one op-parametrized test: test_block_inclusive(sg_per_block, dtype, op_name, ...) Identical param count to the original three tests (3 sg x 5 dtype x 3 op = 45 cases vs original 3 x 15). Pulls a shared `_assert_block_scan_close` helper out so the int / relative-float / absolute-float regime is encoded in one place; the relative-float branch keeps the floor-on-tol-base trick needed by the original `test_block_exclusive_add` (also routed through the same helper). `test_block_exclusive_add` stays as its own function for now because the matching exclusive `min` / `max` cases need dtype-derived sentinel identities + ``isinf`` handling that's different enough that fusing them in would create more branches than it removes; can address that in a follow-up if needed.

…trized test `test_block_exclusive_min` and `test_block_exclusive_max` share the same permuted-init pattern and only differ in the dtype-derived sentinel identity (``+inf`` / ``iinfo.max`` for min, ``-inf`` / ``iinfo.min`` for max) and the inf-sign check at lane 0. Collapse into one op-parametrized test that takes ``(op_name, sentinel_fn, py_op, inf_sign)`` and dispatches via getattr + the (already module-level) `_PY_MIN` / `_PY_MAX` lambdas. Identical param count to the original pair (3 sg x 5 dtype x 2 op = 30 cases vs original 2 x 15 each = 30). `test_block_exclusive_add` remains its own function because the integer identity is `0` (not `iinfo.max/min`) and the init pattern is sequential -- different enough that fusing it in would add more branches than it removes. Drops ~30 LOC.

The PR's `Check line wrapping` CI agent flagged three comments wrapped at the AI-default ~78-90c instead of the project's 120c target. Reflow each to the full target width: - tests/python/test_tile16.py:1791 (78c -> 120c) docstring for test_tile16_cholesky_blocked_demo. - tests/python/test_ad_gdar_diffmpm.py:8 (85c -> 120c) the "defaults shrink ..." comment above the parametrize block. - tests/run_tests.py:60 (90c -> 120c) the "--run-slow opts back in" comment. Also collapse the dangling-backslash continuation in misc/demos/cholesky_blocked.py's Usage example onto one line (69c -> 109c). No behavior change; comments only. Verified via the cursor find-underwrapped skill that the remaining flagged runs in my diff are all 103-116c with save~=0 (already-tight runs the greedy heuristic still reports), comfortably in the agent's "not borderline" exemption.

Hugh requested in PR #709 review comment that the testing bullet collapse to just a pointer at unit_testing.md, since the long inline summary duplicates the dedicated doc immediately below.

@sample

Documents the test launcher, the @pytest.mark.slow marker (whole-test and parametrize-case variants), how to write a new parametrized test with the test_utils.test decorator, and the Advanced section with the per-test timeout, kernel compilation cache, and per-file timing knobs. Modeled on the structure of the equivalent doc on hp/mark-slow-tests (after Hugh's two rounds of PR review feedback there) but with all @pytest.mark.sample references stripped, since the @sample marker is not part of this branch.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a85c6ecbcc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T10:16:39Z

    sm70: Can only run on GPU with compute capability 7.0 or higher.
    needs_torch: mark test as requiring PyTorch.
+    slow: mark test (or parametrize case) as slow. Skipped by default by tests/run_tests.py;
+        pass --run-slow to include them, or directly `pytest -m slow` to run only the slow ones.


Keep slow marker description in one ini entry

The wrapped description line is parsed as a second marker entry, so pytest now registers an unintended marker named pass --run-slow to include them, or directly \pytest -m slow` to run only the slow ones.instead of treating it as part ofslow's description. You can see this in pytest --markers, and it can confuse marker tooling (and future strict marker validation) because tests/pytest.inino longer defines markers as onename: description` per line.

Useful? React with 👍 / 👎.

github-actions · 2026-05-21T10:46:05Z

Total: 12 file(s) changed, +296 -473 code lines.

github-actions · 2026-05-21T11:41:07Z

Diff coverage: 97% · 180 lines, 5 missing

duburcqa · 2026-05-21T12:16:22Z

ok to merge

Hugh Perkins (deskai7) added 15 commits May 21, 2026 02:58

[Style] black: reformat test_tile16_cholesky_blocked_demo cmd list + …

8acaaec

…run_tests help string Pure formatting fix from `pre-commit run -a`; no behavior change.

[Doc] contributing.md: shorten testing bullet per PR review

7389b5f

Hugh requested in PR #709 review comment that the testing bullet collapse to just a pointer at unit_testing.md, since the long inline summary duplicates the dedicated doc immediately below.

hughperkins assigned duburcqa May 21, 2026

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add slow marker and remove un-necessary tests#711

[CI] Add slow marker and remove un-necessary tests#711
hughperkins wants to merge 15 commits into
mainfrom
hp/mark-slow-tests-only

hughperkins commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

duburcqa commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hughperkins commented May 21, 2026

Brief Summary

Walkthrough

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

duburcqa commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants