Skip to content

[CI] Add slow marker and remove un-necessary tests#711

Open
hughperkins wants to merge 15 commits into
mainfrom
hp/mark-slow-tests-only
Open

[CI] Add slow marker and remove un-necessary tests#711
hughperkins wants to merge 15 commits into
mainfrom
hp/mark-slow-tests-only

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Hugh Perkins (deskai7) added 15 commits May 21, 2026 02:58
Add a `slow` pytest marker, mark the worst-case tests with it, and have
`tests/run_tests.py` skip those tests by default (use `--run-slow` to include
them, or `pytest -m slow` to run only those).

Picked from macOS CI per-file timing (QD_FILE_TIMING=1, run 26083950810):
phase 1 totals 6415s across 8641 test calls; the slowest 3 files alone
(test_eig, test_tile16, test_linalg) cover 55%. The cost of test_eig /
test_make_spd is super-linear in matrix size n (n=12 ≈ 5x n=9).

Marked slow:

  - Parametrize cases n in {6, 9, 12} (and 7..11 for inverse_large) across
    test_eig.py and test_linalg.py.
  - Rectangular (9, 12) / (12, 3) cases in test_frobenius_inner_rectangular.
  - test_matmul_chain_qipc_sizes_{f32,f64} (>130s each on macOS CI).
  - test_clear_all_gradients (180s/invocation).
  - test_reset_ndarrays::test_ndarray_doesnt_crash_on_gc (127s).
  - test_mpm88::{test_mpm88, test_mpm88_numpy_and_ndarray} (~30s/invocation).
  - test_struct::test_2d_nested (122s/invocation).

run_tests.py composes `not slow` with any user-supplied `-m` expression, so
existing CI invocations like `-m "not needs_torch"` become
`(not needs_torch) and not slow`. Note that this also drops slow tests from
GPU / Linux / macOS CI runs — a separate workflow (or `--run-slow` job) is
needed if we still want to exercise the n>=6 / n=12 paths in CI.
The previous lists ([4, 5, 6, 9, 12], [2, 3, 4, 6, 9, 12], [5..12], etc.) gave
the Householder/QR path a lot of redundant size coverage. For routine CI we
only need to exercise a small size + the largest supported size (12, which
also doubles as the slow-marked stress case): if a bug shows up only at
n=7 or n=11 it almost certainly also shows up at n=12.

  test_eig.py
    sym_eig_general_{f32,f64}             [4,5,6,9,12]     -> [4, 12*]
    make_spd_{f32,f64}                    [4,6,9,12]       -> [4, 12*]
    sym_eig_alpha_identity_f64            [4,6,9,12]       -> [4, 12*]
    make_spd_idempotent_f64               [4,6,9,12]       -> [4, 12*]
    make_spd_negative_definite_zero_f64   [4,6,9,12]       -> [4, 12*]
    sym_eig_sort_order_{f32,f64}          [2,3,4,6,9,12]   -> [3, 12*]
  test_linalg.py
    frobenius_inner_{f32,f64}             [2,3,6,9,12]     -> [3, 12*]
    inverse_large_{f32,f64}               [5..12]          -> [5, 12*]

* n=12 retains the `slow` marker, so default `run_tests.py` invocations only
  hit n=4 / n=3 / n=5. `--run-slow` runs both.

Closed-form 2x2/3x3 paths in test_sym_eig_sort_order: dropped n=2 in favour
of n=3 (per directive); the 2x2 path is still covered by
test_sym_eig2x2_{f32,f64}. The 3x3 closed-form path stays covered by n=3.

Other parametrize lists left untouched:
  - rectangular (rows, cols) tuples in test_frobenius_inner_rectangular (it's
    varying shape, not pure size).
  - test_mat_inverse_size's `range(1, 5)` (tiny sizes only).
  - `a00` integer parametrize in test_sym_eig3x3_{f32,f64}.
The blocked-Cholesky demo previously hard-coded N=92, N_ENVS=4096,
WARMUP=50, ITERS=200 as module globals. The unit-test wrapper
test_tile16_cholesky_blocked_demo runs the demo as a subprocess and
only cares that it returns 0; at the hard-coded sizes that takes ~74 s
on cluster CUDA, dominated by JIT-compiling 3 large unrolled kernels
at N=92 and running the 4096-env x 250-iter benchmark loop.

Expose all four knobs as command-line flags with the previous values as
defaults, so:

    python misc/demos/cholesky_blocked.py                                # unchanged, full demo
    python misc/demos/cholesky_blocked.py --n 32 --n-envs 64 \
        --num-warmup 1 --num-iters 1                                    # smoke-mode

The test will switch to the smoke-mode invocation in a follow-up
commit so it stops dominating the slow critical path.

Flag names (--n, --n-envs, --num-warmup, --num-iters) follow the user
spec; using argparse + ArgumentDefaultsHelpFormatter so --help shows
the full demo defaults.
Pass small CLI overrides (--n 32 --n-envs 64 --num-warmup 1
--num-iters 1) so the demo runs end-to-end in seconds instead of ~74 s.
The test contract is just "demo exits 0"; it doesn't read any of the
benchmark numbers, so the smaller workload still satisfies the smoke
test.

The full N=92 / N_ENVS=4096 / 50+200-iter demo is still what humans
running misc/demos/cholesky_blocked.py see by default (argparse
defaults match the previous hard-coded values).

Together with the previous commit, this drops the
test_tile16_cholesky_blocked_demo wall time on cluster CUDA from
~74 s to (expected) a few seconds, removing the largest remaining
single-test outlier on hp/mark-slow-tests.
Previously the test hard-coded the qipc IPC sizes (9x12) · (12x12) ·
(12x9). On cluster CUDA those two cases (f32 + f64) take ~92.7s and
~87.3s respectively -- the top two single-test outliers in the suite,
each holding one xdist worker for ~90s of contiguous JIT-compile +
unrolled-FMA work.

Parametrize `_test_matmul_chain` on (rows_a, cols_a, cols_b, cols_c).
Default lane runs the small (3,4,4,3) chain to exercise the same
Matrix.__matmul__ codegen path; the original (9,12,12,9) qipc-sized
chain is slow-marked so it still runs on --run-slow (i.e. CI's nightly
/ release lane, once that's wired up).

Estimated saving: ~180s CPU, ~70s wall (these tests were on the
critical path of the branch run).

No function-level coverage lost: both f32 and f64 versions still run
the same chain by default, just at a smaller size.
Previously hard-coded N=30 (900 particles), n_grid=120, steps=32 -- 26s
on cluster CUDA. The test's actual contract is that the AD-validation
checker raises QuadrantsAssertionError on the global-data-access
violation in g2p (`v[f, p] = new_v`), which fires on the first substep
regardless of grid / particle / step counts.

Parametrize on (particles_side, n_grid_size, num_steps) with a small
default (8, 32, 4) and slow-marked original (30, 120, 32). The default
still exercises the same diff-MPM pipeline (p2g / grid_op / g2p,
qd.ad.Tape with validation=True, `with pytest.raises(...)`) and still
triggers the assertion error.

Estimated CPU saving: ~22s; wall saving ~3s on the branch run.
…ne op-parametrized test

The three reduce variants (and the three scan variants) shared an identical
kernel signature, identical input shape, and differed only in (a) which
qd.algorithms.device_<op> function they called and (b) overflow vs
bitwise-exact verification. Collapse each triple into a single op-parametrized
test:

  test_device_reduce(op, dtype, N)            # op in {add, min, max}
  test_device_exclusive_scan(op, dtype, N)    # op in {add, min, max}

Behavior, coverage and the parametrize space are unchanged -- pytest still
collects the same number of parametrize cases, just under unified test names.
This is purely a code-dedup refactor (~130 LOC less) which makes the next
op-axis sampling change (if/when we choose to drop A vs B vs C from the
sweep) a one-line edit.
…run_tests help string

Pure formatting fix from `pre-commit run -a`; no behavior change.
…ppers into 2 op-parametrized tests

Lines 3608-3694 in test_simt.py were 18 ~5-line wrappers each calling
``_check_full_matches_tiled(subgroup.<op>, subgroup.<op>_tiled, ...)``.
Lines 3841-3848 were 2 more, parametrized on dtype. ``_check_full_matches_tiled``
already accepts the full / tiled functions as Python arguments (closure-captured
into ``@qd.kernel``), so collapsing the family is a pure dedup move:

  test_subgroup_full_matches_tiled(op_name, host_init)
      # 18 cases: {reduce, inclusive, exclusive}_{add,min,max,mul,and,or,xor} on qd.i32

  test_subgroup_full_matches_tiled_float(op_name, dtype)
      # 4 cases: {reduce_add, inclusive_add} x {qd.f32, qd.f64}

Behavior + coverage unchanged (still 22 parametrize cases, same dtype + init
configurations). Pytest ids are designed to match the original test-name
suffixes (e.g. ``[reduce_add]``, ``[inclusive_mul]``) so ``-k`` selectors and
test reports stay readable. Drops ~50 LOC net.
…zed tests

The six block-reduce tests (3 single-output + 3 broadcast) share an identical
kernel skeleton, parametrize axes, and verification loop. They only differ in
which `block.reduce_*` function is called (closure-captured into `@qd.kernel`
via getattr), the host-side reference oracle, the init pattern (sequential for
`add` so the running sum has signal; permuted hash for `min` / `max` so the
result depends on lanes other than first / last), and the float tolerance
regime (relative for accumulating `add`, absolute for picker `min` / `max`).
Collapse the six tests into two op-parametrized tests:

  test_block_reduce(sg_per_block, dtype, op_name, ...)        # single-output, 3 ops
  test_block_reduce_all(sg_per_block, dtype, op_name, ...)    # broadcast, 3 ops

Parametrize space is unchanged (3 sg x 5 dtype x 3 op = 45 cases per fused
test, matching the original 3 tests x 15 cases each). Pytest ids use plain
`[add|min|max]` suffixes so `-k` selectors remain readable. Drops ~100 LOC of
boilerplate -- two new small helpers (`_init_block_reduce_src` and
`_assert_block_reduce_close`) capture the per-op behavioral differences in one
place each.
…zed test

The three block inclusive scan tests share the same kernel skeleton and only
differ in the closure-captured `block.inclusive_<op>` function, the host-side
reference oracle, the init pattern (sequential for `add` -- sums grow with
prefix length; permuted for `min` / `max` -- result depends on lanes other
than first / last), and the float tolerance regime (relative for `add`,
absolute for `min` / `max`). Collapse into one op-parametrized test:

  test_block_inclusive(sg_per_block, dtype, op_name, ...)

Identical param count to the original three tests (3 sg x 5 dtype x 3 op =
45 cases vs original 3 x 15). Pulls a shared `_assert_block_scan_close`
helper out so the int / relative-float / absolute-float regime is encoded in
one place; the relative-float branch keeps the floor-on-tol-base trick
needed by the original `test_block_exclusive_add` (also routed through the
same helper). `test_block_exclusive_add` stays as its own function for now
because the matching exclusive `min` / `max` cases need dtype-derived
sentinel identities + ``isinf`` handling that's different enough that
fusing them in would create more branches than it removes; can address
that in a follow-up if needed.
…trized test

`test_block_exclusive_min` and `test_block_exclusive_max` share the same
permuted-init pattern and only differ in the dtype-derived sentinel identity
(``+inf`` / ``iinfo.max`` for min, ``-inf`` / ``iinfo.min`` for max) and the
inf-sign check at lane 0. Collapse into one op-parametrized test that takes
``(op_name, sentinel_fn, py_op, inf_sign)`` and dispatches via getattr +
the (already module-level) `_PY_MIN` / `_PY_MAX` lambdas.

Identical param count to the original pair (3 sg x 5 dtype x 2 op = 30 cases
vs original 2 x 15 each = 30). `test_block_exclusive_add` remains its own
function because the integer identity is `0` (not `iinfo.max/min`) and the
init pattern is sequential -- different enough that fusing it in would add
more branches than it removes. Drops ~30 LOC.
The PR's `Check line wrapping` CI agent flagged three comments wrapped at
the AI-default ~78-90c instead of the project's 120c target. Reflow each
to the full target width:

  - tests/python/test_tile16.py:1791  (78c -> 120c) docstring for
    test_tile16_cholesky_blocked_demo.
  - tests/python/test_ad_gdar_diffmpm.py:8  (85c -> 120c) the
    "defaults shrink ..." comment above the parametrize block.
  - tests/run_tests.py:60  (90c -> 120c) the "--run-slow opts back in"
    comment.

Also collapse the dangling-backslash continuation in
misc/demos/cholesky_blocked.py's Usage example onto one line (69c -> 109c).

No behavior change; comments only. Verified via the cursor
find-underwrapped skill that the remaining flagged runs in my diff are
all 103-116c with save~=0 (already-tight runs the greedy heuristic still
reports), comfortably in the agent's "not borderline" exemption.
Hugh requested in PR #709 review comment that the testing bullet collapse
to just a pointer at unit_testing.md, since the long inline summary
duplicates the dedicated doc immediately below.
Documents the test launcher, the @pytest.mark.slow marker (whole-test
and parametrize-case variants), how to write a new parametrized test
with the test_utils.test decorator, and the Advanced section with the
per-test timeout, kernel compilation cache, and per-file timing knobs.

Modeled on the structure of the equivalent doc on hp/mark-slow-tests
(after Hugh's two rounds of PR review feedback there) but with all
@pytest.mark.sample references stripped, since the @sample marker is
not part of this branch.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a85c6ecbcc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/pytest.ini
sm70: Can only run on GPU with compute capability 7.0 or higher.
needs_torch: mark test as requiring PyTorch.
slow: mark test (or parametrize case) as slow. Skipped by default by tests/run_tests.py;
pass --run-slow to include them, or directly `pytest -m slow` to run only the slow ones.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep slow marker description in one ini entry

The wrapped description line is parsed as a second marker entry, so pytest now registers an unintended marker named pass --run-slow to include them, or directly \pytest -m slow` to run only the slow ones.instead of treating it as part ofslow's description. You can see this in pytest --markers, and it can confuse marker tooling (and future strict marker validation) because tests/pytest.inino longer defines markers as onename: description` per line.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

@duburcqa
Copy link
Copy Markdown
Contributor

ok to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants