[CI] Add slow marker and remove un-necessary tests#711
Conversation
Add a `slow` pytest marker, mark the worst-case tests with it, and have
`tests/run_tests.py` skip those tests by default (use `--run-slow` to include
them, or `pytest -m slow` to run only those).
Picked from macOS CI per-file timing (QD_FILE_TIMING=1, run 26083950810):
phase 1 totals 6415s across 8641 test calls; the slowest 3 files alone
(test_eig, test_tile16, test_linalg) cover 55%. The cost of test_eig /
test_make_spd is super-linear in matrix size n (n=12 ≈ 5x n=9).
Marked slow:
- Parametrize cases n in {6, 9, 12} (and 7..11 for inverse_large) across
test_eig.py and test_linalg.py.
- Rectangular (9, 12) / (12, 3) cases in test_frobenius_inner_rectangular.
- test_matmul_chain_qipc_sizes_{f32,f64} (>130s each on macOS CI).
- test_clear_all_gradients (180s/invocation).
- test_reset_ndarrays::test_ndarray_doesnt_crash_on_gc (127s).
- test_mpm88::{test_mpm88, test_mpm88_numpy_and_ndarray} (~30s/invocation).
- test_struct::test_2d_nested (122s/invocation).
run_tests.py composes `not slow` with any user-supplied `-m` expression, so
existing CI invocations like `-m "not needs_torch"` become
`(not needs_torch) and not slow`. Note that this also drops slow tests from
GPU / Linux / macOS CI runs — a separate workflow (or `--run-slow` job) is
needed if we still want to exercise the n>=6 / n=12 paths in CI.
The previous lists ([4, 5, 6, 9, 12], [2, 3, 4, 6, 9, 12], [5..12], etc.) gave
the Householder/QR path a lot of redundant size coverage. For routine CI we
only need to exercise a small size + the largest supported size (12, which
also doubles as the slow-marked stress case): if a bug shows up only at
n=7 or n=11 it almost certainly also shows up at n=12.
test_eig.py
sym_eig_general_{f32,f64} [4,5,6,9,12] -> [4, 12*]
make_spd_{f32,f64} [4,6,9,12] -> [4, 12*]
sym_eig_alpha_identity_f64 [4,6,9,12] -> [4, 12*]
make_spd_idempotent_f64 [4,6,9,12] -> [4, 12*]
make_spd_negative_definite_zero_f64 [4,6,9,12] -> [4, 12*]
sym_eig_sort_order_{f32,f64} [2,3,4,6,9,12] -> [3, 12*]
test_linalg.py
frobenius_inner_{f32,f64} [2,3,6,9,12] -> [3, 12*]
inverse_large_{f32,f64} [5..12] -> [5, 12*]
* n=12 retains the `slow` marker, so default `run_tests.py` invocations only
hit n=4 / n=3 / n=5. `--run-slow` runs both.
Closed-form 2x2/3x3 paths in test_sym_eig_sort_order: dropped n=2 in favour
of n=3 (per directive); the 2x2 path is still covered by
test_sym_eig2x2_{f32,f64}. The 3x3 closed-form path stays covered by n=3.
Other parametrize lists left untouched:
- rectangular (rows, cols) tuples in test_frobenius_inner_rectangular (it's
varying shape, not pure size).
- test_mat_inverse_size's `range(1, 5)` (tiny sizes only).
- `a00` integer parametrize in test_sym_eig3x3_{f32,f64}.
The blocked-Cholesky demo previously hard-coded N=92, N_ENVS=4096,
WARMUP=50, ITERS=200 as module globals. The unit-test wrapper
test_tile16_cholesky_blocked_demo runs the demo as a subprocess and
only cares that it returns 0; at the hard-coded sizes that takes ~74 s
on cluster CUDA, dominated by JIT-compiling 3 large unrolled kernels
at N=92 and running the 4096-env x 250-iter benchmark loop.
Expose all four knobs as command-line flags with the previous values as
defaults, so:
python misc/demos/cholesky_blocked.py # unchanged, full demo
python misc/demos/cholesky_blocked.py --n 32 --n-envs 64 \
--num-warmup 1 --num-iters 1 # smoke-mode
The test will switch to the smoke-mode invocation in a follow-up
commit so it stops dominating the slow critical path.
Flag names (--n, --n-envs, --num-warmup, --num-iters) follow the user
spec; using argparse + ArgumentDefaultsHelpFormatter so --help shows
the full demo defaults.
Pass small CLI overrides (--n 32 --n-envs 64 --num-warmup 1 --num-iters 1) so the demo runs end-to-end in seconds instead of ~74 s. The test contract is just "demo exits 0"; it doesn't read any of the benchmark numbers, so the smaller workload still satisfies the smoke test. The full N=92 / N_ENVS=4096 / 50+200-iter demo is still what humans running misc/demos/cholesky_blocked.py see by default (argparse defaults match the previous hard-coded values). Together with the previous commit, this drops the test_tile16_cholesky_blocked_demo wall time on cluster CUDA from ~74 s to (expected) a few seconds, removing the largest remaining single-test outlier on hp/mark-slow-tests.
Previously the test hard-coded the qipc IPC sizes (9x12) · (12x12) · (12x9). On cluster CUDA those two cases (f32 + f64) take ~92.7s and ~87.3s respectively -- the top two single-test outliers in the suite, each holding one xdist worker for ~90s of contiguous JIT-compile + unrolled-FMA work. Parametrize `_test_matmul_chain` on (rows_a, cols_a, cols_b, cols_c). Default lane runs the small (3,4,4,3) chain to exercise the same Matrix.__matmul__ codegen path; the original (9,12,12,9) qipc-sized chain is slow-marked so it still runs on --run-slow (i.e. CI's nightly / release lane, once that's wired up). Estimated saving: ~180s CPU, ~70s wall (these tests were on the critical path of the branch run). No function-level coverage lost: both f32 and f64 versions still run the same chain by default, just at a smaller size.
Previously hard-coded N=30 (900 particles), n_grid=120, steps=32 -- 26s on cluster CUDA. The test's actual contract is that the AD-validation checker raises QuadrantsAssertionError on the global-data-access violation in g2p (`v[f, p] = new_v`), which fires on the first substep regardless of grid / particle / step counts. Parametrize on (particles_side, n_grid_size, num_steps) with a small default (8, 32, 4) and slow-marked original (30, 120, 32). The default still exercises the same diff-MPM pipeline (p2g / grid_op / g2p, qd.ad.Tape with validation=True, `with pytest.raises(...)`) and still triggers the assertion error. Estimated CPU saving: ~22s; wall saving ~3s on the branch run.
…ne op-parametrized test
The three reduce variants (and the three scan variants) shared an identical
kernel signature, identical input shape, and differed only in (a) which
qd.algorithms.device_<op> function they called and (b) overflow vs
bitwise-exact verification. Collapse each triple into a single op-parametrized
test:
test_device_reduce(op, dtype, N) # op in {add, min, max}
test_device_exclusive_scan(op, dtype, N) # op in {add, min, max}
Behavior, coverage and the parametrize space are unchanged -- pytest still
collects the same number of parametrize cases, just under unified test names.
This is purely a code-dedup refactor (~130 LOC less) which makes the next
op-axis sampling change (if/when we choose to drop A vs B vs C from the
sweep) a one-line edit.
…run_tests help string Pure formatting fix from `pre-commit run -a`; no behavior change.
…ppers into 2 op-parametrized tests
Lines 3608-3694 in test_simt.py were 18 ~5-line wrappers each calling
``_check_full_matches_tiled(subgroup.<op>, subgroup.<op>_tiled, ...)``.
Lines 3841-3848 were 2 more, parametrized on dtype. ``_check_full_matches_tiled``
already accepts the full / tiled functions as Python arguments (closure-captured
into ``@qd.kernel``), so collapsing the family is a pure dedup move:
test_subgroup_full_matches_tiled(op_name, host_init)
# 18 cases: {reduce, inclusive, exclusive}_{add,min,max,mul,and,or,xor} on qd.i32
test_subgroup_full_matches_tiled_float(op_name, dtype)
# 4 cases: {reduce_add, inclusive_add} x {qd.f32, qd.f64}
Behavior + coverage unchanged (still 22 parametrize cases, same dtype + init
configurations). Pytest ids are designed to match the original test-name
suffixes (e.g. ``[reduce_add]``, ``[inclusive_mul]``) so ``-k`` selectors and
test reports stay readable. Drops ~50 LOC net.
…zed tests The six block-reduce tests (3 single-output + 3 broadcast) share an identical kernel skeleton, parametrize axes, and verification loop. They only differ in which `block.reduce_*` function is called (closure-captured into `@qd.kernel` via getattr), the host-side reference oracle, the init pattern (sequential for `add` so the running sum has signal; permuted hash for `min` / `max` so the result depends on lanes other than first / last), and the float tolerance regime (relative for accumulating `add`, absolute for picker `min` / `max`). Collapse the six tests into two op-parametrized tests: test_block_reduce(sg_per_block, dtype, op_name, ...) # single-output, 3 ops test_block_reduce_all(sg_per_block, dtype, op_name, ...) # broadcast, 3 ops Parametrize space is unchanged (3 sg x 5 dtype x 3 op = 45 cases per fused test, matching the original 3 tests x 15 cases each). Pytest ids use plain `[add|min|max]` suffixes so `-k` selectors remain readable. Drops ~100 LOC of boilerplate -- two new small helpers (`_init_block_reduce_src` and `_assert_block_reduce_close`) capture the per-op behavioral differences in one place each.
…zed test The three block inclusive scan tests share the same kernel skeleton and only differ in the closure-captured `block.inclusive_<op>` function, the host-side reference oracle, the init pattern (sequential for `add` -- sums grow with prefix length; permuted for `min` / `max` -- result depends on lanes other than first / last), and the float tolerance regime (relative for `add`, absolute for `min` / `max`). Collapse into one op-parametrized test: test_block_inclusive(sg_per_block, dtype, op_name, ...) Identical param count to the original three tests (3 sg x 5 dtype x 3 op = 45 cases vs original 3 x 15). Pulls a shared `_assert_block_scan_close` helper out so the int / relative-float / absolute-float regime is encoded in one place; the relative-float branch keeps the floor-on-tol-base trick needed by the original `test_block_exclusive_add` (also routed through the same helper). `test_block_exclusive_add` stays as its own function for now because the matching exclusive `min` / `max` cases need dtype-derived sentinel identities + ``isinf`` handling that's different enough that fusing them in would create more branches than it removes; can address that in a follow-up if needed.
…trized test `test_block_exclusive_min` and `test_block_exclusive_max` share the same permuted-init pattern and only differ in the dtype-derived sentinel identity (``+inf`` / ``iinfo.max`` for min, ``-inf`` / ``iinfo.min`` for max) and the inf-sign check at lane 0. Collapse into one op-parametrized test that takes ``(op_name, sentinel_fn, py_op, inf_sign)`` and dispatches via getattr + the (already module-level) `_PY_MIN` / `_PY_MAX` lambdas. Identical param count to the original pair (3 sg x 5 dtype x 2 op = 30 cases vs original 2 x 15 each = 30). `test_block_exclusive_add` remains its own function because the integer identity is `0` (not `iinfo.max/min`) and the init pattern is sequential -- different enough that fusing it in would add more branches than it removes. Drops ~30 LOC.
The PR's `Check line wrapping` CI agent flagged three comments wrapped at
the AI-default ~78-90c instead of the project's 120c target. Reflow each
to the full target width:
- tests/python/test_tile16.py:1791 (78c -> 120c) docstring for
test_tile16_cholesky_blocked_demo.
- tests/python/test_ad_gdar_diffmpm.py:8 (85c -> 120c) the
"defaults shrink ..." comment above the parametrize block.
- tests/run_tests.py:60 (90c -> 120c) the "--run-slow opts back in"
comment.
Also collapse the dangling-backslash continuation in
misc/demos/cholesky_blocked.py's Usage example onto one line (69c -> 109c).
No behavior change; comments only. Verified via the cursor
find-underwrapped skill that the remaining flagged runs in my diff are
all 103-116c with save~=0 (already-tight runs the greedy heuristic still
reports), comfortably in the agent's "not borderline" exemption.
Hugh requested in PR #709 review comment that the testing bullet collapse to just a pointer at unit_testing.md, since the long inline summary duplicates the dedicated doc immediately below.
Documents the test launcher, the @pytest.mark.slow marker (whole-test and parametrize-case variants), how to write a new parametrized test with the test_utils.test decorator, and the Advanced section with the per-test timeout, kernel compilation cache, and per-file timing knobs. Modeled on the structure of the equivalent doc on hp/mark-slow-tests (after Hugh's two rounds of PR review feedback there) but with all @pytest.mark.sample references stripped, since the @sample marker is not part of this branch.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a85c6ecbcc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| sm70: Can only run on GPU with compute capability 7.0 or higher. | ||
| needs_torch: mark test as requiring PyTorch. | ||
| slow: mark test (or parametrize case) as slow. Skipped by default by tests/run_tests.py; | ||
| pass --run-slow to include them, or directly `pytest -m slow` to run only the slow ones. |
There was a problem hiding this comment.
Keep slow marker description in one ini entry
The wrapped description line is parsed as a second marker entry, so pytest now registers an unintended marker named pass --run-slow to include them, or directly \pytest -m slow` to run only the slow ones.instead of treating it as part ofslow's description. You can see this in pytest --markers, and it can confuse marker tooling (and future strict marker validation) because tests/pytest.inino longer defines markers as onename: description` per line.
Useful? React with 👍 / 👎.
|
ok to merge |
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough