Skip to content

[CI] Optimize unit tests including adding stochasticity#709

Draft
hughperkins wants to merge 30 commits into
mainfrom
hp/mark-slow-tests
Draft

[CI] Optimize unit tests including adding stochasticity#709
hughperkins wants to merge 30 commits into
mainfrom
hp/mark-slow-tests

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Hugh Perkins (deskai7) added 2 commits May 19, 2026 02:47
Add a `slow` pytest marker, mark the worst-case tests with it, and have
`tests/run_tests.py` skip those tests by default (use `--run-slow` to include
them, or `pytest -m slow` to run only those).

Picked from macOS CI per-file timing (QD_FILE_TIMING=1, run 26083950810):
phase 1 totals 6415s across 8641 test calls; the slowest 3 files alone
(test_eig, test_tile16, test_linalg) cover 55%. The cost of test_eig /
test_make_spd is super-linear in matrix size n (n=12 ≈ 5x n=9).

Marked slow:

  - Parametrize cases n in {6, 9, 12} (and 7..11 for inverse_large) across
    test_eig.py and test_linalg.py.
  - Rectangular (9, 12) / (12, 3) cases in test_frobenius_inner_rectangular.
  - test_matmul_chain_qipc_sizes_{f32,f64} (>130s each on macOS CI).
  - test_clear_all_gradients (180s/invocation).
  - test_reset_ndarrays::test_ndarray_doesnt_crash_on_gc (127s).
  - test_mpm88::{test_mpm88, test_mpm88_numpy_and_ndarray} (~30s/invocation).
  - test_struct::test_2d_nested (122s/invocation).

run_tests.py composes `not slow` with any user-supplied `-m` expression, so
existing CI invocations like `-m "not needs_torch"` become
`(not needs_torch) and not slow`. Note that this also drops slow tests from
GPU / Linux / macOS CI runs — a separate workflow (or `--run-slow` job) is
needed if we still want to exercise the n>=6 / n=12 paths in CI.
The previous lists ([4, 5, 6, 9, 12], [2, 3, 4, 6, 9, 12], [5..12], etc.) gave
the Householder/QR path a lot of redundant size coverage. For routine CI we
only need to exercise a small size + the largest supported size (12, which
also doubles as the slow-marked stress case): if a bug shows up only at
n=7 or n=11 it almost certainly also shows up at n=12.

  test_eig.py
    sym_eig_general_{f32,f64}             [4,5,6,9,12]     -> [4, 12*]
    make_spd_{f32,f64}                    [4,6,9,12]       -> [4, 12*]
    sym_eig_alpha_identity_f64            [4,6,9,12]       -> [4, 12*]
    make_spd_idempotent_f64               [4,6,9,12]       -> [4, 12*]
    make_spd_negative_definite_zero_f64   [4,6,9,12]       -> [4, 12*]
    sym_eig_sort_order_{f32,f64}          [2,3,4,6,9,12]   -> [3, 12*]
  test_linalg.py
    frobenius_inner_{f32,f64}             [2,3,6,9,12]     -> [3, 12*]
    inverse_large_{f32,f64}               [5..12]          -> [5, 12*]

* n=12 retains the `slow` marker, so default `run_tests.py` invocations only
  hit n=4 / n=3 / n=5. `--run-slow` runs both.

Closed-form 2x2/3x3 paths in test_sym_eig_sort_order: dropped n=2 in favour
of n=3 (per directive); the 2x2 path is still covered by
test_sym_eig2x2_{f32,f64}. The 3x3 closed-form path stays covered by n=3.

Other parametrize lists left untouched:
  - rectangular (rows, cols) tuples in test_frobenius_inner_rectangular (it's
    varying shape, not pure size).
  - test_mat_inverse_size's `range(1, 5)` (tiny sizes only).
  - `a00` integer parametrize in test_sym_eig3x3_{f32,f64}.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d915b74e68

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".



@pytest.mark.parametrize("n", [5, 6, 7, 8, 9, 10, 11, 12])
@pytest.mark.parametrize("n", [5, pytest.param(12, marks=pytest.mark.slow)])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore deleted intermediate matrix-size cases

When --run-slow is used, this test still no longer exercises sizes 6 through 11 because the parametrization was reduced from [5, 6, 7, 8, 9, 10, 11, 12] to only 5 and slow 12. That makes the new slow-test mode unable to recover the coverage that was removed from default CI, so regressions that only appear at intermediate matrix sizes will no longer be caught; the slow cases should be marked with pytest.param(..., marks=pytest.mark.slow) rather than deleted.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I dont think we have to test for all these intermediate sizes.

@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

Hugh Perkins (deskai7) added 2 commits May 19, 2026 05:03
Round 1 of the slow-marking pass dropped cluster CUDA wall time from
1303s (main) to 563s on hp/mark-slow-tests (-56.8%). The next critical
path bottleneck on branch was a small set of single-test outliers each
holding one xdist worker for 10-73 seconds:

  test_tile16_cholesky_blocked_demo[cuda]        72.77s
  test_field_max_num_args[cuda]                  40.17s
  test_src_ll_cache_has_return[..., *]      4 x ~12.7s = ~50s
  test_src_ll_cache_modify_sub_func[cuda]        13.68s
  test_tile16_shared_array_cholesky[cuda]        12.03s
  test_tile16_ger_sub[cuda-...]                  11.19s + 3 sibling cases
  test_tile16_syr_sub[cuda-...]                  11.04s + 3 sibling cases
  test_tile16_vec_proxy_syr_sub_3d[cuda-...]     10.59s + 1 sibling case
  test_tile16_shared_array_roundtrip[cuda]       10.49s
  test_mesh_localize_mapping0[cuda]              11.66s
  test_ad_gdar_diffmpm.py::test_gdar_mpm[cuda]   11.41s

Slow-mark all of them so the default `pytest -m "not slow"` run skips
them. They still run on the dedicated slow / CI suites via --run-slow
or `-m slow`. Target: push the wall-time reduction from 56.8% past the
66% goal.
After round 2 the wall-time ratio sat at 461s / 1303s = 0.354 (-64.6%),
just shy of the 66% target. The remaining critical path is now the
long tail of test_tile16.py tests in the 9-11s range with no single
dominator. Slow-mark another batch of test_tile16 outliers to push
just past the target:

  test_tile16_vec_proxy_multi_column_accumulate (2 cases, ~10.7s avg)
  test_tile16_slice_ger_sub_via_outer            (4 cases, ~10s avg)
  test_tile16_vec_proxy_ger_sub_2d/_3d           (~10s each)
  test_tile16_potrf_then_trsm                    (4 cases, ~9.8s)
  test_tile16_shared_array_partial_cols          (3 cases, ~9.6s)
  test_tile16_vec_proxy_partial_rows             (2 cases, ~9.4s)
  test_tile16_outer_symmetric_same_variable      (~9.4s)
  test_tile16_vec_proxy_shared_array             (~9.2s)

These are all tile16 stress / blas-style tests. They still run under
--run-slow and on the dedicated slow CI lane.
@github-actions
Copy link
Copy Markdown

Per discussion: marking a *larger parameter* slow (when smaller params
of the same test still run) is acceptable -- the function is still
exercised, just at a smaller size. But marking an *entire test
function* slow drops it from the default suite entirely with no
remaining coverage, which isn't acceptable.

Revert function-level @pytest.mark.slow from:

  test_tile16.py: 14 functions (cholesky_blocked_demo, ger_sub, syr_sub,
                  shared_array_roundtrip, shared_array_cholesky,
                  shared_array_partial_cols, vec_proxy_*, slice_ger_sub_*,
                  outer_symmetric_*, potrf_then_trsm, multi_column_accumulate)
  test_src_ll_cache.py: test_src_ll_cache_has_return, _modify_sub_func
  test_ad_gdar_diffmpm.py: test_gdar_mpm
  test_mesh.py: test_mesh_localize_mapping0  (also drop unused `import pytest`)
  test_field.py: test_field_max_num_args
  test_linalg.py: test_matmul_chain_qipc_sizes_f32/f64
  test_struct.py: test_2d_nested
  test_mpm88.py: test_mpm88, test_mpm88_numpy_and_ndarray
  test_reset_ndarrays.py: test_ndarray_doesnt_crash_on_gc
  test_clear_all_gradients.py: test_clear_all_gradients (also drop unused
                               `import pytest`)

Preserved (still acceptable): param-level slow marks where a smaller
parameter of the same test continues to run by default. These live in
test_eig.py and test_linalg.py (n=6/9/12 cases for sym_eig / make_spd /
inverse_large / frobenius_inner / rectangular pairs).
@github-actions
Copy link
Copy Markdown

Hugh Perkins (deskai7) added 4 commits May 19, 2026 06:48
The blocked-Cholesky demo previously hard-coded N=92, N_ENVS=4096,
WARMUP=50, ITERS=200 as module globals. The unit-test wrapper
test_tile16_cholesky_blocked_demo runs the demo as a subprocess and
only cares that it returns 0; at the hard-coded sizes that takes ~74 s
on cluster CUDA, dominated by JIT-compiling 3 large unrolled kernels
at N=92 and running the 4096-env x 250-iter benchmark loop.

Expose all four knobs as command-line flags with the previous values as
defaults, so:

    python misc/demos/cholesky_blocked.py                                # unchanged, full demo
    python misc/demos/cholesky_blocked.py --n 32 --n-envs 64 \
        --num-warmup 1 --num-iters 1                                    # smoke-mode

The test will switch to the smoke-mode invocation in a follow-up
commit so it stops dominating the slow critical path.

Flag names (--n, --n-envs, --num-warmup, --num-iters) follow the user
spec; using argparse + ArgumentDefaultsHelpFormatter so --help shows
the full demo defaults.
Pass small CLI overrides (--n 32 --n-envs 64 --num-warmup 1
--num-iters 1) so the demo runs end-to-end in seconds instead of ~74 s.
The test contract is just "demo exits 0"; it doesn't read any of the
benchmark numbers, so the smaller workload still satisfies the smoke
test.

The full N=92 / N_ENVS=4096 / 50+200-iter demo is still what humans
running misc/demos/cholesky_blocked.py see by default (argparse
defaults match the previous hard-coded values).

Together with the previous commit, this drops the
test_tile16_cholesky_blocked_demo wall time on cluster CUDA from
~74 s to (expected) a few seconds, removing the largest remaining
single-test outlier on hp/mark-slow-tests.
Previously the test hard-coded the qipc IPC sizes (9x12) · (12x12) ·
(12x9). On cluster CUDA those two cases (f32 + f64) take ~92.7s and
~87.3s respectively -- the top two single-test outliers in the suite,
each holding one xdist worker for ~90s of contiguous JIT-compile +
unrolled-FMA work.

Parametrize `_test_matmul_chain` on (rows_a, cols_a, cols_b, cols_c).
Default lane runs the small (3,4,4,3) chain to exercise the same
Matrix.__matmul__ codegen path; the original (9,12,12,9) qipc-sized
chain is slow-marked so it still runs on --run-slow (i.e. CI's nightly
/ release lane, once that's wired up).

Estimated saving: ~180s CPU, ~70s wall (these tests were on the
critical path of the branch run).

No function-level coverage lost: both f32 and f64 versions still run
the same chain by default, just at a smaller size.
Previously hard-coded N=30 (900 particles), n_grid=120, steps=32 -- 26s
on cluster CUDA. The test's actual contract is that the AD-validation
checker raises QuadrantsAssertionError on the global-data-access
violation in g2p (`v[f, p] = new_v`), which fires on the first substep
regardless of grid / particle / step counts.

Parametrize on (particles_side, n_grid_size, num_steps) with a small
default (8, 32, 4) and slow-marked original (30, 120, 32). The default
still exercises the same diff-MPM pipeline (p2g / grid_op / g2p,
qd.ad.Tape with validation=True, `with pytest.raises(...)`) and still
triggers the assertion error.

Estimated CPU saving: ~22s; wall saving ~3s on the branch run.
@github-actions
Copy link
Copy Markdown

Hugh Perkins (deskai7) added 6 commits May 19, 2026 08:02
…ne op-parametrized test

The three reduce variants (and the three scan variants) shared an identical
kernel signature, identical input shape, and differed only in (a) which
qd.algorithms.device_<op> function they called and (b) overflow vs
bitwise-exact verification. Collapse each triple into a single op-parametrized
test:

  test_device_reduce(op, dtype, N)            # op in {add, min, max}
  test_device_exclusive_scan(op, dtype, N)    # op in {add, min, max}

Behavior, coverage and the parametrize space are unchanged -- pytest still
collects the same number of parametrize cases, just under unified test names.
This is purely a code-dedup refactor (~130 LOC less) which makes the next
op-axis sampling change (if/when we choose to drop A vs B vs C from the
sweep) a one-line edit.
…run_tests help string

Pure formatting fix from `pre-commit run -a`; no behavior change.
…ppers into 2 op-parametrized tests

Lines 3608-3694 in test_simt.py were 18 ~5-line wrappers each calling
``_check_full_matches_tiled(subgroup.<op>, subgroup.<op>_tiled, ...)``.
Lines 3841-3848 were 2 more, parametrized on dtype. ``_check_full_matches_tiled``
already accepts the full / tiled functions as Python arguments (closure-captured
into ``@qd.kernel``), so collapsing the family is a pure dedup move:

  test_subgroup_full_matches_tiled(op_name, host_init)
      # 18 cases: {reduce, inclusive, exclusive}_{add,min,max,mul,and,or,xor} on qd.i32

  test_subgroup_full_matches_tiled_float(op_name, dtype)
      # 4 cases: {reduce_add, inclusive_add} x {qd.f32, qd.f64}

Behavior + coverage unchanged (still 22 parametrize cases, same dtype + init
configurations). Pytest ids are designed to match the original test-name
suffixes (e.g. ``[reduce_add]``, ``[inclusive_mul]``) so ``-k`` selectors and
test reports stay readable. Drops ~50 LOC net.
…zed tests

The six block-reduce tests (3 single-output + 3 broadcast) share an identical
kernel skeleton, parametrize axes, and verification loop. They only differ in
which `block.reduce_*` function is called (closure-captured into `@qd.kernel`
via getattr), the host-side reference oracle, the init pattern (sequential for
`add` so the running sum has signal; permuted hash for `min` / `max` so the
result depends on lanes other than first / last), and the float tolerance
regime (relative for accumulating `add`, absolute for picker `min` / `max`).
Collapse the six tests into two op-parametrized tests:

  test_block_reduce(sg_per_block, dtype, op_name, ...)        # single-output, 3 ops
  test_block_reduce_all(sg_per_block, dtype, op_name, ...)    # broadcast, 3 ops

Parametrize space is unchanged (3 sg x 5 dtype x 3 op = 45 cases per fused
test, matching the original 3 tests x 15 cases each). Pytest ids use plain
`[add|min|max]` suffixes so `-k` selectors remain readable. Drops ~100 LOC of
boilerplate -- two new small helpers (`_init_block_reduce_src` and
`_assert_block_reduce_close`) capture the per-op behavioral differences in one
place each.
…zed test

The three block inclusive scan tests share the same kernel skeleton and only
differ in the closure-captured `block.inclusive_<op>` function, the host-side
reference oracle, the init pattern (sequential for `add` -- sums grow with
prefix length; permuted for `min` / `max` -- result depends on lanes other
than first / last), and the float tolerance regime (relative for `add`,
absolute for `min` / `max`). Collapse into one op-parametrized test:

  test_block_inclusive(sg_per_block, dtype, op_name, ...)

Identical param count to the original three tests (3 sg x 5 dtype x 3 op =
45 cases vs original 3 x 15). Pulls a shared `_assert_block_scan_close`
helper out so the int / relative-float / absolute-float regime is encoded in
one place; the relative-float branch keeps the floor-on-tol-base trick
needed by the original `test_block_exclusive_add` (also routed through the
same helper). `test_block_exclusive_add` stays as its own function for now
because the matching exclusive `min` / `max` cases need dtype-derived
sentinel identities + ``isinf`` handling that's different enough that
fusing them in would create more branches than it removes; can address
that in a follow-up if needed.
…trized test

`test_block_exclusive_min` and `test_block_exclusive_max` share the same
permuted-init pattern and only differ in the dtype-derived sentinel identity
(``+inf`` / ``iinfo.max`` for min, ``-inf`` / ``iinfo.min`` for max) and the
inf-sign check at lane 0. Collapse into one op-parametrized test that takes
``(op_name, sentinel_fn, py_op, inf_sign)`` and dispatches via getattr +
the (already module-level) `_PY_MIN` / `_PY_MAX` lambdas.

Identical param count to the original pair (3 sg x 5 dtype x 2 op = 30 cases
vs original 2 x 15 each = 30). `test_block_exclusive_add` remains its own
function because the integer identity is `0` (not `iinfo.max/min`) and the
init pattern is sequential -- different enough that fusing it in would add
more branches than it removes. Drops ~30 LOC.
@github-actions
Copy link
Copy Markdown

Hugh Perkins (deskai7) and others added 8 commits May 19, 2026 09:04
Add a new opt-in marker for tests whose parametrize space is intentionally
large but where running every case every CI run is overkill. Used like:

    @pytest.mark.sample(n=4)             # keep 4 of N cases per run
    @pytest.mark.sample(fraction=0.25)   # keep 25% per run, min 1
    @pytest.mark.parametrize(...)
    ...

Over many runs each parametrize case asymptotically gets covered
(Pr[hit after r runs] = 1 - (1 - keep/total)^r). Reproducibility hooks:

  - whole sample reproducible: --sample-seed=<S> (seed printed in report
    header on every sampled run);
  - single failing case: paste the failing nodeid -- the sampler's
    len(group) <= 1 short-circuit keeps it without any flag;
  - exhaustive run: --no-sample.

Key implementation choices:

  - Seed picked on the *controller* in pytest_configure, not in
    pytest_collection_modifyitems. With pytest-xdist the latter runs per
    worker, so workers would otherwise each pick different seeds and
    sample different subsets, breaking the --sample-seed contract.
  - Per-test RNG keyed on (global_seed, nodeid_prefix). Adding / renaming /
    tweaking the mark on test_A does not shift test_B's sample, so
    routine refactors don't migrate failures.
  - Stratified per test function: each @sample-marked test keeps >= 1
    case per run (no silent zero-case runs).
  - Sampling runs *after* marker-based filtering. Composes with --run-slow:
    --no-sample --run-slow is the truly-exhaustive combo for nightly /
    release-gate runs.

Applied to two highest-cardinality test_tile16.py tests as a starter set
(test_tile16_load_store, 32 cases -> 4; test_tile16_cholesky, 36 cases ->
4). Combined saves ~150-180s per cluster CI run with >=99% case coverage
within a 50-PR window.

Docs: new docs/source/user_guide/unit_testing.md consolidates the testing
content from contributing.md plus full docs for the slow / sample markers.
contributing.md trimmed to just reference unit_testing.md for testing
specifics. index.md wires the new page into the Testing toctree.

run_tests.py exposes --sample-seed and --no-sample passthrough flags.
pytest.ini registers the `sample` marker so --strict-markers doesn't
complain. Marker is also registered dynamically by conftest.py
(addinivalue_line) so external callers using the conftest in isolation
still work.
Tighten per-PR coverage on the two initial @sample-marked tests:
test_tile16_load_store (32 cases -> 6/run) and test_tile16_cholesky
(36 cases -> 6/run). Trades a small per-CI saving (now ~150s instead of
~180s) for materially better single-run coverage: ~65% / 60% after 5
runs (was ~50% / ~42%), ~98% / 97% after 20 runs (was ~93% / ~90%).
Same convergence math, larger keep-fraction. Doc + comment updated to
match.
The @pytest.mark.sample machinery added in 62eb3aa / a4badcf was wrapped
at the AI-default ~110c rather than the project target 120c. Re-flow prose
blocks in conftest.py and test_tile16.py to use the full line width. No
behavior change.
The previous implementation of ``pytest_configure`` ran on every xdist worker
and called ``random.randrange()`` independently, so each worker computed a
different seed and (via ``pytest_collection_modifyitems``) sampled a different
subset of cases. xdist then aborted the run with "Different tests were collected
between gw0 and gwN".

Fix: use xdist's ``pytest_configure_node`` controller-only hook to stash the
controller-chosen seed in the worker's ``workerinput`` dict, and have
``pytest_configure`` on the worker side read from there instead of generating
its own seed. With this change ``--sample-seed`` is consistent across all
workers and the @sample-marked subset matches on every worker, so collection
no longer diverges.

Smoke-tested via cluster bench: previous run failed with the collection
mismatch at the 9s mark; this run gets past collection and exercises the
sampler on real xdist workers.
…env var

The previous attempt to use xdist's ``pytest_configure_node`` /
``workerinput`` hook didn't fire when conftest.py lives below rootdir
(here: ``tests/python/conftest.py`` while rootdir is ``tests/``), so each
worker re-drew its own seed and collection still diverged.

Switch to an environment-variable handoff (``QD_SAMPLE_SEED``). xdist's popen
gateway inherits ``os.environ`` from the controller process, so the seed
chosen once on the controller is seen identically by every worker regardless
of conftest depth. Logic:

  - ``--no-sample``                  -> sampler disabled, no seed needed.
  - ``--sample-seed=N`` on argv      -> xdist forwards argv; every process sees it.
  - ``QD_SAMPLE_SEED`` in os.environ -> worker inherits the controller's seed.
  - else (controller / no xdist)     -> pick once, set os.environ for workers.
…eed argv

xdist's collection-consistency check requires every worker to sample the same
parametrize subset. Earlier attempts to propagate the per-run seed via
``config.option`` (replicated by xdist), via ``pytest_configure_node`` /
``workerinput`` (only fires for conftests at rootdir level), and via
``os.environ`` (xdist snapshots env before pytest_configure mutates it) all
left workers drawing independent seeds and aborting collection with
"Different tests were collected between gw0 and gwN".

Switch to the only mechanism that reliably propagates across xdist workers:
argv. ``tests/run_tests.py`` now picks the seed once per invocation and passes
``--sample-seed=<S>`` to pytest, which xdist forwards verbatim to every worker
subprocess. ``pytest_configure`` in tests/python/conftest.py only picks a
single-process fallback seed when run as a non-xdist controller (no
``workerinput``) and no ``--sample-seed`` is on argv -- the path used when
someone invokes pytest directly without going through run_tests.py.
…ion_modifyitems call

Co-authored-by: Cursor <cursoragent@cursor.com>
…tical order

Final piece of the seed-propagation puzzle. Even with the same seed reaching
every worker (via run_tests.py injecting --sample-seed on argv, which xdist
forwards), workers were still selecting different parametrize cases.

Root cause: pytest does NOT guarantee that ``items`` (and therefore the
per-group list) lands in the same in-memory order on every xdist worker --
order can drift from collection-phase parallelism, dict iteration, etc. So
``rng.sample(group, k)`` was picking the same indices on every worker, but
those indices resolved to *different* item objects in differently-ordered
lists, leaving each worker with a different subset and triggering xdist's
"Different tests were collected between gw0 and gwN" abort.

Fix: sort the group by ``nodeid`` (a content-derived total order) before
sampling. Also switch the kept-set from ``id(it)`` to ``it.nodeid`` for the
same reason -- ``id()`` is per-process and the comparison loop iterates the
original (possibly differently ordered) group, but ``nodeid`` is identical
on every worker. Drop the temporary debug log added in c4b81bd (it
confirmed the seed propagated correctly; the remaining bug was order).
@github-actions
Copy link
Copy Markdown

Comment thread docs/source/user_guide/unit_testing.md Outdated
Hugh requested in PR #709 review comment that the testing bullet collapse
to just a pointer at unit_testing.md, since the long inline summary
duplicates the dedicated doc immediately below.
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Hugh Perkins (deskai7) and others added 2 commits May 19, 2026 11:22
Address Hugh's 15 review comments on PR #709, all on unit_testing.md:

- Soften opener: "For one-shot setup..." -> "For setup..."
- Trim slow-marker intro: drop "(or, more commonly, a specific
  pytest.param case inside a parametrize list)" and the trailing
  "-- long enough that the default test suite skips it" clause.
- Trim slow-marker patterns list: drop "rare", "always", and
  "and there's no smaller variant"; drop "(preferred when applicable)"
  and the long descriptive paragraph about size-axis parametrize on
  test_eig.py / test_linalg.py / test_ad_gdar_diffmpm.py.
- @sample bullets: "coverage convergence" -> "asymptotic coverage";
  drop "for that test".
- @sample placement: clarify that decorator order is irrelevant
  (pytest attaches function-level markers regardless of order), so
  "like any other marker" is accurate.
- Drop the Convergence-math paragraph and the "When *not* to use
  @sample" paragraph (both more academic than the rest of the doc).
- New "## Advanced" section at the bottom for optional / infra detail
  that most contributors don't need on first read: per-test timeout,
  kernel compilation cache, per-file timing breakdown, and the
  @sample + xdist seed-propagation mechanism.
- Drop the "## CI checks" section entirely.
Comment thread docs/source/user_guide/contributing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
Comment thread docs/source/user_guide/unit_testing.md Outdated
@hughperkins hughperkins changed the title [CI] Mark slow tests, to speed up CI [CI] Optimize unit tests May 19, 2026
Apply Hugh's 4 follow-up comments on PR #709:

- contributing.md L11: trim "See unit_testing.md for the full reference:
  launcher flags, env vars, ..." down to just "See unit_testing.md."
- unit_testing.md L77: "With this pattern" -> "In this specific example".
- unit_testing.md L88: drop the "Position within the decorator stack is
  irrelevant..." sentence I added in b4e3355. The shorter
  "Apply it like any other marker:" reads better.
- unit_testing.md L102: "How to reproduce." ->
  "How to reproduce failing tests."
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown


### `@pytest.mark.sample(...)`

Marks a single heavily-parametrized test as opting in to **per-run stochastic sub-selection** of its parametrize cases. Use when:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this idea at all. But you do what you want.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your concern is that when one makes a PR, and tests run multiple times, sometimes they might succeed, sometimes they might fail?

Thoughts on a compromise where the random seed is deterministically fixed by the name of the branch? So, re-running the tests on a specific PR won't change which tests run.

- the test's parametrize space is large (≥ ~16 cases),
- each parametrize case is roughly independent (covering an independent corner case rather than a single bug class),
- running every case every CI run is overkill, and
- asymptotic coverage over many runs is acceptable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is ever the case in my opinion.


- the test's parametrize space is large (≥ ~16 cases),
- each parametrize case is roughly independent (covering an independent corner case rather than a single bug class),
- running every case every CI run is overkill, and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more in favour of nightly vs PR jobs. With a defined selection of required subset running on all PRs.

@hughperkins hughperkins changed the title [CI] Optimize unit tests [CI] Optimize unit tests including adding stochasticity May 21, 2026
@hughperkins hughperkins marked this pull request as draft May 21, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants