[CI] Optimize unit tests including adding stochasticity by hughperkins · Pull Request #709 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-05-19T10:22:22Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Add a `slow` pytest marker, mark the worst-case tests with it, and have `tests/run_tests.py` skip those tests by default (use `--run-slow` to include them, or `pytest -m slow` to run only those). Picked from macOS CI per-file timing (QD_FILE_TIMING=1, run 26083950810): phase 1 totals 6415s across 8641 test calls; the slowest 3 files alone (test_eig, test_tile16, test_linalg) cover 55%. The cost of test_eig / test_make_spd is super-linear in matrix size n (n=12 ≈ 5x n=9). Marked slow: - Parametrize cases n in {6, 9, 12} (and 7..11 for inverse_large) across test_eig.py and test_linalg.py. - Rectangular (9, 12) / (12, 3) cases in test_frobenius_inner_rectangular. - test_matmul_chain_qipc_sizes_{f32,f64} (>130s each on macOS CI). - test_clear_all_gradients (180s/invocation). - test_reset_ndarrays::test_ndarray_doesnt_crash_on_gc (127s). - test_mpm88::{test_mpm88, test_mpm88_numpy_and_ndarray} (~30s/invocation). - test_struct::test_2d_nested (122s/invocation). run_tests.py composes `not slow` with any user-supplied `-m` expression, so existing CI invocations like `-m "not needs_torch"` become `(not needs_torch) and not slow`. Note that this also drops slow tests from GPU / Linux / macOS CI runs — a separate workflow (or `--run-slow` job) is needed if we still want to exercise the n>=6 / n=12 paths in CI.

The previous lists ([4, 5, 6, 9, 12], [2, 3, 4, 6, 9, 12], [5..12], etc.) gave the Householder/QR path a lot of redundant size coverage. For routine CI we only need to exercise a small size + the largest supported size (12, which also doubles as the slow-marked stress case): if a bug shows up only at n=7 or n=11 it almost certainly also shows up at n=12. test_eig.py sym_eig_general_{f32,f64} [4,5,6,9,12] -> [4, 12*] make_spd_{f32,f64} [4,6,9,12] -> [4, 12*] sym_eig_alpha_identity_f64 [4,6,9,12] -> [4, 12*] make_spd_idempotent_f64 [4,6,9,12] -> [4, 12*] make_spd_negative_definite_zero_f64 [4,6,9,12] -> [4, 12*] sym_eig_sort_order_{f32,f64} [2,3,4,6,9,12] -> [3, 12*] test_linalg.py frobenius_inner_{f32,f64} [2,3,6,9,12] -> [3, 12*] inverse_large_{f32,f64} [5..12] -> [5, 12*] * n=12 retains the `slow` marker, so default `run_tests.py` invocations only hit n=4 / n=3 / n=5. `--run-slow` runs both. Closed-form 2x2/3x3 paths in test_sym_eig_sort_order: dropped n=2 in favour of n=3 (per directive); the 2x2 path is still covered by test_sym_eig2x2_{f32,f64}. The 3x3 closed-form path stays covered by n=3. Other parametrize lists left untouched: - rectangular (rows, cols) tuples in test_frobenius_inner_rectangular (it's varying shape, not pure size). - test_mat_inverse_size's `range(1, 5)` (tiny sizes only). - `a00` integer parametrize in test_sym_eig3x3_{f32,f64}.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d915b74e68

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T10:25:04Z



-@pytest.mark.parametrize("n", [5, 6, 7, 8, 9, 10, 11, 12])
+@pytest.mark.parametrize("n", [5, pytest.param(12, marks=pytest.mark.slow)])


Restore deleted intermediate matrix-size cases

When --run-slow is used, this test still no longer exercises sizes 6 through 11 because the parametrization was reduced from [5, 6, 7, 8, 9, 10, 11, 12] to only 5 and slow 12. That makes the new slow-test mode unable to recover the coverage that was removed from default CI, so regressions that only appear at intermediate matrix sizes will no longer be caught; the slow cases should be marked with pytest.param(..., marks=pytest.mark.slow) rather than deleted.

Useful? React with 👍 / 👎.

no, I dont think we have to test for all these intermediate sizes.

github-actions · 2026-05-19T10:54:08Z

Total: 7 file(s) changed, +53 -18 code lines.

github-actions · 2026-05-19T11:50:49Z

Diff coverage: 100% · 23 lines, 0 missing

Round 1 of the slow-marking pass dropped cluster CUDA wall time from 1303s (main) to 563s on hp/mark-slow-tests (-56.8%). The next critical path bottleneck on branch was a small set of single-test outliers each holding one xdist worker for 10-73 seconds: test_tile16_cholesky_blocked_demo[cuda] 72.77s test_field_max_num_args[cuda] 40.17s test_src_ll_cache_has_return[..., *] 4 x ~12.7s = ~50s test_src_ll_cache_modify_sub_func[cuda] 13.68s test_tile16_shared_array_cholesky[cuda] 12.03s test_tile16_ger_sub[cuda-...] 11.19s + 3 sibling cases test_tile16_syr_sub[cuda-...] 11.04s + 3 sibling cases test_tile16_vec_proxy_syr_sub_3d[cuda-...] 10.59s + 1 sibling case test_tile16_shared_array_roundtrip[cuda] 10.49s test_mesh_localize_mapping0[cuda] 11.66s test_ad_gdar_diffmpm.py::test_gdar_mpm[cuda] 11.41s Slow-mark all of them so the default `pytest -m "not slow"` run skips them. They still run on the dedicated slow / CI suites via --run-slow or `-m slow`. Target: push the wall-time reduction from 56.8% past the 66% goal.

After round 2 the wall-time ratio sat at 461s / 1303s = 0.354 (-64.6%), just shy of the 66% target. The remaining critical path is now the long tail of test_tile16.py tests in the 9-11s range with no single dominator. Slow-mark another batch of test_tile16 outliers to push just past the target: test_tile16_vec_proxy_multi_column_accumulate (2 cases, ~10.7s avg) test_tile16_slice_ger_sub_via_outer (4 cases, ~10s avg) test_tile16_vec_proxy_ger_sub_2d/_3d (~10s each) test_tile16_potrf_then_trsm (4 cases, ~9.8s) test_tile16_shared_array_partial_cols (3 cases, ~9.6s) test_tile16_vec_proxy_partial_rows (2 cases, ~9.4s) test_tile16_outer_symmetric_same_variable (~9.4s) test_tile16_vec_proxy_shared_array (~9.2s) These are all tile16 stress / blas-style tests. They still run under --run-slow and on the dedicated slow CI lane.

github-actions · 2026-05-19T12:50:45Z

Total: 12 file(s) changed, +74 -18 code lines.

Per discussion: marking a *larger parameter* slow (when smaller params of the same test still run) is acceptable -- the function is still exercised, just at a smaller size. But marking an *entire test function* slow drops it from the default suite entirely with no remaining coverage, which isn't acceptable. Revert function-level @pytest.mark.slow from: test_tile16.py: 14 functions (cholesky_blocked_demo, ger_sub, syr_sub, shared_array_roundtrip, shared_array_cholesky, shared_array_partial_cols, vec_proxy_*, slice_ger_sub_*, outer_symmetric_*, potrf_then_trsm, multi_column_accumulate) test_src_ll_cache.py: test_src_ll_cache_has_return, _modify_sub_func test_ad_gdar_diffmpm.py: test_gdar_mpm test_mesh.py: test_mesh_localize_mapping0 (also drop unused `import pytest`) test_field.py: test_field_max_num_args test_linalg.py: test_matmul_chain_qipc_sizes_f32/f64 test_struct.py: test_2d_nested test_mpm88.py: test_mpm88, test_mpm88_numpy_and_ndarray test_reset_ndarrays.py: test_ndarray_doesnt_crash_on_gc test_clear_all_gradients.py: test_clear_all_gradients (also drop unused `import pytest`) Preserved (still acceptable): param-level slow marks where a smaller parameter of the same test continues to run by default. These live in test_eig.py and test_linalg.py (n=6/9/12 cases for sym_eig / make_spd / inverse_large / frobenius_inner / rectangular pairs).

github-actions · 2026-05-19T13:35:47Z

Total: 3 file(s) changed, +45 -18 code lines.

The blocked-Cholesky demo previously hard-coded N=92, N_ENVS=4096, WARMUP=50, ITERS=200 as module globals. The unit-test wrapper test_tile16_cholesky_blocked_demo runs the demo as a subprocess and only cares that it returns 0; at the hard-coded sizes that takes ~74 s on cluster CUDA, dominated by JIT-compiling 3 large unrolled kernels at N=92 and running the 4096-env x 250-iter benchmark loop. Expose all four knobs as command-line flags with the previous values as defaults, so: python misc/demos/cholesky_blocked.py # unchanged, full demo python misc/demos/cholesky_blocked.py --n 32 --n-envs 64 \ --num-warmup 1 --num-iters 1 # smoke-mode The test will switch to the smoke-mode invocation in a follow-up commit so it stops dominating the slow critical path. Flag names (--n, --n-envs, --num-warmup, --num-iters) follow the user spec; using argparse + ArgumentDefaultsHelpFormatter so --help shows the full demo defaults.

Pass small CLI overrides (--n 32 --n-envs 64 --num-warmup 1 --num-iters 1) so the demo runs end-to-end in seconds instead of ~74 s. The test contract is just "demo exits 0"; it doesn't read any of the benchmark numbers, so the smaller workload still satisfies the smoke test. The full N=92 / N_ENVS=4096 / 50+200-iter demo is still what humans running misc/demos/cholesky_blocked.py see by default (argparse defaults match the previous hard-coded values). Together with the previous commit, this drops the test_tile16_cholesky_blocked_demo wall time on cluster CUDA from ~74 s to (expected) a few seconds, removing the largest remaining single-test outlier on hp/mark-slow-tests.

Previously the test hard-coded the qipc IPC sizes (9x12) · (12x12) · (12x9). On cluster CUDA those two cases (f32 + f64) take ~92.7s and ~87.3s respectively -- the top two single-test outliers in the suite, each holding one xdist worker for ~90s of contiguous JIT-compile + unrolled-FMA work. Parametrize `_test_matmul_chain` on (rows_a, cols_a, cols_b, cols_c). Default lane runs the small (3,4,4,3) chain to exercise the same Matrix.__matmul__ codegen path; the original (9,12,12,9) qipc-sized chain is slow-marked so it still runs on --run-slow (i.e. CI's nightly / release lane, once that's wired up). Estimated saving: ~180s CPU, ~70s wall (these tests were on the critical path of the branch run). No function-level coverage lost: both f32 and f64 versions still run the same chain by default, just at a smaller size.

Previously hard-coded N=30 (900 particles), n_grid=120, steps=32 -- 26s on cluster CUDA. The test's actual contract is that the AD-validation checker raises QuadrantsAssertionError on the global-data-access violation in g2p (`v[f, p] = new_v`), which fires on the first substep regardless of grid / particle / step counts. Parametrize on (particles_side, n_grid_size, num_steps) with a small default (8, 32, 4) and slow-marked original (30, 120, 32). The default still exercises the same diff-MPM pipeline (p2g / grid_op / g2p, qd.ad.Tape with validation=True, `with pytest.raises(...)`) and still triggers the assertion error. Estimated CPU saving: ~22s; wall saving ~3s on the branch run.

github-actions · 2026-05-19T14:53:37Z

Total: 6 file(s) changed, +103 -44 code lines.

…ne op-parametrized test The three reduce variants (and the three scan variants) shared an identical kernel signature, identical input shape, and differed only in (a) which qd.algorithms.device_<op> function they called and (b) overflow vs bitwise-exact verification. Collapse each triple into a single op-parametrized test: test_device_reduce(op, dtype, N) # op in {add, min, max} test_device_exclusive_scan(op, dtype, N) # op in {add, min, max} Behavior, coverage and the parametrize space are unchanged -- pytest still collects the same number of parametrize cases, just under unified test names. This is purely a code-dedup refactor (~130 LOC less) which makes the next op-axis sampling change (if/when we choose to drop A vs B vs C from the sweep) a one-line edit.

…run_tests help string Pure formatting fix from `pre-commit run -a`; no behavior change.

…ppers into 2 op-parametrized tests Lines 3608-3694 in test_simt.py were 18 ~5-line wrappers each calling ``_check_full_matches_tiled(subgroup.<op>, subgroup.<op>_tiled, ...)``. Lines 3841-3848 were 2 more, parametrized on dtype. ``_check_full_matches_tiled`` already accepts the full / tiled functions as Python arguments (closure-captured into ``@qd.kernel``), so collapsing the family is a pure dedup move: test_subgroup_full_matches_tiled(op_name, host_init) # 18 cases: {reduce, inclusive, exclusive}_{add,min,max,mul,and,or,xor} on qd.i32 test_subgroup_full_matches_tiled_float(op_name, dtype) # 4 cases: {reduce_add, inclusive_add} x {qd.f32, qd.f64} Behavior + coverage unchanged (still 22 parametrize cases, same dtype + init configurations). Pytest ids are designed to match the original test-name suffixes (e.g. ``[reduce_add]``, ``[inclusive_mul]``) so ``-k`` selectors and test reports stay readable. Drops ~50 LOC net.

…zed tests The six block-reduce tests (3 single-output + 3 broadcast) share an identical kernel skeleton, parametrize axes, and verification loop. They only differ in which `block.reduce_*` function is called (closure-captured into `@qd.kernel` via getattr), the host-side reference oracle, the init pattern (sequential for `add` so the running sum has signal; permuted hash for `min` / `max` so the result depends on lanes other than first / last), and the float tolerance regime (relative for accumulating `add`, absolute for picker `min` / `max`). Collapse the six tests into two op-parametrized tests: test_block_reduce(sg_per_block, dtype, op_name, ...) # single-output, 3 ops test_block_reduce_all(sg_per_block, dtype, op_name, ...) # broadcast, 3 ops Parametrize space is unchanged (3 sg x 5 dtype x 3 op = 45 cases per fused test, matching the original 3 tests x 15 cases each). Pytest ids use plain `[add|min|max]` suffixes so `-k` selectors remain readable. Drops ~100 LOC of boilerplate -- two new small helpers (`_init_block_reduce_src` and `_assert_block_reduce_close`) capture the per-op behavioral differences in one place each.

…zed test The three block inclusive scan tests share the same kernel skeleton and only differ in the closure-captured `block.inclusive_<op>` function, the host-side reference oracle, the init pattern (sequential for `add` -- sums grow with prefix length; permuted for `min` / `max` -- result depends on lanes other than first / last), and the float tolerance regime (relative for `add`, absolute for `min` / `max`). Collapse into one op-parametrized test: test_block_inclusive(sg_per_block, dtype, op_name, ...) Identical param count to the original three tests (3 sg x 5 dtype x 3 op = 45 cases vs original 3 x 15). Pulls a shared `_assert_block_scan_close` helper out so the int / relative-float / absolute-float regime is encoded in one place; the relative-float branch keeps the floor-on-tol-base trick needed by the original `test_block_exclusive_add` (also routed through the same helper). `test_block_exclusive_add` stays as its own function for now because the matching exclusive `min` / `max` cases need dtype-derived sentinel identities + ``isinf`` handling that's different enough that fusing them in would create more branches than it removes; can address that in a follow-up if needed.

…trized test `test_block_exclusive_min` and `test_block_exclusive_max` share the same permuted-init pattern and only differ in the dtype-derived sentinel identity (``+inf`` / ``iinfo.max`` for min, ``-inf`` / ``iinfo.min`` for max) and the inf-sign check at lane 0. Collapse into one op-parametrized test that takes ``(op_name, sentinel_fn, py_op, inf_sign)`` and dispatches via getattr + the (already module-level) `_PY_MIN` / `_PY_MAX` lambdas. Identical param count to the original pair (3 sg x 5 dtype x 2 op = 30 cases vs original 2 x 15 each = 30). `test_block_exclusive_add` remains its own function because the integer identity is `0` (not `iinfo.max/min`) and the init pattern is sequential -- different enough that fusing it in would add more branches than it removes. Drops ~30 LOC.

github-actions · 2026-05-19T15:52:00Z

Total: 8 file(s) changed, +290 -473 code lines.

Add a new opt-in marker for tests whose parametrize space is intentionally large but where running every case every CI run is overkill. Used like: @pytest.mark.sample(n=4) # keep 4 of N cases per run @pytest.mark.sample(fraction=0.25) # keep 25% per run, min 1 @pytest.mark.parametrize(...) ... Over many runs each parametrize case asymptotically gets covered (Pr[hit after r runs] = 1 - (1 - keep/total)^r). Reproducibility hooks: - whole sample reproducible: --sample-seed=<S> (seed printed in report header on every sampled run); - single failing case: paste the failing nodeid -- the sampler's len(group) <= 1 short-circuit keeps it without any flag; - exhaustive run: --no-sample. Key implementation choices: - Seed picked on the *controller* in pytest_configure, not in pytest_collection_modifyitems. With pytest-xdist the latter runs per worker, so workers would otherwise each pick different seeds and sample different subsets, breaking the --sample-seed contract. - Per-test RNG keyed on (global_seed, nodeid_prefix). Adding / renaming / tweaking the mark on test_A does not shift test_B's sample, so routine refactors don't migrate failures. - Stratified per test function: each @sample-marked test keeps >= 1 case per run (no silent zero-case runs). - Sampling runs *after* marker-based filtering. Composes with --run-slow: --no-sample --run-slow is the truly-exhaustive combo for nightly / release-gate runs. Applied to two highest-cardinality test_tile16.py tests as a starter set (test_tile16_load_store, 32 cases -> 4; test_tile16_cholesky, 36 cases -> 4). Combined saves ~150-180s per cluster CI run with >=99% case coverage within a 50-PR window. Docs: new docs/source/user_guide/unit_testing.md consolidates the testing content from contributing.md plus full docs for the slow / sample markers. contributing.md trimmed to just reference unit_testing.md for testing specifics. index.md wires the new page into the Testing toctree. run_tests.py exposes --sample-seed and --no-sample passthrough flags. pytest.ini registers the `sample` marker so --strict-markers doesn't complain. Marker is also registered dynamically by conftest.py (addinivalue_line) so external callers using the conftest in isolation still work.

Tighten per-PR coverage on the two initial @sample-marked tests: test_tile16_load_store (32 cases -> 6/run) and test_tile16_cholesky (36 cases -> 6/run). Trades a small per-CI saving (now ~150s instead of ~180s) for materially better single-run coverage: ~65% / 60% after 5 runs (was ~50% / ~42%), ~98% / 97% after 20 runs (was ~93% / ~90%). Same convergence math, larger keep-fraction. Doc + comment updated to match.

The @pytest.mark.sample machinery added in 62eb3aa / a4badcf was wrapped at the AI-default ~110c rather than the project target 120c. Re-flow prose blocks in conftest.py and test_tile16.py to use the full line width. No behavior change.

The previous implementation of ``pytest_configure`` ran on every xdist worker and called ``random.randrange()`` independently, so each worker computed a different seed and (via ``pytest_collection_modifyitems``) sampled a different subset of cases. xdist then aborted the run with "Different tests were collected between gw0 and gwN". Fix: use xdist's ``pytest_configure_node`` controller-only hook to stash the controller-chosen seed in the worker's ``workerinput`` dict, and have ``pytest_configure`` on the worker side read from there instead of generating its own seed. With this change ``--sample-seed`` is consistent across all workers and the @sample-marked subset matches on every worker, so collection no longer diverges. Smoke-tested via cluster bench: previous run failed with the collection mismatch at the 9s mark; this run gets past collection and exercises the sampler on real xdist workers.

…env var The previous attempt to use xdist's ``pytest_configure_node`` / ``workerinput`` hook didn't fire when conftest.py lives below rootdir (here: ``tests/python/conftest.py`` while rootdir is ``tests/``), so each worker re-drew its own seed and collection still diverged. Switch to an environment-variable handoff (``QD_SAMPLE_SEED``). xdist's popen gateway inherits ``os.environ`` from the controller process, so the seed chosen once on the controller is seen identically by every worker regardless of conftest depth. Logic: - ``--no-sample`` -> sampler disabled, no seed needed. - ``--sample-seed=N`` on argv -> xdist forwards argv; every process sees it. - ``QD_SAMPLE_SEED`` in os.environ -> worker inherits the controller's seed. - else (controller / no xdist) -> pick once, set os.environ for workers.

…eed argv xdist's collection-consistency check requires every worker to sample the same parametrize subset. Earlier attempts to propagate the per-run seed via ``config.option`` (replicated by xdist), via ``pytest_configure_node`` / ``workerinput`` (only fires for conftests at rootdir level), and via ``os.environ`` (xdist snapshots env before pytest_configure mutates it) all left workers drawing independent seeds and aborting collection with "Different tests were collected between gw0 and gwN". Switch to the only mechanism that reliably propagates across xdist workers: argv. ``tests/run_tests.py`` now picks the seed once per invocation and passes ``--sample-seed=<S>`` to pytest, which xdist forwards verbatim to every worker subprocess. ``pytest_configure`` in tests/python/conftest.py only picks a single-process fallback seed when run as a non-xdist controller (no ``workerinput``) and no ``--sample-seed`` is on argv -- the path used when someone invokes pytest directly without going through run_tests.py.

…ion_modifyitems call Co-authored-by: Cursor <cursoragent@cursor.com>

…tical order Final piece of the seed-propagation puzzle. Even with the same seed reaching every worker (via run_tests.py injecting --sample-seed on argv, which xdist forwards), workers were still selecting different parametrize cases. Root cause: pytest does NOT guarantee that ``items`` (and therefore the per-group list) lands in the same in-memory order on every xdist worker -- order can drift from collection-phase parallelism, dict iteration, etc. So ``rng.sample(group, k)`` was picking the same indices on every worker, but those indices resolved to *different* item objects in differently-ordered lists, leaving each worker with a different subset and triggering xdist's "Different tests were collected between gw0 and gwN" abort. Fix: sort the group by ``nodeid`` (a content-derived total order) before sampling. Also switch the kept-set from ``id(it)`` to ``it.nodeid`` for the same reason -- ``id()`` is per-process and the comparison loop iterates the original (possibly differently ordered) group, but ``nodeid`` is identical on every worker. Drop the temporary debug log added in c4b81bd (it confirmed the seed propagated correctly; the remaining bug was order).

github-actions · 2026-05-19T18:04:02Z

Diff coverage: 46% · 228 lines, 122 missing

Hugh requested in PR #709 review comment that the testing bullet collapse to just a pointer at unit_testing.md, since the long inline summary duplicates the dedicated doc immediately below.

@sample

Address Hugh's 15 review comments on PR #709, all on unit_testing.md: - Soften opener: "For one-shot setup..." -> "For setup..." - Trim slow-marker intro: drop "(or, more commonly, a specific pytest.param case inside a parametrize list)" and the trailing "-- long enough that the default test suite skips it" clause. - Trim slow-marker patterns list: drop "rare", "always", and "and there's no smaller variant"; drop "(preferred when applicable)" and the long descriptive paragraph about size-axis parametrize on test_eig.py / test_linalg.py / test_ad_gdar_diffmpm.py. - @sample bullets: "coverage convergence" -> "asymptotic coverage"; drop "for that test". - @sample placement: clarify that decorator order is irrelevant (pytest attaches function-level markers regardless of order), so "like any other marker" is accurate. - Drop the Convergence-math paragraph and the "When *not* to use @sample" paragraph (both more academic than the rest of the doc). - New "## Advanced" section at the bottom for optional / infra detail that most contributors don't need on first read: per-test timeout, kernel compilation cache, per-file timing breakdown, and the @sample + xdist seed-propagation mechanism. - Drop the "## CI checks" section entirely.

Apply Hugh's 4 follow-up comments on PR #709: - contributing.md L11: trim "See unit_testing.md for the full reference: launcher flags, env vars, ..." down to just "See unit_testing.md." - unit_testing.md L77: "With this pattern" -> "In this specific example". - unit_testing.md L88: drop the "Position within the decorator stack is irrelevant..." sentence I added in b4e3355. The shorter "Apply it like any other marker:" reads better. - unit_testing.md L102: "How to reproduce." -> "How to reproduce failing tests."

github-actions · 2026-05-19T19:05:18Z

Total: 9 file(s) changed, +390 -473 code lines.

github-actions · 2026-05-19T20:00:13Z

Diff coverage: 95% · 228 lines, 12 missing

duburcqa · 2026-05-21T09:41:06Z

+
+### `@pytest.mark.sample(...)`
+
+Marks a single heavily-parametrized test as opting in to **per-run stochastic sub-selection** of its parametrize cases. Use when:


I don't like this idea at all. But you do what you want.

Your concern is that when one makes a PR, and tests run multiple times, sometimes they might succeed, sometimes they might fail?

Thoughts on a compromise where the random seed is deterministically fixed by the name of the branch? So, re-running the tests on a specific PR won't change which tests run.

duburcqa · 2026-05-21T09:41:38Z

+- the test's parametrize space is large (≥ ~16 cases),
+- each parametrize case is roughly independent (covering an independent corner case rather than a single bug class),
+- running every case every CI run is overkill, and
+- asymptotic coverage over many runs is acceptable.


I don't think it is ever the case in my opinion.

duburcqa · 2026-05-21T09:42:30Z

+
+- the test's parametrize space is large (≥ ~16 cases),
+- each parametrize case is roughly independent (covering an independent corner case rather than a single bug class),
+- running every case every CI run is overkill, and


I'm more in favour of nightly vs PR jobs. With a defined selection of required subset running on all PRs.

Hugh Perkins (deskai7) added 2 commits May 19, 2026 02:47

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

Hugh Perkins (deskai7) added 2 commits May 19, 2026 05:03

Hugh Perkins (deskai7) added 4 commits May 19, 2026 06:48

Hugh Perkins (deskai7) added 6 commits May 19, 2026 08:02

[Style] black: reformat test_tile16_cholesky_blocked_demo cmd list + …

64fcbb0

…run_tests help string Pure formatting fix from `pre-commit run -a`; no behavior change.

Hugh Perkins (deskai7) and others added 8 commits May 19, 2026 09:04

[DebugTemp] @sample: log seed/argv/workerinput in each pytest_collect…

c4b81bd

…ion_modifyitems call Co-authored-by: Cursor <cursoragent@cursor.com>