diff --git a/docs/source/user_guide/contributing.md b/docs/source/user_guide/contributing.md index ec97b9529f..8325303fba 100644 --- a/docs/source/user_guide/contributing.md +++ b/docs/source/user_guide/contributing.md @@ -2,25 +2,13 @@ ## Good practice reminder -* *testing*: Any new features or modified code should be tested. You have to run the test suite using `python tests/run_tests.py` which sets up the right test environment for `pytest`. CLI arguments are forwarded to `pytest`. Do not use `pytest` directly as it behaves differently. To see a per-file timing breakdown (useful for identifying slow test files), set `QD_FILE_TIMING=1` — e.g. `QD_FILE_TIMING=1 python tests/run_tests.py`. This is enabled by default in the Mac CI job and the results appear in the GitHub Actions job summary. +* *testing*: Any new features or modified code should be tested. see [unit_testing.md](unit_testing.md) * *format/linter*: Before pushing any commits, ensure you set up `pre-commit` and run it using `pre-commit run -a` * No need to force push to keep a clean history as the merging is eventually done by squashing commits. ## Running tests -Run the test suite with `python tests/run_tests.py`. CLI arguments are forwarded to pytest. For example, to run only Metal tests matching a keyword: - -``` -python tests/run_tests.py --arch metal -k "test_tile16_cholesky" -``` - -The target architecture can also be set via the `QD_WANTED_ARCHS` environment variable (comma-separated, e.g. `QD_WANTED_ARCHS=metal,vulkan`). - -### Kernel compilation cache - -During test runs, compiled kernels are cached to disk so that the same kernel is not recompiled after each `qd.reset()`/`qd.init()` cycle. - -A fresh, empty cache directory is created for each test session by pytest's [`tmp_path_factory`](https://docs.pytest.org/en/stable/how-to/tmp_path.html) (typically under `/tmp/pytest-of-/pytest-/qdcache0/`). Old session directories are cleaned up automatically by pytest's retention policy. This cache is separate from the user-facing `~/.cache/quadrants/` cache. +See [unit_testing.md](unit_testing.md). ## Creating your build/dev environment diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md index b648f97527..c824a270e7 100644 --- a/docs/source/user_guide/index.md +++ b/docs/source/user_guide/index.md @@ -82,6 +82,7 @@ init_options :maxdepth: 1 :titlesonly: +unit_testing kernel_coverage ``` diff --git a/docs/source/user_guide/unit_testing.md b/docs/source/user_guide/unit_testing.md new file mode 100644 index 0000000000..7ce8147e40 --- /dev/null +++ b/docs/source/user_guide/unit_testing.md @@ -0,0 +1,189 @@ +# Unit testing + +This page documents how to run, write, and tune the Quadrants Python unit test suite. For setup of the build / dev environment, see [contributing.md](contributing.md). + +## Running the tests + +The test suite is run via the project's launcher, **not** by invoking `pytest` directly: + +``` +python tests/run_tests.py +``` + +The launcher sets up the test-only env vars (kernel offline cache, watchdog, xdist worker count, etc.) and forwards any unrecognised flags to pytest. Calling `pytest` directly skips that setup and behaves differently. + +Common one-liners: + +``` +# run one file +python tests/run_tests.py test_tile16 + +# run one test (any pytest -k expression) +python tests/run_tests.py -k test_tile16_cholesky + +# run on a specific backend (or comma-separated list) +python tests/run_tests.py --arch cuda +python tests/run_tests.py --arch metal -k tile16 + +# same, via env var (handy for CI) +QD_WANTED_ARCHS=metal,vulkan python tests/run_tests.py + +# rerun the last failing tests first +python tests/run_tests.py -f + +# stop at the first failure +python tests/run_tests.py -x +``` + +The target architecture can also be set via `QD_WANTED_ARCHS` (comma-separated; supports `^arch` to exclude rather than include). + +## Markers + +Tests can opt into two project-specific markers, in addition to pytest's built-in ones (`skip`, `xfail`, etc.). + +### `@pytest.mark.slow` + +Marks a test as **slow**. `tests/run_tests.py` adds `-m "not slow"` to the pytest invocation by default; pass `--run-slow` to opt back in: + +``` +# default: skip slow +python tests/run_tests.py + +# include slow +python tests/run_tests.py --run-slow + +# slow ONLY (e.g. nightly job) +python tests/run_tests.py -m slow --run-slow +``` + +The marker is used in two patterns: + +1. **Whole-test slow**: the whole test takes a long time. + + ```python + @pytest.mark.slow + def test_thing_that_is_always_slow(): + ... + ``` + +2. **Slow-marked parametrize case**: + + ```python + @pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) + def test_sym_eig_general(n): + ... + ``` + + In this specific example the default suite still exercises the code path; the slow lane just adds the larger-size variant for full coverage. + +### `@pytest.mark.sample(...)` + +Marks a single heavily-parametrized test as opting in to **per-run stochastic sub-selection** of its parametrize cases. Use when: + +- the test's parametrize space is large (≥ ~16 cases), +- each parametrize case is roughly independent (covering an independent corner case rather than a single bug class), +- running every case every CI run is overkill, and +- asymptotic coverage over many runs is acceptable. + +Apply it like any other marker: + +```python +@pytest.mark.sample(n=6) # keep 6 of N cases per run +# OR +@pytest.mark.sample(fraction=0.25) # keep 25% of cases per run, min 1 +@pytest.mark.parametrize("size", [...]) +@pytest.mark.parametrize("dtype", [...]) +@pytest.mark.parametrize("layout", [...]) +@test_utils.test(arch=qd.gpu) +def test_thing(size, dtype, layout): + ... +``` + +**How to reproduce failing tests.** Three levels of reproducibility: + +1. **One failing case** — paste the failing nodeid from the CI log. Pytest already prints the full nodeid on failure: + + ``` + FAILED tests/python/test_tile16.py::test_tile16_load_store[arch=cuda-qd_dtype0-ndarray-16-32-4-8-7-11] + ``` + + Just rerun it directly: + + ``` + python tests/run_tests.py -k "test_tile16_load_store and ndarray-16-32-4-8-7-11" + # or, if you want the exact nodeid (bypasses -k matching): + pytest "tests/python/test_tile16.py::test_tile16_load_store[arch=cuda-qd_dtype0-ndarray-16-32-4-8-7-11]" + ``` + + When pytest narrows collection to a single nodeid, the sampler's `len(group) <= 1` short-circuit keeps it. **No `--sample-seed` flag needed.** + +2. **The exact subset of a failing run** — useful when several cases failed and you want to bisect or reproduce the whole sample locally. The report header of every run prints the seed used: + + ``` + sample-seed=1834729104 (reproduce the same sample: --sample-seed=1834729104; ...) + ``` + + Then locally: + + ``` + python tests/run_tests.py --sample-seed=1834729104 + ``` + +3. **Exhaustive run** — for release gates, coverage-debt audits, or a periodic "did anything regress in any branch of the parametrize space" sweep. Disables the sampler entirely; every `@sample`-marked test runs every parametrize case: + + ``` + python tests/run_tests.py --no-sample + ``` + +**Per-test RNG independence.** Each `@sample`-marked test's subsample is seeded from `(global_seed, test_nodeid_prefix)`, so adding / renaming / tweaking the mark on `test_A` does NOT shift the sample of `test_B`. Routine refactors don't cause samples to migrate file-wide. + +**Composition with `slow`.** Sampling runs **after** marker-based filtering. With `--run-slow` not passed (the default), slow-marked parametrize cases drop out first, then the sampler sub-selects from the remaining (fast) cases. The intersection is the right composition: `--no-sample --run-slow` is the truly-exhaustive combo. + +## Writing new tests + +The standard recipe combines `@test_utils.test(...)` (arch / option matrix) with `@pytest.mark.parametrize`: + +```python +import pytest +import quadrants as qd +from tests import test_utils + + +@pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) +@test_utils.test(arch=qd.gpu, default_fp=qd.f32) +def test_my_thing(n): + ... +``` + +`@test_utils.test` is what wires the test into the per-backend matrix and applies platform exclusions (`exclude=`), extension requirements (`require=`, e.g. `qd.extension.data64` for f64 tests), and per-test options (`default_fp`, `fast_math`, etc.). See `tests/test_utils.py` for the full surface. + +Common helpers in `tests/test_utils.py`: + +- `test_utils.skip_if_f64_unsupported(dtype)` — skip the current test at runtime if `dtype == qd.f64` and the active backend can't carry f64 through buffer I/O (Metal, MoltenVK on Darwin). Use inside a parametrized test that sweeps both f32 and f64. +- `test_utils.expected_archs()` — list of archs that the current `QD_WANTED_ARCHS` allows. Used to skip tests with no satisfiable arch. + +## Advanced + +Optional knobs and runtime details. The defaults work for most contributors. + +### Per-test timeout + +Per-test timeouts default to 600 s and are enforced by `pytest_hardtle`, a CFFI-compiled C watchdog that can kill tests hung in native GPU calls even when the GIL is held. + +### Kernel compilation cache + +During each test session the kernel compilation cache lives in a fresh, empty temp directory created by pytest's [`tmp_path_factory`](https://docs.pytest.org/en/stable/how-to/tmp_path.html) — typically `/tmp/pytest-of-/pytest-/qdcache0/`. Old session directories are cleaned up automatically by pytest's retention policy. This cache is separate from the user-facing `~/.cache/quadrants/` cache, and avoids recompiling identical kernels after each `qd.reset()` / `qd.init()` cycle within a session. + +### Per-file timing breakdown + +Set `QD_FILE_TIMING=1` to print a per-file duration summary at the end of the session: + +``` +QD_FILE_TIMING=1 python tests/run_tests.py +``` + +This is enabled by default in the Mac CI job; the results appear in the GitHub Actions job summary and are the primary tool for identifying slow test files. + +### `@sample` + xdist seed propagation + +`tests/run_tests.py` picks the per-run sample seed before pytest is launched and passes it via `--sample-seed=` on argv. xdist forwards argv to every worker, so all workers see the same seed and produce identical samples; without this, each worker would subsample independently and `--sample-seed=` wouldn't reproduce. The per-test RNG inside `pytest_collection_modifyitems` is then derived deterministically via `sha256(f"{seed}|{nodeid_prefix}")`, which is what makes the **Per-test RNG independence** property above hold. diff --git a/misc/demos/cholesky_blocked.py b/misc/demos/cholesky_blocked.py index 8dbcb3fbb9..3c72dd39fd 100644 --- a/misc/demos/cholesky_blocked.py +++ b/misc/demos/cholesky_blocked.py @@ -1,13 +1,14 @@ #!/usr/bin/env python3 -"""Benchmark 92x92 blocked Cholesky factorization using Tile16x16. +"""Benchmark NxN blocked Cholesky factorization using Tile16x16. Three kernels compared: 1. Baseline: scalar Cholesky-Crout, 64 threads, shared memory, 2*N+1 sequential syncs. Thread 0 computes each diagonal, remaining threads parallelize off-diagonal updates. -2. Blocked: 6x6 grid of 16x16 tiles, 16 threads, shared memory, scalar Crout for diagonal blocks. Same blocking - structure as Tile16x16 but all data lives in shared memory with block.sync() between every step. +2. Blocked: ceil(N/16) x ceil(N/16) grid of 16x16 tiles, 16 threads, shared memory, scalar Crout for diagonal + blocks. Same blocking structure as Tile16x16 but all data lives in shared memory with block.sync() between + every step. 3. Tile16x16: same blocked structure but fully register-resident via Tile16x16. No shared memory, zero syncs. Prior tiles read from global memory (L2). @@ -20,22 +21,37 @@ tile16 (Tile16x16, no shared memory) 16 533 5.19x Usage: - python misc/demos/cholesky_blocked.py + python misc/demos/cholesky_blocked.py [--n N] [--n-envs N_ENVS] [--num-warmup WARMUP] [--num-iters ITERS] """ +import argparse import time import numpy as np import quadrants as qd -N = 92 + +def _parse_args(): + p = argparse.ArgumentParser( + description="Blocked Cholesky NxN benchmark (3 kernels: baseline / blocked / tile16).", + formatter_class=argparse.ArgumentDefaultsHelpFormatter, + ) + p.add_argument("--n", type=int, default=92, help="Matrix dimension N (NxN SPD).") + p.add_argument("--n-envs", type=int, default=4096, help="Number of independent environments.") + p.add_argument("--num-warmup", type=int, default=50, help="Warmup iterations per kernel.") + p.add_argument("--num-iters", type=int, default=200, help="Timed iterations per kernel.") + return p.parse_args() + + +_args = _parse_args() +N = _args.n TILE = 16 -N_BLOCKS = (N + TILE - 1) // TILE # 6 -N_PADDED = N_BLOCKS * TILE # 96, rounded up for blocked kernel SharedArrays -N_ENVS = 4096 -WARMUP = 50 -ITERS = 200 +N_BLOCKS = (N + TILE - 1) // TILE +N_PADDED = N_BLOCKS * TILE # rounded up for blocked kernel SharedArrays +N_ENVS = _args.n_envs +WARMUP = _args.num_warmup +ITERS = _args.num_iters qd.init(arch=qd.gpu) diff --git a/tests/pytest.ini b/tests/pytest.ini index 5ee5ec16b2..3fbc75158c 100644 --- a/tests/pytest.ini +++ b/tests/pytest.ini @@ -3,3 +3,9 @@ markers = run_in_serial: mark test to run serially(usually for resource intensive tests). sm70: Can only run on GPU with compute capability 7.0 or higher. needs_torch: mark test as requiring PyTorch. + slow: mark test (or parametrize case) as slow. Skipped by default by tests/run_tests.py; + pass --run-slow to include them, or directly `pytest -m slow` to run only the slow ones. + sample(fraction=None, n=None): per-test stochastic parametrize subsampling. Pass exactly one of + `fraction` (0..1) or `n` (>= 1). Implemented in tests/python/conftest.py. See + docs/source/user_guide/unit_testing.md for the reproducibility recipes (--sample-seed, + --no-sample, nodeid-paste). diff --git a/tests/python/conftest.py b/tests/python/conftest.py index 9e8f816a11..30f5e0fd89 100644 --- a/tests/python/conftest.py +++ b/tests/python/conftest.py @@ -1,5 +1,7 @@ import gc +import hashlib import os +import random import sys import time @@ -15,6 +17,159 @@ pytest_rerunfailures.works_with_current_xdist = lambda: True +# --------------------------------------------------------------------------- +# @pytest.mark.sample(...) -- per-test stochastic parametrize subsampling +# --------------------------------------------------------------------------- +# +# Some tests parametrize so widely (test_tile16_load_store, test_tile16_cholesky, ...) that running every case on every +# CI run is wasteful: the parametrize axes are intentionally varied to cover corner cases, but most runs would get the +# same signal from a small random subset. ``@pytest.mark.sample(n=...)`` or ``@pytest.mark.sample(fraction=...)`` opts a +# *single* test into per-run random sub-selection. Over many runs, each parametrize case asymptotically gets covered +# (Pr[hit after k runs] = 1 - (1 - keep/total)^k). +# +# Reproducibility hooks: +# - whole-suite: ``--sample-seed=`` reproduces the exact same trimmed set (header prints the seed used). +# - single failing case: paste the failing nodeid into ``pytest `` -- the sampler's ``len(group) <= 1`` +# short-circuit keeps it; no flags needed. +# - exhaustive run (release gate / coverage audit): ``--no-sample`` skips the sampler entirely. +# +# Per-test RNG keyed on ``(seed, nodeid_prefix)``: adding / renaming a @sample-marked test does NOT shift any other +# test's sample. Routine refactors don't migrate failures. + + +def pytest_addoption(parser): + parser.addoption( + "--sample-seed", + type=int, + default=None, + help="Seed for @pytest.mark.sample subsampling. If absent, a fresh seed is picked and printed " + "in the report header so a failing run can be reproduced via --sample-seed=.", + ) + parser.addoption( + "--no-sample", + action="store_true", + default=False, + help="Disable @pytest.mark.sample subsampling -- run every parametrize case of every marked test. " + "Use for exhaustive CI release gates / coverage-debt audits.", + ) + + +@pytest.hookimpl(tryfirst=True) +def pytest_configure(config): + # The marker is registered here (rather than only in pytest.ini) so callers that use + # `--strict-markers` don't blow up if they happen to import this conftest in isolation. + config.addinivalue_line( + "markers", + "sample(fraction=None, n=None): per-test stochastic parametrize subsampling. Pass exactly one of " + "`fraction` (0..1) or `n` (>= 1). Seed printed in report header; rerun the same sample with " + "--sample-seed=; rerun every case with --no-sample; rerun a single failing case by pasting its nodeid.", + ) + # Seed propagation contract: the seed must reach the controller AND every xdist worker as the same value, or + # xdist's collection-consistency check fails with "Different tests were collected between gw0 and gwN". argv is + # forwarded by xdist to every worker, so we require the seed to live on argv as ``--sample-seed=N``. ``tests/ + # run_tests.py`` picks a seed once per run and injects it; direct ``pytest`` invocations either pass + # ``--sample-seed`` explicitly (reproducibility) or fall back to a single-process seed picked below. We do NOT + # mutate ``os.environ`` here -- env-var inheritance into xdist worker subprocesses is not guaranteed for runtime + # mutations, only for vars present when pytest itself was launched. + if ( + not config.getoption("--no-sample") + and config.getoption("--sample-seed") is None + and not hasattr(config, "workerinput") # single-process / non-xdist controller only. + ): + config.option.sample_seed = random.randrange(0, 2**31) + + +def pytest_report_header(config): + if config.getoption("--no-sample"): + return "sample: --no-sample (every @sample-marked test runs every parametrize case)" + seed = config.getoption("--sample-seed") + if seed is None: + return None + return ( + f"sample-seed={seed} (reproduce the same sample: --sample-seed={seed}; " + f"reproduce a single failure: paste its nodeid; run every case: --no-sample)" + ) + + +def _sample_keep_count(mark, group_size, group_key): + """Resolve ``@pytest.mark.sample(fraction=..., n=...)`` for a group of ``group_size`` parametrize cases. + + Exactly one of ``fraction`` (0..1) or ``n`` (int >= 1) must be passed; ``UsageError`` otherwise. The result is + clamped to ``[1, group_size]`` so every @sample-marked test runs at least one case per run (no silent zero-case + runs even if e.g. ``fraction * group_size`` rounds to zero on a 1-case group). + """ + fraction = mark.kwargs.get("fraction") + n = mark.kwargs.get("n") + if (fraction is None) == (n is None): + raise pytest.UsageError( + f"@pytest.mark.sample on {group_key!r}: pass exactly one of `fraction` or `n`, got " + f"fraction={fraction!r}, n={n!r}" + ) + if fraction is not None: + return max(1, int(round(group_size * float(fraction)))) + return max(1, min(int(n), group_size)) + + +def pytest_collection_modifyitems(config, items): + if config.getoption("--no-sample"): + return + seed = config.getoption("--sample-seed") + if seed is None: + # Defensive: pytest_configure didn't run (e.g. someone imported this module manually). Nothing to do. + return + + # Group items by test function (strip the parametrize bracket suffix). Per-function stratification is what + # guarantees every @sample-marked test keeps at least one case per run -- uniform sampling across all items + # could otherwise drop a 2-case marked test entirely. + groups: dict[str, list] = {} + for item in items: + key = item.nodeid.split("[", 1)[0] + groups.setdefault(key, []).append(item) + + keep, deselected = [], [] + # ``sorted(groups)`` so the iteration order (and therefore any incidental RNG advance) is reproducible across + # Python versions / dict insertion orders. Per-test RNG is keyed below so this only matters for the (cheap) + # bookkeeping order. + for key in sorted(groups): + group = groups[key] + mark = group[0].get_closest_marker("sample") + if mark is None or len(group) <= 1: + # No sample mark -> every case runs. Also: a single-item group means either the test only had one + # parametrize case to begin with, or pytest narrowed collection to a specific nodeid -- both cases + # should run as-is. This is what makes "paste failing nodeid" work without --no-sample. + keep.extend(group) + continue + keep_n = _sample_keep_count(mark, len(group), key) + # Per-test RNG: keyed on (seed, key) so: + # - Independence: adding / renaming / tweaking the @sample mark on test_A does NOT shift the sample of test_B. + # Routine refactors don't cause failures to migrate file-wide. + # - Locality: when debugging, you can reason about one test's sample without simulating all the others' RNG + # advances. + # Seed mixing uses sha256 of a canonical ``f"{seed}|{key}"`` rather than ``random.Random((seed, key))``: tuple + # seeding goes through ``_sha512(repr(a).encode())`` in CPython 3.10+ which IS deterministic in principle but + # raises a ``DeprecationWarning: Seeding based on hashing is deprecated`` and is slated for removal. We pin to + # an explicit hash so the sample is reproducible across Python versions and not at the mercy of stdlib churn. + # CRITICAL: ``rng.sample(group_sorted, ...)`` rather than ``rng.sample(group, ...)``. xdist workers each run + # ``pytest_collection_modifyitems`` independently and pytest does NOT guarantee that ``items`` (and therefore + # ``group``) lands in the same in-memory order on every worker. With the same seed but a differently-ordered + # list, ``rng.sample`` would pick the same indices but those indices would resolve to different items, so + # workers would collect different subsets and xdist's collection-consistency check would abort the run with + # "Different tests were collected between gw0 and gwN". Sorting by ``nodeid`` (a content-derived total order) + # forces every worker to sample from an identical sequence. + group_sorted = sorted(group, key=lambda it: it.nodeid) + mixed = int.from_bytes(hashlib.sha256(f"{seed}|{key}".encode()).digest()[:8], "big") + rng = random.Random(mixed) + kept_nodeids = {it.nodeid for it in rng.sample(group_sorted, k=keep_n)} + for it in group: + (keep if it.nodeid in kept_nodeids else deselected).append(it) + + if deselected: + # ``pytest_deselected`` is the supported way to report filtered-out items so pytest's summary shows them as + # deselected (not silently dropped). xdist also forwards this to the controller correctly. + config.hook.pytest_deselected(items=deselected) + items[:] = keep + + @pytest.fixture(scope="session", autouse=True) def _offline_cache_dir(tmp_path_factory): """Enable the kernel compilation disk cache for the test session. diff --git a/tests/python/test_ad_gdar_diffmpm.py b/tests/python/test_ad_gdar_diffmpm.py index cd6bb32a04..8fd3c56d56 100644 --- a/tests/python/test_ad_gdar_diffmpm.py +++ b/tests/python/test_ad_gdar_diffmpm.py @@ -5,14 +5,25 @@ from tests import test_utils +# Defaults shrink particle / grid / steps counts so the JIT compile + AD-tape replay stays cheap; the slow-marked +# entry keeps the original (N=30, n_grid=120, steps=32) workload that runs on --run-slow. The point of the test is +# that the AD-validation checker fires on the global-data-access violation in g2p (`v[f, p] = new_v`), which happens +# on the first substep regardless of size. +@pytest.mark.parametrize( + "particles_side,n_grid_size,num_steps", + [ + (8, 32, 4), + pytest.param(30, 120, 32, marks=pytest.mark.slow), + ], +) @test_utils.test(require=qd.extension.assertion, debug=True) -def test_gdar_mpm(): +def test_gdar_mpm(particles_side, n_grid_size, num_steps): real = qd.f32 dim = 2 - N = 30 # reduce to 30 if run out of GPU memory + N = particles_side n_particles = N * N - n_grid = 120 + n_grid = n_grid_size dx = 1 / n_grid inv_dx = 1 / dx dt = 3e-4 @@ -21,8 +32,8 @@ def test_gdar_mpm(): E = 100 mu = E la = E - max_steps = 32 - steps = 32 + max_steps = num_steps + steps = num_steps gravity = 9.8 target = [0.3, 0.6] diff --git a/tests/python/test_algorithms.py b/tests/python/test_algorithms.py index e4b4ac9960..508732ce3b 100644 --- a/tests/python/test_algorithms.py +++ b/tests/python/test_algorithms.py @@ -320,86 +320,79 @@ def _rand_reduce_host(rng, dtype, N, *, bound=1000): return rng.integers(-bound, bound, size=N, dtype=np_dt) -@pytest.mark.parametrize("N", _REDUCE_SIZES) -@pytest.mark.parametrize("dtype", _REDUCE_DTYPES) -@test_utils.test(arch=qd.gpu) -def test_device_reduce_add(dtype, N): - """device_reduce_add matches numpy.sum across the full size sweep + dtype set.""" - _skip_if_dtype_unsupported(dtype) - inp, out = _alloc_input_out(dtype, N) - rng = np.random.default_rng(seed=1234) - host = _rand_reduce_host(rng, dtype, N) - _fill_field(inp, host) +_REDUCE_OPS = ["add", "min", "max"] - qd.algorithms.device_reduce_add(inp, out=out) - got = out.to_numpy()[0] +def _reduce_host(rng, op, dtype, N): + """Generate the test input for a reduce of `op` on `dtype` x N values. + + ``add`` uses small uniform / bounded values so float sums stay representable; ``min`` and ``max`` use a wider + range (-10..10 for floats, +-10000 for ints) since picking-an-element is bitwise-exact regardless of magnitude. + """ + if op == "add": + return _rand_reduce_host(rng, dtype, N) if _is_float(dtype): - expected = float(np.sum(host.astype(np.float64))) - rtol, atol = (_F32_REDUCE_RTOL, _F32_REDUCE_ATOL) if dtype == qd.f32 else (_F64_RTOL, _F64_ATOL) - assert math.isclose( - got, expected, rel_tol=rtol, abs_tol=atol - ), f"{dtype} reduce_add(N={N}): got {got}, expected {expected}" - else: - # Promote to Python int for an arbitrary-width reference; mask both sides to dtype width to handle the - # u32 / u64 mod-wrap case at large N. - mod = 1 << (32 if dtype in (qd.i32, qd.u32) else 64) if _is_unsigned(dtype) else None - ref = int( - np.sum(host.astype(np.int64 if dtype in (qd.i32, qd.u32) else (np.int64 if dtype == qd.i64 else np.uint64))) - ) # noqa: E501 - got_int = int(got) - if mod is not None: - ref &= mod - 1 - got_int &= mod - 1 - assert got_int == ref, f"{dtype} reduce_add(N={N}): got {got_int}, expected {ref}" + return rng.uniform(-10.0, 10.0, size=N).astype(_DTYPE_TO_NP[dtype]) + return _rand_reduce_host(rng, dtype, N, bound=10000) -@pytest.mark.parametrize("N", _REDUCE_SIZES) -@pytest.mark.parametrize("dtype", _REDUCE_DTYPES) -@test_utils.test(arch=qd.gpu) -def test_device_reduce_min(dtype, N): - """device_reduce_min(identity=type-positive-extreme) matches numpy.min.""" +def _check_reduce(op, dtype, N): + """Run ``device_reduce_(arr)`` and verify against ``numpy.(arr)``. + + ``add`` accumulates so it needs (a) wider integer promotion + mod-wrap masking for u32/u64 and (b) per-N float + tolerance. ``min`` / ``max`` pick one input element, so they're bitwise-exact for both ints and floats. + """ _skip_if_dtype_unsupported(dtype) inp, out = _alloc_input_out(dtype, N) rng = np.random.default_rng(seed=1234) - if _is_float(dtype): - host = rng.uniform(-10.0, 10.0, size=N).astype(_DTYPE_TO_NP[dtype]) - else: - host = _rand_reduce_host(rng, dtype, N, bound=10000) + host = _reduce_host(rng, op, dtype, N) _fill_field(inp, host) - qd.algorithms.device_reduce_min(inp, out=out) + qd_fn = getattr(qd.algorithms, f"device_reduce_{op}") + qd_fn(inp, out=out) got = out.to_numpy()[0] - expected = host.min() + if op == "add": + if _is_float(dtype): + expected = float(np.sum(host.astype(np.float64))) + rtol, atol = (_F32_REDUCE_RTOL, _F32_REDUCE_ATOL) if dtype == qd.f32 else (_F64_RTOL, _F64_ATOL) + assert math.isclose( + got, expected, rel_tol=rtol, abs_tol=atol + ), f"{dtype} reduce_add(N={N}): got {got}, expected {expected}" + else: + # Promote to Python int for an arbitrary-width reference; mask both sides to dtype width to handle the + # u32 / u64 mod-wrap case at large N. + mod = 1 << (32 if dtype in (qd.i32, qd.u32) else 64) if _is_unsigned(dtype) else None + ref = int( + np.sum( + host.astype(np.int64 if dtype in (qd.i32, qd.u32) else (np.int64 if dtype == qd.i64 else np.uint64)) + ) + ) # noqa: E501 + got_int = int(got) + if mod is not None: + ref &= mod - 1 + got_int &= mod - 1 + assert got_int == ref, f"{dtype} reduce_add(N={N}): got {got_int}, expected {ref}" + return + + expected = host.min() if op == "min" else host.max() if _is_float(dtype): assert got == pytest.approx(expected, abs=1e-6 if dtype == qd.f32 else 1e-12) else: - assert int(got) == int(expected), f"{dtype} reduce_min(N={N}): got {got}, expected {expected}" + assert int(got) == int(expected), f"{dtype} reduce_{op}(N={N}): got {got}, expected {expected}" +@pytest.mark.parametrize("op", _REDUCE_OPS) @pytest.mark.parametrize("N", _REDUCE_SIZES) @pytest.mark.parametrize("dtype", _REDUCE_DTYPES) @test_utils.test(arch=qd.gpu) -def test_device_reduce_max(dtype, N): - """device_reduce_max(identity=type-negative-extreme) matches numpy.max.""" - _skip_if_dtype_unsupported(dtype) - inp, out = _alloc_input_out(dtype, N) - rng = np.random.default_rng(seed=1234) - if _is_float(dtype): - host = rng.uniform(-10.0, 10.0, size=N).astype(_DTYPE_TO_NP[dtype]) - else: - host = _rand_reduce_host(rng, dtype, N, bound=10000) - _fill_field(inp, host) - - qd.algorithms.device_reduce_max(inp, out=out) - got = out.to_numpy()[0] - expected = host.max() +def test_device_reduce(op, dtype, N): + """``device_reduce_{add,min,max}`` match numpy across the full size sweep + dtype set. - if _is_float(dtype): - assert got == pytest.approx(expected, abs=1e-6 if dtype == qd.f32 else 1e-12) - else: - assert int(got) == int(expected), f"{dtype} reduce_max(N={N}): got {got}, expected {expected}" + Unified across the three op variants. ``add`` accumulates so it needs overflow / precision-aware comparison; + ``min`` / ``max`` pick one element of the input and are bitwise-exact. + """ + _check_reduce(op, dtype, N) @test_utils.test(arch=qd.gpu) @@ -454,101 +447,80 @@ def _scan_dtype_mask(dtype): return -1 -@pytest.mark.parametrize("N", _SCAN_SIZES) -@pytest.mark.parametrize("dtype", _SCAN_DTYPES) -@test_utils.test(arch=qd.gpu) -def test_device_exclusive_scan_add(dtype, N): - """device_exclusive_scan_add(out[i] = sum(arr[0:i])) matches numpy.cumsum-shifted across the full 6-dtype set.""" - _skip_if_dtype_unsupported(dtype) - inp, out = _alloc_scan_input_out(dtype, N) - rng = np.random.default_rng(seed=1234) - host = _rand_reduce_host(rng, dtype, N, bound=100) - _fill_field(inp, host) +_SCAN_OPS = ["add", "min", "max"] - qd.algorithms.device_exclusive_scan_add(inp, out=out) - got = out.to_numpy() +def _scan_host(rng, op, dtype, N): + """Generate the test input for a scan of `op` on `dtype` x N values. Same rationale as ``_reduce_host``.""" + if op == "add": + return _rand_reduce_host(rng, dtype, N, bound=100) if _is_float(dtype): - ref = np.concatenate([[0.0], np.cumsum(host.astype(np.float64))[:-1]]) - rtol, atol = _f32_scan_tol(N) if dtype == qd.f32 else (_F64_RTOL, _F64_ATOL) - np.testing.assert_allclose( - got.astype(np.float64), - ref, - rtol=rtol, - atol=atol, - err_msg=f"{dtype} scan_add(N={N})", - ) - else: - # Promote to a width that survives the cumulative sum: u64 / i64 inputs use a Python int reference; smaller - # ints can still use int64. - promote = np.int64 if dtype in (qd.i32, qd.u32, qd.i64) else np.uint64 - host_wide = host.astype(promote) - ref = np.concatenate([[promote(0)], np.cumsum(host_wide)[:-1]]).astype(promote) - mask = _scan_dtype_mask(dtype) - got_view = got.astype(np.int64 if dtype != qd.u64 else np.uint64) - if mask != -1: - got_view = got_view & promote(mask) - ref = ref & promote(mask) - np.testing.assert_array_equal( - got_view, - ref, - err_msg=f"{dtype} scan_add(N={N})", - ) + return rng.uniform(-10.0, 10.0, size=N).astype(_DTYPE_TO_NP[dtype]) + return _rand_reduce_host(rng, dtype, N, bound=10000) -@pytest.mark.parametrize("N", _SCAN_SIZES) -@pytest.mark.parametrize("dtype", _SCAN_DTYPES) -@test_utils.test(arch=qd.gpu) -def test_device_exclusive_scan_min(dtype, N): - """device_exclusive_scan_min(out[i] = min(arr[0:i])) matches numpy.minimum.accumulate-shifted across the full - 6-dtype set.""" +def _check_scan(op, dtype, N): + """Run ``device_exclusive_scan_(arr)`` and verify against ``numpy..accumulate``-shifted. + + Like the reduce family, ``add`` accumulates (overflow / precision care) while ``min`` / ``max`` are + bitwise-exact in both float and int paths. + """ _skip_if_dtype_unsupported(dtype) inp, out = _alloc_scan_input_out(dtype, N) rng = np.random.default_rng(seed=1234) np_dt = _DTYPE_TO_NP[dtype] - if _is_float(dtype): - host = rng.uniform(-10.0, 10.0, size=N).astype(np_dt) - else: - host = _rand_reduce_host(rng, dtype, N, bound=10000) + host = _scan_host(rng, op, dtype, N) _fill_field(inp, host) - qd.algorithms.device_exclusive_scan_min(inp, out=out) + qd_fn = getattr(qd.algorithms, f"device_exclusive_scan_{op}") + qd_fn(inp, out=out) got = out.to_numpy() + if op == "add": + if _is_float(dtype): + ref = np.concatenate([[0.0], np.cumsum(host.astype(np.float64))[:-1]]) + rtol, atol = _f32_scan_tol(N) if dtype == qd.f32 else (_F64_RTOL, _F64_ATOL) + np.testing.assert_allclose( + got.astype(np.float64), + ref, + rtol=rtol, + atol=atol, + err_msg=f"{dtype} scan_add(N={N})", + ) + else: + # Promote to a width that survives the cumulative sum: u64 / i64 inputs use a Python int reference; + # smaller ints can still use int64. + promote = np.int64 if dtype in (qd.i32, qd.u32, qd.i64) else np.uint64 + host_wide = host.astype(promote) + ref = np.concatenate([[promote(0)], np.cumsum(host_wide)[:-1]]).astype(promote) + mask = _scan_dtype_mask(dtype) + got_view = got.astype(np.int64 if dtype != qd.u64 else np.uint64) + if mask != -1: + got_view = got_view & promote(mask) + ref = ref & promote(mask) + np.testing.assert_array_equal(got_view, ref, err_msg=f"{dtype} scan_add(N={N})") + return + + np_accum = np.minimum.accumulate if op == "min" else np.maximum.accumulate + identity_table = _MIN_IDENTITY if op == "min" else _MAX_IDENTITY if _is_float(dtype): - ref = np.concatenate([[float("inf")], np.minimum.accumulate(host.astype(np.float64))[:-1]]).astype(np_dt) - atol = 0 if dtype == qd.f32 else 0 # min is bitwise-exact for monotone ops on float - np.testing.assert_allclose(got, ref, rtol=0, atol=atol, err_msg=f"{dtype} scan_min(N={N})") + identity = float("inf") if op == "min" else float("-inf") + ref = np.concatenate([[identity], np_accum(host.astype(np.float64))[:-1]]).astype(np_dt) + np.testing.assert_allclose(got, ref, rtol=0, atol=0, err_msg=f"{dtype} scan_{op}(N={N})") else: - ref = np.concatenate([[np_dt(_MIN_IDENTITY[dtype])], np.minimum.accumulate(host)[:-1]]).astype(np_dt) - np.testing.assert_array_equal(got, ref, err_msg=f"{dtype} scan_min(N={N})") + ref = np.concatenate([[np_dt(identity_table[dtype])], np_accum(host)[:-1]]).astype(np_dt) + np.testing.assert_array_equal(got, ref, err_msg=f"{dtype} scan_{op}(N={N})") +@pytest.mark.parametrize("op", _SCAN_OPS) @pytest.mark.parametrize("N", _SCAN_SIZES) @pytest.mark.parametrize("dtype", _SCAN_DTYPES) @test_utils.test(arch=qd.gpu) -def test_device_exclusive_scan_max(dtype, N): - """device_exclusive_scan_max(out[i] = max(arr[0:i])) matches numpy.maximum.accumulate-shifted across the full - 6-dtype set.""" - _skip_if_dtype_unsupported(dtype) - inp, out = _alloc_scan_input_out(dtype, N) - rng = np.random.default_rng(seed=1234) - np_dt = _DTYPE_TO_NP[dtype] - if _is_float(dtype): - host = rng.uniform(-10.0, 10.0, size=N).astype(np_dt) - else: - host = _rand_reduce_host(rng, dtype, N, bound=10000) - _fill_field(inp, host) - - qd.algorithms.device_exclusive_scan_max(inp, out=out) - got = out.to_numpy() - - if _is_float(dtype): - ref = np.concatenate([[float("-inf")], np.maximum.accumulate(host.astype(np.float64))[:-1]]).astype(np_dt) - np.testing.assert_allclose(got, ref, rtol=0, atol=0, err_msg=f"{dtype} scan_max(N={N})") - else: - ref = np.concatenate([[np_dt(_MAX_IDENTITY[dtype])], np.maximum.accumulate(host)[:-1]]).astype(np_dt) - np.testing.assert_array_equal(got, ref, err_msg=f"{dtype} scan_max(N={N})") +def test_device_exclusive_scan(op, dtype, N): + """``device_exclusive_scan_{add,min,max}`` match ``numpy.{cumsum, minimum.accumulate, maximum.accumulate}``-shifted + across the full size sweep + dtype set. Unified across the three op variants; same overflow vs bitwise-exact + handling as the reduce family.""" + _check_scan(op, dtype, N) @test_utils.test(arch=qd.gpu) diff --git a/tests/python/test_eig.py b/tests/python/test_eig.py index 53647a6eef..a8b5153dd6 100644 --- a/tests/python/test_eig.py +++ b/tests/python/test_eig.py @@ -295,7 +295,7 @@ def run(): np.testing.assert_allclose(A_reconstructed, A_np, rtol=tol, atol=tol) -@pytest.mark.parametrize("n", [4, 5, 6, 9, 12]) +@pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) @pytest.mark.parametrize( "factory", [ @@ -311,7 +311,7 @@ def test_sym_eig_general_f32(n, factory): _test_sym_eig_general(n, qd.f32, factory) -@pytest.mark.parametrize("n", [4, 5, 6, 9, 12]) +@pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) @pytest.mark.parametrize( "factory", [ @@ -358,7 +358,7 @@ def run(): np.testing.assert_allclose(A_spd_qd, expected, rtol=tol, atol=tol) -@pytest.mark.parametrize("n", [4, 6, 9, 12]) +@pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) @pytest.mark.parametrize( "factory", [_sym_eig_factory_indefinite, _sym_eig_factory_random, _sym_eig_factory_spd], @@ -368,7 +368,7 @@ def test_make_spd_f32(n, factory): _test_make_spd(n, qd.f32, factory) -@pytest.mark.parametrize("n", [4, 6, 9, 12]) +@pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) @pytest.mark.parametrize( "factory", [_sym_eig_factory_indefinite, _sym_eig_factory_random, _sym_eig_factory_spd], @@ -404,7 +404,7 @@ def run(): np.testing.assert_allclose(Q.T @ Q, np.eye(n), rtol=tol, atol=tol) -@pytest.mark.parametrize("n", [4, 6, 9, 12]) +@pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) @pytest.mark.parametrize("alpha", [0.0, 1.0, -2.5]) @test_utils.test(require=qd.extension.data64, arch=qd.gpu, default_fp=qd.f64, fast_math=False) def test_sym_eig_alpha_identity_f64(n, alpha): @@ -445,7 +445,7 @@ def project(src: qd.types.NDArray[mat_t, 1], dst: qd.types.NDArray[mat_t, 1]): ) -@pytest.mark.parametrize("n", [4, 6, 9, 12]) +@pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) @pytest.mark.parametrize( "factory", [_sym_eig_factory_indefinite, _sym_eig_factory_negative_definite, _sym_eig_factory_spd], @@ -455,7 +455,7 @@ def test_make_spd_idempotent_f64(n, factory): _test_make_spd_idempotent(n, qd.f64, factory) -@pytest.mark.parametrize("n", [4, 6, 9, 12]) +@pytest.mark.parametrize("n", [4, pytest.param(12, marks=pytest.mark.slow)]) @test_utils.test(require=qd.extension.data64, arch=qd.gpu, default_fp=qd.f64, fast_math=False) def test_make_spd_negative_definite_zero_f64(n): """A symmetric matrix with all-negative eigenvalues projects to the zero matrix (``Q · diag(max(λ, 0)) · Qᵀ`` @@ -535,13 +535,13 @@ def run(): ), f"column {i} is not the eigenvector of eigvals[{i}]={eigvals_qd[i]}: residual={residual}" -@pytest.mark.parametrize("n", [2, 3, 4, 6, 9, 12]) +@pytest.mark.parametrize("n", [3, pytest.param(12, marks=pytest.mark.slow)]) @test_utils.test(arch=qd.gpu, default_fp=qd.f32, fast_math=False) def test_sym_eig_sort_order_f32(n): _test_sym_eig_sort_order(n, qd.f32) -@pytest.mark.parametrize("n", [2, 3, 4, 6, 9, 12]) +@pytest.mark.parametrize("n", [3, pytest.param(12, marks=pytest.mark.slow)]) @test_utils.test(require=qd.extension.data64, arch=qd.gpu, default_fp=qd.f64, fast_math=False) def test_sym_eig_sort_order_f64(n): _test_sym_eig_sort_order(n, qd.f64) diff --git a/tests/python/test_linalg.py b/tests/python/test_linalg.py index dfa31495bc..59925ee2ce 100644 --- a/tests/python/test_linalg.py +++ b/tests/python/test_linalg.py @@ -154,13 +154,13 @@ def run(): assert out_self[None] == test_utils.approx(A.to_numpy().__pow__(2).sum(), rel=tol, abs=tol) -@pytest.mark.parametrize("n", [2, 3, 6, 9, 12]) +@pytest.mark.parametrize("n", [3, pytest.param(12, marks=pytest.mark.slow)]) @test_utils.test(arch=qd.gpu, default_fp=qd.f32, fast_math=False) def test_frobenius_inner_f32(n): _test_frobenius_inner(n, qd.f32) -@pytest.mark.parametrize("n", [2, 3, 6, 9, 12]) +@pytest.mark.parametrize("n", [3, pytest.param(12, marks=pytest.mark.slow)]) @test_utils.test(require=qd.extension.data64, arch=qd.gpu, default_fp=qd.f64, fast_math=False) def test_frobenius_inner_f64(n): _test_frobenius_inner(n, qd.f64) @@ -189,36 +189,52 @@ def run(): assert out[None] == test_utils.approx(expected, rel=tol, abs=tol) -@pytest.mark.parametrize("rows,cols", [(9, 12), (12, 3), (2, 4)]) +@pytest.mark.parametrize( + "rows,cols", + [ + pytest.param(9, 12, marks=pytest.mark.slow), + pytest.param(12, 3, marks=pytest.mark.slow), + (2, 4), + ], +) @test_utils.test(arch=qd.gpu, default_fp=qd.f32, fast_math=False) def test_frobenius_inner_rectangular_f32(rows, cols): _test_frobenius_inner_rectangular(rows, cols, qd.f32) -@pytest.mark.parametrize("rows,cols", [(9, 12), (12, 3), (2, 4)]) +@pytest.mark.parametrize( + "rows,cols", + [ + pytest.param(9, 12, marks=pytest.mark.slow), + pytest.param(12, 3, marks=pytest.mark.slow), + (2, 4), + ], +) @test_utils.test(require=qd.extension.data64, arch=qd.gpu, default_fp=qd.f64, fast_math=False) def test_frobenius_inner_rectangular_f64(rows, cols): _test_frobenius_inner_rectangular(rows, cols, qd.f64) -def _test_matmul_chain(dt): - """3-way matmul chain at qipc IPC sizes: (9×12) · (12×12) · (12×9) → (9×9). +def _test_matmul_chain(rows_a, cols_a, cols_b, cols_c, dt): + """3-way matmul chain: ``(rows_a × cols_a) · (cols_a × cols_b) · (cols_b × cols_c) → (rows_a × cols_c)``. - Verifies that ``Matrix.__matmul__`` compiles and is numerically correct at the largest size qipc needs. Quadrants - imposes no enforced size cap on matmul, but the unrolled `static(range)` triple loop produces ~1296 FMAs per - intermediate, so this test catches compile-time blow-up or back-end miscompiles at large sizes. + Verifies that ``Matrix.__matmul__`` compiles and is numerically correct at the requested size. Quadrants + imposes no enforced size cap on matmul, but the unrolled `static(range)` triple loop produces + ``rows_a * cols_a * cols_b + rows_a * cols_b * cols_c`` FMAs per kernel call, so this test catches compile-time + blow-up or back-end miscompiles at large sizes. The largest parametrize value is the chain qipc actually uses; + smaller values are cheap sanity checks that the same code path still works. """ np_dt = np.float32 if dt == qd.f32 else np.float64 - A_np = np.random.default_rng(0xCA70).standard_normal((9, 12)).astype(np_dt) - B_np = np.random.default_rng(0xCA71).standard_normal((12, 12)).astype(np_dt) - C_np = np.random.default_rng(0xCA72).standard_normal((12, 9)).astype(np_dt) + A_np = np.random.default_rng(0xCA70).standard_normal((rows_a, cols_a)).astype(np_dt) + B_np = np.random.default_rng(0xCA71).standard_normal((cols_a, cols_b)).astype(np_dt) + C_np = np.random.default_rng(0xCA72).standard_normal((cols_b, cols_c)).astype(np_dt) - A = qd.Matrix.field(9, 12, dtype=dt, shape=()) - B = qd.Matrix.field(12, 12, dtype=dt, shape=()) - C = qd.Matrix.field(12, 9, dtype=dt, shape=()) - AB = qd.Matrix.field(9, 12, dtype=dt, shape=()) - ABC_chained = qd.Matrix.field(9, 9, dtype=dt, shape=()) - ABC_staged = qd.Matrix.field(9, 9, dtype=dt, shape=()) + A = qd.Matrix.field(rows_a, cols_a, dtype=dt, shape=()) + B = qd.Matrix.field(cols_a, cols_b, dtype=dt, shape=()) + C = qd.Matrix.field(cols_b, cols_c, dtype=dt, shape=()) + AB = qd.Matrix.field(rows_a, cols_b, dtype=dt, shape=()) + ABC_chained = qd.Matrix.field(rows_a, cols_c, dtype=dt, shape=()) + ABC_staged = qd.Matrix.field(rows_a, cols_c, dtype=dt, shape=()) A.from_numpy(A_np) B.from_numpy(B_np) @@ -241,14 +257,25 @@ def run(): np.testing.assert_allclose(ABC_chained.to_numpy(), ABC_staged.to_numpy(), rtol=tol, atol=tol) +# qipc's actual size is (9,12,12,9) -- the largest chain it instantiates. We also keep a tiny (3,4,4,3) chain so +# the default fast lane still exercises the same Matrix.__matmul__ codegen path without paying the ~90s/case +# CUDA JIT cost of the qipc-sized chain. +_MATMUL_CHAIN_SHAPES = [ + (3, 4, 4, 3), + pytest.param(9, 12, 12, 9, marks=pytest.mark.slow), +] + + +@pytest.mark.parametrize("rows_a,cols_a,cols_b,cols_c", _MATMUL_CHAIN_SHAPES) @test_utils.test(arch=qd.gpu, default_fp=qd.f32, fast_math=False) -def test_matmul_chain_qipc_sizes_f32(): - _test_matmul_chain(qd.f32) +def test_matmul_chain_qipc_sizes_f32(rows_a, cols_a, cols_b, cols_c): + _test_matmul_chain(rows_a, cols_a, cols_b, cols_c, qd.f32) +@pytest.mark.parametrize("rows_a,cols_a,cols_b,cols_c", _MATMUL_CHAIN_SHAPES) @test_utils.test(require=qd.extension.data64, arch=qd.gpu, default_fp=qd.f64, fast_math=False) -def test_matmul_chain_qipc_sizes_f64(): - _test_matmul_chain(qd.f64) +def test_matmul_chain_qipc_sizes_f64(rows_a, cols_a, cols_b, cols_c): + _test_matmul_chain(rows_a, cols_a, cols_b, cols_c, qd.f64) @test_utils.test() @@ -434,7 +461,7 @@ def run(): np.testing.assert_allclose(M @ inv_np, np.eye(n_), rtol=tol, atol=tol) -@pytest.mark.parametrize("n", [5, 6, 7, 8, 9, 10, 11, 12]) +@pytest.mark.parametrize("n", [5, pytest.param(12, marks=pytest.mark.slow)]) @pytest.mark.parametrize( "factory", [_inverse_diagonally_dominant, _inverse_spd, _inverse_pivoting_required], @@ -444,7 +471,7 @@ def test_inverse_large_f32(n, factory): _test_inverse_at_size(n, qd.f32, factory) -@pytest.mark.parametrize("n", [5, 6, 7, 8, 9, 10, 11, 12]) +@pytest.mark.parametrize("n", [5, pytest.param(12, marks=pytest.mark.slow)]) @pytest.mark.parametrize( "factory", [_inverse_diagonally_dominant, _inverse_spd, _inverse_pivoting_required], diff --git a/tests/python/test_simt.py b/tests/python/test_simt.py index 95e3438e41..8c44a40bf9 100644 --- a/tests/python/test_simt.py +++ b/tests/python/test_simt.py @@ -887,81 +887,57 @@ def _ref_reduce_max(values): return max(values) -@pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) -@pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) -@test_utils.test(arch=qd.gpu) -def test_block_reduce_add(dtype, sg_per_block): - """Block sum-reduce: thread 0 of each block holds `sum(src[block_base:block_base+block_dim])`.""" - _skip_if_f64_unsupported(dtype) - block_dim = sg_per_block * _arch_subgroup_size() - NUM_BLOCKS = 4 - N = NUM_BLOCKS * block_dim - src = qd.field(dtype=dtype, shape=N) - dst = qd.field(dtype=dtype, shape=NUM_BLOCKS) - - @qd.kernel - def foo(): - qd.loop_config(block_dim=block_dim) - for i in range(N): - tid = i % block_dim - agg = block.reduce_add(src[i], block_dim, dtype) - if tid == 0: - dst[i // block_dim] = agg - - _init_field(src, N, dtype) - foo() - - for b in range(NUM_BLOCKS): - block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_reduce_add(block_vals) - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert dst[b] == expected, f"block {b}: got {dst[b]}, expected {expected}" - else: - assert abs(dst[b] - expected) < 1e-4 * abs(expected), f"block {b}: got {dst[b]}, expected {expected}" +# The three single-output reduces (`test_block_reduce_{add,min,max}`) and their three broadcast siblings +# (`test_block_reduce_all_{add,min,max}`) share the same kernel skeleton, parametrize axes, and verification loop; +# they differ only in (a) which `block.reduce_*` function gets called, (b) the host-side reference oracle, (c) the +# init pattern (sequential for `add` so the running sum has signal, permuted hash for `min` / `max` so the result +# depends on lanes other than first / last), and (d) the float tolerance regime (`add` accumulates so it uses a +# relative tol; `min` / `max` pick one element of the input and use an absolute tol). +_BLOCK_REDUCE_OP_CASES = [ + # (op_name, ref_fn, init_permuted, tol_relative) + pytest.param("add", _ref_reduce_add, False, True, id="add"), + pytest.param("min", _ref_reduce_min, True, False, id="min"), + pytest.param("max", _ref_reduce_max, True, False, id="max"), +] -@pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) -@pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) -@test_utils.test(arch=qd.gpu) -def test_block_reduce_min(dtype, sg_per_block): - """Block min-reduce: thread 0 of each block holds `min(src[block_base:block_base+block_dim])`.""" - _skip_if_f64_unsupported(dtype) - block_dim = sg_per_block * _arch_subgroup_size() - NUM_BLOCKS = 4 - N = NUM_BLOCKS * block_dim - src = qd.field(dtype=dtype, shape=N) - dst = qd.field(dtype=dtype, shape=NUM_BLOCKS) - - @qd.kernel - def foo(): - qd.loop_config(block_dim=block_dim) +def _init_block_reduce_src(src, N, dtype, *, permuted): + """Initialize ``src[0:N]`` for a block reduce test. ``permuted=False`` is the sequential ``1..N`` init from + ``_init_field`` (good for add); ``permuted=True`` is the stable hash ``((i * 1009) % 997) + 1`` so the per-block + min / max depends on lanes other than first / last.""" + if permuted: for i in range(N): - tid = i % block_dim - agg = block.reduce_min(src[i], block_dim, dtype) - if tid == 0: - dst[i // block_dim] = agg + v = ((i * 1009) % 997) + 1 + src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v + else: + _init_field(src, N, dtype) - # Permuted (non-monotone) initialisation so the min depends on lanes other than the first / last. - for i in range(N): - v = ((i * 1009) % 997) + 1 # in [1, 997]; stable hash, no collisions w/ block_dim values up to 256 - src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v - foo() - for b in range(NUM_BLOCKS): - block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_reduce_min(block_vals) - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert dst[b] == expected, f"block {b}: got {dst[b]}, expected {expected}" - else: - assert abs(dst[b] - expected) < 1e-5, f"block {b}: got {dst[b]}, expected {expected}" +def _assert_block_reduce_close(actual, expected, dtype, *, tol_relative, ctx): + """Assert ``actual ~= expected`` per the block-reduce tolerance regime. + Int dtypes compare exactly. Floats use relative tolerance ``1e-4 * |expected|`` for accumulating ops (sums grow + with block_dim, so a relative bound is the only thing that stays meaningful across the 32 / 128 / 256 / 64 / 256 / + 512 block-size sweep), and absolute tolerance ``1e-5`` for picker ops (min / max pick one element so the + magnitude is whatever was in the input -- a small absolute bound suffices). + """ + if dtype in _BLOCK_REDUCE_INT_DTYPES: + assert actual == expected, f"{ctx}: got {actual}, expected {expected}" + elif tol_relative: + assert abs(actual - expected) < 1e-4 * abs(expected), f"{ctx}: got {actual}, expected {expected}" + else: + assert abs(actual - expected) < 1e-5, f"{ctx}: got {actual}, expected {expected}" + +@pytest.mark.parametrize("op_name,ref_fn,init_permuted,tol_relative", _BLOCK_REDUCE_OP_CASES) @pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) @pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) @test_utils.test(arch=qd.gpu) -def test_block_reduce_max(dtype, sg_per_block): - """Block max-reduce: thread 0 of each block holds `max(src[block_base:block_base+block_dim])`.""" +def test_block_reduce(dtype, sg_per_block, op_name, ref_fn, init_permuted, tol_relative): + """Block reduce: thread 0 of each block holds ``(src[block_base:block_base+block_dim])``. Unified across + ``add`` / ``min`` / ``max`` -- op-name is closure-captured into ``@qd.kernel``.""" _skip_if_f64_unsupported(dtype) + op_fn = getattr(block, f"reduce_{op_name}") block_dim = sg_per_block * _arch_subgroup_size() NUM_BLOCKS = 4 N = NUM_BLOCKS * block_dim @@ -973,34 +949,29 @@ def foo(): qd.loop_config(block_dim=block_dim) for i in range(N): tid = i % block_dim - agg = block.reduce_max(src[i], block_dim, dtype) + agg = op_fn(src[i], block_dim, dtype) if tid == 0: dst[i // block_dim] = agg - for i in range(N): - v = ((i * 1009) % 997) + 1 - src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v + _init_block_reduce_src(src, N, dtype, permuted=init_permuted) foo() for b in range(NUM_BLOCKS): block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_reduce_max(block_vals) - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert dst[b] == expected, f"block {b}: got {dst[b]}, expected {expected}" - else: - assert abs(dst[b] - expected) < 1e-5, f"block {b}: got {dst[b]}, expected {expected}" + expected = ref_fn(block_vals) + _assert_block_reduce_close(dst[b], expected, dtype, tol_relative=tol_relative, ctx=f"block {b}") +@pytest.mark.parametrize("op_name,ref_fn,init_permuted,tol_relative", _BLOCK_REDUCE_OP_CASES) @pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) @pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) @test_utils.test(arch=qd.gpu) -def test_block_reduce_all_add(dtype, sg_per_block): - """Block sum-reduce broadcast: every thread of each block holds the block-wide sum. - - Verifies the broadcast variant by writing the per-thread output to a flat field, then asserting every thread of a - given block reads the same aggregate. - """ +def test_block_reduce_all(dtype, sg_per_block, op_name, ref_fn, init_permuted, tol_relative): + """Block reduce broadcast: every thread of each block holds the block-wide ````. Verified by writing the + per-thread output to a flat field, then asserting every thread of a given block reads the same aggregate. + Unified across ``add`` / ``min`` / ``max``.""" _skip_if_f64_unsupported(dtype) + op_fn = getattr(block, f"reduce_all_{op_name}") block_dim = sg_per_block * _arch_subgroup_size() NUM_BLOCKS = 4 N = NUM_BLOCKS * block_dim @@ -1011,90 +982,17 @@ def test_block_reduce_all_add(dtype, sg_per_block): def foo(): qd.loop_config(block_dim=block_dim) for i in range(N): - dst[i] = block.reduce_all_add(src[i], block_dim, dtype) + dst[i] = op_fn(src[i], block_dim, dtype) - _init_field(src, N, dtype) + _init_block_reduce_src(src, N, dtype, permuted=init_permuted) foo() for b in range(NUM_BLOCKS): block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_reduce_add(block_vals) + expected = ref_fn(block_vals) for j in range(block_dim): actual = dst[b * block_dim + j] - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert actual == expected, f"block {b} thread {j}: got {actual}, expected {expected}" - else: - assert abs(actual - expected) < 1e-4 * abs( - expected - ), f"block {b} thread {j}: got {actual}, expected {expected}" - - -@pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) -@pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) -@test_utils.test(arch=qd.gpu) -def test_block_reduce_all_min(dtype, sg_per_block): - """Block min-reduce broadcast: every thread reads the block-wide min.""" - _skip_if_f64_unsupported(dtype) - block_dim = sg_per_block * _arch_subgroup_size() - NUM_BLOCKS = 4 - N = NUM_BLOCKS * block_dim - src = qd.field(dtype=dtype, shape=N) - dst = qd.field(dtype=dtype, shape=N) - - @qd.kernel - def foo(): - qd.loop_config(block_dim=block_dim) - for i in range(N): - dst[i] = block.reduce_all_min(src[i], block_dim, dtype) - - for i in range(N): - v = ((i * 1009) % 997) + 1 - src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v - foo() - - for b in range(NUM_BLOCKS): - block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_reduce_min(block_vals) - for j in range(block_dim): - actual = dst[b * block_dim + j] - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert actual == expected, f"block {b} thread {j}: got {actual}, expected {expected}" - else: - assert abs(actual - expected) < 1e-5, f"block {b} thread {j}: got {actual}, expected {expected}" - - -@pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) -@pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) -@test_utils.test(arch=qd.gpu) -def test_block_reduce_all_max(dtype, sg_per_block): - """Block max-reduce broadcast: every thread reads the block-wide max.""" - _skip_if_f64_unsupported(dtype) - block_dim = sg_per_block * _arch_subgroup_size() - NUM_BLOCKS = 4 - N = NUM_BLOCKS * block_dim - src = qd.field(dtype=dtype, shape=N) - dst = qd.field(dtype=dtype, shape=N) - - @qd.kernel - def foo(): - qd.loop_config(block_dim=block_dim) - for i in range(N): - dst[i] = block.reduce_all_max(src[i], block_dim, dtype) - - for i in range(N): - v = ((i * 1009) % 997) + 1 - src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v - foo() - - for b in range(NUM_BLOCKS): - block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_reduce_max(block_vals) - for j in range(block_dim): - actual = dst[b * block_dim + j] - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert actual == expected, f"block {b} thread {j}: got {actual}, expected {expected}" - else: - assert abs(actual - expected) < 1e-5, f"block {b} thread {j}: got {actual}, expected {expected}" + _assert_block_reduce_close(actual, expected, dtype, tol_relative=tol_relative, ctx=f"block {b} thread {j}") # --- Block scan tests ------------------------------------------------------------------ @@ -1147,46 +1045,45 @@ def _ref_exclusive_scan_op(values, op, identity): return out -@pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) -@pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) -@test_utils.test(arch=qd.gpu) -def test_block_inclusive_add(dtype, sg_per_block): - """Block inclusive prefix sum: thread `i` holds `sum(src[block_base..i])`.""" - _skip_if_f64_unsupported(dtype) - block_dim = sg_per_block * _arch_subgroup_size() - NUM_BLOCKS = 4 - N = NUM_BLOCKS * block_dim - src = qd.field(dtype=dtype, shape=N) - dst = qd.field(dtype=dtype, shape=N) - - @qd.kernel - def foo(): - qd.loop_config(block_dim=block_dim) - for i in range(N): - dst[i] = block.inclusive_add(src[i], block_dim, dtype) +# The four scan tests in this group (`test_block_inclusive_{add,min,max}` + `test_block_exclusive_add`) share the +# kernel skeleton; only the per-op reference oracle, init pattern, and float tolerance differ. `add` accumulates +# (sequential init, relative tol); `min` / `max` pick (permuted init, absolute tol). Exclusive `min` / `max` get +# their own dedicated test below because they need a dtype-derived sentinel identity (+inf / iinfo(max), -inf / +# iinfo(min)) at lane 0 with explicit ``isinf`` handling -- different enough that fusing them in would create more +# branches than it removes. +_PY_MIN = lambda a, b: a if a < b else b # noqa: E731 (intentional 1-line lambda for ref oracle) +_PY_MAX = lambda a, b: a if a > b else b # noqa: E731 + +_BLOCK_INCLUSIVE_SCAN_OP_CASES = [ + # (op_name, ref_fn, init_permuted, tol_relative) + pytest.param("add", _ref_inclusive_scan_add, False, True, id="add"), + pytest.param("min", lambda vals: _ref_inclusive_scan_op(vals, _PY_MIN, 0), True, False, id="min"), + pytest.param("max", lambda vals: _ref_inclusive_scan_op(vals, _PY_MAX, 0), True, False, id="max"), +] - _init_field(src, N, dtype) - foo() - for b in range(NUM_BLOCKS): - block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_inclusive_scan_add(block_vals) - for j in range(block_dim): - actual = dst[b * block_dim + j] - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert actual == expected[j], f"block {b} thread {j}: got {actual}, expected {expected[j]}" - else: - assert abs(actual - expected[j]) < 1e-4 * abs( - expected[j] + 1.0 - ), f"block {b} thread {j}: got {actual}, expected {expected[j]}" +def _assert_block_scan_close(actual, expected_j, dtype, *, tol_relative, ctx): + """Per-thread assertion for block scan tests. Same int / relative-float / absolute-float regime as + ``_assert_block_reduce_close`` but with a floor on the relative-tol base so the first few prefixes (where + ``expected_j`` is near zero) don't tighten the bound to zero.""" + if dtype in _BLOCK_REDUCE_INT_DTYPES: + assert actual == expected_j, f"{ctx}: got {actual}, expected {expected_j}" + elif tol_relative: + tol_base = abs(expected_j) if abs(expected_j) > 1.0 else 1.0 + assert abs(actual - expected_j) < 1e-4 * tol_base, f"{ctx}: got {actual}, expected {expected_j}" + else: + assert abs(actual - expected_j) < 1e-5, f"{ctx}: got {actual}, expected {expected_j}" +@pytest.mark.parametrize("op_name,ref_fn,init_permuted,tol_relative", _BLOCK_INCLUSIVE_SCAN_OP_CASES) @pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) @pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) @test_utils.test(arch=qd.gpu) -def test_block_exclusive_add(dtype, sg_per_block): - """Block exclusive prefix sum: thread `i` holds `sum(src[block_base..i-1])`; thread 0 holds 0.""" +def test_block_inclusive(dtype, sg_per_block, op_name, ref_fn, init_permuted, tol_relative): + """Block inclusive prefix scan: thread ``i`` holds ``(src[block_base..i])``. Unified across ``add`` / ``min`` + / ``max``.""" _skip_if_f64_unsupported(dtype) + op_fn = getattr(block, f"inclusive_{op_name}") block_dim = sg_per_block * _arch_subgroup_size() NUM_BLOCKS = 4 N = NUM_BLOCKS * block_dim @@ -1197,31 +1094,24 @@ def test_block_exclusive_add(dtype, sg_per_block): def foo(): qd.loop_config(block_dim=block_dim) for i in range(N): - dst[i] = block.exclusive_add(src[i], block_dim, dtype) + dst[i] = op_fn(src[i], block_dim, dtype) - _init_field(src, N, dtype) + _init_block_reduce_src(src, N, dtype, permuted=init_permuted) foo() for b in range(NUM_BLOCKS): block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_exclusive_scan_add(block_vals) + expected = ref_fn(block_vals) for j in range(block_dim): actual = dst[b * block_dim + j] - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert actual == expected[j], f"block {b} thread {j}: got {actual}, expected {expected[j]}" - else: - # First thread's expected is 0; gate the relative tolerance so it doesn't blow up. - tol_base = abs(expected[j]) if abs(expected[j]) > 1.0 else 1.0 - assert ( - abs(actual - expected[j]) < 1e-4 * tol_base - ), f"block {b} thread {j}: got {actual}, expected {expected[j]}" + _assert_block_scan_close(actual, expected[j], dtype, tol_relative=tol_relative, ctx=f"block {b} thread {j}") @pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) @pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) @test_utils.test(arch=qd.gpu) -def test_block_inclusive_min(dtype, sg_per_block): - """Block inclusive prefix min.""" +def test_block_exclusive_add(dtype, sg_per_block): + """Block exclusive prefix sum: thread ``i`` holds ``sum(src[block_base..i-1])``; thread 0 holds 0.""" _skip_if_f64_unsupported(dtype) block_dim = sg_per_block * _arch_subgroup_size() NUM_BLOCKS = 4 @@ -1233,66 +1123,37 @@ def test_block_inclusive_min(dtype, sg_per_block): def foo(): qd.loop_config(block_dim=block_dim) for i in range(N): - dst[i] = block.inclusive_min(src[i], block_dim, dtype) + dst[i] = block.exclusive_add(src[i], block_dim, dtype) - for i in range(N): - v = ((i * 1009) % 997) + 1 - src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v + _init_field(src, N, dtype) foo() - py_min = lambda a, b: a if a < b else b # noqa: E731 (intentional 1-line lambda for ref oracle) for b in range(NUM_BLOCKS): block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_inclusive_scan_op(block_vals, py_min, 0) + expected = _ref_exclusive_scan_add(block_vals) for j in range(block_dim): actual = dst[b * block_dim + j] - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert actual == expected[j], f"block {b} thread {j}: got {actual}, expected {expected[j]}" - else: - assert abs(actual - expected[j]) < 1e-5, f"block {b} thread {j}: got {actual}, expected {expected[j]}" - - -@pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) -@pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) -@test_utils.test(arch=qd.gpu) -def test_block_inclusive_max(dtype, sg_per_block): - """Block inclusive prefix max.""" - _skip_if_f64_unsupported(dtype) - block_dim = sg_per_block * _arch_subgroup_size() - NUM_BLOCKS = 4 - N = NUM_BLOCKS * block_dim - src = qd.field(dtype=dtype, shape=N) - dst = qd.field(dtype=dtype, shape=N) - - @qd.kernel - def foo(): - qd.loop_config(block_dim=block_dim) - for i in range(N): - dst[i] = block.inclusive_max(src[i], block_dim, dtype) + _assert_block_scan_close(actual, expected[j], dtype, tol_relative=True, ctx=f"block {b} thread {j}") - for i in range(N): - v = ((i * 1009) % 997) + 1 - src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v - foo() - py_max = lambda a, b: a if a > b else b # noqa: E731 - for b in range(NUM_BLOCKS): - block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_inclusive_scan_op(block_vals, py_max, 0) - for j in range(block_dim): - actual = dst[b * block_dim + j] - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert actual == expected[j], f"block {b} thread {j}: got {actual}, expected {expected[j]}" - else: - assert abs(actual - expected[j]) < 1e-5, f"block {b} thread {j}: got {actual}, expected {expected[j]}" +_BLOCK_EXCLUSIVE_MINMAX_CASES = [ + # (op_name, sentinel_fn, py_op, inf_sign) + pytest.param("min", _block_exclusive_min_sentinel, _PY_MIN, 1, id="min"), + pytest.param("max", _block_exclusive_max_sentinel, _PY_MAX, -1, id="max"), +] +@pytest.mark.parametrize("op_name,sentinel_fn,py_op,inf_sign", _BLOCK_EXCLUSIVE_MINMAX_CASES) @pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) @pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) @test_utils.test(arch=qd.gpu) -def test_block_exclusive_min(dtype, sg_per_block): - """Block exclusive prefix min; thread 0 holds the dtype-derived identity (``+inf`` / ``np.iinfo(dtype).max``).""" +def test_block_exclusive_minmax(dtype, sg_per_block, op_name, sentinel_fn, py_op, inf_sign): + """Block exclusive prefix ```` for ``op in {min, max}``; thread 0 of each block holds the dtype-derived + identity (``+inf`` / ``iinfo(dtype).max`` for min, ``-inf`` / ``iinfo(dtype).min`` for max). The float ``inf`` / + ``-inf`` lane-0 identity gets a sign-only check because ``inf - inf`` (or ``(-inf) - (-inf)``) is ``NaN`` and the + standard ``abs(diff) < tol`` compare would fail spuriously.""" _skip_if_f64_unsupported(dtype) + op_fn = getattr(block, f"exclusive_{op_name}") block_dim = sg_per_block * _arch_subgroup_size() NUM_BLOCKS = 4 N = NUM_BLOCKS * block_dim @@ -1303,25 +1164,23 @@ def test_block_exclusive_min(dtype, sg_per_block): def foo(): qd.loop_config(block_dim=block_dim) for i in range(N): - dst[i] = block.exclusive_min(src[i], block_dim, dtype) + dst[i] = op_fn(src[i], block_dim, dtype) - for i in range(N): - v = ((i * 1009) % 997) + 1 - src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v + _init_block_reduce_src(src, N, dtype, permuted=True) foo() - sentinel = _block_exclusive_min_sentinel(dtype) - py_min = lambda a, b: a if a < b else b # noqa: E731 + sentinel = sentinel_fn(dtype) for b in range(NUM_BLOCKS): block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_exclusive_scan_op(block_vals, py_min, sentinel) + expected = _ref_exclusive_scan_op(block_vals, py_op, sentinel) for j in range(block_dim): actual = dst[b * block_dim + j] if dtype in _BLOCK_REDUCE_INT_DTYPES: assert actual == expected[j], f"block {b} thread {j}: got {actual}, expected {expected[j]}" elif math.isinf(expected[j]): - # Thread 0 of each block gets the +inf identity; ``inf - inf`` is NaN, so check by equality / sign. - assert math.isinf(actual) and actual > 0, f"block {b} thread {j}: got {actual}, expected {expected[j]}" + assert math.isinf(actual) and ( + actual > 0 if inf_sign > 0 else actual < 0 + ), f"block {b} thread {j}: got {actual}, expected {expected[j]}" else: assert abs(actual - expected[j]) < 1e-5, f"block {b} thread {j}: got {actual}, expected {expected[j]}" @@ -1455,45 +1314,6 @@ def kern(): assert actual_ranks == ref_ranks, f"ranks mismatch (pattern={key_pattern})" -@pytest.mark.parametrize("dtype", _BLOCK_REDUCE_DTYPES) -@pytest.mark.parametrize("sg_per_block", _BLOCK_REDUCE_SG_PER_BLOCK) -@test_utils.test(arch=qd.gpu) -def test_block_exclusive_max(dtype, sg_per_block): - """Block exclusive prefix max; thread 0 holds the dtype-derived identity (``-inf`` / ``np.iinfo(dtype).min``).""" - _skip_if_f64_unsupported(dtype) - block_dim = sg_per_block * _arch_subgroup_size() - NUM_BLOCKS = 4 - N = NUM_BLOCKS * block_dim - src = qd.field(dtype=dtype, shape=N) - dst = qd.field(dtype=dtype, shape=N) - - @qd.kernel - def foo(): - qd.loop_config(block_dim=block_dim) - for i in range(N): - dst[i] = block.exclusive_max(src[i], block_dim, dtype) - - for i in range(N): - v = ((i * 1009) % 997) + 1 - src[i] = v if dtype in _BLOCK_REDUCE_INT_DTYPES else 1.0 * v - foo() - - sentinel = _block_exclusive_max_sentinel(dtype) - py_max = lambda a, b: a if a > b else b # noqa: E731 - for b in range(NUM_BLOCKS): - block_vals = [src[b * block_dim + j] for j in range(block_dim)] - expected = _ref_exclusive_scan_op(block_vals, py_max, sentinel) - for j in range(block_dim): - actual = dst[b * block_dim + j] - if dtype in _BLOCK_REDUCE_INT_DTYPES: - assert actual == expected[j], f"block {b} thread {j}: got {actual}, expected {expected[j]}" - elif math.isinf(expected[j]): - # Thread 0 of each block gets the -inf identity; ``-inf - -inf`` is NaN, so check by equality / sign. - assert math.isinf(actual) and actual < 0, f"block {b} thread {j}: got {actual}, expected {expected[j]}" - else: - assert abs(actual - expected[j]) < 1e-5, f"block {b} thread {j}: got {actual}, expected {expected[j]}" - - @pytest.mark.parametrize("dtype", [qd.i32, qd.f32, qd.f64]) @test_utils.test(arch=qd.gpu) def test_subgroup_shuffle_broadcast(dtype): @@ -3604,94 +3424,45 @@ def _init_full_bitwise(src, n): src[i] = 1 << (i % 7) -@test_utils.test(arch=qd.gpu) -def test_subgroup_reduce_add(): - _check_full_matches_tiled(subgroup.reduce_add, subgroup.reduce_add_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_reduce_all_add(): - _check_full_matches_tiled(subgroup.reduce_all_add, subgroup.reduce_all_add_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_reduce_min(): - _check_full_matches_tiled(subgroup.reduce_min, subgroup.reduce_min_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_reduce_max(): - _check_full_matches_tiled(subgroup.reduce_max, subgroup.reduce_max_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_reduce_all_min(): - _check_full_matches_tiled(subgroup.reduce_all_min, subgroup.reduce_all_min_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_reduce_all_max(): - _check_full_matches_tiled(subgroup.reduce_all_max, subgroup.reduce_all_max_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_inclusive_add(): - _check_full_matches_tiled(subgroup.inclusive_add, subgroup.inclusive_add_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_inclusive_min(): - _check_full_matches_tiled(subgroup.inclusive_min, subgroup.inclusive_min_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_inclusive_max(): - _check_full_matches_tiled(subgroup.inclusive_max, subgroup.inclusive_max_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_inclusive_mul(): - _check_full_matches_tiled(subgroup.inclusive_mul, subgroup.inclusive_mul_tiled, host_init=_init_full_small_int) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_inclusive_and(): - _check_full_matches_tiled(subgroup.inclusive_and, subgroup.inclusive_and_tiled, host_init=_init_full_bitwise) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_inclusive_or(): - _check_full_matches_tiled(subgroup.inclusive_or, subgroup.inclusive_or_tiled, host_init=_init_full_bitwise) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_inclusive_xor(): - _check_full_matches_tiled(subgroup.inclusive_xor, subgroup.inclusive_xor_tiled, host_init=_init_full_bitwise) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_exclusive_add(): - _check_full_matches_tiled(subgroup.exclusive_add, subgroup.exclusive_add_tiled) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_exclusive_mul(): - _check_full_matches_tiled(subgroup.exclusive_mul, subgroup.exclusive_mul_tiled, host_init=_init_full_small_int) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_exclusive_and(): - _check_full_matches_tiled(subgroup.exclusive_and, subgroup.exclusive_and_tiled, host_init=_init_full_bitwise) - - -@test_utils.test(arch=qd.gpu) -def test_subgroup_exclusive_or(): - _check_full_matches_tiled(subgroup.exclusive_or, subgroup.exclusive_or_tiled, host_init=_init_full_bitwise) +# Each entry is a thin ``_check_full_matches_tiled(subgroup.X, subgroup.X_tiled, ...)`` wrapper. Collapsed into one +# op-parametrized test to drop ~80 LOC of duplication. The pytest ids match the names of the original +# ``test_subgroup_`` functions so test reports / `-k` selectors stay stable. +_FULL_VS_TILED_INT_CASES = [ + pytest.param("reduce_add", None, id="reduce_add"), + pytest.param("reduce_all_add", None, id="reduce_all_add"), + pytest.param("reduce_min", None, id="reduce_min"), + pytest.param("reduce_max", None, id="reduce_max"), + pytest.param("reduce_all_min", None, id="reduce_all_min"), + pytest.param("reduce_all_max", None, id="reduce_all_max"), + pytest.param("inclusive_add", None, id="inclusive_add"), + pytest.param("inclusive_min", None, id="inclusive_min"), + pytest.param("inclusive_max", None, id="inclusive_max"), + # `mul` needs bounded inputs (2**N overflows i32 quickly); bitwise ops need a per-lane bit pattern that's + # non-zero on every lane so AND has signal and OR / XOR have varied bits. + pytest.param("inclusive_mul", _init_full_small_int, id="inclusive_mul"), + pytest.param("inclusive_and", _init_full_bitwise, id="inclusive_and"), + pytest.param("inclusive_or", _init_full_bitwise, id="inclusive_or"), + pytest.param("inclusive_xor", _init_full_bitwise, id="inclusive_xor"), + pytest.param("exclusive_add", None, id="exclusive_add"), + pytest.param("exclusive_mul", _init_full_small_int, id="exclusive_mul"), + pytest.param("exclusive_and", _init_full_bitwise, id="exclusive_and"), + pytest.param("exclusive_or", _init_full_bitwise, id="exclusive_or"), + pytest.param("exclusive_xor", _init_full_bitwise, id="exclusive_xor"), +] +@pytest.mark.parametrize("op_name,host_init", _FULL_VS_TILED_INT_CASES) @test_utils.test(arch=qd.gpu) -def test_subgroup_exclusive_xor(): - _check_full_matches_tiled(subgroup.exclusive_xor, subgroup.exclusive_xor_tiled, host_init=_init_full_bitwise) +def test_subgroup_full_matches_tiled(op_name, host_init): + """For each subgroup op ``X``, verify ``subgroup.X(v)`` matches ``subgroup.X_tiled(v, log2_group_size())`` + lane-by-lane on ``qd.i32``. Covers reduce / inclusive / exclusive families; bitwise ops + ``mul`` use a custom + initializer that keeps the per-lane aggregate bounded.""" + full_fn = getattr(subgroup, op_name) + tiled_fn = getattr(subgroup, f"{op_name}_tiled") + kwargs = {} + if host_init is not None: + kwargs["host_init"] = host_init + _check_full_matches_tiled(full_fn, tiled_fn, **kwargs) @test_utils.test(arch=qd.gpu) @@ -3836,16 +3607,15 @@ def k(): # accidentally cast through i32 inside a wrapper. +@pytest.mark.parametrize("op_name", ["reduce_add", "inclusive_add"]) @pytest.mark.parametrize("dtype", [qd.f32, qd.f64]) @test_utils.test(arch=qd.gpu) -def test_subgroup_reduce_add_float(dtype): - _check_full_matches_tiled(subgroup.reduce_add, subgroup.reduce_add_tiled, dtype=dtype) - - -@pytest.mark.parametrize("dtype", [qd.f32, qd.f64]) -@test_utils.test(arch=qd.gpu) -def test_subgroup_inclusive_add_float(dtype): - _check_full_matches_tiled(subgroup.inclusive_add, subgroup.inclusive_add_tiled, dtype=dtype) +def test_subgroup_full_matches_tiled_float(op_name, dtype): + """Float-dtype coverage of the dtype-agnostic ``full`` wrappers (``reduce_add``, ``inclusive_add``). One f32 + one + f64 case per family is enough to catch an i32-only regression in a wrapper.""" + full_fn = getattr(subgroup, op_name) + tiled_fn = getattr(subgroup, f"{op_name}_tiled") + _check_full_matches_tiled(full_fn, tiled_fn, dtype=dtype) @pytest.mark.parametrize("dtype", [qd.f32, qd.f64]) diff --git a/tests/python/test_tile16.py b/tests/python/test_tile16.py index 97480c7d1d..9bed5bc277 100644 --- a/tests/python/test_tile16.py +++ b/tests/python/test_tile16.py @@ -92,6 +92,12 @@ def k1(src_arr: Ann, dst_arr: Ann): np.testing.assert_allclose(dst.to_numpy(), np.eye(_TILE, dtype=np_dtype)) +# 8 geometries x 2 tensor_type x 2 qd_dtype = 32 parametrize cases. The geometries enumerate hand-picked corner cases +# (origin, non-zero src/dst offsets, partial cols/rows, oversize backing array); coverage of any single geometry is +# more valuable than running every combination every CI run. ``@pytest.mark.sample(n=6)`` keeps 6 of the 32 cases per +# run; after k runs each specific case is hit with probability 1 - (26/32)^k = 1 - 0.8125^k (~65% after 5 runs, ~98% +# after 20). See docs/source/user_guide/unit_testing.md for the reproducibility recipes. +@pytest.mark.sample(n=6) @pytest.mark.parametrize( "src_row, src_col, row_offset, col_offset, ncols, nrows", [ @@ -439,6 +445,10 @@ def k1( np.testing.assert_allclose(out.to_numpy(), expected, atol=atol) +# 3 dst_delta x 3 src_offset x 2 tensor_type x 2 qd_dtype = 36 parametrize cases. Each case is an independent offset / +# delta combo; running 6 random ones per CI run with ~97% convergence over 20 runs is the right tradeoff given each +# case takes ~5s of cluster wall time. See unit_testing.md. +@pytest.mark.sample(n=6) @pytest.mark.parametrize("tensor_type", [qd.ndarray, qd.field]) @pytest.mark.parametrize("dst_delta", [0, 3, 16]) @pytest.mark.parametrize("src_offset", [0, 5, 32]) @@ -1776,8 +1786,25 @@ def write_eye_f32(dst: Ann32): @test_utils.test(arch=[qd.cuda]) def test_tile16_cholesky_blocked_demo(): - """Smoke-test that misc/demos/cholesky_blocked.py runs to completion.""" + """Smoke-test that misc/demos/cholesky_blocked.py runs to completion. + + Uses small CLI overrides (N=32, N_ENVS=64, 1 warmup + 1 timed iter) so the JIT compile of the 3 unrolled kernels + and the benchmark loop both stay cheap. The demo's defaults (N=92, N_ENVS=4096, 50+200 iters) are exercised by + anyone running the script manually, not by CI. + """ demo = Path(__file__).resolve().parents[2] / "misc" / "demos" / "cholesky_blocked.py" - result = subprocess.run([sys.executable, str(demo)], capture_output=True, text=True, timeout=300) + cmd = [ + sys.executable, + str(demo), + "--n", + "32", + "--n-envs", + "64", + "--num-warmup", + "1", + "--num-iters", + "1", + ] + result = subprocess.run(cmd, capture_output=True, text=True, timeout=300) if result.returncode != 0: pytest.fail(f"cholesky_blocked.py exited with code {result.returncode}\nstderr:\n{result.stderr}") diff --git a/tests/run_tests.py b/tests/run_tests.py index e2419add42..9e033a89d4 100644 --- a/tests/run_tests.py +++ b/tests/run_tests.py @@ -1,6 +1,7 @@ import argparse import importlib.util import os +import random def _test_python(args, default_dir="python"): @@ -56,8 +57,26 @@ def _test_python(args, default_dir="python"): pytest_args += ["--cov-append"] if args.keys: pytest_args += ["-k", args.keys] - if args.marks: - pytest_args += ["-m", args.marks] + # By default we exclude tests marked `slow` (eig / make_spd at n>=6, inverse_large at n>=6, mpm88, etc. -- see + # tests/pytest.ini for the marker). `--run-slow` opts back in. If the user passes their own `-m` expression we + # AND `not slow` onto it so the exclusion still applies, unless they explicitly opt out via `--run-slow`. + marks_expr = args.marks + if not args.run_slow: + marks_expr = f"({marks_expr}) and not slow" if marks_expr else "not slow" + if marks_expr: + pytest_args += ["-m", marks_expr] + if args.no_sample: + pytest_args += ["--no-sample"] + else: + # Pick the run's @pytest.mark.sample seed here (before pytest is launched) and pass it via --sample-seed on + # argv. This is the most reliable way to propagate the seed to xdist workers: xdist forwards argv to every + # worker subprocess, so all workers and the controller see the exact same value, sample identical subsets, + # and xdist's collection-consistency check passes. (Setting the seed inside ``pytest_configure`` doesn't + # work because ``os.environ`` mutation there happens after xdist has already snapshotted the env it ships + # to workers, and ``pytest_configure_node`` only fires for conftests at the rootdir level.) + if args.sample_seed is None: + args.sample_seed = random.randrange(0, 2**31) + pytest_args += [f"--sample-seed={args.sample_seed}"] if args.failed_first: pytest_args += ["--failed-first"] if args.fail_fast: @@ -161,7 +180,34 @@ def test(): default=None, dest="marks", type=str, - help="Only run tests with specific marks", + help="Only run tests with specific marks. `not slow` is appended automatically " "unless --run-slow is passed.", + ) + parser.add_argument( + "--run-slow", + required=False, + default=False, + dest="run_slow", + action="store_true", + help="Include tests marked `slow` (excluded by default). Has no effect if -m is " + "given an explicit expression that already mentions `slow`.", + ) + parser.add_argument( + "--sample-seed", + required=False, + default=None, + type=int, + dest="sample_seed", + help="Seed for @pytest.mark.sample subsampling. Defaults to a fresh seed picked per run " + "(printed in the report header). Pass the seed from a failing CI run to reproduce its sample.", + ) + parser.add_argument( + "--no-sample", + required=False, + default=False, + dest="no_sample", + action="store_true", + help="Disable @pytest.mark.sample subsampling -- run every parametrize case of every marked test. " + "Use for exhaustive CI release gates / coverage-debt audits.", ) parser.add_argument( "-f",