Skip to content

fix(test): increase LER sample count to 100k to fix stochastic flakes on A100#60

Merged
ivanbasov merged 7 commits into
NVIDIA:mainfrom
ivanbasov:worktree-qa
Apr 10, 2026
Merged

fix(test): increase LER sample count to 100k to fix stochastic flakes on A100#60
ivanbasov merged 7 commits into
NVIDIA:mainfrom
ivanbasov:worktree-qa

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

@ivanbasov ivanbasov commented Apr 9, 2026

Summary

  • Increases test_ler_improves_with_bd_noise_model sample count from 50k/20k (full/CI) to 100k/100k
  • Restores the 1.5x degradation threshold (unchanged from upstream)
  • Fixes an inverted ratio in the debug print (ler_no_bd / ler_with_bdler_with_bd / ler_no_bd)

Motivation

A DVS sanity run on A100-PCIE-40GB (ga100_p1001_0200, py312+cu126) observed a spurious failure: ler_with_bd=2.3e-3 vs ler_no_bd=1.48e-3, ratio 1.554x — just 0.054x above the 1.5x threshold. All other 5 Python/CUDA combos on the same GPU and all other GPU configurations (GH100, GH200, GB100, GB200, GB300) passed cleanly, confirming a stochastic flake, not a code bug.

Root cause: at N=50k and LER ~1.5e-3 the ratio estimator has ~17% standard deviation, placing the 1.5x threshold at only ~3.1σ — roughly 0.11% false-failure probability per test, or ~5% per full DVS run across all combos.

Raising to N=100k moves the threshold to ~4.3σ (<0.001% per test):

N (samples) σ from 1.0x P(false fail) per test
20,000 (old CI) 1.9σ 2.6%
50,000 (old full) 3.1σ 0.11%
100,000 (new) 4.3σ <0.001%

Wall-clock cost is ~2s per run (CPU-only test), acceptable for both DVS and CI.

Test plan

  • Confirm test_ler_improves_with_bd_noise_model passes on GPU short matrix tests across all Python/CUDA combos including A100 py312+cu126
  • Confirm no other TestLERComparison tests were changed

🤖 Generated with Claude Code

ivanbasov and others added 5 commits March 30, 2026 11:54
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The strict assertLess (zero tolerance) in test_ler_improves_with_bd_noise_model
is statistically fragile: at N=20000-50000 samples and LER ~1.5e-3, the ratio
estimator has ~17% std dev, so the strict inequality fails with non-trivial
probability on slow/low-memory GPUs (e.g. A100-PCIE-40GB).

Replace with assertLessEqual(ler_with_bd, 2.0 * ler_no_bd). The 2x threshold
is ~5.9σ above the expected 1.0x ratio, reducing false-failure probability to
<0.001% per run while still catching real regressions.

Also fix the degradation ratio print (was inverted: ler_no_bd/ler_with_bd).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Upstream introduced a 1.5x tolerance (replacing strict assertLess) but the
DVS sanity run on A100-PCIE-40GB / py312+cu126 shows the 1.5x threshold is
still too tight: observed ratio 1.554x (no_bd=1.48e-3, with_bd=2.3e-3),
failing by only 0.00008. All other 5 py/CUDA combos and all other GPU
configurations (GH100, GH200, GB100, GB200, GB300) passed cleanly, confirming
a stochastic flake rather than a regression.

At N=20000-50000 samples with LER ~1.5e-3 the ratio estimator has ~17% std dev,
placing the 1.5x threshold at only ~2.9σ (~1% false-failure rate per 6-combo
run). Raising to 2.0x puts it at ~5.9σ (<0.001% per run) while still catching
any genuine regression in the boundary-detector implementation.

Also updates the docstring to reference 2.0x instead of 1.5x.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov requested review from bmhowe23 and huaweil-nv April 9, 2026 15:22
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 9, 2026

How long does this test take? I am wondering if doubling the number of samples would give us enough statistical resolving power to fix the "fluke". If that doesn't take too long, I would prefer that because I am slightly uncomfortable allowing worse and worse LER measurements sneak through this wider threshold.

ivanbasov and others added 2 commits April 9, 2026 10:08
With N=50k the 2.0x threshold sits at ~3.06σ giving ~0.11% false-failure
probability per test; combined across the full DVS matrix (~48 combos per
run) that is ~5% chance of a spurious failure each DVS cycle.

Raising both full-run and CI counts to 100k brings the threshold to ~4.3σ
(P < 0.001% per test) while adding only ~2s of wall-clock time per run —
an acceptable trade-off even in CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tistical soundness

N=100k puts the 1.5x threshold at ~4.3σ above the expected 1.0x ratio
(P < 0.001% per test), eliminating the stochastic flakes seen on A100 with
the previous N=50k (which was only ~3.1σ, ~0.11% per test / ~5% per DVS run).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov changed the title fix(test): loosen BD LER tolerance to 2x to prevent stochastic flakes fix(test): increase LER sample count to 100k to fix stochastic flakes on A100 Apr 10, 2026
@ivanbasov ivanbasov merged commit a61c149 into NVIDIA:main Apr 10, 2026
17 checks passed
@ivanbasov ivanbasov deleted the worktree-qa branch April 10, 2026 00:22
@huaweil-nv
Copy link
Copy Markdown
Collaborator

It has been verified that it works properly with the current configuration.

ivanbasov added a commit that referenced this pull request Apr 10, 2026
… on A100 (#60)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* fix(test): loosen BD LER tolerance to 2x to prevent stochastic flakes

The strict assertLess (zero tolerance) in test_ler_improves_with_bd_noise_model
is statistically fragile: at N=20000-50000 samples and LER ~1.5e-3, the ratio
estimator has ~17% std dev, so the strict inequality fails with non-trivial
probability on slow/low-memory GPUs (e.g. A100-PCIE-40GB).

Replace with assertLessEqual(ler_with_bd, 2.0 * ler_no_bd). The 2x threshold
is ~5.9σ above the expected 1.0x ratio, reducing false-failure probability to
<0.001% per run while still catching real regressions.

Also fix the degradation ratio print (was inverted: ler_no_bd/ler_with_bd).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): increase LER sample count to 100k for both full and CI runs

With N=50k the 2.0x threshold sits at ~3.06σ giving ~0.11% false-failure
probability per test; combined across the full DVS matrix (~48 combos per
run) that is ~5% chance of a spurious failure each DVS cycle.

Raising both full-run and CI counts to 100k brings the threshold to ~4.3σ
(P < 0.001% per test) while adding only ~2s of wall-clock time per run —
an acceptable trade-off even in CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): restore 1.5x LER threshold, backed by 100k samples for statistical soundness

N=100k puts the 1.5x threshold at ~4.3σ above the expected 1.0x ratio
(P < 0.001% per test), eliminating the stochastic flakes seen on A100 with
the previous N=50k (which was only ~3.1σ, ~0.11% per test / ~5% per DVS run).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants