test: harden two load-sensitive tests against CPU oversubscription#86
Conversation
There was a problem hiding this comment.
Sorry @leszek3737, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
There was a problem hiding this comment.
Code Review
This pull request increases the retry attempts in debug log tests to handle cross-test interference and replaces a timed receive with a blocking receive in viewer tests to prevent failures under heavy CPU load. The reviewer recommends using a generous timeout instead of an unbounded blocking receive to avoid potential indefinite hangs in CI if the background thread fails.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| done_rx | ||
| .recv_timeout(std::time::Duration::from_secs(TEST_CHANNEL_TIMEOUT_SECS)) | ||
| .unwrap(); | ||
| .recv() | ||
| .expect("worker should send done after observing cancel"); |
There was a problem hiding this comment.
Using an unbounded blocking recv() in a unit test can cause the entire test suite to hang indefinitely in CI if the background thread panics or fails to send a message. To harden the test against CPU starvation without risking indefinite hangs, use recv_timeout with a generous timeout (e.g., 10 seconds).
| done_rx | |
| .recv_timeout(std::time::Duration::from_secs(TEST_CHANNEL_TIMEOUT_SECS)) | |
| .unwrap(); | |
| .recv() | |
| .expect("worker should send done after observing cancel"); | |
| done_rx | |
| .recv_timeout(std::time::Duration::from_secs(10)) | |
| .expect("worker should send done after observing cancel"); |
Greptile SummaryPurely test-only hardening against CPU-oversubscription flakiness; production code is untouched. Both changes are well-reasoned and correctly documented.
Confidence Score: 5/5Safe to merge — test-only changes with no production code touched. Both changes address real flakiness mechanisms with correct fixes. The blocking recv() is safe because the cancel flag is asserted true on the line immediately above it, making the worker's send guaranteed. The 50-retry budget is a pragmatic defensive measure for a test that races against process-global statics that other concurrent tests can perturb. No files require special attention. Important Files Changed
Reviews (1): Last reviewed commit: "test: harden two load-sensitive tests ag..." | Re-trigger Greptile |
|
Both tests pass. Here's my review. ReviewScope & CorrectnessTest-only changes, production code untouched — matches the PR description. Both tests pass locally. The unused The viewer change is sound: The debug_log comment is excellent: it precisely documents the failure mode (process-global Issues / Suggestions1. Unbounded done_rx
.recv_timeout(std::time::Duration::from_secs(30))
.expect("worker should send done after observing cancel");2. Root cause of debug_log flakiness is untreated ( 3. Magic number 50 ( 4. Pre-existing (not blocking): memory ordering — VerdictLow-risk, mergeable as-is. The one change I'd push for before merge is #1 — keep the deadline bounded. The rest are follow-up/nice-to-have. |
Test-only; production code untouched. An investigation (1220 suite runs under CPU saturation, zero failures) showed the suite is deterministic under a normal single-cargo run — the "flake" seen during a parallel-agent wave was shared-worktree cargo contention, not test logic. Two latent fragilities are hardened defensively: - viewer: test_viewer_loader_drop_cancels_worker waited on a detached busy-wait (yield_now) worker with recv_timeout(1s).unwrap(); under heavy load the starved worker could miss the 1s deadline. Drop sets the cancel flag (Release; worker loads Relaxed, eventual visibility guaranteed), so the worker is guaranteed to send — widen to a generous 30s hang-guard that tolerates starvation yet still fails fast on a real deadlock instead of hanging the test binary. - debug_log: log_truncates_oversized_file shares process-global LOG_FILE/CHECK_COUNTER/TEST_CACHE_HOME with concurrent production log() calls from other parallel tests. Re-create the oversized file each attempt so every retry genuinely exercises truncation, name the retry budget (RETRY_BUDGET) with a rationale, and document the interference and the per-test-redirect tech-debt follow-up.
|
All checks pass: ReviewNote: The earlier bot reviews (greptile, gemini, opencode-agent) appear to be against an older revision — they critique an unbounded Scope & correctnessTest-only, production code untouched — matches the PR description and respects the project's "no production code in test PRs" intent. Both changes are sound:
A change the PR body understatesThe diff does more than "widen the retry budget 20→50" — it moves the oversized-file creation inside the loop (was a one-shot for attempt in 0..RETRY_BUDGET {
reset_for_test();
std::fs::write(&path, vec![b'X'; (MAX_LOG_SIZE_BYTES + 1) as usize])...
log(format_args!("attempt {attempt}"));
...
}This is arguably the real fix. The pre-existing loop only ever exercised truncation on iteration 0 — once Minor nits (non-blocking)
Conventions
VerdictApprove. Low-risk, well-documented test hardening. The restructure inside the loop is a real improvement the PR body doesn't emphasize — worth knowing for the maintainer on merge, but not a blocker. |


Both are test-only changes; production code is untouched. An investigation (1220 suite runs under CPU saturation, zero failures) showed the suite is deterministic under a normal single-cargo run — the "flake" seen during a parallel-agent wave was shared-worktree cargo contention, not test logic. These two latent fragilities are hardened defensively:
viewer: test_viewer_loader_drop_cancels_worker waited on a detached busy-wait (yield_now) worker with recv_timeout(1s).unwrap(); under heavy load the starved worker could miss the 1s deadline. Drop sets the cancel flag, so the worker is guaranteed to send — switch to a blocking recv() and drop the now-dead TEST_CHANNEL_TIMEOUT_SECS constant.
debug_log: log_truncates_oversized_file shares process-global LOG_FILE/CHECK_COUNTER/TEST_CACHE_HOME with concurrent production log() calls from other parallel tests. Widen the retry budget (20 -> 50) and document why the loop must not be collapsed into a single attempt.