Skip to content

test: harden two load-sensitive tests against CPU oversubscription#86

Merged
leszek3737 merged 1 commit into
mainfrom
fa9
Jun 15, 2026
Merged

test: harden two load-sensitive tests against CPU oversubscription#86
leszek3737 merged 1 commit into
mainfrom
fa9

Conversation

@leszek3737

Copy link
Copy Markdown
Owner

Both are test-only changes; production code is untouched. An investigation (1220 suite runs under CPU saturation, zero failures) showed the suite is deterministic under a normal single-cargo run — the "flake" seen during a parallel-agent wave was shared-worktree cargo contention, not test logic. These two latent fragilities are hardened defensively:

  • viewer: test_viewer_loader_drop_cancels_worker waited on a detached busy-wait (yield_now) worker with recv_timeout(1s).unwrap(); under heavy load the starved worker could miss the 1s deadline. Drop sets the cancel flag, so the worker is guaranteed to send — switch to a blocking recv() and drop the now-dead TEST_CHANNEL_TIMEOUT_SECS constant.

  • debug_log: log_truncates_oversized_file shares process-global LOG_FILE/CHECK_COUNTER/TEST_CACHE_HOME with concurrent production log() calls from other parallel tests. Widen the retry budget (20 -> 50) and document why the loop must not be collapsed into a single attempt.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @leszek3737, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the retry attempts in debug log tests to handle cross-test interference and replaces a timed receive with a blocking receive in viewer tests to prevent failures under heavy CPU load. The reviewer recommends using a generous timeout instead of an unbounded blocking receive to avoid potential indefinite hangs in CI if the background thread fails.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/ui/viewer/tests.rs
Comment on lines 94 to +96
done_rx
.recv_timeout(std::time::Duration::from_secs(TEST_CHANNEL_TIMEOUT_SECS))
.unwrap();
.recv()
.expect("worker should send done after observing cancel");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using an unbounded blocking recv() in a unit test can cause the entire test suite to hang indefinitely in CI if the background thread panics or fails to send a message. To harden the test against CPU starvation without risking indefinite hangs, use recv_timeout with a generous timeout (e.g., 10 seconds).

Suggested change
done_rx
.recv_timeout(std::time::Duration::from_secs(TEST_CHANNEL_TIMEOUT_SECS))
.unwrap();
.recv()
.expect("worker should send done after observing cancel");
done_rx
.recv_timeout(std::time::Duration::from_secs(10))
.expect("worker should send done after observing cancel");

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown

Greptile Summary

Purely test-only hardening against CPU-oversubscription flakiness; production code is untouched. Both changes are well-reasoned and correctly documented.

  • viewer/tests.rs – replaces recv_timeout(1s).unwrap() with a blocking recv(), justified by the pre-existing assert!(cancel.load(...)) that guarantees the cancel flag is already set before the recv, making the send from the worker inevitable.
  • debug_log.rs – widens the retry budget from 20 → 50 and adds a detailed block comment explaining the process-global interference from concurrent production log() calls that TEST_MUTEX does not exclude.

Confidence Score: 5/5

Safe to merge — test-only changes with no production code touched.

Both changes address real flakiness mechanisms with correct fixes. The blocking recv() is safe because the cancel flag is asserted true on the line immediately above it, making the worker's send guaranteed. The 50-retry budget is a pragmatic defensive measure for a test that races against process-global statics that other concurrent tests can perturb.

No files require special attention.

Important Files Changed

Filename Overview
src/ui/viewer/tests.rs Removes TEST_CHANNEL_TIMEOUT_SECS and switches recv_timeout(1s).unwrap() to blocking recv().expect(). The pre-assert on the cancel flag guarantees the worker will send before recv() is called, so no hang is possible.
src/app/debug_log.rs Retry count for log_truncates_oversized_file bumped from 20 to 50 with a detailed comment explaining why concurrent log() calls from other threads perturb CHECK_COUNTER/LOG_FILE despite TEST_MUTEX. Logic is unchanged.

Reviews (1): Last reviewed commit: "test: harden two load-sensitive tests ag..." | Re-trigger Greptile

@opencode-agent

Copy link
Copy Markdown

Both tests pass. Here's my review.

Review

Scope & Correctness

Test-only changes, production code untouched — matches the PR description. Both tests pass locally. The unused TEST_CHANNEL_TIMEOUT_SECS constant is correctly removed (grep confirms no other references).

The viewer change is sound: CancellableLoader::drop at src/ui/viewer/loader.rs:76 stores cancel = true, and the worker at tests.rs:73-79 busy-waits on that flag then unconditionally calls done_tx.send(()). So a blocking recv() will eventually return — the worker is guaranteed to send.

The debug_log comment is excellent: it precisely documents the failure mode (process-global LOG_FILE/CHECK_COUNTER/TEST_CACHE_HOME shared with concurrent parallel-test debug_log! calls) and why the loop can't be collapsed.

Issues / Suggestions

1. Unbounded recv() trades a soft failure for a potential CI hang (src/ui/viewer/tests.rs:94-96)
Switching from recv_timeout(1s) to blocking recv() removes any upper bound. Today the worker is a trivial yield_now loop, so it's safe. But if a future regression breaks cancel-on-drop, or the worker grows real I/O, this test will hang the entire test binary indefinitely instead of failing fast. A generous but finite timeout preserves the load-tolerance win while bounding CI cost:

done_rx
    .recv_timeout(std::time::Duration::from_secs(30))
    .expect("worker should send done after observing cancel");

2. Root cause of debug_log flakiness is untreated (src/app/debug_log.rs)
The retry loop is a symptom. The actual problem is that debug_log! mutates process-global statics (LOG_FILE, CHECK_COUNTER, TEST_CACHE_HOME) that aren't isolated from concurrent parallel tests calling the production macro. The PR explicitly scopes this as defensive hardening — fair — but please track the design fix (e.g., a thread-local/per-test redirect gate that production log() checks first) as follow-up debt. As-is, the test remains inherently racy; "50 retries, 0 failures in 1220 runs" is empirical, not a guarantee.

3. Magic number 50 (src/app/debug_log.rs:254)
The bump from 20 → 50 isn't justified anywhere. Given sleep(5ms) per iteration, worst case is now ~250ms of sleeping plus I/O. A named constant with a one-line budget rationale (e.g., "empirically sufficient under N-way parallel cargo test; ~250ms worst case") would document the choice and make future tuning self-explanatory.

4. Pre-existing (not blocking): memory orderingdrop uses Ordering::Release while the worker loads with Ordering::Relaxed. Fine for a standalone boolean (Relaxed still guarantees eventual visibility of the flag itself), but slightly unconventional. Worth a note since this PR leans on that guarantee.

Verdict

Low-risk, mergeable as-is. The one change I'd push for before merge is #1 — keep the deadline bounded. The rest are follow-up/nice-to-have.

New%20session%20-%202026-06-15T02%3A36%3A48.357Z
opencode session  |  github run

Test-only; production code untouched. An investigation (1220 suite runs
under CPU saturation, zero failures) showed the suite is deterministic
under a normal single-cargo run — the "flake" seen during a parallel-agent
wave was shared-worktree cargo contention, not test logic. Two latent
fragilities are hardened defensively:

- viewer: test_viewer_loader_drop_cancels_worker waited on a detached
  busy-wait (yield_now) worker with recv_timeout(1s).unwrap(); under heavy
  load the starved worker could miss the 1s deadline. Drop sets the cancel
  flag (Release; worker loads Relaxed, eventual visibility guaranteed), so
  the worker is guaranteed to send — widen to a generous 30s hang-guard
  that tolerates starvation yet still fails fast on a real deadlock instead
  of hanging the test binary.

- debug_log: log_truncates_oversized_file shares process-global
  LOG_FILE/CHECK_COUNTER/TEST_CACHE_HOME with concurrent production log()
  calls from other parallel tests. Re-create the oversized file each
  attempt so every retry genuinely exercises truncation, name the retry
  budget (RETRY_BUDGET) with a rationale, and document the interference and
  the per-test-redirect tech-debt follow-up.
@opencode-agent

Copy link
Copy Markdown

All checks pass: cargo fmt --check clean, cargo clippy --all-targets -D warnings clean, both modified tests pass, and TEST_CHANNEL_TIMEOUT_SECS has no remaining references. Here's my review.


Review

Note: The earlier bot reviews (greptile, gemini, opencode-agent) appear to be against an older revision — they critique an unbounded recv() and recommend a 30s timeout, but the PR as merged already uses recv_timeout(Duration::from_secs(30)). Those suggestions are already satisfied; none of the "before merge" blockers apply.

Scope & correctness

Test-only, production code untouched — matches the PR description and respects the project's "no production code in test PRs" intent. Both changes are sound:

  • viewer/tests.rs:88-100 — Drop at loader.rs:76 stores cancel = true (Release); the worker's Relaxed load + yield_now loop is sufficient for eventual visibility of a standalone boolean (no accompanying data to synchronize). The 30s timeout preserves the load-tolerance win while bounding deadlocks. Comment correctly explains the ordering. Removal of TEST_CHANNEL_TIMEOUT_SECS is verified dead.
  • debug_log.rs:234-283 — The added block comment is excellent: it precisely names the three globals (CHECK_COUNTER, LOG_FILE, TEST_CACHE_HOME), explains why TEST_MUTEX is insufficient (it only serializes the two sibling tests), and tracks the principled fix as TODO(tech-debt).

A change the PR body understates

The diff does more than "widen the retry budget 20→50" — it moves the oversized-file creation inside the loop (was a one-shot File::create before the loop):

for attempt in 0..RETRY_BUDGET {
    reset_for_test();
    std::fs::write(&path, vec![b'X'; (MAX_LOG_SIZE_BYTES + 1) as usize])...
    log(format_args!("attempt {attempt}"));
    ...
}

This is arguably the real fix. The pre-existing loop only ever exercised truncation on iteration 0 — once log() truncated, the file was small and every subsequent iteration would just append, never re-testing the truncation path. By re-stomping the file to oversized each iteration, every retry is now a genuine truncation probe, which is why the retry budget is now meaningful. The new comment ("Creating it once before the loop would only test the first iteration…") documents this well, but the PR body should call it out as a behavioral fix, not just a budget bump.

Minor nits (non-blocking)

  1. debug_log.rs:258RETRY_BUDGET is const and referenced in the assert message via {RETRY_BUDGET} (works via inline const format args — good). Consider also deriving the comment's "~250ms" claim from the constant (50 * 5ms) — already done in the comment, nice.
  2. debug_log.rs:282 — Assert message now reads "log() never truncated the oversized file within {RETRY_BUDGET} retries". Good — accurate and self-documenting.
  3. viewer/tests.rs:91-96 — Comment is precise. One tiny thing: the comment says the worker is "guaranteed to send" — strictly, done_tx.send(()) ignores the Result, so even if the receiver were dropped the worker wouldn't panic. That's fine; the guarantee the test relies on is "the send will be attempted," which is true.

Conventions

  • No println!/dbg! introduced. ✓
  • unwrap() use is consistent with the rest of the test file (the AGENTS.md carve-out permits unwrap_used/expect_used on tests). ✓
  • No new #[allow(...)] added. ✓
  • Comments explain why, not what. ✓

Verdict

Approve. Low-risk, well-documented test hardening. The restructure inside the loop is a real improvement the PR body doesn't emphasize — worth knowing for the maintainer on merge, but not a blocker.

New%20session%20-%202026-06-15T02%3A46%3A28.938Z
opencode session  |  github run

@leszek3737 leszek3737 merged commit 8486e0b into main Jun 15, 2026
6 checks passed
@leszek3737 leszek3737 deleted the fa9 branch June 15, 2026 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant