test: harden two load-sensitive tests against CPU oversubscription by leszek3737 · Pull Request #86 · leszek3737/LibreCommander

leszek3737 · 2026-06-15T02:36:31Z

Both are test-only changes; production code is untouched. An investigation (1220 suite runs under CPU saturation, zero failures) showed the suite is deterministic under a normal single-cargo run — the "flake" seen during a parallel-agent wave was shared-worktree cargo contention, not test logic. These two latent fragilities are hardened defensively:

viewer: test_viewer_loader_drop_cancels_worker waited on a detached busy-wait (yield_now) worker with recv_timeout(1s).unwrap(); under heavy load the starved worker could miss the 1s deadline. Drop sets the cancel flag, so the worker is guaranteed to send — switch to a blocking recv() and drop the now-dead TEST_CHANNEL_TIMEOUT_SECS constant.
debug_log: log_truncates_oversized_file shares process-global LOG_FILE/CHECK_COUNTER/TEST_CACHE_HOME with concurrent production log() calls from other parallel tests. Widen the retry budget (20 -> 50) and document why the loop must not be collapsed into a single attempt.

sourcery-ai

Sorry @leszek3737, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

gemini-code-assist

Code Review

This pull request increases the retry attempts in debug log tests to handle cross-test interference and replaces a timed receive with a blocking receive in viewer tests to prevent failures under heavy CPU load. The reviewer recommends using a generous timeout instead of an unbounded blocking receive to avoid potential indefinite hangs in CI if the background thread fails.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T02:37:15Z

    done_rx
-        .recv_timeout(std::time::Duration::from_secs(TEST_CHANNEL_TIMEOUT_SECS))
-        .unwrap();
+        .recv()
+        .expect("worker should send done after observing cancel");


Using an unbounded blocking recv() in a unit test can cause the entire test suite to hang indefinitely in CI if the background thread panics or fails to send a message. To harden the test against CPU starvation without risking indefinite hangs, use recv_timeout with a generous timeout (e.g., 10 seconds).

Suggested change

done_rx

.recv_timeout(std::time::Duration::from_secs(TEST_CHANNEL_TIMEOUT_SECS))

.unwrap();

.recv()

.expect("worker should send done after observing cancel");

done_rx

.recv_timeout(std::time::Duration::from_secs(10))

.expect("worker should send done after observing cancel");

greptile-apps · 2026-06-15T02:39:35Z

Greptile Summary

Purely test-only hardening against CPU-oversubscription flakiness; production code is untouched. Both changes are well-reasoned and correctly documented.

viewer/tests.rs – replaces recv_timeout(1s).unwrap() with a blocking recv(), justified by the pre-existing assert!(cancel.load(...)) that guarantees the cancel flag is already set before the recv, making the send from the worker inevitable.
debug_log.rs – widens the retry budget from 20 → 50 and adds a detailed block comment explaining the process-global interference from concurrent production log() calls that TEST_MUTEX does not exclude.

Confidence Score: 5/5

Safe to merge — test-only changes with no production code touched.

Both changes address real flakiness mechanisms with correct fixes. The blocking recv() is safe because the cancel flag is asserted true on the line immediately above it, making the worker's send guaranteed. The 50-retry budget is a pragmatic defensive measure for a test that races against process-global statics that other concurrent tests can perturb.

No files require special attention.

Important Files Changed

Filename	Overview
src/ui/viewer/tests.rs	Removes TEST_CHANNEL_TIMEOUT_SECS and switches recv_timeout(1s).unwrap() to blocking recv().expect(). The pre-assert on the cancel flag guarantees the worker will send before recv() is called, so no hang is possible.
src/app/debug_log.rs	Retry count for log_truncates_oversized_file bumped from 20 to 50 with a detailed comment explaining why concurrent log() calls from other threads perturb CHECK_COUNTER/LOG_FILE despite TEST_MUTEX. Logic is unchanged.

_{Reviews (1): Last reviewed commit: "test: harden two load-sensitive tests ag..." | Re-trigger Greptile}

opencode-agent · 2026-06-15T02:43:21Z

Both tests pass. Here's my review.

Review

Scope & Correctness

Test-only changes, production code untouched — matches the PR description. Both tests pass locally. The unused TEST_CHANNEL_TIMEOUT_SECS constant is correctly removed (grep confirms no other references).

The viewer change is sound: CancellableLoader::drop at src/ui/viewer/loader.rs:76 stores cancel = true, and the worker at tests.rs:73-79 busy-waits on that flag then unconditionally calls done_tx.send(()). So a blocking recv() will eventually return — the worker is guaranteed to send.

The debug_log comment is excellent: it precisely documents the failure mode (process-global LOG_FILE/CHECK_COUNTER/TEST_CACHE_HOME shared with concurrent parallel-test debug_log! calls) and why the loop can't be collapsed.

Issues / Suggestions

1. Unbounded recv() trades a soft failure for a potential CI hang (src/ui/viewer/tests.rs:94-96)
Switching from recv_timeout(1s) to blocking recv() removes any upper bound. Today the worker is a trivial yield_now loop, so it's safe. But if a future regression breaks cancel-on-drop, or the worker grows real I/O, this test will hang the entire test binary indefinitely instead of failing fast. A generous but finite timeout preserves the load-tolerance win while bounding CI cost:

done_rx
    .recv_timeout(std::time::Duration::from_secs(30))
    .expect("worker should send done after observing cancel");

2. Root cause of debug_log flakiness is untreated (src/app/debug_log.rs)
The retry loop is a symptom. The actual problem is that debug_log! mutates process-global statics (LOG_FILE, CHECK_COUNTER, TEST_CACHE_HOME) that aren't isolated from concurrent parallel tests calling the production macro. The PR explicitly scopes this as defensive hardening — fair — but please track the design fix (e.g., a thread-local/per-test redirect gate that production log() checks first) as follow-up debt. As-is, the test remains inherently racy; "50 retries, 0 failures in 1220 runs" is empirical, not a guarantee.

3. Magic number 50 (src/app/debug_log.rs:254)
The bump from 20 → 50 isn't justified anywhere. Given sleep(5ms) per iteration, worst case is now ~250ms of sleeping plus I/O. A named constant with a one-line budget rationale (e.g., "empirically sufficient under N-way parallel cargo test; ~250ms worst case") would document the choice and make future tuning self-explanatory.

4. Pre-existing (not blocking): memory ordering — drop uses Ordering::Release while the worker loads with Ordering::Relaxed. Fine for a standalone boolean (Relaxed still guarantees eventual visibility of the flag itself), but slightly unconventional. Worth a note since this PR leans on that guarantee.

Verdict

Low-risk, mergeable as-is. The one change I'd push for before merge is #1 — keep the deadline bounded. The rest are follow-up/nice-to-have.

opencode session | github run

Test-only; production code untouched. An investigation (1220 suite runs under CPU saturation, zero failures) showed the suite is deterministic under a normal single-cargo run — the "flake" seen during a parallel-agent wave was shared-worktree cargo contention, not test logic. Two latent fragilities are hardened defensively: - viewer: test_viewer_loader_drop_cancels_worker waited on a detached busy-wait (yield_now) worker with recv_timeout(1s).unwrap(); under heavy load the starved worker could miss the 1s deadline. Drop sets the cancel flag (Release; worker loads Relaxed, eventual visibility guaranteed), so the worker is guaranteed to send — widen to a generous 30s hang-guard that tolerates starvation yet still fails fast on a real deadlock instead of hanging the test binary. - debug_log: log_truncates_oversized_file shares process-global LOG_FILE/CHECK_COUNTER/TEST_CACHE_HOME with concurrent production log() calls from other parallel tests. Re-create the oversized file each attempt so every retry genuinely exercises truncation, name the retry budget (RETRY_BUDGET) with a rationale, and document the interference and the per-test-redirect tech-debt follow-up.

opencode-agent · 2026-06-15T02:51:15Z

All checks pass: cargo fmt --check clean, cargo clippy --all-targets -D warnings clean, both modified tests pass, and TEST_CHANNEL_TIMEOUT_SECS has no remaining references. Here's my review.

Review

Note: The earlier bot reviews (greptile, gemini, opencode-agent) appear to be against an older revision — they critique an unbounded recv() and recommend a 30s timeout, but the PR as merged already uses recv_timeout(Duration::from_secs(30)). Those suggestions are already satisfied; none of the "before merge" blockers apply.

Scope & correctness

Test-only, production code untouched — matches the PR description and respects the project's "no production code in test PRs" intent. Both changes are sound:

viewer/tests.rs:88-100 — Drop at loader.rs:76 stores cancel = true (Release); the worker's Relaxed load + yield_now loop is sufficient for eventual visibility of a standalone boolean (no accompanying data to synchronize). The 30s timeout preserves the load-tolerance win while bounding deadlocks. Comment correctly explains the ordering. Removal of TEST_CHANNEL_TIMEOUT_SECS is verified dead.
debug_log.rs:234-283 — The added block comment is excellent: it precisely names the three globals (CHECK_COUNTER, LOG_FILE, TEST_CACHE_HOME), explains why TEST_MUTEX is insufficient (it only serializes the two sibling tests), and tracks the principled fix as TODO(tech-debt).

A change the PR body understates

The diff does more than "widen the retry budget 20→50" — it moves the oversized-file creation inside the loop (was a one-shot File::create before the loop):

for attempt in 0..RETRY_BUDGET {
    reset_for_test();
    std::fs::write(&path, vec![b'X'; (MAX_LOG_SIZE_BYTES + 1) as usize])...
    log(format_args!("attempt {attempt}"));
    ...
}

This is arguably the real fix. The pre-existing loop only ever exercised truncation on iteration 0 — once log() truncated, the file was small and every subsequent iteration would just append, never re-testing the truncation path. By re-stomping the file to oversized each iteration, every retry is now a genuine truncation probe, which is why the retry budget is now meaningful. The new comment ("Creating it once before the loop would only test the first iteration…") documents this well, but the PR body should call it out as a behavioral fix, not just a budget bump.

Minor nits (non-blocking)

debug_log.rs:258 — RETRY_BUDGET is const and referenced in the assert message via {RETRY_BUDGET} (works via inline const format args — good). Consider also deriving the comment's "~250ms" claim from the constant (50 * 5ms) — already done in the comment, nice.
debug_log.rs:282 — Assert message now reads "log() never truncated the oversized file within {RETRY_BUDGET} retries". Good — accurate and self-documenting.
viewer/tests.rs:91-96 — Comment is precise. One tiny thing: the comment says the worker is "guaranteed to send" — strictly, done_tx.send(()) ignores the Result, so even if the receiver were dropped the worker wouldn't panic. That's fine; the guarantee the test relies on is "the send will be attempted," which is true.

Conventions

No println!/dbg! introduced. ✓
unwrap() use is consistent with the rest of the test file (the AGENTS.md carve-out permits unwrap_used/expect_used on tests). ✓
No new #[allow(...)] added. ✓
Comments explain why, not what. ✓

Verdict

Approve. Low-risk, well-documented test hardening. The restructure inside the loop is a real improvement the PR body doesn't emphasize — worth knowing for the maintainer on merge, but not a blocker.

opencode session | github run

sourcery-ai Bot reviewed Jun 15, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

leszek3737 force-pushed the fa9 branch from 368f118 to 177d307 Compare June 15, 2026 02:46

leszek3737 merged commit 8486e0b into main Jun 15, 2026
6 checks passed

leszek3737 deleted the fa9 branch June 15, 2026 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: harden two load-sensitive tests against CPU oversubscription#86

test: harden two load-sensitive tests against CPU oversubscription#86
leszek3737 merged 1 commit into
mainfrom
fa9

leszek3737 commented Jun 15, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

greptile-apps Bot commented Jun 15, 2026

Uh oh!

opencode-agent Bot commented Jun 15, 2026

Uh oh!

opencode-agent Bot commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leszek3737 commented Jun 15, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 15, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

opencode-agent Bot commented Jun 15, 2026

Review

Scope & Correctness

Issues / Suggestions

Verdict

Uh oh!

opencode-agent Bot commented Jun 15, 2026

Review

Scope & correctness

A change the PR body understates

Minor nits (non-blocking)

Conventions

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant