Skip to content

Hardcoded 10s run-service readiness timeout aborts local dev on macOS (no config hook) #392

@atoomic

Description

@atoomic

Summary

On macOS, make test in this repo fails reliably (~10s in) with a hardcoded 10-second timeout when the harness waits for a freshly-spawned run service to announce itself over IPC. The run service eventually does come up, but the handshake window in Test2::Harness2::_wait_for_run_service_ready is too tight for Darwin's IPC::Manager::Service::Handle::ready latency. CI (Linux containers) is faster and consistently slides in under 10s, so this is invisible in automation — but every local-dev run on a Mac aborts before any test can execute.

Reproduction

$ git clone https://github.com/Test-More/Test2-Harness.git
$ cd Test2-Harness
$ perl Makefile.PL && make
$ YATH_TEST_KEEP_LOGS=1 make test

Observed: both yath test … invocations exit with status 1 after ~10s. test.pl reports:

[test.pl] yath run 1 (allow fork) completed in 10.6s (exit=256)
[test.pl] yath run 2 (T2_NO_FORK=1) completed in 10.6s (exit=256)

Inspecting the kept harness log shows the real failure sequence:

service_started
run_queued
run_started
(warn)  Collector IPC send failed (kind 'collector_artifacts') from 'collector:run-…'
(die)   timeout waiting for run service '<run_id>' to come up
(warn)  Messages waiting at disconnect for harness at …/IPC/Manager/Service/Handle.pm line 61

No per-test output is produced because the whole run aborts before the collector can launch a test.

Environment: macOS 15.3, perl 5.38.4 (perlbrew), IPC::Manager 0.000029. Timed consistently at ~10.1s from run_started to the timeout croak.

Root cause

lib/Test2/Harness2.pm:1120-1131 implements _wait_for_run_service_ready:

sub _wait_for_run_service_ready {
    my ($self, $run_id) = @_;

    my $handle   = $self->_run_service_handle($run_id);
    my $deadline = time + 10;                   # <-- hardcoded 10s
    until ($handle->ready) {
        croak "timeout waiting for run service '$run_id' to come up"
            if time > $deadline;
        tinysleep(0.02);
    }
    return $handle;
}

The 10 is a magic literal — no env-var override, no config knob, no caller-supplied cap. When $handle->ready (IPC::Manager::Service::Handle::ready) does not flip true within 10 wall-clock seconds, the harness croaks and the whole run is lost.

On the macOS box used for reproduction:

  • service_started / run_queued / run_started events all arrive well inside 1s.
  • _wait_for_run_service_ready enters, never sees ready become truthy, and croaks exactly at 10s.

The 10s budget dates from the 9ea82b795 / b8be1469a run-service aggregation refactor (RunService: aggregate test_job_* + broadcast run_state, Collector: add Observer chain + TestObserver, drop collector_exiting). Before that refactor, the harness did not need a separate run service to accept IPC before launching test collectors; the budget was chosen for Linux container scheduling where IPC sockets bind essentially instantly.

The first symptom, Collector IPC send failed (kind 'collector_artifacts') from 'collector:run-…', originates at lib/Test2/Harness2/Collector.pm:809 — the collector tries to send to the not-yet-registered run-service name, IPC::Manager queues the message, and when the harness-side timeout then fires we get the Messages waiting at disconnect warn on shutdown. That's fallout; the actual root is the hardcoded 10s cap.

Why this is not covered by existing trackers

  • #388Spawn::finish peer-gone race — is about shutdown tolerance in the caller after the harness has exited. Different code path (caller-side sync_request, not handshake), different failure shape (peer … went away / is not a valid message recipient on a dead peer), different fix (make finish/terminate tolerate peer-gone). Related only in the sense that both involve IPC-Manager peer state.
  • #391 — visible-TODO for the finish() swallow in tests — is a test-side diagnostic, also unrelated to run-service startup.

This ticket is the missing startup-race tracker.

Proposed fix

Two layers, land independently:

1. Config hook (low risk, unblocks local dev immediately)

Make the 10s cap configurable at lib/Test2/Harness2.pm:1124:

my $cap = $ENV{YATH_RUN_SERVICE_READY_TIMEOUT} // 10;
my $deadline = time + $cap;

Documented as an env override so developers on slower machines (macOS, VMs, resource-constrained containers) can raise the cap without patching source. Default remains 10s, so no behavior change on machines where it already works.

2. Investigate why Darwin is slow

Separately, understand why IPC::Manager::Service::Handle::ready takes more than 10s on macOS when it takes milliseconds on Linux:

  • Is this a select()/poll()-vs-kqueue() path difference inside IPC::Manager?
  • Does the run-service child actually finish its ready advertisement but the parent's handle not observe it? (Check by adding a one-off warn inside ready() or tinysleeping longer and inspecting the end state.)
  • Is there a real flaw where the run service's listening socket is created AFTER the IPC-Manager registration message is sent, so the parent's ready never settles?

This second layer may surface a real bug in the run-service spawn sequence rather than a pure timing issue. Worth pairing with an strace/dtruss run on both platforms.

Suggested title

Hardcoded 10s run-service readiness timeout aborts local dev on macOS (no config hook)

Affected files / line numbers

  • lib/Test2/Harness2.pm:1120-1131 — the 10s spin-wait.
  • lib/Test2/Harness2/Collector.pm:809 — the collector_artifacts send that surfaces the first warn.
  • lib/App/Yath2/Command/test.pm — the caller that loses everything when the harness croaks mid-run.

Workaround (until fixed)

None in tree. A patched local copy of Test2::Harness2.pm:1124 with my $deadline = time + 60; lets make test get past this point on the reproduction box. The config hook above is the right fix.

References

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions