Summary
On macOS, make test in this repo fails reliably (~10s in) with a hardcoded 10-second timeout when the harness waits for a freshly-spawned run service to announce itself over IPC. The run service eventually does come up, but the handshake window in Test2::Harness2::_wait_for_run_service_ready is too tight for Darwin's IPC::Manager::Service::Handle::ready latency. CI (Linux containers) is faster and consistently slides in under 10s, so this is invisible in automation — but every local-dev run on a Mac aborts before any test can execute.
Reproduction
$ git clone https://github.com/Test-More/Test2-Harness.git
$ cd Test2-Harness
$ perl Makefile.PL && make
$ YATH_TEST_KEEP_LOGS=1 make test
Observed: both yath test … invocations exit with status 1 after ~10s. test.pl reports:
[test.pl] yath run 1 (allow fork) completed in 10.6s (exit=256)
[test.pl] yath run 2 (T2_NO_FORK=1) completed in 10.6s (exit=256)
Inspecting the kept harness log shows the real failure sequence:
service_started
run_queued
run_started
(warn) Collector IPC send failed (kind 'collector_artifacts') from 'collector:run-…'
(die) timeout waiting for run service '<run_id>' to come up
(warn) Messages waiting at disconnect for harness at …/IPC/Manager/Service/Handle.pm line 61
No per-test output is produced because the whole run aborts before the collector can launch a test.
Environment: macOS 15.3, perl 5.38.4 (perlbrew), IPC::Manager 0.000029. Timed consistently at ~10.1s from run_started to the timeout croak.
Root cause
lib/Test2/Harness2.pm:1120-1131 implements _wait_for_run_service_ready:
sub _wait_for_run_service_ready {
my ($self, $run_id) = @_;
my $handle = $self->_run_service_handle($run_id);
my $deadline = time + 10; # <-- hardcoded 10s
until ($handle->ready) {
croak "timeout waiting for run service '$run_id' to come up"
if time > $deadline;
tinysleep(0.02);
}
return $handle;
}
The 10 is a magic literal — no env-var override, no config knob, no caller-supplied cap. When $handle->ready (IPC::Manager::Service::Handle::ready) does not flip true within 10 wall-clock seconds, the harness croaks and the whole run is lost.
On the macOS box used for reproduction:
service_started / run_queued / run_started events all arrive well inside 1s.
_wait_for_run_service_ready enters, never sees ready become truthy, and croaks exactly at 10s.
The 10s budget dates from the 9ea82b795 / b8be1469a run-service aggregation refactor (RunService: aggregate test_job_* + broadcast run_state, Collector: add Observer chain + TestObserver, drop collector_exiting). Before that refactor, the harness did not need a separate run service to accept IPC before launching test collectors; the budget was chosen for Linux container scheduling where IPC sockets bind essentially instantly.
The first symptom, Collector IPC send failed (kind 'collector_artifacts') from 'collector:run-…', originates at lib/Test2/Harness2/Collector.pm:809 — the collector tries to send to the not-yet-registered run-service name, IPC::Manager queues the message, and when the harness-side timeout then fires we get the Messages waiting at disconnect warn on shutdown. That's fallout; the actual root is the hardcoded 10s cap.
Why this is not covered by existing trackers
- #388 —
Spawn::finish peer-gone race — is about shutdown tolerance in the caller after the harness has exited. Different code path (caller-side sync_request, not handshake), different failure shape (peer … went away / is not a valid message recipient on a dead peer), different fix (make finish/terminate tolerate peer-gone). Related only in the sense that both involve IPC-Manager peer state.
- #391 — visible-TODO for the
finish() swallow in tests — is a test-side diagnostic, also unrelated to run-service startup.
This ticket is the missing startup-race tracker.
Proposed fix
Two layers, land independently:
1. Config hook (low risk, unblocks local dev immediately)
Make the 10s cap configurable at lib/Test2/Harness2.pm:1124:
my $cap = $ENV{YATH_RUN_SERVICE_READY_TIMEOUT} // 10;
my $deadline = time + $cap;
Documented as an env override so developers on slower machines (macOS, VMs, resource-constrained containers) can raise the cap without patching source. Default remains 10s, so no behavior change on machines where it already works.
2. Investigate why Darwin is slow
Separately, understand why IPC::Manager::Service::Handle::ready takes more than 10s on macOS when it takes milliseconds on Linux:
- Is this a
select()/poll()-vs-kqueue() path difference inside IPC::Manager?
- Does the run-service child actually finish its
ready advertisement but the parent's handle not observe it? (Check by adding a one-off warn inside ready() or tinysleeping longer and inspecting the end state.)
- Is there a real flaw where the run service's listening socket is created AFTER the IPC-Manager registration message is sent, so the parent's
ready never settles?
This second layer may surface a real bug in the run-service spawn sequence rather than a pure timing issue. Worth pairing with an strace/dtruss run on both platforms.
Suggested title
Hardcoded 10s run-service readiness timeout aborts local dev on macOS (no config hook)
Affected files / line numbers
lib/Test2/Harness2.pm:1120-1131 — the 10s spin-wait.
lib/Test2/Harness2/Collector.pm:809 — the collector_artifacts send that surfaces the first warn.
lib/App/Yath2/Command/test.pm — the caller that loses everything when the harness croaks mid-run.
Workaround (until fixed)
None in tree. A patched local copy of Test2::Harness2.pm:1124 with my $deadline = time + 60; lets make test get past this point on the reproduction box. The config hook above is the right fix.
References
Summary
On macOS,
make testin this repo fails reliably (~10s in) with a hardcoded 10-second timeout when the harness waits for a freshly-spawned run service to announce itself over IPC. The run service eventually does come up, but the handshake window inTest2::Harness2::_wait_for_run_service_readyis too tight for Darwin'sIPC::Manager::Service::Handle::readylatency. CI (Linux containers) is faster and consistently slides in under 10s, so this is invisible in automation — but every local-dev run on a Mac aborts before any test can execute.Reproduction
Observed: both
yath test …invocations exit with status 1 after ~10s.test.plreports:Inspecting the kept harness log shows the real failure sequence:
No per-test output is produced because the whole run aborts before the collector can launch a test.
Environment: macOS 15.3, perl 5.38.4 (perlbrew),
IPC::Manager 0.000029. Timed consistently at ~10.1s fromrun_startedto thetimeoutcroak.Root cause
lib/Test2/Harness2.pm:1120-1131implements_wait_for_run_service_ready:The
10is a magic literal — no env-var override, no config knob, no caller-supplied cap. When$handle->ready(IPC::Manager::Service::Handle::ready) does not flip true within 10 wall-clock seconds, the harnesscroaks and the whole run is lost.On the macOS box used for reproduction:
service_started/run_queued/run_startedevents all arrive well inside 1s._wait_for_run_service_readyenters, never seesreadybecome truthy, and croaks exactly at 10s.The 10s budget dates from the
9ea82b795/b8be1469arun-service aggregation refactor (RunService: aggregate test_job_* + broadcast run_state,Collector: add Observer chain + TestObserver, drop collector_exiting). Before that refactor, the harness did not need a separate run service to accept IPC before launching test collectors; the budget was chosen for Linux container scheduling where IPC sockets bind essentially instantly.The first symptom,
Collector IPC send failed (kind 'collector_artifacts') from 'collector:run-…', originates atlib/Test2/Harness2/Collector.pm:809— the collector tries to send to the not-yet-registered run-service name,IPC::Managerqueues the message, and when the harness-side timeout then fires we get theMessages waiting at disconnectwarn on shutdown. That's fallout; the actual root is the hardcoded 10s cap.Why this is not covered by existing trackers
Spawn::finishpeer-gone race — is about shutdown tolerance in the caller after the harness has exited. Different code path (caller-sidesync_request, not handshake), different failure shape (peer … went away/is not a valid message recipienton a dead peer), different fix (makefinish/terminatetolerate peer-gone). Related only in the sense that both involve IPC-Manager peer state.finish()swallow in tests — is a test-side diagnostic, also unrelated to run-service startup.This ticket is the missing startup-race tracker.
Proposed fix
Two layers, land independently:
1. Config hook (low risk, unblocks local dev immediately)
Make the 10s cap configurable at
lib/Test2/Harness2.pm:1124:Documented as an env override so developers on slower machines (macOS, VMs, resource-constrained containers) can raise the cap without patching source. Default remains 10s, so no behavior change on machines where it already works.
2. Investigate why Darwin is slow
Separately, understand why
IPC::Manager::Service::Handle::readytakes more than 10s on macOS when it takes milliseconds on Linux:select()/poll()-vs-kqueue()path difference insideIPC::Manager?readyadvertisement but the parent's handle not observe it? (Check by adding a one-offwarninsideready()ortinysleeping longer and inspecting the end state.)readynever settles?This second layer may surface a real bug in the run-service spawn sequence rather than a pure timing issue. Worth pairing with an
strace/dtrussrun on both platforms.Suggested title
Hardcoded 10s run-service readiness timeout aborts local dev on macOS (no config hook)Affected files / line numbers
lib/Test2/Harness2.pm:1120-1131— the 10s spin-wait.lib/Test2/Harness2/Collector.pm:809— thecollector_artifactssend that surfaces the first warn.lib/App/Yath2/Command/test.pm— the caller that loses everything when the harnesscroaks mid-run.Workaround (until fixed)
None in tree. A patched local copy of
Test2::Harness2.pm:1124withmy $deadline = time + 60;letsmake testget past this point on the reproduction box. The config hook above is the right fix.References
9ea82b795,b8be1469a,67c4c2077,a9c77f5ad.