Hardcoded 10s run-service readiness timeout aborts local dev on macOS (no config hook)

## Summary

On macOS, `make test` in this repo fails reliably (~10s in) with a hardcoded 10-second timeout when the harness waits for a freshly-spawned run service to announce itself over IPC. The run service eventually does come up, but the handshake window in `Test2::Harness2::_wait_for_run_service_ready` is too tight for Darwin's `IPC::Manager::Service::Handle::ready` latency. CI (Linux containers) is faster and consistently slides in under 10s, so this is invisible in automation — but every local-dev run on a Mac aborts before any test can execute.

## Reproduction

```
$ git clone https://github.com/Test-More/Test2-Harness.git
$ cd Test2-Harness
$ perl Makefile.PL && make
$ YATH_TEST_KEEP_LOGS=1 make test
```

Observed: both `yath test …` invocations exit with status 1 after ~10s. `test.pl` reports:

```
[test.pl] yath run 1 (allow fork) completed in 10.6s (exit=256)
[test.pl] yath run 2 (T2_NO_FORK=1) completed in 10.6s (exit=256)
```

Inspecting the kept harness log shows the real failure sequence:

```
service_started
run_queued
run_started
(warn)  Collector IPC send failed (kind 'collector_artifacts') from 'collector:run-…'
(die)   timeout waiting for run service '<run_id>' to come up
(warn)  Messages waiting at disconnect for harness at …/IPC/Manager/Service/Handle.pm line 61
```

No per-test output is produced because the whole run aborts before the collector can launch a test.

Environment: macOS 15.3, perl 5.38.4 (perlbrew), `IPC::Manager 0.000029`. Timed consistently at ~10.1s from `run_started` to the `timeout` croak.

## Root cause

`lib/Test2/Harness2.pm:1120-1131` implements `_wait_for_run_service_ready`:

```perl
sub _wait_for_run_service_ready {
    my ($self, $run_id) = @_;

    my $handle   = $self->_run_service_handle($run_id);
    my $deadline = time + 10;                   # <-- hardcoded 10s
    until ($handle->ready) {
        croak "timeout waiting for run service '$run_id' to come up"
            if time > $deadline;
        tinysleep(0.02);
    }
    return $handle;
}
```

The `10` is a magic literal — no env-var override, no config knob, no caller-supplied cap. When `$handle->ready` (`IPC::Manager::Service::Handle::ready`) does not flip true within 10 wall-clock seconds, the harness `croak`s and the whole run is lost.

On the macOS box used for reproduction:

- `service_started` / `run_queued` / `run_started` events all arrive well inside 1s.
- `_wait_for_run_service_ready` enters, never sees `ready` become truthy, and croaks exactly at 10s.

The 10s budget dates from the `9ea82b795` / `b8be1469a` run-service aggregation refactor (`RunService: aggregate test_job_* + broadcast run_state`, `Collector: add Observer chain + TestObserver, drop collector_exiting`). Before that refactor, the harness did not need a separate run service to accept IPC before launching test collectors; the budget was chosen for Linux container scheduling where IPC sockets bind essentially instantly.

The first symptom, `Collector IPC send failed (kind 'collector_artifacts') from 'collector:run-…'`, originates at `lib/Test2/Harness2/Collector.pm:809` — the collector tries to send to the not-yet-registered run-service name, `IPC::Manager` queues the message, and when the harness-side timeout then fires we get the `Messages waiting at disconnect` warn on shutdown. That's fallout; the actual root is the hardcoded 10s cap.

## Why this is not covered by existing trackers

- [#388](https://github.com/Test-More/Test2-Harness/issues/388) — `Spawn::finish` peer-gone race — is about **shutdown** tolerance in the caller after the harness has exited. Different code path (caller-side `sync_request`, not handshake), different failure shape (`peer … went away` / `is not a valid message recipient` on a dead peer), different fix (make `finish`/`terminate` tolerate peer-gone). Related only in the sense that both involve IPC-Manager peer state.
- [#391](https://github.com/Test-More/Test2-Harness/issues/391) — visible-TODO for the `finish()` swallow in tests — is a test-side diagnostic, also unrelated to run-service startup.

This ticket is the missing startup-race tracker.

## Proposed fix

Two layers, land independently:

### 1. Config hook (low risk, unblocks local dev immediately)

Make the 10s cap configurable at `lib/Test2/Harness2.pm:1124`:

```perl
my $cap = $ENV{YATH_RUN_SERVICE_READY_TIMEOUT} // 10;
my $deadline = time + $cap;
```

Documented as an env override so developers on slower machines (macOS, VMs, resource-constrained containers) can raise the cap without patching source. Default remains 10s, so no behavior change on machines where it already works.

### 2. Investigate why Darwin is slow

Separately, understand why `IPC::Manager::Service::Handle::ready` takes more than 10s on macOS when it takes milliseconds on Linux:

- Is this a `select()`/`poll()`-vs-`kqueue()` path difference inside `IPC::Manager`?
- Does the run-service child actually finish its `ready` advertisement but the parent's handle not observe it? (Check by adding a one-off `warn` inside `ready()` or `tinysleep`ing longer and inspecting the end state.)
- Is there a real flaw where the run service's listening socket is created AFTER the IPC-Manager registration message is sent, so the parent's `ready` never settles?

This second layer may surface a real bug in the run-service spawn sequence rather than a pure timing issue. Worth pairing with an `strace`/`dtruss` run on both platforms.

## Suggested title

`Hardcoded 10s run-service readiness timeout aborts local dev on macOS (no config hook)`

## Affected files / line numbers

- `lib/Test2/Harness2.pm:1120-1131` — the 10s spin-wait.
- `lib/Test2/Harness2/Collector.pm:809` — the `collector_artifacts` send that surfaces the first warn.
- `lib/App/Yath2/Command/test.pm` — the caller that loses everything when the harness `croak`s mid-run.

## Workaround (until fixed)

None in tree. A patched local copy of `Test2::Harness2.pm:1124` with `my $deadline = time + 60;` lets `make test` get past this point on the reproduction box. The config hook above is the right fix.

## References

- PR: #384 (surfaced the issue while working on CI stability; not a regression from the PR).
- Related issues: #388 (shutdown race), #391 (test-side TODO visibility).
- Commits that introduced the current run-service handshake shape: `9ea82b795`, `b8be1469a`, `67c4c2077`, `a9c77f5ad`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded 10s run-service readiness timeout aborts local dev on macOS (no config hook) #392

Summary

Reproduction

Root cause

Why this is not covered by existing trackers

Proposed fix

1. Config hook (low risk, unblocks local dev immediately)

2. Investigate why Darwin is slow

Suggested title

Affected files / line numbers

Workaround (until fixed)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hardcoded 10s run-service readiness timeout aborts local dev on macOS (no config hook) #392

Description

Summary

Reproduction

Root cause

Why this is not covered by existing trackers

Proposed fix

1. Config hook (low risk, unblocks local dev immediately)

2. Investigate why Darwin is slow

Suggested title

Affected files / line numbers

Workaround (until fixed)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions