[codex] Apply managed attribution restart decisions#334
Conversation
hexinw-nvidia
left a comment
There was a problem hiding this comment.
Thanks for getting attribution wired into the restart path — the integration shape works. Before merge, I'd like to propose a design change that I think simplifies the code and yields a meaningful latency win. Happy to discuss before you spend cycles refactoring.
Proposal: trigger on the agent, decide on the rendezvous leader
Today (this PR): every agent runs _run_attribution() synchronously inside _handle_restart_decision(), then communicates the verdict to itself via self._graceful_stop_requested and re-checks it in two no-restart exit branches in _invoke_run_with_any_failed_policy.
Two consequences:
- Total recovery wall time = attribution_time + rendezvous_time (serialized).
- N agents each pay attribution cost; if backends disagree on racing inputs, agents disagree on outcome.
- The
_graceful_stop_requestedflag is read in two places because there are two no-restart exit paths feeding the same finalization logic.
Proposed: fire attribution non-blocking from _handle_restart_decision(), and read the verdict in the rendezvous "decide" step on the leader only, plumbed through the existing barrier state.
# Every agent — non-blocking trigger
def _handle_restart_decision(self, ...):
self._attribution_client.trigger_async(cycle_id=self._current_cycle_id, ...)
# No waiting. No flag. Existing restart-decision logic stays.
# Rendezvous leader, during the "decide" step
def _decide_round_outcome(self):
decision = self._attribution_client.get_decision(
cycle_id=self._current_cycle_id,
timeout=self._cfg.attribution_decision_timeout,
default=AttributionDecision.RESTART, # safe default
)
if decision is AttributionDecision.STOP:
self._barrier_state.set_shutdown(reason=GRACEFUL_ATTRIBUTION)
# … continue normal rendezvous open …Non-leader agents do nothing new — they observe shutdown via the existing rendezvous protocol and exit cleanly through RendezvousGracefulExitError, which launcher.py:501 already handles. No _graceful_stop_requested, no two-place check.
Why it's worth the refactor
- Latency overlap. Attribution work runs in parallel with rendezvous wait. Total =
max(X, Y)instead ofX + Y. For LLM-based LogSage / large FR analyses this is a real saving. - 1× attribution cost instead of N× — the leader is the single consumer.
- No racing outcomes. Single source of truth via rendezvous state.
- No side-channel flag. Verdict travels through the rendezvous protocol the same way
set_shutdown()already does. The two no-restart branches collapse to one —
no DRY violation to refactor away later. - Better failure mode. A hung attribution backend no longer blocks the agent; it blocks one bounded-timeout lookup on the leader with a safe-default fallback.
Things to nail down
- Correlation key.
NVRxCycleInfo(#258, in v0.6) is the natural carrier — addcycle_idso "agent triggered" and "leader reads" agree on which failure. - Leader identity. Confirm which rank runs
_decide_round_outcomein barrier rendezvous v2. The rank that calledopen_rendezvous()is the obvious choice; idempotent reads onattrsvcguard against leader change between trigger and read. - Decision timeout.
attribution_decision_timeout≤ rendezvous timeout (otherwise we defeat the overlap). Default small (e.g., 10 s) with safe-defaultRESTART. Surface as--ft-attribution-decision-timeout. - Trigger fan-in. N agents may trigger on the same failure —
attrsvcmust dedupe bycycle_id. - Dropped-result handling. If attribution returns STOP after the leader already opened the round for restart, log + drop — don't retro-shutdown.
- Where the trigger lives. Agent vs.
RankMonitorServer. Server-side trigger keeps_handle_restart_decisionattribution-agnostic and is my preference. - Failure modes. Unreachable
attrsvcon trigger → log + continue. Unreachable on read → leader uses safe default. Document both.
Path forward
Two options:
- Retarget this PR. The agent-side trigger call stays in
_handle_restart_decision, just non-blocking; the verdict read moves into the rendezvous decide step;_graceful_stop_requestedand the two-place check disappear. Net diff is comparable in size, IMO. - Land this as a baseline + follow-up. Merge as-is, then a v0.7 design item for the rendezvous-centralized refactor. Risk: the flag-based integration becomes the de-facto contract and harder to remove.
I'd push for (1). What do you think?
hexinw-nvidia
left a comment
There was a problem hiding this comment.
Second comment, separate concern from the latency proposal: the attribution config surface in this PR is duplicated and the terminology is inconsistent. I think it can be consolidated before merge.
What we have today (this PR + pre-existing attrsvc path)
| Concept | Type | CLI | Used for |
|---|---|---|---|
attribution_endpoint |
Optional[str] (one URL) |
--ft-attribution-endpoint (pre-existing) |
The standalone attrsvc service (#248 / #318) |
attribution_backends |
Optional[List[str]] |
--ft-attribution-backend (this PR, repeatable) |
Direct queries (mcp or HTTP) bypassing attrsvc |
Both also carry their own timeout / dry-run, both feed _validate_attribution_requires_per_cycle_applog, both have nearly identical help text — and they overlap semantically. An HTTP entry in attribution_backends is an attribution endpoint.
Specific issues
- Two ways to express the same thing.
attribution_endpoint = "http://x"andattribution_backends = ["http://x"]describe the same underlying call. No documented precedence when both are set. - "Endpoint" vs "backend" terminology. Same concept, two names. Pick one.
mcpas a magic string inside a list of URLs. Mixing"mcp"and"http://..."in one list isn't self-describing. URI-scheme tagging (e.g.,mcp:) lets argparse / YAML readers parse uniformly.attribution_timeout_seconds/attribution_dry_runscope is ambiguous. Help text says "direct attribution"; do they apply to theattribution_endpointpath too? Diff doesn't make it clear.- Multi-backend aggregation rule is implicit. "If any backend recommends STOP → stop" is in prose only; no
--ft-attribution-aggregation any|allknob.
Consolidated design: typed flags per source kind
One config block, one flag per source kind, no magic strings.
CLI:
# Use the standalone attrsvc (single URL)
--ft-attrsvc-url http://attrsvc.internal:8000
# Enable local MCP-based attribution
--ft-attribution-enable-mcp
# Add direct HTTP attribution backends (repeatable)
--ft-attribution-http-url http://other-backend:8000
# Shared knobs
--ft-attribution-timeout 60
--ft-attribution-dry-run
--ft-attribution-aggregation any
YAML:
fault_tolerance:
attribution:
attrsvc_url: http://attrsvc.internal:8000
enable_mcp: true
http_urls:
- http://other-backend:8000
timeout_seconds: 60
dry_run: false
aggregation: any # any | all (default: any)Internally the launcher folds these into a single list of source objects before dispatch — implementation stays unified, surface stays self-documenting:
# pseudocode
sources = []
if cfg.attribution.attrsvc_url:
sources.append(HttpSource(cfg.attribution.attrsvc_url, kind="attrsvc"))
if cfg.attribution.enable_mcp:
sources.append(McpSource())
for url in cfg.attribution.http_urls or []:
sources.append(HttpSource(url, kind="direct"))What that buys
- One config block, self-documenting flags. Every flag's name tells you what kind of thing it configures. No literal
mcpmasquerading as a URL, no scheme-suffix syntax to memorize. - Shared timeout / dry-run / aggregation across all source kinds — orthogonal to source kind, as they should be.
- One validation block, one merge block, one dispatch path.
_validate_attribution_requires_per_cycle_applog's double-check collapses to one branch. - Composes with the latency proposal. The "trigger on agent, decide on leader" design operates on this same source list — no second redesign later.
Tradeoff
Adding a new source kind in the future requires a new flag, not a new scheme handler. For a small fixed set (attrsvc, mcp, http) this is fine. If you anticipate many kinds, a single repeatable --ft-attribution-source URL flag (with URI-scheme dispatch) scales better — but for today's three kinds, typed flags win on clarity.
Migration
--ft-attribution-endpoint→ folded into--ft-attrsvc-url. Keep accepting it for one release with aDeprecationWarning; remove in v0.7.--ft-attribution-backend→ folded into--ft-attribution-http-url(when the value is a URL) or--ft-attribution-enable-mcp(when the value is"mcp"). Same one-release deprecation window.
Greptile SummaryThis PR routes all fault-tolerance attribution restart-veto decisions through the launcher-managed
Confidence Score: 5/5The change is safe to merge; the attribution veto path is well-isolated, the flag lifecycle is correctly bounded to a single loop iteration, and the refactored validation function is semantically equivalent to what it replaces. The new code path is narrow: No files require special attention, though Important Files Changed
Reviews (2): Last reviewed commit: "Fix managed attribution restart test" | Re-trigger Greptile |
| def test_handle_restart_decision_stops_when_attribution_service_says_stop(self): | ||
| """Returns False before restart when the managed attribution service recommends stop.""" | ||
| agent = self._make_agent() | ||
| agent._progress_tracker = MagicMock() | ||
| agent._remaining_restarts = 2 | ||
| agent._rdzv_handler._attribution_service = MagicMock() | ||
| agent._rdzv_handler._attribution_service.get_last_result.return_value = True | ||
| agent._rdzv_handler._barrier_state = MagicMock() | ||
|
|
||
| with patch.object(agent, '_restart_workers') as mock_restart: | ||
| result = agent._handle_restart_decision( | ||
| role="test", spec=self.spec, log_msg="[%s] restarting" | ||
| ) | ||
|
|
||
| self.assertFalse(result) | ||
| self.assertTrue(agent._graceful_stop_requested) | ||
| agent._rdzv_handler._barrier_state.set_shutdown.assert_called_once() | ||
| agent._progress_tracker.analyze_previous_cycle.assert_not_called() | ||
| mock_restart.assert_not_called() |
There was a problem hiding this comment.
Attribution veto path never exercised — test assertions will fail at runtime
_make_agent() constructs LocalElasticAgent with the default is_store_host=False. Because _run_attribution() returns False immediately when not self._is_store_host, the attribution veto branch in _handle_restart_decision is never entered. The mock's get_last_result is never called, _graceful_stop_requested stays False, and set_shutdown is never invoked. As a result three of the four assertions would fail: self.assertTrue(agent._graceful_stop_requested), _barrier_state.set_shutdown.assert_called_once(), and _progress_tracker.analyze_previous_cycle.assert_not_called(). The test needs is_store_host=True passed to LocalElasticAgent (or agent._is_store_host = True set before the call) for the intended code path to be reachable.
| def test_handle_restart_decision_stops_when_attribution_service_says_stop(self): | |
| """Returns False before restart when the managed attribution service recommends stop.""" | |
| agent = self._make_agent() | |
| agent._progress_tracker = MagicMock() | |
| agent._remaining_restarts = 2 | |
| agent._rdzv_handler._attribution_service = MagicMock() | |
| agent._rdzv_handler._attribution_service.get_last_result.return_value = True | |
| agent._rdzv_handler._barrier_state = MagicMock() | |
| with patch.object(agent, '_restart_workers') as mock_restart: | |
| result = agent._handle_restart_decision( | |
| role="test", spec=self.spec, log_msg="[%s] restarting" | |
| ) | |
| self.assertFalse(result) | |
| self.assertTrue(agent._graceful_stop_requested) | |
| agent._rdzv_handler._barrier_state.set_shutdown.assert_called_once() | |
| agent._progress_tracker.analyze_previous_cycle.assert_not_called() | |
| mock_restart.assert_not_called() | |
| def test_handle_restart_decision_stops_when_attribution_service_says_stop(self): | |
| """Returns False before restart when the managed attribution service recommends stop.""" | |
| agent = self._make_agent() | |
| agent._is_store_host = True | |
| agent._progress_tracker = MagicMock() |
cf58a44 to
dfd42f6
Compare
Summary
AttributionServicefor the previously submitted cycle log before spending a restart.Validation
git diff --checkPYTHONPYCACHEPREFIX=/private/tmp/nvrx-pycache-managed-only python3 -m compileall -q src/nvidia_resiliency_ext/fault_tolerance tests/fault_tolerance/unit/test_launcher.py tests/fault_tolerance/unit/test_config.py tests/fault_tolerance/unit/test_attribution_manager.pyblack --check/isort --check-only/ruff checkon the touched Python files using the pinned lint tools in/private/tmp/nvrx-lint-toolsPYTHONPATH=src:/private/tmp/nvrx-test-tools, but local collection is blocked because this Python environment is missingtorch.Scope
This PR intentionally leaves out direct attribution backends, timing instrumentation, peer-abort signaling, Slack/dataflow wiring, attribution runner changes, and worker lifecycle hardening.