[codex] Apply managed attribution restart decisions by namitdhameja · Pull Request #334 · NVIDIA/nvidia-resiliency-ext

namitdhameja · 2026-05-13T16:29:09Z

Summary

Use the launcher-managed attribution service as the only FT attribution restart-decision path.
Query the rendezvous-owned AttributionService for the previously submitted cycle log before spending a restart.
Remove the direct backend CLI/YAML/config surface from the PR and keep unit coverage on the managed-service veto behavior.

Validation

git diff --check
PYTHONPYCACHEPREFIX=/private/tmp/nvrx-pycache-managed-only python3 -m compileall -q src/nvidia_resiliency_ext/fault_tolerance tests/fault_tolerance/unit/test_launcher.py tests/fault_tolerance/unit/test_config.py tests/fault_tolerance/unit/test_attribution_manager.py
black --check / isort --check-only / ruff check on the touched Python files using the pinned lint tools in /private/tmp/nvrx-lint-tools
Focused pytest was attempted with PYTHONPATH=src:/private/tmp/nvrx-test-tools, but local collection is blocked because this Python environment is missing torch.

Scope

This PR intentionally leaves out direct attribution backends, timing instrumentation, peer-abort signaling, Slack/dataflow wiring, attribution runner changes, and worker lifecycle hardening.

hexinw-nvidia

Thanks for getting attribution wired into the restart path — the integration shape works. Before merge, I'd like to propose a design change that I think simplifies the code and yields a meaningful latency win. Happy to discuss before you spend cycles refactoring.

Proposal: trigger on the agent, decide on the rendezvous leader

Today (this PR): every agent runs _run_attribution() synchronously inside _handle_restart_decision(), then communicates the verdict to itself via self._graceful_stop_requested and re-checks it in two no-restart exit branches in _invoke_run_with_any_failed_policy.

Two consequences:

Total recovery wall time = attribution_time + rendezvous_time (serialized).
N agents each pay attribution cost; if backends disagree on racing inputs, agents disagree on outcome.
The _graceful_stop_requested flag is read in two places because there are two no-restart exit paths feeding the same finalization logic.

Proposed: fire attribution non-blocking from _handle_restart_decision(), and read the verdict in the rendezvous "decide" step on the leader only, plumbed through the existing barrier state.

# Every agent — non-blocking trigger
def _handle_restart_decision(self, ...):
    self._attribution_client.trigger_async(cycle_id=self._current_cycle_id, ...)
    # No waiting. No flag. Existing restart-decision logic stays.

# Rendezvous leader, during the "decide" step
def _decide_round_outcome(self):
    decision = self._attribution_client.get_decision(
        cycle_id=self._current_cycle_id,
        timeout=self._cfg.attribution_decision_timeout,
        default=AttributionDecision.RESTART,   # safe default
    )
    if decision is AttributionDecision.STOP:
        self._barrier_state.set_shutdown(reason=GRACEFUL_ATTRIBUTION)
    # … continue normal rendezvous open …

Non-leader agents do nothing new — they observe shutdown via the existing rendezvous protocol and exit cleanly through RendezvousGracefulExitError, which launcher.py:501 already handles. No _graceful_stop_requested, no two-place check.

Why it's worth the refactor

Latency overlap. Attribution work runs in parallel with rendezvous wait. Total = max(X, Y) instead of X + Y. For LLM-based LogSage / large FR analyses this is a real saving.
1× attribution cost instead of N× — the leader is the single consumer.
No racing outcomes. Single source of truth via rendezvous state.
No side-channel flag. Verdict travels through the rendezvous protocol the same way set_shutdown() already does. The two no-restart branches collapse to one —
no DRY violation to refactor away later.
Better failure mode. A hung attribution backend no longer blocks the agent; it blocks one bounded-timeout lookup on the leader with a safe-default fallback.

Things to nail down

Correlation key. NVRxCycleInfo (#258, in v0.6) is the natural carrier — add cycle_id so "agent triggered" and "leader reads" agree on which failure.
Leader identity. Confirm which rank runs _decide_round_outcome in barrier rendezvous v2. The rank that called open_rendezvous() is the obvious choice; idempotent reads on attrsvc guard against leader change between trigger and read.
Decision timeout. attribution_decision_timeout ≤ rendezvous timeout (otherwise we defeat the overlap). Default small (e.g., 10 s) with safe-default RESTART. Surface as --ft-attribution-decision-timeout.
Trigger fan-in. N agents may trigger on the same failure — attrsvc must dedupe by cycle_id.
Dropped-result handling. If attribution returns STOP after the leader already opened the round for restart, log + drop — don't retro-shutdown.
Where the trigger lives. Agent vs. RankMonitorServer. Server-side trigger keeps _handle_restart_decision attribution-agnostic and is my preference.
Failure modes. Unreachable attrsvc on trigger → log + continue. Unreachable on read → leader uses safe default. Document both.

Path forward

Two options:

Retarget this PR. The agent-side trigger call stays in _handle_restart_decision, just non-blocking; the verdict read moves into the rendezvous decide step; _graceful_stop_requested and the two-place check disappear. Net diff is comparable in size, IMO.
Land this as a baseline + follow-up. Merge as-is, then a v0.7 design item for the rendezvous-centralized refactor. Risk: the flag-based integration becomes the de-facto contract and harder to remove.

I'd push for (1). What do you think?

hexinw-nvidia

Second comment, separate concern from the latency proposal: the attribution config surface in this PR is duplicated and the terminology is inconsistent. I think it can be consolidated before merge.

What we have today (this PR + pre-existing `attrsvc` path)

Concept	Type	CLI	Used for
`attribution_endpoint`	`Optional[str]` (one URL)	`--ft-attribution-endpoint` (pre-existing)	The standalone `attrsvc` service (#248 / #318)
`attribution_backends`	`Optional[List[str]]`	`--ft-attribution-backend` (this PR, repeatable)	Direct queries (`mcp` or HTTP) bypassing `attrsvc`

Both also carry their own timeout / dry-run, both feed _validate_attribution_requires_per_cycle_applog, both have nearly identical help text — and they overlap semantically. An HTTP entry in attribution_backends is an attribution endpoint.

Specific issues

Two ways to express the same thing. attribution_endpoint = "http://x" and attribution_backends = ["http://x"] describe the same underlying call. No documented precedence when both are set.
"Endpoint" vs "backend" terminology. Same concept, two names. Pick one.
mcp as a magic string inside a list of URLs. Mixing "mcp" and "http://..." in one list isn't self-describing. URI-scheme tagging (e.g., mcp:) lets argparse / YAML readers parse uniformly.
attribution_timeout_seconds / attribution_dry_run scope is ambiguous. Help text says "direct attribution"; do they apply to the attribution_endpoint path too? Diff doesn't make it clear.
Multi-backend aggregation rule is implicit. "If any backend recommends STOP → stop" is in prose only; no --ft-attribution-aggregation any|all knob.

Consolidated design: typed flags per source kind

One config block, one flag per source kind, no magic strings.

CLI:

# Use the standalone attrsvc (single URL)
--ft-attrsvc-url http://attrsvc.internal:8000

# Enable local MCP-based attribution
--ft-attribution-enable-mcp

# Add direct HTTP attribution backends (repeatable)
--ft-attribution-http-url http://other-backend:8000

# Shared knobs
--ft-attribution-timeout 60
--ft-attribution-dry-run
--ft-attribution-aggregation any

YAML:

fault_tolerance:
  attribution:
    attrsvc_url: http://attrsvc.internal:8000
    enable_mcp: true
    http_urls:
      - http://other-backend:8000
    timeout_seconds: 60
    dry_run: false
    aggregation: any        # any | all (default: any)

Internally the launcher folds these into a single list of source objects before dispatch — implementation stays unified, surface stays self-documenting:

# pseudocode
sources = []
if cfg.attribution.attrsvc_url:
    sources.append(HttpSource(cfg.attribution.attrsvc_url, kind="attrsvc"))
if cfg.attribution.enable_mcp:
    sources.append(McpSource())
for url in cfg.attribution.http_urls or []:
    sources.append(HttpSource(url, kind="direct"))

What that buys

One config block, self-documenting flags. Every flag's name tells you what kind of thing it configures. No literal mcp masquerading as a URL, no scheme-suffix syntax to memorize.
Shared timeout / dry-run / aggregation across all source kinds — orthogonal to source kind, as they should be.
One validation block, one merge block, one dispatch path. _validate_attribution_requires_per_cycle_applog's double-check collapses to one branch.
Composes with the latency proposal. The "trigger on agent, decide on leader" design operates on this same source list — no second redesign later.

Tradeoff

Adding a new source kind in the future requires a new flag, not a new scheme handler. For a small fixed set (attrsvc, mcp, http) this is fine. If you anticipate many kinds, a single repeatable --ft-attribution-source URL flag (with URI-scheme dispatch) scales better — but for today's three kinds, typed flags win on clarity.

Migration

--ft-attribution-endpoint → folded into --ft-attrsvc-url. Keep accepting it for one release with a DeprecationWarning; remove in v0.7.
--ft-attribution-backend → folded into --ft-attribution-http-url (when the value is a URL) or --ft-attribution-enable-mcp (when the value is "mcp"). Same one-release deprecation window.

greptile-apps · 2026-05-13T19:33:12Z

Greptile Summary

This PR routes all fault-tolerance attribution restart-veto decisions through the launcher-managed AttributionService held by the rendezvous handler, removing any direct backend path from _handle_restart_decision. It adds a _graceful_stop_requested flag that lets the veto propagate cleanly out of the monitoring loop as a RendezvousGracefulExitError.

New _run_attribution() helper checks attribution only on the store-host node, reads _attribution_service.get_last_result() from the rendezvous handler, and calls _barrier_state.set_shutdown() before returning False from the decision method.
_handle_restart_decision update short-circuits before progress tracking when attribution says stop, setting _graceful_stop_requested = True; callers in both the UNHEALTHY/FAILED and HEALTHY paths then raise RendezvousGracefulExitError after stopping workers.
_validate_attribution_requires_per_cycle_applog refactor is a semantically-equivalent restructuring (early-return style) of the original boolean compound condition.

Confidence Score: 5/5

The change is safe to merge; the attribution veto path is well-isolated, the flag lifecycle is correctly bounded to a single loop iteration, and the refactored validation function is semantically equivalent to what it replaces.

The new code path is narrow: _run_attribution is a one-level getattr call with two early-exits, and _graceful_stop_requested is always cleared before any exception propagates. Existing tests for _validate_attribution_requires_per_cycle_applog cover all four input combinations. The new test correctly exercises the veto branch by setting _is_store_host = True, which was the gap flagged in a previous review round.

No files require special attention, though launcher.py carries the is True identity check that could silently drop a stop decision if the attribution service API contract ever widens beyond strict Python booleans.

Important Files Changed

Filename	Overview
src/nvidia_resiliency_ext/fault_tolerance/launcher.py	Adds `_graceful_stop_requested` flag, `_run_attribution()` helper, and attribution veto path in `_handle_restart_decision`; also refactors `_validate_attribution_requires_per_cycle_applog` with equivalent semantics but using `is True` identity check for the stop decision.
tests/fault_tolerance/unit/test_launcher.py	Pins `_attribution_service = None` on the shared mock spec, adds a well-structured test for the attribution veto path, and correctly sets `_is_store_host = True` to exercise the new code branch.

_{Reviews (2): Last reviewed commit: "Fix managed attribution restart test" | Re-trigger Greptile}

greptile-apps · 2026-05-13T19:33:16Z

+    def test_handle_restart_decision_stops_when_attribution_service_says_stop(self):
+        """Returns False before restart when the managed attribution service recommends stop."""
+        agent = self._make_agent()
+        agent._progress_tracker = MagicMock()
+        agent._remaining_restarts = 2
+        agent._rdzv_handler._attribution_service = MagicMock()
+        agent._rdzv_handler._attribution_service.get_last_result.return_value = True
+        agent._rdzv_handler._barrier_state = MagicMock()
+
+        with patch.object(agent, '_restart_workers') as mock_restart:
+            result = agent._handle_restart_decision(
+                role="test", spec=self.spec, log_msg="[%s] restarting"
+            )
+
+        self.assertFalse(result)
+        self.assertTrue(agent._graceful_stop_requested)
+        agent._rdzv_handler._barrier_state.set_shutdown.assert_called_once()
+        agent._progress_tracker.analyze_previous_cycle.assert_not_called()
+        mock_restart.assert_not_called()


Attribution veto path never exercised — test assertions will fail at runtime

_make_agent() constructs LocalElasticAgent with the default is_store_host=False. Because _run_attribution() returns False immediately when not self._is_store_host, the attribution veto branch in _handle_restart_decision is never entered. The mock's get_last_result is never called, _graceful_stop_requested stays False, and set_shutdown is never invoked. As a result three of the four assertions would fail: self.assertTrue(agent._graceful_stop_requested), _barrier_state.set_shutdown.assert_called_once(), and _progress_tracker.analyze_previous_cycle.assert_not_called(). The test needs is_store_host=True passed to LocalElasticAgent (or agent._is_store_host = True set before the call) for the intended code path to be reachable.

Suggested change

def test_handle_restart_decision_stops_when_attribution_service_says_stop(self):

"""Returns False before restart when the managed attribution service recommends stop."""

agent = self._make_agent()

agent._progress_tracker = MagicMock()

agent._remaining_restarts = 2

agent._rdzv_handler._attribution_service = MagicMock()

agent._rdzv_handler._attribution_service.get_last_result.return_value = True

agent._rdzv_handler._barrier_state = MagicMock()

with patch.object(agent, '_restart_workers') as mock_restart:

result = agent._handle_restart_decision(

role="test", spec=self.spec, log_msg="[%s] restarting"

)

self.assertFalse(result)

self.assertTrue(agent._graceful_stop_requested)

agent._rdzv_handler._barrier_state.set_shutdown.assert_called_once()

agent._progress_tracker.analyze_previous_cycle.assert_not_called()

mock_restart.assert_not_called()

def test_handle_restart_decision_stops_when_attribution_service_says_stop(self):

"""Returns False before restart when the managed attribution service recommends stop."""

agent = self._make_agent()

agent._is_store_host = True

agent._progress_tracker = MagicMock()

namitdhameja changed the title ~~[codex] Integrate attribution restart decisions~~ [codex] Integrate FT attribution restart decisions May 13, 2026

hexinw-nvidia reviewed May 13, 2026

View reviewed changes

namitdhameja changed the title ~~[codex] Integrate FT attribution restart decisions~~ [codex] Apply managed attribution restart decisions May 13, 2026

hexinw-nvidia reviewed May 13, 2026

View reviewed changes

hexinw-nvidia marked this pull request as ready for review May 13, 2026 19:30

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

namitdhameja marked this pull request as draft May 13, 2026 19:40

Apply managed attribution restart decisions

dfd42f6

namitdhameja force-pushed the ft-attribution-integration branch from cf58a44 to dfd42f6 Compare May 14, 2026 16:42

Wait for GPU memory only before restart

2175d94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Apply managed attribution restart decisions#334

[codex] Apply managed attribution restart decisions#334
namitdhameja wants to merge 2 commits into
mainfrom
ft-attribution-integration

namitdhameja commented May 13, 2026 •

edited

Loading

Uh oh!

hexinw-nvidia left a comment

Uh oh!

hexinw-nvidia left a comment

Uh oh!

greptile-apps Bot commented May 13, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

namitdhameja commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Scope

Uh oh!

hexinw-nvidia left a comment

Choose a reason for hiding this comment

Proposal: trigger on the agent, decide on the rendezvous leader

Why it's worth the refactor

Things to nail down

Path forward

Uh oh!

hexinw-nvidia left a comment

Choose a reason for hiding this comment

What we have today (this PR + pre-existing attrsvc path)

Specific issues

Consolidated design: typed flags per source kind

What that buys

Tradeoff

Migration

Uh oh!

greptile-apps Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

namitdhameja commented May 13, 2026 •

edited

Loading

What we have today (this PR + pre-existing `attrsvc` path)

greptile-apps Bot commented May 13, 2026 •

edited

Loading