Skip to content

[codex] Apply managed attribution restart decisions#334

Draft
namitdhameja wants to merge 2 commits into
mainfrom
ft-attribution-integration
Draft

[codex] Apply managed attribution restart decisions#334
namitdhameja wants to merge 2 commits into
mainfrom
ft-attribution-integration

Conversation

@namitdhameja

@namitdhameja namitdhameja commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Use the launcher-managed attribution service as the only FT attribution restart-decision path.
  • Query the rendezvous-owned AttributionService for the previously submitted cycle log before spending a restart.
  • Remove the direct backend CLI/YAML/config surface from the PR and keep unit coverage on the managed-service veto behavior.

Validation

  • git diff --check
  • PYTHONPYCACHEPREFIX=/private/tmp/nvrx-pycache-managed-only python3 -m compileall -q src/nvidia_resiliency_ext/fault_tolerance tests/fault_tolerance/unit/test_launcher.py tests/fault_tolerance/unit/test_config.py tests/fault_tolerance/unit/test_attribution_manager.py
  • black --check / isort --check-only / ruff check on the touched Python files using the pinned lint tools in /private/tmp/nvrx-lint-tools
  • Focused pytest was attempted with PYTHONPATH=src:/private/tmp/nvrx-test-tools, but local collection is blocked because this Python environment is missing torch.

Scope

This PR intentionally leaves out direct attribution backends, timing instrumentation, peer-abort signaling, Slack/dataflow wiring, attribution runner changes, and worker lifecycle hardening.

@namitdhameja namitdhameja changed the title [codex] Integrate attribution restart decisions [codex] Integrate FT attribution restart decisions May 13, 2026

@hexinw-nvidia hexinw-nvidia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting attribution wired into the restart path — the integration shape works. Before merge, I'd like to propose a design change that I think simplifies the code and yields a meaningful latency win. Happy to discuss before you spend cycles refactoring.

Proposal: trigger on the agent, decide on the rendezvous leader

Today (this PR): every agent runs _run_attribution() synchronously inside _handle_restart_decision(), then communicates the verdict to itself via self._graceful_stop_requested and re-checks it in two no-restart exit branches in _invoke_run_with_any_failed_policy.

Two consequences:

  1. Total recovery wall time = attribution_time + rendezvous_time (serialized).
  2. N agents each pay attribution cost; if backends disagree on racing inputs, agents disagree on outcome.
  3. The _graceful_stop_requested flag is read in two places because there are two no-restart exit paths feeding the same finalization logic.

Proposed: fire attribution non-blocking from _handle_restart_decision(), and read the verdict in the rendezvous "decide" step on the leader only, plumbed through the existing barrier state.

# Every agent — non-blocking trigger
def _handle_restart_decision(self, ...):
    self._attribution_client.trigger_async(cycle_id=self._current_cycle_id, ...)
    # No waiting. No flag. Existing restart-decision logic stays.

# Rendezvous leader, during the "decide" step
def _decide_round_outcome(self):
    decision = self._attribution_client.get_decision(
        cycle_id=self._current_cycle_id,
        timeout=self._cfg.attribution_decision_timeout,
        default=AttributionDecision.RESTART,   # safe default
    )
    if decision is AttributionDecision.STOP:
        self._barrier_state.set_shutdown(reason=GRACEFUL_ATTRIBUTION)
    # … continue normal rendezvous open …

Non-leader agents do nothing new — they observe shutdown via the existing rendezvous protocol and exit cleanly through RendezvousGracefulExitError, which launcher.py:501 already handles. No _graceful_stop_requested, no two-place check.

Why it's worth the refactor

  1. Latency overlap. Attribution work runs in parallel with rendezvous wait. Total = max(X, Y) instead of X + Y. For LLM-based LogSage / large FR analyses this is a real saving.
  2. 1× attribution cost instead of N× — the leader is the single consumer.
  3. No racing outcomes. Single source of truth via rendezvous state.
  4. No side-channel flag. Verdict travels through the rendezvous protocol the same way set_shutdown() already does. The two no-restart branches collapse to one —
    no DRY violation to refactor away later.
  5. Better failure mode. A hung attribution backend no longer blocks the agent; it blocks one bounded-timeout lookup on the leader with a safe-default fallback.

Things to nail down

  • Correlation key. NVRxCycleInfo (#258, in v0.6) is the natural carrier — add cycle_id so "agent triggered" and "leader reads" agree on which failure.
  • Leader identity. Confirm which rank runs _decide_round_outcome in barrier rendezvous v2. The rank that called open_rendezvous() is the obvious choice; idempotent reads on attrsvc guard against leader change between trigger and read.
  • Decision timeout. attribution_decision_timeout ≤ rendezvous timeout (otherwise we defeat the overlap). Default small (e.g., 10 s) with safe-default RESTART. Surface as --ft-attribution-decision-timeout.
  • Trigger fan-in. N agents may trigger on the same failure — attrsvc must dedupe by cycle_id.
  • Dropped-result handling. If attribution returns STOP after the leader already opened the round for restart, log + drop — don't retro-shutdown.
  • Where the trigger lives. Agent vs. RankMonitorServer. Server-side trigger keeps _handle_restart_decision attribution-agnostic and is my preference.
  • Failure modes. Unreachable attrsvc on trigger → log + continue. Unreachable on read → leader uses safe default. Document both.

Path forward

Two options:

  1. Retarget this PR. The agent-side trigger call stays in _handle_restart_decision, just non-blocking; the verdict read moves into the rendezvous decide step; _graceful_stop_requested and the two-place check disappear. Net diff is comparable in size, IMO.
  2. Land this as a baseline + follow-up. Merge as-is, then a v0.7 design item for the rendezvous-centralized refactor. Risk: the flag-based integration becomes the de-facto contract and harder to remove.

I'd push for (1). What do you think?

@namitdhameja namitdhameja changed the title [codex] Integrate FT attribution restart decisions [codex] Apply managed attribution restart decisions May 13, 2026

@hexinw-nvidia hexinw-nvidia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second comment, separate concern from the latency proposal: the attribution config surface in this PR is duplicated and the terminology is inconsistent. I think it can be consolidated before merge.

What we have today (this PR + pre-existing attrsvc path)

Concept Type CLI Used for
attribution_endpoint Optional[str] (one URL) --ft-attribution-endpoint (pre-existing) The standalone attrsvc service (#248 / #318)
attribution_backends Optional[List[str]] --ft-attribution-backend (this PR, repeatable) Direct queries (mcp or HTTP) bypassing attrsvc

Both also carry their own timeout / dry-run, both feed _validate_attribution_requires_per_cycle_applog, both have nearly identical help text — and they overlap semantically. An HTTP entry in attribution_backends is an attribution endpoint.

Specific issues

  1. Two ways to express the same thing. attribution_endpoint = "http://x" and attribution_backends = ["http://x"] describe the same underlying call. No documented precedence when both are set.
  2. "Endpoint" vs "backend" terminology. Same concept, two names. Pick one.
  3. mcp as a magic string inside a list of URLs. Mixing "mcp" and "http://..." in one list isn't self-describing. URI-scheme tagging (e.g., mcp:) lets argparse / YAML readers parse uniformly.
  4. attribution_timeout_seconds / attribution_dry_run scope is ambiguous. Help text says "direct attribution"; do they apply to the attribution_endpoint path too? Diff doesn't make it clear.
  5. Multi-backend aggregation rule is implicit. "If any backend recommends STOP → stop" is in prose only; no --ft-attribution-aggregation any|all knob.

Consolidated design: typed flags per source kind

One config block, one flag per source kind, no magic strings.

CLI:

# Use the standalone attrsvc (single URL)
--ft-attrsvc-url http://attrsvc.internal:8000

# Enable local MCP-based attribution
--ft-attribution-enable-mcp

# Add direct HTTP attribution backends (repeatable)
--ft-attribution-http-url http://other-backend:8000

# Shared knobs
--ft-attribution-timeout 60
--ft-attribution-dry-run
--ft-attribution-aggregation any

YAML:

fault_tolerance:
  attribution:
    attrsvc_url: http://attrsvc.internal:8000
    enable_mcp: true
    http_urls:
      - http://other-backend:8000
    timeout_seconds: 60
    dry_run: false
    aggregation: any        # any | all (default: any)

Internally the launcher folds these into a single list of source objects before dispatch — implementation stays unified, surface stays self-documenting:

# pseudocode
sources = []
if cfg.attribution.attrsvc_url:
    sources.append(HttpSource(cfg.attribution.attrsvc_url, kind="attrsvc"))
if cfg.attribution.enable_mcp:
    sources.append(McpSource())
for url in cfg.attribution.http_urls or []:
    sources.append(HttpSource(url, kind="direct"))

What that buys

  • One config block, self-documenting flags. Every flag's name tells you what kind of thing it configures. No literal mcp masquerading as a URL, no scheme-suffix syntax to memorize.
  • Shared timeout / dry-run / aggregation across all source kinds — orthogonal to source kind, as they should be.
  • One validation block, one merge block, one dispatch path. _validate_attribution_requires_per_cycle_applog's double-check collapses to one branch.
  • Composes with the latency proposal. The "trigger on agent, decide on leader" design operates on this same source list — no second redesign later.

Tradeoff

Adding a new source kind in the future requires a new flag, not a new scheme handler. For a small fixed set (attrsvc, mcp, http) this is fine. If you anticipate many kinds, a single repeatable --ft-attribution-source URL flag (with URI-scheme dispatch) scales better — but for today's three kinds, typed flags win on clarity.

Migration

  • --ft-attribution-endpoint → folded into --ft-attrsvc-url. Keep accepting it for one release with a DeprecationWarning; remove in v0.7.
  • --ft-attribution-backend → folded into --ft-attribution-http-url (when the value is a URL) or --ft-attribution-enable-mcp (when the value is "mcp"). Same one-release deprecation window.

@hexinw-nvidia hexinw-nvidia marked this pull request as ready for review May 13, 2026 19:30
@greptile-apps

greptile-apps Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR routes all fault-tolerance attribution restart-veto decisions through the launcher-managed AttributionService held by the rendezvous handler, removing any direct backend path from _handle_restart_decision. It adds a _graceful_stop_requested flag that lets the veto propagate cleanly out of the monitoring loop as a RendezvousGracefulExitError.

  • New _run_attribution() helper checks attribution only on the store-host node, reads _attribution_service.get_last_result() from the rendezvous handler, and calls _barrier_state.set_shutdown() before returning False from the decision method.
  • _handle_restart_decision update short-circuits before progress tracking when attribution says stop, setting _graceful_stop_requested = True; callers in both the UNHEALTHY/FAILED and HEALTHY paths then raise RendezvousGracefulExitError after stopping workers.
  • _validate_attribution_requires_per_cycle_applog refactor is a semantically-equivalent restructuring (early-return style) of the original boolean compound condition.

Confidence Score: 5/5

The change is safe to merge; the attribution veto path is well-isolated, the flag lifecycle is correctly bounded to a single loop iteration, and the refactored validation function is semantically equivalent to what it replaces.

The new code path is narrow: _run_attribution is a one-level getattr call with two early-exits, and _graceful_stop_requested is always cleared before any exception propagates. Existing tests for _validate_attribution_requires_per_cycle_applog cover all four input combinations. The new test correctly exercises the veto branch by setting _is_store_host = True, which was the gap flagged in a previous review round.

No files require special attention, though launcher.py carries the is True identity check that could silently drop a stop decision if the attribution service API contract ever widens beyond strict Python booleans.

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/fault_tolerance/launcher.py Adds _graceful_stop_requested flag, _run_attribution() helper, and attribution veto path in _handle_restart_decision; also refactors _validate_attribution_requires_per_cycle_applog with equivalent semantics but using is True identity check for the stop decision.
tests/fault_tolerance/unit/test_launcher.py Pins _attribution_service = None on the shared mock spec, adds a well-structured test for the attribution veto path, and correctly sets _is_store_host = True to exercise the new code branch.

Reviews (2): Last reviewed commit: "Fix managed attribution restart test" | Re-trigger Greptile

Comment on lines +548 to +566
def test_handle_restart_decision_stops_when_attribution_service_says_stop(self):
"""Returns False before restart when the managed attribution service recommends stop."""
agent = self._make_agent()
agent._progress_tracker = MagicMock()
agent._remaining_restarts = 2
agent._rdzv_handler._attribution_service = MagicMock()
agent._rdzv_handler._attribution_service.get_last_result.return_value = True
agent._rdzv_handler._barrier_state = MagicMock()

with patch.object(agent, '_restart_workers') as mock_restart:
result = agent._handle_restart_decision(
role="test", spec=self.spec, log_msg="[%s] restarting"
)

self.assertFalse(result)
self.assertTrue(agent._graceful_stop_requested)
agent._rdzv_handler._barrier_state.set_shutdown.assert_called_once()
agent._progress_tracker.analyze_previous_cycle.assert_not_called()
mock_restart.assert_not_called()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Attribution veto path never exercised — test assertions will fail at runtime

_make_agent() constructs LocalElasticAgent with the default is_store_host=False. Because _run_attribution() returns False immediately when not self._is_store_host, the attribution veto branch in _handle_restart_decision is never entered. The mock's get_last_result is never called, _graceful_stop_requested stays False, and set_shutdown is never invoked. As a result three of the four assertions would fail: self.assertTrue(agent._graceful_stop_requested), _barrier_state.set_shutdown.assert_called_once(), and _progress_tracker.analyze_previous_cycle.assert_not_called(). The test needs is_store_host=True passed to LocalElasticAgent (or agent._is_store_host = True set before the call) for the intended code path to be reachable.

Suggested change
def test_handle_restart_decision_stops_when_attribution_service_says_stop(self):
"""Returns False before restart when the managed attribution service recommends stop."""
agent = self._make_agent()
agent._progress_tracker = MagicMock()
agent._remaining_restarts = 2
agent._rdzv_handler._attribution_service = MagicMock()
agent._rdzv_handler._attribution_service.get_last_result.return_value = True
agent._rdzv_handler._barrier_state = MagicMock()
with patch.object(agent, '_restart_workers') as mock_restart:
result = agent._handle_restart_decision(
role="test", spec=self.spec, log_msg="[%s] restarting"
)
self.assertFalse(result)
self.assertTrue(agent._graceful_stop_requested)
agent._rdzv_handler._barrier_state.set_shutdown.assert_called_once()
agent._progress_tracker.analyze_previous_cycle.assert_not_called()
mock_restart.assert_not_called()
def test_handle_restart_decision_stops_when_attribution_service_says_stop(self):
"""Returns False before restart when the managed attribution service recommends stop."""
agent = self._make_agent()
agent._is_store_host = True
agent._progress_tracker = MagicMock()

@namitdhameja namitdhameja marked this pull request as draft May 13, 2026 19:40
@namitdhameja namitdhameja force-pushed the ft-attribution-integration branch from cf58a44 to dfd42f6 Compare May 14, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants