feat(skills): add nvrx attribution workflow bundle by sbak5 · Pull Request #312 · NVIDIA/nvidia-resiliency-ext

sbak5 · 2026-04-24T05:17:59Z

Summary

add the nvrx-attr skill bundle with log analysis, flight-recorder analysis, and a Slurm-based fault-injection feedback loop
harden the fault-loop scripts and docs for local/site configuration, including local slurm.conf defaults and local user.env support
improve log-analysis robustness with retry/backoff handling and reduce noisy Torch C++ logging in the workload script

Notes

scripts/user.env is intentionally local and untracked
branch includes follow-up cleanup to remove internal defaults from docs and templates

greptile-apps · 2026-04-24T22:35:13Z

Greptile Summary

This PR adds the nvrx-attr skill bundle, introducing log-analysis, FR-analysis, and a Slurm-based fault-injection feedback loop, alongside retry/backoff hardening in nvrx_logsage.py and logger-level restoration in fr_attribution.py.

--emit-stdout in fr_attribution.py emits a Python list repr instead of the analysis text (result[0] is the output_list returned by output_handler, not the string inside it), causing watch_and_analyze.sh to silently receive an unparseable FR_OUT and score every experiment's FR dimension as no_dumps.

Confidence Score: 3/5

Not safe to merge as-is — the FR analysis --emit-stdout path silently produces wrong output, breaking FR scoring in the feedback loop.

One clear P1 defect: fr_attribution.py --emit-stdout emits a Python list repr instead of the analysis text string, causing watch_and_analyze.sh to pass unparseable content to the judge scorer and score all FR dimensions as no_dumps. Score pulled below the P1 ceiling of 4 because the loop's primary output — the FR accuracy column in the report — is silently wrong for every experiment, and several previously-flagged P1s remain unresolved.

src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_attribution.py — --emit-stdout fix required; src/nvidia_resiliency_ext/attribution/log_analyzer/nvrx_logsage.py — _retry_return_application_errors UnboundLocalError at retries=0 still needs a safe fallback initializer.

Important Files Changed

Filename	Overview
src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_attribution.py	Adds `--emit-stdout`, fixes logger level restoration with try/finally, drops completed-PG analysis, and refactors table parsing — but `--emit-stdout` prints a Python list repr instead of the analysis text, silently breaking FR scoring in the feedback loop.
src/nvidia_resiliency_ext/attribution/log_analyzer/nvrx_logsage.py	Adds exponential-backoff retry wrappers and `--emit-stdout` flag; `_with_exponential_backoff` correctly handles retries=0, but `_retry_return_application_errors` still has an `UnboundLocalError` when retries=0 (previously flagged).
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/watch_and_analyze.sh	New orchestration script that polls SLURM job states, runs log- and FR-analysis, and calls the judge scorer; logic is sound except FR output will be a Python list repr (upstream bug in fr_attribution.py --emit-stdout).
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/score_attribution.py	New LLM-judge scorer with FR rank normalization and structured JSON output; graceful error handling; no new logic issues beyond dependency on correct FR output format.
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/l4_gb200_reduced.sh	New Llama4-Scout SBATCH script with fault injection, cache staging, and user.env-based local config; dead FAULT_LABEL assignment previously flagged; otherwise well-structured.
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/prepare_node_alloc.sh	New batch submission orchestrator with workload resolution and pool management; correctly captures pre-existing env vars before sourcing user.env.
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/n3_super_gb200_fi.sh	New Nemotron3-Super SBATCH script with CUDA graph and DeepEP support; uses `_i${FAULT_AT_ITER}` naming convention consistent with docs.
tests/attribution/unit/test_nvrx_logsage_retry.py	Adds unit test for `_with_exponential_backoff` with retries=0; covers the fixed code path correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[prepare_node_alloc.sh\nsubmit pool 2 jobs at a time] -->|writes tracking TSV| B[experiments.tsv]
    A -->|sbatch| C[l4_gb200_reduced.sh\nor n3_super_gb200_fi.sh]
    C -->|writes| D[EXPERIMENT_DIR/logs/slurm/*.main_workload.log]
    C -->|writes| E[EXPERIMENT_DIR/checkpoints/_dump_*]
    B --> F[watch_and_analyze.sh\npoll squeue until job done]
    D --> F
    E --> F
    F -->|--emit-stdout| G[nvrx_logsage.py\nlog analysis correct]
    F -->|--emit-stdout| H[fr_attribution.py\nFR analysis prints list repr]
    G -->|LOG_OUT 5-field text| I[score_attribution.py\nLLM judge]
    H -->|FR_OUT Python list repr\nnot parseable| I
    I -->|JSON scores| J[experiments_report.md]

_{Reviews (6): Last reviewed commit: "style(tests): sort nvrx logsage retry im..." | Re-trigger Greptile}

greptile-apps · 2026-04-27T17:56:54Z

+            logger.error(
+                "Log-analysis extraction failed after %d attempts; last status: %s",
+                retries,
+                last_status,
+            )
+            return app_data
+
+        backoff = _sleep_with_backoff(attempt, retries, backoff, max_backoff, jitter)
+
+    return app_data
+
+
+def _with_exponential_backoff(llm_call, checkpoint_saved: bool) -> tuple[str, str, str, str, str]:
+    retries, initial_backoff, max_backoff, jitter = _log_analysis_retry_config()
+    backoff = initial_backoff
+
+    for attempt in range(1, retries + 1):
+        try:
+            result = llm_call()
+            if result and not any(field == LOGSAGE_LLM_ENDPOINT_FAILED for field in result[:4]):
+                return result
+            last_error = LOGSAGE_LLM_ENDPOINT_FAILED
+        except Exception as exc:
+            last_error = str(exc)
+            logger.warning("Log-analysis LLM attempt %d/%d failed: %s", attempt, retries, exc)
+
+        if attempt == retries:
+            logger.error(
+                "Log-analysis LLM failed after %d attempts; last error: %s",
+                retries,
+                last_error,
+            )
+            return (
+                ATTR_LLM_FAILURE,
+                ATTR_LLM_FAILURE,
+                ATTR_LLM_FAILURE,
+                ATTR_LLM_FAILURE,
+                str(checkpoint_saved),
+            )
+
+        backoff = _sleep_with_backoff(attempt, retries, backoff, max_backoff, jitter)
+
+
 class NVRxLogAnalyzer(NVRxAttribution):
    def __init__(self, args: Union[argparse.Namespace, Mapping[str, Any]]):
        from nvidia_resiliency_ext.attribution.api_keys import load_nvidia_api_key


Implicit None return when retries=0

_with_exponential_backoff has the same zero-retries blind spot as the already-flagged _retry_return_application_errors. When NVRX_LOG_ANALYSIS_LLM_RETRIES=0, range(1, 1) produces no iterations and the function falls off the end returning None, violating the tuple[str, str, str, str, str] annotation. The caller appends this into result, which is then iterated over in main() — the if not result: continue guard saves the --emit-stdout path, but any other consumer of the return value (e.g. a wrapper that indexes the tuple) will raise a TypeError.

Fix: initialise a safe fallback before the loop and add a final return after it:

def _with_exponential_backoff(llm_call, checkpoint_saved: bool) -> tuple[str, str, str, str, str]: retries, initial_backoff, max_backoff, jitter = _log_analysis_retry_config() backoff = initial_backoff last_error: str = "no attempts made (retries=0)" for attempt in range(1, retries + 1): ... return ( ATTR_LLM_FAILURE, ATTR_LLM_FAILURE, ATTR_LLM_FAILURE, ATTR_LLM_FAILURE, str(checkpoint_saved), )

namitdhameja · 2026-04-28T01:11:26Z

+    if not api_key:
+        raise ValueError(
+            "Judge API key not found. Set JUDGE_API_KEY/JUDGE_API_KEY_FILE, "
+            "or NVIDIA_API_KEY/NVIDIA_API_KEY_FILE, or create ~/.nvidia_api_key"


same here LLM_API_KEY

Same as above.

greptile-apps · 2026-04-28T22:17:43Z

        action='store_true',
        help='Convert the trace file to json file, if the trace is binary, for debugging',
    )
+    parser.add_argument(
+        '--emit-stdout',
+        action='store_true',
+        help='Print final FR summary table to stdout for machine consumers',
+    )

    args = parser.parse_args()

    analyzer = CollectiveAnalyzer(args)
-    analyzer.run_sync(args)
+    result = analyzer.run_sync(args)
+
+    if args.emit_stdout and isinstance(result, tuple) and result:
+        payload = result[0]
+        if isinstance(payload, dict):
+            text = payload.get("analysis_text", "")
+            if text:
+                print(text)
+        elif payload:
+            print(payload)


 if __name__ == "__main__":


--emit-stdout prints Python list repr instead of analysis text

run_sync() returns (output_list, overall_state) (from output_handler), so result[0] is a Python list of (text, AttributionState) tuples, not a string. print(payload) therefore outputs something like [('PGID | ...\n...', <AttributionState.CONTINUE: 1>)] — the list's repr() with embedded \n escaped — rather than the clean table text.

watch_and_analyze.sh captures this as FR_OUT and passes it to score_attribution.py, which calls parse_fr_missing_ranks(). Because the newlines are escaped in the repr, splitlines() sees a single line that matches the "Missing Ranks" in line guard and is skipped; the function returns an empty set and silently falls back to "no_dumps". FR rank-correctness is never scored.

The isinstance(payload, dict) branch is also dead code because result[0] is always a list.

Fix: unwrap the inner text from the first item in output_list:

if args.emit_stdout and isinstance(result, tuple) and result: output_list = result[0] if isinstance(output_list, list) and output_list: text = output_list[0][0] if isinstance(output_list[0], (list, tuple)) else output_list[0] if isinstance(text, dict): text = text.get("analysis_text", "") if text: print(text) elif isinstance(output_list, dict): text = output_list.get("analysis_text", "") if text: print(text) elif output_list: print(output_list)

sbak5 requested a review from namitdhameja April 24, 2026 05:18

sbak5 marked this pull request as ready for review April 24, 2026 22:30

sbak5 added the ci-approved Approved to run CI label Apr 24, 2026

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/nvidia_resiliency_ext/attribution/log_analyzer/nvrx_logsage.py

Comment thread src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_attribution.py

Comment thread src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/user.env.example Outdated

sbak5 requested a review from continue-revolution April 24, 2026 23:11

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

namitdhameja requested changes Apr 28, 2026

View reviewed changes

sbak5 added 20 commits April 28, 2026 13:46

feat(skills): add nvrx-attr skill bundle

b2a035e

chore(skills): remove extraneous nvrx-attr artifacts

df5f5c7

feat(skills): harden nvrx-attr fault injection workflow

a146d54

chore(skills): add slurm defaults template

f73ff8b

feat(skills): add local env support for fault loop

2275c6a

chore(skills): reduce torch cpp log verbosity

3f2016d

style(skills): format changed python files

43e39f7

fix(skills): wire feedback-loop analysis outputs

251df4e

fix(skills): refine feedback-loop scoring config

3b791bc

feat(skills): add n3 super fault-loop workload

af9c325

fix(skills): harden n3 fault-loop analysis

9a12531

fix(skills): repair fr-analysis wrapper symlink

1e23cb9

docs(skills): clarify local env configuration

5e30a25

refactor(skills): simplify local config wiring

810c534

fix(fr): restore logger level on analysis errors

d0fac00

refactor(skills): require local env for fault loop

950b97b

fix(skills): normalize FR segment scoring

269a3e3

fix(log-analysis): handle zero LLM retries

59ebefe

fix(skills): use LLM API key names

0e1d6db

fix(skills): resolve user env for spooled jobs

26e6429

sbak5 force-pushed the sbak/attr_skills_pr branch from 1e7a76e to 26e6429 Compare April 28, 2026 21:37

style(tests): sort nvrx logsage retry imports

d4a3999

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

namitdhameja approved these changes Apr 28, 2026

View reviewed changes

sbak5 merged commit ee16c45 into NVIDIA:main Apr 29, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): add nvrx attribution workflow bundle#312

feat(skills): add nvrx attribution workflow bundle#312
sbak5 merged 21 commits into
NVIDIA:mainfrom
sbak5:sbak/attr_skills_pr

sbak5 commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Apr 27, 2026

Uh oh!

Uh oh!

namitdhameja Apr 28, 2026

Uh oh!

sbak5 Apr 28, 2026

Uh oh!

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sbak5 commented Apr 24, 2026

Summary

Notes

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

namitdhameja Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

sbak5 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading