Skip to content

feat(skills): add nvrx attribution workflow bundle#312

Merged
sbak5 merged 21 commits into
NVIDIA:mainfrom
sbak5:sbak/attr_skills_pr
Apr 29, 2026
Merged

feat(skills): add nvrx attribution workflow bundle#312
sbak5 merged 21 commits into
NVIDIA:mainfrom
sbak5:sbak/attr_skills_pr

Conversation

@sbak5

@sbak5 sbak5 commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add the nvrx-attr skill bundle with log analysis, flight-recorder analysis, and a Slurm-based fault-injection feedback loop
  • harden the fault-loop scripts and docs for local/site configuration, including local slurm.conf defaults and local user.env support
  • improve log-analysis robustness with retry/backoff handling and reduce noisy Torch C++ logging in the workload script

Notes

  • scripts/user.env is intentionally local and untracked
  • branch includes follow-up cleanup to remove internal defaults from docs and templates

@sbak5 sbak5 requested a review from namitdhameja April 24, 2026 05:18
@sbak5 sbak5 marked this pull request as ready for review April 24, 2026 22:30
@sbak5 sbak5 added the ci-approved Approved to run CI label Apr 24, 2026
@greptile-apps

greptile-apps Bot commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds the nvrx-attr skill bundle, introducing log-analysis, FR-analysis, and a Slurm-based fault-injection feedback loop, alongside retry/backoff hardening in nvrx_logsage.py and logger-level restoration in fr_attribution.py.

  • --emit-stdout in fr_attribution.py emits a Python list repr instead of the analysis text (result[0] is the output_list returned by output_handler, not the string inside it), causing watch_and_analyze.sh to silently receive an unparseable FR_OUT and score every experiment's FR dimension as no_dumps.

Confidence Score: 3/5

Not safe to merge as-is — the FR analysis --emit-stdout path silently produces wrong output, breaking FR scoring in the feedback loop.

One clear P1 defect: fr_attribution.py --emit-stdout emits a Python list repr instead of the analysis text string, causing watch_and_analyze.sh to pass unparseable content to the judge scorer and score all FR dimensions as no_dumps. Score pulled below the P1 ceiling of 4 because the loop's primary output — the FR accuracy column in the report — is silently wrong for every experiment, and several previously-flagged P1s remain unresolved.

src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_attribution.py — --emit-stdout fix required; src/nvidia_resiliency_ext/attribution/log_analyzer/nvrx_logsage.py — _retry_return_application_errors UnboundLocalError at retries=0 still needs a safe fallback initializer.

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_attribution.py Adds --emit-stdout, fixes logger level restoration with try/finally, drops completed-PG analysis, and refactors table parsing — but --emit-stdout prints a Python list repr instead of the analysis text, silently breaking FR scoring in the feedback loop.
src/nvidia_resiliency_ext/attribution/log_analyzer/nvrx_logsage.py Adds exponential-backoff retry wrappers and --emit-stdout flag; _with_exponential_backoff correctly handles retries=0, but _retry_return_application_errors still has an UnboundLocalError when retries=0 (previously flagged).
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/watch_and_analyze.sh New orchestration script that polls SLURM job states, runs log- and FR-analysis, and calls the judge scorer; logic is sound except FR output will be a Python list repr (upstream bug in fr_attribution.py --emit-stdout).
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/score_attribution.py New LLM-judge scorer with FR rank normalization and structured JSON output; graceful error handling; no new logic issues beyond dependency on correct FR output format.
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/l4_gb200_reduced.sh New Llama4-Scout SBATCH script with fault injection, cache staging, and user.env-based local config; dead FAULT_LABEL assignment previously flagged; otherwise well-structured.
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/prepare_node_alloc.sh New batch submission orchestrator with workload resolution and pool management; correctly captures pre-existing env vars before sourcing user.env.
src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/n3_super_gb200_fi.sh New Nemotron3-Super SBATCH script with CUDA graph and DeepEP support; uses _i${FAULT_AT_ITER} naming convention consistent with docs.
tests/attribution/unit/test_nvrx_logsage_retry.py Adds unit test for _with_exponential_backoff with retries=0; covers the fixed code path correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[prepare_node_alloc.sh\nsubmit pool 2 jobs at a time] -->|writes tracking TSV| B[experiments.tsv]
    A -->|sbatch| C[l4_gb200_reduced.sh\nor n3_super_gb200_fi.sh]
    C -->|writes| D[EXPERIMENT_DIR/logs/slurm/*.main_workload.log]
    C -->|writes| E[EXPERIMENT_DIR/checkpoints/_dump_*]
    B --> F[watch_and_analyze.sh\npoll squeue until job done]
    D --> F
    E --> F
    F -->|--emit-stdout| G[nvrx_logsage.py\nlog analysis correct]
    F -->|--emit-stdout| H[fr_attribution.py\nFR analysis prints list repr]
    G -->|LOG_OUT 5-field text| I[score_attribution.py\nLLM judge]
    H -->|FR_OUT Python list repr\nnot parseable| I
    I -->|JSON scores| J[experiments_report.md]
Loading

Reviews (6): Last reviewed commit: "style(tests): sort nvrx logsage retry im..." | Re-trigger Greptile

Comment thread src/nvidia_resiliency_ext/attribution/log_analyzer/nvrx_logsage.py
Comment thread src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/user.env.example Outdated
Comment on lines 155 to 200
logger.error(
"Log-analysis extraction failed after %d attempts; last status: %s",
retries,
last_status,
)
return app_data

backoff = _sleep_with_backoff(attempt, retries, backoff, max_backoff, jitter)

return app_data


def _with_exponential_backoff(llm_call, checkpoint_saved: bool) -> tuple[str, str, str, str, str]:
retries, initial_backoff, max_backoff, jitter = _log_analysis_retry_config()
backoff = initial_backoff

for attempt in range(1, retries + 1):
try:
result = llm_call()
if result and not any(field == LOGSAGE_LLM_ENDPOINT_FAILED for field in result[:4]):
return result
last_error = LOGSAGE_LLM_ENDPOINT_FAILED
except Exception as exc:
last_error = str(exc)
logger.warning("Log-analysis LLM attempt %d/%d failed: %s", attempt, retries, exc)

if attempt == retries:
logger.error(
"Log-analysis LLM failed after %d attempts; last error: %s",
retries,
last_error,
)
return (
ATTR_LLM_FAILURE,
ATTR_LLM_FAILURE,
ATTR_LLM_FAILURE,
ATTR_LLM_FAILURE,
str(checkpoint_saved),
)

backoff = _sleep_with_backoff(attempt, retries, backoff, max_backoff, jitter)


class NVRxLogAnalyzer(NVRxAttribution):
def __init__(self, args: Union[argparse.Namespace, Mapping[str, Any]]):
from nvidia_resiliency_ext.attribution.api_keys import load_nvidia_api_key

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Implicit None return when retries=0

_with_exponential_backoff has the same zero-retries blind spot as the already-flagged _retry_return_application_errors. When NVRX_LOG_ANALYSIS_LLM_RETRIES=0, range(1, 1) produces no iterations and the function falls off the end returning None, violating the tuple[str, str, str, str, str] annotation. The caller appends this into result, which is then iterated over in main() — the if not result: continue guard saves the --emit-stdout path, but any other consumer of the return value (e.g. a wrapper that indexes the tuple) will raise a TypeError.

Fix: initialise a safe fallback before the loop and add a final return after it:

def _with_exponential_backoff(llm_call, checkpoint_saved: bool) -> tuple[str, str, str, str, str]:
    retries, initial_backoff, max_backoff, jitter = _log_analysis_retry_config()
    backoff = initial_backoff
    last_error: str = "no attempts made (retries=0)"

    for attempt in range(1, retries + 1):
        ...

    return (
        ATTR_LLM_FAILURE,
        ATTR_LLM_FAILURE,
        ATTR_LLM_FAILURE,
        ATTR_LLM_FAILURE,
        str(checkpoint_saved),
    )

Comment thread src/nvidia_resiliency_ext/skills/nvrx-attr/scripts/score_attribution.py Outdated
if not api_key:
raise ValueError(
"Judge API key not found. Set JUDGE_API_KEY/JUDGE_API_KEY_FILE, "
"or NVIDIA_API_KEY/NVIDIA_API_KEY_FILE, or create ~/.nvidia_api_key"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here LLM_API_KEY

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@sbak5 sbak5 force-pushed the sbak/attr_skills_pr branch from 1e7a76e to 26e6429 Compare April 28, 2026 21:37
Comment on lines 1171 to 1195
action='store_true',
help='Convert the trace file to json file, if the trace is binary, for debugging',
)
parser.add_argument(
'--emit-stdout',
action='store_true',
help='Print final FR summary table to stdout for machine consumers',
)

args = parser.parse_args()

analyzer = CollectiveAnalyzer(args)
analyzer.run_sync(args)
result = analyzer.run_sync(args)

if args.emit_stdout and isinstance(result, tuple) and result:
payload = result[0]
if isinstance(payload, dict):
text = payload.get("analysis_text", "")
if text:
print(text)
elif payload:
print(payload)


if __name__ == "__main__":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 --emit-stdout prints Python list repr instead of analysis text

run_sync() returns (output_list, overall_state) (from output_handler), so result[0] is a Python list of (text, AttributionState) tuples, not a string. print(payload) therefore outputs something like [('PGID | ...\n...', <AttributionState.CONTINUE: 1>)] — the list's repr() with embedded \n escaped — rather than the clean table text.

watch_and_analyze.sh captures this as FR_OUT and passes it to score_attribution.py, which calls parse_fr_missing_ranks(). Because the newlines are escaped in the repr, splitlines() sees a single line that matches the "Missing Ranks" in line guard and is skipped; the function returns an empty set and silently falls back to "no_dumps". FR rank-correctness is never scored.

The isinstance(payload, dict) branch is also dead code because result[0] is always a list.

Fix: unwrap the inner text from the first item in output_list:

if args.emit_stdout and isinstance(result, tuple) and result:
    output_list = result[0]
    if isinstance(output_list, list) and output_list:
        text = output_list[0][0] if isinstance(output_list[0], (list, tuple)) else output_list[0]
        if isinstance(text, dict):
            text = text.get("analysis_text", "")
        if text:
            print(text)
    elif isinstance(output_list, dict):
        text = output_list.get("analysis_text", "")
        if text:
            print(text)
    elif output_list:
        print(output_list)

@sbak5 sbak5 merged commit ee16c45 into NVIDIA:main Apr 29, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants