feat(skills): add nvrx attribution workflow bundle#312
Conversation
Greptile SummaryThis PR adds the
Confidence Score: 3/5Not safe to merge as-is — the FR analysis --emit-stdout path silently produces wrong output, breaking FR scoring in the feedback loop. One clear P1 defect: src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_attribution.py — Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[prepare_node_alloc.sh\nsubmit pool 2 jobs at a time] -->|writes tracking TSV| B[experiments.tsv]
A -->|sbatch| C[l4_gb200_reduced.sh\nor n3_super_gb200_fi.sh]
C -->|writes| D[EXPERIMENT_DIR/logs/slurm/*.main_workload.log]
C -->|writes| E[EXPERIMENT_DIR/checkpoints/_dump_*]
B --> F[watch_and_analyze.sh\npoll squeue until job done]
D --> F
E --> F
F -->|--emit-stdout| G[nvrx_logsage.py\nlog analysis correct]
F -->|--emit-stdout| H[fr_attribution.py\nFR analysis prints list repr]
G -->|LOG_OUT 5-field text| I[score_attribution.py\nLLM judge]
H -->|FR_OUT Python list repr\nnot parseable| I
I -->|JSON scores| J[experiments_report.md]
Reviews (6): Last reviewed commit: "style(tests): sort nvrx logsage retry im..." | Re-trigger Greptile |
| logger.error( | ||
| "Log-analysis extraction failed after %d attempts; last status: %s", | ||
| retries, | ||
| last_status, | ||
| ) | ||
| return app_data | ||
|
|
||
| backoff = _sleep_with_backoff(attempt, retries, backoff, max_backoff, jitter) | ||
|
|
||
| return app_data | ||
|
|
||
|
|
||
| def _with_exponential_backoff(llm_call, checkpoint_saved: bool) -> tuple[str, str, str, str, str]: | ||
| retries, initial_backoff, max_backoff, jitter = _log_analysis_retry_config() | ||
| backoff = initial_backoff | ||
|
|
||
| for attempt in range(1, retries + 1): | ||
| try: | ||
| result = llm_call() | ||
| if result and not any(field == LOGSAGE_LLM_ENDPOINT_FAILED for field in result[:4]): | ||
| return result | ||
| last_error = LOGSAGE_LLM_ENDPOINT_FAILED | ||
| except Exception as exc: | ||
| last_error = str(exc) | ||
| logger.warning("Log-analysis LLM attempt %d/%d failed: %s", attempt, retries, exc) | ||
|
|
||
| if attempt == retries: | ||
| logger.error( | ||
| "Log-analysis LLM failed after %d attempts; last error: %s", | ||
| retries, | ||
| last_error, | ||
| ) | ||
| return ( | ||
| ATTR_LLM_FAILURE, | ||
| ATTR_LLM_FAILURE, | ||
| ATTR_LLM_FAILURE, | ||
| ATTR_LLM_FAILURE, | ||
| str(checkpoint_saved), | ||
| ) | ||
|
|
||
| backoff = _sleep_with_backoff(attempt, retries, backoff, max_backoff, jitter) | ||
|
|
||
|
|
||
| class NVRxLogAnalyzer(NVRxAttribution): | ||
| def __init__(self, args: Union[argparse.Namespace, Mapping[str, Any]]): | ||
| from nvidia_resiliency_ext.attribution.api_keys import load_nvidia_api_key |
There was a problem hiding this comment.
Implicit
None return when retries=0
_with_exponential_backoff has the same zero-retries blind spot as the already-flagged _retry_return_application_errors. When NVRX_LOG_ANALYSIS_LLM_RETRIES=0, range(1, 1) produces no iterations and the function falls off the end returning None, violating the tuple[str, str, str, str, str] annotation. The caller appends this into result, which is then iterated over in main() — the if not result: continue guard saves the --emit-stdout path, but any other consumer of the return value (e.g. a wrapper that indexes the tuple) will raise a TypeError.
Fix: initialise a safe fallback before the loop and add a final return after it:
def _with_exponential_backoff(llm_call, checkpoint_saved: bool) -> tuple[str, str, str, str, str]:
retries, initial_backoff, max_backoff, jitter = _log_analysis_retry_config()
backoff = initial_backoff
last_error: str = "no attempts made (retries=0)"
for attempt in range(1, retries + 1):
...
return (
ATTR_LLM_FAILURE,
ATTR_LLM_FAILURE,
ATTR_LLM_FAILURE,
ATTR_LLM_FAILURE,
str(checkpoint_saved),
)| if not api_key: | ||
| raise ValueError( | ||
| "Judge API key not found. Set JUDGE_API_KEY/JUDGE_API_KEY_FILE, " | ||
| "or NVIDIA_API_KEY/NVIDIA_API_KEY_FILE, or create ~/.nvidia_api_key" |
There was a problem hiding this comment.
same here LLM_API_KEY
1e7a76e to
26e6429
Compare
| action='store_true', | ||
| help='Convert the trace file to json file, if the trace is binary, for debugging', | ||
| ) | ||
| parser.add_argument( | ||
| '--emit-stdout', | ||
| action='store_true', | ||
| help='Print final FR summary table to stdout for machine consumers', | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| analyzer = CollectiveAnalyzer(args) | ||
| analyzer.run_sync(args) | ||
| result = analyzer.run_sync(args) | ||
|
|
||
| if args.emit_stdout and isinstance(result, tuple) and result: | ||
| payload = result[0] | ||
| if isinstance(payload, dict): | ||
| text = payload.get("analysis_text", "") | ||
| if text: | ||
| print(text) | ||
| elif payload: | ||
| print(payload) | ||
|
|
||
|
|
||
| if __name__ == "__main__": |
There was a problem hiding this comment.
--emit-stdout prints Python list repr instead of analysis text
run_sync() returns (output_list, overall_state) (from output_handler), so result[0] is a Python list of (text, AttributionState) tuples, not a string. print(payload) therefore outputs something like [('PGID | ...\n...', <AttributionState.CONTINUE: 1>)] — the list's repr() with embedded \n escaped — rather than the clean table text.
watch_and_analyze.sh captures this as FR_OUT and passes it to score_attribution.py, which calls parse_fr_missing_ranks(). Because the newlines are escaped in the repr, splitlines() sees a single line that matches the "Missing Ranks" in line guard and is skipped; the function returns an empty set and silently falls back to "no_dumps". FR rank-correctness is never scored.
The isinstance(payload, dict) branch is also dead code because result[0] is always a list.
Fix: unwrap the inner text from the first item in output_list:
if args.emit_stdout and isinstance(result, tuple) and result:
output_list = result[0]
if isinstance(output_list, list) and output_list:
text = output_list[0][0] if isinstance(output_list[0], (list, tuple)) else output_list[0]
if isinstance(text, dict):
text = text.get("analysis_text", "")
if text:
print(text)
elif isinstance(output_list, dict):
text = output_list.get("analysis_text", "")
if text:
print(text)
elif output_list:
print(output_list)
Summary
nvrx-attrskill bundle with log analysis, flight-recorder analysis, and a Slurm-based fault-injection feedback loopslurm.confdefaults and localuser.envsupportNotes
scripts/user.envis intentionally local and untracked