Skip to content

Heuristic record end of cycle log time.#351

Merged
hexinw-nvidia merged 3 commits into
NVIDIA:mainfrom
hexinw-nvidia:end_of_log
Jun 21, 2026
Merged

Heuristic record end of cycle log time.#351
hexinw-nvidia merged 3 commits into
NVIDIA:mainfrom
hexinw-nvidia:end_of_log

Conversation

@hexinw-nvidia

Copy link
Copy Markdown
Contributor

No description provided.

@greptile-apps

greptile-apps Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a heuristic mechanism to the gRPC log aggregation server to detect and record when a per-cycle application log file has finished receiving data, emitting an INFO "Heuristic cycle log complete" marker. It also refactors _shutdown_cycle_info_reporter_safely into a reusable helper, adds an early shutdown call on the SignalException path, and reorders the control-plane shutdown so the cycle info reporter is finalized before the gRPC log servers are torn down.

  • grpc_log_server.py: Tracks last_chunk_time and last_flush_time per cycle-log file under buffer_lock; _scan_cycle_log_completions runs after each periodic flush and on shutdown (force_idle=True, include_latest=True) to emit completion markers once a non-latest cycle log has been idle past the configurable completion_idle_timeout.
  • launcher.py: Extracts the reporter shutdown into _shutdown_cycle_info_reporter_safely (exception-isolated, with logger.warning on failure) and calls it early in the SignalException handler so the current cycle is closed before workers are torn down.
  • control_plane.py: Moves cycle_info_reporter.shutdown() to run before stop_grpc_log_servers, ensuring cycle records are finalized before the log transport shuts down.

Confidence Score: 5/5

Safe to merge. The heuristic completion logic is consistent with the dual-lock design, the new shutdown paths are correctly exception-isolated, and CycleInfoReporter.shutdown() is idempotent so the multi-call pattern on the signal path causes no functional harm.

No defects were found on any changed path. The cycle-log completion detection correctly defers marking a cycle complete until it is no longer the latest, and the shutdown path forces completion for all cycles including the current one. Lock discipline is maintained throughout the new scan method, and the test suite covers the key edge cases.

grpc_log_server.py is the most complex file in the diff and is worth a careful read, particularly the interaction between buffer_lock acquisition in _scan_cycle_log_completions and the shutdown flush sequence, but no issues were found there.

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/shared_utils/grpc_log_server.py Core of this PR: adds heuristic cycle-log completion detection. Introduces _cycle_log_info regex parsing, per-file tracking of last_chunk_time/last_flush_time/completion_logged under buffer_lock, and a _scan_cycle_log_completions method. The logic is sound; one minor cosmetic ordering issue where completion_logged is set inside buffer_lock but the INFO log emitted after releasing it.
src/nvidia_resiliency_ext/fault_tolerance/launcher.py Extracts shutdown logic into _shutdown_cycle_info_reporter_safely helper (exception-safe) and adds an early call on SignalException path. Since CycleInfoReporter.shutdown() is idempotent, the double-call from the signal path + launch_agent finally block is harmless.
src/nvidia_resiliency_ext/fault_tolerance/control_plane.py Reorders shutdown: cycle_info_reporter.shutdown() is now called before stop_grpc_log_servers. Ordering is logically correct — cycle info is finalized before the log transport is torn down.
tests/shared_utils/test_grpc_log_server.py Adds TestHeuristicCycleLogCompletion with five scenarios covering current-cycle suppression, shutdown force-completion, later-cycle triggers prior completion, non-cycle-log ignored, and late-chunk warning + re-completion.
tests/fault_tolerance/unit/test_launcher.py Adds a focused unit test for _shutdown_cycle_info_reporter_safely that verifies exception isolation and warning logging.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant W as Worker Node
    participant GS as LogAggregationServicer
    participant PF as PeriodicFlushThread
    participant CIR as CycleInfoReporter

    W->>GS: StreamLogs(train_cycle3.log chunk)
    GS->>GS: "last_chunk_time = now (under buffer_lock)"
    GS->>GS: append to buffer

    PF->>GS: _flush_file_buffer_locked(train_cycle3.log)
    GS->>GS: "last_flush_time = now (under buffer_lock)"
    PF->>GS: _scan_cycle_log_completions()
    GS->>GS: skip: cycle3 is latest in family

    W->>GS: StreamLogs(train_cycle4.log chunk)
    GS->>GS: "latest_cycle_by_family[train.log] = 4"

    PF->>GS: _scan_cycle_log_completions()
    GS->>GS: "cycle3 < latest(4) → eligible"
    GS->>GS: "scan_time - last_chunk_time >= idle_timeout?"
    GS-->>GS: emit Heuristic cycle log complete: train_cycle3.log

    Note over GS,CIR: On shutdown (SignalException or normal exit)
    CIR->>CIR: shutdown_cycle_info_reporter_safely()
    CIR->>CIR: report_cycle_end()
    GS->>GS: "_scan_cycle_log_completions(include_latest=True, force_idle=True)"
    GS-->>GS: emit Heuristic cycle log complete: train_cycle4.log
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant W as Worker Node
    participant GS as LogAggregationServicer
    participant PF as PeriodicFlushThread
    participant CIR as CycleInfoReporter

    W->>GS: StreamLogs(train_cycle3.log chunk)
    GS->>GS: "last_chunk_time = now (under buffer_lock)"
    GS->>GS: append to buffer

    PF->>GS: _flush_file_buffer_locked(train_cycle3.log)
    GS->>GS: "last_flush_time = now (under buffer_lock)"
    PF->>GS: _scan_cycle_log_completions()
    GS->>GS: skip: cycle3 is latest in family

    W->>GS: StreamLogs(train_cycle4.log chunk)
    GS->>GS: "latest_cycle_by_family[train.log] = 4"

    PF->>GS: _scan_cycle_log_completions()
    GS->>GS: "cycle3 < latest(4) → eligible"
    GS->>GS: "scan_time - last_chunk_time >= idle_timeout?"
    GS-->>GS: emit Heuristic cycle log complete: train_cycle3.log

    Note over GS,CIR: On shutdown (SignalException or normal exit)
    CIR->>CIR: shutdown_cycle_info_reporter_safely()
    CIR->>CIR: report_cycle_end()
    GS->>GS: "_scan_cycle_log_completions(include_latest=True, force_idle=True)"
    GS-->>GS: emit Heuristic cycle log complete: train_cycle4.log
Loading

Reviews (4): Last reviewed commit: "Address review feedback for log completi..." | Re-trigger Greptile

Comment thread src/nvidia_resiliency_ext/fault_tolerance/launcher.py Outdated
Comment thread src/nvidia_resiliency_ext/shared_utils/grpc_log_server.py Outdated
@hexinw-nvidia hexinw-nvidia added the ci-approved Approved to run CI label Jun 20, 2026
@hexinw-nvidia hexinw-nvidia requested a review from sbak5 June 20, 2026 23:03
@hexinw-nvidia hexinw-nvidia merged commit 4813046 into NVIDIA:main Jun 21, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants