Skip to content

Latest commit

 

History

History
322 lines (246 loc) · 8.35 KB

File metadata and controls

322 lines (246 loc) · 8.35 KB

Exit Code Strategy Architecture

← Back to Main Architecture

Table of Contents

Overview

The exit code strategy system determines how bssh reports success or failure when executing commands across multiple nodes.

Breaking Change: The default exit code behavior matches standard MPI tools (mpirun, srun, mpiexec). This improves compatibility with distributed computing workflows and enables better error diagnostics.

Old Behavior:

  • Returns exit code 0 only if all nodes succeeded
  • Returns exit code 1 if any node failed (discarding actual exit codes)

New Behavior:

  • Returns the main rank's exit code by default (matching MPI standard)
  • Preserves actual exit codes (139=SIGSEGV, 137=OOM, 124=timeout, etc.)
  • Use --require-all-success flag for old behavior

Design Rationale

Why MainRank is Default

  1. MPI Standard Compliance: All standard MPI tools (mpirun, srun, mpiexec) return rank 0's exit code
  2. Better Diagnostics: Actual exit codes preserved for debugging (segfault, OOM, timeout)
  3. CI/CD Integration: Exit-code-based decisions work naturally
  4. Information Preservation: No loss of error details
  5. Industry Practice: Aligns with HPC and distributed computing conventions

Use Case Alignment

  • 90% of use cases: MPI workloads, distributed computing, CI/CD
  • 10% of use cases: Health checks, monitoring (use --require-all-success)

Main Rank Detection

The system identifies the "main rank" (rank 0) using hierarchical fallback:

pub fn identify_main_rank(nodes: &[Node]) -> Option<usize> {
 // 1. Check Backend.AI CLUSTER_ROLE environment variable
 if env::var("BACKENDAI_CLUSTER_ROLE").ok == Some("main".to_string) {
 // Try to match by hostname
 if let Ok(host) = env::var("BACKENDAI_CLUSTER_HOST") {
 if let Some(idx) = nodes.iter.position(|n| n.host == host) {
 return Some(idx);
 }
 }
 }

 // 2. Fallback: First node is main rank (standard convention)
 if !nodes.is_empty {
 Some(0)
 } else {
 None
 }
}

Detection Priority:

  1. BACKENDAI_CLUSTER_ROLE=main + BACKENDAI_CLUSTER_HOST match
  2. First node in the node list (index 0)

Backend.AI Integration: Automatic detection in multi-node sessions without configuration.

Exit Code Strategies

Three strategies are available to handle different scenarios:

1. MainRank (Default)

Behavior: Returns the main rank's actual exit code.

Use Cases:

  • MPI workloads and distributed computing
  • CI/CD pipelines requiring exit code inspection
  • Shell scripts with error handling logic
  • When debugging requires specific exit codes

Example:

bssh exec "mpirun -n 16 ./simulation"
EXIT_CODE=$?

case $EXIT_CODE in
 0) echo "Success!"; deploy_results ;;
 139) echo "Segfault!"; collect_core_dump ;;
 137) echo "OOM!"; retry_with_more_memory ;;
 124) echo "Timeout!"; extend_time_limit ;;
 *) echo "Failed: $EXIT_CODE"; exit $EXIT_CODE ;;
esac

Implementation:

ExitCodeStrategy::MainRank => {
 main_idx
 .and_then(|i| results.get(i))
 .map(|r| r.get_exit_code)
 .unwrap_or(1) // No main rank identified → failure
}

2. RequireAllSuccess

Behavior: Returns 0 only if all nodes succeeded, 1 otherwise.

CLI Flag: --require-all-success

Use Cases:

  • Health checks and monitoring
  • Cluster validation
  • When any failure should be treated equally
  • Legacy scripts requiring old behavior

Example:

bssh --require-all-success exec "disk-check"
if [ $? -ne 0 ]; then
 alert_ops "Node failure detected"
fi

Implementation:

ExitCodeStrategy::RequireAllSuccess => {
 if results.iter.any(|r| !r.is_success) {
 1
 } else {
 0
 }
}

3. MainRankWithFailureCheck (Hybrid Mode)

Behavior: Returns main rank's exit code if non-zero, or 1 if main succeeded but others failed.

CLI Flag: --check-all-nodes

Use Cases:

  • Production deployments requiring both diagnostics and completeness
  • When you need detailed error codes but also want to catch failures on any node

Example:

bssh --check-all-nodes exec "mpirun ./program"
# Main failed → main's exit code
# Main OK + others failed → 1
# All OK → 0

Implementation:

ExitCodeStrategy::MainRankWithFailureCheck => {
 let main_code = main_idx
 .and_then(|i| results.get(i))
 .map(|r| r.get_exit_code)
 .unwrap_or(0);

 let other_failed = results.iter
 .enumerate
 .any(|(i, r)| Some(i) != main_idx && !r.is_success);

 if main_code != 0 {
 main_code // Main failed → return its code
 } else if other_failed {
 1 // Main OK but others failed → 1
 } else {
 0 // All OK
 }
}

Strategy Comparison

Scenario Main Exit Other Exits MainRank RequireAllSuccess MainRankWithFailureCheck
All success 0 0,0,0 0 0 0
Main failed 139 (SIGSEGV) 0,0,0 139 1 139
Other failed 0 1,0,0 0 1 1
All failed 1 1,1,1 1 1 1
Main timeout 124 0,0,0 124 1 124
Main OOM 137 0,0,0 137 1 137

Bold values show where strategies differ.

Implementation Details

File Structure

src/executor/
├── rank_detector.rs # Main rank identification
├── exit_strategy.rs # Exit code calculation strategies
├── result_types.rs # ExecutionResult with is_main_rank field
├── mod.rs # Re-exports RankDetector and ExitCodeStrategy
└── parallel.rs # Marks main rank in results

src/commands/
└── exec.rs # Applies exit strategy based on CLI flags

src/
└── cli.rs # CLI flags: --require-all-success, --check-all-nodes

ExecutionResult Enhancement

pub struct ExecutionResult {
 pub node: Node,
 pub result: Result<CommandResult>,
 pub is_main_rank: bool,
}

impl ExecutionResult {
 pub fn get_exit_code(&self) -> i32 {
 match &self.result {
 Ok(cmd_result) => cmd_result.exit_status as i32,
 Err(_) => 1, // Connection error → exit code 1
 }
 }
}

Executor Integration

The ParallelExecutor automatically marks the main rank:

fn collect_results(&self, results: Vec<...>) -> Result<Vec<ExecutionResult>> {
 let mut execution_results = Vec::new;
 // ... collect results ...

 // Identify and mark the main rank
 if let Some(main_idx) = RankDetector::identify_main_rank(&self.nodes) {
 if let Some(main_result) = execution_results.get_mut(main_idx) {
 main_result.is_main_rank = true;
 }
 }

 Ok(execution_results)
}

Command Integration

The exec command determines strategy from CLI flags:

let strategy = if params.require_all_success {
 ExitCodeStrategy::RequireAllSuccess
} else if params.check_all_nodes {
 ExitCodeStrategy::MainRankWithFailureCheck
} else {
 ExitCodeStrategy::MainRank // Default
};

let main_idx = RankDetector::identify_main_rank(&nodes);
let exit_code = strategy.calculate(&results, main_idx);

if exit_code != 0 {
 std::process::exit(exit_code);
}

Migration Guide

For MPI Workloads (No Changes Needed)

# Before: Exit code discarded
bssh exec "mpirun ./program"
# Returns: 1 (just "failed", no details)

# After: Exit code preserved
bssh exec "mpirun ./program"
# Returns: 139 (SIGSEGV - immediate diagnosis!)

# ✅ No changes needed - behavior improved

For Health Checks (Add Flag)

# Before: Implicit all-must-succeed
bssh exec "health-check"

# After: Add --require-all-success flag
bssh --require-all-success exec "health-check"

# ⚠️ Action required: Add flag to preserve behavior

Performance Considerations

  • Zero Overhead: Main rank detection is O(n) single pass
  • Strategy Selection: Compile-time resolution via enum dispatch
  • No Allocations: All calculations on stack
  • Minimal Latency: <1μs added to exit path

Security Considerations

  • Exit Code Range: Limited to 0-255 (POSIX standard)
  • No Injection: Exit codes are integers, not strings
  • Deterministic: Same inputs → same output (no randomness)

Related Documentation: