Exit Code Strategy Architecture

Overview
Design Rationale
Main Rank Detection
Exit Code Strategies
Strategy Comparison
Implementation Details
Migration Guide

Overview

The exit code strategy system determines how bssh reports success or failure when executing commands across multiple nodes.

Breaking Change: The default exit code behavior matches standard MPI tools (mpirun, srun, mpiexec). This improves compatibility with distributed computing workflows and enables better error diagnostics.

Old Behavior:

Returns exit code 0 only if all nodes succeeded
Returns exit code 1 if any node failed (discarding actual exit codes)

New Behavior:

Returns the main rank's exit code by default (matching MPI standard)
Preserves actual exit codes (139=SIGSEGV, 137=OOM, 124=timeout, etc.)
Use --require-all-success flag for old behavior

Design Rationale

Why MainRank is Default

MPI Standard Compliance: All standard MPI tools (mpirun, srun, mpiexec) return rank 0's exit code
Better Diagnostics: Actual exit codes preserved for debugging (segfault, OOM, timeout)
CI/CD Integration: Exit-code-based decisions work naturally
Information Preservation: No loss of error details
Industry Practice: Aligns with HPC and distributed computing conventions

Use Case Alignment

90% of use cases: MPI workloads, distributed computing, CI/CD
10% of use cases: Health checks, monitoring (use --require-all-success)

Main Rank Detection

The system identifies the "main rank" (rank 0) using hierarchical fallback:

pub fn identify_main_rank(nodes: &[Node]) -> Option<usize> {
 // 1. Check Backend.AI CLUSTER_ROLE environment variable
 if env::var("BACKENDAI_CLUSTER_ROLE").ok == Some("main".to_string) {
 // Try to match by hostname
 if let Ok(host) = env::var("BACKENDAI_CLUSTER_HOST") {
 if let Some(idx) = nodes.iter.position(|n| n.host == host) {
 return Some(idx);
 }
 }
 }

 // 2. Fallback: First node is main rank (standard convention)
 if !nodes.is_empty {
 Some(0)
 } else {
 None
 }
}

Detection Priority:

BACKENDAI_CLUSTER_ROLE=main + BACKENDAI_CLUSTER_HOST match
First node in the node list (index 0)

Backend.AI Integration: Automatic detection in multi-node sessions without configuration.

Exit Code Strategies

Three strategies are available to handle different scenarios:

1. MainRank (Default)

Behavior: Returns the main rank's actual exit code.

Use Cases:

MPI workloads and distributed computing
CI/CD pipelines requiring exit code inspection
Shell scripts with error handling logic
When debugging requires specific exit codes

Example:

bssh exec "mpirun -n 16 ./simulation"
EXIT_CODE=$?

case $EXIT_CODE in
 0) echo "Success!"; deploy_results ;;
 139) echo "Segfault!"; collect_core_dump ;;
 137) echo "OOM!"; retry_with_more_memory ;;
 124) echo "Timeout!"; extend_time_limit ;;
 *) echo "Failed: $EXIT_CODE"; exit $EXIT_CODE ;;
esac

Implementation:

ExitCodeStrategy::MainRank => {
 main_idx
 .and_then(|i| results.get(i))
 .map(|r| r.get_exit_code)
 .unwrap_or(1) // No main rank identified → failure
}

2. RequireAllSuccess

Behavior: Returns 0 only if all nodes succeeded, 1 otherwise.

CLI Flag: --require-all-success

Use Cases:

Health checks and monitoring
Cluster validation
When any failure should be treated equally
Legacy scripts requiring old behavior

Example:

bssh --require-all-success exec "disk-check"
if [ $? -ne 0 ]; then
 alert_ops "Node failure detected"
fi

Implementation:

ExitCodeStrategy::RequireAllSuccess => {
 if results.iter.any(|r| !r.is_success) {
 1
 } else {
 0
 }
}

3. MainRankWithFailureCheck (Hybrid Mode)

Behavior: Returns main rank's exit code if non-zero, or 1 if main succeeded but others failed.

CLI Flag: --check-all-nodes

Use Cases:

Production deployments requiring both diagnostics and completeness
When you need detailed error codes but also want to catch failures on any node

Example:

bssh --check-all-nodes exec "mpirun ./program"
# Main failed → main's exit code
# Main OK + others failed → 1
# All OK → 0

Implementation:

ExitCodeStrategy::MainRankWithFailureCheck => {
 let main_code = main_idx
 .and_then(|i| results.get(i))
 .map(|r| r.get_exit_code)
 .unwrap_or(0);

 let other_failed = results.iter
 .enumerate
 .any(|(i, r)| Some(i) != main_idx && !r.is_success);

 if main_code != 0 {
 main_code // Main failed → return its code
 } else if other_failed {
 1 // Main OK but others failed → 1
 } else {
 0 // All OK
 }
}

Strategy Comparison

Scenario	Main Exit	Other Exits	MainRank	RequireAllSuccess	MainRankWithFailureCheck
All success	0	0,0,0	0	0	0
Main failed	139 (SIGSEGV)	0,0,0	139	1	139
Other failed	0	1,0,0	0	1	1
All failed	1	1,1,1	1	1	1
Main timeout	124	0,0,0	124	1	124
Main OOM	137	0,0,0	137	1	137

Bold values show where strategies differ.

Implementation Details

File Structure

src/executor/
├── rank_detector.rs # Main rank identification
├── exit_strategy.rs # Exit code calculation strategies
├── result_types.rs # ExecutionResult with is_main_rank field
├── mod.rs # Re-exports RankDetector and ExitCodeStrategy
└── parallel.rs # Marks main rank in results

src/commands/
└── exec.rs # Applies exit strategy based on CLI flags

src/
└── cli.rs # CLI flags: --require-all-success, --check-all-nodes

ExecutionResult Enhancement

pub struct ExecutionResult {
 pub node: Node,
 pub result: Result<CommandResult>,
 pub is_main_rank: bool,
}

impl ExecutionResult {
 pub fn get_exit_code(&self) -> i32 {
 match &self.result {
 Ok(cmd_result) => cmd_result.exit_status as i32,
 Err(_) => 1, // Connection error → exit code 1
 }
 }
}

Executor Integration

The ParallelExecutor automatically marks the main rank:

fn collect_results(&self, results: Vec<...>) -> Result<Vec<ExecutionResult>> {
 let mut execution_results = Vec::new;
 // ... collect results ...

 // Identify and mark the main rank
 if let Some(main_idx) = RankDetector::identify_main_rank(&self.nodes) {
 if let Some(main_result) = execution_results.get_mut(main_idx) {
 main_result.is_main_rank = true;
 }
 }

 Ok(execution_results)
}

Command Integration

The exec command determines strategy from CLI flags:

let strategy = if params.require_all_success {
 ExitCodeStrategy::RequireAllSuccess
} else if params.check_all_nodes {
 ExitCodeStrategy::MainRankWithFailureCheck
} else {
 ExitCodeStrategy::MainRank // Default
};

let main_idx = RankDetector::identify_main_rank(&nodes);
let exit_code = strategy.calculate(&results, main_idx);

if exit_code != 0 {
 std::process::exit(exit_code);
}

Migration Guide

For MPI Workloads (No Changes Needed)

# Before: Exit code discarded
bssh exec "mpirun ./program"
# Returns: 1 (just "failed", no details)

# After: Exit code preserved
bssh exec "mpirun ./program"
# Returns: 139 (SIGSEGV - immediate diagnosis!)

# ✅ No changes needed - behavior improved

For Health Checks (Add Flag)

# Before: Implicit all-must-succeed
bssh exec "health-check"

# After: Add --require-all-success flag
bssh --require-all-success exec "health-check"

# ⚠️ Action required: Add flag to preserve behavior

Performance Considerations

Zero Overhead: Main rank detection is O(n) single pass
Strategy Selection: Compile-time resolution via enum dispatch
No Allocations: All calculations on stack
Minimal Latency: <1μs added to exit path

Security Considerations

Exit Code Range: Limited to 0-255 (POSIX standard)
No Injection: Exit codes are integers, not strings
Deterministic: Same inputs → same output (no randomness)

Related Documentation:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exit Code Strategy Architecture

Table of Contents

Overview

Design Rationale

Why MainRank is Default

Use Case Alignment

Main Rank Detection

Exit Code Strategies

1. MainRank (Default)

2. RequireAllSuccess

3. MainRankWithFailureCheck (Hybrid Mode)

Strategy Comparison

Implementation Details

File Structure

ExecutionResult Enhancement

Executor Integration

Command Integration

Migration Guide

For MPI Workloads (No Changes Needed)

For Health Checks (Add Flag)

Performance Considerations

Security Considerations

FilesExpand file tree

exit-code-strategy.md

Latest commit

History

exit-code-strategy.md

File metadata and controls

Exit Code Strategy Architecture

Table of Contents

Overview

Design Rationale

Why MainRank is Default

Use Case Alignment

Main Rank Detection

Exit Code Strategies

1. MainRank (Default)

2. RequireAllSuccess

3. MainRankWithFailureCheck (Hybrid Mode)

Strategy Comparison

Implementation Details

File Structure

ExecutionResult Enhancement

Executor Integration

Command Integration

Migration Guide

For MPI Workloads (No Changes Needed)

For Health Checks (Add Flag)

Performance Considerations

Security Considerations