Fault-tolerant Linux Utility for Resource Management
An Erlang-based, SLURM-compatible job scheduler designed for high availability, zero-downtime operations, and seamless horizontal scaling.
Note: This project was developed with the assistance of generative AI (Claude by Anthropic). The architecture, code, documentation, and TLA+ specifications were created through AI-assisted development.
| Component | Status |
|---|---|
| Unit Tests | 2400+ passing |
| Integration Tests | 22/22 passing |
| SLURM Compatibility | 141 tests (132 pass, 9 skip) |
| Protocol Fuzzing | 33K+ property tests |
| TLA+ Verification | All specs pass |
| Performance | Benchmarked (see docs/BENCHMARKS.md) |
FLURM is a next-generation workload manager that speaks the SLURM protocol while leveraging Erlang/OTP's battle-tested concurrency primitives. It provides fault tolerance, hot code reloading, and distributed consensus without the operational complexity of traditional HPC schedulers.
- Hot Code Reload: Update scheduler logic without dropping jobs or connections
- Zero-Downtime Failover: Multi-controller consensus with automatic leader election
- Dynamic Scaling: Add or remove compute nodes without cluster restarts
- No Global Locks: Lock-free scheduling using Erlang's actor model
- SLURM Protocol Compatible: Drop-in replacement for existing SLURM clients
- Built-in Observability: Prometheus metrics, distributed tracing, and live introspection
- Deterministic Testing: TLA+ specifications and simulation-based testing
- Erlang/OTP 28 (OTP 26+ compatible)
- rebar3 3.22+
- (Optional) Docker for containerized deployment
- (Optional) MUNGE for authentication in production
# Clone the repository
git clone https://github.com/zoratu/flurm.git
cd flurm
# Fetch dependencies and compile
rebar3 compile
# Run unit tests
rebar3 eunit
# Build the release
rebar3 release
# Start a single-node controller (development mode)
_build/default/rel/flurmctld/bin/flurmctld foreground# In a separate terminal, check the controller is running
rebar3 shell
# From the Erlang shell
1> flurm_controller_app:status().# Submit a job (using standard SLURM commands)
sbatch --wrap="echo Hello FLURM"
# Check job status
squeue
# View node status
sinfo
# Cancel a job
scancel <job_id>Create a flurm.config file:
[
{flurm, [
{cluster_name, "my-cluster"},
{controllers, [
{"controller1.example.com", 6817},
{"controller2.example.com", 6817},
{"controller3.example.com", 6817}
]},
{slurmd_port, 6818},
{slurmctld_port, 6817},
{scheduler_plugin, flurm_sched_backfill},
{checkpoint_interval, 60000}
]}
].flowchart TB
subgraph Clients["Client Layer"]
sbatch
squeue
scancel
sinfo
scontrol
sacct
salloc
end
subgraph Controller["Controller Layer"]
Protocol["Protocol Decoder<br/>(flurm_protocol.erl)"]
subgraph Managers["Core Managers"]
Job["Job<br/>Manager"]
Queue["Queue<br/>Manager"]
Sched["Scheduler<br/>Engine"]
Node["Node<br/>Manager"]
Part["Partition<br/>Manager"]
end
State["State Manager<br/>(Raft Consensus + Mnesia)"]
end
subgraph Compute["Compute Layer"]
N1["Node 1<br/>(flurmnd)"]
N2["Node 2<br/>(flurmnd)"]
NN["Node N<br/>(flurmnd)"]
end
Clients -->|SLURM Protocol TCP| Protocol
Protocol --> Managers
Managers --> State
State --> Compute
| Feature | SLURM | FLURM |
|---|---|---|
| Job Scheduling | Yes | Yes |
| Fair Share | Yes | Yes |
| Backfill Scheduling | Yes | Yes |
| Preemption | Yes | Yes |
| Job Arrays | Yes | Yes |
| Node Health Monitoring | Yes | Yes |
| Accounting | Yes | Yes |
| Hot Code Reload | No | Yes |
| Zero-Downtime Upgrades | No | Yes |
| Lock-Free Scheduling | No | Yes |
| Built-in Consensus | Limited | Raft |
| Live State Inspection | Limited | Full REPL |
| Deterministic Testing | No | TLA+ & SimTest |
| Protocol Version | 23.x compatible | 23.x compatible |
| Max Controllers | 2 (active/passive) | Unlimited (Raft) |
| Failover Time | 30-60 seconds | < 1 second |
| Language | C | Erlang/OTP |
- Usage Guide - How to use FLURM (SLURM command reference)
- Quick Start - Get running in 5 minutes
- Architecture Overview - System design and OTP structure
- Protocol Reference - SLURM binary protocol details
- Development Guide - Contributing to FLURM
- AI Agent Guide - Guide for AI-assisted development
- Deployment Guide - Production deployment
- Operations Guide - Day-to-day operations and troubleshooting
- Security Guide - Security model, authentication, and hardening
- Migration Guide - Migrating from SLURM to FLURM
- Testing Guide - How to test FLURM
- SLURM Compatibility Testing - 141 SLURM-native tests (132 pass, 9 skip)
- SLURM Client Testing - Testing with real SLURM clients
- Benchmarks - Performance benchmarks and results
- Code Coverage - Coverage strategy and targets
- SLURM Differences - Key differences between SLURM and FLURM
FLURM is currently in active development (February 2026). Phase 7-8 implementation is complete. The following components are implemented:
- SLURM protocol decoder/encoder (75% coverage)
- Basic job submission and management
- Node registration and heartbeat
- Partition management
- Fair share scheduler
- Raft consensus integration (Ra library)
- Controller failover
- Hot code reloading (slurm.conf live reload)
- srun support (interactive jobs) - I/O forwarding, task exit handling
- Job steps management
- sacctmgr (accounting management)
- slurmdbd (accounting daemon)
- Unit test suite (2400+ tests)
- SLURM compatibility test suite (141 tests, 132 pass, 9 skip)
- Protocol fuzzing (33K+ PropEr property tests)
- Deterministic simulation framework (FoundationDB-style)
- Performance benchmarks (3M+ ops/sec job submission)
- Multi-node cluster tests (Docker Compose)
- Integration test framework (22/22 tests passing)
- TLA+ model checking (Federation, Accounting, Migration specs)
- GPU scheduling (GRES) - Full generic resource support
- Burst buffer support - Stage-in/stage-out operations
- Job arrays - Full array syntax with throttling
- Job dependencies - afterok, afterany, afternotok, singleton
- Preemption - Checkpoint, requeue, suspend modes
- Reservations - Maintenance windows and user reservations
- License management - Cluster-wide license tracking
- Federation support (partial)
- Full federation with cross-cluster job submission
- Kubernetes operator deployment
- SPANK plugin compatibility layer
- Integration test coverage for core modules
FLURM passes 132 out of 141 SLURM-native compatibility tests derived from the SchedMD SLURM testsuite. Tests cover all major SLURM CLI tools (including sacct and salloc) using real SLURM clients against a FLURM Docker cluster:
| Test Category | Tests | Status |
|---|---|---|
| sinfo output and formatting | 10 | All pass |
| sbatch submission and options | 16 | All pass |
| squeue output and filtering | 16 | All pass |
| scontrol show job/partition/node | 13 | 11 pass, 2 skip |
| scancel job cancellation | 7 | All pass |
| Job lifecycle (pending/running/completed) | 6 | All pass |
| Resource scheduling (CPU/memory) | 6 | All pass |
| Stress tests (concurrent submissions) | 5 | All pass |
| Node resource tracking | 5 | All pass |
| Edge cases (empty queues, invalid jobs) | 10 | All pass |
| SBATCH directives (#SBATCH parsing) | 4 | All pass |
| Time limit formats (HH:MM:SS, D-HH:MM:SS) | 4 | All pass |
| Node detail tracking | 4 | 3 pass, 1 skip |
| Mixed workload | 2 | All pass |
| Python-suite scancel filtering | 4 | All pass |
| Python-suite sbatch environment | 4 | All pass |
| Python-suite squeue format | 3 | All pass |
| Python-suite sinfo node states | 3 | All pass |
| Python-suite job output | 4 | All pass |
| Python-suite scontrol extended | 4 | All pass |
| sacct accounting queries | 5 | 2 pass, 3 skip |
| salloc interactive allocation | 6 | 3 pass, 3 skip |
9 skips: 3 original (--output path display, scontrol hold, scontrol update JobName), 3 sacct job data queries (DBD connected but job records not yet stored), 3 salloc/timing-dependent tests.
The test suite runs automatically in the pre-commit hook when Docker containers are running. See SLURM Compatibility Testing for details.
The srun command for interactive jobs is now working with FLURM. The implementation includes:
- Full I/O forwarding from node daemon back to srun client
- Task exit code reporting with proper waitpid format conversion
- Job step creation and launch task protocol
- RESPONSE_LAUNCH_TASKS and MESSAGE_TASK_EXIT message handling
Status: srun is fully functional for basic interactive jobs. Commands like srun hostname, srun echo "Hello World", and srun <script> work correctly.
Working CLI commands: sbatch, squeue, scancel, sinfo, scontrol show job/partition/node, srun, sacct, salloc
We welcome contributions! Please see our Development Guide for details on:
- Setting up your development environment
- Code style and conventions
- Submitting pull requests
- Testing requirements
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Ensure all tests pass (
rebar3 eunit && rebar3 ct) - Run the linter (
rebar3 lint) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
GNU GPLv3
- The SLURM team at SchedMD for creating the industry-standard workload manager
- The Erlang/OTP team for the incredible runtime
- The TLA+ community for formal verification tools
FLURM - Because your HPC cluster deserves fault tolerance.