Skip to content

zoratu/flurm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

233 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FLURM

Fault-tolerant Linux Utility for Resource Management

An Erlang-based, SLURM-compatible job scheduler designed for high availability, zero-downtime operations, and seamless horizontal scaling.

License: GPL v3 Erlang/OTP Tests Integration SLURM Compat Made with Claude

Note: This project was developed with the assistance of generative AI (Claude by Anthropic). The architecture, code, documentation, and TLA+ specifications were created through AI-assisted development.

Build Status

Component Status
Unit Tests 2400+ passing
Integration Tests 22/22 passing
SLURM Compatibility 141 tests (132 pass, 9 skip)
Protocol Fuzzing 33K+ property tests
TLA+ Verification All specs pass
Performance Benchmarked (see docs/BENCHMARKS.md)

Overview

FLURM is a next-generation workload manager that speaks the SLURM protocol while leveraging Erlang/OTP's battle-tested concurrency primitives. It provides fault tolerance, hot code reloading, and distributed consensus without the operational complexity of traditional HPC schedulers.

Key Features

  • Hot Code Reload: Update scheduler logic without dropping jobs or connections
  • Zero-Downtime Failover: Multi-controller consensus with automatic leader election
  • Dynamic Scaling: Add or remove compute nodes without cluster restarts
  • No Global Locks: Lock-free scheduling using Erlang's actor model
  • SLURM Protocol Compatible: Drop-in replacement for existing SLURM clients
  • Built-in Observability: Prometheus metrics, distributed tracing, and live introspection
  • Deterministic Testing: TLA+ specifications and simulation-based testing

Quick Start

Prerequisites

  • Erlang/OTP 28 (OTP 26+ compatible)
  • rebar3 3.22+
  • (Optional) Docker for containerized deployment
  • (Optional) MUNGE for authentication in production

Installation

# Clone the repository
git clone https://github.com/zoratu/flurm.git
cd flurm

# Fetch dependencies and compile
rebar3 compile

# Run unit tests
rebar3 eunit

# Build the release
rebar3 release

# Start a single-node controller (development mode)
_build/default/rel/flurmctld/bin/flurmctld foreground

Verify Installation

# In a separate terminal, check the controller is running
rebar3 shell

# From the Erlang shell
1> flurm_controller_app:status().

Basic Usage

# Submit a job (using standard SLURM commands)
sbatch --wrap="echo Hello FLURM"

# Check job status
squeue

# View node status
sinfo

# Cancel a job
scancel <job_id>

Configuration

Create a flurm.config file:

[
  {flurm, [
    {cluster_name, "my-cluster"},
    {controllers, [
      {"controller1.example.com", 6817},
      {"controller2.example.com", 6817},
      {"controller3.example.com", 6817}
    ]},
    {slurmd_port, 6818},
    {slurmctld_port, 6817},
    {scheduler_plugin, flurm_sched_backfill},
    {checkpoint_interval, 60000}
  ]}
].

Architecture Overview

flowchart TB
    subgraph Clients["Client Layer"]
        sbatch
        squeue
        scancel
        sinfo
        scontrol
        sacct
        salloc
    end

    subgraph Controller["Controller Layer"]
        Protocol["Protocol Decoder<br/>(flurm_protocol.erl)"]

        subgraph Managers["Core Managers"]
            Job["Job<br/>Manager"]
            Queue["Queue<br/>Manager"]
            Sched["Scheduler<br/>Engine"]
            Node["Node<br/>Manager"]
            Part["Partition<br/>Manager"]
        end

        State["State Manager<br/>(Raft Consensus + Mnesia)"]
    end

    subgraph Compute["Compute Layer"]
        N1["Node 1<br/>(flurmnd)"]
        N2["Node 2<br/>(flurmnd)"]
        NN["Node N<br/>(flurmnd)"]
    end

    Clients -->|SLURM Protocol TCP| Protocol
    Protocol --> Managers
    Managers --> State
    State --> Compute
Loading

SLURM vs FLURM Capabilities

Feature SLURM FLURM
Job Scheduling Yes Yes
Fair Share Yes Yes
Backfill Scheduling Yes Yes
Preemption Yes Yes
Job Arrays Yes Yes
Node Health Monitoring Yes Yes
Accounting Yes Yes
Hot Code Reload No Yes
Zero-Downtime Upgrades No Yes
Lock-Free Scheduling No Yes
Built-in Consensus Limited Raft
Live State Inspection Limited Full REPL
Deterministic Testing No TLA+ & SimTest
Protocol Version 23.x compatible 23.x compatible
Max Controllers 2 (active/passive) Unlimited (Raft)
Failover Time 30-60 seconds < 1 second
Language C Erlang/OTP

Documentation

Getting Started

Core Documentation

Operations & Deployment

Testing & Performance

Reference

Project Status

FLURM is currently in active development (February 2026). Phase 7-8 implementation is complete. The following components are implemented:

Core Components

  • SLURM protocol decoder/encoder (75% coverage)
  • Basic job submission and management
  • Node registration and heartbeat
  • Partition management
  • Fair share scheduler
  • Raft consensus integration (Ra library)
  • Controller failover
  • Hot code reloading (slurm.conf live reload)
  • srun support (interactive jobs) - I/O forwarding, task exit handling
  • Job steps management
  • sacctmgr (accounting management)
  • slurmdbd (accounting daemon)

Testing & Verification

  • Unit test suite (2400+ tests)
  • SLURM compatibility test suite (141 tests, 132 pass, 9 skip)
  • Protocol fuzzing (33K+ PropEr property tests)
  • Deterministic simulation framework (FoundationDB-style)
  • Performance benchmarks (3M+ ops/sec job submission)
  • Multi-node cluster tests (Docker Compose)
  • Integration test framework (22/22 tests passing)
  • TLA+ model checking (Federation, Accounting, Migration specs)

Advanced Features

  • GPU scheduling (GRES) - Full generic resource support
  • Burst buffer support - Stage-in/stage-out operations
  • Job arrays - Full array syntax with throttling
  • Job dependencies - afterok, afterany, afternotok, singleton
  • Preemption - Checkpoint, requeue, suspend modes
  • Reservations - Maintenance windows and user reservations
  • License management - Cluster-wide license tracking
  • Federation support (partial)

Planned Features

  • Full federation with cross-cluster job submission
  • Kubernetes operator deployment
  • SPANK plugin compatibility layer
  • Integration test coverage for core modules

SLURM Compatibility Test Results

FLURM passes 132 out of 141 SLURM-native compatibility tests derived from the SchedMD SLURM testsuite. Tests cover all major SLURM CLI tools (including sacct and salloc) using real SLURM clients against a FLURM Docker cluster:

Test Category Tests Status
sinfo output and formatting 10 All pass
sbatch submission and options 16 All pass
squeue output and filtering 16 All pass
scontrol show job/partition/node 13 11 pass, 2 skip
scancel job cancellation 7 All pass
Job lifecycle (pending/running/completed) 6 All pass
Resource scheduling (CPU/memory) 6 All pass
Stress tests (concurrent submissions) 5 All pass
Node resource tracking 5 All pass
Edge cases (empty queues, invalid jobs) 10 All pass
SBATCH directives (#SBATCH parsing) 4 All pass
Time limit formats (HH:MM:SS, D-HH:MM:SS) 4 All pass
Node detail tracking 4 3 pass, 1 skip
Mixed workload 2 All pass
Python-suite scancel filtering 4 All pass
Python-suite sbatch environment 4 All pass
Python-suite squeue format 3 All pass
Python-suite sinfo node states 3 All pass
Python-suite job output 4 All pass
Python-suite scontrol extended 4 All pass
sacct accounting queries 5 2 pass, 3 skip
salloc interactive allocation 6 3 pass, 3 skip

9 skips: 3 original (--output path display, scontrol hold, scontrol update JobName), 3 sacct job data queries (DBD connected but job records not yet stored), 3 salloc/timing-dependent tests.

The test suite runs automatically in the pre-commit hook when Docker containers are running. See SLURM Compatibility Testing for details.

Known Limitations

srun Interactive Job Protocol

The srun command for interactive jobs is now working with FLURM. The implementation includes:

  • Full I/O forwarding from node daemon back to srun client
  • Task exit code reporting with proper waitpid format conversion
  • Job step creation and launch task protocol
  • RESPONSE_LAUNCH_TASKS and MESSAGE_TASK_EXIT message handling

Status: srun is fully functional for basic interactive jobs. Commands like srun hostname, srun echo "Hello World", and srun <script> work correctly.

Working CLI commands: sbatch, squeue, scancel, sinfo, scontrol show job/partition/node, srun, sacct, salloc

Contributing

We welcome contributions! Please see our Development Guide for details on:

  • Setting up your development environment
  • Code style and conventions
  • Submitting pull requests
  • Testing requirements

Contribution Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Write tests for your changes
  4. Ensure all tests pass (rebar3 eunit && rebar3 ct)
  5. Run the linter (rebar3 lint)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

License

GNU GPLv3

Acknowledgments

  • The SLURM team at SchedMD for creating the industry-standard workload manager
  • The Erlang/OTP team for the incredible runtime
  • The TLA+ community for formal verification tools

FLURM - Because your HPC cluster deserves fault tolerance.

About

FLURM: Fault-tolerant Linux Utility for Resource Management - Erlang-based SLURM-compatible job scheduler

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages