FLURM

Fault-tolerant Linux Utility for Resource Management

An Erlang-based, SLURM-compatible job scheduler designed for high availability, zero-downtime operations, and seamless horizontal scaling.

Note: This project was developed with the assistance of generative AI (Claude by Anthropic). The architecture, code, documentation, and TLA+ specifications were created through AI-assisted development.

Build Status

Component	Status
Unit Tests	2400+ passing
Integration Tests	22/22 passing
SLURM Compatibility	141 tests (132 pass, 9 skip)
Protocol Fuzzing	33K+ property tests
TLA+ Verification	All specs pass
Performance	Benchmarked (see docs/BENCHMARKS.md)

Overview

FLURM is a next-generation workload manager that speaks the SLURM protocol while leveraging Erlang/OTP's battle-tested concurrency primitives. It provides fault tolerance, hot code reloading, and distributed consensus without the operational complexity of traditional HPC schedulers.

Key Features

Hot Code Reload: Update scheduler logic without dropping jobs or connections
Zero-Downtime Failover: Multi-controller consensus with automatic leader election
Dynamic Scaling: Add or remove compute nodes without cluster restarts
No Global Locks: Lock-free scheduling using Erlang's actor model
SLURM Protocol Compatible: Drop-in replacement for existing SLURM clients
Built-in Observability: Prometheus metrics, distributed tracing, and live introspection
Deterministic Testing: TLA+ specifications and simulation-based testing

Quick Start

Prerequisites

Erlang/OTP 28 (OTP 26+ compatible)
rebar3 3.22+
(Optional) Docker for containerized deployment
(Optional) MUNGE for authentication in production

Installation

# Clone the repository
git clone https://github.com/zoratu/flurm.git
cd flurm

# Fetch dependencies and compile
rebar3 compile

# Run unit tests
rebar3 eunit

# Build the release
rebar3 release

# Start a single-node controller (development mode)
_build/default/rel/flurmctld/bin/flurmctld foreground

Verify Installation

# In a separate terminal, check the controller is running
rebar3 shell

# From the Erlang shell
1> flurm_controller_app:status().

Basic Usage

# Submit a job (using standard SLURM commands)
sbatch --wrap="echo Hello FLURM"

# Check job status
squeue

# View node status
sinfo

# Cancel a job
scancel <job_id>

Configuration

Create a flurm.config file:

[
  {flurm, [
    {cluster_name, "my-cluster"},
    {controllers, [
      {"controller1.example.com", 6817},
      {"controller2.example.com", 6817},
      {"controller3.example.com", 6817}
    ]},
    {slurmd_port, 6818},
    {slurmctld_port, 6817},
    {scheduler_plugin, flurm_sched_backfill},
    {checkpoint_interval, 60000}
  ]}
].

Architecture Overview

flowchart TB
    subgraph Clients["Client Layer"]
        sbatch
        squeue
        scancel
        sinfo
        scontrol
        sacct
        salloc
    end

    subgraph Controller["Controller Layer"]
        Protocol["Protocol Decoder<br/>(flurm_protocol.erl)"]

        subgraph Managers["Core Managers"]
            Job["Job<br/>Manager"]
            Queue["Queue<br/>Manager"]
            Sched["Scheduler<br/>Engine"]
            Node["Node<br/>Manager"]
            Part["Partition<br/>Manager"]
        end

        State["State Manager<br/>(Raft Consensus + Mnesia)"]
    end

    subgraph Compute["Compute Layer"]
        N1["Node 1<br/>(flurmnd)"]
        N2["Node 2<br/>(flurmnd)"]
        NN["Node N<br/>(flurmnd)"]
    end

    Clients -->|SLURM Protocol TCP| Protocol
    Protocol --> Managers
    Managers --> State
    State --> Compute

SLURM vs FLURM Capabilities

Feature	SLURM	FLURM
Job Scheduling	Yes	Yes
Fair Share	Yes	Yes
Backfill Scheduling	Yes	Yes
Preemption	Yes	Yes
Job Arrays	Yes	Yes
Node Health Monitoring	Yes	Yes
Accounting	Yes	Yes
Hot Code Reload	No	Yes
Zero-Downtime Upgrades	No	Yes
Lock-Free Scheduling	No	Yes
Built-in Consensus	Limited	Raft
Live State Inspection	Limited	Full REPL
Deterministic Testing	No	TLA+ & SimTest
Protocol Version	23.x compatible	23.x compatible
Max Controllers	2 (active/passive)	Unlimited (Raft)
Failover Time	30-60 seconds	< 1 second
Language	C	Erlang/OTP

Documentation

Getting Started

Usage Guide - How to use FLURM (SLURM command reference)
Quick Start - Get running in 5 minutes

Core Documentation

Architecture Overview - System design and OTP structure
Protocol Reference - SLURM binary protocol details
Development Guide - Contributing to FLURM
AI Agent Guide - Guide for AI-assisted development

Operations & Deployment

Deployment Guide - Production deployment
Operations Guide - Day-to-day operations and troubleshooting
Security Guide - Security model, authentication, and hardening
Migration Guide - Migrating from SLURM to FLURM

Testing & Performance

Testing Guide - How to test FLURM
SLURM Compatibility Testing - 141 SLURM-native tests (132 pass, 9 skip)
SLURM Client Testing - Testing with real SLURM clients
Benchmarks - Performance benchmarks and results
Code Coverage - Coverage strategy and targets

Reference

SLURM Differences - Key differences between SLURM and FLURM

Project Status

FLURM is currently in active development (February 2026). Phase 7-8 implementation is complete. The following components are implemented:

Core Components

Testing & Verification

Unit test suite (2400+ tests)
SLURM compatibility test suite (141 tests, 132 pass, 9 skip)
Protocol fuzzing (33K+ PropEr property tests)
Deterministic simulation framework (FoundationDB-style)
Performance benchmarks (3M+ ops/sec job submission)
Multi-node cluster tests (Docker Compose)
Integration test framework (22/22 tests passing)
TLA+ model checking (Federation, Accounting, Migration specs)

Advanced Features

GPU scheduling (GRES) - Full generic resource support
Burst buffer support - Stage-in/stage-out operations
Job arrays - Full array syntax with throttling
Job dependencies - afterok, afterany, afternotok, singleton
Preemption - Checkpoint, requeue, suspend modes
Reservations - Maintenance windows and user reservations
License management - Cluster-wide license tracking
Federation support (partial)

Planned Features

Full federation with cross-cluster job submission
Kubernetes operator deployment
SPANK plugin compatibility layer
Integration test coverage for core modules

SLURM Compatibility Test Results

FLURM passes 132 out of 141 SLURM-native compatibility tests derived from the SchedMD SLURM testsuite. Tests cover all major SLURM CLI tools (including sacct and salloc) using real SLURM clients against a FLURM Docker cluster:

Test Category	Tests	Status
sinfo output and formatting	10	All pass
sbatch submission and options	16	All pass
squeue output and filtering	16	All pass
scontrol show job/partition/node	13	11 pass, 2 skip
scancel job cancellation	7	All pass
Job lifecycle (pending/running/completed)	6	All pass
Resource scheduling (CPU/memory)	6	All pass
Stress tests (concurrent submissions)	5	All pass
Node resource tracking	5	All pass
Edge cases (empty queues, invalid jobs)	10	All pass
SBATCH directives (#SBATCH parsing)	4	All pass
Time limit formats (HH:MM:SS, D-HH:MM:SS)	4	All pass
Node detail tracking	4	3 pass, 1 skip
Mixed workload	2	All pass
Python-suite scancel filtering	4	All pass
Python-suite sbatch environment	4	All pass
Python-suite squeue format	3	All pass
Python-suite sinfo node states	3	All pass
Python-suite job output	4	All pass
Python-suite scontrol extended	4	All pass
sacct accounting queries	5	2 pass, 3 skip
salloc interactive allocation	6	3 pass, 3 skip

9 skips: 3 original (--output path display, scontrol hold, scontrol update JobName), 3 sacct job data queries (DBD connected but job records not yet stored), 3 salloc/timing-dependent tests.

The test suite runs automatically in the pre-commit hook when Docker containers are running. See SLURM Compatibility Testing for details.

Known Limitations

srun Interactive Job Protocol

The srun command for interactive jobs is now working with FLURM. The implementation includes:

Full I/O forwarding from node daemon back to srun client
Task exit code reporting with proper waitpid format conversion
Job step creation and launch task protocol
RESPONSE_LAUNCH_TASKS and MESSAGE_TASK_EXIT message handling

Status: srun is fully functional for basic interactive jobs. Commands like srun hostname, srun echo "Hello World", and srun <script> work correctly.

Working CLI commands: sbatch, squeue, scancel, sinfo, scontrol show job/partition/node, srun, sacct, salloc

Contributing

We welcome contributions! Please see our Development Guide for details on:

Setting up your development environment
Code style and conventions
Submitting pull requests
Testing requirements

Contribution Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Write tests for your changes
Ensure all tests pass (rebar3 eunit && rebar3 ct)
Run the linter (rebar3 lint)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

GNU GPLv3

Acknowledgments

The SLURM team at SchedMD for creating the industry-standard workload manager
The Erlang/OTP team for the incredible runtime
The TLA+ community for formal verification tools

FLURM - Because your HPC cluster deserves fault tolerance.

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
.claude		.claude
.githooks		.githooks
apps		apps
config		config
coverage		coverage
deploy/kubernetes		deploy/kubernetes
docker		docker
docs		docs
escripts		escripts
hooks		hooks
integration_test		integration_test
scripts		scripts
specs/tla		specs/tla
test		test
tla		tla
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.slurm-testsuite-ref		.slurm-testsuite-ref
CLAUDE.local.md		CLAUDE.local.md
Dockerfile		Dockerfile
Dockerfile.slurm-test		Dockerfile.slurm-test
Dockerfile.test		Dockerfile.test
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
rebar.config		rebar.config
rebar.config.local		rebar.config.local
run_benchmarks.escript		run_benchmarks.escript

Folders and files

Latest commit

History

Repository files navigation

FLURM

Build Status

Overview

Key Features

Quick Start

Prerequisites

Installation

Verify Installation

Basic Usage

Configuration

Architecture Overview

SLURM vs FLURM Capabilities

Documentation

Getting Started

Core Documentation

Operations & Deployment

Testing & Performance

Reference

Project Status

Core Components

Testing & Verification

Advanced Features

Planned Features

SLURM Compatibility Test Results

Known Limitations

srun Interactive Job Protocol

Contributing

Contribution Workflow

License

Acknowledgments

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages