Skip to content

feat: add scheduler backends (local/k8s/slurm) with unified CLI#10

Merged
TerrenceZhangX merged 56 commits intomicrosoft:mainfrom
TerrenceZhangX:zhangt/scheduler-support
Mar 20, 2026
Merged

feat: add scheduler backends (local/k8s/slurm) with unified CLI#10
TerrenceZhangX merged 56 commits intomicrosoft:mainfrom
TerrenceZhangX:zhangt/scheduler-support

Conversation

@TerrenceZhangX
Copy link
Contributor

@TerrenceZhangX TerrenceZhangX commented Mar 19, 2026

Summary

Add a schedulers/ package with three job scheduler backends for stage profiling, plus a unified flowsim CLI.

What's New

Scheduler Backends (schedulers/)

  • local.py — run profiling in a local Docker container
  • k8s.py — submit as Kubernetes Jobs (PVC/hostPath storage)
  • slurm.py — submit via sbatch/squeue/scancel (CLI mode); supports docker, enroot, and bare-metal container runtimes; auto-mounts output dir; --exclusive GPU allocation
  • base.py — abstract base class + ProfileJobSpec dataclass; auto-generated job names include timestamp for uniqueness
  • config.py — YAML config loading from ~/.flowsim/ with env var overrides
  • templates/ — annotated config templates for K8s and Slurm

CLI (scripts/cli/)

  • flowsim init {k8s,slurm} — install config template (or --config to use your own)
  • flowsim submit — submit profiling jobs (local/k8s/slurm), with --dry-run and --sweep-file
  • flowsim status/logs/cancel/list — job lifecycle management
  • --sweep support for multi-point profiling in one job

Test Infrastructure (tests/integration/infra/)

  • Kind cluster config + GPU passthrough (CDI mode)
  • Slurm Docker Compose cluster (slurmctld + compute node with runtime: nvidia); host paths parameterized via $HOST_WORKSPACE
  • dev-setup.sh / dev-teardown.sh for one-click test environments

Tests

  • 45 unit tests (test_scheduler_cli.py)
  • Integration tests for all 3 backends (test_scheduler.py)

Stats

  • 23 new files, 5 modified
  • +4,908 / −179 lines

Usage

# Install
pip install -e .

# Initialize config
flowsim init k8s
flowsim init slurm --config my-slurm.yaml

# Submit — same workload, different schedulers
flowsim submit --scheduler local \
    --collect all \
    --model-path workload/models/configs/Qwen3-235B-A22B \
    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
    --extra-server-opts "--load-format dummy"

flowsim submit --scheduler k8s \
    --collect all \
    --model-path workload/models/configs/Qwen3-235B-A22B \
    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
    --extra-server-opts "--load-format dummy"

flowsim submit --scheduler slurm \
    --collect all \
    --model-path workload/models/configs/Qwen3-235B-A22B \
    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
    --slurm-partition gpu \
    --extra-server-opts "--load-format dummy"

# Multi-point sweep (works with any scheduler)
flowsim submit --scheduler local \
    --collect all \
    --model-path workload/models/configs/Qwen3-235B-A22B \
    --sweep 1:2048:0 4:2048:0 8:2048:0 \
    --decode-tokens 2 --gpus 1 \
    --extra-server-opts "--load-format dummy"

# Job management
flowsim status --scheduler k8s --job <job-name>
flowsim logs   --scheduler k8s --job <job-name>
flowsim cancel --scheduler k8s --job <job-name>
flowsim list   --scheduler slurm

Terrence Zhang added 30 commits March 17, 2026 03:46
Add a scheduler package with:
- ProfileJobSpec dataclass for all profiling parameters
- BaseScheduler ABC with render/submit/dry_run interface
- K8sScheduler: generates valid K8s Job YAML with GPU resources,
  PVC/hostPath volumes, nodeSelector, serviceAccount support
- SlurmScheduler: generates sbatch scripts with docker/enroot/bare-metal
  container runtimes, module loading, and custom #SBATCH directives
- scripts/submit_profile.py: unified CLI entry point with --scheduler
  {k8s,slurm}, --dry-run (default) and --submit modes

Zero external dependencies — uses only Python stdlib.
K8s:
- render() now builds a dict and serializes via yaml.safe_dump (falls
  back to json.dumps if PyYAML is absent). Fixes YAML injection when
  values contain : # or quotes.
- submit() uses the 'kubernetes' Python client (kubeconfig / in-cluster).
- New args: --k8s-kubeconfig, --k8s-context.

Slurm:
- submit() now posts to slurmrestd REST API via urllib.request (stdlib).
- Supports JWT auth, configurable API version (v0.0.39–v0.0.41+),
  and TLS certificate verification toggle.
- New args: --slurm-rest-url, --slurm-jwt-token, --slurm-api-version,
  --slurm-no-verify-ssl.

render() / dry-run remain zero-dependency (stdlib only).
submit() requires 'kubernetes' package for K8s; Slurm uses stdlib.
- Core deps: requests, perfetto, numpy, pandas
- Optional dependency groups:
  k8s:  kubernetes>=27.0, PyYAML>=6.0
  slurm: (stdlib only, no extra deps)
  sim:  scalesim, scipy, torch
  viz:  matplotlib, seaborn
  api:  fastapi, pydantic, uvicorn
  dev:  black, pytest
  all:  everything
- Entry point: flowsim-submit -> scripts.submit_profile:main
- requires-python >= 3.10
- Add scripts/__init__.py so 'scripts' is a findable package
- Remove sys.path hack from submit_profile.py (not needed after install)
- Add [tool.setuptools.packages.find] with explicit include list
  (excludes tests/ and backend/ from the installable package)
- Improve K8s submit error: catch both kubeconfig and in-cluster
  failures and show a single clear message with --k8s-kubeconfig hint

Verified: pip install -e '.[k8s]' -> flowsim-submit --dry-run works.
- Add scripts/cli.py with subcommand routing (flowsim {submit, ...})
- Entry point changed: flowsim-submit -> flowsim
- 'flowsim submit' delegates to submit_profile.main()
- Extensible for future subcommands (profile, parse, simulate)
Removed the redundant --submit flag. The subcommand name already implies
submission; --dry-run is the opt-out.
- Slurm: fail fast if --slurm-rest-url or --slurm-jwt-token missing
- K8s: warn to stderr when no explicit kubeconfig/context provided
- --dry-run skips validation (no cluster needed for manifest preview)
Connection params now read from environment variables as defaults,
so you don't have to pass them every invocation:

K8s:
  KUBECONFIG             -> --k8s-kubeconfig
  FLOWSIM_K8S_NAMESPACE  -> --k8s-namespace
  FLOWSIM_K8S_CONTEXT    -> --k8s-context

Slurm:
  FLOWSIM_SLURM_REST_URL     -> --slurm-rest-url
  FLOWSIM_SLURM_JWT_TOKEN    -> --slurm-jwt-token
  FLOWSIM_SLURM_PARTITION    -> --slurm-partition
  FLOWSIM_SLURM_TIME         -> --slurm-time
  FLOWSIM_SLURM_API_VERSION  -> --slurm-api-version

CLI flags override env vars. Env var names shown in --help.
No more built-in defaults for cluster connection params. Users must
configure before submitting:

  flowsim init                 # copies templates to ~/.flowsim/
  vim ~/.flowsim/k8s.yaml      # fill in kubeconfig, namespace, etc.
  vim ~/.flowsim/slurm.yaml    # fill in rest_url, partition, etc.
  flowsim submit ...           # works

Changes:
- Add 'flowsim init' subcommand (copies templates, --force to overwrite)
- Split config into ~/.flowsim/k8s.yaml and ~/.flowsim/slurm.yaml
- Templates have empty REQUIRED fields — submit fails if unfilled
- Config loader: schedulers/config.py with per-scheduler load functions
- Priority: CLI flag > env var > config file (no silent fallbacks)
- Slurm jwt_token_cmd: execute a command to get token at submit time
- --dry-run skips all validation (no config needed for preview)
- flowsim init k8s --kubeconfig ... --namespace ...
- flowsim init slurm --rest-url ... --partition ... --account ...
- Required fields enforced by argparse, --help shows everything
- --force to overwrite existing config
- Demote --dry-run to [debug] in submit help text
- Remove template-copy approach, use _save_yaml() directly
Docker test environments:
- kind-multi-node.yaml: 1 control-plane + 2 workers (GPU 0, GPU 1)
- slurm-compose.yaml: slurmctld + 2 slurmd (GPU 0, GPU 1) + slurmrestd
- slurm-node.dockerfile + slurm.conf: Slurm 23.11 with JWT auth

PD disaggregation:
- ProfileJobSpec: disagg_mode, disagg_transfer_backend, disagg_bootstrap_port,
  disagg_prefill_pp, disagg_ib_device
- as_prefill() / as_decode() helpers for creating PD pairs
- BaseScheduler: render_pd_pair() and submit_pd_pair()
- CLI: --pd flag submits prefill + decode job pair
- --disagg-transfer-backend (mooncake/nixl), --disagg-bootstrap-port, etc.

Bugfix:
- resolve_jwt_token: catch FileNotFoundError when jwt_token_cmd binary missing
- dev-setup.sh: auto-installs kind/kubectl, creates kind cluster, starts
  Slurm compose, runs flowsim init — all in one command
- dev-teardown.sh: tears down both clusters cleanly
- Supports 'kind', 'slurm', or 'all' (default) targets
- Verified: kind cluster creation + K8s Job submit + PD pair submit all work
- LocalScheduler runs profiling via subprocess on this machine
- --local-gpus to set CUDA_VISIBLE_DEVICES (e.g. '0' or '0,1')
- --local-workdir for custom working directory
- No cluster config needed; replaces manual 'python scripts/run_stage_profile.py'
- Supports --pd for local PD disaggregation testing
- Skips cluster connection validation for local scheduler
Tests cover:
- ProfileJobSpec: job name, server opts, disagg params, as_prefill/decode
- K8sScheduler.render: YAML validity, namespace, GPU resources, PVC,
  hostPath, nodeSelector, serviceAccount, labels, PD pair
- SlurmScheduler.render: shebang, sbatch directives, docker/enroot/bare,
  modules, extra sbatch, constraint, time parsing
- LocalScheduler.render: GPU selection, workdir, env vars
- CLI init: help, required args, bad kubeconfig, save/load config,
  overwrite protection, --force
- CLI submit: help, dry-run for local/k8s/slurm, PD pair, nixl backend
- Config: save/load yaml, jwt_token static/cmd/bad_cmd, cfg_get

All tests run inside the FlowSim Docker container.
…8s submit without PVC

- log_dir is now derived as {output_dir}/logs/ (single volume covers both)
- LocalScheduler.submit() tees stdout/stderr to log files in real time
- K8s submit refuses if no --k8s-pvc or --k8s-host-output-dir (prevents data loss)
- Slurm output_dir defaults to ~/flowsim_traces (shared filesystem)
- Local output_dir defaults to {project}/stage_traces/
- Add flowsim status/logs subcommands (K8s via API, Slurm via slurmrestd, local via log files)
- Submit prints result location + follow-up commands after every job
- Add integration tests for local scheduler
- TestLocalScheduler: real TP=1 profiling, verify traces + logs + status/logs CLI
- TestK8sScheduler: dry-run YAML (PVC mount, hostPath, log paths), refuse without
  storage, real Job submit to Kind cluster with status/logs verification
- TestSlurmScheduler: dry-run sbatch script (output_dir, log_dir, PD pair)

Results: 9 passed, 1 skipped (K8s real submit skipped in container, passes on host)
- Add JobResult dataclass: submit() now returns structured data
  (job_id, scheduler, state, output_dir, message) instead of string
- Add flowsim cancel: K8s (delete_namespaced_job), Slurm (DELETE via
  slurmrestd), local (no-op for synchronous jobs)
- Add flowsim list: list FlowSim jobs with --status filter
  K8s (label_selector=app=flowsim), Slurm (slurmrestd /jobs),
  local (scan log files)
- Add --follow / -f to flowsim logs: shows tail -f / kubectl logs -f
  commands for real-time log streaming
- submit_pd_pair() now returns list[JobResult] instead of string
- Post-submit output shows cancel/list/follow commands
Two-pass argparse: peek --scheduler with a minimal pre-parser,
then add only the relevant scheduler's options before full parse.
'flowsim submit --scheduler local --help' no longer shows k8s/slurm args.
Most systems (Ubuntu, Debian) don't have 'python' symlink by default.
Before: flowsim-perf-qwen3-8b_1773771736.stdout.log
After:  flowsim-perf-qwen3-8b_20260317_184236.stdout.log

list_jobs() regex updated to support both old epoch and new formats.
- Add submit_via='cli' mode to SlurmScheduler, using sbatch/squeue/scancel
  subprocess calls instead of slurmrestd REST API (which has JWT auth issues
  in Slurm 23.11 docker containers).
- Add cli_prefix param for running commands via docker exec.
- Use scontrol show job for status (works without slurmdbd).
- Slurm compose: base image on flowsim-image:latest, compile Slurm 23.11
  with NVML support, cgroup/v1, explicit GRES config.
- Slurm test passes in ~76s (same as K8s test).
- K8s test uses host mount for traces (no docker cp).
- All three backends (local, k8s, slurm) tested and working.
Usage:
  flowsim submit --scheduler local --collect perf --model-path Qwen/Qwen3-8B \
      --sweep 1:2048:0 4:8192:0 16:2048:4096

Or from file:
  flowsim submit --scheduler local --collect perf --model-path Qwen/Qwen3-8B \
      --sweep-file sweep_points.txt

Each point is a BS:INPUT_LEN:CTX tuple. One server launch, multiple
profile points sequentially. Backwards compatible: without --sweep,
--bs/--input-len/--existing-ctx still works as single-point.
Two tests in TestLocalSweep:
- test_sweep_inline: --sweep 1:2048:0 1:4096:0 1:2048:2048
- test_sweep_file:   same points read from a temp file

Also fix: use single --sweep with multiple values (nargs=+)
instead of repeated --sweep flags which argparse would override.
- Extract resolve_default() to config.py (was _d() duplicated in submit/status)
- Extract parse_sweep_point()/load_sweep_file() to scripts/__init__.py
- K8s: submit() reuses _load_k8s() instead of duplicating kubeconfig logic
- K8s: remove unused kubernetes imports in status()/logs()
- Local: move inline imports (glob/re/shlex/threading) to module level
- Local: remove dead if-branch in list_jobs (always set Completed)
- Slurm: default submit_via='cli', deprecate REST mode with DeprecationWarning
- Slurm: add TODO for _logs_cli (currently returns status info only)
- CLI: flowsim init slurm supports --submit-via/--cli-prefix, rest-url optional
- Template: slurm.yaml updated for CLI-first workflow
- run_stage_profile: fix _run_perf sentinel bs=0 -> Optional[int]
…faults)

- slurm.py: fix module docstring (no longer says 'posts to slurmrestd')
- local.py: remove unused stderr/stderr_size vars in list_jobs()
- k8s.py: extract _k8s_job_state() helper (was duplicated in status+list_jobs)
- README: update Slurm default to cli, mark REST as deprecated, fix init example
- Delete all slurmrestd REST methods (submit/cancel/status/logs/list)
- Remove ssl, urllib, json imports from slurm.py
- Remove REST constructor params (rest_url, jwt_token, api_version, verify_ssl, submit_via)
- Remove resolve_jwt_token() from config.py
- Remove REST CLI args from submit_profile.py, status_profile.py, cli.py
- Strip REST fields from slurm.yaml template
- Remove JWT-related tests, update init/submit tests
- Rewrite schedulers/README.md entirely in English, no REST references
- 56 unit tests pass, net -524 lines
Terrence Zhang added 17 commits March 18, 2026 23:48
…templates

- Move dev-setup.sh, dev-teardown.sh, slurm-compose.yaml,
  slurm-node.dockerfile, kind-multi-node.yaml, slurm.conf,
  cgroup.conf, gres.conf from dockerfiles/ to tests/integration/infra/
- Delete schedulers/templates/ (unused by code; flowsim init generates
  config directly from CLI args)
- Update all path references in README, config.py, test files, and
  shell script comments
- dockerfiles/ now contains only cuda12.6.dockerfile (app image)
- Remove disagg_mode, disagg_transfer_backend, disagg_bootstrap_port,
  disagg_prefill_pp, disagg_ib_device fields from ProfileJobSpec
- Remove as_prefill(), as_decode(), render_pd_pair(), submit_pd_pair()
- Remove --pd, --disagg-* CLI args from submit_profile.py
- Remove PD branch from main() submit/dry-run logic
- Remove 8 PD-related unit tests
- Remove PD Disaggregation section from README
- 48 unit tests pass
- flowsim init k8s → writes commented k8s.yaml template to ~/.flowsim/
- flowsim init slurm → writes commented slurm.yaml template
- Users edit the file directly (comments explain each field)
- Removed ~60 lines of argparse init code
- Kept --force overwrite logic
- Updated README examples and tests (43 pass)
- flowsim init k8s --config my.yaml  → installs user file to ~/.flowsim/
- flowsim init k8s                   → writes annotated template (unchanged)
- Added 2 tests: config copy + missing file error
- Move annotated templates to schedulers/templates/{k8s,slurm}.yaml
- flowsim init k8s → copies bundled template to ~/.flowsim/
- flowsim init k8s --config my.yaml → copies user file instead
- Remove inline template strings from cli.py
- Root README: replace manual docker run profile/parse with flowsim submit
- Schedulers README: remove redundant How It Works, inline YAML examples, scattered test sections
- Unify model/params across both READMEs (Qwen3-235B-A22B, tp=1, gpus=1, --load-format dummy)
- Add Scheduler Backends section to root README linking to schedulers/README.md
…ilter

- Add timestamp suffix (-MMDD-HHMMSS) to auto-generated job names for uniqueness
- Add #SBATCH --exclusive to Slurm scripts for profiling GPU isolation
- Remove flowsim- prefix filter from Slurm list_jobs (let users filter)
- Add --sweep-file to scheduler README Common Parameters table
- cli.py → cli/__init__.py (entry point + init command)
- submit_profile.py → cli/submit.py (flowsim submit)
- status_profile.py → cli/manage.py (flowsim status/logs/list/cancel)
- Update all import paths in tests
Replace deploy.resources.reservations with runtime:nvidia +
NVIDIA_VISIBLE_DEVICES to fix NVML initialization failure in slurmd-0.
… header

- Remove misleading 'local' suffix (file tests all 3 backends)
- Add test methodology (How It Works) and Pass Criteria to docstring
- Update file references in schedulers/README.md
Sync the output tree in both README.md and schedulers/README.md to
reflect the actual directory layout produced by profiling jobs:
- Add logs/ with server, shape_server, and job log entries
- Add merged/ and shape_traces/ + shape_parsed/ inside point dirs
- Add brief descriptions of each subdirectory in root README
@TerrenceZhangX TerrenceZhangX marked this pull request as ready for review March 19, 2026 23:57
@TerrenceZhangX TerrenceZhangX requested a review from Copilot March 19, 2026 23:57
@TerrenceZhangX TerrenceZhangX self-assigned this Mar 19, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new schedulers/ package (local/K8s/Slurm backends) and a unified flowsim CLI to submit and manage stage-profiling jobs across environments, including sweep (multi-point) profiling support and associated unit/integration test infrastructure.

Changes:

  • Added scheduler backends (local, k8s, slurm) built around a shared ProfileJobSpec + BaseScheduler API.
  • Added unified flowsim CLI (init, submit, status/logs/list/cancel) with YAML-based config templates under ~/.flowsim/.
  • Added unit tests plus integration tests and provisioning scripts for Kind and a docker-compose Slurm test cluster; enhanced stage profiling script with --sweep / --sweep-file.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
tests/unit/test_scheduler_cli.py Unit coverage for CLI parsing, config install, and backend renderers.
tests/integration/test_scheduler.py End-to-end integration tests for local/k8s/slurm and sweep output validation.
tests/integration/infra/slurm.conf Slurm test cluster configuration used by docker-compose infra.
tests/integration/infra/slurm-node.dockerfile Container image for Slurm test cluster nodes (built atop flowsim image).
tests/integration/infra/slurm-compose.yaml Docker compose topology for the local Slurm integration cluster.
tests/integration/infra/kind-multi-node.yaml Kind cluster definition for K8s integration testing with GPU passthrough.
tests/integration/infra/gres.conf Slurm GRES GPU definition for the test cluster.
tests/integration/infra/cgroup.conf Slurm cgroup plugin config for containerized environments.
tests/integration/infra/dev-setup.sh One-shot setup for Kind + Slurm test environments.
tests/integration/infra/dev-teardown.sh Teardown script for test clusters.
simulator/base_parser.py Small refactor in annotation parsing initialization.
scripts/run_stage_profile.py Adds sweep support and adjusts defaults/log-dir handling for profiling runs.
scripts/cli/submit.py Adds flowsim submit implementation (render/dry-run/submit) and config-based defaults.
scripts/cli/manage.py Adds job lifecycle management commands (status/logs/list/cancel) across schedulers.
scripts/cli/init.py Adds unified flowsim entry point and init config installer.
scripts/init.py Adds shared sweep parsing utilities used by CLI + profiling script.
schedulers/templates/slurm.yaml Slurm config template installed via flowsim init slurm.
schedulers/templates/k8s.yaml K8s config template installed via flowsim init k8s.
schedulers/slurm.py Slurm backend: sbatch script rendering + CLI-mode submission/status/logs/list/cancel.
schedulers/local.py Local backend: docker run rendering + synchronous execution with log capture.
schedulers/k8s.py K8s backend: Job manifest rendering and Python-client submission/status/logs/list/cancel.
schedulers/config.py Loads per-scheduler YAML configs with env-var override support.
schedulers/base.py Defines common dataclasses/interfaces and command rendering for schedulers.
schedulers/init.py Public exports for scheduler package.
schedulers/README.md User docs for scheduler usage, config, parameters, and output layout.
pyproject.toml Packages CLI entry point and declares dependencies/extras.
README.md Updates top-level documentation to use flowsim CLI and documents stage profiling/schedulers.
.gitignore Ignores generated stage traces directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TerrenceZhangX and others added 7 commits March 19, 2026 19:09
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Replace absolute /home/administrator/… bind mounts with:
- ${HOST_WORKSPACE} env var for the read-only /workspace mount
- Relative path ../../../stage_traces for the writable traces mount

dev-setup.sh now exports HOST_WORKSPACE (defaults to parent of
repo root) before invoking docker compose.
Without mounting spec.output_dir into the container, traces and logs
are written to the ephemeral container filesystem and lost on exit.

Docker mode: prepend -v output_dir:output_dir to the mount list.
Enroot mode: append output_dir:output_dir to --container-mounts.
@TerrenceZhangX TerrenceZhangX merged commit f1859ce into microsoft:main Mar 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants