feat: add scheduler backends (local/k8s/slurm) with unified CLI#10
Merged
TerrenceZhangX merged 56 commits intomicrosoft:mainfrom Mar 20, 2026
Merged
Conversation
added 30 commits
March 17, 2026 03:46
Add a scheduler package with:
- ProfileJobSpec dataclass for all profiling parameters
- BaseScheduler ABC with render/submit/dry_run interface
- K8sScheduler: generates valid K8s Job YAML with GPU resources,
PVC/hostPath volumes, nodeSelector, serviceAccount support
- SlurmScheduler: generates sbatch scripts with docker/enroot/bare-metal
container runtimes, module loading, and custom #SBATCH directives
- scripts/submit_profile.py: unified CLI entry point with --scheduler
{k8s,slurm}, --dry-run (default) and --submit modes
Zero external dependencies — uses only Python stdlib.
K8s: - render() now builds a dict and serializes via yaml.safe_dump (falls back to json.dumps if PyYAML is absent). Fixes YAML injection when values contain : # or quotes. - submit() uses the 'kubernetes' Python client (kubeconfig / in-cluster). - New args: --k8s-kubeconfig, --k8s-context. Slurm: - submit() now posts to slurmrestd REST API via urllib.request (stdlib). - Supports JWT auth, configurable API version (v0.0.39–v0.0.41+), and TLS certificate verification toggle. - New args: --slurm-rest-url, --slurm-jwt-token, --slurm-api-version, --slurm-no-verify-ssl. render() / dry-run remain zero-dependency (stdlib only). submit() requires 'kubernetes' package for K8s; Slurm uses stdlib.
- Core deps: requests, perfetto, numpy, pandas - Optional dependency groups: k8s: kubernetes>=27.0, PyYAML>=6.0 slurm: (stdlib only, no extra deps) sim: scalesim, scipy, torch viz: matplotlib, seaborn api: fastapi, pydantic, uvicorn dev: black, pytest all: everything - Entry point: flowsim-submit -> scripts.submit_profile:main - requires-python >= 3.10
- Add scripts/__init__.py so 'scripts' is a findable package - Remove sys.path hack from submit_profile.py (not needed after install) - Add [tool.setuptools.packages.find] with explicit include list (excludes tests/ and backend/ from the installable package) - Improve K8s submit error: catch both kubeconfig and in-cluster failures and show a single clear message with --k8s-kubeconfig hint Verified: pip install -e '.[k8s]' -> flowsim-submit --dry-run works.
- Add scripts/cli.py with subcommand routing (flowsim {submit, ...})
- Entry point changed: flowsim-submit -> flowsim
- 'flowsim submit' delegates to submit_profile.main()
- Extensible for future subcommands (profile, parse, simulate)
Removed the redundant --submit flag. The subcommand name already implies submission; --dry-run is the opt-out.
- Slurm: fail fast if --slurm-rest-url or --slurm-jwt-token missing - K8s: warn to stderr when no explicit kubeconfig/context provided - --dry-run skips validation (no cluster needed for manifest preview)
Connection params now read from environment variables as defaults, so you don't have to pass them every invocation: K8s: KUBECONFIG -> --k8s-kubeconfig FLOWSIM_K8S_NAMESPACE -> --k8s-namespace FLOWSIM_K8S_CONTEXT -> --k8s-context Slurm: FLOWSIM_SLURM_REST_URL -> --slurm-rest-url FLOWSIM_SLURM_JWT_TOKEN -> --slurm-jwt-token FLOWSIM_SLURM_PARTITION -> --slurm-partition FLOWSIM_SLURM_TIME -> --slurm-time FLOWSIM_SLURM_API_VERSION -> --slurm-api-version CLI flags override env vars. Env var names shown in --help.
No more built-in defaults for cluster connection params. Users must configure before submitting: flowsim init # copies templates to ~/.flowsim/ vim ~/.flowsim/k8s.yaml # fill in kubeconfig, namespace, etc. vim ~/.flowsim/slurm.yaml # fill in rest_url, partition, etc. flowsim submit ... # works Changes: - Add 'flowsim init' subcommand (copies templates, --force to overwrite) - Split config into ~/.flowsim/k8s.yaml and ~/.flowsim/slurm.yaml - Templates have empty REQUIRED fields — submit fails if unfilled - Config loader: schedulers/config.py with per-scheduler load functions - Priority: CLI flag > env var > config file (no silent fallbacks) - Slurm jwt_token_cmd: execute a command to get token at submit time - --dry-run skips all validation (no config needed for preview)
- flowsim init k8s --kubeconfig ... --namespace ... - flowsim init slurm --rest-url ... --partition ... --account ... - Required fields enforced by argparse, --help shows everything - --force to overwrite existing config - Demote --dry-run to [debug] in submit help text - Remove template-copy approach, use _save_yaml() directly
Docker test environments: - kind-multi-node.yaml: 1 control-plane + 2 workers (GPU 0, GPU 1) - slurm-compose.yaml: slurmctld + 2 slurmd (GPU 0, GPU 1) + slurmrestd - slurm-node.dockerfile + slurm.conf: Slurm 23.11 with JWT auth PD disaggregation: - ProfileJobSpec: disagg_mode, disagg_transfer_backend, disagg_bootstrap_port, disagg_prefill_pp, disagg_ib_device - as_prefill() / as_decode() helpers for creating PD pairs - BaseScheduler: render_pd_pair() and submit_pd_pair() - CLI: --pd flag submits prefill + decode job pair - --disagg-transfer-backend (mooncake/nixl), --disagg-bootstrap-port, etc. Bugfix: - resolve_jwt_token: catch FileNotFoundError when jwt_token_cmd binary missing
- dev-setup.sh: auto-installs kind/kubectl, creates kind cluster, starts Slurm compose, runs flowsim init — all in one command - dev-teardown.sh: tears down both clusters cleanly - Supports 'kind', 'slurm', or 'all' (default) targets - Verified: kind cluster creation + K8s Job submit + PD pair submit all work
- LocalScheduler runs profiling via subprocess on this machine - --local-gpus to set CUDA_VISIBLE_DEVICES (e.g. '0' or '0,1') - --local-workdir for custom working directory - No cluster config needed; replaces manual 'python scripts/run_stage_profile.py' - Supports --pd for local PD disaggregation testing - Skips cluster connection validation for local scheduler
Tests cover: - ProfileJobSpec: job name, server opts, disagg params, as_prefill/decode - K8sScheduler.render: YAML validity, namespace, GPU resources, PVC, hostPath, nodeSelector, serviceAccount, labels, PD pair - SlurmScheduler.render: shebang, sbatch directives, docker/enroot/bare, modules, extra sbatch, constraint, time parsing - LocalScheduler.render: GPU selection, workdir, env vars - CLI init: help, required args, bad kubeconfig, save/load config, overwrite protection, --force - CLI submit: help, dry-run for local/k8s/slurm, PD pair, nixl backend - Config: save/load yaml, jwt_token static/cmd/bad_cmd, cfg_get All tests run inside the FlowSim Docker container.
…8s submit without PVC
- log_dir is now derived as {output_dir}/logs/ (single volume covers both)
- LocalScheduler.submit() tees stdout/stderr to log files in real time
- K8s submit refuses if no --k8s-pvc or --k8s-host-output-dir (prevents data loss)
- Slurm output_dir defaults to ~/flowsim_traces (shared filesystem)
- Local output_dir defaults to {project}/stage_traces/
- Add flowsim status/logs subcommands (K8s via API, Slurm via slurmrestd, local via log files)
- Submit prints result location + follow-up commands after every job
- Add integration tests for local scheduler
…of dumping content
- TestLocalScheduler: real TP=1 profiling, verify traces + logs + status/logs CLI - TestK8sScheduler: dry-run YAML (PVC mount, hostPath, log paths), refuse without storage, real Job submit to Kind cluster with status/logs verification - TestSlurmScheduler: dry-run sbatch script (output_dir, log_dir, PD pair) Results: 9 passed, 1 skipped (K8s real submit skipped in container, passes on host)
- Add JobResult dataclass: submit() now returns structured data (job_id, scheduler, state, output_dir, message) instead of string - Add flowsim cancel: K8s (delete_namespaced_job), Slurm (DELETE via slurmrestd), local (no-op for synchronous jobs) - Add flowsim list: list FlowSim jobs with --status filter K8s (label_selector=app=flowsim), Slurm (slurmrestd /jobs), local (scan log files) - Add --follow / -f to flowsim logs: shows tail -f / kubectl logs -f commands for real-time log streaming - submit_pd_pair() now returns list[JobResult] instead of string - Post-submit output shows cancel/list/follow commands
Two-pass argparse: peek --scheduler with a minimal pre-parser, then add only the relevant scheduler's options before full parse. 'flowsim submit --scheduler local --help' no longer shows k8s/slurm args.
Most systems (Ubuntu, Debian) don't have 'python' symlink by default.
Before: flowsim-perf-qwen3-8b_1773771736.stdout.log After: flowsim-perf-qwen3-8b_20260317_184236.stdout.log list_jobs() regex updated to support both old epoch and new formats.
- Add submit_via='cli' mode to SlurmScheduler, using sbatch/squeue/scancel subprocess calls instead of slurmrestd REST API (which has JWT auth issues in Slurm 23.11 docker containers). - Add cli_prefix param for running commands via docker exec. - Use scontrol show job for status (works without slurmdbd). - Slurm compose: base image on flowsim-image:latest, compile Slurm 23.11 with NVML support, cgroup/v1, explicit GRES config. - Slurm test passes in ~76s (same as K8s test). - K8s test uses host mount for traces (no docker cp). - All three backends (local, k8s, slurm) tested and working.
Usage:
flowsim submit --scheduler local --collect perf --model-path Qwen/Qwen3-8B \
--sweep 1:2048:0 4:8192:0 16:2048:4096
Or from file:
flowsim submit --scheduler local --collect perf --model-path Qwen/Qwen3-8B \
--sweep-file sweep_points.txt
Each point is a BS:INPUT_LEN:CTX tuple. One server launch, multiple
profile points sequentially. Backwards compatible: without --sweep,
--bs/--input-len/--existing-ctx still works as single-point.
Two tests in TestLocalSweep: - test_sweep_inline: --sweep 1:2048:0 1:4096:0 1:2048:2048 - test_sweep_file: same points read from a temp file Also fix: use single --sweep with multiple values (nargs=+) instead of repeated --sweep flags which argparse would override.
- Extract resolve_default() to config.py (was _d() duplicated in submit/status) - Extract parse_sweep_point()/load_sweep_file() to scripts/__init__.py - K8s: submit() reuses _load_k8s() instead of duplicating kubeconfig logic - K8s: remove unused kubernetes imports in status()/logs() - Local: move inline imports (glob/re/shlex/threading) to module level - Local: remove dead if-branch in list_jobs (always set Completed) - Slurm: default submit_via='cli', deprecate REST mode with DeprecationWarning - Slurm: add TODO for _logs_cli (currently returns status info only) - CLI: flowsim init slurm supports --submit-via/--cli-prefix, rest-url optional - Template: slurm.yaml updated for CLI-first workflow - run_stage_profile: fix _run_perf sentinel bs=0 -> Optional[int]
…faults) - slurm.py: fix module docstring (no longer says 'posts to slurmrestd') - local.py: remove unused stderr/stderr_size vars in list_jobs() - k8s.py: extract _k8s_job_state() helper (was duplicated in status+list_jobs) - README: update Slurm default to cli, mark REST as deprecated, fix init example
- Delete all slurmrestd REST methods (submit/cancel/status/logs/list) - Remove ssl, urllib, json imports from slurm.py - Remove REST constructor params (rest_url, jwt_token, api_version, verify_ssl, submit_via) - Remove resolve_jwt_token() from config.py - Remove REST CLI args from submit_profile.py, status_profile.py, cli.py - Strip REST fields from slurm.yaml template - Remove JWT-related tests, update init/submit tests - Rewrite schedulers/README.md entirely in English, no REST references - 56 unit tests pass, net -524 lines
added 17 commits
March 18, 2026 23:48
…templates - Move dev-setup.sh, dev-teardown.sh, slurm-compose.yaml, slurm-node.dockerfile, kind-multi-node.yaml, slurm.conf, cgroup.conf, gres.conf from dockerfiles/ to tests/integration/infra/ - Delete schedulers/templates/ (unused by code; flowsim init generates config directly from CLI args) - Update all path references in README, config.py, test files, and shell script comments - dockerfiles/ now contains only cuda12.6.dockerfile (app image)
- Remove disagg_mode, disagg_transfer_backend, disagg_bootstrap_port, disagg_prefill_pp, disagg_ib_device fields from ProfileJobSpec - Remove as_prefill(), as_decode(), render_pd_pair(), submit_pd_pair() - Remove --pd, --disagg-* CLI args from submit_profile.py - Remove PD branch from main() submit/dry-run logic - Remove 8 PD-related unit tests - Remove PD Disaggregation section from README - 48 unit tests pass
- flowsim init k8s → writes commented k8s.yaml template to ~/.flowsim/ - flowsim init slurm → writes commented slurm.yaml template - Users edit the file directly (comments explain each field) - Removed ~60 lines of argparse init code - Kept --force overwrite logic - Updated README examples and tests (43 pass)
- flowsim init k8s --config my.yaml → installs user file to ~/.flowsim/ - flowsim init k8s → writes annotated template (unchanged) - Added 2 tests: config copy + missing file error
- Move annotated templates to schedulers/templates/{k8s,slurm}.yaml
- flowsim init k8s → copies bundled template to ~/.flowsim/
- flowsim init k8s --config my.yaml → copies user file instead
- Remove inline template strings from cli.py
- Root README: replace manual docker run profile/parse with flowsim submit - Schedulers README: remove redundant How It Works, inline YAML examples, scattered test sections - Unify model/params across both READMEs (Qwen3-235B-A22B, tp=1, gpus=1, --load-format dummy) - Add Scheduler Backends section to root README linking to schedulers/README.md
… decode-tokens to 2
…ilter - Add timestamp suffix (-MMDD-HHMMSS) to auto-generated job names for uniqueness - Add #SBATCH --exclusive to Slurm scripts for profiling GPU isolation - Remove flowsim- prefix filter from Slurm list_jobs (let users filter) - Add --sweep-file to scheduler README Common Parameters table
- cli.py → cli/__init__.py (entry point + init command) - submit_profile.py → cli/submit.py (flowsim submit) - status_profile.py → cli/manage.py (flowsim status/logs/list/cancel) - Update all import paths in tests
Replace deploy.resources.reservations with runtime:nvidia + NVIDIA_VISIBLE_DEVICES to fix NVML initialization failure in slurmd-0.
… header - Remove misleading 'local' suffix (file tests all 3 backends) - Add test methodology (How It Works) and Pass Criteria to docstring - Update file references in schedulers/README.md
Sync the output tree in both README.md and schedulers/README.md to reflect the actual directory layout produced by profiling jobs: - Add logs/ with server, shape_server, and job log entries - Add merged/ and shape_traces/ + shape_parsed/ inside point dirs - Add brief descriptions of each subdirectory in root README
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a new schedulers/ package (local/K8s/Slurm backends) and a unified flowsim CLI to submit and manage stage-profiling jobs across environments, including sweep (multi-point) profiling support and associated unit/integration test infrastructure.
Changes:
- Added scheduler backends (
local,k8s,slurm) built around a sharedProfileJobSpec+BaseSchedulerAPI. - Added unified
flowsimCLI (init,submit,status/logs/list/cancel) with YAML-based config templates under~/.flowsim/. - Added unit tests plus integration tests and provisioning scripts for Kind and a docker-compose Slurm test cluster; enhanced stage profiling script with
--sweep/--sweep-file.
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_scheduler_cli.py | Unit coverage for CLI parsing, config install, and backend renderers. |
| tests/integration/test_scheduler.py | End-to-end integration tests for local/k8s/slurm and sweep output validation. |
| tests/integration/infra/slurm.conf | Slurm test cluster configuration used by docker-compose infra. |
| tests/integration/infra/slurm-node.dockerfile | Container image for Slurm test cluster nodes (built atop flowsim image). |
| tests/integration/infra/slurm-compose.yaml | Docker compose topology for the local Slurm integration cluster. |
| tests/integration/infra/kind-multi-node.yaml | Kind cluster definition for K8s integration testing with GPU passthrough. |
| tests/integration/infra/gres.conf | Slurm GRES GPU definition for the test cluster. |
| tests/integration/infra/cgroup.conf | Slurm cgroup plugin config for containerized environments. |
| tests/integration/infra/dev-setup.sh | One-shot setup for Kind + Slurm test environments. |
| tests/integration/infra/dev-teardown.sh | Teardown script for test clusters. |
| simulator/base_parser.py | Small refactor in annotation parsing initialization. |
| scripts/run_stage_profile.py | Adds sweep support and adjusts defaults/log-dir handling for profiling runs. |
| scripts/cli/submit.py | Adds flowsim submit implementation (render/dry-run/submit) and config-based defaults. |
| scripts/cli/manage.py | Adds job lifecycle management commands (status/logs/list/cancel) across schedulers. |
| scripts/cli/init.py | Adds unified flowsim entry point and init config installer. |
| scripts/init.py | Adds shared sweep parsing utilities used by CLI + profiling script. |
| schedulers/templates/slurm.yaml | Slurm config template installed via flowsim init slurm. |
| schedulers/templates/k8s.yaml | K8s config template installed via flowsim init k8s. |
| schedulers/slurm.py | Slurm backend: sbatch script rendering + CLI-mode submission/status/logs/list/cancel. |
| schedulers/local.py | Local backend: docker run rendering + synchronous execution with log capture. |
| schedulers/k8s.py | K8s backend: Job manifest rendering and Python-client submission/status/logs/list/cancel. |
| schedulers/config.py | Loads per-scheduler YAML configs with env-var override support. |
| schedulers/base.py | Defines common dataclasses/interfaces and command rendering for schedulers. |
| schedulers/init.py | Public exports for scheduler package. |
| schedulers/README.md | User docs for scheduler usage, config, parameters, and output layout. |
| pyproject.toml | Packages CLI entry point and declares dependencies/extras. |
| README.md | Updates top-level documentation to use flowsim CLI and documents stage profiling/schedulers. |
| .gitignore | Ignores generated stage traces directory. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Replace absolute /home/administrator/… bind mounts with:
- ${HOST_WORKSPACE} env var for the read-only /workspace mount
- Relative path ../../../stage_traces for the writable traces mount
dev-setup.sh now exports HOST_WORKSPACE (defaults to parent of
repo root) before invoking docker compose.
Without mounting spec.output_dir into the container, traces and logs are written to the ephemeral container filesystem and lost on exit. Docker mode: prepend -v output_dir:output_dir to the mount list. Enroot mode: append output_dir:output_dir to --container-mounts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a
schedulers/package with three job scheduler backends for stage profiling, plus a unifiedflowsimCLI.What's New
Scheduler Backends (
schedulers/)local.py— run profiling in a local Docker containerk8s.py— submit as Kubernetes Jobs (PVC/hostPath storage)slurm.py— submit via sbatch/squeue/scancel (CLI mode); supports docker, enroot, and bare-metal container runtimes; auto-mounts output dir;--exclusiveGPU allocationbase.py— abstract base class +ProfileJobSpecdataclass; auto-generated job names include timestamp for uniquenessconfig.py— YAML config loading from~/.flowsim/with env var overridestemplates/— annotated config templates for K8s and SlurmCLI (
scripts/cli/)flowsim init {k8s,slurm}— install config template (or--configto use your own)flowsim submit— submit profiling jobs (local/k8s/slurm), with--dry-runand--sweep-fileflowsim status/logs/cancel/list— job lifecycle management--sweepsupport for multi-point profiling in one jobTest Infrastructure (
tests/integration/infra/)runtime: nvidia); host paths parameterized via$HOST_WORKSPACEdev-setup.sh/dev-teardown.shfor one-click test environmentsTests
test_scheduler_cli.py)test_scheduler.py)Stats
Usage