diff --git a/.gitignore b/.gitignore
index 706276b..b70854b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,4 +6,5 @@ tests/test-artifacts/
 unknown_kernels.json
 /artifacts
 /server_profile
-/server_simulate
\ No newline at end of file
+/server_simulate
+/stage_traces/
\ No newline at end of file
diff --git a/README.md b/README.md
index c4a674e..6fa4289 100644
--- a/README.md
+++ b/README.md
@@ -20,6 +20,8 @@ The project supports rapid deployment using Docker, includes scripts for environ
 ## Table of Contents
 
 - [Getting Started](#getting-started)
+- [Stage Profiling](#stage-profiling)
+- [Scheduler Backends](#scheduler-backends)
 - [For Developers](#for-developers)
 - [Risks and limitations](#risks-and-limitations)
 - [License](#license)
@@ -49,228 +51,130 @@ make build-docker
 
 This creates a local image named `flowsim-image` with FlowSim patches already applied to sglang.
 
-### 2. Run Profile → Parse → Simulate
+### 2. Profile (Generate Traces)
 
-Create workspace directories on your host for storing traces and results:
+Use `flowsim submit` to capture stage-separated traces (EXTEND + DECODE), parse them, and run cross-rank analysis — all in one step. See [Stage Profiling](#stage-profiling) for how stages and collection modes work.
 
 ```bash
-mkdir -p /data/flowsim-profile
-mkdir -p /data/flowsim-simulate
-```
-
-#### Step 1: Profile (Generate Traces)
-
-```bash
-sudo docker run --gpus=all \
-  -v /data/flowsim-profile:/workspace/profile \
-  -v /data/flowsim-simulate:/workspace/simulate \
-  -w /flowsim \
-  --cap-add=SYS_ADMIN \
-  --network=host \
-  --shm-size 911G \
-  flowsim-image \
-  python scripts/run_profile.py \
-    --profile-dir /workspace/profile \
-    --log-dir /workspace/profile/logs \
-    --bench-timeout 3600 \
-    --server-opts "--model-path /flowsim/workload/models/configs/deepseek/ --load-format dummy --tp 4 --ep 4 --host 0.0.0.0 --port 30001 --attention-backend flashinfer --disable-cuda-graph" \
-    --bench-opts "--backend sglang --host 0.0.0.0 --port 30001 --dataset-name defined-len --prefill-decode-lens 1024:8 --num-prompts 1 --profile"
-```
-
-**What this does:**
-- Starts an sglang server with profiling enabled
-- Runs benchmark requests against it
-- Generates `*.trace.json.gz` files in `/data/flowsim-profile` (mounted as `/workspace/profile`)
-
-**Note:** The first run will be slow (~10 minutes) due to DeepGEMM kernel warmup and compilation. For stable performance, avoid using `--rm` flag and reuse the same container using `sudo docker exec -it <container_id> bash`. Subsequent runs with similar configurations will be faster.
-
-**Tip:** 
-- Adjust `--server-opts` and `--bench-opts` to match your model, parallelism (TP/DP/EP), and workload requirements. All `sglang.launch_server` and `bench_serving.py` parameters are supported.
-- Trace files can be visualized using [Perfetto UI](https://ui.perfetto.dev/) by uploading the `.trace.json.gz` files directly.
-- For multi-GPU profiling (TP > 1), merge individual traces into a single file for a global view:
-  ```bash
-  python /flowsim/utils/merge_trace.py \
-    --trace_dir /data/flowsim-profile \
-    --output /data/flowsim-profile/merged_trace.json
-  ```
-  Then visualize the merged trace at [Perfetto UI](https://ui.perfetto.dev/).
-
-#### Step 2: Parse (Convert Trace to CSV)
-
-```bash
-sudo docker run --rm \
-  -v /data/flowsim-profile:/workspace/profile \
-  -v /data/flowsim-simulate:/workspace/simulate \
-  -w /flowsim \
-  flowsim-image \
-  python -m scripts.run_parse \
-    --trace-file /workspace/profile/your-trace-name-TP-0.trace.json.gz \
-    --output-dir /workspace/simulate
+pip install -e .
+flowsim submit --scheduler local \
+    --collect all \
+    --model-path workload/models/configs/Qwen3-235B-A22B \
+    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
+    --extra-server-opts "--load-format dummy"
 ```
 
-Replace `your-trace-name-TP-0.trace.json.gz` with the actual filename from step 1.
-
-**What this does:**
-- Parses the trace file
-- Extracts kernel-level information (operator, shapes, dtypes)
-- Generates a CSV file and JSON summary in `/data/flowsim-simulate` (mounted as `/workspace/simulate`)
-
-**Fallback:** If you don't have a GPU or can't run profiling, use the demo trace shipped with the repo:
+For K8s / Slurm clusters, see [Scheduler Backends](#scheduler-backends).
 
+**Tip:** Trace files can be visualized at [Perfetto UI](https://ui.perfetto.dev/). For multi-GPU traces, merge them first:
 ```bash
-sudo docker run --rm \
-  -v /data/flowsim-simulate:/workspace/simulate \
-  -w /flowsim \
-  flowsim-image \
-  python -m scripts.run_parse \
-    --trace-file /flowsim/demo/deepseekv3-TP-0.trace.json.gz \
-    --output-dir /workspace/simulate
+python utils/merge_trace.py --trace_dir stage_traces/local/*/bs1_input2048_ctx0 --output merged.json
 ```
 
-#### Step 3: Simulate (Run Hardware Simulation)
+### 3. Simulate (Run Hardware Simulation)
 
-This step requires a running LLMCompass backend. First, build the backend image:
+Build and start the LLMCompass backend, then submit parsed traces for kernel-level simulation:
 
 ```bash
+# Build backend image
 sudo docker build -t llmcompass-backend -f backend/LLMCompass/Dockerfile backend/LLMCompass/
-```
 
-Then start the backend:
-
-```bash
-# Terminal 1: Start LLMCompass backend
+# Terminal 1: Start backend
 sudo docker run --rm -p 8000:8000 llmcompass-backend
-```
 
-Then in another terminal, run the simulation:
-
-```bash
 # Terminal 2: Run simulation
-sudo docker run --rm \
-  --network=host \
-  -v /data/flowsim-profile:/workspace/profile \
-  -v /data/flowsim-simulate:/workspace/simulate \
+sudo docker run --rm --network=host \
+  -v /data/flowsim:/workspace \
   flowsim-image \
   python -m scripts.run_simulate \
-    --trace-file /workspace/profile/your-trace-name-TP-0.trace.json.gz \
+    --trace-file /workspace/traces/bs1_input2048_ctx0/*-TP-0-EXTEND.trace.json.gz \
     --api-url http://127.0.0.1:8000 \
     --artifact-dir /workspace/simulate/llmcompass
 ```
 
-**What this does:**
-- Parses the trace into kernels
-- Submits each kernel to the LLMCompass backend `/tasks` API
-- Polls until all tasks complete
-- Writes request/response artifacts to `/workspace/simulate/llmcompass`
-
-### 3. Inspect Results
-
-All generated files are available on your host at `/data/`:
+### 4. Inspect Results
 
 ```bash
-ls -lh /data/flowsim-profile/      # Raw trace files
-ls -lh /data/flowsim-simulate/     # Parsed CSV, summary, simulation artifacts
+ls -lh /data/flowsim/traces/       # Stage-separated traces + parsed CSVs
+ls -lh /data/flowsim/simulate/     # Simulation artifacts
 ```
 
 ---
 
-## Stage Profiling (`run_stage_profile.py`)
+## Stage Profiling
 
-`scripts/run_stage_profile.py` is the single entry-point for **stage-separated** profiling: it captures prefill (EXTEND) and decode traces independently, parses them, runs cross-rank kernel analysis, and optionally collects kernel input shapes.
+FlowSim performs **stage-separated** profiling: it captures prefill (EXTEND) and decode traces independently, parses them, runs cross-rank kernel analysis, and optionally collects kernel input shapes.
 
-### Quick reference
+### How stages work
 
 Each profiling request produces **two** stage-separated traces:
 - **EXTEND** (prefill) — processes `input_len` new tokens (with optional `existing_ctx` tokens already in KV cache)
-- **DECODE** — profiler captures `decode-tokens` decode batch steps
-
-The profiler captures exactly **one** EXTEND batch and **decode-tokens** DECODE batches per run.
+- **DECODE** — captures `decode-tokens` decode batch steps (default 2)
 
-| Flag | Description | Default |
-|---|---|---|
-| `--input-len` | Number of new prefill tokens per request (EXTEND) | 2048 |
-| `--existing-ctx` | Tokens already in KV cache from a prior request (0 = cold prefill) | 0 |
-| `--bs` | Batch size (concurrent requests) | 1 |
-| `--decode-tokens` | Number of decode tokens to generate (= number of decode batches profiled) | 32 |
+### Collection modes
 
 | Mode | What it does |
 |---|---|
-| `--collect perf` | Profile a single (bs, input_len, existing_ctx) point → trace (EXTEND + DECODE) → parse → cross-rank analysis |
-| `--collect shapes` | Re-run **without CUDA graph** to capture kernel input shapes, then merge into timing CSVs (both EXTEND and DECODE) |
-| `--collect all` | Both phases back-to-back (auto-restarts the server in between). Requires `--launch-server`. |
-
-`--collect` is required. Use `perf`, `shapes`, or `all`.
+| `--collect perf` | Profile a single (bs, input_len, existing_ctx) point → trace → parse → cross-rank analysis |
+| `--collect shapes` | Re-run **without CUDA graph** to capture kernel input shapes, then merge into timing CSVs |
+| `--collect all` | Both phases back-to-back (auto-restarts the server in between) |
 
 ### Examples
 
-**Cold prefill** (server already running):
-
 ```bash
-python3 scripts/run_stage_profile.py \
+# Basic profiling
+flowsim submit --scheduler local \
     --collect perf \
-    --bs 1 --input-len 2048 --decode-tokens 32 \
-    --output-dir /workspace/traces \
-    --host 0.0.0.0 --port 30001
-```
+    --model-path workload/models/configs/Qwen3-235B-A22B \
+    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
+    --extra-server-opts "--load-format dummy"
 
-**With existing KV cache context:**
-
-```bash
-python3 scripts/run_stage_profile.py \
+# With existing KV cache context
+flowsim submit --scheduler local \
     --collect perf \
-    --bs 4 --input-len 512 --existing-ctx 4096 --decode-tokens 32 \
-    --output-dir /workspace/traces \
-    --launch-server \
-    --server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
-```
+    --model-path workload/models/configs/Qwen3-235B-A22B \
+    --tp 1 --bs 4 --input-len 512 --existing-ctx 4096 --decode-tokens 2 --gpus 1 \
+    --extra-server-opts "--load-format dummy"
 
-**Collect shapes only** (requires a no-CUDA-graph server):
-
-```bash
-python3 scripts/run_stage_profile.py \
-    --collect shapes \
-    --output-dir /workspace/sweep_P1_tp4 \
-    --launch-server \
-    --server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
-```
-
-When `--collect shapes` is used with `--launch-server`, the server is automatically started with `--disable-cuda-graph --disable-cuda-graph-padding`.
-
-**Full pipeline** (perf → auto-restart → shapes → merge):
+# Full pipeline (perf + shapes)
+flowsim submit --scheduler local \
+    --collect all \
+    --model-path workload/models/configs/Qwen3-235B-A22B \
+    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
+    --extra-server-opts "--load-format dummy"
 
-```bash
-python3 scripts/run_stage_profile.py \
+# Multi-point sweep
+flowsim submit --scheduler local \
     --collect all \
-    --output-dir /workspace/sweep_P1_tp4 \
-    --launch-server \
-    --server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
+    --model-path workload/models/configs/Qwen3-235B-A22B \
+    --sweep 1:2048:0 4:2048:0 8:2048:0 --decode-tokens 2 --gpus 1 \
+    --extra-server-opts "--load-format dummy"
 ```
 
+For K8s / Slurm clusters, replace `--scheduler local` with `k8s` or `slurm`. See [schedulers/README.md](schedulers/README.md) for full scheduler documentation.
 
 ### Output structure
 
 ```
-sweep_P1_tp4/
-├── sweep_summary.json
+stage_traces/{scheduler}/{YYYYMMDD_HHMMSS}/
 ├── bs1_input2048_ctx0/
-│   ├── *-TP-*-EXTEND.trace.json.gz
-│   ├── *-TP-*-DECODE.trace.json.gz
-│   ├── parsed/
-│   │   ├── TP-0-EXTEND.csv
-│   │   ├── TP-0-DECODE.csv
-│   │   └── ...
+│   ├── *.trace.json.gz
+│   ├── parsed/*.csv
+│   ├── merged/*_merged.trace.csv
+│   ├── shape_traces/ + shape_parsed/
 │   ├── analysis_extend.json
 │   └── analysis_decode.json
-└── ...
+├── logs/
+│   ├── server_*.{stdout,stderr}.log
+│   ├── shape_server_*.{stdout,stderr}.log
+│   └── {job_name}_*.{stdout,stderr}.log
+└── sweep_summary.json
 ```
 
-After `--collect shapes`, each `parsed/TP-*-DECODE.csv` gains a `Dims` column with kernel tensor shapes.
-
-### Helper scripts
-
-| Script | Purpose |
-|---|---|
-| `tests/integration/test_stage_profile_configs.py` | Integration tests for `--collect {perf,shapes,all}` across parallelism configs. Run with `pytest` inside Docker. Filter with `RUN_CONFIGS=P1`. |
+- `parsed/`: Per-rank timing CSVs extracted from traces.
+- `merged/`: Timing + shape columns joined into a single CSV per rank/stage.
+- `shape_traces/` / `shape_parsed/`: Raw and parsed shape-profiling traces (generated by `--collect shapes` or `--collect all`).
+- `logs/`: Server, shape-server, and job stdout/stderr logs.
 
 ### Utilities (`utils/`)
 
@@ -278,11 +182,16 @@ After `--collect shapes`, each `parsed/TP-*-DECODE.csv` gains a `Dims` column wi
 |---|---|
 | `utils/cross_rank_agg.py` | Cross-rank kernel aggregation (symmetric collectives → min, asymmetric → max, compute → mean) |
 | `utils/shape_merge.py` | Merge kernel shape data into timing CSVs |
-| `utils/net.py` | Shared networking helpers (`wait_for_port`) |
 | `utils/merge_trace.py` | Merge multi-rank traces into a single Perfetto-compatible file |
 
 ---
 
+## Scheduler Backends
+
+For submitting profiling jobs to **local Docker**, **Kubernetes**, or **Slurm** clusters, use the `flowsim` CLI. See [schedulers/README.md](schedulers/README.md) for full documentation including per-scheduler parameters, configuration, and environment variables.
+
+---
+
 ## For Developers
 
 ### Customizing Profiling Workloads
diff --git a/pyproject.toml b/pyproject.toml
index 0b237ec..c91de8a 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,3 +1,74 @@
+[build-system]
+requires = ["setuptools>=68.0"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "flowsim"
+version = "0.1.0"
+description = "Workload simulation pipeline for kernel-level inference profiling"
+readme = "README.md"
+license = {text = "MIT"}
+requires-python = ">=3.10"
+dependencies = [
+    "requests>=2.28",
+    "perfetto>=0.7",
+    "numpy>=1.24",
+    "pandas>=1.5",
+    "PyYAML>=6.0",
+]
+
+[project.optional-dependencies]
+# Scheduler backends -------------------------------------------------------
+k8s = [
+    "kubernetes>=27.0",    # K8s Python client for remote job submission
+]
+slurm = []                 # Slurm REST API uses stdlib urllib only
+
+# Full simulation stack (matches Dockerfile) --------------------------------
+sim = [
+    "scalesim>=2.0",
+    "scipy>=1.10",
+    "torch>=2.0",
+]
+
+# Visualization -------------------------------------------------------------
+viz = [
+    "matplotlib>=3.7",
+    "seaborn>=0.12",
+]
+
+# Backend API ---------------------------------------------------------------
+api = [
+    "fastapi>=0.100",
+    "pydantic>=2.0",
+    "uvicorn>=0.23",
+]
+
+# Development ---------------------------------------------------------------
+dev = [
+    "black>=23.0",
+    "pytest>=7.0",
+]
+
+# Everything ----------------------------------------------------------------
+all = [
+    "flowsim[k8s,sim,viz,api,dev]",
+]
+
+[tool.setuptools.packages.find]
+include = [
+    "schedulers*",
+    "scripts*",
+    "simulator*",
+    "utils*",
+]
+
+[project.scripts]
+flowsim = "scripts.cli:main"
+
 [tool.black]
 line-length = 80
 include = '\.pyi?$'
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
diff --git a/schedulers/README.md b/schedulers/README.md
new file mode 100644
index 0000000..3994cb6
--- /dev/null
+++ b/schedulers/README.md
@@ -0,0 +1,242 @@
+# FlowSim Schedulers
+
+FlowSim supports three scheduler backends for submitting GPU profiling jobs:
+
+| Backend | Use Case | Runs On | Dependencies |
+|---------|----------|---------|--------------|
+| **local** | Single-machine dev/testing | Host Docker container | Docker + NVIDIA GPU |
+| **k8s** | Kubernetes cluster | K8s Job Pod | `kubernetes` Python package |
+| **slurm** | HPC cluster | Slurm compute node | Slurm CLI (`sbatch`/`squeue`/`scancel`) |
+
+## Quick Start
+
+```bash
+pip install -e .
+flowsim --help
+```
+
+## Common Workflow
+
+```bash
+# Submit a job (same interface for all backends)
+flowsim submit --scheduler <local|k8s|slurm> \
+    --collect <perf|shapes|all> \
+    --model-path <model> \
+    --tp 1 --bs 1 --input-len 2048 --decode-tokens 2 --gpus 1
+
+# Job lifecycle
+flowsim list   --scheduler <backend>
+flowsim status --scheduler <backend> --job <job_id>
+flowsim logs   --scheduler <backend> --job <job_id>
+flowsim cancel --scheduler <backend> --job <job_id>
+
+# Preview without submitting
+flowsim submit --scheduler <backend> ... --dry-run
+
+# Multi-point sweep
+flowsim submit --scheduler <backend> \
+    --collect all --model-path workload/models/configs/Qwen3-235B-A22B \
+    --sweep 1:2048:0 4:2048:0 8:2048:0 --gpus 1
+```
+
+### Common Parameters
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--collect` | `perf` / `shapes` / `all` | required |
+| `--model-path` | HuggingFace model path | required |
+| `--tp` | Tensor parallelism | `1` |
+| `--dp` | Data parallelism | `1` |
+| `--bs` | Batch size | `1` |
+| `--input-len` | Input sequence length | `2048` |
+| `--existing-ctx` | Existing KV cache length | `0` |
+| `--decode-tokens` | Decode batches to profile | `2` |
+| `--gpus` | GPU count | `1` |
+| `--image` | Docker image | `flowsim-image:latest` |
+| `--output-dir` | Output directory | `stage_traces/{scheduler}/{timestamp}/` |
+| `--extra-server-opts` | Extra sglang server flags (quoted string) | `""` |
+| `--sweep` | Multi-point sweep `BS:INPUT_LEN:CTX` (repeatable) | empty |
+| `--sweep-file` | File with one `BS:INPUT_LEN:CTX` per line (mutually exclusive with `--sweep`) | none |
+| `--job-name` | Custom job name | auto-generated |
+| `--dry-run` | Print script only | `false` |
+
+---
+
+## 1. Local Scheduler
+
+Runs profiling via `docker run` on the host machine.
+
+```bash
+flowsim submit --scheduler local \
+    --collect all \
+    --model-path workload/models/configs/Qwen3-235B-A22B \
+    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
+    --local-gpus 0 \
+    --extra-server-opts "--load-format dummy"
+```
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--local-gpus` | `CUDA_VISIBLE_DEVICES` (e.g. `0` or `0,1`) | all GPUs |
+| `--local-workdir` | Host working directory | FlowSim project root |
+
+---
+
+## 2. Kubernetes Scheduler
+
+Submits profiling jobs as Kubernetes Jobs. Supports PVC and hostPath storage.
+
+### Setup
+
+```bash
+flowsim init k8s                           # install bundled template
+flowsim init k8s --config my-cluster.yaml  # or use your own
+# Edit ~/.flowsim/k8s.yaml
+```
+
+### Usage
+
+```bash
+flowsim submit --scheduler k8s \
+    --collect all \
+    --model-path workload/models/configs/Qwen3-235B-A22B \
+    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
+    --extra-server-opts "--load-format dummy"
+```
+
+### Parameters
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--k8s-namespace` | K8s namespace | `default` |
+| `--k8s-kubeconfig` | kubeconfig path | `~/.kube/config` |
+| `--k8s-context` | kubeconfig context | current context |
+| `--k8s-pvc` | PVC name for traces | empty |
+| `--k8s-host-output-dir` | hostPath (when no PVC) | empty |
+| `--k8s-node-selector` | Node selector `KEY=VALUE` (repeatable) | empty |
+| `--k8s-service-account` | ServiceAccount | empty |
+| `--k8s-shm-size` | Shared memory size | `16Gi` |
+| `--k8s-runtime-class` | RuntimeClass (e.g. `nvidia`) | empty |
+
+---
+
+## 3. Slurm Scheduler
+
+Generates sbatch scripts and submits via `sbatch`/`squeue`/`scancel`.
+
+### Setup
+
+```bash
+flowsim init slurm                         # install bundled template
+flowsim init slurm --config my-slurm.yaml  # or use your own
+# Edit ~/.flowsim/slurm.yaml
+```
+
+### Usage
+
+```bash
+flowsim submit --scheduler slurm \
+    --collect all \
+    --model-path workload/models/configs/Qwen3-235B-A22B \
+    --tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
+    --slurm-partition gpu \
+    --extra-server-opts "--load-format dummy"
+```
+
+For remote clusters, use `--slurm-cli-prefix`:
+```bash
+flowsim submit --scheduler slurm ... \
+    --slurm-cli-prefix "docker exec -i slurmctld"
+# or: --slurm-cli-prefix "ssh login-node"
+```
+
+### Parameters
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--slurm-partition` | Slurm partition | empty |
+| `--slurm-time` | Job time limit | `02:00:00` |
+| `--slurm-account` | Billing account | empty |
+| `--slurm-constraint` | Node constraint | empty |
+| `--slurm-cli-prefix` | Shell prefix for remote CLI | empty |
+| `--slurm-container-runtime` | `docker` / `enroot` / `none` | `none` |
+| `--slurm-container-mounts` | Container mounts | empty |
+| `--slurm-module` | `module load` commands (repeatable) | empty |
+| `--slurm-extra-sbatch` | Extra `#SBATCH` directives (repeatable) | empty |
+
+---
+
+## Configuration
+
+Config files live in `~/.flowsim/` and are installed via `flowsim init`.
+Templates with comments are in `schedulers/templates/`.
+
+```
+~/.flowsim/
+├── k8s.yaml
+└── slurm.yaml
+```
+
+**Priority** (highest to lowest):
+CLI flag → environment variable → config file → built-in default
+
+### Environment Variables
+
+| Variable | Overrides | Example |
+|----------|-----------|--------|
+| `KUBECONFIG` | `--k8s-kubeconfig` | `/home/user/.kube/config` |
+| `FLOWSIM_K8S_NAMESPACE` | `--k8s-namespace` | `ml-team` |
+| `FLOWSIM_K8S_CONTEXT` | `--k8s-context` | `kind-flowsim` |
+| `FLOWSIM_K8S_CONFIG` | Config file path | `/etc/flowsim/k8s.yaml` |
+| `FLOWSIM_SLURM_PARTITION` | `--slurm-partition` | `gpu-h100` |
+| `FLOWSIM_SLURM_TIME` | `--slurm-time` | `04:00:00` |
+| `FLOWSIM_SLURM_CONFIG` | Config file path | `/etc/flowsim/slurm.yaml` |
+
+---
+
+## Output Structure
+
+```
+stage_traces/{scheduler}/{YYYYMMDD_HHMMSS}/
+├── bs1_input2048_ctx0/
+│   ├── *.trace.json.gz
+│   ├── parsed/*.csv
+│   ├── merged/*_merged.trace.csv
+│   ├── shape_traces/ + shape_parsed/
+│   ├── analysis_extend.json
+│   └── analysis_decode.json
+├── logs/
+│   ├── server_*.{stdout,stderr}.log
+│   ├── shape_server_*.{stdout,stderr}.log
+│   └── {job_name}_*.{stdout,stderr}.log
+└── sweep_summary.json
+```
+
+---
+
+## Development
+
+### Test Clusters
+
+```bash
+# Kind (K8s) — GPU passthrough via CDI
+bash tests/integration/infra/dev-setup.sh kind
+bash tests/integration/infra/dev-teardown.sh kind
+
+# Slurm — Docker Compose cluster
+cd tests/integration/infra/
+docker compose -f slurm-compose.yaml up -d
+docker compose -f slurm-compose.yaml down -v
+```
+
+### Running Tests
+
+```bash
+# Unit tests
+python -m pytest tests/unit/test_scheduler_cli.py -v
+
+# Integration tests
+python -m pytest tests/integration/test_scheduler.py::TestK8sScheduler -v -x
+python -m pytest tests/integration/test_scheduler.py::TestSlurmScheduler -v -x
+```
+
diff --git a/schedulers/__init__.py b/schedulers/__init__.py
new file mode 100644
index 0000000..7e0df35
--- /dev/null
+++ b/schedulers/__init__.py
@@ -0,0 +1,15 @@
+"""Scheduler backends for submitting FlowSim profiling jobs."""
+
+from schedulers.base import BaseScheduler, JobResult, ProfileJobSpec
+from schedulers.k8s import K8sScheduler
+from schedulers.local import LocalScheduler
+from schedulers.slurm import SlurmScheduler
+
+__all__ = [
+    "BaseScheduler",
+    "JobResult",
+    "K8sScheduler",
+    "LocalScheduler",
+    "ProfileJobSpec",
+    "SlurmScheduler",
+]
diff --git a/schedulers/base.py b/schedulers/base.py
new file mode 100644
index 0000000..ac71548
--- /dev/null
+++ b/schedulers/base.py
@@ -0,0 +1,218 @@
+"""Abstract base class for FlowSim job schedulers."""
+
+from __future__ import annotations
+
+import abc
+import shlex
+import time
+from dataclasses import dataclass, field
+from typing import Optional, Sequence
+
+
+@dataclass
+class JobResult:
+    """Structured return value from ``submit()``."""
+
+    job_id: str
+    scheduler: str  # "local", "k8s", "slurm"
+    state: str  # "Submitted", "Completed", "Failed"
+    output_dir: str = ""
+    message: str = ""
+
+
+@dataclass
+class ProfileJobSpec:
+    """All parameters needed to run a stage-profiling job.
+
+    The scheduler backends render this into a K8s Job YAML or Slurm
+    sbatch script.
+    """
+
+    # -- Profiling workload --
+    collect: str  # "perf", "shapes", or "all"
+    model_path: str
+    tp: int = 1
+    dp: int = 1
+    bs: int = 1
+    input_len: int = 2048
+    existing_ctx: int = 0
+    decode_tokens: int = 32
+    warmup_n: int = 5
+    disable_chunked_prefill: bool = False
+    max_prefill_tokens: int = 131072
+
+    # -- Infrastructure --
+    image: str = "flowsim-image:latest"
+    gpus: int = 1  # total GPU count (must be >= tp * dp)
+    host: str = "0.0.0.0"
+    port: int = 30001
+    output_dir: str = "/flowsim/stage_traces"
+    job_name: str = ""
+
+    # -- Sweep: explicit list of (bs, input_len, existing_ctx) tuples --
+    sweep_points: list[tuple[int, int, int]] = field(default_factory=list)
+
+    # -- Extra server opts (appended verbatim) --
+    extra_server_opts: str = ""
+
+    def build_server_opts(self) -> str:
+        """Build the ``--server-opts`` string for run_stage_profile.py."""
+        parts = [
+            f"--model-path {self.model_path}",
+            f"--tp {self.tp}",
+            f"--host {self.host}",
+            f"--port {self.port}",
+        ]
+        if self.dp > 1:
+            parts.append(f"--dp {self.dp}")
+        if self.extra_server_opts:
+            parts.append(self.extra_server_opts)
+        return " ".join(parts)
+
+    @property
+    def log_dir(self) -> str:
+        """Server logs go under ``{output_dir}/logs/``."""
+        return self.output_dir + "/logs"
+
+    def build_profile_command(self) -> list[str]:
+        """Build the full ``python scripts/run_stage_profile.py ...`` command."""
+        cmd = [
+            "python3",
+            "scripts/run_stage_profile.py",
+            "--collect",
+            self.collect,
+            "--launch-server",
+            "--server-opts",
+            self.build_server_opts(),
+            "--decode-tokens",
+            str(self.decode_tokens),
+            "--warmup-n",
+            str(self.warmup_n),
+            "--host",
+            self.host,
+            "--port",
+            str(self.port),
+            "--output-dir",
+            self.output_dir,
+            "--log-dir",
+            self.log_dir,
+        ]
+        if self.sweep_points:
+            cmd.append("--sweep")
+            for bs, il, ctx in self.sweep_points:
+                cmd.append(f"{bs}:{il}:{ctx}")
+        else:
+            cmd.extend(["--bs", str(self.bs)])
+            cmd.extend(["--input-len", str(self.input_len)])
+            cmd.extend(["--existing-ctx", str(self.existing_ctx)])
+        if self.disable_chunked_prefill:
+            cmd.append("--disable-chunked-prefill")
+            cmd.extend(["--max-prefill-tokens", str(self.max_prefill_tokens)])
+        return cmd
+
+    def build_shell_command(self) -> str:
+        """Build a single shell command string (properly quoted)."""
+        cmd = self.build_profile_command()
+        # Quote the --server-opts value since it contains spaces
+        quoted = []
+        i = 0
+        while i < len(cmd):
+            if cmd[i] == "--server-opts" and i + 1 < len(cmd):
+                quoted.append(cmd[i])
+                quoted.append(shlex.quote(cmd[i + 1]))
+                i += 2
+            else:
+                quoted.append(cmd[i])
+                i += 1
+        return " ".join(quoted)
+
+    def default_job_name(self) -> str:
+        """Generate a default job name from workload params.
+
+        Auto-generated names include a short timestamp suffix
+        (``-MMDD-HHMMSS``) so repeated submissions of the same
+        workload get distinct names.  User-supplied ``--job-name``
+        values are returned as-is.
+        """
+        if self.job_name:
+            return self.job_name
+        model_short = self.model_path.split("/")[-1].lower().replace(".", "-")
+        ts = time.strftime("%m%d-%H%M%S")
+        if self.sweep_points:
+            name = f"flowsim-{self.collect}-{model_short}-sweep{len(self.sweep_points)}pt-{ts}"
+        else:
+            name = f"flowsim-{self.collect}-{model_short}-bs{self.bs}-il{self.input_len}-{ts}"
+        return name
+
+
+class BaseScheduler(abc.ABC):
+    """Abstract scheduler backend."""
+
+    @abc.abstractmethod
+    def render(self, spec: ProfileJobSpec) -> str:
+        """Render the job manifest / script as a string."""
+
+    @abc.abstractmethod
+    def submit(self, spec: ProfileJobSpec) -> JobResult:
+        """Submit the job and return a structured :class:`JobResult`."""
+
+    def cancel(self, job_id: str) -> str:
+        """Cancel a running or pending job. Returns a status message."""
+        raise NotImplementedError(
+            f"{type(self).__name__} does not support cancel"
+        )
+
+    def status(self, job_id: str) -> dict:
+        """Query job status. Returns dict with at least 'state' key.
+
+        Subclasses should return::
+
+            {
+                "state": "Pending" | "Running" | "Succeeded" | "Failed" | ...,
+                "message": "human-readable detail",
+                "output_hint": "where to find trace files",
+            }
+        """
+        raise NotImplementedError(
+            f"{type(self).__name__} does not support status queries"
+        )
+
+    def logs(
+        self, job_id: str, *, tail: int = 100, follow: bool = False
+    ) -> str:
+        """Retrieve recent log output for a job.
+
+        Parameters
+        ----------
+        job_id : str
+            Job name (K8s) or job ID (Slurm) or log prefix (local).
+        tail : int
+            Number of lines from the end to return.
+        follow : bool
+            If True, stream logs in real time (blocking).
+        """
+        raise NotImplementedError(
+            f"{type(self).__name__} does not support log retrieval"
+        )
+
+    def list_jobs(self, *, status_filter: str = "") -> list[dict]:
+        """List jobs managed by this scheduler.
+
+        Parameters
+        ----------
+        status_filter : str
+            If non-empty, only return jobs matching this state
+            (e.g., ``"Running"``, ``"Succeeded"``, ``"PENDING"``).
+
+        Returns
+        -------
+        list[dict]
+            Each dict has at least ``{"job_id": ..., "state": ..., "name": ...}``.
+        """
+        raise NotImplementedError(
+            f"{type(self).__name__} does not support list"
+        )
+
+    def dry_run(self, spec: ProfileJobSpec) -> str:
+        """Render and return the manifest without submitting."""
+        return self.render(spec)
diff --git a/schedulers/config.py b/schedulers/config.py
new file mode 100644
index 0000000..10c7f8d
--- /dev/null
+++ b/schedulers/config.py
@@ -0,0 +1,88 @@
+"""Load FlowSim scheduler config from per-scheduler YAML files.
+
+Config file lookup (per scheduler):
+
+K8s:
+  1. ``FLOWSIM_K8S_CONFIG`` env var
+  2. ``~/.flowsim/k8s.yaml``
+
+Slurm:
+  1. ``FLOWSIM_SLURM_CONFIG`` env var
+  2. ``~/.flowsim/slurm.yaml``
+
+Priority (highest → lowest):
+    CLI flag  >  env var  >  config file  >  built-in default
+
+Run ``flowsim init k8s`` or ``flowsim init slurm`` to install
+a config template under ``~/.flowsim/``.  Templates are in
+``schedulers/templates/``.
+"""
+
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+import yaml as _yaml
+
+
+def _load_yaml(path: Path) -> dict:
+    with open(path) as f:
+        return _yaml.safe_load(f) or {}
+
+
+_CONFIG_DIR = Path.home() / ".flowsim"
+
+
+def _save_yaml(path: Path, data: dict) -> None:
+    """Write a dict to a YAML file."""
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with open(path, "w") as f:
+        _yaml.safe_dump(data, f, default_flow_style=False, sort_keys=False)
+
+
+def _resolve_path(env_var: str, filename: str) -> Path | None:
+    """Return the config file path, or None if it doesn't exist."""
+    env = os.environ.get(env_var)
+    if env:
+        p = Path(env)
+        return p if p.is_file() else None
+    default = _CONFIG_DIR / filename
+    return default if default.is_file() else None
+
+
+def load_k8s_config() -> dict:
+    """Load ``~/.flowsim/k8s.yaml`` (or ``FLOWSIM_K8S_CONFIG``)."""
+    path = _resolve_path("FLOWSIM_K8S_CONFIG", "k8s.yaml")
+    if path is None:
+        return {}
+    try:
+        return _load_yaml(path)
+    except Exception:
+        return {}
+
+
+def load_slurm_config() -> dict:
+    """Load ``~/.flowsim/slurm.yaml`` (or ``FLOWSIM_SLURM_CONFIG``)."""
+    path = _resolve_path("FLOWSIM_SLURM_CONFIG", "slurm.yaml")
+    if path is None:
+        return {}
+    try:
+        return _load_yaml(path)
+    except Exception:
+        return {}
+
+
+def cfg_get(cfg: dict, key: str, fallback: str = "") -> str:
+    """Get a value from a flat config dict, or fallback."""
+    val = cfg.get(key)
+    if val is not None:
+        return str(val)
+    return fallback
+
+
+def resolve_default(
+    env_var: str, cfg: dict, key: str, fallback: str = ""
+) -> str:
+    """Resolve a config value: env var > config file > fallback."""
+    return os.environ.get(env_var, "") or cfg_get(cfg, key, fallback)
diff --git a/schedulers/k8s.py b/schedulers/k8s.py
new file mode 100644
index 0000000..8c07771
--- /dev/null
+++ b/schedulers/k8s.py
@@ -0,0 +1,415 @@
+"""Kubernetes Job scheduler for FlowSim profiling.
+
+Uses the ``kubernetes`` Python client for remote submission.
+The ``render()`` / ``dry_run()`` path uses stdlib only (json fallback if
+PyYAML is not installed — JSON is valid YAML 1.2 and ``kubectl`` accepts it).
+"""
+
+from __future__ import annotations
+
+import json
+
+from schedulers.base import BaseScheduler, JobResult, ProfileJobSpec
+
+
+def _k8s_job_state(status) -> str:
+    """Derive a human-readable state string from a K8s Job status object."""
+    if status.succeeded and status.succeeded > 0:
+        return "Succeeded"
+    if status.failed and status.failed > 0:
+        return "Failed"
+    if status.active and status.active > 0:
+        return "Running"
+    return "Pending"
+
+
+# Optional: nicer YAML output for dry-run.
+try:
+    import yaml as _yaml  # type: ignore[import-untyped]
+
+    def _dump(obj: dict) -> str:
+        return _yaml.safe_dump(obj, default_flow_style=False, sort_keys=False)
+
+except ImportError:
+    _yaml = None  # type: ignore[assignment]
+
+    def _dump(obj: dict) -> str:  # type: ignore[misc]
+        return json.dumps(obj, indent=2, ensure_ascii=False) + "\n"
+
+
+class K8sScheduler(BaseScheduler):
+    """Generate and optionally submit a Kubernetes Job for profiling.
+
+    Parameters
+    ----------
+    namespace : str
+        Kubernetes namespace for the Job.
+    kubeconfig : str, optional
+        Path to a kubeconfig file.  When empty, the ``kubernetes`` client
+        tries in-cluster config, then ``~/.kube/config``.
+    context : str, optional
+        kubeconfig context to activate.
+    pvc_name : str, optional
+        Name of a PersistentVolumeClaim to mount for trace output.
+        If empty, uses ``emptyDir`` (traces are lost when the pod exits).
+    host_output_dir : str, optional
+        If set (and *pvc_name* is empty), use a ``hostPath`` volume at
+        this path instead of a PVC.
+    node_selector : dict, optional
+        Kubernetes nodeSelector labels (e.g., ``{"gpu": "a100"}``).
+    service_account : str, optional
+        ServiceAccount name for the pod.
+    shm_size : str
+        Size of ``/dev/shm`` (shared memory).  Defaults to ``"16Gi"``.
+    runtime_class_name : str, optional
+        Kubernetes RuntimeClass name for the pod (e.g., ``"nvidia"`` for
+        CDI-based GPU injection in Kind clusters).
+    """
+
+    def __init__(
+        self,
+        *,
+        namespace: str = "default",
+        kubeconfig: str = "",
+        context: str = "",
+        pvc_name: str = "",
+        host_output_dir: str = "",
+        node_selector: dict[str, str] | None = None,
+        service_account: str = "",
+        shm_size: str = "16Gi",
+        runtime_class_name: str = "",
+    ) -> None:
+        self.namespace = namespace
+        self.kubeconfig = kubeconfig
+        self.context = context
+        self.pvc_name = pvc_name
+        self.host_output_dir = host_output_dir
+        self.node_selector = node_selector or {}
+        self.service_account = service_account
+        self.shm_size = shm_size
+        self.runtime_class_name = runtime_class_name
+
+    def render(self, spec: ProfileJobSpec) -> str:
+        return _dump(self._build_job_dict(spec))
+
+    # -----------------------------------------------------------------
+    # Build a plain-dict manifest (used by both render and submit)
+    # -----------------------------------------------------------------
+    def _build_job_dict(self, spec: ProfileJobSpec) -> dict:
+        """Return the Job manifest as a nested Python dict."""
+        job_name = spec.default_job_name()[:63]
+        cmd = spec.build_profile_command()
+
+        # volumes + mounts
+        volume_mounts = [{"name": "dshm", "mountPath": "/dev/shm"}]
+        volumes: list[dict] = [
+            {
+                "name": "dshm",
+                "emptyDir": {"medium": "Memory", "sizeLimit": self.shm_size},
+            },
+        ]
+        if self.pvc_name:
+            volume_mounts.append(
+                {"name": "output", "mountPath": spec.output_dir}
+            )
+            volumes.append(
+                {
+                    "name": "output",
+                    "persistentVolumeClaim": {"claimName": self.pvc_name},
+                }
+            )
+        elif self.host_output_dir:
+            # Mount at base traces dir so the full directory structure
+            # (e.g. k8s/{timestamp}/bs1_...) is preserved on the host.
+            volume_mounts.append(
+                {"name": "output", "mountPath": "/flowsim/stage_traces"}
+            )
+            volumes.append(
+                {
+                    "name": "output",
+                    "hostPath": {
+                        "path": self.host_output_dir,
+                        "type": "DirectoryOrCreate",
+                    },
+                }
+            )
+
+        container = {
+            "name": "profiler",
+            "image": spec.image,
+            "imagePullPolicy": "IfNotPresent",
+            "workingDir": "/flowsim",
+            "command": cmd,
+            "env": [{"name": "SGLANG_PROFILE_KERNELS", "value": "1"}],
+            "resources": {
+                "limits": {"nvidia.com/gpu": str(spec.gpus)},
+                "requests": {"nvidia.com/gpu": str(spec.gpus)},
+            },
+            "volumeMounts": volume_mounts,
+        }
+
+        pod_spec: dict = {
+            "restartPolicy": "Never",
+            "containers": [container],
+            "volumes": volumes,
+        }
+        if self.runtime_class_name:
+            pod_spec["runtimeClassName"] = self.runtime_class_name
+        if self.service_account:
+            pod_spec["serviceAccountName"] = self.service_account
+        if self.node_selector:
+            pod_spec["nodeSelector"] = dict(self.node_selector)
+
+        return {
+            "apiVersion": "batch/v1",
+            "kind": "Job",
+            "metadata": {
+                "name": job_name,
+                "namespace": self.namespace,
+                "labels": {
+                    "app": "flowsim",
+                    "component": "profiling",
+                    "collect": spec.collect,
+                },
+            },
+            "spec": {
+                "backoffLimit": 0,
+                "ttlSecondsAfterFinished": 86400,
+                "template": {
+                    "metadata": {
+                        "labels": {"app": "flowsim", "component": "profiling"}
+                    },
+                    "spec": pod_spec,
+                },
+            },
+        }
+
+    def submit(self, spec: ProfileJobSpec) -> JobResult:
+        """Submit via the ``kubernetes`` Python client (``pip install kubernetes``)."""
+        if not self.pvc_name and not self.host_output_dir:
+            raise ValueError(
+                "No persistent storage configured. "
+                "Set --k8s-pvc or --k8s-host-output-dir to avoid losing traces when the pod exits."
+            )
+        batch_api, _ = self._load_k8s()
+
+        body = self._build_job_dict(spec)
+        resp = batch_api.create_namespaced_job(
+            namespace=self.namespace,
+            body=body,
+        )
+        return JobResult(
+            job_id=resp.metadata.name,
+            scheduler="k8s",
+            state="Submitted",
+            output_dir=spec.output_dir,
+            message=f"job.batch/{resp.metadata.name} created (namespace={resp.metadata.namespace})",
+        )
+
+    # -----------------------------------------------------------------
+    # Helpers shared by status / logs
+    # -----------------------------------------------------------------
+
+    def _load_k8s(self):
+        """Load kubeconfig and return (BatchV1Api, CoreV1Api).
+
+        Raises RuntimeError with actionable message on failure.
+        """
+        try:
+            from kubernetes import client as k8s_client, config as k8s_config
+        except ImportError:
+            raise RuntimeError(
+                "The 'kubernetes' package is required. "
+                "Install it with: pip install kubernetes"
+            )
+
+        config_kwargs: dict = {}
+        if self.kubeconfig:
+            config_kwargs["config_file"] = self.kubeconfig
+        if self.context:
+            config_kwargs["context"] = self.context
+        try:
+            k8s_config.load_kube_config(**config_kwargs)
+        except k8s_config.ConfigException:
+            try:
+                k8s_config.load_incluster_config()
+            except k8s_config.ConfigException:
+                hint = (
+                    " Try --k8s-kubeconfig /path/to/kubeconfig."
+                    if not self.kubeconfig
+                    else ""
+                )
+                raise RuntimeError(
+                    "No valid Kubernetes configuration found. "
+                    "Checked kubeconfig file and in-cluster environment." + hint
+                )
+
+        return k8s_client.BatchV1Api(), k8s_client.CoreV1Api()
+
+    def cancel(self, job_id: str) -> str:
+        """Delete a K8s Job (and its pods) by name."""
+        from kubernetes import client as k8s_client
+
+        batch_api, _ = self._load_k8s()
+        batch_api.delete_namespaced_job(
+            name=job_id,
+            namespace=self.namespace,
+            body=k8s_client.V1DeleteOptions(propagation_policy="Foreground"),
+        )
+        return f"job.batch/{job_id} deleted (namespace={self.namespace})"
+
+    def status(self, job_id: str) -> dict:
+        """Query K8s Job status by job name."""
+        batch_api, core_api = self._load_k8s()
+
+        job = batch_api.read_namespaced_job(
+            name=job_id, namespace=self.namespace
+        )
+
+        # Determine state
+        state = _k8s_job_state(job.status)
+
+        # Pod info
+        pods = core_api.list_namespaced_pod(
+            namespace=self.namespace,
+            label_selector=f"job-name={job_id}",
+        )
+        pod_statuses = []
+        for pod in pods.items:
+            phase = pod.status.phase
+            node = pod.spec.node_name or "unscheduled"
+            pod_statuses.append(f"{pod.metadata.name} ({phase}, node={node})")
+
+        output_hint = ""
+        if self.pvc_name:
+            output_hint = f"Traces persisted on PVC '{self.pvc_name}'"
+        elif self.host_output_dir:
+            output_hint = f"Traces at hostPath {self.host_output_dir} on the scheduled node"
+        else:
+            output_hint = "WARNING: no PVC or hostPath configured — traces will be lost when pod exits"
+
+        msg_parts = [
+            f"Job: {job_id}  Namespace: {self.namespace}  State: {state}"
+        ]
+        if pod_statuses:
+            msg_parts.append("Pods: " + ", ".join(pod_statuses))
+        msg_parts.append(output_hint)
+
+        return {
+            "state": state,
+            "message": "\n".join(msg_parts),
+            "output_hint": output_hint,
+        }
+
+    def logs(
+        self, job_id: str, *, tail: int = 100, follow: bool = False
+    ) -> str:
+        """Show where logs are and how to access them for a K8s Job."""
+        _, core_api = self._load_k8s()
+
+        pods = core_api.list_namespaced_pod(
+            namespace=self.namespace,
+            label_selector=f"job-name={job_id}",
+        )
+        if not pods.items:
+            return (
+                f"No pods found for job {job_id} in namespace {self.namespace}"
+            )
+
+        if follow:
+            # Stream logs from the first running/succeeded pod
+            for pod in pods.items:
+                name = pod.metadata.name
+                if pod.status.phase in ("Running", "Succeeded"):
+                    # Use kubectl follow since the Python client follow is blocking
+                    return (
+                        f"Follow logs:\n"
+                        f"  kubectl logs -f {name} -n {self.namespace}"
+                    )
+            name = pods.items[0].metadata.name
+            return f"Follow logs:\n  kubectl logs -f {name} -n {self.namespace}"
+
+        parts: list[str] = []
+
+        # Pod info
+        for pod in pods.items:
+            name = pod.metadata.name
+            phase = pod.status.phase
+            parts.append(f"Pod: {name}  ({phase})")
+
+        parts.append("")
+
+        # Commands to view pod stdout
+        parts.append("View profiling script output:")
+        for pod in pods.items:
+            name = pod.metadata.name
+            parts.append(f"  kubectl logs {name} -n {self.namespace}")
+            parts.append(
+                f"  kubectl logs {name} -n {self.namespace} --tail={tail}"
+            )
+
+        parts.append("")
+
+        # Persistent log files
+        if self.pvc_name:
+            parts.append(
+                f"Server logs + traces persisted on PVC '{self.pvc_name}'."
+            )
+            parts.append("Copy to local machine:")
+            for pod in pods.items:
+                name = pod.metadata.name
+                if pod.status.phase in ("Running", "Succeeded"):
+                    parts.append(
+                        f"  kubectl cp {self.namespace}/{name}:/flowsim/stage_traces ./stage_traces"
+                    )
+                    break
+            else:
+                parts.append(
+                    "  (pod not running — mount the PVC in another pod to retrieve files)"
+                )
+        elif self.host_output_dir:
+            parts.append(f"Server logs + traces at hostPath on the node:")
+            parts.append(f"  {self.host_output_dir}/")
+            parts.append(f"  {self.host_output_dir}/logs/")
+            # Identify node
+            for pod in pods.items:
+                if pod.spec.node_name:
+                    parts.append(f"  Node: {pod.spec.node_name}")
+                    parts.append(
+                        f"  scp {pod.spec.node_name}:{self.host_output_dir}/ ./stage_traces/"
+                    )
+                    break
+
+        return "\n".join(parts)
+
+    def list_jobs(self, *, status_filter: str = "") -> list[dict]:
+        """List FlowSim Jobs in the namespace (label: app=flowsim)."""
+        batch_api, _ = self._load_k8s()
+
+        jobs = batch_api.list_namespaced_job(
+            namespace=self.namespace,
+            label_selector="app=flowsim",
+        )
+        result: list[dict] = []
+        for job in jobs.items:
+            state = _k8s_job_state(job.status)
+
+            if status_filter and state.lower() != status_filter.lower():
+                continue
+
+            created = ""
+            if job.metadata.creation_timestamp:
+                created = job.metadata.creation_timestamp.strftime(
+                    "%Y-%m-%d %H:%M:%S"
+                )
+
+            result.append(
+                {
+                    "job_id": job.metadata.name,
+                    "name": job.metadata.name,
+                    "state": state,
+                    "namespace": self.namespace,
+                    "created": created,
+                }
+            )
+        return result
diff --git a/schedulers/local.py b/schedulers/local.py
new file mode 100644
index 0000000..7015d28
--- /dev/null
+++ b/schedulers/local.py
@@ -0,0 +1,369 @@
+"""Local scheduler — run profiling via Docker on the local machine.
+
+``render()`` returns the ``docker run`` command string.
+``submit()`` executes it as a subprocess, with stdout/stderr tee'd to log files.
+The profiling runs inside the FlowSim Docker image with GPU access.
+"""
+
+from __future__ import annotations
+
+import glob
+import os
+import re
+import shlex
+import subprocess
+import sys
+import threading
+import time
+
+from schedulers.base import BaseScheduler, JobResult, ProfileJobSpec
+
+
+def _shell_quote(s: str) -> str:
+    """Quote a string for safe embedding in a bash -c '...' invocation."""
+    return shlex.quote(s)
+
+
+class LocalScheduler(BaseScheduler):
+    """Run profiling jobs locally inside a Docker container.
+
+    Parameters
+    ----------
+    gpus : str
+        GPU device IDs for Docker ``--gpus`` (e.g., ``"0"`` or ``"0,1"``).
+        Empty string means all GPUs.
+    workdir : str
+        Host directory to use as the FlowSim project root for log scanning.
+        Defaults to the FlowSim project root on the host.
+    """
+
+    def __init__(
+        self,
+        *,
+        gpus: str = "",
+        workdir: str = "",
+    ) -> None:
+        self.gpus = gpus
+        self.workdir = workdir or self._find_project_root()
+
+    @staticmethod
+    def _find_project_root() -> str:
+        """Walk up from this file to find the FlowSim project root."""
+        d = os.path.dirname(os.path.abspath(__file__))
+        # schedulers/ is one level below project root
+        return os.path.dirname(d)
+
+    @staticmethod
+    def _check_image_exists(image: str) -> None:
+        """Raise if the Docker image is not available locally."""
+        result = subprocess.run(
+            ["docker", "image", "inspect", image],
+            capture_output=True,
+            timeout=10,
+        )
+        if result.returncode != 0:
+            raise SystemExit(
+                f"[local] Docker image '{image}' not found.\n"
+                f"Build it first, e.g.:\n"
+                f"  docker build -t {image} -f dockerfiles/cuda12.6.dockerfile ."
+            )
+
+    def _docker_gpu_flag(self) -> str:
+        """Build the ``--gpus`` flag for ``docker run``."""
+        if not self.gpus:
+            return "--gpus all"
+        return f"--gpus '\"device={self.gpus}\"'"
+
+    def _host_output_dir(self, spec_output_dir: str) -> str:
+        """Host directory that gets bind-mounted into the container.
+
+        Mirrors the container path structure under the host workdir.
+        e.g. container ``/flowsim/stage_traces/local/20260317_211318``
+        →  host ``{workdir}/stage_traces/local/20260317_211318``
+        """
+        # spec_output_dir is like /flowsim/stage_traces/local/{ts}
+        # Strip the /flowsim/ prefix to get the relative path
+        rel = spec_output_dir
+        if rel.startswith("/flowsim/"):
+            rel = rel[len("/flowsim/") :]
+        return os.path.join(self.workdir, rel)
+
+    def _build_docker_cmd(self, spec: ProfileJobSpec) -> str:
+        """Build the full ``docker run`` command.
+
+        Paths in *spec* (model_path, output_dir, log_dir) are expected to be
+        relative to the project root or absolute container paths (``/flowsim/…``).
+        The container workdir is ``/flowsim``, so relative paths resolve
+        correctly without any string replacement.
+        """
+        job_name = spec.default_job_name()[:63]
+        host_output = self._host_output_dir(spec.output_dir)
+        container_output = (
+            spec.output_dir
+        )  # e.g. /flowsim/stage_traces/local/{ts}
+
+        inner_cmd = spec.build_shell_command()
+
+        parts = [
+            "docker run --rm",
+            f"--name {job_name}",
+            self._docker_gpu_flag(),
+            "--ipc=host --shm-size=16g",
+            "--network=host",
+            f"-e SGLANG_PROFILE_KERNELS=1",
+            f"-v {host_output}:{container_output}",
+            f"-v {self.workdir}/simulator:/flowsim/simulator",
+            f"-v {self.workdir}/scripts:/flowsim/scripts",
+            f"-w /flowsim",
+            spec.image,
+            f"bash -c {_shell_quote(inner_cmd)}",
+        ]
+        return " \\\n  ".join(parts)
+
+    def render(self, spec: ProfileJobSpec) -> str:
+        return self._build_docker_cmd(spec)
+
+    def submit(self, spec: ProfileJobSpec) -> JobResult:
+        """Launch a Docker container for profiling.
+
+        stdout and stderr are streamed to the terminal *and* saved to
+        log files under ``spec.output_dir/logs/`` on the host.
+        """
+        self._check_image_exists(spec.image)
+
+        # Ensure host output dir exists before mounting
+        host_output = self._host_output_dir(spec.output_dir)
+        log_dir = os.path.join(host_output, "logs")
+        os.makedirs(log_dir, exist_ok=True)
+
+        docker_cmd = self._build_docker_cmd(spec)
+        job_name = spec.default_job_name()
+        ts = time.strftime("%Y%m%d_%H%M%S")
+
+        # Remove stale container with the same name (e.g. from a killed run)
+        subprocess.run(
+            ["docker", "rm", "-f", job_name[:63]],
+            capture_output=True,
+            timeout=10,
+        )
+        stdout_path = os.path.join(log_dir, f"{job_name}_{ts}.stdout.log")
+        stderr_path = os.path.join(log_dir, f"{job_name}_{ts}.stderr.log")
+
+        print(f"[local] Running {job_name} in Docker...")
+        print(f"[local] image: {spec.image}")
+        print(f"[local] gpus: {self.gpus or 'all'}")
+        print(f"[local] host output: {host_output}")
+        print(f"[local] logs: {stdout_path}")
+        print(f"[local]        {stderr_path}")
+        print(f"[local] cmd:\n  {docker_cmd}")
+        print()
+
+        with open(stdout_path, "w") as fout, open(stderr_path, "w") as ferr:
+            proc = subprocess.Popen(
+                docker_cmd,
+                shell=True,
+                cwd=self.workdir,
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE,
+            )
+
+            def _tee(src, dest_file, dest_stream):
+                for line in src:
+                    dest_stream.buffer.write(line)
+                    dest_stream.buffer.flush()
+                    dest_file.write(line.decode("utf-8", errors="replace"))
+                    dest_file.flush()
+
+            t_out = threading.Thread(
+                target=_tee,
+                args=(proc.stdout, fout, sys.stdout),
+                daemon=True,
+            )
+            t_err = threading.Thread(
+                target=_tee,
+                args=(proc.stderr, ferr, sys.stderr),
+                daemon=True,
+            )
+            t_out.start()
+            t_err.start()
+            proc.wait()
+            t_out.join()
+            t_err.join()
+
+        if proc.returncode != 0:
+            return JobResult(
+                job_id=job_name,
+                scheduler="local",
+                state="Failed",
+                output_dir=host_output,
+                message=(
+                    f"{job_name} FAILED (exit code {proc.returncode})\n"
+                    f"stdout log: {stdout_path}\n"
+                    f"stderr log: {stderr_path}"
+                ),
+            )
+        return JobResult(
+            job_id=job_name,
+            scheduler="local",
+            state="Completed",
+            output_dir=host_output,
+            message=(
+                f"{job_name} completed successfully\n"
+                f"stdout log: {stdout_path}\n"
+                f"stderr log: {stderr_path}"
+            ),
+        )
+
+    def cancel(self, job_id: str) -> str:
+        """Stop the Docker container for a local job.
+
+        The Docker container name is truncated to 63 characters when created.
+        To ensure we stop the correct container even if a longer job id is
+        provided (for example, the full job name), apply the same truncation
+        here before calling ``docker stop``.
+        """
+        container_name = job_id[:63]
+        proc = subprocess.run(
+            ["docker", "stop", container_name],
+            capture_output=True,
+            text=True,
+            timeout=30,
+        )
+        if proc.returncode == 0:
+            return f"Stopped container {container_name}"
+        return f"Could not stop container {container_name}: {proc.stderr.strip()}"
+
+    def _find_log_dirs(self) -> list[str]:
+        """Find all log directories under stage_traces/{scheduler}/*/logs/."""
+        base = os.path.join(self.workdir, "stage_traces", "local")
+        # New layout: stage_traces/local/{ts}/logs/
+        dirs = sorted(glob.glob(os.path.join(base, "*/logs")))
+        # Also check legacy flat layout: stage_traces/logs/
+        legacy = os.path.join(self.workdir, "stage_traces", "logs")
+        if os.path.isdir(legacy):
+            dirs.append(legacy)
+        return dirs
+
+    def status(self, job_id: str) -> dict:
+        """Check local job status by looking for log files.
+
+        ``job_id`` is the job name prefix used in log filenames.
+        """
+        matches = []
+        for log_dir in self._find_log_dirs():
+            matches.extend(
+                sorted(
+                    glob.glob(os.path.join(log_dir, f"{job_id}_*.stdout.log"))
+                )
+            )
+
+        if not matches:
+            return {
+                "state": "NotFound",
+                "message": f"No logs found for job '{job_id}'",
+                "output_hint": "",
+            }
+
+        latest = matches[-1]
+        stderr_log = latest.replace(".stdout.log", ".stderr.log")
+        # trace_dir is the parent of logs/
+        trace_dir = os.path.dirname(os.path.dirname(latest))
+
+        return {
+            "state": "Completed",
+            "message": (
+                f"Latest log: {latest}\n"
+                f"Stderr log: {stderr_log}\n"
+                f"Traces dir: {trace_dir}"
+            ),
+            "output_hint": trace_dir,
+        }
+
+    def logs(
+        self, job_id: str, *, tail: int = 100, follow: bool = False
+    ) -> str:
+        """List log files for a local job and print access commands."""
+        matches = []
+        for log_dir in self._find_log_dirs():
+            matches.extend(
+                sorted(glob.glob(os.path.join(log_dir, f"{job_id}_*")))
+            )
+
+        if not matches:
+            for log_dir in self._find_log_dirs():
+                matches.extend(
+                    sorted(glob.glob(os.path.join(log_dir, f"*{job_id}*")))
+                )
+
+        if not matches:
+            return f"No logs found matching '{job_id}'"
+
+        if follow:
+            stdout_files = sorted(
+                f for f in matches if f.endswith(".stdout.log")
+            )
+            if stdout_files:
+                return f"Follow logs with:\n  tail -f {stdout_files[-1]}"
+            return f"No stdout log found to follow for '{job_id}'"
+
+        log_dir = os.path.dirname(matches[-1])
+        parts = [f"Log directory: {log_dir}", ""]
+        parts.append(f"Files ({len(matches)}):")
+        for p in matches:
+            size = os.path.getsize(p)
+            parts.append(f"  {os.path.basename(p)}  ({size:,} bytes)")
+
+        # Provide commands
+        parts.append("")
+        parts.append("View logs:")
+        stdout_files = sorted(f for f in matches if f.endswith(".stdout.log"))
+        stderr_files = sorted(f for f in matches if f.endswith(".stderr.log"))
+        if stdout_files:
+            parts.append(f"  less {stdout_files[-1]}")
+        if stderr_files:
+            parts.append(f"  less {stderr_files[-1]}")
+        if stdout_files:
+            parts.append("")
+            parts.append("Follow logs:")
+            parts.append(f"  tail -f {stdout_files[-1]}")
+
+        trace_dir = os.path.dirname(log_dir)  # parent of logs/
+        parts.append("")
+        parts.append(f"Trace files: {trace_dir}")
+        parts.append(f"  ls {trace_dir}")
+
+        return "\n".join(parts)
+
+    def list_jobs(self, *, status_filter: str = "") -> list[dict]:
+        """List local jobs by scanning log files."""
+        matches = []
+        for log_dir in self._find_log_dirs():
+            matches.extend(
+                sorted(glob.glob(os.path.join(log_dir, "*.stdout.log")))
+            )
+
+        jobs: list[dict] = []
+        for path in matches:
+            basename = os.path.basename(path)
+            # Parse: {job_name}_{YYYYMMDD_HHMMSS}.stdout.log
+            # Also support old epoch format {job_name}_{digits}.stdout.log
+            m = re.match(r"^(.+)_(\d{8}_\d{6}|\d{10,})\.stdout\.log$", basename)
+            if not m:
+                continue
+            name = m.group(1)
+            ts = m.group(2)
+            state = "Completed"
+            jobs.append(
+                {
+                    "job_id": name,
+                    "name": name,
+                    "state": state,
+                    "timestamp": ts,
+                }
+            )
+
+        if status_filter:
+            filt = status_filter.lower()
+            jobs = [j for j in jobs if j["state"].lower() == filt]
+
+        return jobs
diff --git a/schedulers/slurm.py b/schedulers/slurm.py
new file mode 100644
index 0000000..543b22f
--- /dev/null
+++ b/schedulers/slurm.py
@@ -0,0 +1,332 @@
+"""Slurm sbatch scheduler for FlowSim profiling.
+
+``render()`` / ``dry_run()`` produce a standalone bash script (zero deps).
+``submit()`` pipes the script to ``sbatch`` via subprocess (CLI mode).
+
+Requires ``sbatch``/``squeue``/``scancel`` on PATH (or reachable
+via ``cli_prefix``, e.g. ``"docker exec slurmctld"``).
+"""
+
+from __future__ import annotations
+
+import shlex
+import subprocess
+
+from schedulers.base import BaseScheduler, JobResult, ProfileJobSpec
+
+
+class SlurmScheduler(BaseScheduler):
+    """Generate and optionally submit an sbatch script for profiling.
+
+    Parameters
+    ----------
+    partition : str
+        Slurm partition to submit to.
+    time_limit : str
+        Wall-clock time limit (e.g., ``"01:00:00"``).
+    account : str, optional
+        ``--account`` for which allocation to charge.
+    constraint : str, optional
+        ``--constraint`` node feature (e.g., ``"gpu80g"``).
+    container_runtime : str
+        How to run the container inside the allocation.
+        ``"docker"``  -> ``docker run``
+        ``"enroot"``  -> ``srun --container-image``
+        ``"none"``    -> run bare-metal (no container)
+    container_mounts : str
+        Bind-mount string passed to the container runtime
+        (e.g., ``"/data:/data"``).
+    modules : list[str]
+        ``module load`` commands to run before the job
+        (relevant for ``"none"`` runtime).
+    extra_sbatch : list[str]
+        Additional ``#SBATCH`` lines, each *without* the ``#SBATCH`` prefix.
+    cli_prefix : str
+        Shell prefix for CLI commands (e.g. ``"docker exec -i slurmctld"``).
+    """
+
+    def __init__(
+        self,
+        *,
+        partition: str = "gpu",
+        time_limit: str = "02:00:00",
+        account: str = "",
+        constraint: str = "",
+        container_runtime: str = "none",
+        container_mounts: str = "",
+        modules: list[str] | None = None,
+        extra_sbatch: list[str] | None = None,
+        cli_prefix: str = "",
+    ) -> None:
+        self.partition = partition
+        self.time_limit = time_limit
+        self.account = account
+        self.constraint = constraint
+        self.container_runtime = container_runtime
+        self.container_mounts = container_mounts
+        self.modules = modules or []
+        self.extra_sbatch = extra_sbatch or []
+        self.cli_prefix = cli_prefix
+
+    def render(self, spec: ProfileJobSpec) -> str:
+        job_name = spec.default_job_name()
+        cmd = spec.build_shell_command()
+
+        lines = [
+            "#!/bin/bash",
+            f"#SBATCH --job-name={job_name}",
+            f"#SBATCH --partition={self.partition}",
+            f"#SBATCH --gpus-per-node={spec.gpus}",
+            f"#SBATCH --ntasks=1",
+            f"#SBATCH --exclusive",
+            f"#SBATCH --time={self.time_limit}",
+            f"#SBATCH --output={spec.output_dir}/{job_name}_%j.log",
+        ]
+
+        if self.account:
+            lines.append(f"#SBATCH --account={self.account}")
+        if self.constraint:
+            lines.append(f"#SBATCH --constraint={self.constraint}")
+        for extra in self.extra_sbatch:
+            lines.append(f"#SBATCH {extra}")
+
+        lines.append("")
+        lines.append("set -euo pipefail")
+        lines.append("")
+
+        # Ensure output dir exists (needed for #SBATCH --output)
+        lines.append(f"mkdir -p {spec.output_dir}")
+        lines.append("")
+
+        if self.modules:
+            for mod in self.modules:
+                lines.append(f"module load {mod}")
+            lines.append("")
+
+        lines.append("export SGLANG_PROFILE_KERNELS=1")
+        lines.append("")
+
+        if self.container_runtime == "docker":
+            # Always mount output_dir so traces/logs persist on the host.
+            mounts = f" -v {spec.output_dir}:{spec.output_dir}"
+            if self.container_mounts:
+                mounts += f" -v {self.container_mounts}"
+            lines.append(
+                f"docker run --gpus all --ipc=host --shm-size=16g"
+                f"{mounts} -w /flowsim {spec.image} \\"
+            )
+            lines.append(f"  {cmd}")
+        elif self.container_runtime == "enroot":
+            # Always mount output_dir so traces/logs persist on the host.
+            out_mount = f"{spec.output_dir}:{spec.output_dir}"
+            if self.container_mounts:
+                all_mounts = f"{self.container_mounts},{out_mount}"
+            else:
+                all_mounts = out_mount
+            lines.append(
+                f"srun --container-image={spec.image}"
+                f" --container-workdir=/flowsim"
+                f" --container-mounts={all_mounts} \\"
+            )
+            lines.append(f"  {cmd}")
+        elif self.container_runtime == "none":
+            lines.append(f"cd /flowsim")
+            lines.append(cmd)
+        else:
+            raise ValueError(
+                f"Unknown container_runtime: {self.container_runtime!r}. "
+                "Choose from: docker, enroot, none"
+            )
+
+        lines.append("")
+        return "\n".join(lines)
+
+    def submit(self, spec: ProfileJobSpec) -> JobResult:
+        """Submit the job via ``sbatch``."""
+        return self._submit_cli(spec)
+
+    # ------------------------------------------------------------------
+    # CLI helpers
+    # ------------------------------------------------------------------
+
+    def _cli_cmd(self, *args: str) -> list[str]:
+        """Build a command list, prepending ``cli_prefix`` if set."""
+        prefix = shlex.split(self.cli_prefix) if self.cli_prefix else []
+        return prefix + list(args)
+
+    def _cli_run(
+        self,
+        *args: str,
+        input_data: str | None = None,
+        timeout: int = 60,
+    ) -> subprocess.CompletedProcess:
+        """Run a Slurm CLI command and return the CompletedProcess."""
+        cmd = self._cli_cmd(*args)
+        return subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            input=input_data,
+            timeout=timeout,
+        )
+
+    def _submit_cli(self, spec: ProfileJobSpec) -> JobResult:
+        """Submit via ``sbatch`` (piping the script on stdin)."""
+        script = self.render(spec)
+        job_name = spec.default_job_name()
+
+        r = self._cli_run("sbatch", "--parsable", input_data=script, timeout=30)
+        if r.returncode != 0:
+            raise RuntimeError(
+                f"sbatch failed (exit {r.returncode}):\n{r.stderr}"
+            )
+
+        job_id = r.stdout.strip().split(";")[
+            0
+        ]  # parsable: "jobid" or "jobid;cluster"
+        return JobResult(
+            job_id=job_id,
+            scheduler="slurm",
+            state="Submitted",
+            output_dir=spec.output_dir,
+            message=f"Submitted batch job {job_id}",
+        )
+
+    def cancel(self, job_id: str) -> str:
+        """Cancel a Slurm job."""
+        return self._cancel_cli(job_id)
+
+    def status(self, job_id: str) -> dict:
+        """Query Slurm job status."""
+        return self._status_cli(job_id)
+
+    def logs(
+        self, job_id: str, *, tail: int = 100, follow: bool = False
+    ) -> str:
+        """Show Slurm job log information."""
+        return self._logs_cli(job_id, tail=tail, follow=follow)
+
+    def list_jobs(self, *, status_filter: str = "") -> list[dict]:
+        """List Slurm jobs."""
+        return self._list_jobs_cli(status_filter=status_filter)
+
+    # ------------------------------------------------------------------
+    # CLI implementations
+    # ------------------------------------------------------------------
+
+    def _cancel_cli(self, job_id: str) -> str:
+        r = self._cli_run("scancel", job_id)
+        if r.returncode != 0:
+            raise RuntimeError(f"scancel failed: {r.stderr}")
+        return f"Cancelled Slurm job {job_id}"
+
+    def _status_cli(self, job_id: str) -> dict:
+        # Use scontrol show job — works for both running and completed jobs
+        # (completed jobs stay in memory for MinJobAge seconds, default 300s)
+        r = self._cli_run("scontrol", "show", "job", job_id)
+        if r.returncode != 0 or not r.stdout.strip():
+            return {
+                "state": "Unknown",
+                "message": f"No job found with ID {job_id}",
+                "output_hint": "",
+            }
+
+        # Parse key=value output
+        fields: dict[str, str] = {}
+        for token in r.stdout.replace("\n", " ").split():
+            if "=" in token:
+                k, _, v = token.partition("=")
+                fields[k] = v
+
+        state = fields.get("JobState", "UNKNOWN")
+        name = fields.get("JobName", "")
+        nodes = fields.get("NodeList", "")
+        output_file = fields.get("StdOut", "")
+
+        # Normalize Slurm uppercase states to capitalized format
+        _STATE_MAP = {
+            "PENDING": "Pending",
+            "RUNNING": "Running",
+            "SUSPENDED": "Suspended",
+            "COMPLETED": "Completed",
+            "CANCELLED": "Cancelled",
+            "FAILED": "Failed",
+            "TIMEOUT": "Timeout",
+            "NODE_FAIL": "Failed",
+            "PREEMPTED": "Preempted",
+            "OUT_OF_MEMORY": "Failed",
+        }
+        state = _STATE_MAP.get(state, state)
+
+        msg_parts = [
+            f"Job ID: {job_id}  Name: {name}  State: {state}",
+            f"Nodes: {nodes}" if nodes else "Nodes: (not yet assigned)",
+        ]
+        if output_file:
+            msg_parts.append(f"Output log: {output_file}")
+
+        return {
+            "state": state,
+            "message": "\n".join(msg_parts),
+            "output_hint": output_file,
+        }
+
+    def _logs_cli(
+        self, job_id: str, *, tail: int = 100, follow: bool = False
+    ) -> str:
+        info = self._status_cli(job_id)
+        output_file = info.get("output_hint", "")
+
+        if not output_file:
+            return info["message"] + "\n(no log file path found)"
+
+        # Try to read the log file via CLI prefix (handles remote Slurm)
+        if follow:
+            return (
+                f"{info['message']}\n\n"
+                f"Follow logs:\n"
+                f"  tail -f {output_file}"
+            )
+
+        r = self._cli_run("tail", f"-{tail}", output_file, timeout=15)
+        if r.returncode == 0 and r.stdout.strip():
+            return r.stdout
+
+        # Fallback: file may not exist yet or be on a remote node
+        return (
+            f"{info['message']}\n\n"
+            f"Log file: {output_file}\n"
+            f"View on login node:\n"
+            f"  tail -{tail} {output_file}\n"
+            f"Follow:\n"
+            f"  tail -f {output_file}"
+        )
+
+    def _list_jobs_cli(self, *, status_filter: str = "") -> list[dict]:
+        r = self._cli_run(
+            "squeue",
+            "-o",
+            "%i|%j|%T|%P|%N",
+            "-h",
+        )
+        if r.returncode != 0:
+            raise RuntimeError(f"squeue failed: {r.stderr}")
+        result: list[dict] = []
+        for line in r.stdout.strip().splitlines():
+            if not line.strip():
+                continue
+            parts = line.split("|", 4)
+            name = parts[1] if len(parts) > 1 else ""
+            state = parts[2] if len(parts) > 2 else "UNKNOWN"
+            if status_filter and state.upper() != status_filter.upper():
+                continue
+            result.append(
+                {
+                    "job_id": parts[0] if parts else "",
+                    "name": name,
+                    "state": state,
+                    "partition": parts[3] if len(parts) > 3 else "",
+                    "nodes": parts[4] if len(parts) > 4 else "",
+                }
+            )
+        return result
diff --git a/schedulers/templates/k8s.yaml b/schedulers/templates/k8s.yaml
new file mode 100644
index 0000000..8f548de
--- /dev/null
+++ b/schedulers/templates/k8s.yaml
@@ -0,0 +1,27 @@
+# FlowSim Kubernetes scheduler config
+# Copy to ~/.flowsim/k8s.yaml and edit:
+#   flowsim init k8s --config schedulers/templates/k8s.yaml
+
+# Path to kubeconfig file (required)
+kubeconfig: ~/.kube/config
+
+# Kubeconfig context (empty = current-context)
+context: ""
+
+# Kubernetes namespace (required)
+namespace: default
+
+# Persistent storage for trace output (set one):
+#   pvc: my-traces-pvc
+#   host_output_dir: /data/flowsim-traces
+pvc: ""
+host_output_dir: ""
+
+# Service account for the job pod (empty = default)
+service_account: ""
+
+# Shared memory size (for /dev/shm in the pod)
+shm_size: "16Gi"
+
+# RuntimeClass (e.g. "nvidia" for CDI GPU passthrough)
+runtime_class_name: ""
diff --git a/schedulers/templates/slurm.yaml b/schedulers/templates/slurm.yaml
new file mode 100644
index 0000000..5f27328
--- /dev/null
+++ b/schedulers/templates/slurm.yaml
@@ -0,0 +1,27 @@
+# FlowSim Slurm scheduler config
+# Copy to ~/.flowsim/slurm.yaml and edit:
+#   flowsim init slurm --config schedulers/templates/slurm.yaml
+
+# Slurm partition (required)
+partition: gpu
+
+# Billing account (empty = default)
+account: ""
+
+# Job time limit
+time: "02:00:00"
+
+# Node constraint (e.g. "h100")
+constraint: ""
+
+# CLI prefix for remote sbatch/squeue/scancel
+# Examples:
+#   "docker exec -i slurmctld"   (via Docker container)
+#   "ssh login-node"             (via SSH)
+cli_prefix: ""
+
+# Container runtime: docker | enroot | none
+container_runtime: none
+
+# Container mount spec (for enroot/docker)
+container_mounts: ""
diff --git a/scripts/__init__.py b/scripts/__init__.py
new file mode 100644
index 0000000..e785b75
--- /dev/null
+++ b/scripts/__init__.py
@@ -0,0 +1,36 @@
+"""Shared utilities for FlowSim CLI scripts."""
+
+
+def parse_sweep_point(s: str) -> tuple[int, int, int]:
+    """Parse a ``BS:INPUT_LEN:CTX`` string into an int 3-tuple.
+
+    Raises :class:`ValueError` on bad input.
+    """
+    parts = s.strip().split(":")
+    if len(parts) != 3:
+        raise ValueError(
+            f"Bad sweep point {s!r}: expected BS:INPUT_LEN:CTX "
+            f"(e.g. 1:2048:0)"
+        )
+    try:
+        return int(parts[0]), int(parts[1]), int(parts[2])
+    except ValueError:
+        raise ValueError(
+            f"Bad sweep point {s!r}: all three values must be integers"
+        )
+
+
+def load_sweep_file(path: str) -> list[tuple[int, int, int]]:
+    """Read sweep points from a file (one ``BS:INPUT_LEN:CTX`` per line).
+
+    Blank lines and ``#`` comments are skipped.
+    Raises :class:`ValueError` on bad entries.
+    """
+    points: list[tuple[int, int, int]] = []
+    with open(path) as f:
+        for line in f:
+            line = line.strip()
+            if not line or line.startswith("#"):
+                continue
+            points.append(parse_sweep_point(line))
+    return points
diff --git a/scripts/cli/__init__.py b/scripts/cli/__init__.py
new file mode 100644
index 0000000..9d4755e
--- /dev/null
+++ b/scripts/cli/__init__.py
@@ -0,0 +1,167 @@
+"""FlowSim CLI — unified entry point.
+
+Usage::
+
+    flowsim init k8s            # create ~/.flowsim/k8s.yaml template
+    flowsim init slurm          # create ~/.flowsim/slurm.yaml template
+    flowsim submit --scheduler k8s --collect perf --model-path ...
+    flowsim submit ... --dry-run   # debug: preview manifest
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+from pathlib import Path
+
+_CONFIG_DIR = Path.home() / ".flowsim"
+_TEMPLATES_DIR = (
+    Path(__file__).resolve().parent.parent.parent / "schedulers" / "templates"
+)
+
+
+def _cmd_init(argv: list[str]) -> int:
+    """Install a scheduler config to ~/.flowsim/.
+
+    Without --config: copies the bundled template from schedulers/templates/.
+    With --config: copies the specified file.
+    """
+    parser = argparse.ArgumentParser(
+        prog="flowsim init",
+        description=(
+            "Install scheduler config under ~/.flowsim/.\n\n"
+            "Examples:\n"
+            "  flowsim init k8s                    # install bundled template\n"
+            "  flowsim init k8s --config my.yaml   # install your own file\n"
+            "  flowsim init slurm --force           # overwrite existing"
+        ),
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument(
+        "scheduler",
+        choices=["k8s", "slurm"],
+        help="Scheduler type",
+    )
+    parser.add_argument(
+        "--config",
+        "-c",
+        default="",
+        help="Path to a config YAML to install (default: bundled template)",
+    )
+    parser.add_argument(
+        "--force",
+        action="store_true",
+        help="Overwrite existing config file",
+    )
+    args = parser.parse_args(argv)
+
+    dst = _CONFIG_DIR / f"{args.scheduler}.yaml"
+
+    if dst.exists() and not args.force:
+        print(
+            f"Error: {dst} already exists (use --force to overwrite)",
+            file=sys.stderr,
+        )
+        return 1
+
+    if args.config:
+        src = Path(args.config).expanduser()
+    else:
+        src = _TEMPLATES_DIR / f"{args.scheduler}.yaml"
+
+    if not src.is_file():
+        print(f"Error: config file not found: {src}", file=sys.stderr)
+        return 1
+
+    import shutil
+
+    _CONFIG_DIR.mkdir(parents=True, exist_ok=True)
+    shutil.copy2(src, dst)
+    print(f"Installed {src} → {dst}")
+    print(
+        f"Edit {dst}, then run: flowsim submit --scheduler "
+        f"{args.scheduler} ..."
+    )
+    return 0
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        prog="flowsim",
+        description="FlowSim: workload simulation pipeline CLI",
+    )
+    sub = parser.add_subparsers(dest="command")
+    sub.required = True
+
+    sub.add_parser(
+        "init",
+        help="Configure a scheduler (k8s/slurm) and save to ~/.flowsim/",
+        add_help=False,
+    )
+    sub.add_parser(
+        "submit",
+        help="Submit a profiling job to K8s or Slurm",
+        add_help=False,
+    )
+    sub.add_parser(
+        "status",
+        help="Query job status (local/k8s/slurm)",
+        add_help=False,
+    )
+    sub.add_parser(
+        "logs",
+        help="Retrieve job logs (local/k8s/slurm)",
+        add_help=False,
+    )
+    sub.add_parser(
+        "list",
+        help="List FlowSim jobs (local/k8s/slurm)",
+        add_help=False,
+    )
+    sub.add_parser(
+        "cancel",
+        help="Cancel a running job (k8s/slurm)",
+        add_help=False,
+    )
+
+    args, remaining = parser.parse_known_args(argv)
+
+    if args.command == "init":
+        return _cmd_init(remaining)
+
+    if args.command == "submit":
+        from scripts.cli.submit import main as submit_main
+
+        submit_main(remaining)
+        return 0
+
+    if args.command == "status":
+        from scripts.cli.manage import main_status
+
+        main_status(remaining)
+        return 0
+
+    if args.command == "logs":
+        from scripts.cli.manage import main_logs
+
+        main_logs(remaining)
+        return 0
+
+    if args.command == "list":
+        from scripts.cli.manage import main_list
+
+        main_list(remaining)
+        return 0
+
+    if args.command == "cancel":
+        from scripts.cli.manage import main_cancel
+
+        main_cancel(remaining)
+        return 0
+
+    parser.print_help()
+    return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/scripts/cli/manage.py b/scripts/cli/manage.py
new file mode 100644
index 0000000..bf389ab
--- /dev/null
+++ b/scripts/cli/manage.py
@@ -0,0 +1,211 @@
+#!/usr/bin/env python3
+"""Query FlowSim profiling job status, logs, list, and cancel.
+
+Usage examples
+--------------
+
+Check K8s job status::
+
+    flowsim status --scheduler k8s --job flowsim-perf-qwen3-8b-bs1-il2048
+
+Get K8s job logs::
+
+    flowsim logs --scheduler k8s --job flowsim-perf-qwen3-8b-bs1-il2048
+
+Follow K8s job logs::
+
+    flowsim logs --scheduler k8s --job flowsim-perf-qwen3-8b-bs1-il2048 --follow
+
+List all FlowSim jobs::
+
+    flowsim list --scheduler k8s
+    flowsim list --scheduler k8s --status Running
+
+Cancel a job::
+
+    flowsim cancel --scheduler k8s --job flowsim-perf-qwen3-8b-bs1-il2048
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+
+from schedulers.config import (
+    cfg_get,
+    load_k8s_config,
+    load_slurm_config,
+    resolve_default,
+)
+from schedulers.k8s import K8sScheduler
+from schedulers.local import LocalScheduler
+from schedulers.slurm import SlurmScheduler
+
+_d = resolve_default
+
+
+def _add_scheduler_args(p: argparse.ArgumentParser) -> None:
+    """Add common scheduler choice arg (first pass only)."""
+    p.add_argument(
+        "--scheduler",
+        choices=["local", "k8s", "slurm"],
+        required=True,
+    )
+
+
+def _add_scheduler_specific_args(
+    p: argparse.ArgumentParser, scheduler: str
+) -> None:
+    """Add only the args relevant to the chosen scheduler (second pass)."""
+    k8s_cfg = load_k8s_config()
+    slurm_cfg = load_slurm_config()
+
+    if scheduler == "local":
+        p.add_argument("--local-workdir", default="")
+
+    elif scheduler == "k8s":
+        p.add_argument(
+            "--k8s-namespace",
+            default=_d(
+                "FLOWSIM_K8S_NAMESPACE", k8s_cfg, "namespace", "default"
+            ),
+        )
+        p.add_argument(
+            "--k8s-kubeconfig",
+            default=_d("KUBECONFIG", k8s_cfg, "kubeconfig", ""),
+        )
+        p.add_argument(
+            "--k8s-context",
+            default=_d("FLOWSIM_K8S_CONTEXT", k8s_cfg, "context", ""),
+        )
+        p.add_argument(
+            "--k8s-pvc",
+            default=cfg_get(k8s_cfg, "pvc", ""),
+        )
+        p.add_argument(
+            "--k8s-host-output-dir",
+            default=cfg_get(k8s_cfg, "host_output_dir", ""),
+        )
+
+    elif scheduler == "slurm":
+        p.add_argument(
+            "--slurm-cli-prefix",
+            default=cfg_get(slurm_cfg, "cli_prefix", ""),
+        )
+
+
+def _build_scheduler(args: argparse.Namespace):
+    if args.scheduler == "local":
+        return LocalScheduler(workdir=getattr(args, "local_workdir", ""))
+    elif args.scheduler == "k8s":
+        return K8sScheduler(
+            namespace=args.k8s_namespace,
+            kubeconfig=args.k8s_kubeconfig,
+            context=args.k8s_context,
+            pvc_name=getattr(args, "k8s_pvc", "") or "",
+            host_output_dir=getattr(args, "k8s_host_output_dir", "") or "",
+        )
+    else:
+        return SlurmScheduler(
+            cli_prefix=args.slurm_cli_prefix,
+        )
+
+
+def _parse_two_pass(
+    p: argparse.ArgumentParser, argv: list[str] | None = None
+) -> argparse.Namespace:
+    """Two-pass parse: peek --scheduler, add scheduler-specific args, full parse."""
+    _pre = argparse.ArgumentParser(add_help=False)
+    _pre.add_argument("--scheduler", choices=["local", "k8s", "slurm"])
+    pre, _ = _pre.parse_known_args(argv)
+    _add_scheduler_specific_args(p, pre.scheduler)
+    return p.parse_args(argv)
+
+
+def main_status(argv: list[str] | None = None) -> None:
+    p = argparse.ArgumentParser(description="Query FlowSim job status.")
+    _add_scheduler_args(p)
+    p.add_argument("--job", required=True, help="Job name or ID")
+    args = _parse_two_pass(p, argv)
+
+    scheduler = _build_scheduler(args)
+    try:
+        info = scheduler.status(args.job)
+        print(f"State: {info['state']}")
+        print(info["message"])
+    except Exception as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        sys.exit(1)
+
+
+def main_logs(argv: list[str] | None = None) -> None:
+    p = argparse.ArgumentParser(description="Retrieve FlowSim job logs.")
+    _add_scheduler_args(p)
+    p.add_argument("--job", required=True, help="Job name or ID")
+    p.add_argument(
+        "--tail",
+        type=int,
+        default=100,
+        help="Number of log lines (default: 100)",
+    )
+    p.add_argument(
+        "--follow", "-f", action="store_true", help="Follow log output"
+    )
+    args = _parse_two_pass(p, argv)
+
+    scheduler = _build_scheduler(args)
+    try:
+        text = scheduler.logs(args.job, tail=args.tail, follow=args.follow)
+        print(text)
+    except Exception as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        sys.exit(1)
+
+
+def main_list(argv: list[str] | None = None) -> None:
+    p = argparse.ArgumentParser(description="List FlowSim jobs.")
+    _add_scheduler_args(p)
+    p.add_argument(
+        "--status",
+        default="",
+        help="Filter by job state (e.g. Running, Succeeded, PENDING)",
+    )
+    args = _parse_two_pass(p, argv)
+
+    scheduler = _build_scheduler(args)
+    try:
+        jobs = scheduler.list_jobs(status_filter=args.status)
+        if not jobs:
+            print("No jobs found.")
+            return
+        # Print table header
+        headers = list(jobs[0].keys())
+        widths = {
+            h: max(len(h), max(len(str(j.get(h, ""))) for j in jobs))
+            for h in headers
+        }
+        header_line = "  ".join(h.upper().ljust(widths[h]) for h in headers)
+        print(header_line)
+        print("-" * len(header_line))
+        for job in jobs:
+            print(
+                "  ".join(str(job.get(h, "")).ljust(widths[h]) for h in headers)
+            )
+    except Exception as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        sys.exit(1)
+
+
+def main_cancel(argv: list[str] | None = None) -> None:
+    p = argparse.ArgumentParser(description="Cancel a FlowSim job.")
+    _add_scheduler_args(p)
+    p.add_argument("--job", required=True, help="Job name or ID to cancel")
+    args = _parse_two_pass(p, argv)
+
+    scheduler = _build_scheduler(args)
+    try:
+        msg = scheduler.cancel(args.job)
+        print(msg)
+    except Exception as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        sys.exit(1)
diff --git a/scripts/cli/submit.py b/scripts/cli/submit.py
new file mode 100644
index 0000000..e59d697
--- /dev/null
+++ b/scripts/cli/submit.py
@@ -0,0 +1,490 @@
+#!/usr/bin/env python3
+"""Submit FlowSim profiling jobs locally, to Kubernetes, or to Slurm.
+
+Usage examples
+--------------
+
+Run locally (no cluster needed):
+
+    flowsim submit \\
+        --scheduler local \\
+        --collect perf \\
+        --model-path Qwen/Qwen3-8B \\
+        --tp 1 --local-gpus 0
+
+Dry-run (print Kubernetes Job YAML to stdout):
+
+    flowsim submit \\
+        --scheduler k8s \\
+        --collect perf \\
+        --model-path Qwen/Qwen3-235B-A22B-FP8 \\
+        --tp 4 --gpus 4 \\
+        --bs 1 --input-len 2048 --decode-tokens 2 \\
+        --image flowsim-image:latest \\
+        --k8s-namespace default \\
+        --k8s-pvc flowsim-traces \\
+        --dry-run
+
+Dry-run (print Slurm sbatch script to stdout):
+
+    flowsim submit \\
+        --scheduler slurm \\
+        --collect perf \\
+        --model-path Qwen/Qwen3-235B-A22B-FP8 \\
+        --tp 4 --gpus 4 \\
+        --slurm-partition gpu-a100 \\
+        --slurm-time 02:00:00 \\
+        --dry-run
+
+Submit directly to cluster (omit --dry-run):
+
+    flowsim submit \\
+        --scheduler k8s \\
+        ... 
+"""
+from __future__ import annotations
+
+import argparse
+import os
+import sys
+
+from schedulers.base import ProfileJobSpec
+from schedulers.config import (
+    cfg_get,
+    load_k8s_config,
+    load_slurm_config,
+    resolve_default,
+)
+from schedulers.k8s import K8sScheduler
+from schedulers.local import LocalScheduler
+from schedulers.slurm import SlurmScheduler
+from scripts import load_sweep_file, parse_sweep_point
+
+# Short alias for argparse default= expressions
+_d = resolve_default
+
+
+def _parse_args(argv: list[str] | None = None) -> argparse.Namespace:
+    # Load per-scheduler config files for defaults
+    k8s_cfg = load_k8s_config()
+    slurm_cfg = load_slurm_config()
+
+    p = argparse.ArgumentParser(
+        description="Submit FlowSim profiling jobs to local, K8s, or Slurm.",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__,
+    )
+
+    # -- Scheduler choice --
+    p.add_argument(
+        "--scheduler",
+        choices=["local", "k8s", "slurm"],
+        required=True,
+        help="Scheduler backend.",
+    )
+
+    # -- Profiling workload (mirrors run_stage_profile.py) --
+    wl = p.add_argument_group("workload")
+    wl.add_argument(
+        "--collect",
+        choices=["perf", "shapes", "all"],
+        required=True,
+    )
+    wl.add_argument("--model-path", required=True, help="HF model path")
+    wl.add_argument("--tp", type=int, default=1)
+    wl.add_argument("--dp", type=int, default=1)
+    wl.add_argument("--bs", type=int, default=1, help="Batch size")
+    wl.add_argument("--input-len", type=int, default=2048)
+    wl.add_argument("--existing-ctx", type=int, default=0)
+    wl.add_argument("--decode-tokens", type=int, default=2)
+    wl.add_argument("--warmup-n", type=int, default=5)
+    wl.add_argument(
+        "--disable-chunked-prefill",
+        action="store_true",
+    )
+    wl.add_argument("--max-prefill-tokens", type=int, default=131072)
+    wl.add_argument(
+        "--extra-server-opts",
+        default="",
+        help="Extra server options appended verbatim",
+    )
+    wl.add_argument(
+        "--sweep",
+        type=str,
+        nargs="+",
+        default=[],
+        metavar="BS:INPUT_LEN:CTX",
+        help=(
+            "Profile multiple (bs, input_len, existing_ctx) points in one job. "
+            "Each value is a colon-separated tuple, e.g. --sweep 1:2048:0 4:8192:0. "
+            "Overrides --bs, --input-len, --existing-ctx."
+        ),
+    )
+    wl.add_argument(
+        "--sweep-file",
+        type=str,
+        default="",
+        metavar="FILE",
+        help=(
+            "Read sweep points from a file (one BS:INPUT_LEN:CTX per line, "
+            "# comments allowed). Overrides --bs, --input-len, --existing-ctx."
+        ),
+    )
+
+    # -- Infrastructure --
+    infra = p.add_argument_group("infrastructure")
+    infra.add_argument("--image", default="flowsim-image:latest")
+    infra.add_argument(
+        "--gpus",
+        type=int,
+        default=1,
+        help="Total GPU count",
+    )
+    infra.add_argument("--host", default="0.0.0.0")
+    infra.add_argument("--port", type=int, default=30001)
+    infra.add_argument("--output-dir", default="")
+    infra.add_argument("--job-name", default="")
+
+    # -- Action --
+    p.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="[debug] Print rendered manifest without submitting",
+    )
+
+    # ---- Two-pass: peek at --scheduler, then add only relevant args ----
+    # Use a minimal pre-parser to avoid required-arg errors during peek.
+    _pre = argparse.ArgumentParser(add_help=False)
+    _pre.add_argument("--scheduler", choices=["local", "k8s", "slurm"])
+    pre, _ = _pre.parse_known_args(argv)
+
+    if pre.scheduler == "local":
+        loc = p.add_argument_group("local options")
+        loc.add_argument(
+            "--local-gpus",
+            default="",
+            help="CUDA_VISIBLE_DEVICES for local execution (e.g. '0' or '0,1')",
+        )
+        loc.add_argument(
+            "--local-workdir",
+            default="",
+            help="Working directory for local execution (default: FlowSim project root)",
+        )
+
+    elif pre.scheduler == "k8s":
+        k8s = p.add_argument_group(
+            "kubernetes options (config: ~/.flowsim/k8s.yaml)"
+        )
+        k8s.add_argument(
+            "--k8s-namespace",
+            default=_d(
+                "FLOWSIM_K8S_NAMESPACE", k8s_cfg, "namespace", "default"
+            ),
+            help="K8s namespace (env: FLOWSIM_K8S_NAMESPACE)",
+        )
+        k8s.add_argument(
+            "--k8s-kubeconfig",
+            default=_d("KUBECONFIG", k8s_cfg, "kubeconfig", ""),
+            help="Path to kubeconfig file (env: KUBECONFIG)",
+        )
+        k8s.add_argument(
+            "--k8s-context",
+            default=_d("FLOWSIM_K8S_CONTEXT", k8s_cfg, "context", ""),
+            help="kubeconfig context (env: FLOWSIM_K8S_CONTEXT)",
+        )
+        k8s.add_argument(
+            "--k8s-pvc",
+            default=cfg_get(k8s_cfg, "pvc", ""),
+            help="PVC name for output volume (omit for emptyDir)",
+        )
+        k8s.add_argument(
+            "--k8s-host-output-dir",
+            default=cfg_get(k8s_cfg, "host_output_dir", ""),
+            help="hostPath for output (used when --k8s-pvc is empty)",
+        )
+        k8s.add_argument(
+            "--k8s-node-selector",
+            action="append",
+            default=[],
+            metavar="KEY=VALUE",
+            help="Node selector labels (repeatable)",
+        )
+        k8s.add_argument(
+            "--k8s-service-account",
+            default=cfg_get(k8s_cfg, "service_account", ""),
+        )
+        k8s.add_argument(
+            "--k8s-shm-size",
+            default=cfg_get(k8s_cfg, "shm_size", "16Gi"),
+        )
+        k8s.add_argument(
+            "--k8s-runtime-class",
+            default=cfg_get(k8s_cfg, "runtime_class_name", ""),
+            help="RuntimeClass for pod (e.g. 'nvidia' for CDI mode)",
+        )
+
+    elif pre.scheduler == "slurm":
+        slurm = p.add_argument_group(
+            "slurm options (config: ~/.flowsim/slurm.yaml)"
+        )
+        slurm.add_argument(
+            "--slurm-partition",
+            default=_d("FLOWSIM_SLURM_PARTITION", slurm_cfg, "partition", ""),
+            help="Slurm partition (env: FLOWSIM_SLURM_PARTITION)",
+        )
+        slurm.add_argument(
+            "--slurm-time",
+            default=_d("FLOWSIM_SLURM_TIME", slurm_cfg, "time", "02:00:00"),
+            help="Wall time limit (env: FLOWSIM_SLURM_TIME)",
+        )
+        slurm.add_argument(
+            "--slurm-account",
+            default=cfg_get(slurm_cfg, "account", ""),
+        )
+        slurm.add_argument(
+            "--slurm-constraint",
+            default=cfg_get(slurm_cfg, "constraint", ""),
+        )
+        slurm.add_argument(
+            "--slurm-container-runtime",
+            choices=["docker", "enroot", "none"],
+            default=cfg_get(slurm_cfg, "container_runtime", "none"),
+        )
+        slurm.add_argument(
+            "--slurm-container-mounts",
+            default=cfg_get(slurm_cfg, "container_mounts", ""),
+        )
+        # Modules from config (list) + CLI (append)
+        cfg_modules = (
+            slurm_cfg.get("modules")
+            if isinstance(slurm_cfg.get("modules"), list)
+            else []
+        )
+        slurm.add_argument(
+            "--slurm-module",
+            action="append",
+            default=[str(m) for m in cfg_modules],
+            help="Modules to load (repeatable, merged with config)",
+        )
+        slurm.add_argument(
+            "--slurm-extra-sbatch",
+            action="append",
+            default=[],
+            metavar="DIRECTIVE",
+            help="Extra #SBATCH directives (repeatable, without prefix)",
+        )
+        slurm.add_argument(
+            "--slurm-cli-prefix",
+            default=cfg_get(slurm_cfg, "cli_prefix", ""),
+            help='Shell prefix for CLI mode (e.g. "docker exec -i slurmctld")',
+        )
+
+    return p.parse_args(argv)
+
+
+def _parse_sweep_points(args) -> list[tuple[int, int, int]]:
+    """Resolve sweep points from --sweep / --sweep-file args."""
+    if args.sweep and args.sweep_file:
+        sys.exit("Error: --sweep and --sweep-file are mutually exclusive")
+    try:
+        if args.sweep:
+            return [parse_sweep_point(s) for s in args.sweep]
+        if args.sweep_file:
+            return load_sweep_file(args.sweep_file)
+    except ValueError as e:
+        sys.exit(str(e))
+    return []
+
+
+def _build_spec(args: argparse.Namespace) -> ProfileJobSpec:
+    sweep_points = _parse_sweep_points(args)
+    return ProfileJobSpec(
+        collect=args.collect,
+        model_path=args.model_path,
+        tp=args.tp,
+        dp=args.dp,
+        bs=args.bs,
+        input_len=args.input_len,
+        existing_ctx=args.existing_ctx,
+        decode_tokens=args.decode_tokens,
+        warmup_n=args.warmup_n,
+        disable_chunked_prefill=args.disable_chunked_prefill,
+        max_prefill_tokens=args.max_prefill_tokens,
+        image=args.image,
+        gpus=args.gpus,
+        host=args.host,
+        port=args.port,
+        output_dir=args.output_dir,
+        job_name=args.job_name,
+        extra_server_opts=args.extra_server_opts,
+        sweep_points=sweep_points,
+    )
+
+
+def _build_scheduler(args: argparse.Namespace):
+    if args.scheduler == "local":
+        return LocalScheduler(
+            gpus=args.local_gpus,
+            workdir=args.local_workdir,
+        )
+    elif args.scheduler == "k8s":
+        node_sel = {}
+        for item in args.k8s_node_selector:
+            k, _, v = item.partition("=")
+            if not v:
+                sys.exit(
+                    f"Bad --k8s-node-selector format: {item!r} (use KEY=VALUE)"
+                )
+            node_sel[k] = v
+        return K8sScheduler(
+            namespace=args.k8s_namespace,
+            kubeconfig=args.k8s_kubeconfig,
+            context=args.k8s_context,
+            pvc_name=args.k8s_pvc,
+            host_output_dir=args.k8s_host_output_dir,
+            node_selector=node_sel,
+            service_account=args.k8s_service_account,
+            shm_size=args.k8s_shm_size,
+            runtime_class_name=args.k8s_runtime_class,
+        )
+    else:
+        return SlurmScheduler(
+            partition=args.slurm_partition,
+            time_limit=args.slurm_time,
+            account=args.slurm_account,
+            constraint=args.slurm_constraint,
+            container_runtime=args.slurm_container_runtime,
+            container_mounts=args.slurm_container_mounts,
+            modules=args.slurm_module,
+            extra_sbatch=args.slurm_extra_sbatch,
+            cli_prefix=args.slurm_cli_prefix,
+        )
+
+
+def main(argv: list[str] | None = None) -> None:
+    args = _parse_args(argv)
+
+    # Smart defaults for output_dir based on scheduler.
+    # Layout: stage_traces/{scheduler}/{timestamp}/
+    import time as _time
+
+    _ts = _time.strftime("%Y%m%d_%H%M%S")
+    if not args.output_dir:
+        if args.scheduler == "local":
+            args.output_dir = f"/flowsim/stage_traces/local/{_ts}"
+        elif args.scheduler == "slurm":
+            args.output_dir = f"/flowsim/stage_traces/slurm/{_ts}"
+        else:
+            args.output_dir = f"/flowsim/stage_traces/k8s/{_ts}"
+
+    # Validate required connection params before submit
+    if not args.dry_run and args.scheduler not in ("local",):
+        _validate_connection(args)
+
+    # For local scheduler, convert absolute host model_path to relative
+    # so it resolves correctly inside the container (workdir=/flowsim).
+    if args.scheduler == "local" and os.path.isabs(args.model_path):
+        project_root = os.path.dirname(
+            os.path.dirname(os.path.abspath(__file__))
+        )
+        if args.model_path.startswith(project_root):
+            args.model_path = os.path.relpath(args.model_path, project_root)
+
+    spec = _build_spec(args)
+    scheduler = _build_scheduler(args)
+
+    if args.dry_run:
+        print(scheduler.dry_run(spec))
+    else:
+        result = scheduler.submit(spec)
+        print(result.message)
+
+        # Tell user where to find results
+        print()
+        print(f"Traces: {result.output_dir}")
+        print(f"Logs:   {result.output_dir}/logs/")
+        job_id = result.job_id
+        sched = args.scheduler
+
+        if sched == "k8s":
+            if args.k8s_pvc:
+                print(f"  (persisted on PVC '{args.k8s_pvc}')")
+            else:
+                print(
+                    f"  (persisted at hostPath '{args.k8s_host_output_dir}' on the node)"
+                )
+            print(
+                f"\nTo check status:  flowsim status --scheduler k8s --job {job_id}"
+            )
+            print(
+                f"To view logs:     flowsim logs   --scheduler k8s --job {job_id}"
+            )
+            print(
+                f"To follow logs:   flowsim logs   --scheduler k8s --job {job_id} --follow"
+            )
+            print(
+                f"To cancel:        flowsim cancel --scheduler k8s --job {job_id}"
+            )
+        elif sched == "slurm":
+            print(f"  (on cluster shared filesystem)")
+            print(
+                f"\nTo check status:  flowsim status --scheduler slurm --job {job_id}"
+            )
+            print(
+                f"To view logs:     flowsim logs   --scheduler slurm --job {job_id}"
+            )
+            print(
+                f"To cancel:        flowsim cancel --scheduler slurm --job {job_id}"
+            )
+        else:
+            print(
+                f"\nTo view logs:     flowsim logs   --scheduler local --job {job_id}"
+            )
+        print(f"To list all jobs: flowsim list   --scheduler {sched}")
+
+
+_INIT_HINT = "Run 'flowsim init' to create config files."
+
+
+def _validate_connection(args: argparse.Namespace) -> None:
+    """Fail fast if required cluster connection params are missing."""
+    if args.scheduler == "k8s":
+        if not args.k8s_namespace:
+            sys.exit(
+                "Error: K8s namespace not set.\n"
+                "Set it in ~/.flowsim/k8s.yaml, FLOWSIM_K8S_NAMESPACE env var,\n"
+                f"or --k8s-namespace flag. {_INIT_HINT}"
+            )
+        # Traces + logs must survive pod termination
+        if not args.k8s_pvc and not args.k8s_host_output_dir:
+            sys.exit(
+                "Error: no persistent storage configured for K8s job output.\n"
+                "Traces and logs are written to output_dir inside the pod —\n"
+                "without a volume mount they are lost when the pod exits.\n\n"
+                "Set one of:\n"
+                "  --k8s-pvc <pvc-name>           (PersistentVolumeClaim)\n"
+                "  --k8s-host-output-dir <path>   (hostPath on the node)\n\n"
+                "Or configure in ~/.flowsim/k8s.yaml:\n"
+                "  pvc: my-traces-pvc\n"
+                "  # or\n"
+                "  host_output_dir: /data/flowsim-traces"
+            )
+        # kubeconfig is optional (in-cluster auto-discovery), but warn
+        if not args.k8s_kubeconfig and not args.k8s_context:
+            print(
+                "Note: no kubeconfig or context specified. "
+                "Will try ~/.kube/config and in-cluster auto-discovery.",
+                file=sys.stderr,
+            )
+    elif args.scheduler == "slurm":
+        if not args.slurm_partition:
+            sys.exit(
+                "Error: missing required Slurm config:\n"
+                "  - partition (--slurm-partition)\n\n"
+                f"Set it in ~/.flowsim/slurm.yaml or via CLI flag.\n"
+                + _INIT_HINT
+            )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/run_stage_profile.py b/scripts/run_stage_profile.py
index 8346e3b..36505ec 100644
--- a/scripts/run_stage_profile.py
+++ b/scripts/run_stage_profile.py
@@ -61,14 +61,14 @@
   python scripts/run_stage_profile.py \\
       --collect perf \\
       --host 0.0.0.0 --port 30001 \\
-      --bs 1 --input-len 2048 --decode-tokens 32 \\
+      --bs 1 --input-len 2048 --decode-tokens 2 \\
       --output-dir /flowsim/stage_traces
 
 Example — with existing KV cache context
   python scripts/run_stage_profile.py \\
       --collect perf \\
       --host 0.0.0.0 --port 30001 \\
-      --bs 4 --input-len 512 --existing-ctx 4096 --decode-tokens 32 \\
+      --bs 4 --input-len 512 --existing-ctx 4096 --decode-tokens 2 \\
       --output-dir /flowsim/stage_traces
 
 Example — launch server + full pipeline (perf → shapes)
@@ -107,12 +107,13 @@
 )
 from utils.net import wait_for_port
 from utils.shape_merge import merge_shapes_dir
+from scripts import load_sweep_file, parse_sweep_point
 
 # ---------------------------------------------------------------------------
 # Defaults
 # ---------------------------------------------------------------------------
 DEFAULT_WARMUP_N = 5
-DEFAULT_DECODE_TOKENS = 32
+DEFAULT_DECODE_TOKENS = 2
 DEFAULT_MAX_PREFILL_TOKENS = 131072
 
 
@@ -700,6 +701,31 @@ def parse_args(argv: Optional[list] = None) -> argparse.Namespace:
         default="/flowsim/stage_traces",
         help="Root directory for trace output",
     )
+
+    sweep = p.add_argument_group("sweep (multi-point profiling)")
+    sweep.add_argument(
+        "--sweep",
+        type=str,
+        nargs="+",
+        default=[],
+        metavar="BS:INPUT_LEN:CTX",
+        help=(
+            "Profile multiple (bs, input_len, existing_ctx) points in one job. "
+            "Each value is a colon-separated tuple, e.g. --sweep 1:2048:0 4:8192:0 16:2048:4096. "
+            "Overrides --bs, --input-len, --existing-ctx."
+        ),
+    )
+    sweep.add_argument(
+        "--sweep-file",
+        type=str,
+        default="",
+        metavar="FILE",
+        help=(
+            "Read sweep points from a file (one BS:INPUT_LEN:CTX per line, "
+            "# comments allowed). Overrides --bs, --input-len, --existing-ctx."
+        ),
+    )
+
     srv = p.add_argument_group("server launch (optional)")
     srv.add_argument(
         "--launch-server",
@@ -714,13 +740,27 @@ def parse_args(argv: Optional[list] = None) -> argparse.Namespace:
     )
     srv.add_argument(
         "--log-dir",
-        default="/flowsim/tests/test-artifacts",
-        help="Directory for server logs",
+        default="",
+        help="Directory for server logs (default: {output-dir}/logs/)",
     )
 
     return p.parse_args(argv)
 
 
+def _load_sweep_points(args) -> list[tuple[int, int, int]]:
+    """Resolve sweep points from --sweep, --sweep-file, or single-point args."""
+    if args.sweep and args.sweep_file:
+        print("[ERROR] --sweep and --sweep-file are mutually exclusive")
+        raise SystemExit(1)
+
+    if args.sweep:
+        return [parse_sweep_point(s) for s in args.sweep]
+    if args.sweep_file:
+        return load_sweep_file(args.sweep_file)
+    # Single-point from --bs / --input-len / --existing-ctx
+    return [(args.bs, args.input_len, args.existing_ctx)]
+
+
 # ---------------------------------------------------------------------------
 # Phase runners
 # ---------------------------------------------------------------------------
@@ -759,11 +799,20 @@ def _start_server(
     return proc
 
 
-def _run_perf(args, summary: list[dict]) -> int:
+def _run_perf(
+    args,
+    summary: list[dict],
+    *,
+    bs: Optional[int] = None,
+    input_len: Optional[int] = None,
+    existing_ctx: Optional[int] = None,
+) -> int:
     """Collect traces for a single (bs, input_len, existing_ctx, decode_tokens) point."""
-    bs = args.bs
-    input_len = args.input_len
-    existing_ctx = args.existing_ctx
+    bs = bs if bs is not None else args.bs
+    input_len = input_len if input_len is not None else args.input_len
+    existing_ctx = (
+        existing_ctx if existing_ctx is not None else args.existing_ctx
+    )
 
     tag = f"bs{bs}_input{input_len}_ctx{existing_ctx}"
     sub_dir = os.path.join(args.output_dir, tag)
@@ -873,6 +922,10 @@ def _write_summary(args, summary: list[dict]) -> None:
 def main(argv: Optional[list] = None) -> int:
     args = parse_args(argv)
 
+    # Default log_dir to {output_dir}/logs/ if not specified
+    if not args.log_dir:
+        args.log_dir = os.path.join(args.output_dir, "logs")
+
     if args.decode_tokens < 2:
         print(
             "[ERROR] --decode-tokens must be >= 2. "
@@ -883,6 +936,14 @@ def main(argv: Optional[list] = None) -> int:
 
     server_proc = None
     summary: list[dict] = []
+    sweep_points = _load_sweep_points(args)
+    is_sweep = len(sweep_points) > 1
+
+    if is_sweep:
+        print(f"\n[sweep] {len(sweep_points)} points to profile:")
+        for i, (bs, il, ctx) in enumerate(sweep_points):
+            print(f"  [{i+1}] bs={bs}  input_len={il}  existing_ctx={ctx}")
+        print()
 
     try:
         # ==================================================================
@@ -904,7 +965,10 @@ def main(argv: Optional[list] = None) -> int:
             print("  PHASE 1 / 2 : PERF COLLECTION")
             print("=" * 60 + "\n")
             server_proc = _start_server(args, disable_cuda_graph=False)
-            _run_perf(args, summary)
+            for idx, (bs, il, ctx) in enumerate(sweep_points):
+                if is_sweep:
+                    print(f"\n[sweep] Point {idx+1}/{len(sweep_points)}")
+                _run_perf(args, summary, bs=bs, input_len=il, existing_ctx=ctx)
             _write_summary(args, summary)
             print("\n[server] Shutting down for shape pass …")
             kill_server(server_proc)
@@ -925,7 +989,10 @@ def main(argv: Optional[list] = None) -> int:
         if args.collect == "perf":
             if args.launch_server:
                 server_proc = _start_server(args, disable_cuda_graph=False)
-            _run_perf(args, summary)
+            for idx, (bs, il, ctx) in enumerate(sweep_points):
+                if is_sweep:
+                    print(f"\n[sweep] Point {idx+1}/{len(sweep_points)}")
+                _run_perf(args, summary, bs=bs, input_len=il, existing_ctx=ctx)
             _write_summary(args, summary)
             return 0
 
diff --git a/simulator/base_parser.py b/simulator/base_parser.py
index ca9cadb..2b77967 100644
--- a/simulator/base_parser.py
+++ b/simulator/base_parser.py
@@ -319,12 +319,12 @@ def _parse_events(self) -> list[tuple]:
                 else:
                     # Case 2: If no ext_id, we need to find the shape from user annotations
                     # Key Identification Methodology: Annotation is overlapped with kernel
+                    dims_anno = "N/A"
+                    input_type_anno = "N/A"
+                    desc_anno = ""
                     for anno_idx, anno in enumerate(annotation_events):
                         if anno_idx in used_annotations:
                             continue
-                        dims_anno = "N/A"
-                        input_type_anno = "N/A"
-                        desc_anno = ""
                         if "ProfilerStep" in anno.get("name", ""):
                             continue
                         anno_start = anno.get("ts", 0)
diff --git a/tests/integration/infra/cgroup.conf b/tests/integration/infra/cgroup.conf
new file mode 100644
index 0000000..68de2cc
--- /dev/null
+++ b/tests/integration/infra/cgroup.conf
@@ -0,0 +1,3 @@
+# cgroup.conf — use cgroup v1 (only v1 plugin available; v2 host is compatible
+# via the unified/hybrid hierarchy mount)
+CgroupPlugin=cgroup/v1
diff --git a/tests/integration/infra/dev-setup.sh b/tests/integration/infra/dev-setup.sh
new file mode 100755
index 0000000..02e447f
--- /dev/null
+++ b/tests/integration/infra/dev-setup.sh
@@ -0,0 +1,363 @@
+#!/usr/bin/env bash
+# dev-setup.sh — one-shot setup for FlowSim test clusters (kind + Slurm)
+#
+# Usage:
+#   ./tests/integration/infra/dev-setup.sh          # setup both kind + slurm
+#   ./tests/integration/infra/dev-setup.sh kind     # kind only
+#   ./tests/integration/infra/dev-setup.sh slurm    # slurm only
+#
+# Teardown:
+#   ./tests/integration/infra/dev-teardown.sh
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+KIND_VERSION="v0.27.0"
+KIND_CLUSTER_NAME="flowsim"
+KIND_WORKERS=("${KIND_CLUSTER_NAME}-worker")
+KUBECTL_STABLE_URL="https://dl.k8s.io/release/stable.txt"
+HELM_INSTALL_URL="https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3"
+NVIDIA_CTK_KEYRING="/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg"
+
+log()  { printf "\033[1;32m[setup]\033[0m %s\n" "$*"; }
+warn() { printf "\033[1;33m[setup]\033[0m %s\n" "$*"; }
+err()  { printf "\033[1;31m[setup]\033[0m %s\n" "$*" >&2; exit 1; }
+
+# ----------------------------------------------------------------
+# Dependency checks & auto-install
+# ----------------------------------------------------------------
+ensure_docker() {
+    command -v docker >/dev/null || err "Docker is required but not installed."
+    docker info >/dev/null 2>&1 || err "Docker daemon not running."
+    log "Docker: $(docker --version)"
+}
+
+ensure_kind() {
+    if command -v kind >/dev/null; then
+        log "kind already installed: $(kind version)"
+        return
+    fi
+    log "Installing kind ${KIND_VERSION}..."
+    curl -fsSLo /tmp/kind "https://kind.sigs.k8s.io/dl/${KIND_VERSION}/kind-linux-amd64"
+    chmod +x /tmp/kind
+    sudo mv /tmp/kind /usr/local/bin/kind
+    log "kind installed: $(kind version)"
+}
+
+ensure_kubectl() {
+    if command -v kubectl >/dev/null; then
+        log "kubectl already installed"
+        return
+    fi
+    log "Installing kubectl..."
+    local ver
+    ver="$(curl -fsSL "${KUBECTL_STABLE_URL}")"
+    curl -fsSLo /tmp/kubectl "https://dl.k8s.io/release/${ver}/bin/linux/amd64/kubectl"
+    chmod +x /tmp/kubectl
+    sudo mv /tmp/kubectl /usr/local/bin/kubectl
+    log "kubectl installed: $(kubectl version --client --short 2>/dev/null || true)"
+}
+
+# ----------------------------------------------------------------
+# Kind cluster with NVIDIA GPU via CDI
+# (Official approach from NVIDIA k8s-device-plugin demo)
+# https://github.com/NVIDIA/k8s-device-plugin/tree/main/demo/clusters/kind
+# ----------------------------------------------------------------
+ensure_nvidia_runtime() {
+    # Docker must use nvidia as default runtime so Kind node containers get GPU access
+    command -v nvidia-ctk >/dev/null || err "nvidia-container-toolkit is required (nvidia-ctk not found)."
+    command -v nvidia-smi >/dev/null || err "NVIDIA driver not found (nvidia-smi missing)."
+    log "nvidia-ctk: $(nvidia-ctk --version 2>&1 | head -1)"
+
+    if ! docker info 2>/dev/null | grep -q "Default Runtime: nvidia"; then
+        log "Setting nvidia as default Docker runtime..."
+        sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
+        sudo systemctl restart docker
+        log "Docker restarted with nvidia runtime as default"
+    else
+        log "Docker already using nvidia as default runtime"
+    fi
+
+    # Required: accept-nvidia-visible-devices-as-volume-mounts must be true
+    # for Kind GPU passthrough via /var/run/nvidia-container-devices/all
+    local cfg="/etc/nvidia-container-runtime/config.toml"
+    if grep -qE '^\s*accept-nvidia-visible-devices-as-volume-mounts\s*=\s*true' "$cfg" 2>/dev/null; then
+        log "accept-nvidia-visible-devices-as-volume-mounts already enabled"
+    else
+        log "Enabling accept-nvidia-visible-devices-as-volume-mounts in $cfg..."
+        if grep -qE '#?\s*accept-nvidia-visible-devices-as-volume-mounts' "$cfg" 2>/dev/null; then
+            sudo sed -i 's/#*\s*accept-nvidia-visible-devices-as-volume-mounts.*/accept-nvidia-visible-devices-as-volume-mounts = true/' "$cfg"
+        else
+            echo 'accept-nvidia-visible-devices-as-volume-mounts = true' | sudo tee -a "$cfg" >/dev/null
+        fi
+        sudo systemctl restart docker
+        log "Host nvidia-container-runtime config updated and Docker restarted"
+    fi
+}
+
+ensure_helm() {
+    if command -v helm >/dev/null; then
+        log "helm already installed: $(helm version --short 2>/dev/null)"
+        return
+    fi
+    log "Installing helm..."
+    curl -fsSL "${HELM_INSTALL_URL}" | bash
+    log "helm installed: $(helm version --short)"
+}
+
+setup_kind() {
+    ensure_docker
+    ensure_nvidia_runtime
+    ensure_kind
+    ensure_kubectl
+    ensure_helm
+
+    if kind get clusters 2>/dev/null | grep -q "^${KIND_CLUSTER_NAME}$"; then
+        warn "kind cluster '${KIND_CLUSTER_NAME}' already exists, skipping creation"
+    else
+        log "Creating kind cluster '${KIND_CLUSTER_NAME}' (1 control-plane + 1 GPU worker)..."
+        kind create cluster --name "${KIND_CLUSTER_NAME}" \
+            --config "${SCRIPT_DIR}/kind-multi-node.yaml"
+    fi
+
+    # ── Post-creation: configure GPU support inside each worker node ──
+    for worker in "${KIND_WORKERS[@]}"; do
+        log "=== Configuring ${worker} ==="
+
+        # Step 1: Unmount masked /proc/driver/nvidia
+        log "Unmounting /proc/driver/nvidia in ${worker}..."
+        docker exec "${worker}" umount -R /proc/driver/nvidia 2>/dev/null || true
+
+        # Step 2: Install nvidia-container-toolkit inside the worker node
+        log "Installing nvidia-container-toolkit inside ${worker}..."
+        docker exec "${worker}" bash -c "apt-get update && apt-get install -y gpg"
+        docker exec "${worker}" bash -c "\
+            curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
+                | gpg --dearmor -o ${NVIDIA_CTK_KEYRING} \
+            && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
+                | sed 's#deb https://#deb [signed-by=${NVIDIA_CTK_KEYRING}] https://#g' \
+                | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
+            && apt-get update \
+            && apt-get install -y nvidia-container-toolkit"
+
+        # Step 3: Configure CDI mode in containerd inside worker
+        log "Configuring CDI mode for containerd in ${worker}..."
+        docker exec "${worker}" bash -c "\
+            nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ \
+            && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled --config-source=command \
+            && systemctl restart containerd"
+
+        # Step 4: Label worker node for GPU presence
+        kubectl --context "kind-${KIND_CLUSTER_NAME}" label node "${worker}" \
+            --overwrite nvidia.com/gpu.present=true
+    done
+
+    # Step 5: Create nvidia RuntimeClass
+    log "Creating nvidia RuntimeClass..."
+    kubectl --context "kind-${KIND_CLUSTER_NAME}" apply -f - <<'RTEOF'
+apiVersion: node.k8s.io/v1
+handler: nvidia
+kind: RuntimeClass
+metadata:
+  name: nvidia
+RTEOF
+
+    # Step 6: Deploy per-node NVIDIA device plugin DaemonSets
+    # Each worker gets its own DaemonSet with a specific NVIDIA_VISIBLE_DEVICES
+    # so the device plugin only discovers/advertises that worker's assigned GPU.
+    # (Helm's single DaemonSet can't set different env per node.)
+    log "Deploying NVIDIA device plugin (per-node GPU assignment)..."
+    local CTX="kind-${KIND_CLUSTER_NAME}"
+    local PLUGIN_IMAGE="nvcr.io/nvidia/k8s-device-plugin:v0.17.1"
+    local gpu_idx=0
+    for worker in "${KIND_WORKERS[@]}"; do
+        local ds_name="nvidia-device-plugin-${worker##*-}"   # e.g. nvidia-device-plugin-worker
+        kubectl --context "$CTX" apply -f - <<DPEOF
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: nvidia-device-plugin
+---
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: ${ds_name}
+  namespace: nvidia-device-plugin
+  labels:
+    app: nvidia-device-plugin
+    node: ${worker}
+spec:
+  selector:
+    matchLabels:
+      app: nvidia-device-plugin
+      node: ${worker}
+  template:
+    metadata:
+      labels:
+        app: nvidia-device-plugin
+        node: ${worker}
+    spec:
+      runtimeClassName: nvidia
+      nodeSelector:
+        kubernetes.io/hostname: ${worker}
+      tolerations:
+        - key: nvidia.com/gpu
+          operator: Exists
+          effect: NoSchedule
+      priorityClassName: system-node-critical
+      containers:
+        - name: nvidia-device-plugin
+          image: ${PLUGIN_IMAGE}
+          env:
+            - name: NVIDIA_VISIBLE_DEVICES
+              value: "${gpu_idx}"
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
+          volumeMounts:
+            - name: device-plugin
+              mountPath: /var/lib/kubelet/device-plugins
+      volumes:
+        - name: device-plugin
+          hostPath:
+            path: /var/lib/kubelet/device-plugins
+DPEOF
+        log "  ${worker} → GPU ${gpu_idx} (DaemonSet: ${ds_name})"
+        gpu_idx=$((gpu_idx + 1))
+    done
+
+    # Step 7: Load flowsim-image into worker nodes (skip if already present)
+    local FLOWSIM_IMAGE="flowsim-image:latest"
+    if docker image inspect "${FLOWSIM_IMAGE}" >/dev/null 2>&1; then
+        for worker in "${KIND_WORKERS[@]}"; do
+            if docker exec "${worker}" crictl images 2>/dev/null | grep -q "flowsim-image.*latest"; then
+                log "${FLOWSIM_IMAGE} already loaded in ${worker}, skipping"
+            else
+                log "Loading ${FLOWSIM_IMAGE} into ${worker} (~34GB, may take several minutes)..."
+                if command -v pv >/dev/null; then
+                    docker save "${FLOWSIM_IMAGE}" | pv -f -a -b | \
+                        docker exec -i "${worker}" ctr -n k8s.io images import -
+                else
+                    docker save "${FLOWSIM_IMAGE}" | \
+                        docker exec -i "${worker}" ctr -n k8s.io images import -
+                fi
+                log "${FLOWSIM_IMAGE} loaded into ${worker}"
+            fi
+        done
+    else
+        warn "${FLOWSIM_IMAGE} not found on host, skipping image load (build it first)"
+    fi
+
+    # Step 9: Wait for GPU resources
+    log "Waiting for nvidia.com/gpu resources to appear (up to 180s)..."
+    local gpu_retries=36
+    while true; do
+        gpu_count=$(kubectl --context "kind-${KIND_CLUSTER_NAME}" get nodes \
+            -o jsonpath='{range .items[*]}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}' 2>/dev/null \
+            | grep -cE '^[1-9]' || true)
+        if [ "${gpu_count}" -ge 1 ]; then
+            log "GPUs registered on ${gpu_count} node(s)"
+            break
+        fi
+        gpu_retries=$((gpu_retries - 1))
+        if [ "${gpu_retries}" -le 0 ]; then
+            warn "GPUs not registered after 180s — debugging info:"
+            kubectl --context "kind-${KIND_CLUSTER_NAME}" get pods -n nvidia-device-plugin -o wide 2>/dev/null || true
+            kubectl --context "kind-${KIND_CLUSTER_NAME}" describe nodes 2>/dev/null | grep -A5 "Allocatable" || true
+            break
+        fi
+        sleep 5
+    done
+
+    # Step 10: Init FlowSim K8s config
+    log "Initializing FlowSim K8s config..."
+    flowsim init k8s \
+        --kubeconfig "${HOME}/.kube/config" \
+        --context "kind-${KIND_CLUSTER_NAME}" \
+        --namespace default \
+        --host-output-dir /tmp/flowsim-traces \
+        --runtime-class-name nvidia \
+        --force
+
+    log "Cluster nodes:"
+    kubectl --context "kind-${KIND_CLUSTER_NAME}" get nodes -o wide
+    echo
+
+    log "GPU resources:"
+    kubectl --context "kind-${KIND_CLUSTER_NAME}" get nodes \
+        -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}' 2>/dev/null || true
+    echo
+
+    log "Kind cluster with GPU (CDI mode) ready."
+}
+
+# ----------------------------------------------------------------
+# Slurm cluster (docker compose)
+# ----------------------------------------------------------------
+setup_slurm() {
+    ensure_docker
+
+    if ! docker compose version >/dev/null 2>&1; then
+        err "docker compose v2 is required but not available."
+    fi
+
+    # HOST_WORKSPACE is used by slurm-compose.yaml for the read-only /workspace mount.
+    REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
+    export HOST_WORKSPACE="${HOST_WORKSPACE:-$(dirname "${REPO_ROOT}")}"
+
+    log "Building and starting Slurm cluster (slurmctld + 2 slurmd + slurmrestd)..."
+    log "  HOST_WORKSPACE=${HOST_WORKSPACE}"
+    docker compose -f "${SCRIPT_DIR}/slurm-compose.yaml" up -d --build
+
+    log "Waiting for slurmctld to become ready..."
+    local retries=30
+    while ! docker exec slurmctld sinfo >/dev/null 2>&1; do
+        retries=$((retries - 1))
+        if [ "${retries}" -le 0 ]; then
+            err "slurmctld did not become ready in time"
+        fi
+        sleep 2
+    done
+
+    log "Slurm cluster status:"
+    docker exec slurmctld sinfo
+    echo
+
+    log "Initializing FlowSim Slurm config..."
+    flowsim init slurm \
+        --rest-url "http://localhost:6820" \
+        --partition normal \
+        --account default \
+        --jwt-token-cmd "docker exec slurmctld scontrol token lifespan=3600" \
+        --force
+    echo
+    log "Slurm cluster ready. Test with:"
+    log "  flowsim submit --scheduler slurm --collect perf --model-path <path> --dry-run"
+}
+
+# ----------------------------------------------------------------
+# Main
+# ----------------------------------------------------------------
+target="${1:-all}"
+
+case "${target}" in
+    kind)
+        setup_kind
+        ;;
+    slurm)
+        setup_slurm
+        ;;
+    all)
+        setup_kind
+        echo
+        setup_slurm
+        ;;
+    *)
+        echo "Usage: $0 [kind|slurm|all]"
+        exit 1
+        ;;
+esac
+
+echo
+log "All done. Teardown with: ./tests/integration/infra/dev-teardown.sh"
diff --git a/tests/integration/infra/dev-teardown.sh b/tests/integration/infra/dev-teardown.sh
new file mode 100755
index 0000000..c5e74ee
--- /dev/null
+++ b/tests/integration/infra/dev-teardown.sh
@@ -0,0 +1,48 @@
+#!/usr/bin/env bash
+# dev-teardown.sh — tear down FlowSim test clusters
+#
+# Usage:
+#   ./tests/integration/infra/dev-teardown.sh          # teardown both
+#   ./tests/integration/infra/dev-teardown.sh kind     # kind only
+#   ./tests/integration/infra/dev-teardown.sh slurm    # slurm only
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+KIND_CLUSTER_NAME="flowsim"
+
+log()  { printf "\033[1;32m[teardown]\033[0m %s\n" "$*"; }
+warn() { printf "\033[1;33m[teardown]\033[0m %s\n" "$*"; }
+
+teardown_kind() {
+    # Delete device plugin namespace (contains per-node DaemonSets)
+    if command -v kubectl >/dev/null; then
+        kubectl delete namespace nvidia-device-plugin --ignore-not-found 2>/dev/null || true
+    fi
+    if command -v kind >/dev/null && kind get clusters 2>/dev/null | grep -q "^${KIND_CLUSTER_NAME}$"; then
+        log "Deleting kind cluster '${KIND_CLUSTER_NAME}'..."
+        kind delete cluster --name "${KIND_CLUSTER_NAME}"
+    else
+        warn "kind cluster '${KIND_CLUSTER_NAME}' not found, skipping"
+    fi
+}
+
+teardown_slurm() {
+    if docker compose -f "${SCRIPT_DIR}/slurm-compose.yaml" ps --quiet 2>/dev/null | head -1 | grep -q .; then
+        log "Stopping Slurm containers..."
+        docker compose -f "${SCRIPT_DIR}/slurm-compose.yaml" down -v
+    else
+        warn "Slurm containers not running, skipping"
+    fi
+}
+
+target="${1:-all}"
+
+case "${target}" in
+    kind)   teardown_kind ;;
+    slurm)  teardown_slurm ;;
+    all)    teardown_kind; teardown_slurm ;;
+    *)      echo "Usage: $0 [kind|slurm|all]"; exit 1 ;;
+esac
+
+log "Done."
diff --git a/tests/integration/infra/gres.conf b/tests/integration/infra/gres.conf
new file mode 100644
index 0000000..745eeac
--- /dev/null
+++ b/tests/integration/infra/gres.conf
@@ -0,0 +1,3 @@
+# Slurm GRES config — explicit GPU definition (AutoDetect=nvml requires
+# cgroup v2 which is not available; define GPU manually)
+Name=gpu Type=nvidia File=/dev/nvidia0 Count=1
diff --git a/tests/integration/infra/kind-multi-node.yaml b/tests/integration/infra/kind-multi-node.yaml
new file mode 100644
index 0000000..7ec1b68
--- /dev/null
+++ b/tests/integration/infra/kind-multi-node.yaml
@@ -0,0 +1,37 @@
+# Kind cluster config — 1 control-plane + 1 GPU worker node
+#
+# GPU support via CDI mode (NVIDIA k8s-device-plugin official approach).
+# See: https://github.com/NVIDIA/k8s-device-plugin/tree/main/demo/clusters/kind
+#
+# The single worker binds GPU 0.  Change the containerPath index to
+# assign a different GPU.
+#
+# Pre-requisites (host):
+#   - Docker with nvidia as default runtime
+#   - accept-nvidia-visible-devices-as-volume-mounts = true
+#     in /etc/nvidia-container-runtime/config.toml
+#   - kind, kubectl, helm
+#
+# Usage:
+#   ./tests/integration/infra/dev-setup.sh kind
+#
+# Teardown:
+#   ./tests/integration/infra/dev-teardown.sh kind
+
+kind: Cluster
+apiVersion: kind.x-k8s.io/v1alpha4
+
+nodes:
+  - role: control-plane
+
+  # Worker — GPU 0 only
+  - role: worker
+    extraMounts:
+      - hostPath: /dev/null
+        containerPath: /var/run/nvidia-container-devices/0
+      - hostPath: /path/to/host/workspace
+        containerPath: /workspace
+        readOnly: true
+      # Writable mount so K8s pods can write traces directly to host
+      - hostPath: /path/to/host/stage_traces
+        containerPath: /host-stage-traces
diff --git a/tests/integration/infra/slurm-compose.yaml b/tests/integration/infra/slurm-compose.yaml
new file mode 100644
index 0000000..b9ba09a
--- /dev/null
+++ b/tests/integration/infra/slurm-compose.yaml
@@ -0,0 +1,141 @@
+# Slurm test cluster — slurmctld + 1 compute node (GPU 0)
+#
+# Requires HOST_WORKSPACE env var pointing to the directory containing
+# model weights (mounted read-only into containers as /workspace).
+#
+# Usage:
+#   export HOST_WORKSPACE=/path/to/workspace
+#   cd tests/integration/infra/
+#   docker compose -f slurm-compose.yaml up -d
+#
+#   # Wait for cluster to be ready (~30s)
+#   docker exec slurmctld sinfo
+#
+#   # Get JWT token for REST API
+#   docker exec slurmctld scontrol token lifespan=3600
+#
+#   # Init FlowSim
+#   flowsim init slurm --rest-url http://localhost:6820 \
+#       --partition normal --account default \
+#       --jwt-token-cmd "docker exec slurmctld scontrol token lifespan=3600" \
+#       --force
+#
+#   # Submit a job
+#   flowsim submit --scheduler slurm --collect perf \
+#       --model-path /models/Qwen-7B --gpus 1
+#
+#   # Teardown
+#   docker compose -f slurm-compose.yaml down -v
+#   # Or from project root:
+#   docker compose -f tests/integration/infra/slurm-compose.yaml down -v
+
+x-slurm-base: &slurm-base
+  build:
+    context: .
+    dockerfile: slurm-node.dockerfile
+  volumes:
+    - slurm-etc:/etc/slurm
+    - munge-socket:/run/munge
+    # Share workspace for model weights / traces
+    - ${HOST_WORKSPACE:?set HOST_WORKSPACE to the directory containing model weights}:/workspace:ro
+  networks:
+    - slurm-net
+
+services:
+  # ---- Munge (shared auth daemon) ----
+  munge:
+    <<: *slurm-base
+    container_name: munge
+    hostname: munge
+    command: >
+      bash -c "
+        if [ ! -f /etc/munge/munge.key ]; then
+          mungekey --create --force
+        fi
+        chown munge:munge /etc/munge/munge.key
+        chmod 400 /etc/munge/munge.key
+        mkdir -p /run/munge
+        chown munge:munge /run/munge
+        chmod 755 /run/munge
+        gosu munge munged --foreground
+      "
+    volumes:
+      - munge-key:/etc/munge
+      - munge-socket:/run/munge
+
+  # ---- Controller ----
+  slurmctld:
+    <<: *slurm-base
+    container_name: slurmctld
+    hostname: slurmctld
+    command: >
+      bash -c "
+        mkdir -p /run/munge && chown munge:munge /run/munge
+        until [ -S /run/munge/munge.socket.2 ]; do sleep 0.5; done
+        slurmctld -D -vvv
+      "
+    depends_on:
+      - munge
+    volumes:
+      - slurm-etc:/etc/slurm
+      - munge-key:/etc/munge:ro
+      - munge-socket:/run/munge
+      - slurm-state:/var/spool/slurmctld
+
+  # ---- Compute node 0 (GPU 0) ----
+  slurmd-0:
+    <<: *slurm-base
+    container_name: slurmd-0
+    hostname: slurmd-0
+    runtime: nvidia
+    environment:
+      NVIDIA_VISIBLE_DEVICES: "0"
+    command: >
+      bash -c "
+        mkdir -p /run/munge && chown munge:munge /run/munge
+        until [ -S /run/munge/munge.socket.2 ]; do sleep 0.5; done
+        slurmd -D -vvv
+      "
+    depends_on:
+      - slurmctld
+    volumes:
+      - slurm-etc:/etc/slurm:ro
+      - munge-key:/etc/munge:ro
+      - munge-socket:/run/munge
+      - ${HOST_WORKSPACE:?set HOST_WORKSPACE}:/workspace:ro
+      # Writable mount so traces appear on host
+      - ../../../stage_traces:/flowsim/stage_traces
+      # Cgroup needed by slurmd
+      - /sys/fs/cgroup:/sys/fs/cgroup:rw
+
+  # ---- REST API (optional, for REST mode) ----
+  # slurmrestd:
+  #   <<: *slurm-base
+  #   container_name: slurmrestd
+  #   hostname: slurmrestd
+  #   command: >
+  #     bash -c "
+  #       mkdir -p /run/munge && chown munge:munge /run/munge
+  #       until [ -S /run/munge/munge.socket.2 ]; do sleep 0.5; done
+  #       gosu slurm slurmrestd -a rest_auth/jwt 0.0.0.0:6820 -vvv -s slurmctld
+  #     "
+  #   depends_on:
+  #     - slurmctld
+  #   ports:
+  #     - "6820:6820"
+  #   cap_add:
+  #     - SYS_ADMIN
+  #   volumes:
+  #     - slurm-etc:/etc/slurm:ro
+  #     - munge-key:/etc/munge:ro
+  #     - munge-socket:/run/munge
+
+volumes:
+  slurm-etc:
+  slurm-state:
+  munge-key:
+  munge-socket:
+
+networks:
+  slurm-net:
+    driver: bridge
diff --git a/tests/integration/infra/slurm-node.dockerfile b/tests/integration/infra/slurm-node.dockerfile
new file mode 100644
index 0000000..8b79db0
--- /dev/null
+++ b/tests/integration/infra/slurm-node.dockerfile
@@ -0,0 +1,55 @@
+# Slurm node image — controller, compute, and REST API
+#
+# Based on flowsim-image so compute nodes have the full Python/sglang
+# environment.  Slurm 23.11 is compiled on top with JWT + NVML GRES.
+# Used by slurm-compose.yaml.
+
+FROM flowsim-image:latest
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Slurm build dependencies + munge
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    gosu \
+    libhttp-parser-dev \
+    libjson-c-dev \
+    libjwt-dev \
+    libmunge-dev \
+    munge \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install Slurm 23.11 from source (slurmrestd + JWT auth + NVML GRES)
+ARG SLURM_VERSION=23.11.10
+RUN cd /tmp && \
+    wget -q https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2 && \
+    tar xjf slurm-${SLURM_VERSION}.tar.bz2 && \
+    cd slurm-${SLURM_VERSION} && \
+    ./configure \
+        --prefix=/usr \
+        --sysconfdir=/etc/slurm \
+        --with-jwt \
+        --with-http-parser \
+        --with-json \
+        --with-nvml \
+        --enable-slurmrestd && \
+    make -j"$(nproc)" && \
+    make install && \
+    rm -rf /tmp/slurm-*
+
+# Create required directories and users
+RUN useradd -r -s /sbin/nologin slurm 2>/dev/null || true && \
+    mkdir -p /etc/slurm /var/spool/slurmctld /var/spool/slurmd /var/log/slurm && \
+    chown slurm:slurm /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
+
+# Slurm config
+COPY slurm.conf /etc/slurm/slurm.conf
+COPY gres.conf /etc/slurm/gres.conf
+COPY cgroup.conf /etc/slurm/cgroup.conf
+
+# JWT key for REST API auth
+RUN dd if=/dev/urandom bs=32 count=1 2>/dev/null | base64 > /etc/slurm/jwt_hs256.key && \
+    chown slurm:slurm /etc/slurm/jwt_hs256.key && \
+    chmod 0600 /etc/slurm/jwt_hs256.key
+
+WORKDIR /flowsim
+CMD ["bash"]
diff --git a/tests/integration/infra/slurm.conf b/tests/integration/infra/slurm.conf
new file mode 100644
index 0000000..ea7611b
--- /dev/null
+++ b/tests/integration/infra/slurm.conf
@@ -0,0 +1,51 @@
+# slurm.conf — minimal single-node cluster for FlowSim testing
+#
+# Controller: slurmctld
+# Compute:    slurmd-0 (1 GPU)
+# REST API:   not provisioned in this test configuration
+
+ClusterName=flowsim
+SlurmctldHost=slurmctld
+
+# Auth
+AuthType=auth/munge
+AuthAltTypes=auth/jwt
+AuthAltParameters=jwt_key=/etc/slurm/jwt_hs256.key
+
+# Paths
+SlurmctldPidFile=/var/run/slurmctld.pid
+SlurmdPidFile=/var/run/slurmd.pid
+StateSaveLocation=/var/spool/slurmctld
+SlurmdSpoolDir=/var/spool/slurmd
+SlurmctldLogFile=/var/log/slurm/slurmctld.log
+SlurmdLogFile=/var/log/slurm/slurmd.log
+
+# Scheduling
+SchedulerType=sched/backfill
+SelectType=select/cons_tres
+SelectTypeParameters=CR_Core_Memory
+
+# Accounting (disabled — no slurmdbd in test cluster)
+JobAcctGatherType=jobacct_gather/none
+
+# Task management — disable cgroups (not available in containers)
+TaskPlugin=task/none
+ProctrackType=proctrack/linuxproc
+JobContainerType=job_container/none
+
+# Timeouts
+SlurmctldTimeout=30
+SlurmdTimeout=30
+InactiveLimit=0
+MinJobAge=300
+KillWait=30
+Waittime=0
+
+# GRES (GPU) auto-detection
+GresTypes=gpu
+
+# Partitions — single compute node for testing
+PartitionName=normal Nodes=slurmd-0 Default=YES MaxTime=INFINITE State=UP
+
+# Node definition — 1 GPU (CPUs/memory match hardware)
+NodeName=slurmd-0 CPUs=112 RealMemory=128000 Gres=gpu:1 State=UNKNOWN
diff --git a/tests/integration/test_scheduler.py b/tests/integration/test_scheduler.py
new file mode 100644
index 0000000..7ecaf9b
--- /dev/null
+++ b/tests/integration/test_scheduler.py
@@ -0,0 +1,830 @@
+"""Integration tests for all FlowSim scheduler backends.
+
+How It Works
+------------
+Each test class exercises one scheduler backend end-to-end through the
+``flowsim`` CLI (the same commands a user would run).  The flow is:
+
+1. ``flowsim submit`` — submit a ``--collect all`` profiling job.
+2. ``flowsim list``   — verify the job appears in the listing.
+3. ``flowsim status`` — poll until Completed / Succeeded (up to 20 min).
+4. Validate outputs on the host file system.
+
+Infrastructure is auto-provisioned by session-scoped fixtures:
+
+* **Local** — uses Docker on the host directly (no extra infra).
+* **K8s**   — spins up a Kind cluster via ``dev-setup.sh kind``.
+* **Slurm** — spins up a docker-compose Slurm cluster via
+  ``dev-setup.sh slurm`` (slurmctld + slurmd-0 with GPU 0).
+
+Pass Criteria
+-------------
+* Job reaches Completed/Succeeded within the timeout.
+* Stage-separated trace files exist (EXTEND + DECODE ``.trace.json.gz``).
+* Parsed CSVs exist under ``parsed/`` with non-zero rows.
+* GEMM kernels: EXTEND ``dim0 == bs * input_len``, DECODE ``dim0 == bs``.
+* FlashAttn kernels: EXTEND dims contain ``[bs, input_len + existing_ctx]`` (±1).
+* ``analysis_extend.json`` and ``analysis_decode.json`` are valid JSON.
+* After ``--collect shapes``, ``Dims`` column is present in merged CSVs.
+* Sweep jobs produce per-point subdirs + ``sweep_summary.json``.
+* Log files (stdout/stderr) exist under ``logs/``.
+
+Requirements
+------------
+* Docker with ``flowsim-image:latest`` built.
+* GPU-equipped host machine.
+* ``tests/integration/infra/dev-setup.sh`` available.
+
+Environment Variables
+---------------------
+``MODEL``
+    Model path (default: ``workload/models/configs/Qwen3-235B-A22B``).
+``LOAD_FORMAT``
+    Load format (default: ``dummy``).
+
+Usage
+-----
+    # All scheduler tests:
+    python -m pytest tests/integration/test_scheduler.py -v -x
+
+    # Single backend:
+    python -m pytest tests/integration/test_scheduler.py -v -x -k "local"
+    python -m pytest tests/integration/test_scheduler.py -v -x -k "k8s"
+    python -m pytest tests/integration/test_scheduler.py -v -x -k "slurm"
+"""
+
+import ast
+import csv
+import glob
+import json
+import os
+import subprocess
+import sys
+import tempfile
+import time
+
+import pytest
+
+from schedulers.base import JobResult, ProfileJobSpec
+from schedulers.local import LocalScheduler
+
+_PROJECT_ROOT = os.path.abspath(
+    os.path.join(os.path.dirname(__file__), "..", "..")
+)
+_DEV_SETUP = os.path.join(
+    _PROJECT_ROOT, "tests", "integration", "infra", "dev-setup.sh"
+)
+_DEV_TEARDOWN = os.path.join(
+    _PROJECT_ROOT, "tests", "integration", "infra", "dev-teardown.sh"
+)
+
+MODEL = os.environ.get("MODEL", "workload/models/configs/Qwen3-235B-A22B")
+LOAD_FORMAT = os.environ.get("LOAD_FORMAT", "dummy")
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _flowsim_cli(
+    *args: str, timeout: int = 1200
+) -> subprocess.CompletedProcess:
+    """Run a ``flowsim`` subcommand via Python entry point."""
+    cmd = [
+        sys.executable,
+        "-u",
+        "-c",
+        "from scripts.cli import main; main()",
+        *args,
+    ]
+    env = os.environ.copy()
+    env["PYTHONPATH"] = _PROJECT_ROOT + (":" + env.get("PYTHONPATH", ""))
+    env["PYTHONUNBUFFERED"] = "1"
+    return subprocess.run(
+        cmd,
+        capture_output=True,
+        text=True,
+        cwd=_PROJECT_ROOT,
+        env=env,
+        timeout=timeout,
+    )
+
+
+def _assert_traces(output_dir: str) -> None:
+    """Assert EXTEND + DECODE traces and parsed CSVs exist."""
+    traces = glob.glob(
+        os.path.join(output_dir, "**/*.trace.json.gz"), recursive=True
+    )
+    assert len(traces) > 0, f"No trace files under {output_dir}"
+    extend = [t for t in traces if "EXTEND" in os.path.basename(t)]
+    decode = [t for t in traces if "DECODE" in os.path.basename(t)]
+    assert len(extend) > 0, "No EXTEND traces"
+    assert len(decode) > 0, "No DECODE traces"
+
+    csvs = glob.glob(
+        os.path.join(output_dir, "**/parsed/*.csv"), recursive=True
+    )
+    assert len(csvs) > 0, f"No parsed CSVs under {output_dir}"
+    # At least EXTEND should be parsed; DECODE CSV may be absent for short sequences
+    extend_csvs = [c for c in csvs if "EXTEND" in os.path.basename(c)]
+    assert len(extend_csvs) > 0, "No EXTEND parsed CSVs"
+
+
+def _assert_logs(output_dir: str) -> None:
+    """Assert server log files exist under {output_dir}/logs/."""
+    log_dir = os.path.join(output_dir, "logs")
+    assert os.path.isdir(log_dir), f"Log directory not found: {log_dir}"
+    log_files = os.listdir(log_dir)
+    assert len(log_files) > 0, f"No log files in {log_dir}"
+    stdout_logs = [f for f in log_files if f.endswith(".stdout.log")]
+    stderr_logs = [f for f in log_files if f.endswith(".stderr.log")]
+    assert len(stdout_logs) > 0, f"No stdout logs in {log_dir}"
+    assert len(stderr_logs) > 0, f"No stderr logs in {log_dir}"
+    # At least one log should be non-empty
+    sizes = [os.path.getsize(os.path.join(log_dir, f)) for f in stdout_logs]
+    assert max(sizes) > 0, "All stdout logs are empty"
+
+
+# ---------------------------------------------------------------------------
+# Shape validation helpers (same logic as test_stage_profile_configs.py)
+# ---------------------------------------------------------------------------
+def _read_csv(path):
+    with open(path, newline="") as f:
+        return list(csv.DictReader(f))
+
+
+_GEMM_NAME_PATTERNS = ("nvjet", "cublasLt", "cublas_", "cutlass_gemm")
+
+
+def _first_matmul_dim0(rows):
+    """Return dim0 of the first GEMM kernel (the M dimension)."""
+    for row in rows:
+        if row.get("op", "") == "matmul":
+            dims = ast.literal_eval(row["Dims"])
+            return dims[0][0]
+    for row in rows:
+        name = row["Name"]
+        dims_str = row.get("Dims", "N/A")
+        if dims_str == "N/A" or not dims_str:
+            continue
+        if any(pat in name for pat in _GEMM_NAME_PATTERNS):
+            dims = ast.literal_eval(dims_str)
+            if len(dims) >= 2 and len(dims[0]) == 2 and len(dims[1]) == 2:
+                return dims[0][0]
+    return None
+
+
+def _attention_seqlen_pair(rows, bs, seq_len):
+    """Check that [bs, seq_len] (or +1) appears in FlashAttn dims."""
+    for row in rows:
+        name = row["Name"]
+        if "FlashAttn" not in name:
+            continue
+        if "Combine" in name or "prepare" in name:
+            continue
+        dims = ast.literal_eval(row["Dims"])
+        for d in dims:
+            if (
+                isinstance(d, list)
+                and len(d) == 2
+                and d[0] == bs
+                and d[1] in (seq_len, seq_len + 1)
+            ):
+                return d
+        return None
+    return None
+
+
+def _validate_shapes(output_dir, bs, input_len, existing_ctx):
+    """Validate GEMM dim0 and FlashAttn seqlen in merged/shape_parsed CSVs."""
+    tag = f"bs{bs}_input{input_len}_ctx{existing_ctx}"
+    for csv_subdir in ("merged", "shape_parsed"):
+        extend_csvs = sorted(
+            glob.glob(
+                os.path.join(output_dir, tag, csv_subdir, "*TP-0*EXTEND*.csv")
+            )
+        )
+        decode_csvs = sorted(
+            glob.glob(
+                os.path.join(output_dir, tag, csv_subdir, "*TP-0*DECODE*.csv")
+            )
+        )
+        if extend_csvs and decode_csvs:
+            break
+    else:
+        pytest.fail(
+            f"No EXTEND+DECODE CSVs for TP-0 in {output_dir}/{tag}/{{merged,shape_parsed}}/"
+        )
+
+    extend_rows = _read_csv(extend_csvs[0])
+    decode_rows = _read_csv(decode_csvs[0])
+
+    # EXTEND first GEMM dim0 == bs * input_len
+    ext_gemm_dim0 = _first_matmul_dim0(extend_rows)
+    assert ext_gemm_dim0 is not None, "No matmul kernel found in EXTEND CSV"
+    expected_ext = bs * input_len
+    assert (
+        ext_gemm_dim0 == expected_ext
+    ), f"EXTEND first GEMM dim0={ext_gemm_dim0}, expected bs*input_len={expected_ext}"
+
+    # EXTEND FlashAttn dims contain [bs, seq_len]
+    seq_len = input_len + existing_ctx
+    attn_pair = _attention_seqlen_pair(extend_rows, bs, seq_len)
+    assert (
+        attn_pair is not None
+    ), f"No FlashAttention dim matching [bs={bs}, seqlen={seq_len}(+1)] in EXTEND CSV"
+
+    # DECODE first GEMM dim0 == bs
+    dec_gemm_dim0 = _first_matmul_dim0(decode_rows)
+    assert dec_gemm_dim0 is not None, "No matmul kernel found in DECODE CSV"
+    assert (
+        dec_gemm_dim0 == bs
+    ), f"DECODE first GEMM dim0={dec_gemm_dim0}, expected bs={bs}"
+
+
+# =====================================================================
+# LOCAL SCHEDULER — real profiling (4-step flow)
+# =====================================================================
+class TestLocalScheduler:
+    """Run real profiling via ``flowsim`` CLI on the local Docker scheduler.
+
+    Flow per test point:
+    1. ``flowsim submit`` — submit the job (collect all)
+    2. ``flowsim list``   — verify the job appears
+    3. ``flowsim status`` — poll until Completed
+    4. Validate trace CSVs — GEMM dim0, FlashAttn seqlen for EXTEND & DECODE
+    """
+
+    _TP1_POINTS = [
+        {"bs": 1, "input_len": 2048, "existing_ctx": 0, "decode_tokens": 2},
+        {"bs": 1, "input_len": 2048, "existing_ctx": 2048, "decode_tokens": 2},
+    ]
+
+    @pytest.mark.parametrize(
+        "point",
+        _TP1_POINTS,
+        ids=[
+            f"bs{p['bs']}_il{p['input_len']}_ctx{p['existing_ctx']}"
+            for p in _TP1_POINTS
+        ],
+    )
+    def test_local_tp1_all(self, point):
+        bs = point["bs"]
+        input_len = point["input_len"]
+        existing_ctx = point["existing_ctx"]
+        decode_tokens = point["decode_tokens"]
+
+        # ── Step 1: submit ──
+        r = _flowsim_cli(
+            "submit",
+            "--scheduler",
+            "local",
+            "--collect",
+            "all",
+            "--model-path",
+            MODEL,
+            "--tp",
+            "1",
+            "--bs",
+            str(bs),
+            "--input-len",
+            str(input_len),
+            "--existing-ctx",
+            str(existing_ctx),
+            "--decode-tokens",
+            str(decode_tokens),
+            "--warmup-n",
+            "2",
+            "--gpus",
+            "1",
+            "--local-gpus",
+            "0",
+            "--extra-server-opts",
+            f"--load-format {LOAD_FORMAT}",
+        )
+        if r.returncode != 0:
+            print("STDOUT:", r.stdout[-3000:])
+            print("STDERR:", r.stderr[-3000:])
+        assert r.returncode == 0, f"flowsim submit failed (exit {r.returncode})"
+
+        # Extract job_id from output (line like "flowsim-all-... completed successfully")
+        combined = r.stdout + r.stderr
+        job_id = None
+        for line in combined.splitlines():
+            if "flowsim-all-" in line:
+                for word in line.split():
+                    if word.startswith("flowsim-all-"):
+                        job_id = word.rstrip(".,;:")
+                        break
+                if job_id:
+                    break
+        assert (
+            job_id
+        ), f"Could not find job_id in submit output:\n{combined[-1000:]}"
+
+        # ── Step 2: list — verify job appears ──
+        r_list = _flowsim_cli("list", "--scheduler", "local")
+        assert r_list.returncode == 0, "flowsim list failed"
+        assert (
+            job_id in r_list.stdout
+        ), f"Job {job_id} not found in list output:\n{r_list.stdout}"
+
+        # ── Step 3: status — should be Completed (submit is synchronous) ──
+        r_status = _flowsim_cli(
+            "status", "--scheduler", "local", "--job", job_id
+        )
+        assert r_status.returncode == 0, "flowsim status failed"
+        status_out = r_status.stdout.lower()
+        assert (
+            "completed" in status_out
+        ), f"Job {job_id} not completed:\n{r_status.stdout}"
+
+        # ── Step 4: validate trace CSVs ──
+        # Extract output_dir from status output (Traces dir: ...)
+        output_dir = None
+        for line in r_status.stdout.splitlines():
+            if "Traces dir:" in line:
+                output_dir = line.split("Traces dir:", 1)[1].strip()
+                break
+        assert output_dir and os.path.isdir(
+            output_dir
+        ), f"Could not find traces dir in status output:\n{r_status.stdout}"
+        _assert_traces(output_dir)
+        _assert_logs(output_dir)
+        _validate_shapes(
+            output_dir, bs=bs, input_len=input_len, existing_ctx=existing_ctx
+        )
+
+
+# =====================================================================
+# Cluster setup helpers & fixtures
+# =====================================================================
+
+
+def _run_dev_setup(target: str) -> None:
+    """Run ``tests/integration/infra/dev-setup.sh <target>`` and assert success."""
+    r = subprocess.run(
+        ["bash", _DEV_SETUP, target],
+        capture_output=True,
+        text=True,
+        cwd=_PROJECT_ROOT,
+        timeout=300,
+    )
+    if r.returncode != 0:
+        raise RuntimeError(
+            f"dev-setup.sh {target} failed (exit {r.returncode}):\n"
+            f"stdout: {r.stdout[-2000:]}\nstderr: {r.stderr[-2000:]}"
+        )
+
+
+def _run_dev_teardown(target: str) -> None:
+    """Run ``tests/integration/infra/dev-teardown.sh <target>``."""
+    subprocess.run(
+        ["bash", _DEV_TEARDOWN, target],
+        capture_output=True,
+        text=True,
+        cwd=_PROJECT_ROOT,
+        timeout=120,
+    )
+
+
+def _kind_cluster_running() -> bool:
+    """Check if the Kind cluster named 'flowsim' is reachable."""
+    try:
+        r = subprocess.run(
+            ["kubectl", "--context", "kind-flowsim", "get", "nodes"],
+            capture_output=True,
+            text=True,
+            timeout=15,
+        )
+        return r.returncode == 0 and "Ready" in r.stdout
+    except Exception:
+        return False
+
+
+@pytest.fixture(scope="session")
+def kind_cluster():
+    """Ensure Kind cluster is running; auto-setup if needed.
+
+    The cluster is kept alive after the test session to avoid
+    re-loading the 34 GB image every time.  Use ``dev-teardown.sh kind``
+    to clean up manually.
+    """
+    if not _kind_cluster_running():
+        _run_dev_setup("kind")
+    assert _kind_cluster_running(), "Kind cluster not reachable after setup"
+    yield
+
+
+@pytest.fixture(scope="session")
+def slurm_cluster():
+    """Ensure Slurm cluster is running; auto-setup if needed.
+
+    Cluster is kept alive after tests.  Use ``dev-teardown.sh slurm``
+    to clean up manually.
+    """
+    if not _slurm_cluster_running():
+        _run_dev_setup("slurm")
+    assert _slurm_cluster_running(), "Slurm cluster not reachable after setup"
+    yield
+
+
+# =====================================================================
+# K8S SCHEDULER
+# =====================================================================
+class TestK8sScheduler:
+    """K8s scheduler: real submit to Kind cluster.
+
+    Automatically sets up the Kind cluster via ``dev-setup.sh`` if not
+    already running.
+    """
+
+    def test_k8s_real_submit_to_kind(self, kind_cluster):
+        """Submit a real Job to Kind cluster: submit → list → status → retrieve → validate."""
+        import shutil
+        import tempfile
+
+        job_name = f"test-integ-{int(time.time()) % 100000}"
+        local_traces = tempfile.mkdtemp(prefix="flowsim-k8s-traces-")
+
+        try:
+            # ── Step 0: clean stale test traces on host ──
+            host_traces = os.path.join(_PROJECT_ROOT, "stage_traces")
+            os.makedirs(host_traces, exist_ok=True)
+
+            # ── Step 1: submit (host mount for trace retrieval) ──
+            r = _flowsim_cli(
+                "submit",
+                "--scheduler",
+                "k8s",
+                "--collect",
+                "all",
+                "--model-path",
+                MODEL,
+                "--tp",
+                "1",
+                "--bs",
+                "1",
+                "--input-len",
+                "2048",
+                "--existing-ctx",
+                "0",
+                "--decode-tokens",
+                "2",
+                "--warmup-n",
+                "2",
+                "--gpus",
+                "1",
+                "--k8s-namespace",
+                "default",
+                "--k8s-host-output-dir",
+                "/host-stage-traces",
+                "--job-name",
+                job_name,
+                "--extra-server-opts",
+                f"--load-format {LOAD_FORMAT}",
+            )
+            combined = r.stdout + r.stderr
+            if r.returncode != 0:
+                print("Submit output:", combined[-3000:])
+            assert r.returncode == 0, f"K8s submit failed: {combined[-1000:]}"
+
+            # ── Step 2: list — verify job appears ──
+            r_list = _flowsim_cli("list", "--scheduler", "k8s")
+            assert r_list.returncode == 0
+            assert (
+                job_name in r_list.stdout
+            ), f"Job {job_name} not in list:\n{r_list.stdout}"
+
+            # ── Step 3: status — poll until Completed/Succeeded (max 20 min) ──
+            deadline = time.time() + 1200
+            state = ""
+            while time.time() < deadline:
+                r_status = _flowsim_cli(
+                    "status", "--scheduler", "k8s", "--job", job_name
+                )
+                assert r_status.returncode == 0
+                state = r_status.stdout.lower()
+                if "completed" in state or "succeeded" in state:
+                    break
+                if "failed" in state:
+                    pytest.fail(f"K8s job failed:\n{r_status.stdout}")
+                time.sleep(15)
+            assert (
+                "completed" in state or "succeeded" in state
+            ), f"K8s job did not complete in time:\n{r_status.stdout}"
+
+            # ── Step 4: traces are on host via Kind mount ──
+            # output_dir inside container: /flowsim/stage_traces/k8s/{ts}
+            # host_output_dir on worker: /host-stage-traces
+            # → host: {project}/stage_traces/k8s/{ts}/
+            k8s_traces = os.path.join(host_traces, "k8s")
+            assert os.path.isdir(
+                k8s_traces
+            ), f"No k8s traces dir at {k8s_traces}"
+            # Find the latest timestamped subdir
+            ts_dirs = sorted(os.listdir(k8s_traces))
+            assert ts_dirs, f"No timestamp dirs in {k8s_traces}"
+            local_traces = os.path.join(k8s_traces, ts_dirs[-1])
+
+            # ── Step 5: validate trace CSVs ──
+            _assert_traces(local_traces)
+            _assert_logs(local_traces)
+            _validate_shapes(local_traces, bs=1, input_len=2048, existing_ctx=0)
+
+        finally:
+            # Cleanup: cancel job (traces stay on host for inspection)
+            _flowsim_cli("cancel", "--scheduler", "k8s", "--job", job_name)
+
+
+# =====================================================================
+# SLURM SCHEDULER
+# =====================================================================
+
+
+def _slurm_cluster_running() -> bool:
+    """Check if local Slurm test cluster (docker compose) is running."""
+    try:
+        r = subprocess.run(
+            ["docker", "exec", "slurmctld", "sinfo", "-h"],
+            capture_output=True,
+            text=True,
+            timeout=10,
+        )
+        return r.returncode == 0 and r.stdout.strip() != ""
+    except Exception:
+        return False
+
+
+# CLI prefix for running Slurm commands inside the slurmctld container.
+# Uses -i so sbatch can read scripts from stdin.
+_SLURM_CLI_PREFIX = "docker exec -i slurmctld"
+
+
+class TestSlurmScheduler:
+    """Slurm scheduler: real submit to local docker-compose cluster.
+
+    Uses CLI mode (sbatch/squeue/scancel) — no slurmrestd needed.
+    Automatically sets up the Slurm cluster via ``dev-setup.sh slurm``
+    if not already running.
+    """
+
+    def test_slurm_real_submit(self, slurm_cluster):
+        """Submit to local Slurm cluster: submit → list → status → retrieve → validate."""
+
+        # Compute node has /flowsim/stage_traces mounted writable to host.
+        # output_dir inside the container maps directly to the host.
+        host_traces = os.path.join(_PROJECT_ROOT, "stage_traces")
+        os.makedirs(host_traces, exist_ok=True)
+        ts = time.strftime("%Y%m%d_%H%M%S")
+        output_dir = f"/flowsim/stage_traces/slurm/{ts}"
+
+        job_id = None
+        try:
+            # ── Step 1: submit (CLI mode, container_runtime=none) ──
+            r = _flowsim_cli(
+                "submit",
+                "--scheduler",
+                "slurm",
+                "--collect",
+                "all",
+                "--model-path",
+                MODEL,
+                "--tp",
+                "1",
+                "--bs",
+                "1",
+                "--input-len",
+                "2048",
+                "--existing-ctx",
+                "0",
+                "--decode-tokens",
+                "2",
+                "--warmup-n",
+                "2",
+                "--gpus",
+                "1",
+                "--slurm-partition",
+                "normal",
+                "--slurm-cli-prefix",
+                _SLURM_CLI_PREFIX,
+                "--slurm-container-runtime",
+                "none",
+                "--output-dir",
+                output_dir,
+                "--extra-server-opts",
+                f"--load-format {LOAD_FORMAT}",
+            )
+            combined = r.stdout + r.stderr
+            if r.returncode != 0:
+                print("Submit output:", combined[-3000:])
+            assert r.returncode == 0, f"Slurm submit failed: {combined[-1000:]}"
+
+            # Extract job_id from output (line like "Submitted batch job 123")
+            for line in combined.splitlines():
+                if "submitted" in line.lower():
+                    for word in line.split():
+                        if word.isdigit():
+                            job_id = word
+                            break
+                if job_id:
+                    break
+            assert (
+                job_id
+            ), f"Could not find job_id in submit output:\n{combined[-1000:]}"
+
+            # ── Step 2: status — poll until Completed (max 20 min) ──
+            deadline = time.time() + 1200
+            state = ""
+            while time.time() < deadline:
+                r_status = _flowsim_cli(
+                    "status",
+                    "--scheduler",
+                    "slurm",
+                    "--job",
+                    job_id,
+                    "--slurm-cli-prefix",
+                    _SLURM_CLI_PREFIX,
+                )
+                assert r_status.returncode == 0
+                state = r_status.stdout.lower()
+                if "completed" in state or "succeeded" in state:
+                    break
+                if "failed" in state:
+                    pytest.fail(f"Slurm job failed:\n{r_status.stdout}")
+                time.sleep(15)
+            assert (
+                "completed" in state or "succeeded" in state
+            ), f"Slurm job did not complete in time:\n{r_status.stdout}"
+
+            # ── Step 3: traces are on host via mount ──
+            slurm_traces = os.path.join(host_traces, "slurm")
+            assert os.path.isdir(
+                slurm_traces
+            ), f"No slurm traces dir at {slurm_traces}"
+            ts_dirs = sorted(os.listdir(slurm_traces))
+            assert ts_dirs, f"No test dirs in {slurm_traces}"
+            local_traces = os.path.join(slurm_traces, ts_dirs[-1])
+
+            # ── Step 4: validate trace CSVs ──
+            _assert_traces(local_traces)
+            _assert_logs(local_traces)
+            _validate_shapes(local_traces, bs=1, input_len=2048, existing_ctx=0)
+
+        finally:
+            # Cleanup: cancel job (traces stay on host for inspection)
+            if job_id:
+                _flowsim_cli(
+                    "cancel",
+                    "--scheduler",
+                    "slurm",
+                    "--job",
+                    job_id,
+                    "--slurm-cli-prefix",
+                    _SLURM_CLI_PREFIX,
+                )
+
+
+# =====================================================================
+# SWEEP — multi-point profiling in a single job
+# =====================================================================
+
+# Three lightweight points: different (bs, input_len, existing_ctx)
+_SWEEP_POINTS = [
+    (1, 2048, 0),
+    (1, 4096, 0),
+    (1, 2048, 2048),
+]
+
+
+def _assert_sweep_output(
+    host_output_dir: str, points: list[tuple[int, int, int]]
+) -> None:
+    """Validate that every sweep point produced traces and parsed CSVs."""
+    for bs, il, ctx in points:
+        tag = f"bs{bs}_input{il}_ctx{ctx}"
+        point_dir = os.path.join(host_output_dir, tag)
+        assert os.path.isdir(point_dir), f"Missing sweep point dir: {point_dir}"
+        _assert_traces(point_dir)
+
+    # sweep_summary.json should exist at the root
+    summary_path = os.path.join(host_output_dir, "sweep_summary.json")
+    assert os.path.isfile(summary_path), f"Missing {summary_path}"
+    with open(summary_path) as f:
+        summary = json.load(f)
+    assert len(summary) == len(
+        points
+    ), f"Expected {len(points)} entries in sweep_summary.json, got {len(summary)}"
+    for entry in summary:
+        assert entry["traces"] > 0, f"Point {entry} has 0 traces"
+
+
+class TestLocalSweep:
+    """Multi-point sweep via ``--sweep`` and ``--sweep-file`` on local scheduler.
+
+    Validates that one job profiles all requested points and produces
+    correct directory structure, traces, and sweep_summary.json.
+    """
+
+    def test_sweep_inline(self):
+        """Submit a 3-point sweep using inline --sweep tuples."""
+        sweep_args = [f"{bs}:{il}:{ctx}" for bs, il, ctx in _SWEEP_POINTS]
+
+        r = _flowsim_cli(
+            "submit",
+            "--scheduler",
+            "local",
+            "--collect",
+            "perf",
+            "--model-path",
+            MODEL,
+            "--tp",
+            "1",
+            "--decode-tokens",
+            "2",
+            "--warmup-n",
+            "2",
+            "--gpus",
+            "1",
+            "--local-gpus",
+            "0",
+            "--extra-server-opts",
+            f"--load-format {LOAD_FORMAT}",
+            "--sweep",
+            *sweep_args,
+        )
+        combined = r.stdout + r.stderr
+        if r.returncode != 0:
+            print("STDOUT:", r.stdout[-3000:])
+            print("STDERR:", r.stderr[-3000:])
+        assert r.returncode == 0, f"sweep submit failed (exit {r.returncode})"
+
+        # Find host output dir from submit output
+        output_dir = None
+        for line in combined.splitlines():
+            if "Traces:" in line:
+                output_dir = line.split("Traces:", 1)[1].strip()
+                break
+        assert output_dir and os.path.isdir(
+            output_dir
+        ), f"Could not find traces dir in output:\n{combined[-1000:]}"
+
+        _assert_sweep_output(output_dir, _SWEEP_POINTS)
+        _assert_logs(output_dir)
+
+    def test_sweep_file(self):
+        """Submit a 3-point sweep reading points from a file."""
+        with tempfile.NamedTemporaryFile(
+            mode="w", suffix=".txt", delete=False, prefix="sweep_"
+        ) as f:
+            f.write("# bs:input_len:existing_ctx\n")
+            for bs, il, ctx in _SWEEP_POINTS:
+                f.write(f"{bs}:{il}:{ctx}\n")
+            sweep_file = f.name
+
+        try:
+            r = _flowsim_cli(
+                "submit",
+                "--scheduler",
+                "local",
+                "--collect",
+                "perf",
+                "--model-path",
+                MODEL,
+                "--tp",
+                "1",
+                "--decode-tokens",
+                "2",
+                "--warmup-n",
+                "2",
+                "--gpus",
+                "1",
+                "--local-gpus",
+                "0",
+                "--extra-server-opts",
+                f"--load-format {LOAD_FORMAT}",
+                "--sweep-file",
+                sweep_file,
+            )
+            combined = r.stdout + r.stderr
+            if r.returncode != 0:
+                print("STDOUT:", r.stdout[-3000:])
+                print("STDERR:", r.stderr[-3000:])
+            assert (
+                r.returncode == 0
+            ), f"sweep-file submit failed (exit {r.returncode})"
+
+            # Find host output dir from submit output
+            output_dir = None
+            for line in combined.splitlines():
+                if "Traces:" in line:
+                    output_dir = line.split("Traces:", 1)[1].strip()
+                    break
+            assert output_dir and os.path.isdir(
+                output_dir
+            ), f"Could not find traces dir in output:\n{combined[-1000:]}"
+
+            _assert_sweep_output(output_dir, _SWEEP_POINTS)
+            _assert_logs(output_dir)
+        finally:
+            os.unlink(sweep_file)
diff --git a/tests/unit/test_scheduler_cli.py b/tests/unit/test_scheduler_cli.py
new file mode 100644
index 0000000..9f9c5ab
--- /dev/null
+++ b/tests/unit/test_scheduler_cli.py
@@ -0,0 +1,513 @@
+"""Unit tests for the scheduler CLI (flowsim init / submit) and backends."""
+
+from __future__ import annotations
+
+import os
+import tempfile
+from pathlib import Path
+from unittest import mock
+
+import pytest
+import yaml
+
+from schedulers.base import ProfileJobSpec
+from schedulers.k8s import K8sScheduler
+from schedulers.local import LocalScheduler
+from schedulers.slurm import SlurmScheduler
+
+# =========================================================================
+# ProfileJobSpec
+# =========================================================================
+
+
+class TestProfileJobSpec:
+    """Tests for ProfileJobSpec dataclass methods."""
+
+    @pytest.fixture()
+    def spec(self) -> ProfileJobSpec:
+        return ProfileJobSpec(
+            collect="perf",
+            model_path="Qwen/Qwen3-8B",
+            tp=2,
+            bs=4,
+            input_len=1024,
+        )
+
+    def test_default_job_name(self, spec: ProfileJobSpec):
+        name = spec.default_job_name()
+        assert name.startswith("flowsim-perf-qwen3-8b-bs4-il1024-")
+
+    def test_custom_job_name(self, spec: ProfileJobSpec):
+        spec.job_name = "my-job"
+        assert spec.default_job_name() == "my-job"
+
+    def test_build_server_opts_basic(self, spec: ProfileJobSpec):
+        opts = spec.build_server_opts()
+        assert "--model-path Qwen/Qwen3-8B" in opts
+        assert "--tp 2" in opts
+
+    def test_build_server_opts_dp(self, spec: ProfileJobSpec):
+        spec.dp = 4
+        assert "--dp 4" in spec.build_server_opts()
+
+    def test_build_server_opts_extra(self, spec: ProfileJobSpec):
+        spec.extra_server_opts = "--some-flag"
+        assert "--some-flag" in spec.build_server_opts()
+
+    def test_build_profile_command(self, spec: ProfileJobSpec):
+        cmd = spec.build_profile_command()
+        assert cmd[0] == "python3"
+        assert "scripts/run_stage_profile.py" in cmd[1]
+        assert "--collect" in cmd
+        assert "perf" in cmd
+        assert "--bs" in cmd
+        assert "4" in cmd
+
+    def test_build_shell_command_quotes_server_opts(self, spec: ProfileJobSpec):
+        shell = spec.build_shell_command()
+        # server-opts contains spaces, must be quoted
+        assert "--server-opts '" in shell or '--server-opts "' in shell
+
+
+# =========================================================================
+# K8sScheduler.render
+# =========================================================================
+
+
+class TestK8sScheduler:
+    """Tests for K8s Job manifest generation."""
+
+    @pytest.fixture()
+    def scheduler(self) -> K8sScheduler:
+        return K8sScheduler(
+            namespace="ml-team",
+            kubeconfig="/fake/kubeconfig",
+            context="prod",
+            shm_size="32Gi",
+        )
+
+    @pytest.fixture()
+    def spec(self) -> ProfileJobSpec:
+        return ProfileJobSpec(
+            collect="perf",
+            model_path="Qwen/Qwen3-8B",
+            gpus=2,
+        )
+
+    def test_render_valid_yaml(self, scheduler, spec):
+        rendered = scheduler.render(spec)
+        doc = yaml.safe_load(rendered)
+        assert doc["apiVersion"] == "batch/v1"
+        assert doc["kind"] == "Job"
+
+    def test_render_namespace(self, scheduler, spec):
+        doc = yaml.safe_load(scheduler.render(spec))
+        assert doc["metadata"]["namespace"] == "ml-team"
+
+    def test_render_gpu_resources(self, scheduler, spec):
+        doc = yaml.safe_load(scheduler.render(spec))
+        container = doc["spec"]["template"]["spec"]["containers"][0]
+        assert container["resources"]["limits"]["nvidia.com/gpu"] == "2"
+
+    def test_render_shm_size(self, scheduler, spec):
+        doc = yaml.safe_load(scheduler.render(spec))
+        volumes = doc["spec"]["template"]["spec"]["volumes"]
+        dshm = [v for v in volumes if v["name"] == "dshm"][0]
+        assert dshm["emptyDir"]["sizeLimit"] == "32Gi"
+
+    def test_render_pvc_volume(self, spec):
+        sched = K8sScheduler(namespace="default", pvc_name="my-pvc")
+        doc = yaml.safe_load(sched.render(spec))
+        volumes = doc["spec"]["template"]["spec"]["volumes"]
+        pvc_vol = [v for v in volumes if v["name"] == "output"]
+        assert len(pvc_vol) == 1
+        assert pvc_vol[0]["persistentVolumeClaim"]["claimName"] == "my-pvc"
+
+    def test_render_host_output_dir(self, spec):
+        sched = K8sScheduler(namespace="default", host_output_dir="/data/out")
+        doc = yaml.safe_load(sched.render(spec))
+        volumes = doc["spec"]["template"]["spec"]["volumes"]
+        host_vol = [v for v in volumes if v["name"] == "output"]
+        assert len(host_vol) == 1
+        assert host_vol[0]["hostPath"]["path"] == "/data/out"
+
+    def test_render_node_selector(self, spec):
+        sched = K8sScheduler(namespace="default", node_selector={"gpu": "h100"})
+        doc = yaml.safe_load(sched.render(spec))
+        pod_spec = doc["spec"]["template"]["spec"]
+        assert pod_spec["nodeSelector"]["gpu"] == "h100"
+
+    def test_render_service_account(self, spec):
+        sched = K8sScheduler(namespace="default", service_account="runner")
+        doc = yaml.safe_load(sched.render(spec))
+        pod_spec = doc["spec"]["template"]["spec"]
+        assert pod_spec["serviceAccountName"] == "runner"
+
+    def test_render_labels(self, scheduler, spec):
+        doc = yaml.safe_load(scheduler.render(spec))
+        labels = doc["metadata"]["labels"]
+        assert labels["app"] == "flowsim"
+        assert labels["collect"] == "perf"
+
+
+# =========================================================================
+# SlurmScheduler.render
+# =========================================================================
+
+
+class TestSlurmScheduler:
+    """Tests for Slurm sbatch script generation."""
+
+    @pytest.fixture()
+    def scheduler(self) -> SlurmScheduler:
+        return SlurmScheduler(
+            partition="gpu-h100",
+            time_limit="01:00:00",
+            account="my-proj",
+        )
+
+    @pytest.fixture()
+    def spec(self) -> ProfileJobSpec:
+        return ProfileJobSpec(
+            collect="perf",
+            model_path="Qwen/Qwen3-8B",
+            gpus=4,
+        )
+
+    def test_render_shebang(self, scheduler, spec):
+        script = scheduler.render(spec)
+        assert script.startswith("#!/bin/bash\n")
+
+    def test_render_sbatch_directives(self, scheduler, spec):
+        script = scheduler.render(spec)
+        assert "#SBATCH --partition=gpu-h100" in script
+        assert "#SBATCH --gpus-per-node=4" in script
+        assert "#SBATCH --exclusive" in script
+        assert "#SBATCH --time=01:00:00" in script
+        assert "#SBATCH --account=my-proj" in script
+
+    def test_render_env_vars(self, scheduler, spec):
+        script = scheduler.render(spec)
+        assert "SGLANG_PROFILE_KERNELS=1" in script
+
+    def test_render_command(self, scheduler, spec):
+        script = scheduler.render(spec)
+        assert "scripts/run_stage_profile.py" in script
+        assert "--collect perf" in script
+
+    def test_render_docker_runtime(self, spec):
+        sched = SlurmScheduler(
+            partition="gpu",
+            container_runtime="docker",
+            container_mounts="/data:/data",
+        )
+        script = sched.render(spec)
+        assert "docker run" in script
+        assert "-v /data:/data" in script
+        # output_dir is always auto-mounted
+        assert f"-v {spec.output_dir}:{spec.output_dir}" in script
+
+    def test_render_enroot_runtime(self, spec):
+        sched = SlurmScheduler(
+            partition="gpu",
+            container_runtime="enroot",
+        )
+        script = sched.render(spec)
+        assert "srun --container-image" in script
+        # output_dir is always auto-mounted
+        assert f"{spec.output_dir}:{spec.output_dir}" in script
+
+    def test_render_modules(self, spec):
+        sched = SlurmScheduler(
+            partition="gpu",
+            modules=["cuda/12.6", "anaconda3"],
+        )
+        script = sched.render(spec)
+        assert "module load cuda/12.6" in script
+        assert "module load anaconda3" in script
+
+    def test_render_extra_sbatch(self, spec):
+        sched = SlurmScheduler(
+            partition="gpu",
+            extra_sbatch=["--mem=64G", "--exclusive"],
+        )
+        script = sched.render(spec)
+        assert "#SBATCH --mem=64G" in script
+        assert "#SBATCH --exclusive" in script
+
+    def test_render_constraint(self, spec):
+        sched = SlurmScheduler(partition="gpu", constraint="gpu80g")
+        script = sched.render(spec)
+        assert "#SBATCH --constraint=gpu80g" in script
+
+
+# =========================================================================
+# LocalScheduler.render
+# =========================================================================
+
+
+class TestLocalScheduler:
+    """Tests for local execution backend."""
+
+    @pytest.fixture(autouse=True)
+    def _skip_image_check(self):
+        with mock.patch.object(LocalScheduler, "_check_image_exists"):
+            yield
+
+    @pytest.fixture()
+    def spec(self) -> ProfileJobSpec:
+        return ProfileJobSpec(
+            collect="perf",
+            model_path="Qwen/Qwen3-8B",
+        )
+
+    def test_render_with_gpus(self, spec):
+        sched = LocalScheduler(gpus="0,1")
+        output = sched.render(spec)
+        assert "device=0,1" in output
+        assert "docker run" in output
+
+    def test_render_without_gpus(self, spec):
+        sched = LocalScheduler(gpus="")
+        output = sched.render(spec)
+        assert "CUDA_VISIBLE_DEVICES" not in output
+
+    def test_render_has_command(self, spec):
+        sched = LocalScheduler()
+        output = sched.render(spec)
+        assert "scripts/run_stage_profile.py" in output
+        assert "SGLANG_PROFILE_KERNELS=1" in output
+
+    def test_render_workdir(self, spec):
+        sched = LocalScheduler(workdir="/my/project")
+        output = sched.render(spec)
+        # Docker mode: workdir is used for log scanning, not in the docker command
+        assert "docker run" in output
+        assert "scripts/run_stage_profile.py" in output
+
+    def test_dry_run_equals_render(self, spec):
+        sched = LocalScheduler(gpus="0")
+        assert sched.dry_run(spec) == sched.render(spec)
+
+
+# =========================================================================
+# CLI: flowsim init
+# =========================================================================
+
+
+class TestCLIInit:
+    """Tests for `flowsim init` subcommand."""
+
+    def test_init_no_args_shows_help(self, capsys):
+        from scripts.cli import _cmd_init
+
+        with pytest.raises(SystemExit) as exc_info:
+            _cmd_init([])
+        assert exc_info.value.code != 0
+
+    def test_init_k8s_creates_template(self, tmp_path: Path):
+        config_dir = tmp_path / "flowsim"
+        with mock.patch("scripts.cli._CONFIG_DIR", config_dir):
+            from scripts.cli import _cmd_init
+
+            rc = _cmd_init(["k8s"])
+        assert rc == 0
+        cfg_file = config_dir / "k8s.yaml"
+        assert cfg_file.exists()
+        content = cfg_file.read_text()
+        assert "kubeconfig:" in content
+        assert "namespace:" in content
+        # Template should have comments
+        assert content.startswith("#")
+        # Should be valid YAML
+        cfg = yaml.safe_load(content)
+        assert "kubeconfig" in cfg
+        assert "namespace" in cfg
+
+    def test_init_slurm_creates_template(self, tmp_path: Path):
+        config_dir = tmp_path / "flowsim"
+        with mock.patch("scripts.cli._CONFIG_DIR", config_dir):
+            from scripts.cli import _cmd_init
+
+            rc = _cmd_init(["slurm"])
+        assert rc == 0
+        cfg_file = config_dir / "slurm.yaml"
+        assert cfg_file.exists()
+        content = cfg_file.read_text()
+        assert "partition:" in content
+        assert "cli_prefix:" in content
+        # Template should have comments
+        assert content.startswith("#")
+        cfg = yaml.safe_load(content)
+        assert "partition" in cfg
+
+    def test_init_refuses_overwrite(self, tmp_path: Path):
+        config_dir = tmp_path / "flowsim"
+        config_dir.mkdir()
+        (config_dir / "slurm.yaml").write_text("existing: true\n")
+
+        with mock.patch("scripts.cli._CONFIG_DIR", config_dir):
+            from scripts.cli import _cmd_init
+
+            rc = _cmd_init(["slurm"])
+        assert rc != 0  # should refuse
+
+    def test_init_force_overwrite(self, tmp_path: Path):
+        config_dir = tmp_path / "flowsim"
+        config_dir.mkdir()
+        (config_dir / "slurm.yaml").write_text("existing: true\n")
+
+        with mock.patch("scripts.cli._CONFIG_DIR", config_dir):
+            from scripts.cli import _cmd_init
+
+            rc = _cmd_init(["slurm", "--force"])
+        assert rc == 0
+        content = (config_dir / "slurm.yaml").read_text()
+        assert "partition:" in content
+        assert "existing" not in content
+
+    def test_init_config_copies_file(self, tmp_path: Path):
+        # User has an existing config
+        user_cfg = tmp_path / "my-k8s.yaml"
+        user_cfg.write_text("namespace: prod\nkubeconfig: /etc/kube\n")
+
+        config_dir = tmp_path / "flowsim"
+        with mock.patch("scripts.cli._CONFIG_DIR", config_dir):
+            from scripts.cli import _cmd_init
+
+            rc = _cmd_init(["k8s", "--config", str(user_cfg)])
+        assert rc == 0
+        installed = config_dir / "k8s.yaml"
+        assert installed.exists()
+        cfg = yaml.safe_load(installed.read_text())
+        assert cfg["namespace"] == "prod"
+
+    def test_init_config_missing_file(self):
+        from scripts.cli import _cmd_init
+
+        rc = _cmd_init(["k8s", "--config", "/nonexistent/path.yaml"])
+        assert rc != 0
+
+
+# =========================================================================
+# CLI: flowsim submit (parse/dry-run only, no actual submission)
+# =========================================================================
+
+
+class TestCLISubmit:
+    """Tests for `flowsim submit` argument parsing and dry-run."""
+
+    @pytest.fixture(autouse=True)
+    def _skip_image_check(self):
+        with mock.patch.object(LocalScheduler, "_check_image_exists"):
+            yield
+
+    def _run(self, *args: str, expect_ok: bool = True) -> str:
+        """Run submit via the Python function, capture stdout."""
+        from scripts.cli.submit import main as submit_main
+        import io
+        from contextlib import redirect_stdout
+
+        buf = io.StringIO()
+        with redirect_stdout(buf):
+            submit_main(list(args))
+        return buf.getvalue()
+
+    def test_submit_help(self, capsys):
+        from scripts.cli.submit import main as submit_main
+
+        with pytest.raises(SystemExit) as exc_info:
+            submit_main(["--help"])
+        assert exc_info.value.code == 0
+        out = capsys.readouterr().out
+        assert "--scheduler" in out
+        assert "local" in out
+
+    def test_submit_missing_required(self):
+        from scripts.cli.submit import main as submit_main
+
+        with pytest.raises(SystemExit):
+            submit_main([])
+
+    def test_submit_local_dry_run(self):
+        out = self._run(
+            "--scheduler",
+            "local",
+            "--collect",
+            "perf",
+            "--model-path",
+            "Qwen/Qwen3-8B",
+            "--dry-run",
+        )
+        assert "scripts/run_stage_profile.py" in out
+        assert "SGLANG_PROFILE_KERNELS=1" in out
+
+    def test_submit_local_dry_run_with_gpus(self):
+        out = self._run(
+            "--scheduler",
+            "local",
+            "--collect",
+            "perf",
+            "--model-path",
+            "Qwen/Qwen3-8B",
+            "--local-gpus",
+            "0,1",
+            "--dry-run",
+        )
+        assert "device=0,1" in out
+
+    def test_submit_k8s_dry_run(self):
+        out = self._run(
+            "--scheduler",
+            "k8s",
+            "--collect",
+            "perf",
+            "--model-path",
+            "Qwen/Qwen3-8B",
+            "--k8s-namespace",
+            "default",
+            "--dry-run",
+        )
+        assert "apiVersion: batch/v1" in out
+        assert "kind: Job" in out
+
+    def test_submit_slurm_dry_run(self):
+        out = self._run(
+            "--scheduler",
+            "slurm",
+            "--collect",
+            "perf",
+            "--model-path",
+            "Qwen/Qwen3-8B",
+            "--slurm-partition",
+            "gpu",
+            "--dry-run",
+        )
+        assert "#!/bin/bash" in out
+        assert "#SBATCH --partition=gpu" in out
+
+
+# =========================================================================
+# Config loading
+# =========================================================================
+
+
+class TestConfig:
+    """Tests for config file loading and saving."""
+
+    def test_save_and_load_yaml(self, tmp_path: Path):
+        from schedulers.config import _save_yaml, _load_yaml
+
+        data = {"partition": "gpu", "account": "proj"}
+        path = tmp_path / "test.yaml"
+        _save_yaml(path, data)
+        loaded = _load_yaml(path)
+        assert loaded == data
+
+    def test_cfg_get(self):
+        from schedulers.config import cfg_get
+
+        cfg = {"key": "value", "empty": ""}
+        assert cfg_get(cfg, "key", "default") == "value"
+        assert cfg_get(cfg, "empty", "default") == ""
+        assert cfg_get(cfg, "missing", "default") == "default"