Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
28301e0
feat: add K8s and Slurm scheduler backends for profiling jobs
Mar 17, 2026
26c9f47
feat: switch to proper API clients for remote submission
Mar 17, 2026
52294a6
chore: add proper pyproject.toml with dependency declarations
Mar 17, 2026
54b6152
fix: make pip install -e . and entry point actually work
Mar 17, 2026
9e1c1f4
refactor: unified CLI as 'flowsim submit' instead of 'flowsim-submit'
Mar 17, 2026
d37d8f3
fix: 'flowsim submit' submits by default, --dry-run to preview
Mar 17, 2026
af48f0c
fix: validate cluster connection params before submit
Mar 17, 2026
87a2c33
feat: support env vars for cluster connection params
Mar 17, 2026
63ab491
feat: config-first approach with flowsim init
Mar 17, 2026
7116fef
refactor: flowsim init takes CLI args instead of interactive prompts
Mar 17, 2026
8987c38
feat: PD disaggregation support + multi-node Docker test infra
Mar 17, 2026
6f1bda2
chore: add dev-setup.sh/dev-teardown.sh for one-click test clusters
Mar 17, 2026
d2bb08e
feat: add local scheduler backend — flowsim submit --scheduler local
Mar 17, 2026
e5e303c
test: add 61 unit tests for scheduler CLI, backends, and config
Mar 17, 2026
c60cd11
feat: persistent logs under output_dir, flowsim status/logs, refuse K…
Mar 17, 2026
ea3c27a
fix: flowsim logs shows all log files (stdout + stderr) with listing
Mar 17, 2026
eb46c36
fix: flowsim logs shows file locations + actionable commands instead …
Mar 17, 2026
0e59219
test: integration tests for all 3 scheduler backends (local/k8s/slurm)
Mar 17, 2026
2c36af0
feat: align CLI with standard job platform APIs
Mar 17, 2026
8cd62f8
fix: CLI only shows scheduler-specific args based on --scheduler
Mar 17, 2026
84c8953
fix: use python3 instead of python in profile command
Mar 17, 2026
fab6314
fix: use YYYYMMDD_HHMMSS timestamp in log filenames
Mar 17, 2026
8f76052
scheduler: add Slurm CLI mode (sbatch/squeue/scancel) + integration test
Mar 18, 2026
3edd5f4
slurm: use YYYYMMDD_HHMMSS timestamp for output dirs (consistent with…
Mar 18, 2026
9bc2d94
docs: add scheduler README with CLI usage and architecture overview
Mar 18, 2026
a92e32b
profile: add --sweep for multi-point profiling in one job
Mar 18, 2026
0aeff89
test: add sweep integration tests (inline + file)
Mar 18, 2026
f58d791
refactor: dedup shared utilities, deprecate Slurm REST mode
Mar 18, 2026
3152a72
review: fix remaining issues (stale docstring, unused vars, README de…
Mar 18, 2026
19973cb
remove Slurm REST dead code, rewrite README in English
Mar 18, 2026
892eeac
review: normalize Slurm states, implement _logs_cli, dedup image chec…
Mar 18, 2026
73cedbe
fix: remove --slurm-submit-via from integration tests
Mar 18, 2026
0a30f7f
refactor: move test infra to tests/integration/infra/, delete unused …
Mar 18, 2026
7831272
remove untested PD disaggregation code
Mar 18, 2026
b6dbbbb
simplify flowsim init: write annotated template instead of argparse
Mar 19, 2026
95028db
add --config flag to flowsim init
Mar 19, 2026
ac41690
use template files for flowsim init instead of inline strings
Mar 19, 2026
da8ab00
update README: reflect template-file init with --config option
Mar 19, 2026
059f3ea
docs: streamline READMEs, unify examples, remove legacy manual workflow
Mar 19, 2026
b0dfdd5
format: fix with black
Mar 19, 2026
9e2541a
docs: add --existing-ctx and --decode-tokens to all examples, default…
Mar 19, 2026
880fe05
refactor: remove PyYAML fallback, make it a core dependency
Mar 19, 2026
236548a
fix: reject k8s submit when no PVC or hostPath configured
Mar 19, 2026
9daee82
docs: add missing parameters
Mar 19, 2026
5e3d1bb
fix: unique job names, Slurm exclusive GPU, remove list_jobs prefix f…
Mar 19, 2026
2a718a7
refactor: restructure CLI into scripts/cli/ subpackage
Mar 19, 2026
31dc15b
fix: use runtime:nvidia for slurm compute node GPU access
Mar 19, 2026
b7ec2cb
refactor: rename test_scheduler_local.py → test_scheduler.py, rewrite…
Mar 19, 2026
7d40889
docs: update output structure to include logs, merged, and shape dirs
Mar 19, 2026
86dd517
Update scripts/cli/submit.py
TerrenceZhangX Mar 20, 2026
2dbb896
Update schedulers/local.py
TerrenceZhangX Mar 20, 2026
f30329a
Update tests/integration/infra/kind-multi-node.yaml
TerrenceZhangX Mar 20, 2026
4718764
Update tests/integration/infra/slurm-compose.yaml
TerrenceZhangX Mar 20, 2026
8f79054
Update tests/integration/infra/slurm.conf
TerrenceZhangX Mar 20, 2026
7d64cfc
fix: parameterize hardcoded host paths in slurm-compose
Mar 20, 2026
76e75bd
fix: auto-mount output_dir in docker/enroot container modes
Mar 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ tests/test-artifacts/
unknown_kernels.json
/artifacts
/server_profile
/server_simulate
/server_simulate
/stage_traces/
237 changes: 73 additions & 164 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ The project supports rapid deployment using Docker, includes scripts for environ
## Table of Contents

- [Getting Started](#getting-started)
- [Stage Profiling](#stage-profiling)
- [Scheduler Backends](#scheduler-backends)
- [For Developers](#for-developers)
- [Risks and limitations](#risks-and-limitations)
- [License](#license)
Expand Down Expand Up @@ -49,240 +51,147 @@ make build-docker

This creates a local image named `flowsim-image` with FlowSim patches already applied to sglang.

### 2. Run Profile → Parse → Simulate
### 2. Profile (Generate Traces)

Create workspace directories on your host for storing traces and results:
Use `flowsim submit` to capture stage-separated traces (EXTEND + DECODE), parse them, and run cross-rank analysis — all in one step. See [Stage Profiling](#stage-profiling) for how stages and collection modes work.

```bash
mkdir -p /data/flowsim-profile
mkdir -p /data/flowsim-simulate
```

#### Step 1: Profile (Generate Traces)

```bash
sudo docker run --gpus=all \
-v /data/flowsim-profile:/workspace/profile \
-v /data/flowsim-simulate:/workspace/simulate \
-w /flowsim \
--cap-add=SYS_ADMIN \
--network=host \
--shm-size 911G \
flowsim-image \
python scripts/run_profile.py \
--profile-dir /workspace/profile \
--log-dir /workspace/profile/logs \
--bench-timeout 3600 \
--server-opts "--model-path /flowsim/workload/models/configs/deepseek/ --load-format dummy --tp 4 --ep 4 --host 0.0.0.0 --port 30001 --attention-backend flashinfer --disable-cuda-graph" \
--bench-opts "--backend sglang --host 0.0.0.0 --port 30001 --dataset-name defined-len --prefill-decode-lens 1024:8 --num-prompts 1 --profile"
```

**What this does:**
- Starts an sglang server with profiling enabled
- Runs benchmark requests against it
- Generates `*.trace.json.gz` files in `/data/flowsim-profile` (mounted as `/workspace/profile`)

**Note:** The first run will be slow (~10 minutes) due to DeepGEMM kernel warmup and compilation. For stable performance, avoid using `--rm` flag and reuse the same container using `sudo docker exec -it <container_id> bash`. Subsequent runs with similar configurations will be faster.

**Tip:**
- Adjust `--server-opts` and `--bench-opts` to match your model, parallelism (TP/DP/EP), and workload requirements. All `sglang.launch_server` and `bench_serving.py` parameters are supported.
- Trace files can be visualized using [Perfetto UI](https://ui.perfetto.dev/) by uploading the `.trace.json.gz` files directly.
- For multi-GPU profiling (TP > 1), merge individual traces into a single file for a global view:
```bash
python /flowsim/utils/merge_trace.py \
--trace_dir /data/flowsim-profile \
--output /data/flowsim-profile/merged_trace.json
```
Then visualize the merged trace at [Perfetto UI](https://ui.perfetto.dev/).

#### Step 2: Parse (Convert Trace to CSV)

```bash
sudo docker run --rm \
-v /data/flowsim-profile:/workspace/profile \
-v /data/flowsim-simulate:/workspace/simulate \
-w /flowsim \
flowsim-image \
python -m scripts.run_parse \
--trace-file /workspace/profile/your-trace-name-TP-0.trace.json.gz \
--output-dir /workspace/simulate
pip install -e .
flowsim submit --scheduler local \
--collect all \
--model-path workload/models/configs/Qwen3-235B-A22B \
--tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
--extra-server-opts "--load-format dummy"
```

Replace `your-trace-name-TP-0.trace.json.gz` with the actual filename from step 1.

**What this does:**
- Parses the trace file
- Extracts kernel-level information (operator, shapes, dtypes)
- Generates a CSV file and JSON summary in `/data/flowsim-simulate` (mounted as `/workspace/simulate`)

**Fallback:** If you don't have a GPU or can't run profiling, use the demo trace shipped with the repo:
For K8s / Slurm clusters, see [Scheduler Backends](#scheduler-backends).

**Tip:** Trace files can be visualized at [Perfetto UI](https://ui.perfetto.dev/). For multi-GPU traces, merge them first:
```bash
sudo docker run --rm \
-v /data/flowsim-simulate:/workspace/simulate \
-w /flowsim \
flowsim-image \
python -m scripts.run_parse \
--trace-file /flowsim/demo/deepseekv3-TP-0.trace.json.gz \
--output-dir /workspace/simulate
python utils/merge_trace.py --trace_dir stage_traces/local/*/bs1_input2048_ctx0 --output merged.json
```

#### Step 3: Simulate (Run Hardware Simulation)
### 3. Simulate (Run Hardware Simulation)

This step requires a running LLMCompass backend. First, build the backend image:
Build and start the LLMCompass backend, then submit parsed traces for kernel-level simulation:

```bash
# Build backend image
sudo docker build -t llmcompass-backend -f backend/LLMCompass/Dockerfile backend/LLMCompass/
```

Then start the backend:

```bash
# Terminal 1: Start LLMCompass backend
# Terminal 1: Start backend
sudo docker run --rm -p 8000:8000 llmcompass-backend
```

Then in another terminal, run the simulation:

```bash
# Terminal 2: Run simulation
sudo docker run --rm \
--network=host \
-v /data/flowsim-profile:/workspace/profile \
-v /data/flowsim-simulate:/workspace/simulate \
sudo docker run --rm --network=host \
-v /data/flowsim:/workspace \
flowsim-image \
python -m scripts.run_simulate \
--trace-file /workspace/profile/your-trace-name-TP-0.trace.json.gz \
--trace-file /workspace/traces/bs1_input2048_ctx0/*-TP-0-EXTEND.trace.json.gz \
--api-url http://127.0.0.1:8000 \
--artifact-dir /workspace/simulate/llmcompass
```

**What this does:**
- Parses the trace into kernels
- Submits each kernel to the LLMCompass backend `/tasks` API
- Polls until all tasks complete
- Writes request/response artifacts to `/workspace/simulate/llmcompass`

### 3. Inspect Results

All generated files are available on your host at `/data/`:
### 4. Inspect Results

```bash
ls -lh /data/flowsim-profile/ # Raw trace files
ls -lh /data/flowsim-simulate/ # Parsed CSV, summary, simulation artifacts
ls -lh /data/flowsim/traces/ # Stage-separated traces + parsed CSVs
ls -lh /data/flowsim/simulate/ # Simulation artifacts
```

---

## Stage Profiling (`run_stage_profile.py`)
## Stage Profiling

`scripts/run_stage_profile.py` is the single entry-point for **stage-separated** profiling: it captures prefill (EXTEND) and decode traces independently, parses them, runs cross-rank kernel analysis, and optionally collects kernel input shapes.
FlowSim performs **stage-separated** profiling: it captures prefill (EXTEND) and decode traces independently, parses them, runs cross-rank kernel analysis, and optionally collects kernel input shapes.

### Quick reference
### How stages work

Each profiling request produces **two** stage-separated traces:
- **EXTEND** (prefill) — processes `input_len` new tokens (with optional `existing_ctx` tokens already in KV cache)
- **DECODE** — profiler captures `decode-tokens` decode batch steps

The profiler captures exactly **one** EXTEND batch and **decode-tokens** DECODE batches per run.
- **DECODE** — captures `decode-tokens` decode batch steps (default 2)

| Flag | Description | Default |
|---|---|---|
| `--input-len` | Number of new prefill tokens per request (EXTEND) | 2048 |
| `--existing-ctx` | Tokens already in KV cache from a prior request (0 = cold prefill) | 0 |
| `--bs` | Batch size (concurrent requests) | 1 |
| `--decode-tokens` | Number of decode tokens to generate (= number of decode batches profiled) | 32 |
### Collection modes

| Mode | What it does |
|---|---|
| `--collect perf` | Profile a single (bs, input_len, existing_ctx) point → trace (EXTEND + DECODE) → parse → cross-rank analysis |
| `--collect shapes` | Re-run **without CUDA graph** to capture kernel input shapes, then merge into timing CSVs (both EXTEND and DECODE) |
| `--collect all` | Both phases back-to-back (auto-restarts the server in between). Requires `--launch-server`. |

`--collect` is required. Use `perf`, `shapes`, or `all`.
| `--collect perf` | Profile a single (bs, input_len, existing_ctx) point → trace → parse → cross-rank analysis |
| `--collect shapes` | Re-run **without CUDA graph** to capture kernel input shapes, then merge into timing CSVs |
| `--collect all` | Both phases back-to-back (auto-restarts the server in between) |

### Examples

**Cold prefill** (server already running):

```bash
python3 scripts/run_stage_profile.py \
# Basic profiling
flowsim submit --scheduler local \
--collect perf \
--bs 1 --input-len 2048 --decode-tokens 32 \
--output-dir /workspace/traces \
--host 0.0.0.0 --port 30001
```
--model-path workload/models/configs/Qwen3-235B-A22B \
--tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
--extra-server-opts "--load-format dummy"

**With existing KV cache context:**

```bash
python3 scripts/run_stage_profile.py \
# With existing KV cache context
flowsim submit --scheduler local \
--collect perf \
--bs 4 --input-len 512 --existing-ctx 4096 --decode-tokens 32 \
--output-dir /workspace/traces \
--launch-server \
--server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
```
--model-path workload/models/configs/Qwen3-235B-A22B \
--tp 1 --bs 4 --input-len 512 --existing-ctx 4096 --decode-tokens 2 --gpus 1 \
--extra-server-opts "--load-format dummy"

**Collect shapes only** (requires a no-CUDA-graph server):

```bash
python3 scripts/run_stage_profile.py \
--collect shapes \
--output-dir /workspace/sweep_P1_tp4 \
--launch-server \
--server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
```

When `--collect shapes` is used with `--launch-server`, the server is automatically started with `--disable-cuda-graph --disable-cuda-graph-padding`.

**Full pipeline** (perf → auto-restart → shapes → merge):
# Full pipeline (perf + shapes)
flowsim submit --scheduler local \
--collect all \
--model-path workload/models/configs/Qwen3-235B-A22B \
--tp 1 --bs 1 --input-len 2048 --existing-ctx 0 --decode-tokens 2 --gpus 1 \
--extra-server-opts "--load-format dummy"

```bash
python3 scripts/run_stage_profile.py \
# Multi-point sweep
flowsim submit --scheduler local \
--collect all \
--output-dir /workspace/sweep_P1_tp4 \
--launch-server \
--server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
--model-path workload/models/configs/Qwen3-235B-A22B \
--sweep 1:2048:0 4:2048:0 8:2048:0 --decode-tokens 2 --gpus 1 \
--extra-server-opts "--load-format dummy"
```

For K8s / Slurm clusters, replace `--scheduler local` with `k8s` or `slurm`. See [schedulers/README.md](schedulers/README.md) for full scheduler documentation.

### Output structure

```
sweep_P1_tp4/
├── sweep_summary.json
stage_traces/{scheduler}/{YYYYMMDD_HHMMSS}/
├── bs1_input2048_ctx0/
│ ├── *-TP-*-EXTEND.trace.json.gz
│ ├── *-TP-*-DECODE.trace.json.gz
│ ├── parsed/
│ │ ├── TP-0-EXTEND.csv
│ │ ├── TP-0-DECODE.csv
│ │ └── ...
│ ├── *.trace.json.gz
│ ├── parsed/*.csv
│ ├── merged/*_merged.trace.csv
│ ├── shape_traces/ + shape_parsed/
│ ├── analysis_extend.json
│ └── analysis_decode.json
└── ...
├── logs/
│ ├── server_*.{stdout,stderr}.log
│ ├── shape_server_*.{stdout,stderr}.log
│ └── {job_name}_*.{stdout,stderr}.log
└── sweep_summary.json
```

After `--collect shapes`, each `parsed/TP-*-DECODE.csv` gains a `Dims` column with kernel tensor shapes.

### Helper scripts

| Script | Purpose |
|---|---|
| `tests/integration/test_stage_profile_configs.py` | Integration tests for `--collect {perf,shapes,all}` across parallelism configs. Run with `pytest` inside Docker. Filter with `RUN_CONFIGS=P1`. |
- `parsed/`: Per-rank timing CSVs extracted from traces.
- `merged/`: Timing + shape columns joined into a single CSV per rank/stage.
- `shape_traces/` / `shape_parsed/`: Raw and parsed shape-profiling traces (generated by `--collect shapes` or `--collect all`).
- `logs/`: Server, shape-server, and job stdout/stderr logs.

### Utilities (`utils/`)

| File | Purpose |
|---|---|
| `utils/cross_rank_agg.py` | Cross-rank kernel aggregation (symmetric collectives → min, asymmetric → max, compute → mean) |
| `utils/shape_merge.py` | Merge kernel shape data into timing CSVs |
| `utils/net.py` | Shared networking helpers (`wait_for_port`) |
| `utils/merge_trace.py` | Merge multi-rank traces into a single Perfetto-compatible file |

---

## Scheduler Backends

For submitting profiling jobs to **local Docker**, **Kubernetes**, or **Slurm** clusters, use the `flowsim` CLI. See [schedulers/README.md](schedulers/README.md) for full documentation including per-scheduler parameters, configuration, and environment variables.

---

## For Developers

### Customizing Profiling Workloads
Expand Down
Loading
Loading