Skip to content

Commit 871d423

Browse files
TerrenceZhangXTao ZhangCopilot
authored
Stage-Separated Profiling with Warm KV Cache Support (#9)
Add stage-separated profiling pipeline for SGLang inference servers. Motivation: - Decouple prefill/decode profiling: N+M runs instead of N×M - Warm KV cache support (--existing-ctx) for agentic/long-context workloads New files: - scripts/run_stage_profile.py — main orchestrator (warmup → profile → parse → analyze) - utils/cross_rank_agg.py — cross-rank kernel aggregation - utils/shape_merge.py — merge kernel shapes into timing CSVs - utils/net.py — shared wait_for_port helper - tests/integration/test_stage_profile_configs.py - tests/unit/test_cross_rank_agg.py, test_shape_merge.py Modified: - simulator/base_parser.py — None-safe dtype lookups - simulator/benchmarks/nccl_benchmarks.py — ValueError guard - README.md — stage profiling docs - Several test files updated for compatibility --------- Co-authored-by: Tao Zhang <zhangt@microsoft.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent 25a452b commit 871d423

17 files changed

Lines changed: 3230 additions & 84 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
*.csv
22
*.pyc
3+
__pycache__/
34
*.egg-info
45
tests/test-artifacts/
56
unknown_kernels.json

README.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,115 @@ ls -lh /data/flowsim-simulate/ # Parsed CSV, summary, simulation artifacts
174174

175175
---
176176

177+
## Stage Profiling (`run_stage_profile.py`)
178+
179+
`scripts/run_stage_profile.py` is the single entry-point for **stage-separated** profiling: it captures prefill (EXTEND) and decode traces independently, parses them, runs cross-rank kernel analysis, and optionally collects kernel input shapes.
180+
181+
### Quick reference
182+
183+
Each profiling request produces **two** stage-separated traces:
184+
- **EXTEND** (prefill) — processes `input_len` new tokens (with optional `existing_ctx` tokens already in KV cache)
185+
- **DECODE** — profiler captures `decode-tokens` decode batch steps
186+
187+
The profiler captures exactly **one** EXTEND batch and **decode-tokens** DECODE batches per run.
188+
189+
| Flag | Description | Default |
190+
|---|---|---|
191+
| `--input-len` | Number of new prefill tokens per request (EXTEND) | 2048 |
192+
| `--existing-ctx` | Tokens already in KV cache from a prior request (0 = cold prefill) | 0 |
193+
| `--bs` | Batch size (concurrent requests) | 1 |
194+
| `--decode-tokens` | Number of decode tokens to generate (= number of decode batches profiled) | 32 |
195+
196+
| Mode | What it does |
197+
|---|---|
198+
| `--collect perf` | Profile a single (bs, input_len, existing_ctx) point → trace (EXTEND + DECODE) → parse → cross-rank analysis |
199+
| `--collect shapes` | Re-run **without CUDA graph** to capture kernel input shapes, then merge into timing CSVs (both EXTEND and DECODE) |
200+
| `--collect all` | Both phases back-to-back (auto-restarts the server in between). Requires `--launch-server`. |
201+
202+
`--collect` is required. Use `perf`, `shapes`, or `all`.
203+
204+
### Examples
205+
206+
**Cold prefill** (server already running):
207+
208+
```bash
209+
python3 scripts/run_stage_profile.py \
210+
--collect perf \
211+
--bs 1 --input-len 2048 --decode-tokens 32 \
212+
--output-dir /workspace/traces \
213+
--host 0.0.0.0 --port 30001
214+
```
215+
216+
**With existing KV cache context:**
217+
218+
```bash
219+
python3 scripts/run_stage_profile.py \
220+
--collect perf \
221+
--bs 4 --input-len 512 --existing-ctx 4096 --decode-tokens 32 \
222+
--output-dir /workspace/traces \
223+
--launch-server \
224+
--server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
225+
```
226+
227+
**Collect shapes only** (requires a no-CUDA-graph server):
228+
229+
```bash
230+
python3 scripts/run_stage_profile.py \
231+
--collect shapes \
232+
--output-dir /workspace/sweep_P1_tp4 \
233+
--launch-server \
234+
--server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
235+
```
236+
237+
When `--collect shapes` is used with `--launch-server`, the server is automatically started with `--disable-cuda-graph --disable-cuda-graph-padding`.
238+
239+
**Full pipeline** (perf → auto-restart → shapes → merge):
240+
241+
```bash
242+
python3 scripts/run_stage_profile.py \
243+
--collect all \
244+
--output-dir /workspace/sweep_P1_tp4 \
245+
--launch-server \
246+
--server-opts "--model-path Qwen/Qwen3-235B-A22B-FP8 --tp 4 --host 0.0.0.0 --port 30001"
247+
```
248+
249+
250+
### Output structure
251+
252+
```
253+
sweep_P1_tp4/
254+
├── sweep_summary.json
255+
├── bs1_input2048_ctx0/
256+
│ ├── *-TP-*-EXTEND.trace.json.gz
257+
│ ├── *-TP-*-DECODE.trace.json.gz
258+
│ ├── parsed/
259+
│ │ ├── TP-0-EXTEND.csv
260+
│ │ ├── TP-0-DECODE.csv
261+
│ │ └── ...
262+
│ ├── analysis_extend.json
263+
│ └── analysis_decode.json
264+
└── ...
265+
```
266+
267+
After `--collect shapes`, each `parsed/TP-*-DECODE.csv` gains a `Dims` column with kernel tensor shapes.
268+
269+
### Helper scripts
270+
271+
| Script | Purpose |
272+
|---|---|
273+
| `tests/integration/test_stage_profile_configs.py` | Integration tests for `--collect {perf,shapes,all}` across parallelism configs. Run with `pytest` inside Docker. Filter with `RUN_CONFIGS=P1`. |
274+
275+
### Utilities (`utils/`)
276+
277+
| File | Purpose |
278+
|---|---|
279+
| `utils/cross_rank_agg.py` | Cross-rank kernel aggregation (symmetric collectives → min, asymmetric → max, compute → mean) |
280+
| `utils/shape_merge.py` | Merge kernel shape data into timing CSVs |
281+
| `utils/net.py` | Shared networking helpers (`wait_for_port`) |
282+
| `utils/merge_trace.py` | Merge multi-rank traces into a single Perfetto-compatible file |
283+
284+
---
285+
177286
## For Developers
178287

179288
### Customizing Profiling Workloads

0 commit comments

Comments
 (0)