Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
0d89afc
Initial modifications for sys info
anandhu-eng May 14, 2026
1769b6f
Add A40 (Prefill) + H100 (Decode) sysinfo example config
anandhu-eng May 15, 2026
8bb0649
further modification
anandhu-eng May 20, 2026
bb9b6cc
update doc, add per run datadictionary capture
anandhu-eng May 28, 2026
5d5ca3c
fixes for run_metadata
anandhu-eng May 28, 2026
26138e3
Merge branch 'main' into sysinfochanges
anandhu-eng May 28, 2026
011a625
fix: address PR review comments for sys_info capture
anandhu-eng May 29, 2026
ef10d77
fix: rename BenchmarkConfig.system_info → sys_info_capture and fix tests
anandhu-eng May 29, 2026
9b88a81
fix: sort imports in test_sysinfo_command.py (ruff)
anandhu-eng May 29, 2026
3184d91
test: add failure scenario tests for sys info capture
anandhu-eng Jun 2, 2026
e134a90
Merge branch 'main' into sysinfochanges
anandhu-eng Jun 2, 2026
7b8e276
Merge branch 'main' into sysinfochanges
anandhu-eng Jun 3, 2026
7550f81
refactor: rename sys_info_capture field to system_info in BenchmarkCo…
anandhu-eng Jun 4, 2026
1db9d01
Merge branch 'main' into sysinfochanges
anandhu-eng Jun 4, 2026
db721b0
fix: regenerate config templates after merge from main
anandhu-eng Jun 4, 2026
435e0ff
regenerated templates
anandhu-eng Jun 4, 2026
f6777b3
Delete examples/sysinfo_a40_prefill_h100_decode.yaml
anandhu-eng Jun 4, 2026
19b20bb
Address cve-2026-34993 (#333)
arekay-nv Jun 4, 2026
8d99b66
Merge branch 'main' into sysinfochanges
arekay-nv Jun 4, 2026
99b548b
fix: address PR review comments for run_metadata and sysinfo
anandhu-eng Jun 4, 2026
979d984
improve readme + force use report dir for sysinfo output
anandhu-eng Jun 5, 2026
4e1121b
pre commit
anandhu-eng Jun 5, 2026
d1539be
Merge branch 'main' into sysinfochanges
anandhu-eng Jun 5, 2026
60989d6
Merge branch 'main' into sysinfochanges
anandhu-eng Jun 8, 2026
5e58ffe
feat: make mlc-scripts an optional dependency for system_info capture
anandhu-eng Jun 11, 2026
fe21d12
test: fix test_capture assertion to match updated install hint
anandhu-eng Jun 11, 2026
1a39d48
style: apply ruff-format to test_capture.py
anandhu-eng Jun 11, 2026
7358b79
Merge branch 'main' into sysinfochanges
anandhu-eng Jun 11, 2026
8d418a1
fix: update sysinfo install hint to use local editable install
anandhu-eng Jun 11, 2026
fbcacea
Merge branch 'sysinfochanges' of https://github.com/mlcommons/endpoin…
anandhu-eng Jun 11, 2026
b87b96a
Merge branch 'main' into sysinfochanges
attafosu Jun 18, 2026
87f2d66
Merge branch 'main' into sysinfochanges
arekay-nv Jun 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ uv run inference-endpoint probe --endpoints http://localhost:8765 --model test-m
uv run inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH
uv run inference-endpoint benchmark online --endpoints URL --model NAME --dataset PATH --load-pattern poisson --target-qps 100
uv run inference-endpoint benchmark from-config --config config.yaml
uv run inference-endpoint sysinfo from-config --config config.yaml
```

### Backward-compatible setup (pip + venv)
Expand Down Expand Up @@ -60,6 +61,7 @@ inference-endpoint probe --endpoints http://localhost:8765 --model test-model
inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH
inference-endpoint benchmark online --endpoints URL --model NAME --dataset PATH --load-pattern poisson --target-qps 100
inference-endpoint benchmark from-config --config config.yaml
inference-endpoint sysinfo from-config --config config.yaml
```

## Architecture
Expand Down Expand Up @@ -125,6 +127,7 @@ CLI is auto-generated from `config/schema.py` Pydantic models via cyclopts. Fiel

- **CLI mode** (`offline`/`online`): cyclopts constructs `OfflineBenchmarkConfig`/`OnlineBenchmarkConfig` (subclasses in `config/schema.py`) directly from CLI args. Type locked via `Literal`. `--dataset` is repeatable with TOML-style format `[perf|acc:]<path>[,key=value...]` (e.g. `--dataset data.csv,samples=500,parser.prompt=article`). Full accuracy support via `accuracy_config.eval_method=pass_at_1` etc.
- **YAML mode** (`from-config`): `BenchmarkConfig.from_yaml_file()` loads YAML, resolves env vars, and auto-selects the right subclass via Pydantic discriminated union. Optional `--timeout`/`--mode` overrides via `config.with_updates()`.
- **sysinfo from-config**: `SysInfoFileConfig.from_yaml_file()` reads the `system_info` key from YAML file (extra top-level keys ignored). Invokes `capture_system_info()` which calls the `get-mlperf-multi-node-system-info` mlcflow script. Supports optional `node_config` for function-based node groupings (written to a temp YAML file passed as `node_config_file` to mlcflow). Independent of benchmark runs.
- **eval**: Not yet implemented (raises `CLIError` with a tracking issue link)

### Config Construction & Validation
Expand Down Expand Up @@ -165,6 +168,9 @@ src/inference_endpoint/
│ │ ├── __init__.py
│ │ ├── cli.py # benchmark_app: offline, online, from-config subcommands
│ │ └── execute.py # Phased execution: setup/run_threaded/finalize + BenchmarkContext
│ ├── sysinfo/
│ │ ├── __init__.py
│ │ └── cli.py # sysinfo_app: from-config subcommand (standalone sys info capture)
│ ├── probe.py # ProbeConfig + execute_probe()
│ ├── info.py # execute_info()
│ ├── validate.py # execute_validate()
Expand Down
197 changes: 188 additions & 9 deletions docs/commands/DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ already-parsed models rather than raw `argparse.Namespace` objects.
| `info` | `main.py` | `commands/info.py` | Implemented |
| `validate-yaml` | `main.py` | `commands/validate.py` | Implemented |
| `init` | `main.py` | `commands/init.py` | Implemented |
| `sysinfo from-config` | `commands/sysinfo/cli.py` | `sys_info/capture.py` | Implemented |
| `eval` | `main.py` | inline stub (`CLIError`) | Reserved, not implemented |

## CLI Structure
Expand All @@ -53,6 +54,9 @@ inference-endpoint
| +-- online
| +-- from-config
|
+-- sysinfo
| +-- from-config
|
+-- probe
+-- info
+-- validate-yaml
Expand Down Expand Up @@ -94,8 +98,182 @@ commands/benchmark/execute.py::run_benchmark()
+-- construct endpoint client + sample issuer
+-- run BenchmarkSession in threaded wrapper
+-- finalize metrics and optional accuracy scoring
+-- if system_info is configured:
write run_metadata.json
capture_system_info() → mlcflow (hardware + serving config)
patch run_metadata.json with serving config values
```

## System Info Capture

> **Requires the `sysinfo` optional dependency.** Install it with:
>
> ```bash
> # uv (recommended)
> uv sync --extra sysinfo
> # or pass --extra sysinfo directly to uv run, e.g.:
> uv run --extra sysinfo inference-endpoint benchmark from-config --config config.yaml
>
> # pip (from repo root)
> pip install -e ".[sysinfo]"
> ```
>
> If `mlc-scripts` is not installed and `system_info` is configured, the benchmark still completes and results are written first; system info capture is then attempted, fails with an error log, and the process exits 0.

System info capture collects hardware/software details from one or more nodes and writes a structured JSON file for MLPerf inference submissions. It runs in two contexts:

- **Standalone** (`sysinfo from-config`): triggered manually, independent of any benchmark run. See [Standalone Command (`sysinfo from-config`)](#standalone-command-sysinfo-from-config) below for a full example.
- **Integrated** (`benchmark from-config`): triggered automatically after a benchmark run completes if `system_info` is present in the config. For example:

```yaml
name: "llama3.1-8b-vllm-perf-c1000"
version: "1.0"
type: "online"

model_params:
name: "meta-llama/Llama-3.1-8B-Instruct"
temperature: 0.0
top_p: 1.0
max_new_tokens: 128

datasets:
- name: cnn_dailymail::llama3_8b
type: performance
samples: 13368
parser:
input: prompt

settings:
runtime:
min_duration_ms: 600000
max_duration_ms: 3600000
scheduler_random_seed: 137
dataloader_random_seed: 111
n_samples_to_issue: 13368

load_pattern:
type: "concurrency"
target_concurrency: 1000

client:
num_workers: 4

endpoint_config:
endpoints:
- "http://localhost:11001"
api_key: null

report_dir: sglang_perf_c1000

system_info:
system_name: H100x8_SGLang
ssh_ids:
- user@inference-node
accelerator_backend: cuda
exclude_current_system: true
skip_ssh_key_file: false
serving_node: user@inference-node
endpoint_url: http://localhost:11001
serving_framework: sglang
```

then running:

```bash
inference-endpoint benchmark from-config --config config.yaml
```

System info capture runs automatically at the end of the benchmark.

Both paths call `sys_info/capture.py::capture_system_info()` and produce the same output JSON (as per endpoints spec). Both also patch `run_metadata.json` with serving configuration values extracted from the inference server's startup log, if the file is present in `report_dir`. In integrated execution mode it is always present (written by the benchmark before capture runs); in standalone execution mode it is patched only if it already exists there.

### Config Reference (`SysInfoCaptureConfig`)

| Field | Type | Description |
| ------------------------ | ------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `system_name` | str | **Required.** Name of the system under test (e.g. `"H100x8_vLLM"`). Used as the MLPerf submission system identifier. |
| `ssh_ids` | `list[str]` | **Required.** Nodes to collect hardware info from. Format: `user@host` or `user@host:port`. |
| `accelerator_backend` | `"cuda"` \| `"rocm"` \| `"xpu"` \| `"none"` | GPU backend on the target nodes. Default: `"none"`. |
| `exclude_current_system` | bool | Skip the machine running this command; collect from `ssh_ids` only. Default: `false`. |
| `skip_ssh_key_file` | bool | Assume SSH key auth is pre-configured (skips mlcflow key-file lookup). Default: `false`. |
| `node_config` | object | Optional function-based node groupings (Prefill/Decode/etc). Maps function names to lists of `{node_name, no_of_nodes}` entries. `node_name` is matched as a case-insensitive substring against the detected GPU model name. |
| `serving_node` | str | SSH target for the inference server (`user@host` or `user@host:port`). When set, the capture SSHes into this node and reads `/tmp/serving.log` to extract serving config. Server stdout/stderr **must** be redirected there. |
| `endpoint_url` | str | Base URL of the running inference server. Probed via HTTP to detect the serving framework name and version (e.g. `"vLLM 0.9.0"`). |
| `serving_framework` | `"auto"` \| `"vllm"` \| `"sglang"` \| `"trtllm"` | Serving engine type used for startup log parsing. Default: `"auto"` (detected from the endpoint). |

### Capture Flow

System info capture is powered by the [mlperf-automations](https://github.com/mlcommons/mlperf-automations) project and orchestrates the following steps:

1. **Hardware collection** — SSHes into each node listed in `ssh_ids` and collects CPU, memory, GPU, networking, and OS information. If `exclude_current_system` is false, the local machine is also probed. Results are merged into a single `system_desc.json`.

2. **Serving config extraction** _(if `serving_node` is set)_ — SSHes into the inference server node and reads its startup log to extract parallelism settings (`tensor_parallel`, `pipeline_parallel`, `expert_parallel`) and batch size.

3. **Framework detection** _(if `endpoint_url` is set)_ — probes the live inference server via HTTP to detect the framework name and version (e.g. `"vLLM 0.9.0"`, `"SGLang 0.4.2"`). HTTP detection takes priority over the log-based result.

4. **Output** — always writes `system_desc.json` to the configured report directory. If a `run_metadata.json` is present in `report_dir`, it is also patched with the extracted serving config values (fields that could not be parsed remain `null`).

### Standalone Command (`sysinfo from-config`)

```bash
inference-endpoint sysinfo from-config -c examples/sysinfo_example.yaml
```

```yaml
report_dir: results/h100_sysinfo/ # output directory

system_info:
system_name: H100x7_vLLM
ssh_ids:
- root@ssh1:22 # prefill node 1
- root@ssh2:22 # prefill node 2
- root@ssh3:22 # decode node 1
- root@ssh4:22 # decode node 2
- root@ssh5:22 # decode node 3
- root@ssh6:22 # decode node 4
- root@ssh7:22 # decode node 5

accelerator_backend: cuda
exclude_current_system: true # master node is orchestrator-only
skip_ssh_key_file: false

# serving_node: where the inference server process is running.
# Server stdout/stderr must be redirected to /tmp/serving.log on that node.
# If multiple serving nodes exist, point to any one — all nodes are assumed
# to run the same serving framework version.
serving_node: root@ssh1:22

node_config: # optional: function-based node groupings
Prefill:
- node_name: NVIDIA H100 # case-insensitive substring of detected GPU model
no_of_nodes: 2
Decode:
- node_name: NVIDIA H100
no_of_nodes: 5
```

Output is written to `report_dir/system_desc.json`.

### `node_config` Validation

When `node_config` is provided, the automations script enforces:

- Every `node_name` must match at least one probed node's GPU model string (case-insensitive substring). Unmatched names return an error.
- For each unique `node_name`, the total `no_of_nodes` across all function groups must not exceed the number of nodes of that type actually probed. Declaring more nodes than were SSHed into is an error.

### Error Handling

In standalone execution mode (`sysinfo from-config`), any capture failure propagates as an error and exits non-zero.

In integrated execution mode (triggered at the end of a benchmark run), `system_info` failures never abort the benchmark — `results.json` and `run_metadata.json` are written before capture runs, so benchmark output is always complete. Capture errors are logged but the process exits 0.

**Two warnings not to ignore:**

- **SSH failure on a node** — the error is logged in the MLC output but capture continues with the remaining nodes. The resulting `system_desc.json` will be missing that node's hardware info. Always verify `system_desc.json` looks complete before submission.
- **Serving config unavailable** — if the serving node is unreachable, `run_metadata.json` will have empty serving config fields (`tensor_parallel`, `pipeline_parallel`, `batch`, etc.). Check the MLC log and re-run `sysinfo from-config` manually if needed.

---

## `probe` Command

`probe` is a lightweight connectivity check built on the same endpoint/client stack as the main
Expand Down Expand Up @@ -141,12 +319,13 @@ not been implemented yet.

## Integration Points

| Dependency | Role |
| --------------------------- | ---------------------------------------------------------------- |
| `main.py` | App definition, logging setup, global error handling |
| `config/` | Defines CLI/YAML schema models and config loading |
| `dataset_manager/` | Loads performance and accuracy datasets |
| `endpoint_client/` | Sends requests to endpoint workers |
| `load_generator/session.py` | Runs the benchmark session |
| `metrics/` | Aggregates and reports benchmark results |
| `evaluation/` | Scores collected accuracy datasets during benchmark finalization |
| Dependency | Role |
| --------------------------- | -------------------------------------------------------------------- |
| `main.py` | App definition, logging setup, global error handling |
| `config/` | Defines CLI/YAML schema models and config loading |
| `dataset_manager/` | Loads performance and accuracy datasets |
| `endpoint_client/` | Sends requests to endpoint workers |
| `load_generator/session.py` | Runs the benchmark session |
| `metrics/` | Aggregates and reports benchmark results |
| `evaluation/` | Scores collected accuracy datasets during benchmark finalization |
| `sys_info/` | Invokes mlcflow to collect hardware/software/serving info from nodes |
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,10 @@ dependencies = [
]

[project.optional-dependencies]
sysinfo = [
# Required for system_info capture (mlperf submission workflow)
"mlc-scripts==1.1.0",
]
sql = [
# SQL event logger (swappable backends, default sqlite)
"sqlalchemy==2.0.48",
Expand All @@ -99,6 +103,7 @@ dev = [
test = [
# Includes optional dependencies for full test coverage
"inference-endpoint[sql]",
"inference-endpoint[sysinfo]",
# Testing framework
"pytest==9.0.3",
"pytest-asyncio==1.3.0",
Expand Down
Loading
Loading