mlcommons · anandhu-eng · May 14, 2026 · May 15, 2026 · May 20, 2026 · May 28, 2026
@@ -31,6 +31,7 @@ uv run inference-endpoint probe --endpoints http://localhost:8765 --model test-m
 uv run inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH
 uv run inference-endpoint benchmark online --endpoints URL --model NAME --dataset PATH --load-pattern poisson --target-qps 100
 uv run inference-endpoint benchmark from-config --config config.yaml
+uv run inference-endpoint sysinfo from-config --config config.yaml
 ```
 
 ### Backward-compatible setup (pip + venv)
@@ -60,6 +61,7 @@ inference-endpoint probe --endpoints http://localhost:8765 --model test-model
 inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH
 inference-endpoint benchmark online --endpoints URL --model NAME --dataset PATH --load-pattern poisson --target-qps 100
 inference-endpoint benchmark from-config --config config.yaml
+inference-endpoint sysinfo from-config --config config.yaml
 ```
 
 ## Architecture
@@ -125,6 +127,7 @@ CLI is auto-generated from `config/schema.py` Pydantic models via cyclopts. Fiel
 
 - **CLI mode** (`offline`/`online`): cyclopts constructs `OfflineBenchmarkConfig`/`OnlineBenchmarkConfig` (subclasses in `config/schema.py`) directly from CLI args. Type locked via `Literal`. `--dataset` is repeatable with TOML-style format `[perf|acc:]<path>[,key=value...]` (e.g. `--dataset data.csv,samples=500,parser.prompt=article`). Full accuracy support via `accuracy_config.eval_method=pass_at_1` etc.
 - **YAML mode** (`from-config`): `BenchmarkConfig.from_yaml_file()` loads YAML, resolves env vars, and auto-selects the right subclass via Pydantic discriminated union. Optional `--timeout`/`--mode` overrides via `config.with_updates()`.
+- **sysinfo from-config**: `SysInfoFileConfig.from_yaml_file()` reads the `system_info` key from YAML file (extra top-level keys ignored). Invokes `capture_system_info()` which calls the `get-mlperf-multi-node-system-info` mlcflow script. Supports optional `node_config` for function-based node groupings (written to a temp YAML file passed as `node_config_file` to mlcflow). Independent of benchmark runs.
 - **eval**: Not yet implemented (raises `CLIError` with a tracking issue link)
 
 ### Config Construction & Validation
@@ -165,6 +168,9 @@ src/inference_endpoint/
 │   │   ├── __init__.py
 │   │   ├── cli.py             # benchmark_app: offline, online, from-config subcommands
 │   │   └── execute.py         # Phased execution: setup/run_threaded/finalize + BenchmarkContext
+│   ├── sysinfo/
+│   │   ├── __init__.py
+│   │   └── cli.py             # sysinfo_app: from-config subcommand (standalone sys info capture)
 │   ├── probe.py               # ProbeConfig + execute_probe()
 │   ├── info.py                # execute_info()
 │   ├── validate.py            # execute_validate()

@@ -36,6 +36,7 @@ already-parsed models rather than raw `argparse.Namespace` objects.
 | `info`                  | `main.py`                   | `commands/info.py`              | Implemented               |
 | `validate-yaml`         | `main.py`                   | `commands/validate.py`          | Implemented               |
 | `init`                  | `main.py`                   | `commands/init.py`              | Implemented               |
+| `sysinfo from-config`   | `commands/sysinfo/cli.py`   | `sys_info/capture.py`           | Implemented               |
 | `eval`                  | `main.py`                   | inline stub (`CLIError`)        | Reserved, not implemented |
 
 ## CLI Structure
@@ -53,6 +54,9 @@ inference-endpoint
   |     +-- online
   |     +-- from-config
   |
+  +-- sysinfo
+  |     +-- from-config
+  |
   +-- probe
   +-- info
   +-- validate-yaml
@@ -94,8 +98,182 @@ commands/benchmark/execute.py::run_benchmark()
   +-- construct endpoint client + sample issuer
   +-- run BenchmarkSession in threaded wrapper
   +-- finalize metrics and optional accuracy scoring
+  +-- if system_info is configured:
+        write run_metadata.json
+        capture_system_info() → mlcflow (hardware + serving config)
+        patch run_metadata.json with serving config values
 ```
 
+## System Info Capture
+
+> **Requires the `sysinfo` optional dependency.** Install it with:
+>
+> ```bash
+> # uv (recommended)
+> uv sync --extra sysinfo
+> # or pass --extra sysinfo directly to uv run, e.g.:
+> uv run --extra sysinfo inference-endpoint benchmark from-config --config config.yaml
+>
+> # pip (from repo root)
+> pip install -e ".[sysinfo]"
+> ```
+>
+> If `mlc-scripts` is not installed and `system_info` is configured, the benchmark still completes and results are written first; system info capture is then attempted, fails with an error log, and the process exits 0.
+
+System info capture collects hardware/software details from one or more nodes and writes a structured JSON file for MLPerf inference submissions. It runs in two contexts:
+
+- **Standalone** (`sysinfo from-config`): triggered manually, independent of any benchmark run. See [Standalone Command (`sysinfo from-config`)](#standalone-command-sysinfo-from-config) below for a full example.
+- **Integrated** (`benchmark from-config`): triggered automatically after a benchmark run completes if `system_info` is present in the config. For example:
+
+  ```yaml
+  name: "llama3.1-8b-vllm-perf-c1000"
+  version: "1.0"
+  type: "online"
+
+  model_params:
+    name: "meta-llama/Llama-3.1-8B-Instruct"
+    temperature: 0.0
+    top_p: 1.0
+    max_new_tokens: 128
+
+  datasets:
+    - name: cnn_dailymail::llama3_8b
+      type: performance
+      samples: 13368
+      parser:
+        input: prompt
+
+  settings:
+    runtime:
+      min_duration_ms: 600000
+      max_duration_ms: 3600000
+      scheduler_random_seed: 137
+      dataloader_random_seed: 111
+      n_samples_to_issue: 13368
+
+    load_pattern:
+      type: "concurrency"
+      target_concurrency: 1000
+
+    client:
+      num_workers: 4
+
+  endpoint_config:
+    endpoints:
+      - "http://localhost:11001"
+    api_key: null
+
+  report_dir: sglang_perf_c1000
+
+  system_info:
+    system_name: H100x8_SGLang
+    ssh_ids:
+      - user@inference-node
+    accelerator_backend: cuda
+    exclude_current_system: true
+    skip_ssh_key_file: false
+    serving_node: user@inference-node
+    endpoint_url: http://localhost:11001
+    serving_framework: sglang
+  ```
+
+  then running:
+
+  ```bash
+  inference-endpoint benchmark from-config --config config.yaml
+  ```
+
+  System info capture runs automatically at the end of the benchmark.
+
+Both paths call `sys_info/capture.py::capture_system_info()` and produce the same output JSON (as per endpoints spec). Both also patch `run_metadata.json` with serving configuration values extracted from the inference server's startup log, if the file is present in `report_dir`. In integrated execution mode it is always present (written by the benchmark before capture runs); in standalone execution mode it is patched only if it already exists there.
+
+### Config Reference (`SysInfoCaptureConfig`)
+
+| Field                    | Type                                             | Description                                                                                                                                                                                                                  |
+| ------------------------ | ------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `system_name`            | str                                              | **Required.** Name of the system under test (e.g. `"H100x8_vLLM"`). Used as the MLPerf submission system identifier.                                                                                                         |
+| `ssh_ids`                | `list[str]`                                      | **Required.** Nodes to collect hardware info from. Format: `user@host` or `user@host:port`.                                                                                                                                  |
+| `accelerator_backend`    | `"cuda"` \| `"rocm"` \| `"xpu"` \| `"none"`      | GPU backend on the target nodes. Default: `"none"`.                                                                                                                                                                          |
+| `exclude_current_system` | bool                                             | Skip the machine running this command; collect from `ssh_ids` only. Default: `false`.                                                                                                                                        |
+| `skip_ssh_key_file`      | bool                                             | Assume SSH key auth is pre-configured (skips mlcflow key-file lookup). Default: `false`.                                                                                                                                     |
+| `node_config`            | object                                           | Optional function-based node groupings (Prefill/Decode/etc). Maps function names to lists of `{node_name, no_of_nodes}` entries. `node_name` is matched as a case-insensitive substring against the detected GPU model name. |
+| `serving_node`           | str                                              | SSH target for the inference server (`user@host` or `user@host:port`). When set, the capture SSHes into this node and reads `/tmp/serving.log` to extract serving config. Server stdout/stderr **must** be redirected there. |
+| `endpoint_url`           | str                                              | Base URL of the running inference server. Probed via HTTP to detect the serving framework name and version (e.g. `"vLLM 0.9.0"`).                                                                                            |
+| `serving_framework`      | `"auto"` \| `"vllm"` \| `"sglang"` \| `"trtllm"` | Serving engine type used for startup log parsing. Default: `"auto"` (detected from the endpoint).                                                                                                                            |
+
+### Capture Flow
+
+System info capture is powered by the [mlperf-automations](https://github.com/mlcommons/mlperf-automations) project and orchestrates the following steps:
+
+1. **Hardware collection** — SSHes into each node listed in `ssh_ids` and collects CPU, memory, GPU, networking, and OS information. If `exclude_current_system` is false, the local machine is also probed. Results are merged into a single `system_desc.json`.
+
+2. **Serving config extraction** _(if `serving_node` is set)_ — SSHes into the inference server node and reads its startup log to extract parallelism settings (`tensor_parallel`, `pipeline_parallel`, `expert_parallel`) and batch size.
+
+3. **Framework detection** _(if `endpoint_url` is set)_ — probes the live inference server via HTTP to detect the framework name and version (e.g. `"vLLM 0.9.0"`, `"SGLang 0.4.2"`). HTTP detection takes priority over the log-based result.
+
+4. **Output** — always writes `system_desc.json` to the configured report directory. If a `run_metadata.json` is present in `report_dir`, it is also patched with the extracted serving config values (fields that could not be parsed remain `null`).
+
+### Standalone Command (`sysinfo from-config`)
+
+```bash
+inference-endpoint sysinfo from-config -c examples/sysinfo_example.yaml
+```
+
+```yaml
+report_dir: results/h100_sysinfo/ # output directory
+
+system_info:
+  system_name: H100x7_vLLM
+  ssh_ids:
+    - root@ssh1:22 # prefill node 1
+    - root@ssh2:22 # prefill node 2
+    - root@ssh3:22 # decode node 1
+    - root@ssh4:22 # decode node 2
+    - root@ssh5:22 # decode node 3
+    - root@ssh6:22 # decode node 4
+    - root@ssh7:22 # decode node 5
+
+  accelerator_backend: cuda
+  exclude_current_system: true # master node is orchestrator-only
+  skip_ssh_key_file: false
+
+  # serving_node: where the inference server process is running.
+  # Server stdout/stderr must be redirected to /tmp/serving.log on that node.
+  # If multiple serving nodes exist, point to any one — all nodes are assumed
+  # to run the same serving framework version.
+  serving_node: root@ssh1:22
+
+  node_config: # optional: function-based node groupings
+    Prefill:
+      - node_name: NVIDIA H100 # case-insensitive substring of detected GPU model
+        no_of_nodes: 2
+    Decode:
+      - node_name: NVIDIA H100
+        no_of_nodes: 5
+```
+
+Output is written to `report_dir/system_desc.json`.
+
+### `node_config` Validation
+
+When `node_config` is provided, the automations script enforces:
+
+- Every `node_name` must match at least one probed node's GPU model string (case-insensitive substring). Unmatched names return an error.
+- For each unique `node_name`, the total `no_of_nodes` across all function groups must not exceed the number of nodes of that type actually probed. Declaring more nodes than were SSHed into is an error.
+
+### Error Handling
+
+In standalone execution mode (`sysinfo from-config`), any capture failure propagates as an error and exits non-zero.
+
+In integrated execution mode (triggered at the end of a benchmark run), `system_info` failures never abort the benchmark — `results.json` and `run_metadata.json` are written before capture runs, so benchmark output is always complete. Capture errors are logged but the process exits 0.
+
+**Two warnings not to ignore:**
+
+- **SSH failure on a node** — the error is logged in the MLC output but capture continues with the remaining nodes. The resulting `system_desc.json` will be missing that node's hardware info. Always verify `system_desc.json` looks complete before submission.
+- **Serving config unavailable** — if the serving node is unreachable, `run_metadata.json` will have empty serving config fields (`tensor_parallel`, `pipeline_parallel`, `batch`, etc.). Check the MLC log and re-run `sysinfo from-config` manually if needed.
+
+---
+
 ## `probe` Command
 
 `probe` is a lightweight connectivity check built on the same endpoint/client stack as the main
@@ -141,12 +319,13 @@ not been implemented yet.
 
 ## Integration Points
 
-| Dependency                  | Role                                                             |
-| --------------------------- | ---------------------------------------------------------------- |
-| `main.py`                   | App definition, logging setup, global error handling             |
-| `config/`                   | Defines CLI/YAML schema models and config loading                |
-| `dataset_manager/`          | Loads performance and accuracy datasets                          |
-| `endpoint_client/`          | Sends requests to endpoint workers                               |
-| `load_generator/session.py` | Runs the benchmark session                                       |
-| `metrics/`                  | Aggregates and reports benchmark results                         |
-| `evaluation/`               | Scores collected accuracy datasets during benchmark finalization |
+| Dependency                  | Role                                                                 |
+| --------------------------- | -------------------------------------------------------------------- |
+| `main.py`                   | App definition, logging setup, global error handling                 |
+| `config/`                   | Defines CLI/YAML schema models and config loading                    |
+| `dataset_manager/`          | Loads performance and accuracy datasets                              |
+| `endpoint_client/`          | Sends requests to endpoint workers                                   |
+| `load_generator/session.py` | Runs the benchmark session                                           |
+| `metrics/`                  | Aggregates and reports benchmark results                             |
+| `evaluation/`               | Scores collected accuracy datasets during benchmark finalization     |
+| `sys_info/`                 | Invokes mlcflow to collect hardware/software/serving info from nodes |
@@ -76,6 +76,10 @@ dependencies = [
 ]
 
 [project.optional-dependencies]
+sysinfo = [
+    # Required for system_info capture (mlperf submission workflow)
+    "mlc-scripts==1.1.0",
+]
 sql = [
     # SQL event logger (swappable backends, default sqlite)
     "sqlalchemy==2.0.48",
@@ -99,6 +103,7 @@ dev = [
 test = [
     # Includes optional dependencies for full test coverage
     "inference-endpoint[sql]",
+    "inference-endpoint[sysinfo]",
     # Testing framework
     "pytest==9.0.3",
     "pytest-asyncio==1.3.0",