Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,28 @@ glm5-fp8-mi355x-sglang:
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }

glm5-fp8-mi355x-atom:
image: TBD
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: mi355x
precision: fp8
framework: atom
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 128 }
- isl: 1024
osl: 8192
search-space:
- { tp: 8, conc-start: 4, conc-end: 128 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 128 }

kimik2.5-int4-mi355x-vllm:
image: vllm/vllm-openai-rocm:v0.18.0
model: moonshotai/Kimi-K2.5
Expand Down Expand Up @@ -363,6 +385,31 @@ kimik2.5-fp4-mi355x-vllm:
- { tp: 8, conc-start: 4, conc-end: 64 }
- { tp: 4, conc-start: 4, conc-end: 64 }

kimik2.5-fp4-mi355x-atom:
image: TBD
model: amd/Kimi-K2.5-MXFP4
model-prefix: kimik2.5
runner: mi355x
precision: fp4
framework: atom
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 128 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- isl: 1024
osl: 8192
search-space:
- { tp: 8, conc-start: 4, conc-end: 128 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 128 }
- { tp: 4, conc-start: 4, conc-end: 128 }

minimaxm2.5-fp8-mi355x-vllm:
image: vllm/vllm-openai-rocm:v0.18.0
model: MiniMaxAI/MiniMax-M2.5
Expand Down Expand Up @@ -391,6 +438,34 @@ minimaxm2.5-fp8-mi355x-vllm:
- { tp: 4, conc-start: 4, conc-end: 64 }
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }

minimaxm2.5-fp8-mi355x-atom:
image: TBD
model: MiniMaxAI/MiniMax-M2.5
model-prefix: minimaxm2.5
runner: mi355x
precision: fp8
framework: atom
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 2, conc-start: 4, conc-end: 128 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }
- isl: 1024
osl: 8192
search-space:
- { tp: 2, conc-start: 4, conc-end: 128 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }
- isl: 8192
osl: 1024
search-space:
- { tp: 2, conc-start: 4, conc-end: 128 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }

minimaxm2.5-fp8-mi300x-vllm:
image: vllm/vllm-openai-rocm:v0.16.0
model: MiniMaxAI/MiniMax-M2.5
Expand Down
79 changes: 79 additions & 0 deletions benchmarks/single_node/glm5_fp8_mi355x_atom.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE \
DP_ATTENTION

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

export OMP_NUM_THREADS=1

# Calculate max-model-len based on ISL and OSL
if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
CALCULATED_MAX_MODEL_LEN=""
else
CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 "
fi

if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x

python3 -m atom.entrypoints.openai_server \
--model $MODEL \
--server-port $PORT \
-tp $TP \
--kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable prefix caching for consistency?

--trust-remote-code \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
Comment on lines +46 to +55
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 All three new non-MTP ATOM benchmark scripts (glm5_fp8_mi355x_atom.sh, kimik2.5_fp4_mi355x_atom.sh, minimaxm2.5_fp8_mi355x_atom.sh) are missing the BLOCK_SIZE=${BLOCK_SIZE:-16} variable and --block-size $BLOCK_SIZE flag present in all other non-MTP ATOM scripts. Without this, the ATOM server uses its framework default block size, which may differ from the tuned value of 16 and could cause performance degradation or OOM at high concurrency.

Extended reasoning...

What the bug is and how it manifests

All three new non-MTP ATOM benchmark scripts introduced in this PR are missing the --block-size parameter when launching the ATOM inference server. The existing non-MTP ATOM scripts (dsr1_fp8_mi355x_atom.sh, dsr1_fp4_mi355x_atom.sh, gptoss_fp4_mi355x_atom.sh) all set BLOCK_SIZE=${BLOCK_SIZE:-16} and pass --block-size $BLOCK_SIZE to the server. The three new scripts omit this entirely, so the ATOM framework will use whatever block size it defaults to internally.

The specific code path that triggers it

In all three new scripts (e.g. glm5_fp8_mi355x_atom.sh lines 46–55), the server launch invocation is:

python3 -m atom.entrypoints.openai_server     --model $MODEL     --server-port $PORT     -tp $TP     --kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP     --trust-remote-code     > $SERVER_LOG 2>&1 &

Compare with dsr1_fp8_mi355x_atom.sh (line 45, 51):

BLOCK_SIZE=${BLOCK_SIZE:-16}
...
    --block-size $BLOCK_SIZE > $SERVER_LOG 2>&1 &

Why existing code doesn't prevent it

There is no default --block-size guard in the shared benchmark_lib.sh — each individual script is responsible for passing this flag. The three new scripts simply don't include it.

Addressing the refutation

One verifier noted that the MTP ATOM scripts (dsr1_fp4_mi355x_atom_mtp.sh and dsr1_fp8_mi355x_atom_mtp.sh) added in PR #673 also omit --block-size. However, those scripts use --method mtp for speculative decoding, which is architecturally distinct from standard inference and may have different (or conflicting) block-size requirements. The new scripts in this PR are standard (non-MTP) inference scripts and should logically follow the non-MTP pattern, not the MTP pattern. The omission in MTP scripts appears intentional for MTP-specific reasons, not a general deprecation of the flag.

Impact

ATOM's block size directly controls KV cache page allocation. If the framework default differs from 16, it can cause increased memory fragmentation, reduced KV cache utilization at high concurrency (conc-end: 128 or 256 in these configs), or OOM errors during sweep benchmarks. Given that the concurrency ranges here (up to 256 for minimaxm2.5) exceed those in the original ATOM scripts, the risk is higher.

Step-by-step proof

  1. User runs glm5-fp8-mi355x-atom benchmark with ISL=1024, OSL=8192, TP=8, CONC=128.
  2. Harness calls glm5_fp8_mi355x_atom.sh — no BLOCK_SIZE variable is set.
  3. Server launches without --block-size; ATOM uses its internal default (e.g. 32 or 64).
  4. At CONC=128 with OSL=8192, KV cache demand is high; a larger block size leads to more wasted space per page and potentially OOM, while the tuned value of 16 was established specifically for MI355X ATOM.
  5. For comparison, dsr1_fp8_mi355x_atom.sh with the same hardware would launch with --block-size 16 and succeed.

How to fix

Add the following two lines to each new script, matching the pattern in the existing non-MTP ATOM scripts:

BLOCK_SIZE=${BLOCK_SIZE:-16}

And add --block-size $BLOCK_SIZE to the python3 -m atom.entrypoints.openai_server invocation in all three new scripts.

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

export PYTHONDONTWRITEBYTECODE=1
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
80 changes: 80 additions & 0 deletions benchmarks/single_node/kimik2.5_fp4_mi355x_atom.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE \
DP_ATTENTION

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

export OMP_NUM_THREADS=1

# Calculate max-model-len based on ISL and OSL
if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
CALCULATED_MAX_MODEL_LEN=""
else
CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 "
fi

if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x

python3 -m atom.entrypoints.openai_server \
--model $MODEL \
--server-port $PORT \
-tp $TP \
--kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable prefix caching for consistency?

--trust-remote-code \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

export PYTHONDONTWRITEBYTECODE=1
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
80 changes: 80 additions & 0 deletions benchmarks/single_node/minimaxm2.5_fp8_mi355x_atom.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE \
DP_ATTENTION

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

export OMP_NUM_THREADS=1

# Calculate max-model-len based on ISL and OSL
if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
CALCULATED_MAX_MODEL_LEN=""
else
CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 "
fi

if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x

python3 -m atom.entrypoints.openai_server \
--model $MODEL \
--server-port $PORT \
-tp $TP \
--kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable prefix caching for consistency?

--trust-remote-code \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

export PYTHONDONTWRITEBYTECODE=1
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1108,3 +1108,12 @@
description:
- "Update vLLM image from v0.15.1 to v0.18.0 for gptoss H100 and H200 configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/960

- config-keys:
- kimik2.5-fp4-mi355x-atom
- glm5-fp8-mi355x-atom
- minimaxm2.5-fp8-mi355x-atom
description:
- "New model support on ATOM framework"
- "Kimi-K2.5 FP4, GLM-5 FP8, and MiniMax-M2.5 FP8 configs added for MI355X ATOM"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/963
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The perf-changelog.yaml entry added in this PR references the wrong PR number: pr-link points to pull/954, but this is PR #963. PR #954 is a separate merged PR ('kimik2.5 int4 changes from mi355x'). The link should be updated to https://github.com/SemiAnalysisAI/InferenceX/pull/963 to correctly attribute the three ATOM configs (kimik2.5-fp4-mi355x-atom, glm5-fp8-mi355x-atom, minimaxm2.5-fp8-mi355x-atom) to this PR.

Extended reasoning...

Bug description: The last entry added to perf-changelog.yaml in this PR uses an incorrect pr-link value. The link points to https://github.com/SemiAnalysisAI/InferenceX/pull/954, which is a different, already-merged PR ('mi325x: port kimik2.5 int4 changes from mi355x PR #950', commit cec542a). The current PR being submitted is #963.

Code path: In perf-changelog.yaml at the final entry (around line 1119), the newly added block reads:

- config-keys:
    - kimik2.5-fp4-mi355x-atom
    - glm5-fp8-mi355x-atom
    - minimaxm2.5-fp8-mi355x-atom
  description:
    - "New model support on ATOM framework"
    - "Kimi-K2.5 FP4, GLM-5 FP8, and MiniMax-M2.5 FP8 configs added for MI355X ATOM"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/954

Why existing code doesn't prevent it: The changelog is a static YAML file with no automated validation of PR numbers against the current PR context. There is no CI check that cross-references the pr-link value with the actual PR number being submitted, making copy-paste errors like this easy to introduce.

Impact: While this is a documentation-only error with no functional impact, the changelog serves as an important audit trail for which PR introduced which configuration. With the wrong link, a future developer investigating kimik2.5-fp4-mi355x-atom, glm5-fp8-mi355x-atom, or minimaxm2.5-fp8-mi355x-atom would be led to PR #954, which is about an entirely different set of changes (MI325X INT4 configs), creating false traceability.

Step-by-step proof:

  1. This PR is [AMD/ROCM] ATOM support for new models: Kimi-K2.5 FP4, GLM-5 FP8, and MiniMax-M2.5 #963, as confirmed by the PR metadata (<pr number="963">).
  2. The git log shows commit cec542a with message 'mi325x: port kimik2.5 int4 changes from mi355x PR [AMD/ROCm] kimik2.5 int4 mi355x: upgrade to vllm-openai-rocm:v0.18.0 #950' — this is PR mi325: update kimi int4 #954.
  3. The last perf-changelog.yaml entry uses pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/954.
  4. The correct value should be https://github.com/SemiAnalysisAI/InferenceX/pull/963.

Fix: Change the pr-link in the final changelog entry from .../pull/954 to .../pull/963.