Skip to content

feat(llama3.1-8b): expose vLLM engine params as CLI arguments#2553

Open
ssaketh-ch wants to merge 5 commits into
mlcommons:masterfrom
ssaketh-ch:feat/vllm-configurable-engine-params
Open

feat(llama3.1-8b): expose vLLM engine params as CLI arguments#2553
ssaketh-ch wants to merge 5 commits into
mlcommons:masterfrom
ssaketh-ch:feat/vllm-configurable-engine-params

Conversation

@ssaketh-ch

@ssaketh-ch ssaketh-ch commented Feb 26, 2026

Copy link
Copy Markdown

Summary

Expose vLLM engine memory and scheduling parameters as CLI arguments in the
Llama 3.1-8B benchmark harness, removing the need to edit source code to tune them.

Problem

When running the vLLM-backed SUT, parameters like gpu_memory_utilization,
max_num_seqs, and max_num_batched_tokens are hardcoded in load_model().
This causes two practical problems:

  1. Out-of-memory (OOM) crashes -- on GPUs with less VRAM, or when running
    alongside other processes, there is no way to reduce memory usage without
    modifying source code.
  2. No visibility into tunable knobs -- users hitting performance or memory
    issues have no obvious way to know which parameters exist or what values to
    try. The only option is to read the source, edit it, and re-run.

This is especially painful during bring-up on new hardware where the right
memory configuration is not known in advance.

Solution

Add 7 CLI flags to main.py and thread them through to the SUT and
SUTServer constructors and load_model() calls:

  • --gpu-memory-utilization -- reduce if hitting OOM (default: 0.90)
  • --max-num-batched-tokens -- cap total tokens scheduled per step
  • --max-num-seqs -- limit concurrent request slots
  • --block-size -- KV cache paging granularity
  • --enforce-eager / --no-enforce-eager
  • --enable-chunked-prefill / --no-enable-chunked-prefill
  • --max-model-len -- cap the KV cache context window

Backward Compatibility

All flags default to vLLM's own defaults. Existing run scripts and
user.conf setups are completely unaffected.

Example

Reducing memory pressure on a smaller GPU:

python main.py --vllm --scenario Offline \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 128

Hardcoded vLLM memory and scheduling parameters are now configurable
via CLI flags with defaults that preserve existing behavior. No change
to model outputs or sampling -- only memory allocation and scheduling.

New flags:
  --gpu-memory-utilization
  --max-num-batched-tokens
  --max-num-seqs
  --enable-prefix-caching / --no-enable-prefix-caching
  --block-size
  --enforce-eager / --no-enforce-eager
  --enable-chunked-prefill / --no-enable-chunked-prefill
  --max-model-len
@ssaketh-ch ssaketh-ch requested a review from a team as a code owner February 26, 2026 15:01
@github-actions

github-actions Bot commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@hanyunfan

Copy link
Copy Markdown
Contributor

WG: Assigned to Thomas to review it.

@attafosu attafosu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comment on the prefix-caching and should be gtg

Comment thread language/llama3.1-8b/SUT_VLLM.py Outdated
gpu_memory_utilization=gpu_memory_utilization,
max_num_batched_tokens=max_num_batched_tokens,
max_num_seqs=max_num_seqs,
enable_prefix_caching=enable_prefix_caching,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix caching should strictly be off (this is per the rules of the benchmark)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I opened two new commits to remove it. Please take a look.

Comment thread language/llama3.1-8b/main.py Outdated
default=256,
help="Max concurrent sequences (default: 256)",
)
parser.add_argument(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this from the cli to avoid any confusion - the rules for the benchmark does not allow for prefix-caching.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I opened two new commits to remove it. Please take a look.

As requested, I removed the Prefix caching from main.py since the benchmark requires it to be False always avoiding any potential confusion
Again, as mentioned, I removed prefix caching as a tunable parameter entirely since the benchmark doesn't allow it.
@ssaketh-ch ssaketh-ch requested a review from attafosu April 1, 2026 08:05

@attafosu attafosu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign the cla and we'll merge next week

@ssaketh-ch

ssaketh-ch commented Apr 23, 2026

Copy link
Copy Markdown
Author

Hi,
I have signed the CLA, Let me know if there are further steps I need to take.
Sorry for the delay. Processing took longer than I expected for some reason.

@ssaketh-ch ssaketh-ch requested a review from attafosu April 23, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants