Question: Handling vLLM default optimizations and tested versions

Hi! I've been looking into TokenPowerBench for energy profiling and noticed that the VLLMEngine wrapper currently relies on the default settings of the installed vLLM environment.

In recent vLLM releases, advanced optimizations like Chunked Prefill, Automatic Prefix Caching (APC), and CUDA graphs are enabled or heavily utilized by default. Since these features significantly smooth out prefill power spikes and bypass compute on repetitive datasets like Alpaca, they can heavily alter the raw hardware energy results.

I wanted to get your thoughts on this: how do you currently account for these software-level optimizations when evaluating your metrics? Do you test with them on or off?

Also, which specific version(s) of vLLM did you use when developing and testing TokenPowerBench? And are you currently testing new versions of vLLM with this benchmark?
Looking forward to hearing your insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Handling vLLM default optimizations and tested versions #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question: Handling vLLM default optimizations and tested versions #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions