Hi! I've been looking into TokenPowerBench for energy profiling and noticed that the VLLMEngine wrapper currently relies on the default settings of the installed vLLM environment.
In recent vLLM releases, advanced optimizations like Chunked Prefill, Automatic Prefix Caching (APC), and CUDA graphs are enabled or heavily utilized by default. Since these features significantly smooth out prefill power spikes and bypass compute on repetitive datasets like Alpaca, they can heavily alter the raw hardware energy results.
I wanted to get your thoughts on this: how do you currently account for these software-level optimizations when evaluating your metrics? Do you test with them on or off?
Also, which specific version(s) of vLLM did you use when developing and testing TokenPowerBench? And are you currently testing new versions of vLLM with this benchmark?
Looking forward to hearing your insights!
Hi! I've been looking into TokenPowerBench for energy profiling and noticed that the VLLMEngine wrapper currently relies on the default settings of the installed vLLM environment.
In recent vLLM releases, advanced optimizations like Chunked Prefill, Automatic Prefix Caching (APC), and CUDA graphs are enabled or heavily utilized by default. Since these features significantly smooth out prefill power spikes and bypass compute on repetitive datasets like Alpaca, they can heavily alter the raw hardware energy results.
I wanted to get your thoughts on this: how do you currently account for these software-level optimizations when evaluating your metrics? Do you test with them on or off?
Also, which specific version(s) of vLLM did you use when developing and testing TokenPowerBench? And are you currently testing new versions of vLLM with this benchmark?
Looking forward to hearing your insights!