Skip to content

Question: Handling vLLM default optimizations and tested versions #1

@Denizdius

Description

@Denizdius

Hi! I've been looking into TokenPowerBench for energy profiling and noticed that the VLLMEngine wrapper currently relies on the default settings of the installed vLLM environment.

In recent vLLM releases, advanced optimizations like Chunked Prefill, Automatic Prefix Caching (APC), and CUDA graphs are enabled or heavily utilized by default. Since these features significantly smooth out prefill power spikes and bypass compute on repetitive datasets like Alpaca, they can heavily alter the raw hardware energy results.

I wanted to get your thoughts on this: how do you currently account for these software-level optimizations when evaluating your metrics? Do you test with them on or off?

Also, which specific version(s) of vLLM did you use when developing and testing TokenPowerBench? And are you currently testing new versions of vLLM with this benchmark?
Looking forward to hearing your insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions