Skip to content

--llm flag is parsed but not used; OMB_ANSWER_LLM env default is 'groq' (not 'gemini') #15

@King-Brownie

Description

@King-Brownie

Summary

Two compound issues in the answer-LLM wiring at commit 45fa380 mean omb run --llm gemini still dispatches answer generation to Groq:

  1. --llm is parsed but never threaded to get_answer_llm(). At src/memory_bench/cli.py#L39 the flag is captured into a local llm variable, but cli.py#L68 calls get_answer_llm() with no arguments:

    mode=get_mode(mode, llm=get_answer_llm()),
  2. get_answer_llm() defaults to "groq". At src/memory_bench/llm/init.py#L24:

    def get_answer_llm() -> LLM:
        provider = os.environ.get("OMB_ANSWER_LLM", "groq")
        ...

    Combined effect: the --llm flag is decorative; OMB_ANSWER_LLM env var is the only way to actually pick an answer LLM, and the default disagrees with --llm's documented default of "gemini".

Reproducer

git clone https://github.com/vectorize-io/agent-memory-benchmark.git
cd agent-memory-benchmark
git checkout 45fa380
# Even though --llm gemini matches the documented default, Groq is invoked:
unset OMB_ANSWER_LLM
uv run --python 3.12 omb run --dataset locomo --memory bm25 --split locomo10 --query-limit 1 --llm gemini
# → answer phase hits memory_bench/llm/groq.py:32 (GroqLLM.generate) and
#   APIConnectionError / 401 / etc. depending on GROQ_API_KEY presence and
#   network egress to api.groq.com.

Workaround for callers today: OMB_ANSWER_LLM=gemini omb run ... (env var wins because --llm is unused).

Suggested fix

Two minimal, independent fixes:

  1. Honor --llm at the call site (cli.py#L68):

    answer_llm = get_llm(llm) if llm else get_answer_llm()
    mode=get_mode(mode, llm=answer_llm),

    This makes --llm gemini actually do what it documents, while still letting OMB_ANSWER_LLM work for callers who prefer the env-var path.

  2. Align get_answer_llm() default with --llm's default (llm/init.py#L24):

    provider = os.environ.get("OMB_ANSWER_LLM", "gemini")

    Matches the Gemini 2.5 Flash judge baseline landed in 45fa380 and the README/CLI documented default. The current "groq" default makes the CLI ship with internally inconsistent semantics.

Happy to file a PR with both fixes if helpful — let me know whether you'd prefer one combined PR or two separated by concern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions