Force clarifier to search before asking clarification questions (#234)#245
Force clarifier to search before asking clarification questions (#234)#245torkian wants to merge 6 commits into
Conversation
Closes NVIDIA-AI-Blueprints#234. The clarifier agent had tool-calling bound to a search loop, but weak or non-frontier models routinely skipped the search and asked the user a clarification question immediately. The expected behavior is to gather context with the available tools first and only ask once a genuine ambiguity remains. This is fixed at the graph level so it works model-agnostically rather than relying on prompt engineering for any one model: - New force_search node in the LangGraph workflow. When the LLM tries to ask for clarification on the first turn, tools are configured, and no tool call has happened yet, decide_route sends the run through force_search before ask_for_clarification. - Guidance is injected ephemerally in agent_node (same pattern as the existing JSON_REMINDER_AFTER_TOOLS) rather than written into state.messages, so it cannot leak into helpers like get_latest_user_query that surface text back to the user. - A force_search_used boolean on ClarifierAgentState makes the nudge fire at most once per run, so a model that ignores the nudge falls through to ask_for_clarification instead of looping. - prompts/research_clarification.j2 is updated so the Tool Usage section instructs the model to "Search first, ask second" for any unfamiliar entity, acronym, or term. Frontier models follow this voluntarily and skip the structural nudge entirely. End-to-end reproducer added at scripts/e2e_clarifier_search_test.py. Against nvidia/nemotron-3-nano-30b-a3b (the project's default clarifier model) with a real Tavily web search tool: develop: 0 tool calls, user prompted immediately (bug) fix: 1 tool call, user never prompted (fixed) The mock-based unit suite covers the new helper, the four routing branches, and the regression flagged during review where the guidance message could otherwise leak into the invalid-format fallback text. All 148 clarifier tests pass. Signed-off-by: Torkian <torkian@mac.com>
Greptile SummaryThis PR fixes the bug where the clarifier agent would prompt the user for clarification before using its bound search tools, by injecting an inline force-search retry inside
Confidence Score: 5/5Safe to merge — the inline retry is tightly guarded by iteration and tool-invocation checks, the two API-contract bugs from prior review rounds are both addressed, and 147 tests including dedicated regression tests for each fix all pass. The force-search nudge fires exactly once per conversation (gated on iteration==0 and absence of prior tool calls), is ephemeral (FORCE_SEARCH_GUIDANCE never enters state), and only retry_response is persisted — eliminating the consecutive-assistant-message violation. The skip-command HumanMessage interleaving and the budget-exhaustion early-exit guard close the other two previously identified issues. No new execution paths are left unguarded. No files require special attention; the one note is a test-assertion coverage gap that does not affect production behavior. Important Files Changed
Reviews (5): Last reviewed commit: "Fix consecutive assistant messages on th..." | Re-trigger Greptile |
|
@torkian Thank you for your contribution. We will review soon. |
ruff (project config: force-single-line=true) flagged the combined `from langchain_core.messages import HumanMessage, ToolMessage` line in scripts/e2e_clarifier_search_test.py. Split it into two single-line imports to match the rest of the codebase, and re-run ruff format. Signed-off-by: Torkian <torkian@mac.com>
Greptile review on PR NVIDIA-AI-Blueprints#245 flagged two real issues: P1 (agent.py): the ephemeral force_search guidance injection in agent_node was missing the iteration==0 guard that decide_route already applies. After force_search fired once and a stubborn model still refused to call a tool, ask_for_clarification ran, the user replied, and iteration advanced to 1 - but the next agent_node call still saw force_search_used=True with no tool invocations in state.messages and re-injected the nudge. The model then received "issue one focused tool call now" immediately after the user's clarifying answer, causing a gratuitous search instead of synthesizing the reply. Fixed by adding state.iteration == 0 to the injection condition. P2 (e2e script): the PASS verdict gated on tool_calls >= 1 only, without verifying len(user_prompts) == 0. A run where the LLM both searched AND prompted the user would have falsely printed "before any user prompt" and exited 0. Now we require both conditions and print distinct FAIL messages for the two failure modes. Also cleaned up an unused mock side_effect in test_force_search_guidance_not_in_state_messages (the user replies "skip" which forces completion, so only two LLM calls actually happen), and added test_force_search_guidance_not_injected_after_user_reply as a regression test for the P1 fix. 149 clarifier tests pass. Signed-off-by: Torkian <torkian@mac.com>
|
Thanks for the review! Addressed both findings in 8f18f80:
Also cleaned up an unused mock side_effect in |
cdgamarose-nv
left a comment
There was a problem hiding this comment.
Thanks for your PR! Left some comments
| @@ -0,0 +1,135 @@ | |||
| """End-to-end test for issue #234 fix. | |||
There was a problem hiding this comment.
Don't include this script in the PR. You can add test and results in PR description.
There was a problem hiding this comment.
Done — removed scripts/e2e_clarifier_search_test.py in 1377209. Moved the end-to-end test approach and before/after results into the PR description.
|
|
||
| return "ask_for_clarification" | ||
|
|
||
| async def force_search_node(state: ClarifierAgentState): |
There was a problem hiding this comment.
Wondering if we need to add a node or a variable in state. I was thinking something like this would be sufficient:
response = await bound_llm.ainvoke(messages)
if (
self.tools
and state.iteration == 0
and not self._has_tool_invocations(state.messages)
and not getattr(response, "tool_calls", None)
and self._is_needed(response.content)
):
logger.info("Clarifier: model skipped search; injecting guidance and retrying once")
retry_messages = messages + [response, HumanMessage(content=FORCE_SEARCH_GUIDANCE)]
retry_response = await bound_llm.ainvoke(retry_messages)
return {"messages": [response, retry_response]}
return {"messages": [response]}
There was a problem hiding this comment.
Good call — adopted your inline approach in 1377209. Dropped the force_search node, its routing branch/edges, and the ClarifierAgentState.force_search_used field; the nudge-and-retry now lives directly in agent_node exactly as you sketched.
The one-shot guarantee holds without the flag from two conditions alone:
iteration == 0— only the first turn; once the user replies, iteration advances and it never fires again.not _has_tool_invocations(state.messages)— after a successful forced search the tool-call message is in history (even while iteration is still 0), so it won't re-nudge.
FORCE_SEARCH_GUIDANCE is passed only in the local retry_messages and never returned to state, so it can't leak into get_latest_user_query. Net −70 source lines, behavior unchanged (same LLM call sequence + final message history). All 147 clarifier tests pass, including the force-search and ephemeral-guidance regression tests.
Address review feedback on NVIDIA-AI-Blueprints#245: replace the dedicated force_search graph node and the ClarifierAgentState.force_search_used flag with a single inline retry inside agent_node. When, on the first turn, the model asks for clarification without first using its bound search tools, agent_node now nudges it once with FORCE_SEARCH_GUIDANCE and re-invokes inline, returning both the original response and the retry. This keeps the behavior model-agnostic without adding graph state or an extra node hop. The one-shot guarantee holds from two conditions alone, so the force_search_used flag is no longer needed: - iteration == 0: only fires on the first turn; once the user replies, iteration advances and it never fires again. - not _has_tool_invocations(state.messages): once any tool call is in history (e.g. after a successful forced search, even while iteration is still 0), it never re-nudges. FORCE_SEARCH_GUIDANCE is sent only in the local retry_messages and is never returned to state, so it cannot leak into get_latest_user_query or other history lookups. Also remove scripts/e2e_clarifier_search_test.py from the PR per review; the end-to-end test approach and results are documented in the PR description instead. Behavior is unchanged (same LLM call sequence and final message history); all 147 clarifier tests pass, including the force-search behavioral and ephemeral-guidance regression tests. Signed-off-by: Torkian <torkian@mac.com>
Greptile review on NVIDIA-AI-Blueprints#245: the inline search-before-clarify retry returned both the skipped clarification response and the retry response, {"messages": [response, retry_response]}. When the retry carries a tool call (the happy path), the persisted history became [..., AIMessage(clarification), AIMessage(tool_call), ToolMessage, ...] — two consecutive assistant-role messages, which the OpenAI Chat Completions and Anthropic Messages APIs reject with a 400. NVIDIA NIM tolerates it, so it did not surface in the NIM-backed e2e run, and the mocked unit tests did not enforce message ordering. Return only the retry response. The first attempt was already shown to the model inside the local retry_messages (ephemeral, never persisted), so dropping it from state loses nothing and keeps a valid sequence regardless of whether the retry is a tool call or another clarification. Add a regression test asserting no two consecutive AIMessages are ever sent to the LLM across a forced-search run; it fails against the old [response, retry_response] return and passes with the fix. Signed-off-by: Torkian <torkian@mac.com>
While auditing the message ordering for the forced-search fix, a separate pre-existing bug surfaced in the skip-command branch. When a user types a skip command, ask_clarification returned an AIMessage(complete) without persisting the user's reply, leaving it directly after the prior AIMessage(clarification) — two consecutive assistant messages. The unconditional ask_for_clarification -> agent edge then re-entered agent_node with the clarification budget exhausted, which appended a third AIMessage(complete). With enable_plan_approval enabled this corrupted history flowed into the planner's ainvoke, producing a 400 from the OpenAI/Anthropic APIs (NVIDIA NIM tolerates it). Two coordinated fixes: - The skip branch now persists the user's reply as a HumanMessage before the completion AIMessage, so the prior and new assistant turns are separated by a human message. - agent_node's budget-exhausted branch no longer emits a completion when the last message is already a completion AIMessage (the skip branch's), avoiding the duplicate on graph re-entry. Add regression tests covering both the persisted history ordering and the planner ainvoke sequence under enable_plan_approval; both fail against the prior behavior and pass with the fix. Signed-off-by: Torkian <torkian@mac.com>
Closes #234.
Summary
The clarifier agent already had tool-calling bound to a search loop, but weak or non-frontier models routinely skipped the search and asked the user a clarification question immediately. The expected behavior — and what the issue asks for — is to gather context with the available tools first, and only ask the user once a genuine ambiguity remains.
This fixes the bug at the agent level (the issue explicitly says "this should not be overly prompt engineered for one model"), with a small prompt update as a supporting hint.
Changes
agent_node. On the first turn, if the model returns a clarification request without using its bound search tools, the node nudges it once withFORCE_SEARCH_GUIDANCEand re-invokes the LLM inline, returning both the original response and the retry. No extra graph node or state field is needed.iteration == 0(so it never fires after the user has replied) andnot _has_tool_invocations(state.messages)(so once any tool call is in history — e.g. after a successful forced search, even while iteration is still 0 — it never re-nudges). A model that stubbornly refuses to search after the single nudge simply falls through to asking the user; no loop.FORCE_SEARCH_GUIDANCEis sent only in the localretry_messagesand is never returned to state, so it cannot leak intoget_latest_user_queryor other history lookups (which would otherwise surface internal scaffolding in fallback text).prompts/research_clarification.j2Tool Usage section rewritten as "Search first, ask second" so frontier models follow this voluntarily and skip the nudge entirely.How I verified (end-to-end, against a real LLM + web search)
I exercised the full clarifier graph against a real NVIDIA NIM-hosted LLM and a real Tavily web search tool, with model
nvidia/nemotron-3-nano-30b-a3b(the repo default inconfigs/config_cli_default.yml) and the query "Research the latest NVIDIA AI announcements from 2026":developDeterministic across three runs. (The throwaway reproducer script is not included in the PR, per review — these are its results.)
Tests
150 clarifier tests pass. Key additions in
tests/aiq_agent/agents/clarifier/test_agent.py:TestHasToolInvocations— the helper that detects prior tool calls in the message history.TestClarifierForceSearch:test_force_search_fires_when_llm_skips_tools— the bug path: model clarifies → nudge + inline retry → model emits a tool call → completes; user never prompted.test_force_search_skipped_when_no_tools— no behavior change when no tools are configured.test_force_search_fires_at_most_once— no loop if the model stubbornly refuses tools after the nudge.test_force_search_not_triggered_when_llm_searches_first— no extra detour for models that already search.test_force_search_guidance_not_in_state_messages— regression test: the internal guidance must never appear in user-facing fallback text.test_force_search_guidance_not_injected_after_user_reply— regression test: the nudge does not re-fire after the user has answered.test_forced_retry_does_not_emit_consecutive_assistant_messages— regression test for the message-ordering fix below.TestClarifierSkipMessageOrdering— regression tests for the skip-path ordering fix (persisted history + planner sequence underenable_plan_approval).Message-ordering correctness (from review)
Per @cdgamarose-nv's review the implementation was simplified to an inline retry; a follow-up Greptile review then flagged that returning both the skipped and retried responses produced two consecutive assistant-role messages once the retry carried a tool call, which the OpenAI/Anthropic APIs reject with a 400 (NVIDIA NIM tolerates it). Fixed by returning only the retry response. Auditing the same invariant across the graph surfaced a separate pre-existing instance on the skip-clarification path (skip emitted a completion without persisting the user reply, and graph re-entry appended a duplicate completion) — also fixed, with regression tests for both.
Test plan
pytest tests/aiq_agent/agents/clarifier/— 150/150 passingruff checkandruff formatcleandevelopagainst the repo's default model