fix(watsonx): use prompt-based JSON parsing instead of response_format #89

coderabbitai · 2026-03-26T16:11:18Z

⚠️ Potential issue | 🟠 Major

Don't keep function_calling on the known-broken watsonx path.

The new inline note says watsonx function_calling returns no tool_calls, but Line 167 still routes that mode through llm.with_structured_output(...). Any caller that explicitly selects wx_json_mode="function_calling" will keep seeing the old failure. Either send that mode through the same prompt+parser fallback or reject it with a clear error.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cuga/backend/cuga_graph/nodes/shared/base_agent.py` around lines 157 - 168, The branch still sends wx_json_mode == "function_calling" into llm.with_structured_output which is known-broken for watsonx; update the conditional logic in the wx_json_mode handling (the branch around wx_json_mode, llm.with_structured_output, prompt_template, parser and chain) so that "function_calling" is not routed into llm.with_structured_output — either treat "function_calling" the same as "response_format" by using the prompt_template | llm | parser fallback with chain.with_retry(...) or explicitly raise a clear error when wx_json_mode == "function_calling", ensuring callers cannot accidentally get routed through llm.with_structured_output.

Valid point. However, no caller in the codebase explicitly passes wx_json_mode="function_calling" for watsonx today — the default is "response_format" and the only explicit overrides are "no_format" (for code_agent, reflection, etc.).

The function_calling and json_mode branches at line 167 are pre-existing code that was not modified by this PR. They would only be hit if a caller explicitly opts in.

That said, if someone does explicitly pass function_calling for watsonx, they'd hit the same vLLM issue. We can address that in a follow-up if needed, but changing pre-existing untouched branches is out of scope for this fix.

@haroldship, that's a fair and reasonable scoping decision — the current PR is a focused fix for the response_format empty-content bug, and touching untouched pre-existing branches would expand its scope unnecessarily.

Would you like me to open a follow-up GitHub issue to track the risk that wx_json_mode="function_calling" (and "json_mode") on ChatWatsonx will hit the same vLLM guided-decoding failure if a caller ever opts in? That way it's documented and won't be forgotten.

🧠 Learnings used

Learnt from: haroldship Repo: cuga-project/cuga-agent PR: 67 File: src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py:377-397 Timestamp: 2026-03-23T08:37:18.878Z Learning: In `src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`, when `PromptUtils.find_tools` raises an `OutputParserException` or other exception (i.e., the shortlister LLM returns malformed/empty JSON), the correct behaviour is to return a short user-facing error string to the agent — NOT a fallback tool list. This was an explicit design decision: returning all tools on LLM failure bloats the agent context and masks bugs; the error string lets the agent LLM decide whether to retry or proceed. This is distinct from the soft-filtering fallback (which applies when `filter_criteria` is provided but no tools match the criteria).

yes open the issue

🧩 Analysis chain

🏁 Script executed:

gh issue create \ --repo cuga-project/cuga-agent \ --title "fix(watsonx): guard function_calling and json_mode against vLLM guided-decoding failure" \ --body "## Background PR `#89` fixed the \`response_format\` empty-content bug on \`ChatWatsonx\` by switching to a prompt-based \`PydanticOutputParser\` fallback. During review it was noted that the \`function_calling\` and \`json_mode\` branches in \`BaseAgent.get_chain\` (\`src/cuga/backend/cuga_graph/nodes/shared/base_agent.py\`, ~line 167) still route through \`llm.with_structured_output(...)\`, which is known-broken on vLLM versions prior to 0.8.2 (see [vllm#15236](https://github.com/vllm-project/vllm/issues/15236) and [vllm#21148](https://github.com/vllm-project/vllm/issues/21148)). ## Risk No caller currently passes \`wx_json_mode=\"function_calling\"\` or \`wx_json_mode=\"json_mode\"\` explicitly for a \`ChatWatsonx\` LLM — the default is \`\"response_format\"\` and current overrides use \`\"no_format\"\`. However, if a future caller opts in to either of these modes, they will silently hit the same vLLM guided-decoding failure (empty content / no tool_calls returned). ## Suggested Fix Either: 1. Route \`\"function_calling\"\` and \`\"json_mode\"\` through the same \`prompt_template | llm | parser\` fallback with \`.with_retry(stop_after_attempt=3)\` (same approach as the \`\"response_format\"\` fix in PR `#89`), **or** 2. Raise a clear \`ValueError\` / \`NotImplementedError\` when \`wx_json_mode\` is \`\"function_calling\"\` or \`\"json_mode\"\` on \`ChatWatsonx\`, so callers get an explicit error rather than a silent failure. This fix should be **reverted or revisited** once IBM upgrades the deployed vLLM to a fixed version (0.8.2+). ## References - PR `#89` (fix for \`response_format\` mode): https://github.com/cuga-project/cuga-agent/pull/89 - Review comment: https://github.com/cuga-project/cuga-agent/pull/89#discussion_r2996097287 - Issue `#88` (original empty-content bug): https://github.com/cuga-project/cuga-agent/issues/88 - Requested by: `@haroldship` " \ --assignee haroldship

Length of output: 52

@haroldship, done! Follow-up issue #90 has been opened and assigned to you:

👉 #90

It documents the latent risk that wx_json_mode="function_calling" and "json_mode" on ChatWatsonx will silently hit the same vLLM guided-decoding failure, links back to this PR and issue #88, and proposes the two remediation options (prompt+parser fallback or an explicit error). The note to revisit once IBM upgrades to vLLM 0.8.2+ is included as well.

🧠 Learnings used

Learnt from: haroldship Repo: cuga-project/cuga-agent PR: 67 File: src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py:377-397 Timestamp: 2026-03-23T08:37:18.878Z Learning: In `src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`, when `PromptUtils.find_tools` raises an `OutputParserException` or other exception (i.e., the shortlister LLM returns malformed/empty JSON), the correct behaviour is to return a short user-facing error string to the agent — NOT a fallback tool list. This was an explicit design decision: returning all tools on LLM failure bloats the agent context and masks bugs; the error string lets the agent LLM decide whether to retry or proceed. This is distinct from the soft-filtering fallback (which applies when `filter_criteria` is provided but no tools match the criteria).

-Original file line number
+Diff line change
@@ Expand Up / @@ -155,6 +155,20 @@ def get_chain( @@
                     schema = APIPlannerOutputWX
                 parser = PydanticOutputParser(pydantic_object=schema)
                 if wx_json_mode == "response_format":
+                    # vLLM's xgrammar guided decoding fails to compile an FSM for
+                    # schemas with $defs/$ref (vllm#21148), returning empty content
+                    # (completion_tokens~=2, finish_reason=stop). Only apply the
+                    # prompt-based fallback when the schema actually has that shape;
+                    # flat schemas (e.g. PlanControllerOutput, NextAgentPlan) keep
+                    # working under guided decoding and are left on the existing
+                    # with_structured_output path.
+                    if "$defs" in schema.model_json_schema():
+                        logger.debug(
+                            "Schema has $defs/$ref; using prompt-based JSON parsing "
+                            "for watsonx (response_format triggers empty content)"
+                        )
+                        chain = prompt_template | llm | parser
+                        return chain.with_retry(stop_after_attempt=3)
                     return BaseAgent.create_validated_structured_output_chain(llm, schema, prompt_template)
                 elif wx_json_mode == "function_calling" or wx_json_mode == "json_mode":
                     chain = prompt_template | llm.with_structured_output(schema, method=wx_json_mode)
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(watsonx): use prompt-based JSON parsing instead of response_format #89

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai Bot Mar 26, 2026 •

edited

Loading

Uh oh!

haroldship Mar 26, 2026

Uh oh!

coderabbitai Bot Mar 26, 2026

Uh oh!

haroldship Mar 26, 2026

Uh oh!

coderabbitai Bot Mar 26, 2026

Uh oh!

Uh oh!

fix(watsonx): use prompt-based JSON parsing instead of response_format #89

Are you sure you want to change the base?

Uh oh!

fix(watsonx): use prompt-based JSON parsing instead of response_format #89

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai Bot Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haroldship Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 26, 2026 •

edited

Loading