fix(watsonx): use prompt-based JSON parsing instead of response_format#89
fix(watsonx): use prompt-based JSON parsing instead of response_format#89haroldship wants to merge 2 commits intomainfrom
Conversation
vLLM's guided decoding behind watsonx returns empty content when response_format is set (json_schema or json_object). Fall back to prompt_template | llm | PydanticOutputParser for watsonx structured output until IBM upgrades vLLM to 0.8.2+. Closes #88
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughFor Watsonx Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/shared/base_agent.py`:
- Around line 157-168: The branch still sends wx_json_mode == "function_calling"
into llm.with_structured_output which is known-broken for watsonx; update the
conditional logic in the wx_json_mode handling (the branch around wx_json_mode,
llm.with_structured_output, prompt_template, parser and chain) so that
"function_calling" is not routed into llm.with_structured_output — either treat
"function_calling" the same as "response_format" by using the prompt_template |
llm | parser fallback with chain.with_retry(...) or explicitly raise a clear
error when wx_json_mode == "function_calling", ensuring callers cannot
accidentally get routed through llm.with_structured_output.
- Around line 156-166: APIPlannerAgent.create() builds a prompt_template without
attaching format instructions, so when wx_json_mode falls back to the
prompt-based route (PydanticOutputParser instance `parser` and chain
`prompt_template | llm | parser`) the prompt lacks the `{format_instructions}`
guidance and parsing repeatedly fails; fix by passing
format_instructions=BaseAgent.get_format_instructions(parser) into
load_prompt_simple when creating prompt_template in APIPlannerAgent.create() so
the prompt includes the parser’s schema guidance (consistent with
ShortlisterAgent/BrowserPlannerAgent patterns).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 09ee8749-1855-448c-a621-d23c7024c8ac
📒 Files selected for processing (1)
src/cuga/backend/cuga_graph/nodes/shared/base_agent.py
| if wx_json_mode == "response_format": | ||
| return BaseAgent.create_validated_structured_output_chain(llm, schema, prompt_template) | ||
| # Avoid any response_format parameter for watsonx. vLLM's guided decoding | ||
| # (json_schema and json_mode) returns empty content on complex schemas | ||
| # (vllm#15236, vllm#21148). function_calling also fails (no tool_calls). | ||
| # Fall back to prompt-based format instructions + PydanticOutputParser. | ||
| logger.debug( | ||
| "Using prompt-based JSON parsing for watsonx (response_format triggers empty content)" | ||
| ) | ||
| chain = prompt_template | llm | parser | ||
| return chain.with_retry(stop_after_attempt=3) | ||
| elif wx_json_mode == "function_calling" or wx_json_mode == "json_mode": | ||
| chain = prompt_template | llm.with_structured_output(schema, method=wx_json_mode) |
There was a problem hiding this comment.
Don't keep function_calling on the known-broken watsonx path.
The new inline note says watsonx function_calling returns no tool_calls, but Line 167 still routes that mode through llm.with_structured_output(...). Any caller that explicitly selects wx_json_mode="function_calling" will keep seeing the old failure. Either send that mode through the same prompt+parser fallback or reject it with a clear error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cuga/backend/cuga_graph/nodes/shared/base_agent.py` around lines 157 -
168, The branch still sends wx_json_mode == "function_calling" into
llm.with_structured_output which is known-broken for watsonx; update the
conditional logic in the wx_json_mode handling (the branch around wx_json_mode,
llm.with_structured_output, prompt_template, parser and chain) so that
"function_calling" is not routed into llm.with_structured_output — either treat
"function_calling" the same as "response_format" by using the prompt_template |
llm | parser fallback with chain.with_retry(...) or explicitly raise a clear
error when wx_json_mode == "function_calling", ensuring callers cannot
accidentally get routed through llm.with_structured_output.
There was a problem hiding this comment.
Valid point. However, no caller in the codebase explicitly passes wx_json_mode="function_calling" for watsonx today — the default is "response_format" and the only explicit overrides are "no_format" (for code_agent, reflection, etc.).
The function_calling and json_mode branches at line 167 are pre-existing code that was not modified by this PR. They would only be hit if a caller explicitly opts in.
That said, if someone does explicitly pass function_calling for watsonx, they'd hit the same vLLM issue. We can address that in a follow-up if needed, but changing pre-existing untouched branches is out of scope for this fix.
There was a problem hiding this comment.
@haroldship, that's a fair and reasonable scoping decision — the current PR is a focused fix for the response_format empty-content bug, and touching untouched pre-existing branches would expand its scope unnecessarily.
Would you like me to open a follow-up GitHub issue to track the risk that wx_json_mode="function_calling" (and "json_mode") on ChatWatsonx will hit the same vLLM guided-decoding failure if a caller ever opts in? That way it's documented and won't be forgotten.
🧠 Learnings used
Learnt from: haroldship
Repo: cuga-project/cuga-agent PR: 67
File: src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py:377-397
Timestamp: 2026-03-23T08:37:18.878Z
Learning: In `src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`, when `PromptUtils.find_tools` raises an `OutputParserException` or other exception (i.e., the shortlister LLM returns malformed/empty JSON), the correct behaviour is to return a short user-facing error string to the agent — NOT a fallback tool list. This was an explicit design decision: returning all tools on LLM failure bloats the agent context and masks bugs; the error string lets the agent LLM decide whether to retry or proceed. This is distinct from the soft-filtering fallback (which applies when `filter_criteria` is provided but no tools match the criteria).
There was a problem hiding this comment.
yes open the issue
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
gh issue create \
--repo cuga-project/cuga-agent \
--title "fix(watsonx): guard function_calling and json_mode against vLLM guided-decoding failure" \
--body "## Background
PR `#89` fixed the \`response_format\` empty-content bug on \`ChatWatsonx\` by switching to a prompt-based \`PydanticOutputParser\` fallback. During review it was noted that the \`function_calling\` and \`json_mode\` branches in \`BaseAgent.get_chain\` (\`src/cuga/backend/cuga_graph/nodes/shared/base_agent.py\`, ~line 167) still route through \`llm.with_structured_output(...)\`, which is known-broken on vLLM versions prior to 0.8.2 (see [vllm#15236](https://github.com/vllm-project/vllm/issues/15236) and [vllm#21148](https://github.com/vllm-project/vllm/issues/21148)).
## Risk
No caller currently passes \`wx_json_mode=\"function_calling\"\` or \`wx_json_mode=\"json_mode\"\` explicitly for a \`ChatWatsonx\` LLM — the default is \`\"response_format\"\` and current overrides use \`\"no_format\"\`. However, if a future caller opts in to either of these modes, they will silently hit the same vLLM guided-decoding failure (empty content / no tool_calls returned).
## Suggested Fix
Either:
1. Route \`\"function_calling\"\` and \`\"json_mode\"\` through the same \`prompt_template | llm | parser\` fallback with \`.with_retry(stop_after_attempt=3)\` (same approach as the \`\"response_format\"\` fix in PR `#89`), **or**
2. Raise a clear \`ValueError\` / \`NotImplementedError\` when \`wx_json_mode\` is \`\"function_calling\"\` or \`\"json_mode\"\` on \`ChatWatsonx\`, so callers get an explicit error rather than a silent failure.
This fix should be **reverted or revisited** once IBM upgrades the deployed vLLM to a fixed version (0.8.2+).
## References
- PR `#89` (fix for \`response_format\` mode): https://github.com/cuga-project/cuga-agent/pull/89
- Review comment: https://github.com/cuga-project/cuga-agent/pull/89#discussion_r2996097287
- Issue `#88` (original empty-content bug): https://github.com/cuga-project/cuga-agent/issues/88
- Requested by: `@haroldship`
" \
--assignee haroldship
Length of output: 52
@haroldship, done! Follow-up issue #90 has been opened and assigned to you:
👉 #90
It documents the latent risk that wx_json_mode="function_calling" and "json_mode" on ChatWatsonx will silently hit the same vLLM guided-decoding failure, links back to this PR and issue #88, and proposes the two remediation options (prompt+parser fallback or an explicit error). The note to revisit once IBM upgrades to vLLM 0.8.2+ is included as well.
🧠 Learnings used
Learnt from: haroldship
Repo: cuga-project/cuga-agent PR: 67
File: src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py:377-397
Timestamp: 2026-03-23T08:37:18.878Z
Learning: In `src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`, when `PromptUtils.find_tools` raises an `OutputParserException` or other exception (i.e., the shortlister LLM returns malformed/empty JSON), the correct behaviour is to return a short user-facing error string to the agent — NOT a fallback tool list. This was an explicit design decision: returning all tools on LLM failure bloats the agent context and masks bugs; the error string lets the agent LLM decide whether to retry or proceed. This is distinct from the soft-filtering fallback (which applies when `filter_criteria` is provided but no tools match the criteria).
Reproduction scriptTwo conditions are required to trigger the bug consistently:
Run with
|
|
Why stability tests fail |
Summary
ChatWatsonx.with_structured_output(schema, method=\"json_schema\")sendsresponse_format: {type: json_schema}to the vLLM backend, which triggers xgrammar guided decoding\$defs/\$ref(Pydantic v2 nested models), returning empty content (content='',completion_tokens≈2,finish_reason=stop)function_callingmode also fails (model returns notool_calls)prompt_template | llm | PydanticOutputParser— noresponse_formatsent, model follows the JSON examples already in the shortlister system promptRoot Cause
vLLM issue vllm#21148: xgrammar FSM compilation failure for schemas with
\$defs/\$ref. TheShortListerOutputLiteschema (Pydantic v2 with a nestedAPIDetailsmodel) produces exactly this shape:{ "\$defs": { "APIDetails": { ... } }, "properties": { "result": { "items": { "\$ref": "#/\$defs/APIDetails" }, "type": "array" } } }Two conditions are required to trigger the bug consistently:
\$defs/\$ref(flat single-level schemas do not trigger it)response_formatconstraint and produces content anywayReproduction
Confirmed reproducible with
ChatWatsonx+ the real shortlister system prompt. See repro script in PR comments.Test plan
response_format: json_schema(with real shortlister prompt) → 100% empty content at all tool counts (5, 20, 50, 96)prompt | llm | parserbaseline → works at 5 and 96 toolsOutputParserExceptionin M3 eval (--max-samples-per-domain 2)Closes #88
cc @sami-marreed for review
🤖 Generated with Claude Code
Summary by CodeRabbit