fix(watsonx): use prompt-based JSON parsing instead of response_format by haroldship · Pull Request #89 · cuga-project/cuga-agent

haroldship · 2026-03-26T15:59:38Z

Summary

ChatWatsonx.with_structured_output(schema, method=\"json_schema\") sends response_format: {type: json_schema} to the vLLM backend, which triggers xgrammar guided decoding
xgrammar fails to compile an FSM for schemas containing \$defs/\$ref (Pydantic v2 nested models), returning empty content (content='', completion_tokens≈2, finish_reason=stop)
function_calling mode also fails (model returns no tool_calls)
Fix: use prompt_template | llm | PydanticOutputParser — no response_format sent, model follows the JSON examples already in the shortlister system prompt

Root Cause

vLLM issue vllm#21148: xgrammar FSM compilation failure for schemas with \$defs/\$ref. The ShortListerOutputLite schema (Pydantic v2 with a nested APIDetails model) produces exactly this shape:

{
  "\$defs": { "APIDetails": { ... } },
  "properties": {
    "result": { "items": { "\$ref": "#/\$defs/APIDetails" }, "type": "array" }
  }
}

Two conditions are required to trigger the bug consistently:

A schema using \$defs/\$ref (flat single-level schemas do not trigger it)
The real shortlister system prompt (9 KB, with few-shot examples) — with a short prompt the model happens to ignore the response_format constraint and produces content anyway

Reproduction

Confirmed reproducible with ChatWatsonx + the real shortlister system prompt. See repro script in PR comments.

Test plan

Verified response_format: json_schema (with real shortlister prompt) → 100% empty content at all tool counts (5, 20, 50, 96)
Verified prompt | llm | parser baseline → works at 5 and 96 tools
Verified fix resolves OutputParserException in M3 eval (--max-samples-per-domain 2)
Run full M3 benchmark evaluation with fix

Closes #88

cc @sami-marreed for review

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Streamlined JSON response handling for Watsonx backend with simplified chain processing while preserving retry behavior.

vLLM's guided decoding behind watsonx returns empty content when response_format is set (json_schema or json_object). Fall back to prompt_template | llm | PydanticOutputParser for watsonx structured output until IBM upgrades vLLM to 0.8.2+. Closes #88

coderabbitai · 2026-03-26T15:59:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38e3c1bf-2266-4b5f-880b-5a58859daac9

📥 Commits

Reviewing files that changed from the base of the PR and between e710528 and 2fb6373.

📒 Files selected for processing (1)

src/cuga/backend/cuga_graph/nodes/shared/base_agent.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/cuga/backend/cuga_graph/nodes/shared/base_agent.py

📝 Walkthrough

Walkthrough

For Watsonx ChatWatsonx when wx_json_mode == "response_format", BaseAgent.get_chain now builds prompt_template | llm | PydanticOutputParser(schema) and wraps that chain with with_retry(stop_after_attempt=3) instead of delegating to create_validated_structured_output_chain(...). A debug log message was added.

Changes

Cohort / File(s)	Summary
Watsonx Response-Format Chain Fallback `src/cuga/backend/cuga_graph/nodes/shared/base_agent.py`	Replaced call to `BaseAgent.create_validated_structured_output_chain()` for the Watsonx `response_format` path with an explicit `prompt_template

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

fix(watsonx): guard function_calling and json_mode against vLLM guided-decoding failure #90 — The change aligns with the issue's concern that Watsonx response_format path should avoid llm.with_structured_output and use a prompt+parser fallback instead.

Poem

🐰 I hopped through code with nimble feet,
Swapped structured binds for a simpler beat,
Prompt, LLM, parser — now they play,
Three retries hum and save the day,
Watsonx sings JSON, neat and sweet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: switching from response_format-based to prompt-based JSON parsing for watsonx backend.
Linked Issues check	✅ Passed	The code changes implement the primary fix from issue `#88`: replacing response_format with prompt-based JSON parsing for watsonx when wx_json_mode='response_format'.
Out of Scope Changes check	✅ Passed	Changes are narrowly focused on the BaseAgent.get_chain method for watsonx JSON mode handling; all modifications directly address the response_format bug workaround described in issue `#88`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/watsonx-empty-response-format

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/shared/base_agent.py`:
- Around line 157-168: The branch still sends wx_json_mode == "function_calling"
into llm.with_structured_output which is known-broken for watsonx; update the
conditional logic in the wx_json_mode handling (the branch around wx_json_mode,
llm.with_structured_output, prompt_template, parser and chain) so that
"function_calling" is not routed into llm.with_structured_output — either treat
"function_calling" the same as "response_format" by using the prompt_template |
llm | parser fallback with chain.with_retry(...) or explicitly raise a clear
error when wx_json_mode == "function_calling", ensuring callers cannot
accidentally get routed through llm.with_structured_output.
- Around line 156-166: APIPlannerAgent.create() builds a prompt_template without
attaching format instructions, so when wx_json_mode falls back to the
prompt-based route (PydanticOutputParser instance `parser` and chain
`prompt_template | llm | parser`) the prompt lacks the `{format_instructions}`
guidance and parsing repeatedly fails; fix by passing
format_instructions=BaseAgent.get_format_instructions(parser) into
load_prompt_simple when creating prompt_template in APIPlannerAgent.create() so
the prompt includes the parser’s schema guidance (consistent with
ShortlisterAgent/BrowserPlannerAgent patterns).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 09ee8749-1855-448c-a621-d23c7024c8ac

📥 Commits

Reviewing files that changed from the base of the PR and between 2b363a1 and e710528.

📒 Files selected for processing (1)

src/cuga/backend/cuga_graph/nodes/shared/base_agent.py

src/cuga/backend/cuga_graph/nodes/shared/base_agent.py

coderabbitai · 2026-03-26T16:11:18Z

src/cuga/backend/cuga_graph/nodes/shared/base_agent.py

            if wx_json_mode == "response_format":
-                return BaseAgent.create_validated_structured_output_chain(llm, schema, prompt_template)
+                # Avoid any response_format parameter for watsonx. vLLM's guided decoding
+                # (json_schema and json_mode) returns empty content on complex schemas
+                # (vllm#15236, vllm#21148). function_calling also fails (no tool_calls).
+                # Fall back to prompt-based format instructions + PydanticOutputParser.
+                logger.debug(
+                    "Using prompt-based JSON parsing for watsonx (response_format triggers empty content)"
+                )
+                chain = prompt_template | llm | parser
+                return chain.with_retry(stop_after_attempt=3)
            elif wx_json_mode == "function_calling" or wx_json_mode == "json_mode":
                chain = prompt_template | llm.with_structured_output(schema, method=wx_json_mode)


⚠️ Potential issue | 🟠 Major

Don't keep function_calling on the known-broken watsonx path.

The new inline note says watsonx function_calling returns no tool_calls, but Line 167 still routes that mode through llm.with_structured_output(...). Any caller that explicitly selects wx_json_mode="function_calling" will keep seeing the old failure. Either send that mode through the same prompt+parser fallback or reject it with a clear error.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cuga/backend/cuga_graph/nodes/shared/base_agent.py` around lines 157 - 168, The branch still sends wx_json_mode == "function_calling" into llm.with_structured_output which is known-broken for watsonx; update the conditional logic in the wx_json_mode handling (the branch around wx_json_mode, llm.with_structured_output, prompt_template, parser and chain) so that "function_calling" is not routed into llm.with_structured_output — either treat "function_calling" the same as "response_format" by using the prompt_template | llm | parser fallback with chain.with_retry(...) or explicitly raise a clear error when wx_json_mode == "function_calling", ensuring callers cannot accidentally get routed through llm.with_structured_output.

Valid point. However, no caller in the codebase explicitly passes wx_json_mode="function_calling" for watsonx today — the default is "response_format" and the only explicit overrides are "no_format" (for code_agent, reflection, etc.).

The function_calling and json_mode branches at line 167 are pre-existing code that was not modified by this PR. They would only be hit if a caller explicitly opts in.

That said, if someone does explicitly pass function_calling for watsonx, they'd hit the same vLLM issue. We can address that in a follow-up if needed, but changing pre-existing untouched branches is out of scope for this fix.

@haroldship, that's a fair and reasonable scoping decision — the current PR is a focused fix for the response_format empty-content bug, and touching untouched pre-existing branches would expand its scope unnecessarily.

Would you like me to open a follow-up GitHub issue to track the risk that wx_json_mode="function_calling" (and "json_mode") on ChatWatsonx will hit the same vLLM guided-decoding failure if a caller ever opts in? That way it's documented and won't be forgotten.

🧠 Learnings used

Learnt from: haroldship Repo: cuga-project/cuga-agent PR: 67 File: src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py:377-397 Timestamp: 2026-03-23T08:37:18.878Z Learning: In `src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`, when `PromptUtils.find_tools` raises an `OutputParserException` or other exception (i.e., the shortlister LLM returns malformed/empty JSON), the correct behaviour is to return a short user-facing error string to the agent — NOT a fallback tool list. This was an explicit design decision: returning all tools on LLM failure bloats the agent context and masks bugs; the error string lets the agent LLM decide whether to retry or proceed. This is distinct from the soft-filtering fallback (which applies when `filter_criteria` is provided but no tools match the criteria).

yes open the issue

🧩 Analysis chain

🏁 Script executed:

gh issue create \ --repo cuga-project/cuga-agent \ --title "fix(watsonx): guard function_calling and json_mode against vLLM guided-decoding failure" \ --body "## Background PR `#89` fixed the \`response_format\` empty-content bug on \`ChatWatsonx\` by switching to a prompt-based \`PydanticOutputParser\` fallback. During review it was noted that the \`function_calling\` and \`json_mode\` branches in \`BaseAgent.get_chain\` (\`src/cuga/backend/cuga_graph/nodes/shared/base_agent.py\`, ~line 167) still route through \`llm.with_structured_output(...)\`, which is known-broken on vLLM versions prior to 0.8.2 (see [vllm#15236](https://github.com/vllm-project/vllm/issues/15236) and [vllm#21148](https://github.com/vllm-project/vllm/issues/21148)). ## Risk No caller currently passes \`wx_json_mode=\"function_calling\"\` or \`wx_json_mode=\"json_mode\"\` explicitly for a \`ChatWatsonx\` LLM — the default is \`\"response_format\"\` and current overrides use \`\"no_format\"\`. However, if a future caller opts in to either of these modes, they will silently hit the same vLLM guided-decoding failure (empty content / no tool_calls returned). ## Suggested Fix Either: 1. Route \`\"function_calling\"\` and \`\"json_mode\"\` through the same \`prompt_template | llm | parser\` fallback with \`.with_retry(stop_after_attempt=3)\` (same approach as the \`\"response_format\"\` fix in PR `#89`), **or** 2. Raise a clear \`ValueError\` / \`NotImplementedError\` when \`wx_json_mode\` is \`\"function_calling\"\` or \`\"json_mode\"\` on \`ChatWatsonx\`, so callers get an explicit error rather than a silent failure. This fix should be **reverted or revisited** once IBM upgrades the deployed vLLM to a fixed version (0.8.2+). ## References - PR `#89` (fix for \`response_format\` mode): https://github.com/cuga-project/cuga-agent/pull/89 - Review comment: https://github.com/cuga-project/cuga-agent/pull/89#discussion_r2996097287 - Issue `#88` (original empty-content bug): https://github.com/cuga-project/cuga-agent/issues/88 - Requested by: `@haroldship` " \ --assignee haroldship

Length of output: 52

@haroldship, done! Follow-up issue #90 has been opened and assigned to you:

👉 #90

It documents the latent risk that wx_json_mode="function_calling" and "json_mode" on ChatWatsonx will silently hit the same vLLM guided-decoding failure, links back to this PR and issue #88, and proposes the two remediation options (prompt+parser fallback or an explicit error). The note to revisit once IBM upgrades to vLLM 0.8.2+ is included as well.

🧠 Learnings used

Learnt from: haroldship Repo: cuga-project/cuga-agent PR: 67 File: src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py:377-397 Timestamp: 2026-03-23T08:37:18.878Z Learning: In `src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`, when `PromptUtils.find_tools` raises an `OutputParserException` or other exception (i.e., the shortlister LLM returns malformed/empty JSON), the correct behaviour is to return a short user-facing error string to the agent — NOT a fallback tool list. This was an explicit design decision: returning all tools on LLM failure bloats the agent context and masks bugs; the error string lets the agent LLM decide whether to retry or proceed. This is distinct from the soft-filtering fallback (which applies when `filter_criteria` is provided but no tools match the criteria).

…#15236

haroldship · 2026-03-27T14:03:03Z

Reproduction script

Two conditions are required to trigger the bug consistently:

Schema uses $defs/$ref (Pydantic v2 nested model — ShortListerOutputLite qualifies)
The real shortlister system prompt (9 KB with few-shot examples) — short prompts don't trigger it reliably

Run with uv run python scripts/repro_watsonx_response_format.py after copying scripts/.env.example to scripts/.env and filling in credentials.

scripts/.env.example

WATSONX_URL=https://us-south.ml.cloud.ibm.com
WATSONX_PROJECT_ID=your-project-id-here
WATSONX_APIKEY=your-api-key-here
# or: WATSONX_API_KEY=your-api-key-here
MODEL_NAME=openai/gpt-oss-120b

scripts/repro_watsonx_response_format.py

"""
Minimal reproduction script for watsonx response_format empty content bug.

When ChatWatsonx.with_structured_output(schema, method="json_schema") is called
with a large prompt (many tool definitions) and a schema using $defs/$ref
(Pydantic models with nested objects/arrays), the watsonx API returns empty
content (content='', completion_tokens~=2, finish_reason=stop).

This reproduces the exact code path used in cuga's shortlister:
  prompt_utils.find_tools
    -> create_validated_structured_output_chain
    -> llm.with_structured_output(ShortListerOutputLite, method="json_schema")
    -> ChatWatsonx.bind(response_format={type: json_schema, schema: model_json_schema()})

Prompt structure mirrors find_tools in prompt_utils.py:
  system:  shortlister/system.jinja2 (long, with few-shot examples)
  human:   "Current Apps: {all_apps}\nCurrent Available Tools: {all_tools}"
  ai:      "Sure, now give me the intent"
  human:   "User Intent: {query}"

This was observed with model openai/gpt-oss-120b on us-south.ml.cloud.ibm.com
using langchain-ibm 1.0.2 / ibm-watsonx-ai 1.4.11.

Usage:
    cp scripts/.env.example scripts/.env
    # Fill in WATSONX_APIKEY (or WATSONX_API_KEY) in scripts/.env
    uv run python scripts/repro_watsonx_response_format.py

Requires:
    langchain-ibm>=1.0.2
    ibm-watsonx-ai>=1.4.11
    python-dotenv
    pydantic
    jinja2
"""

import json
import os
from pathlib import Path
from typing import List

from dotenv import load_dotenv

# Load from scripts/.env if present, else fall back to project root .env
load_dotenv(Path(__file__).parent / ".env", override=True)
load_dotenv(override=False)

WATSONX_URL = os.environ["WATSONX_URL"]
WATSONX_PROJECT_ID = os.environ["WATSONX_PROJECT_ID"]
WATSONX_API_KEY = os.environ.get("WATSONX_API_KEY") or os.environ["WATSONX_APIKEY"]
MODEL_ID = os.environ.get("MODEL_NAME", "openai/gpt-oss-120b").strip('"')

# Path to the actual shortlister system prompt (relative to cuga-agent repo root)
SHORTLISTER_SYSTEM_PROMPT_PATH = (
    Path(__file__).parent.parent
    / "src/cuga/backend/cuga_graph/nodes/cuga_lite/prompts/shortlister/system.jinja2"
)

from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_ibm import ChatWatsonx


# ---------------------------------------------------------------------------
# Schema — exact copy of ShortListerOutputLite (uses $defs/$ref via Pydantic v2)
# ---------------------------------------------------------------------------

class APIDetails(BaseModel):
    name: str = Field(description="API Name")
    relevance_score: float = Field(description="Relevance score")
    reasoning: str = Field(description="Reasoning")


class ShortListerOutputLite(BaseModel):
    result: List[APIDetails]


# ---------------------------------------------------------------------------
# Prompt builder — mirrors prompt_utils.find_tools message structure
# ---------------------------------------------------------------------------

def load_system_prompt() -> str:
    """Load the actual shortlister system prompt, stripping Jinja2 conditionals."""
    if SHORTLISTER_SYSTEM_PROMPT_PATH.exists():
        try:
            from jinja2 import Environment
            env = Environment()
            tmpl = env.from_string(SHORTLISTER_SYSTEM_PROMPT_PATH.read_text())
            return tmpl.render(instructions=None, memory=None)
        except Exception:
            return SHORTLISTER_SYSTEM_PROMPT_PATH.read_text()
    # Fallback: short system prompt
    return (
        "You are a tool selection assistant. Given a user query and a list of tools, "
        "select the most relevant tools and return a JSON object matching the schema."
    )


def build_prompt(system: str) -> ChatPromptTemplate:
    """3-turn structure matching prompt_utils.find_tools."""
    # Escape {{ }} in the rendered system prompt so LangChain doesn't treat
    # JSON examples (e.g. {"thoughts": ...}) as template variables.
    escaped_system = system.replace("{", "{{").replace("}", "}}")
    return ChatPromptTemplate.from_messages([
        ("system", escaped_system),
        ("human", "Current Apps: {all_apps}\nCurrent Available Tools: {all_tools}"),
        ("ai", "Sure, now give me the intent"),
        ("human", "User Intent: {query}"),
    ])


# ---------------------------------------------------------------------------
# Synthetic tool / app generators
# ---------------------------------------------------------------------------

def make_tools(n: int) -> dict:
    """Generate n synthetic movielens-style tool definitions."""
    endpoints = [
        ("get_movies", "Returns a list of movies", {"movie_id": "integer", "genre": "string", "year": "integer"}),
        ("get_ratings", "Returns user ratings for movies", {"user_id": "integer", "movie_id": "integer"}),
        ("get_users", "Returns user information", {"user_id": "integer", "age_group": "string"}),
        ("get_genres", "Returns all genres", {"genre_id": "integer"}),
        ("get_directors", "Returns director information", {"director_id": "integer", "movie_id": "integer"}),
        ("get_actors", "Returns actors in a movie", {"actor_id": "integer", "movie_id": "integer"}),
        ("get_tags", "Returns user-assigned tags for movies", {"tag_id": "integer", "movie_id": "integer"}),
        ("get_links", "Returns external links for movies", {"movie_id": "integer"}),
        ("search_movies", "Full-text search over movie titles", {"query": "string", "limit": "integer"}),
        ("get_avg_rating", "Returns average rating for a movie", {"movie_id": "integer"}),
    ]
    tools = {}
    for i in range(n):
        base = endpoints[i % len(endpoints)]
        name, desc, params = base
        tool_name = f"{name}_{i // len(endpoints)}" if i >= len(endpoints) else name
        props = {k: {"type": v, "description": f"The {k} parameter"} for k, v in params.items()}
        tools[tool_name] = {
            "name": tool_name,
            "description": desc,
            "args_schema": {
                "type": "object",
                "properties": props,
                "required": list(params.keys())[:1],
            },
            "_response_schemas": {
                "200": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "id": {"type": "integer"},
                            "title": {"type": "string"},
                            "value": {"type": "number"},
                        }
                    }
                }
            }
        }
    return tools


def make_apps() -> dict:
    return {
        "movielens": {
            "name": "movielens",
            "description": "MovieLens dataset API — movies, ratings, users, genres, directors, actors.",
        }
    }


# ---------------------------------------------------------------------------
# Call helpers
# ---------------------------------------------------------------------------

def call_with_structured_output(label: str, tools: dict, system: str) -> None:
    """Mirrors create_validated_structured_output_chain."""
    llm = ChatWatsonx(
        model_id=MODEL_ID,
        url=WATSONX_URL,
        project_id=WATSONX_PROJECT_ID,
        apikey=WATSONX_API_KEY,
        params={"temperature": 0.1, "max_new_tokens": 1024},
    )
    prompt = build_prompt(system)
    chain = prompt | llm.with_structured_output(ShortListerOutputLite, method="json_schema")
    try:
        result = chain.invoke({
            "all_apps": json.dumps(make_apps()),
            "all_tools": json.dumps(tools),
            "query": "find movies with high user ratings and list their director IDs",
        })
        status = "✅ OK   "
        detail = f"got {len(result.result)} tools"
    except Exception as e:
        status = "❌ FAIL "
        detail = repr(str(e)[:300])
    print(f"  {status}  [{label}]")
    print(f"           {detail}\n")


def call_no_response_format(label: str, tools: dict, system: str) -> None:
    """Baseline: prompt | llm | PydanticOutputParser (no response_format)"""
    llm = ChatWatsonx(
        model_id=MODEL_ID,
        url=WATSONX_URL,
        project_id=WATSONX_PROJECT_ID,
        apikey=WATSONX_API_KEY,
        params={"temperature": 0.1, "max_new_tokens": 1024},
    )
    parser = PydanticOutputParser(pydantic_object=ShortListerOutputLite)
    escaped_system = system.replace("{", "{{").replace("}", "}}")
    sys_with_format = escaped_system + "\n\nReturn JSON matching this format:\n{format_instructions}"
    prompt = ChatPromptTemplate.from_messages([
        ("system", sys_with_format),
        ("human", "Current Apps: {all_apps}\nCurrent Available Tools: {all_tools}"),
        ("ai", "Sure, now give me the intent"),
        ("human", "User Intent: {query}"),
    ])
    chain = prompt | llm | parser
    try:
        result = chain.invoke({
            "all_apps": json.dumps(make_apps()),
            "all_tools": json.dumps(tools),
            "query": "find movies with high user ratings and list their director IDs",
            "format_instructions": parser.get_format_instructions(),
        })
        status = "✅ OK   "
        detail = f"got {len(result.result)} tools"
    except Exception as e:
        status = "❌ FAIL "
        detail = repr(str(e)[:300])
    print(f"  {status}  [{label}]")
    print(f"           {detail}\n")


# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"Model: {MODEL_ID}")
    print(f"URL:   {WATSONX_URL}\n")

    system = load_system_prompt()
    using_real = SHORTLISTER_SYSTEM_PROMPT_PATH.exists()
    print(f"System prompt: {'real shortlister template' if using_real else 'fallback (short)'} ({len(system)} chars)\n")

    for n_tools in [5, 20, 50, 96]:
        tools = make_tools(n_tools)
        tools_json = json.dumps(tools)
        print("=" * 60)
        print(f"{n_tools} tools (~{len(tools_json)} chars)")
        print("=" * 60)
        call_no_response_format(f"no response_format ({n_tools} tools)", tools, system)
        call_with_structured_output(f"json_schema response_format ({n_tools} tools)", tools, system)

Sample output showing the bug:

Model: openai/gpt-oss-120b
URL:   https://us-south.ml.cloud.ibm.com

System prompt: real shortlister template (9866 chars)

============================================================
5 tools (~2425 chars)
============================================================
  ✅ OK     [no response_format (5 tools)]
           got 3 tools

  ❌ FAIL   [json_schema response_format (5 tools)]
           'Invalid json output: \nFor troubleshooting, visit: ...'

============================================================
96 tools (~45834 chars)
============================================================
  ✅ OK     [no response_format (96 tools)]
           got 2 tools

  ❌ FAIL   [json_schema response_format (96 tools)]
           'Invalid json output: \nFor troubleshooting, visit: ...'

sami-marreed · 2026-03-29T10:27:31Z

Why stability tests fail

haroldship requested a review from sami-marreed March 26, 2026 16:06

haroldship added the bug Something isn't working label Mar 26, 2026

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Mar 26, 2026

fix(watsonx): guard function_calling and json_mode against vLLM guided-decoding failure #90

Open

fix(watsonx): correct root-cause comment — / is the trigger, not vllm…

2fb6373

…#15236

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(watsonx): use prompt-based JSON parsing instead of response_format#89

fix(watsonx): use prompt-based JSON parsing instead of response_format#89
haroldship wants to merge 2 commits intomainfrom
fix/watsonx-empty-response-format

haroldship commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Mar 26, 2026 •

edited

Loading

Uh oh!

haroldship Mar 26, 2026

Uh oh!

coderabbitai bot Mar 26, 2026

Uh oh!

haroldship Mar 26, 2026

Uh oh!

coderabbitai bot Mar 26, 2026

Uh oh!

haroldship commented Mar 27, 2026

Uh oh!

sami-marreed commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haroldship commented Mar 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Reproduction

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haroldship Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship commented Mar 27, 2026

Reproduction script

Uh oh!

sami-marreed commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haroldship commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 26, 2026 •

edited

Loading

coderabbitai bot Mar 26, 2026 •

edited

Loading