Skip to content

fix(watsonx): use prompt-based JSON parsing instead of response_format#89

Open
haroldship wants to merge 2 commits intomainfrom
fix/watsonx-empty-response-format
Open

fix(watsonx): use prompt-based JSON parsing instead of response_format#89
haroldship wants to merge 2 commits intomainfrom
fix/watsonx-empty-response-format

Conversation

@haroldship
Copy link
Copy Markdown
Collaborator

@haroldship haroldship commented Mar 26, 2026

Summary

  • ChatWatsonx.with_structured_output(schema, method=\"json_schema\") sends response_format: {type: json_schema} to the vLLM backend, which triggers xgrammar guided decoding
  • xgrammar fails to compile an FSM for schemas containing \$defs/\$ref (Pydantic v2 nested models), returning empty content (content='', completion_tokens≈2, finish_reason=stop)
  • function_calling mode also fails (model returns no tool_calls)
  • Fix: use prompt_template | llm | PydanticOutputParser — no response_format sent, model follows the JSON examples already in the shortlister system prompt

Root Cause

vLLM issue vllm#21148: xgrammar FSM compilation failure for schemas with \$defs/\$ref. The ShortListerOutputLite schema (Pydantic v2 with a nested APIDetails model) produces exactly this shape:

{
  "\$defs": { "APIDetails": { ... } },
  "properties": {
    "result": { "items": { "\$ref": "#/\$defs/APIDetails" }, "type": "array" }
  }
}

Two conditions are required to trigger the bug consistently:

  1. A schema using \$defs/\$ref (flat single-level schemas do not trigger it)
  2. The real shortlister system prompt (9 KB, with few-shot examples) — with a short prompt the model happens to ignore the response_format constraint and produces content anyway

Reproduction

Confirmed reproducible with ChatWatsonx + the real shortlister system prompt. See repro script in PR comments.

Test plan

  • Verified response_format: json_schema (with real shortlister prompt) → 100% empty content at all tool counts (5, 20, 50, 96)
  • Verified prompt | llm | parser baseline → works at 5 and 96 tools
  • Verified fix resolves OutputParserException in M3 eval (--max-samples-per-domain 2)
  • Run full M3 benchmark evaluation with fix

Closes #88

cc @sami-marreed for review

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Streamlined JSON response handling for Watsonx backend with simplified chain processing while preserving retry behavior.

  vLLM's guided decoding behind watsonx returns empty content when
  response_format is set (json_schema or json_object). Fall back to
  prompt_template | llm | PydanticOutputParser for watsonx structured
  output until IBM upgrades vLLM to 0.8.2+.

  Closes #88
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38e3c1bf-2266-4b5f-880b-5a58859daac9

📥 Commits

Reviewing files that changed from the base of the PR and between e710528 and 2fb6373.

📒 Files selected for processing (1)
  • src/cuga/backend/cuga_graph/nodes/shared/base_agent.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cuga/backend/cuga_graph/nodes/shared/base_agent.py

📝 Walkthrough

Walkthrough

For Watsonx ChatWatsonx when wx_json_mode == "response_format", BaseAgent.get_chain now builds prompt_template | llm | PydanticOutputParser(schema) and wraps that chain with with_retry(stop_after_attempt=3) instead of delegating to create_validated_structured_output_chain(...). A debug log message was added.

Changes

Cohort / File(s) Summary
Watsonx Response-Format Chain Fallback
src/cuga/backend/cuga_graph/nodes/shared/base_agent.py
Replaced call to BaseAgent.create_validated_structured_output_chain() for the Watsonx response_format path with an explicit `prompt_template

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🐰 I hopped through code with nimble feet,
Swapped structured binds for a simpler beat,
Prompt, LLM, parser — now they play,
Three retries hum and save the day,
Watsonx sings JSON, neat and sweet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: switching from response_format-based to prompt-based JSON parsing for watsonx backend.
Linked Issues check ✅ Passed The code changes implement the primary fix from issue #88: replacing response_format with prompt-based JSON parsing for watsonx when wx_json_mode='response_format'.
Out of Scope Changes check ✅ Passed Changes are narrowly focused on the BaseAgent.get_chain method for watsonx JSON mode handling; all modifications directly address the response_format bug workaround described in issue #88.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/watsonx-empty-response-format

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@haroldship haroldship requested a review from sami-marreed March 26, 2026 16:06
@haroldship haroldship added the bug Something isn't working label Mar 26, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/shared/base_agent.py`:
- Around line 157-168: The branch still sends wx_json_mode == "function_calling"
into llm.with_structured_output which is known-broken for watsonx; update the
conditional logic in the wx_json_mode handling (the branch around wx_json_mode,
llm.with_structured_output, prompt_template, parser and chain) so that
"function_calling" is not routed into llm.with_structured_output — either treat
"function_calling" the same as "response_format" by using the prompt_template |
llm | parser fallback with chain.with_retry(...) or explicitly raise a clear
error when wx_json_mode == "function_calling", ensuring callers cannot
accidentally get routed through llm.with_structured_output.
- Around line 156-166: APIPlannerAgent.create() builds a prompt_template without
attaching format instructions, so when wx_json_mode falls back to the
prompt-based route (PydanticOutputParser instance `parser` and chain
`prompt_template | llm | parser`) the prompt lacks the `{format_instructions}`
guidance and parsing repeatedly fails; fix by passing
format_instructions=BaseAgent.get_format_instructions(parser) into
load_prompt_simple when creating prompt_template in APIPlannerAgent.create() so
the prompt includes the parser’s schema guidance (consistent with
ShortlisterAgent/BrowserPlannerAgent patterns).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 09ee8749-1855-448c-a621-d23c7024c8ac

📥 Commits

Reviewing files that changed from the base of the PR and between 2b363a1 and e710528.

📒 Files selected for processing (1)
  • src/cuga/backend/cuga_graph/nodes/shared/base_agent.py

Comment on lines 157 to 168
if wx_json_mode == "response_format":
return BaseAgent.create_validated_structured_output_chain(llm, schema, prompt_template)
# Avoid any response_format parameter for watsonx. vLLM's guided decoding
# (json_schema and json_mode) returns empty content on complex schemas
# (vllm#15236, vllm#21148). function_calling also fails (no tool_calls).
# Fall back to prompt-based format instructions + PydanticOutputParser.
logger.debug(
"Using prompt-based JSON parsing for watsonx (response_format triggers empty content)"
)
chain = prompt_template | llm | parser
return chain.with_retry(stop_after_attempt=3)
elif wx_json_mode == "function_calling" or wx_json_mode == "json_mode":
chain = prompt_template | llm.with_structured_output(schema, method=wx_json_mode)
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't keep function_calling on the known-broken watsonx path.

The new inline note says watsonx function_calling returns no tool_calls, but Line 167 still routes that mode through llm.with_structured_output(...). Any caller that explicitly selects wx_json_mode="function_calling" will keep seeing the old failure. Either send that mode through the same prompt+parser fallback or reject it with a clear error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cuga/backend/cuga_graph/nodes/shared/base_agent.py` around lines 157 -
168, The branch still sends wx_json_mode == "function_calling" into
llm.with_structured_output which is known-broken for watsonx; update the
conditional logic in the wx_json_mode handling (the branch around wx_json_mode,
llm.with_structured_output, prompt_template, parser and chain) so that
"function_calling" is not routed into llm.with_structured_output — either treat
"function_calling" the same as "response_format" by using the prompt_template |
llm | parser fallback with chain.with_retry(...) or explicitly raise a clear
error when wx_json_mode == "function_calling", ensuring callers cannot
accidentally get routed through llm.with_structured_output.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. However, no caller in the codebase explicitly passes wx_json_mode="function_calling" for watsonx today — the default is "response_format" and the only explicit overrides are "no_format" (for code_agent, reflection, etc.).

The function_calling and json_mode branches at line 167 are pre-existing code that was not modified by this PR. They would only be hit if a caller explicitly opts in.

That said, if someone does explicitly pass function_calling for watsonx, they'd hit the same vLLM issue. We can address that in a follow-up if needed, but changing pre-existing untouched branches is out of scope for this fix.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haroldship, that's a fair and reasonable scoping decision — the current PR is a focused fix for the response_format empty-content bug, and touching untouched pre-existing branches would expand its scope unnecessarily.

Would you like me to open a follow-up GitHub issue to track the risk that wx_json_mode="function_calling" (and "json_mode") on ChatWatsonx will hit the same vLLM guided-decoding failure if a caller ever opts in? That way it's documented and won't be forgotten.


🧠 Learnings used
Learnt from: haroldship
Repo: cuga-project/cuga-agent PR: 67
File: src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py:377-397
Timestamp: 2026-03-23T08:37:18.878Z
Learning: In `src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`, when `PromptUtils.find_tools` raises an `OutputParserException` or other exception (i.e., the shortlister LLM returns malformed/empty JSON), the correct behaviour is to return a short user-facing error string to the agent — NOT a fallback tool list. This was an explicit design decision: returning all tools on LLM failure bloats the agent context and masks bugs; the error string lets the agent LLM decide whether to retry or proceed. This is distinct from the soft-filtering fallback (which applies when `filter_criteria` is provided but no tools match the criteria).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes open the issue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

gh issue create \
  --repo cuga-project/cuga-agent \
  --title "fix(watsonx): guard function_calling and json_mode against vLLM guided-decoding failure" \
  --body "## Background

PR `#89` fixed the \`response_format\` empty-content bug on \`ChatWatsonx\` by switching to a prompt-based \`PydanticOutputParser\` fallback. During review it was noted that the \`function_calling\` and \`json_mode\` branches in \`BaseAgent.get_chain\` (\`src/cuga/backend/cuga_graph/nodes/shared/base_agent.py\`, ~line 167) still route through \`llm.with_structured_output(...)\`, which is known-broken on vLLM versions prior to 0.8.2 (see [vllm#15236](https://github.com/vllm-project/vllm/issues/15236) and [vllm#21148](https://github.com/vllm-project/vllm/issues/21148)).

## Risk

No caller currently passes \`wx_json_mode=\"function_calling\"\` or \`wx_json_mode=\"json_mode\"\` explicitly for a \`ChatWatsonx\` LLM — the default is \`\"response_format\"\` and current overrides use \`\"no_format\"\`. However, if a future caller opts in to either of these modes, they will silently hit the same vLLM guided-decoding failure (empty content / no tool_calls returned).

## Suggested Fix

Either:
1. Route \`\"function_calling\"\` and \`\"json_mode\"\` through the same \`prompt_template | llm | parser\` fallback with \`.with_retry(stop_after_attempt=3)\` (same approach as the \`\"response_format\"\` fix in PR `#89`), **or**
2. Raise a clear \`ValueError\` / \`NotImplementedError\` when \`wx_json_mode\` is \`\"function_calling\"\` or \`\"json_mode\"\` on \`ChatWatsonx\`, so callers get an explicit error rather than a silent failure.

This fix should be **reverted or revisited** once IBM upgrades the deployed vLLM to a fixed version (0.8.2+).

## References

- PR `#89` (fix for \`response_format\` mode): https://github.com/cuga-project/cuga-agent/pull/89
- Review comment: https://github.com/cuga-project/cuga-agent/pull/89#discussion_r2996097287
- Issue `#88` (original empty-content bug): https://github.com/cuga-project/cuga-agent/issues/88
- Requested by: `@haroldship`
" \
  --assignee haroldship

Length of output: 52


@haroldship, done! Follow-up issue #90 has been opened and assigned to you:

👉 #90

It documents the latent risk that wx_json_mode="function_calling" and "json_mode" on ChatWatsonx will silently hit the same vLLM guided-decoding failure, links back to this PR and issue #88, and proposes the two remediation options (prompt+parser fallback or an explicit error). The note to revisit once IBM upgrades to vLLM 0.8.2+ is included as well.


🧠 Learnings used
Learnt from: haroldship
Repo: cuga-project/cuga-agent PR: 67
File: src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py:377-397
Timestamp: 2026-03-23T08:37:18.878Z
Learning: In `src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`, when `PromptUtils.find_tools` raises an `OutputParserException` or other exception (i.e., the shortlister LLM returns malformed/empty JSON), the correct behaviour is to return a short user-facing error string to the agent — NOT a fallback tool list. This was an explicit design decision: returning all tools on LLM failure bloats the agent context and masks bugs; the error string lets the agent LLM decide whether to retry or proceed. This is distinct from the soft-filtering fallback (which applies when `filter_criteria` is provided but no tools match the criteria).

@haroldship
Copy link
Copy Markdown
Collaborator Author

Reproduction script

Two conditions are required to trigger the bug consistently:

  1. Schema uses $defs/$ref (Pydantic v2 nested model — ShortListerOutputLite qualifies)
  2. The real shortlister system prompt (9 KB with few-shot examples) — short prompts don't trigger it reliably

Run with uv run python scripts/repro_watsonx_response_format.py after copying scripts/.env.example to scripts/.env and filling in credentials.

scripts/.env.example
WATSONX_URL=https://us-south.ml.cloud.ibm.com
WATSONX_PROJECT_ID=your-project-id-here
WATSONX_APIKEY=your-api-key-here
# or: WATSONX_API_KEY=your-api-key-here
MODEL_NAME=openai/gpt-oss-120b
scripts/repro_watsonx_response_format.py
"""
Minimal reproduction script for watsonx response_format empty content bug.

When ChatWatsonx.with_structured_output(schema, method="json_schema") is called
with a large prompt (many tool definitions) and a schema using $defs/$ref
(Pydantic models with nested objects/arrays), the watsonx API returns empty
content (content='', completion_tokens~=2, finish_reason=stop).

This reproduces the exact code path used in cuga's shortlister:
  prompt_utils.find_tools
    -> create_validated_structured_output_chain
    -> llm.with_structured_output(ShortListerOutputLite, method="json_schema")
    -> ChatWatsonx.bind(response_format={type: json_schema, schema: model_json_schema()})

Prompt structure mirrors find_tools in prompt_utils.py:
  system:  shortlister/system.jinja2 (long, with few-shot examples)
  human:   "Current Apps: {all_apps}\nCurrent Available Tools: {all_tools}"
  ai:      "Sure, now give me the intent"
  human:   "User Intent: {query}"

This was observed with model openai/gpt-oss-120b on us-south.ml.cloud.ibm.com
using langchain-ibm 1.0.2 / ibm-watsonx-ai 1.4.11.

Usage:
    cp scripts/.env.example scripts/.env
    # Fill in WATSONX_APIKEY (or WATSONX_API_KEY) in scripts/.env
    uv run python scripts/repro_watsonx_response_format.py

Requires:
    langchain-ibm>=1.0.2
    ibm-watsonx-ai>=1.4.11
    python-dotenv
    pydantic
    jinja2
"""

import json
import os
from pathlib import Path
from typing import List

from dotenv import load_dotenv

# Load from scripts/.env if present, else fall back to project root .env
load_dotenv(Path(__file__).parent / ".env", override=True)
load_dotenv(override=False)

WATSONX_URL = os.environ["WATSONX_URL"]
WATSONX_PROJECT_ID = os.environ["WATSONX_PROJECT_ID"]
WATSONX_API_KEY = os.environ.get("WATSONX_API_KEY") or os.environ["WATSONX_APIKEY"]
MODEL_ID = os.environ.get("MODEL_NAME", "openai/gpt-oss-120b").strip('"')

# Path to the actual shortlister system prompt (relative to cuga-agent repo root)
SHORTLISTER_SYSTEM_PROMPT_PATH = (
    Path(__file__).parent.parent
    / "src/cuga/backend/cuga_graph/nodes/cuga_lite/prompts/shortlister/system.jinja2"
)

from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_ibm import ChatWatsonx


# ---------------------------------------------------------------------------
# Schema — exact copy of ShortListerOutputLite (uses $defs/$ref via Pydantic v2)
# ---------------------------------------------------------------------------

class APIDetails(BaseModel):
    name: str = Field(description="API Name")
    relevance_score: float = Field(description="Relevance score")
    reasoning: str = Field(description="Reasoning")


class ShortListerOutputLite(BaseModel):
    result: List[APIDetails]


# ---------------------------------------------------------------------------
# Prompt builder — mirrors prompt_utils.find_tools message structure
# ---------------------------------------------------------------------------

def load_system_prompt() -> str:
    """Load the actual shortlister system prompt, stripping Jinja2 conditionals."""
    if SHORTLISTER_SYSTEM_PROMPT_PATH.exists():
        try:
            from jinja2 import Environment
            env = Environment()
            tmpl = env.from_string(SHORTLISTER_SYSTEM_PROMPT_PATH.read_text())
            return tmpl.render(instructions=None, memory=None)
        except Exception:
            return SHORTLISTER_SYSTEM_PROMPT_PATH.read_text()
    # Fallback: short system prompt
    return (
        "You are a tool selection assistant. Given a user query and a list of tools, "
        "select the most relevant tools and return a JSON object matching the schema."
    )


def build_prompt(system: str) -> ChatPromptTemplate:
    """3-turn structure matching prompt_utils.find_tools."""
    # Escape {{ }} in the rendered system prompt so LangChain doesn't treat
    # JSON examples (e.g. {"thoughts": ...}) as template variables.
    escaped_system = system.replace("{", "{{").replace("}", "}}")
    return ChatPromptTemplate.from_messages([
        ("system", escaped_system),
        ("human", "Current Apps: {all_apps}\nCurrent Available Tools: {all_tools}"),
        ("ai", "Sure, now give me the intent"),
        ("human", "User Intent: {query}"),
    ])


# ---------------------------------------------------------------------------
# Synthetic tool / app generators
# ---------------------------------------------------------------------------

def make_tools(n: int) -> dict:
    """Generate n synthetic movielens-style tool definitions."""
    endpoints = [
        ("get_movies", "Returns a list of movies", {"movie_id": "integer", "genre": "string", "year": "integer"}),
        ("get_ratings", "Returns user ratings for movies", {"user_id": "integer", "movie_id": "integer"}),
        ("get_users", "Returns user information", {"user_id": "integer", "age_group": "string"}),
        ("get_genres", "Returns all genres", {"genre_id": "integer"}),
        ("get_directors", "Returns director information", {"director_id": "integer", "movie_id": "integer"}),
        ("get_actors", "Returns actors in a movie", {"actor_id": "integer", "movie_id": "integer"}),
        ("get_tags", "Returns user-assigned tags for movies", {"tag_id": "integer", "movie_id": "integer"}),
        ("get_links", "Returns external links for movies", {"movie_id": "integer"}),
        ("search_movies", "Full-text search over movie titles", {"query": "string", "limit": "integer"}),
        ("get_avg_rating", "Returns average rating for a movie", {"movie_id": "integer"}),
    ]
    tools = {}
    for i in range(n):
        base = endpoints[i % len(endpoints)]
        name, desc, params = base
        tool_name = f"{name}_{i // len(endpoints)}" if i >= len(endpoints) else name
        props = {k: {"type": v, "description": f"The {k} parameter"} for k, v in params.items()}
        tools[tool_name] = {
            "name": tool_name,
            "description": desc,
            "args_schema": {
                "type": "object",
                "properties": props,
                "required": list(params.keys())[:1],
            },
            "_response_schemas": {
                "200": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "id": {"type": "integer"},
                            "title": {"type": "string"},
                            "value": {"type": "number"},
                        }
                    }
                }
            }
        }
    return tools


def make_apps() -> dict:
    return {
        "movielens": {
            "name": "movielens",
            "description": "MovieLens dataset API — movies, ratings, users, genres, directors, actors.",
        }
    }


# ---------------------------------------------------------------------------
# Call helpers
# ---------------------------------------------------------------------------

def call_with_structured_output(label: str, tools: dict, system: str) -> None:
    """Mirrors create_validated_structured_output_chain."""
    llm = ChatWatsonx(
        model_id=MODEL_ID,
        url=WATSONX_URL,
        project_id=WATSONX_PROJECT_ID,
        apikey=WATSONX_API_KEY,
        params={"temperature": 0.1, "max_new_tokens": 1024},
    )
    prompt = build_prompt(system)
    chain = prompt | llm.with_structured_output(ShortListerOutputLite, method="json_schema")
    try:
        result = chain.invoke({
            "all_apps": json.dumps(make_apps()),
            "all_tools": json.dumps(tools),
            "query": "find movies with high user ratings and list their director IDs",
        })
        status = "✅ OK   "
        detail = f"got {len(result.result)} tools"
    except Exception as e:
        status = "❌ FAIL "
        detail = repr(str(e)[:300])
    print(f"  {status}  [{label}]")
    print(f"           {detail}\n")


def call_no_response_format(label: str, tools: dict, system: str) -> None:
    """Baseline: prompt | llm | PydanticOutputParser (no response_format)"""
    llm = ChatWatsonx(
        model_id=MODEL_ID,
        url=WATSONX_URL,
        project_id=WATSONX_PROJECT_ID,
        apikey=WATSONX_API_KEY,
        params={"temperature": 0.1, "max_new_tokens": 1024},
    )
    parser = PydanticOutputParser(pydantic_object=ShortListerOutputLite)
    escaped_system = system.replace("{", "{{").replace("}", "}}")
    sys_with_format = escaped_system + "\n\nReturn JSON matching this format:\n{format_instructions}"
    prompt = ChatPromptTemplate.from_messages([
        ("system", sys_with_format),
        ("human", "Current Apps: {all_apps}\nCurrent Available Tools: {all_tools}"),
        ("ai", "Sure, now give me the intent"),
        ("human", "User Intent: {query}"),
    ])
    chain = prompt | llm | parser
    try:
        result = chain.invoke({
            "all_apps": json.dumps(make_apps()),
            "all_tools": json.dumps(tools),
            "query": "find movies with high user ratings and list their director IDs",
            "format_instructions": parser.get_format_instructions(),
        })
        status = "✅ OK   "
        detail = f"got {len(result.result)} tools"
    except Exception as e:
        status = "❌ FAIL "
        detail = repr(str(e)[:300])
    print(f"  {status}  [{label}]")
    print(f"           {detail}\n")


# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"Model: {MODEL_ID}")
    print(f"URL:   {WATSONX_URL}\n")

    system = load_system_prompt()
    using_real = SHORTLISTER_SYSTEM_PROMPT_PATH.exists()
    print(f"System prompt: {'real shortlister template' if using_real else 'fallback (short)'} ({len(system)} chars)\n")

    for n_tools in [5, 20, 50, 96]:
        tools = make_tools(n_tools)
        tools_json = json.dumps(tools)
        print("=" * 60)
        print(f"{n_tools} tools (~{len(tools_json)} chars)")
        print("=" * 60)
        call_no_response_format(f"no response_format ({n_tools} tools)", tools, system)
        call_with_structured_output(f"json_schema response_format ({n_tools} tools)", tools, system)

Sample output showing the bug:

Model: openai/gpt-oss-120b
URL:   https://us-south.ml.cloud.ibm.com

System prompt: real shortlister template (9866 chars)

============================================================
5 tools (~2425 chars)
============================================================
  ✅ OK     [no response_format (5 tools)]
           got 3 tools

  ❌ FAIL   [json_schema response_format (5 tools)]
           'Invalid json output: \nFor troubleshooting, visit: ...'

============================================================
96 tools (~45834 chars)
============================================================
  ✅ OK     [no response_format (96 tools)]
           got 2 tools

  ❌ FAIL   [json_schema response_format (96 tools)]
           'Invalid json output: \nFor troubleshooting, visit: ...'

@sami-marreed
Copy link
Copy Markdown
Contributor

Why stability tests fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: watsonx vLLM guided decoding returns empty content with response_format

2 participants