Skip to content

fix: fall back to Ollama native /api/chat for thinking-mode models (fixes #26)#38

Open
nandanadileep wants to merge 1 commit intonikmcfly:mainfrom
nandanadileep:fix/ollama-thinking-mode-empty-response
Open

fix: fall back to Ollama native /api/chat for thinking-mode models (fixes #26)#38
nandanadileep wants to merge 1 commit intonikmcfly:mainfrom
nandanadileep:fix/ollama-thinking-mode-empty-response

Conversation

@nandanadileep
Copy link
Copy Markdown

Problem

Thinking-mode models (e.g. gemma4:26b) generate internal <|think|> reasoning tokens that exhaust max_tokens before producing visible content. Ollama's OpenAI-compatible /v1/chat/completions endpoint strips those tokens and returns empty content, causing 500 errors when starting simulations.

Fix

In LLMClient.chat(), after calling the OpenAI-compat endpoint, check if content is empty. If it is and we're talking to an Ollama server, retry via the native /api/chat endpoint, which surfaces the visible response correctly.

Changes in backend/app/utils/llm_client.py:

  • Added _ollama_native_base() — strips /v1 suffix to get the Ollama host URL
  • Added _chat_via_ollama_native() — POSTs to /api/chat with stream=false, carries over temperature and num_ctx
  • In chat(): triggers the fallback only when content is falsy and _is_ollama() is true — fully backwards-compatible, zero impact on non-Ollama or non-thinking-mode models
  • Fixed a latent NoneType crash: re.sub(…, content or '') guards against None content even without the fallback

Test plan

  • Run a simulation with a standard model (e.g. qwen2.5:32b) — behaviour unchanged
  • Run a simulation with gemma4:26b — should now return visible response instead of 500
  • Verify _ollama_native_base() strips /v1 from http://localhost:11434/v1 correctly
  • Non-Ollama endpoints (OpenAI, etc.) are unaffected — fallback never fires

Fixes #26

Thinking-mode models (e.g. Gemma 4) generate internal reasoning tokens
that can exhaust max_tokens before producing any visible content.
Ollama's OpenAI-compatible /v1/chat/completions endpoint strips those
tokens and returns empty content, causing 500 errors in simulations.

When LLMClient.chat() receives empty content from an Ollama endpoint,
it now retries via the native /api/chat endpoint which correctly returns
the visible response. The fallback is backwards-compatible and only
triggers on empty responses.

Fixes nikmcfly#26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thinking-mode models (Gemma 4) return empty responses via Ollama OpenAI-compatible endpoint

1 participant