fix(vlm): default max_tokens=32768 exceeds gpt-4o-mini completion cap → silent 0-memory extraction (#2751)#2755
Conversation
…cap (volcengine#2751) The unset-fallback default of 32768 exceeds the 16384 completion-token cap of the OpenAI VLM backend's own default model (gpt-4o-mini), so a default-configured deployment gets an HTTP 400 that the memory-extraction path swallows -> a silent 0-memory extraction. Lower the fallback to a named _DEFAULT_MAX_TOKENS=16384 and fall back only when max_tokens is genuinely unset (None), leaving explicit values untouched.
volcengine#2751) Pins the unset default to 16384 (<= gpt-4o-mini cap), that an explicit value is honored, and that an explicit falsy 0 is not silently replaced by the default.
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
@chenjw 帮看下添加这个默认的背景是什么 |
chenjw
left a comment
There was a problem hiding this comment.
Review note: the default max_tokens fix is useful, but please keep the existing reasoning-model unset behavior.
| max_tokens = ( | ||
| self.max_tokens if self.max_tokens is not None else _DEFAULT_MAX_TOKENS | ||
| ) | ||
| kwargs["max_completion_tokens" if is_reasoning else "max_tokens"] = max_tokens |
There was a problem hiding this comment.
One concern: this now gives reasoning models a default max_completion_tokens when max_tokens is unset. Existing behavior intentionally omitted both token fields for reasoning models without max_tokens (see test_reasoning_model_without_max_tokens_omits_both), and this PR currently breaks that test. Could we keep the 16384 fallback only for non-reasoning models, and only set max_completion_tokens for reasoning models when max_tokens is explicitly configured?
Addresses @chenjw review: the original fix lowered the unset max_tokens fallback to 16384 for ALL models, but volcengine#2751 is specific to the gpt-4o family (gpt-4o / gpt-4o-mini cap completion at 16384). Reasoning models (gpt-5/o1/o3/o4) advertise larger completion limits and spend a hidden reasoning-token budget out of max_completion_tokens, so the 16384 cap would needlessly truncate them. Make the unset fallback conditional on is_reasoning: reasoning models keep their prior 32768 default (sent as max_completion_tokens); the gpt-4o-family default stays at 16384. Explicitly configured max_tokens (including 0) is still passed through unchanged. Adds reasoning-model regression tests.
|
@/tmp/reply-2755.md |

Fixes #2751.
Problem
The OpenAI VLM backend's unset-fallback default is
max_tokens = self.max_tokens or 32768, applied at both kwargs builders (_build_text_kwargs,_build_vision_kwargs). But the backend's own default model isgpt-4o-mini(self.model or "gpt-4o-mini"), andgpt-4o/gpt-4o-minicap completion at 16384 tokens. So with the default configuration (default model,max_tokensunset), every request sendsmax_tokens=32768, which the provider rejects:That 400 is caught and swallowed in
openviking/session/compressor_v2.py(logger.error("[...] Failed to extract: {e}", exc_info=True)), so the commit still returns200withtotal_memories=0. The failure is silent: a default-configured deployment extracts 0 memories with no surfaced error, and users mistake it for "extraction found nothing." Reported on a clean0.4.4install and reproducible onmain(#2751).Fix
Lower the unset-fallback default to a named
_DEFAULT_MAX_TOKENS = 16384— the completion-token cap of the backend's own default model — at both builders. This is consistent with how the backend already handles model quirks at build time (e.g._is_reasoning_model). Memory-extraction outputs are small JSON, so 16384 introduces no real truncation while still guarding against runaway generation.The fallback now fires only when
max_tokensis genuinely unset (None) — switched fromorto an explicitis not Nonecheck — so an explicitly configured value (including a degenerate0) is passed through unchanged and surfaces loudly instead of being silently rewritten. Callers that need a larger budget setmax_tokensexplicitly, which is honored as before.Tests
tests/unit/test_vlm_default_max_tokens.pypins:16384(≤ default-model cap) for both text and vision kwargs;max_tokens=512is honored;max_tokens=0is not replaced by the default.These would fail against the previous
or 32768.Scope / known limitation
This corrects the reported case (the default
gpt-4o-miniconfig). A model configured withoutmax_tokenswhose completion cap is below 16384 (e.g. oldergpt-4-turbo/gpt-3.5-turboat 4096) would still hit the same swallowed-400 path; the workaround there is to setvlm.max_tokensto the model's cap. If you'd prefer a fully model-agnostic guard, a follow-up could catch the provider's cap-exceededBadRequestError, parse the advertised cap, clamp, and retry once — happy to add that in a separate PR if you want it. I kept this PR minimal and scoped to the reported default-config regression.