Skip to content

fix(vlm): default max_tokens=32768 exceeds gpt-4o-mini completion cap → silent 0-memory extraction (#2751)#2755

Open
r266-tech wants to merge 3 commits into
volcengine:mainfrom
r266-tech:fix/vlm-default-max-tokens-2751
Open

fix(vlm): default max_tokens=32768 exceeds gpt-4o-mini completion cap → silent 0-memory extraction (#2751)#2755
r266-tech wants to merge 3 commits into
volcengine:mainfrom
r266-tech:fix/vlm-default-max-tokens-2751

Conversation

@r266-tech

Copy link
Copy Markdown
Contributor

Fixes #2751.

Problem

The OpenAI VLM backend's unset-fallback default is max_tokens = self.max_tokens or 32768, applied at both kwargs builders (_build_text_kwargs, _build_vision_kwargs). But the backend's own default model is gpt-4o-mini (self.model or "gpt-4o-mini"), and gpt-4o / gpt-4o-mini cap completion at 16384 tokens. So with the default configuration (default model, max_tokens unset), every request sends max_tokens=32768, which the provider rejects:

400 BadRequest: max_tokens is too large: 32768. This model supports at most 16384 completion tokens

That 400 is caught and swallowed in openviking/session/compressor_v2.py (logger.error("[...] Failed to extract: {e}", exc_info=True)), so the commit still returns 200 with total_memories=0. The failure is silent: a default-configured deployment extracts 0 memories with no surfaced error, and users mistake it for "extraction found nothing." Reported on a clean 0.4.4 install and reproducible on main (#2751).

Fix

Lower the unset-fallback default to a named _DEFAULT_MAX_TOKENS = 16384 — the completion-token cap of the backend's own default model — at both builders. This is consistent with how the backend already handles model quirks at build time (e.g. _is_reasoning_model). Memory-extraction outputs are small JSON, so 16384 introduces no real truncation while still guarding against runaway generation.

The fallback now fires only when max_tokens is genuinely unset (None) — switched from or to an explicit is not None check — so an explicitly configured value (including a degenerate 0) is passed through unchanged and surfaces loudly instead of being silently rewritten. Callers that need a larger budget set max_tokens explicitly, which is honored as before.

Tests

tests/unit/test_vlm_default_max_tokens.py pins:

  • the unset default resolves to 16384 (≤ default-model cap) for both text and vision kwargs;
  • an explicit max_tokens=512 is honored;
  • an explicit max_tokens=0 is not replaced by the default.

These would fail against the previous or 32768.

Scope / known limitation

This corrects the reported case (the default gpt-4o-mini config). A model configured without max_tokens whose completion cap is below 16384 (e.g. older gpt-4-turbo / gpt-3.5-turbo at 4096) would still hit the same swallowed-400 path; the workaround there is to set vlm.max_tokens to the model's cap. If you'd prefer a fully model-agnostic guard, a follow-up could catch the provider's cap-exceeded BadRequestError, parse the advertised cap, clamp, and retry once — happy to add that in a separate PR if you want it. I kept this PR minimal and scoped to the reported default-config regression.

…cap (volcengine#2751)

The unset-fallback default of 32768 exceeds the 16384 completion-token cap of the
OpenAI VLM backend's own default model (gpt-4o-mini), so a default-configured
deployment gets an HTTP 400 that the memory-extraction path swallows -> a silent
0-memory extraction. Lower the fallback to a named _DEFAULT_MAX_TOKENS=16384 and
fall back only when max_tokens is genuinely unset (None), leaving explicit values
untouched.
volcengine#2751)

Pins the unset default to 16384 (<= gpt-4o-mini cap), that an explicit value is
honored, and that an explicit falsy 0 is not silently replaced by the default.
@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

2751 - Fully compliant

Compliant requirements:

  • Lowered default max_tokens to 16384, matching gpt-4o-mini/gpt-4o completion cap
  • Changed fallback logic to only use default when max_tokens is genuinely None, preserving explicit values including 0
  • Added regression tests for default fallback, explicit max_tokens, and explicit 0
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@qin-ctx

qin-ctx commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator
image

@chenjw 帮看下添加这个默认的背景是什么

@chenjw chenjw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review note: the default max_tokens fix is useful, but please keep the existing reasoning-model unset behavior.

max_tokens = (
self.max_tokens if self.max_tokens is not None else _DEFAULT_MAX_TOKENS
)
kwargs["max_completion_tokens" if is_reasoning else "max_tokens"] = max_tokens

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern: this now gives reasoning models a default max_completion_tokens when max_tokens is unset. Existing behavior intentionally omitted both token fields for reasoning models without max_tokens (see test_reasoning_model_without_max_tokens_omits_both), and this PR currently breaks that test. Could we keep the 16384 fallback only for non-reasoning models, and only set max_completion_tokens for reasoning models when max_tokens is explicitly configured?

@ZaynJarvis ZaynJarvis added bug Something isn't working scenario:kernel Core server, runtime, storage, retrieval, SDK, CLI, or Studio behavior. urgency:bug Incorrect behavior with a bounded fix path. labels Jun 23, 2026
Addresses @chenjw review: the original fix lowered the unset max_tokens
fallback to 16384 for ALL models, but volcengine#2751 is specific to the gpt-4o
family (gpt-4o / gpt-4o-mini cap completion at 16384). Reasoning models
(gpt-5/o1/o3/o4) advertise larger completion limits and spend a hidden
reasoning-token budget out of max_completion_tokens, so the 16384 cap
would needlessly truncate them.

Make the unset fallback conditional on is_reasoning: reasoning models keep
their prior 32768 default (sent as max_completion_tokens); the gpt-4o-family
default stays at 16384. Explicitly configured max_tokens (including 0) is
still passed through unchanged. Adds reasoning-model regression tests.
@r266-tech

Copy link
Copy Markdown
Contributor Author

@/tmp/reply-2755.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Review effort 2/5 scenario:kernel Core server, runtime, storage, retrieval, SDK, CLI, or Studio behavior. urgency:bug Incorrect behavior with a bounded fix path.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug] VLM default max_tokens=32768 exceeds completion-token limit of common models (e.g. gpt-4o-mini → 400), silently yielding 0 extracted memories

4 participants