fix(vlm): default max_tokens=32768 exceeds gpt-4o-mini completion cap → silent 0-memory extraction (#2751) by r266-tech · Pull Request #2755 · volcengine/OpenViking

r266-tech · 2026-06-21T08:49:38Z

Problem

The OpenAI VLM backend's unset-fallback default is max_tokens = self.max_tokens or 32768, applied at both kwargs builders (_build_text_kwargs, _build_vision_kwargs). But the backend's own default model is gpt-4o-mini (self.model or "gpt-4o-mini"), and gpt-4o / gpt-4o-mini cap completion at 16384 tokens. So with the default configuration (default model, max_tokens unset), every request sends max_tokens=32768, which the provider rejects:

400 BadRequest: max_tokens is too large: 32768. This model supports at most 16384 completion tokens

That 400 is caught and swallowed in openviking/session/compressor_v2.py (logger.error("[...] Failed to extract: {e}", exc_info=True)), so the commit still returns 200 with total_memories=0. The failure is silent: a default-configured deployment extracts 0 memories with no surfaced error, and users mistake it for "extraction found nothing." Reported on a clean 0.4.4 install and reproducible on main (#2751).

Fix

Lower the unset-fallback default to a named _DEFAULT_MAX_TOKENS = 16384 — the completion-token cap of the backend's own default model — at both builders. This is consistent with how the backend already handles model quirks at build time (e.g. _is_reasoning_model). Memory-extraction outputs are small JSON, so 16384 introduces no real truncation while still guarding against runaway generation.

The fallback now fires only when max_tokens is genuinely unset (None) — switched from or to an explicit is not None check — so an explicitly configured value (including a degenerate 0) is passed through unchanged and surfaces loudly instead of being silently rewritten. Callers that need a larger budget set max_tokens explicitly, which is honored as before.

Tests

tests/unit/test_vlm_default_max_tokens.py pins:

the unset default resolves to 16384 (≤ default-model cap) for both text and vision kwargs;
an explicit max_tokens=512 is honored;
an explicit max_tokens=0 is not replaced by the default.

These would fail against the previous or 32768.

Scope / known limitation

This corrects the reported case (the default gpt-4o-mini config). A model configured without max_tokens whose completion cap is below 16384 (e.g. older gpt-4-turbo / gpt-3.5-turbo at 4096) would still hit the same swallowed-400 path; the workaround there is to set vlm.max_tokens to the model's cap. If you'd prefer a fully model-agnostic guard, a follow-up could catch the provider's cap-exceeded BadRequestError, parse the advertised cap, clamp, and retry once — happy to add that in a separate PR if you want it. I kept this PR minimal and scoped to the reported default-config regression.

…cap (volcengine#2751) The unset-fallback default of 32768 exceeds the 16384 completion-token cap of the OpenAI VLM backend's own default model (gpt-4o-mini), so a default-configured deployment gets an HTTP 400 that the memory-extraction path swallows -> a silent 0-memory extraction. Lower the fallback to a named _DEFAULT_MAX_TOKENS=16384 and fall back only when max_tokens is genuinely unset (None), leaving explicit values untouched.

volcengine#2751) Pins the unset default to 16384 (<= gpt-4o-mini cap), that an explicit value is honored, and that an explicit falsy 0 is not silently replaced by the default.

github-actions · 2026-06-21T08:50:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅ 2751 - Fully compliant Compliant requirements: Lowered default `max_tokens` to 16384, matching gpt-4o-mini/gpt-4o completion cap Changed fallback logic to only use default when `max_tokens` is genuinely None, preserving explicit values including 0 Added regression tests for default fallback, explicit max_tokens, and explicit 0
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-06-21T08:50:52Z

PR Code Suggestions ✨

No code suggestions found for the PR.

qin-ctx · 2026-06-23T06:28:02Z

@chenjw 帮看下添加这个默认的背景是什么

chenjw

Review note: the default max_tokens fix is useful, but please keep the existing reasoning-model unset behavior.

chenjw · 2026-06-23T07:32:32Z

+        max_tokens = (
+            self.max_tokens if self.max_tokens is not None else _DEFAULT_MAX_TOKENS
+        )
        kwargs["max_completion_tokens" if is_reasoning else "max_tokens"] = max_tokens


One concern: this now gives reasoning models a default max_completion_tokens when max_tokens is unset. Existing behavior intentionally omitted both token fields for reasoning models without max_tokens (see test_reasoning_model_without_max_tokens_omits_both), and this PR currently breaks that test. Could we keep the 16384 fallback only for non-reasoning models, and only set max_completion_tokens for reasoning models when max_tokens is explicitly configured?

@chenjw

Addresses @chenjw review: the original fix lowered the unset max_tokens fallback to 16384 for ALL models, but volcengine#2751 is specific to the gpt-4o family (gpt-4o / gpt-4o-mini cap completion at 16384). Reasoning models (gpt-5/o1/o3/o4) advertise larger completion limits and spend a hidden reasoning-token budget out of max_completion_tokens, so the 16384 cap would needlessly truncate them. Make the unset fallback conditional on is_reasoning: reasoning models keep their prior 32768 default (sent as max_completion_tokens); the gpt-4o-family default stays at 16384. Explicitly configured max_tokens (including 0) is still passed through unchanged. Adds reasoning-model regression tests.

r266-tech · 2026-06-23T16:54:33Z

@/tmp/reply-2755.md

r266-tech added 2 commits June 21, 2026 16:48

test(vlm): regression for default max_tokens fallback within model cap (

5beabc5

volcengine#2751) Pins the unset default to 16384 (<= gpt-4o-mini cap), that an explicit value is honored, and that an explicit falsy 0 is not silently replaced by the default.

github-project-automation Bot added this to OpenViking project Jun 21, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Jun 21, 2026

github-actions Bot added the Review effort 2/5 label Jun 21, 2026

chenjw reviewed Jun 23, 2026

View reviewed changes

ZaynJarvis added bug Something isn't working scenario:kernel Core server, runtime, storage, retrieval, SDK, CLI, or Studio behavior. urgency:bug Incorrect behavior with a bounded fix path. labels Jun 23, 2026

ZaynJarvis mentioned this pull request Jun 23, 2026

[Bug] VLM default max_tokens=32768 exceeds completion-token limit of common models (e.g. gpt-4o-mini → 400), silently yielding 0 extracted memories #2751

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vlm): default max_tokens=32768 exceeds gpt-4o-mini completion cap → silent 0-memory extraction (#2751)#2755

fix(vlm): default max_tokens=32768 exceeds gpt-4o-mini completion cap → silent 0-memory extraction (#2751)#2755
r266-tech wants to merge 3 commits into
volcengine:mainfrom
r266-tech:fix/vlm-default-max-tokens-2751

r266-tech commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

qin-ctx commented Jun 23, 2026

Uh oh!

chenjw left a comment

Uh oh!

chenjw Jun 23, 2026

Uh oh!

r266-tech commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

r266-tech commented Jun 21, 2026

Problem

Fix

Tests

Scope / known limitation

Uh oh!

github-actions Bot commented Jun 21, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Jun 21, 2026

PR Code Suggestions ✨

Uh oh!

qin-ctx commented Jun 23, 2026

Uh oh!

chenjw left a comment

Choose a reason for hiding this comment

Uh oh!

chenjw Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

r266-tech commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants