Skip to content

[producer] Prevent RecordTooLargeException when batched partial updates exceed size limit#2660

Open
sixpluszero wants to merge 4 commits into
linkedin:mainfrom
sixpluszero:feature/batch-size-limit
Open

[producer] Prevent RecordTooLargeException when batched partial updates exceed size limit#2660
sixpluszero wants to merge 4 commits into
linkedin:mainfrom
sixpluszero:feature/batch-size-limit

Conversation

@sixpluszero

Copy link
Copy Markdown
Contributor

Problem Statement

When BatchingVeniceWriter merges multiple partial update (UPDATE) records for the same key, the merged payload can exceed the 950KB producer size limit (DEFAULT_MAX_SIZE_FOR_USER_PAYLOAD_PER_MESSAGE_IN_BYTES). This happens when updates touch different fields — the merged result contains all fields and grows beyond any individual update's size.

When the oversized merged record is produced via VeniceWriter.update(), it throws RecordTooLargeException. Since UPDATE operations do not support chunking, there is no recovery path — the exception propagates and causes nearline producer failures.

Solution

Added a size-aware fallback in BatchingVeniceWriter.maybeUpdateRecordUpdatePayload():

  1. Normal path (zero overhead): Merge all updates optimistically and serialize once. If the result fits within the limit, proceed as before.
  2. Fallback path (rare): If the merged result exceeds the limit, redo the merge incrementally with size checks after each step. When the next merge would exceed the limit, produce the accumulated result as an intermediate UPDATE, then start a new accumulation.

This preserves as much batching as possible while preventing oversize records. The fallback only triggers when the merged payload actually exceeds the limit, so there is no performance impact on the normal path.

Key changes:

  • maybeUpdateRecordUpdatePayload(): Added size check after optimistic merge; calls fallback if exceeded.
  • mergeAndProduceWithSizeLimit(): New method that re-does the merge incrementally, producing intermediate results when the limit would be exceeded.
  • produceIntermediateUpdate(): New helper that produces an intermediate merged UPDATE directly to the internal VeniceWriter with proper callback management.
  • getMaxSizeForUserPayloadPerMessageInBytes(): Package-private getter for testability.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.
      • The WARN log only fires in the rare fallback path (when merged payload exceeds 950KB), so rate limiting is not needed.

Concurrency-Specific Checks

  • Code has no race conditions or thread safety issues. All new code runs under the existing ReentrantLock in checkAndMaybeProduceBatchRecord().
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • testMergedUpdateExceedsSizeLimitProducesIntermediateBatch — 3 UPDATE records touching different fields; verifies intermediate + final produce when merged result exceeds limit.
  • testMergedUpdateUnderSizeLimitStillMerges — Regression test verifying normal merge path is unaffected.
  • testMergedUpdateWithPutAnchorExceedsSizeLimit — PUT + 2 UPDATEs where full merge exceeds limit; verifies splitting with PUT anchor.

All existing BatchingVeniceWriterTest tests continue to pass.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

🤖 Generated with Claude Code

exceed size limit

When BatchingVeniceWriter merges multiple partial updates for the same key,
the merged payload can exceed the 950KB producer size limit, causing
RecordTooLargeException since UPDATE operations don't support chunking.

This adds a size-aware fallback: after the optimistic merge, if the result
exceeds the limit, the merge is redone incrementally. When the next merge
would exceed the limit, the accumulated result is produced as an intermediate
UPDATE and a new accumulation starts. This preserves as much batching as
possible while preventing oversize records.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 25, 2026 05:12

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses nearline producer failures caused by RecordTooLargeException when BatchingVeniceWriter merges multiple partial UPDATEs for the same key into a single oversized payload (UPDATEs can’t be chunked).

Changes:

  • Add a size check after the optimistic merged UPDATE serialization and fall back to an incremental merge that emits intermediate UPDATEs when needed.
  • Introduce helpers to split/produce intermediate UPDATEs and a getter for the max payload size to improve testability.
  • Add unit tests covering oversize-merge splitting behavior and regression coverage for the normal merge path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
internal/venice-common/src/main/java/com/linkedin/venice/writer/BatchingVeniceWriter.java Adds size-aware fallback merging/splitting to prevent producing oversized merged UPDATE records.
internal/venice-common/src/test/java/com/linkedin/venice/writer/BatchingVeniceWriterTest.java Adds unit tests validating the new splitting behavior and the unchanged normal merge behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sixpluszero and others added 2 commits March 25, 2026 00:51
…e after

split, use atLeast(2)

- Chain intermediate produce future to shared produceResultFuture so async
  failures propagate to callers
- Re-serialize single update through superset schema after split for
consistency
- Complete produceResultFuture exceptionally on synchronous failures too
- Use atLeast(2) in tests instead of times(2) to avoid brittleness from
  serialized size variations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
correctness

- testIntermediateProduceFailurePropagesToProduceResultFuture: verifies
  that when VeniceWriter.update() throws during intermediate produce, the
  shared produceResultFuture is completed exceptionally
- testMultipleSplitsWithFiveUpdates: 5 updates with a tight limit forcing
  3+ splits, verifies all 5 callbacks are completed (no dropped callbacks)
- testIntermediateProducePayloadContainsCorrectFields: verifies each
  intermediate/final payload deserializes to a valid update record and
  the final payload contains expected field data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 6, 2026 16:29

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When the final merge exceeds the limit but lastValidSerialized is null
(e.g., anchor is DELETE with no prior accumulated UPDATEs), the
intermediate produce is skipped. Previously, dependentCallbackList was
cleared unconditionally, dropping callbacks and leaving callers hanging.
Now callbacks are preserved on the final record when no intermediate
produce occurred.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions.

@github-actions github-actions Bot added the stale label May 7, 2026
@github-actions

Copy link
Copy Markdown

Closing this pull request due to 37 days of inactivity. This is not a judgment on the value of the work. If you would like to continue, please reopen or open a new PR and we will be happy to take another look. Thank you again for contributing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants