Skip to content

Fix SizeLimitedStream cursor over-counting elements across pages#45

Merged
lengare merged 5 commits into
mainfrom
lesianlen/feature/size-limited-stream-cursor-drift
May 26, 2026
Merged

Fix SizeLimitedStream cursor over-counting elements across pages#45
lengare merged 5 commits into
mainfrom
lesianlen/feature/size-limited-stream-cursor-drift

Conversation

@lengare
Copy link
Copy Markdown
Contributor

@lengare lengare commented May 26, 2026

What and why? 🤔

SizeLimitedStream is meant to cap a stream at limit elements across a paginated session, but a limit: 20 served across four pages stream ever served 18: three pages of 5, a starved 4th page of 3, then exhaustion.

Root cause. SizeLimitedStreamCursor::_combine_with used max($a, $b) + 1. The +1 is correct only for the within-page fold (StreamResult::get_combined_cursor), but a typical paginating caller also re-combines a page's already-combined cursor with the incoming request cursor — combining two aggregates — so the +1 double-counts each page boundary once per page. current_size drifts above the true count by pages − 1:

After page true served current_size (before)
1 5 5
2 10 11
3 15 17
4 18 21 → next page sees balance ≤ 0

By page 4 the drifted size (17) leaves balance = 20 − 17 = 3, so the page is clamped to 3 and the stream stops short of its own limit.

Fix. Make the cursor's count a true absolute cumulative position and combine idempotently:

  • SizeLimitedStream::_enumerate tags each element with current_size + $index + 1 (not a constant + 1).
  • SizeLimitedStreamCursor::_combine_with uses plain max (no + 1).

Both changes are required together. After the fix, current_size runs 5 → 10 → 15 → 20 and the stream serves exactly 20.

Testing Steps ✍️

  • vendor/bin/phpunit tests/unit/Tumblr/StreamBuilder/Streams/SizeLimitedStreamTest.php
  • vendor/bin/phpunit tests/unit/Tumblr/StreamBuilder/StreamCursors/SizeLimitedStreamCursorTest.php
  • Confirm the new testEnumerateReachesLimitAcrossPages passes — it walks four equal pages and asserts the cursor's cumulative size is 5 → 10 → 15 → 20 (it fails on the old + 1 arithmetic with 11 !== 10 at page 2).
  • On Tumblr's devbox, you can test using the testing instructions provided in https://github.tumblr.net/Tumblr/tumblr/pull/67056

You're it! 👑

@Automattic/stream-builders

@lengare lengare requested a review from a team as a code owner May 26, 2026 16:39
@lengare lengare requested a review from koke May 26, 2026 16:39
@coveralls
Copy link
Copy Markdown

coveralls commented May 26, 2026

Coverage Status

coverage: 80.255%. remained the same — lesianlen/feature/size-limited-stream-cursor-drift into main

@lengare lengare requested a review from lucila May 26, 2026 16:50
Comment thread tests/unit/Tumblr/StreamBuilder/Streams/SizeLimitedStreamTest.php Outdated
Comment thread tests/unit/Tumblr/StreamBuilder/Streams/SizeLimitedStreamTest.php
Co-authored-by: Lucila Stancato <lucila@users.noreply.github.com>

// Limit reached: next page is empty and exhaustive, and the inner stream is not queried again.
$result = $size_limited_stream->enumerate($page_size, $cursor);
$this->assertSame(0, $result->get_size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$this->assertSame(0, $result->get_size());
$this->assertSame(2, $result->get_size());

can we make stream return 3 elements instead of 5 so that we need to fetch more pages. The last page should contain less elements => 2. because the inner stream returns 3 elements each time ,but size limited stream will truncate the results....

pagination would result in:

| page | elements | exhausted | count so far |
|1 | 3 | false | 3 |
|2 | 3 | false | 6 |
|3 | 3 | false | 9 |
|4 | 3 | false | 12 |
|5 | 3 | false | 15 |
|6 | 3 | false | 18 |
|7 | 2 | false | 20 |

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — reworked the test to walk 7 pages and assert the truncated final page, matching your table: pages serve 3, 3, 3, 3, 3, 3, 2 for limit: 20, with exactly(7) inner queries.

One adjustment worth flagging: I set the requested page size to 3 (equal to the inner batch) rather than having the inner return fewer than the requested count. SizeLimitedStream computes its own exhaustion as count(derived) < $count, so if we request 5 and the inner returns 3, every short page reports is_exhaustive === true — which would contradict the "exhausted: false" rows in your table and means a real consumer would stop after page 1. Requesting 3 keeps count == derived on full pages, so they stay non-exhausted exactly as charted, and the last page still truncates 3 → 2 against the remaining budget.

(The single-page "inner returns fewer than requested" path is already covered by test_enumerate_less_count.)

Copy link
Copy Markdown
Contributor

@lucila lucila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we adjust unit tests?

@lucila
Copy link
Copy Markdown
Contributor

lucila commented May 26, 2026

SizeLimitedStream computes its own exhaustion as count(derived) < $count, so if we request 5 and the inner returns 3, every short page reports is_exhaustive === true — which would contradict the "exhausted: false" rows in your table and means a real consumer would stop after page 1

🤔

shouldn't the exhaust condition depend on the inner stream??

original code

$result = $this->stream->enumerate($count, $cursor->get_inner_cursor(), $tracer, $option);
$elements = $result->get_elements();
...
// if this page size is greater than stream's limit, we don't want to keep paginating.
$is_exhausted = count($derived_elements) < $count || count($derived_elements) >= $this->limit;

suggestion

$result = $this->stream->enumerate($count, $cursor->get_inner_cursor(), $tracer, $option);
$elements = $result->get_elements();
...
// if this page size is greater than stream's limit, we don't want to keep paginating.
$is_exhausted = $result->is_exhaustive() || count($derived_elements) >= $this->limit;

Copy link
Copy Markdown
Contributor

@lucila lucila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on the fix.

Since you are improving SizeLimitedStream, it would be nice to address this early exhaustion issue: #45 (comment)

SizeLimitedStream inferred exhaustion from count(derived) < requested
count, which prematurely ended pagination whenever an inner stream
returned a short-but-not-exhausted page (e.g. a FilteredStream that spent
its retries). Trust the inner stream's is_exhaustive() signal instead;
the limit is still enforced by the >= limit clause and the balance check.

Replace test_enumerate_less_count (which only covered the deleted
heuristic) with test_enumerate_inner_exhausted / _inner_not_exhausted.
@lengare
Copy link
Copy Markdown
Contributor Author

lengare commented May 26, 2026

Good call — applied in fe469c7.

@lengare lengare merged commit 0ef7536 into main May 26, 2026
11 checks passed
@lengare lengare deleted the lesianlen/feature/size-limited-stream-cursor-drift branch May 26, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants