Skip to content

Bounded-heap Top-N sort for ORDER BY ... LIMIT#24

Open
iammartinbelobrad wants to merge 1 commit into
1biot:mainfrom
iammartinbelobrad:feat/bounded-heap-order-by-limit
Open

Bounded-heap Top-N sort for ORDER BY ... LIMIT#24
iammartinbelobrad wants to merge 1 commit into
1biot:mainfrom
iammartinbelobrad:feat/bounded-heap-order-by-limit

Conversation

@iammartinbelobrad
Copy link
Copy Markdown

Summary

ORDER BY previously materialised the entire result stream into an in-memory array before sorting it with usort(), even when the query only asked for a handful of rows via LIMIT. For a library whose whole purpose is streaming over large files, this made ORDER BY ... LIMIT the dominant memory bottleneck: a LIMIT 10 over a million-row source still buffered all million rows just to throw away 999 990 of them.

This PR adds a bounded-heap (Top-N) sort path that activates whenever an upper bound is known (ORDER BY ... LIMIT). Memory drops from O(N) to O(offset + limit).

How it works

Results\Stream::applySorting() now branches:

  • ORDER BY with LIMIT → new applyBoundedSort(), backed by a dedicated BoundedSortHeap (SplHeap) of capacity offset + limit. The row that would sort last sits at the root and is evicted the instant the heap exceeds capacity, so it never holds more than offset + limit rows.
  • ORDER BY without LIMIT (or OFFSET-only, which has no upper bound) → unchanged full usort() path, extracted into applyFullSort().

Additional details:

  • Sort keys are evaluated once per row instead of on every comparison.
  • Ties are broken by insertion order, so the output is identical to the stable usort() path — including multi-key ORDER BY and ASC/DESC mixes.
  • ExplainCollector instrumentation (timers, in/out row counts) is preserved.

Behavioural impact

None for callers. OFFSET-only and unbounded ORDER BY queries keep their existing behaviour; bounded queries return the same rows in the same order, just with far less memory.

Verification

  • PHPCS (PSR-12) — clean
  • PHPStan level 8 — no errors
  • PHPUnit — full suite passes (1401 tests, 5645 assertions), incl. a new dedicated BoundedSortHeapTest (targeted cases + 60 randomized scenarios cross-checked against a stable usort + slice reference)
  • Memory: ORDER BY score DESC LIMIT 10 over 200 000 rows peaks at ~4 MB (heap holds 10 rows) instead of buffering the whole dataset

Files changed

  • src/Results/BoundedSortHeap.php — new fixed-capacity Top-N heap
  • src/Results/Stream.phpapplySorting() dispatch + applyBoundedSort() / applyFullSort()
  • tests/FQL/Tests/Results/BoundedSortHeapTest.php — coverage

🤖 Generated with Claude Code

ORDER BY previously buffered the entire result stream into an array
before sorting, even when only a few rows were requested via LIMIT.
For a streaming file-query library this made ORDER BY ... LIMIT the
dominant memory bottleneck.

When an upper bound is known, applySorting() now feeds rows into a
fixed-capacity BoundedSortHeap (capacity = offset + limit) instead of
materialising everything, dropping memory from O(N) to O(offset + limit).
Sort keys are evaluated once per row rather than on every comparison,
and ties are broken by insertion order so the output matches the stable
usort() path. OFFSET-only and unbounded ORDER BY keep the full-sort path
(extracted into applyFullSort()).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@1biot 1biot self-requested a review June 5, 2026 10:07
Copy link
Copy Markdown
Owner

@1biot 1biot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Summary

Really nice work — this is a genuinely valuable optimization for a streaming library, and the implementation is correct and well-documented. 👍

I traced both the heap eviction logic and the stable tie-break (seq + array_reverse) by hand, including DESC and multi-key cases, and they faithfully reproduce the stable usort() ordering. I especially appreciate the testMatchesStableSortAndSliceReference cross-check against a usort + array_slice reference over 60 randomized scenarios — that's exactly the right way to prove equivalence.

I also verified the optimization isn't defeated by Query::execute() wrapping sorted results in Results\InMemory: the sorting + limiting happens inside getIterator() before iterator_to_array() collects the (already bounded) output, so the peak really does drop from O(N) to O(offset + limit).

Just one thing I'd like changed before merge (test location), plus a few optional nits below.

2. Require changes

Move tests/FQL/Tests/Results/BoundedSortHeapTest.php to tests/Results/BoundedSortHeapTest.php

The namespace is FQL\Tests\Results, which in this repo maps to tests/Results/ (see the sibling tests/Results/DescribeResultTest.php). The new file introduces a tests/FQL/Tests/... directory tree that doesn't exist anywhere else in the project. PHPUnit still discovers it (it scans tests/ recursively and there's no PSR-4 autoload-dev), so the suite passes — but could you move it to tests/Results/ to match the existing layout? Thanks!

3. Nit - cosmetic refactor

Most of the diff in Stream.php is unrelated to the sort change — it's the \GeneratorGenerator, \TraversableTraversable and \FQL\…\ExpressionNodeExpressionNode import cleanup applied file-wide. It's fine as a change, but it inflates the diff and makes the actual Top-N logic harder to spot in review. No need to revert here, but for future PRs splitting pure-cosmetic refactors into their own commit/PR makes review much easier.

4. Nit - EXPLAIN change numbers

Heads-up: the sort phase now reports incrementOut for min(N, offset+limit) rows instead of all N (applyBoundedSort only emits the retained rows). This is arguably more accurate, but it is an observable change in EXPLAIN output for ORDER BY ... LIMIT. Might be worth a one-line note in the PR description / docs so it doesn't surprise anyone reading EXPLAIN.

5. Nit - CHANGELOG

Could you add a CHANGELOG.md entry? (3.3.0) The repo conventions track notable changes there, and a memory optimization like this is worth recording.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants