Bounded-heap Top-N sort for ORDER BY ... LIMIT#24
Conversation
ORDER BY previously buffered the entire result stream into an array before sorting, even when only a few rows were requested via LIMIT. For a streaming file-query library this made ORDER BY ... LIMIT the dominant memory bottleneck. When an upper bound is known, applySorting() now feeds rows into a fixed-capacity BoundedSortHeap (capacity = offset + limit) instead of materialising everything, dropping memory from O(N) to O(offset + limit). Sort keys are evaluated once per row rather than on every comparison, and ties are broken by insertion order so the output matches the stable usort() path. OFFSET-only and unbounded ORDER BY keep the full-sort path (extracted into applyFullSort()). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1biot
left a comment
There was a problem hiding this comment.
1. Summary
Really nice work — this is a genuinely valuable optimization for a streaming library, and the implementation is correct and well-documented. 👍
I traced both the heap eviction logic and the stable tie-break (seq + array_reverse) by hand, including DESC and multi-key cases, and they faithfully reproduce the stable usort() ordering. I especially appreciate the testMatchesStableSortAndSliceReference cross-check against a usort + array_slice reference over 60 randomized scenarios — that's exactly the right way to prove equivalence.
I also verified the optimization isn't defeated by Query::execute() wrapping sorted results in Results\InMemory: the sorting + limiting happens inside getIterator() before iterator_to_array() collects the (already bounded) output, so the peak really does drop from O(N) to O(offset + limit).
Just one thing I'd like changed before merge (test location), plus a few optional nits below.
2. Require changes
Move tests/FQL/Tests/Results/BoundedSortHeapTest.php to tests/Results/BoundedSortHeapTest.php
The namespace is FQL\Tests\Results, which in this repo maps to tests/Results/ (see the sibling tests/Results/DescribeResultTest.php). The new file introduces a tests/FQL/Tests/... directory tree that doesn't exist anywhere else in the project. PHPUnit still discovers it (it scans tests/ recursively and there's no PSR-4 autoload-dev), so the suite passes — but could you move it to tests/Results/ to match the existing layout? Thanks!
3. Nit - cosmetic refactor
Most of the diff in Stream.php is unrelated to the sort change — it's the \Generator → Generator, \Traversable → Traversable and \FQL\…\ExpressionNode → ExpressionNode import cleanup applied file-wide. It's fine as a change, but it inflates the diff and makes the actual Top-N logic harder to spot in review. No need to revert here, but for future PRs splitting pure-cosmetic refactors into their own commit/PR makes review much easier.
4. Nit - EXPLAIN change numbers
Heads-up: the sort phase now reports incrementOut for min(N, offset+limit) rows instead of all N (applyBoundedSort only emits the retained rows). This is arguably more accurate, but it is an observable change in EXPLAIN output for ORDER BY ... LIMIT. Might be worth a one-line note in the PR description / docs so it doesn't surprise anyone reading EXPLAIN.
5. Nit - CHANGELOG
Could you add a CHANGELOG.md entry? (3.3.0) The repo conventions track notable changes there, and a memory optimization like this is worth recording.
Summary
ORDER BYpreviously materialised the entire result stream into an in-memory array before sorting it withusort(), even when the query only asked for a handful of rows viaLIMIT. For a library whose whole purpose is streaming over large files, this madeORDER BY ... LIMITthe dominant memory bottleneck: aLIMIT 10over a million-row source still buffered all million rows just to throw away 999 990 of them.This PR adds a bounded-heap (Top-N) sort path that activates whenever an upper bound is known (
ORDER BY ... LIMIT). Memory drops from O(N) to O(offset + limit).How it works
Results\Stream::applySorting()now branches:ORDER BYwithLIMIT→ newapplyBoundedSort(), backed by a dedicatedBoundedSortHeap(SplHeap) of capacityoffset + limit. The row that would sort last sits at the root and is evicted the instant the heap exceeds capacity, so it never holds more thanoffset + limitrows.ORDER BYwithoutLIMIT(orOFFSET-only, which has no upper bound) → unchanged fullusort()path, extracted intoapplyFullSort().Additional details:
usort()path — including multi-keyORDER BYandASC/DESCmixes.ExplainCollectorinstrumentation (timers, in/out row counts) is preserved.Behavioural impact
None for callers.
OFFSET-only and unboundedORDER BYqueries keep their existing behaviour; bounded queries return the same rows in the same order, just with far less memory.Verification
BoundedSortHeapTest(targeted cases + 60 randomized scenarios cross-checked against a stableusort+ slice reference)ORDER BY score DESC LIMIT 10over 200 000 rows peaks at ~4 MB (heap holds 10 rows) instead of buffering the whole datasetFiles changed
src/Results/BoundedSortHeap.php— new fixed-capacity Top-N heapsrc/Results/Stream.php—applySorting()dispatch +applyBoundedSort()/applyFullSort()tests/FQL/Tests/Results/BoundedSortHeapTest.php— coverage🤖 Generated with Claude Code