streaming-partitioned aggregate protocol

rustyconover · claude · rustyconover · commit 81f723670263 · 2026-05-02T18:23:12.000-04:00
Add a third execution path for VGI aggregate functions, alongside the
existing GROUP BY (update/combine/finalize) and windowed (window/_init/
_batch) paths. Functions opt in via Meta.streaming_partitioned=True;
the VGI DuckDB extension's optimizer rule replaces eligible LogicalWindow
nodes with a custom streaming operator that pipes input chunks straight
to the worker — no DuckDB-side partition materialisation.

Three new classmethod hooks on AggregateFunction:

    streaming_open(params) -&gt; StreamingState
    streaming_chunk(chunk, state, partition_key_count, order_key_count, params) -&gt; pa.Array
    streaming_close(state, params) -&gt; None

The worker holds concurrent per-partition state in a hash map keyed by
partition tuple; each input row updates its partition's state and emits
a snapshot. Memory is bounded by partitions × per-partition-state, not
by row count — the structural answer to "running aggregate over
unbounded ordered input."

Wire protocol: three new unary RPCs (aggregate_streaming_open,
aggregate_streaming_chunk, aggregate_streaming_close), all carrying
the standard {request: binary} envelope shape. Session state is held
in an in-process LRU cache for the fast path and persisted to
FunctionStorage (under the existing aggregate_window_partition_put key)
so chunk RPCs landing on a different worker pool entry can rehydrate
correctly. Same affinity pattern as the windowed path.

Eligibility (enforced by the C++ optimizer rule, not this change):
cumulative frame only, no EXCLUDE/DISTINCT/FILTER/arg-orders,
no const-arg parameters in v1. Queries that don't satisfy fall back
to the standard windowed path; the streaming path is additive,
not a replacement.

Documented in docs/aggregate-functions.md — including a note that
pre-aggregation (GROUP BY ... + OVER) is the right pattern for most
analytics shapes; the streaming path's unique value is for shapes
where pre-aggregation isn't algebraically valid (per-fill running
views, very high cardinality, future continuous feeds).

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/aggregate-functions.md b/docs/aggregate-functions.md
@@ -286,6 +286,74 @@ worker = Worker(
 
 The framework automatically detects `AggregateFunction` subclasses and registers them with the correct function type in the catalog.
 
+## Streaming-Partitioned Variant
+
+For `OVER (PARTITION BY ... ORDER BY ...)` queries against unbounded inputs (e.g. running aggregates across years of trade history), the standard windowed path materializes each partition in DuckDB memory before the aggregate sees it — fine for bounded data, OOMs at scale.
+
+The `streaming_partitioned` opt-in routes those queries through a custom physical operator in the VGI DuckDB extension: input chunks pipe directly to the worker, the worker maintains concurrent per-partition state in a hash map keyed by partition tuple, and each input chunk produces a same-length output array of cumulative snapshots. No DuckDB-side partition materialization; memory is bounded by `partitions × state_per_partition`, not by row count.
+
+```python
+class MyRunningAgg(AggregateFunction[MyState]):
+    class Meta:
+        name = "my_running_agg"
+        streaming_partitioned = True   # opt-in
+        # supports_window may also be set; the optimizer chooses the
+        # streaming path for eligible queries and falls back to the
+        # windowed path otherwise.
+
+    @classmethod
+    def streaming_open(cls, params: ProcessParams[None]) -> dict[str, Any]:
+        # Build cross-partition session state. Returned object lives in
+        # an in-process cache for the duration of the session and is
+        # also persisted to FunctionStorage so chunk RPCs landing on a
+        # different pool worker can rehydrate.
+        return {"partition_states": {}}
+
+    @classmethod
+    def streaming_chunk(
+        cls,
+        chunk: pa.RecordBatch,
+        streaming_state: dict[str, Any],
+        partition_key_count: int,
+        order_key_count: int,
+        params: ProcessParams[None],
+    ) -> pa.Array:
+        # Column layout in `chunk`:
+        #   [partition_key_cols..., order_key_cols..., value_cols...]
+        # Return one output value per input row (cumulative snapshot
+        # at that row's position in its partition's order).
+        ...
+
+    @classmethod
+    def streaming_close(cls, streaming_state, params) -> None:
+        # Cleanup hook (called once per session). Default: no-op.
+        ...
+```
+
+**Eligibility for the streaming path** is decided by the extension's optimizer rule and requires:
+
+- `streaming_partitioned = True` on the function's Meta.
+- A cumulative frame: `ROWS/RANGE/GROUPS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` (or the implicit cumulative frame DuckDB emits when only `ORDER BY` is given).
+- No `EXCLUDE`, `DISTINCT`, `FILTER (WHERE ...)`, or aggregate-arg `ORDER BY`.
+- The worker function declares no const-arg parameters (v1 limitation).
+
+Queries that don't satisfy all of these fall back to the standard windowed path automatically. The streaming path is opt-in and additive — it does not replace `update`/`combine`/`finalize`, which still service `GROUP BY` queries normally.
+
+**When pre-aggregation is the better answer.** For most analytics shapes — "EOD positions per book per day, carrying forward across days" — pre-aggregating the input is the cleanest pattern in plain SQL:
+
+```sql
+WITH per_period_net AS (
+  SELECT book, period_key, symbol, SUM(quantity) AS quantity
+  FROM trades GROUP BY book, period_key, symbol
+)
+SELECT book, period_key,
+       my_running_agg(symbol, quantity)
+         OVER (PARTITION BY book ORDER BY period_key) AS running
+FROM per_period_net;
+```
+
+The pre-aggregate collapses fills within each period before the OVER sees them, so the per-row output cardinality of the OVER matches the user's actual intent. The streaming path is the right tool when pre-aggregation isn't viable: per-fill running views, very high symbol cardinality per partition, or aggregates whose state isn't algebraically reducible by a pre-aggregate.
+
 ## Example Functions
 
 See `vgi/examples/aggregate.py` for complete implementations:
diff --git a/vgi/aggregate_function.py b/vgi/aggregate_function.py
@@ -515,3 +515,90 @@ def window_batch(
             cls.window(rid, frames, partition, window_state, params)
             for rid, frames in zip(row_ids, subframes, strict=True)
         ]
+
+    # ------------------------------------------------------------------
+    # Optional streaming-partitioned callbacks
+    # ------------------------------------------------------------------
+    # Enable by setting ``Meta.streaming_partitioned = True`` and overriding
+    # ``streaming_chunk()`` (and optionally ``streaming_open`` /
+    # ``streaming_close``).
+    #
+    # Streaming-partitioned aggregates handle queries shaped like
+    # ``f(...) OVER (PARTITION BY p ORDER BY o)`` with a cumulative frame
+    # (``UNBOUNDED PRECEDING -> CURRENT ROW``) where the input is too large
+    # to materialise in DuckDB memory but compresses heavily into per-
+    # partition state. The framework streams input chunks to the worker;
+    # the worker maintains concurrent per-partition state in a hash map and
+    # emits one output row per input row.
+
+    @classmethod
+    def streaming_open(cls, params: ProcessParams[Any]) -> Any:
+        """Build cross-partition global state for a streaming session.
+
+        Called once when ``aggregate_streaming_open`` arrives, before any
+        chunk is processed. Return any object (it lives in an in-process
+        cache keyed by ``execution_id`` for the duration of the session).
+
+        Typical contents: a ``dict`` of per-partition aggregate states
+        (populated lazily as new partition keys appear in input chunks),
+        plus any cross-partition resources to share — symbol intern
+        tables, allocator pools, prepared output buffers.
+
+        Default implementation returns ``None`` (no shared state); the
+        function still works if ``streaming_chunk`` keeps everything in
+        local variables, but per-partition state would have to live
+        somewhere caller-supplied.
+        """
+        return None
+
+    @classmethod
+    def streaming_chunk(
+        cls,
+        chunk: pa.RecordBatch,
+        streaming_state: Any,
+        partition_key_count: int,
+        order_key_count: int,
+        params: ProcessParams[Any],
+    ) -> "pa.Array | list[Any]":
+        """Process one chunk of streaming input.
+
+        Args:
+            chunk: Input rows for this batch. Schema layout is
+                ``[partition_key_cols..., order_key_cols..., value_cols...]``
+                — the first ``partition_key_count`` columns are partition
+                keys (used to dispatch to the right per-partition state),
+                the next ``order_key_count`` are order keys (informational;
+                may be used to verify monotonicity), the rest are the
+                function's value arguments in declaration order.
+            streaming_state: Whatever ``streaming_open`` returned. The
+                framework passes the same object on every chunk; mutate
+                in place to accumulate state across chunks.
+            partition_key_count: Number of leading columns that form the
+                partition key.
+            order_key_count: Number of columns following the partition key
+                that form the order key.
+            params: Shared ``ProcessParams``.
+
+        Returns:
+            Either a :class:`pa.Array` of length ``chunk.num_rows`` matching
+            the function's output type, or a list of the same length
+            (which the framework converts via ``pa.array``). Each output
+            value is the cumulative aggregate snapshot at that input
+            row's position in its partition's order.
+        """
+        raise NotImplementedError(
+            f"{cls.__name__}: Meta.streaming_partitioned=True requires overriding streaming_chunk()"
+        )
+
+    @classmethod
+    def streaming_close(cls, streaming_state: Any, params: ProcessParams[Any]) -> None:
+        """Tear down streaming session state.
+
+        Called once when ``aggregate_streaming_close`` arrives, after the
+        last chunk. Use to release any external resources held by
+        ``streaming_state``. The framework drops its reference after this
+        call, so anything not held elsewhere is GCed naturally.
+
+        Default implementation is a no-op.
+        """
+        return None
diff --git a/vgi/catalog/catalog_interface.py b/vgi/catalog/catalog_interface.py
@@ -367,6 +367,11 @@ class FunctionInfo(CatalogSchemaObject, ArrowSerializableDataclass):
     distinct_dependent: DistinctDependence = DistinctDependence.NOT_DISTINCT_DEPENDENT
     # True if the aggregate implements the window() callback
     supports_window: bool = False
+    # True if the aggregate opts into the streaming-partitioned protocol —
+    # ``aggregate_streaming_open`` / ``_chunk`` / ``_close``. The DuckDB
+    # extension's optimizer rule may rewrite eligible LogicalWindow nodes to
+    # use this path.
+    streaming_partitioned: bool = False
 
     # True if a table-in-out function declares a finalize/finish stage.
     # The C++ extension uses this to conditionally register
@@ -2133,6 +2138,7 @@ def _function_to_info(self, func_cls: type, schema_name: str) -> FunctionInfo:
             order_dependent=meta.order_dependent,
             distinct_dependent=meta.distinct_dependent,
             supports_window=meta.supports_window,
+            streaming_partitioned=meta.streaming_partitioned,
             has_finalize=meta.has_finalize,
             # Settings
             required_settings=meta.required_settings,
diff --git a/vgi/metadata.py b/vgi/metadata.py
@@ -328,6 +328,7 @@ class ResolvedMetadata:
     order_dependent: OrderDependence = OrderDependence.NOT_ORDER_DEPENDENT
     distinct_dependent: DistinctDependence = DistinctDependence.NOT_DISTINCT_DEPENDENT
     supports_window: bool = False
+    streaming_partitioned: bool = False
 
     # Table-in-out specific: True if the function has a meaningful finalize phase
     # (override of finalize()/finish()). Used by the C++ extension to decide
@@ -359,6 +360,7 @@ def to_dict(self) -> dict[str, Any]:
             "order_dependent": self.order_dependent.name,
             "distinct_dependent": self.distinct_dependent.name,
             "supports_window": self.supports_window,
+            "streaming_partitioned": self.streaming_partitioned,
             "has_finalize": self.has_finalize,
         }
 
@@ -387,6 +389,7 @@ def from_dict(d: dict[str, Any]) -> ResolvedMetadata:
             order_dependent=OrderDependence[d.get("order_dependent", "NOT_ORDER_DEPENDENT")],
             distinct_dependent=DistinctDependence[d.get("distinct_dependent", "NOT_DISTINCT_DEPENDENT")],
             supports_window=d.get("supports_window", False),
+            streaming_partitioned=d.get("streaming_partitioned", False),
             has_finalize=d.get("has_finalize", False),
         )
 
@@ -777,6 +780,7 @@ def _normalize_examples(
         "order_dependent",
         "distinct_dependent",
         "supports_window",
+        "streaming_partitioned",
         # Scalar function specific
         "output_type",  # pa.DataType | type[AnyArrow] for scalar functions
     }
@@ -941,6 +945,7 @@ def resolve_metadata(cls: type) -> ResolvedMetadata:
         order_dependent=attrs.get("order_dependent", OrderDependence.NOT_ORDER_DEPENDENT),
         distinct_dependent=attrs.get("distinct_dependent", DistinctDependence.NOT_DISTINCT_DEPENDENT),
         supports_window=bool(attrs.get("supports_window", False)),
+        streaming_partitioned=bool(attrs.get("streaming_partitioned", False)),
         has_finalize=_detect_has_finalize(cls, function_type),
     )
 
@@ -1028,6 +1033,7 @@ def _detect_has_finalize(cls: type, function_type: CatalogFunctionType) -> bool:
         pa.field("order_dependent", pa.string()),
         pa.field("distinct_dependent", pa.string()),
         pa.field("supports_window", pa.bool_()),
+        pa.field("streaming_partitioned", pa.bool_()),
         pa.field("has_finalize", pa.bool_()),
     ]
 )
diff --git a/vgi/protocol.py b/vgi/protocol.py
@@ -1057,6 +1057,93 @@ class AggregateWindowBatchResponse(ArrowSerializableDataclass):
     result_batch: bytes  # Full IPC stream bytes (count rows, output schema)
 
 
+# ---------------------------------------------------------------------------
+# Aggregate Streaming-Partitioned RPC Types
+# ---------------------------------------------------------------------------
+# Streaming protocol for partitioned aggregates whose state compresses
+# heavily relative to input rows (e.g. portfolio_agg's positions dict vs
+# millions of fills). DuckDB streams input chunks to the worker; the worker
+# maintains concurrent per-partition state in a hash map keyed by partition
+# key, dispatches each row to its partition's state, and emits one snapshot
+# per input row. No DuckDB-side partition materialisation. Cumulative
+# semantics only (UNBOUNDED PRECEDING -> CURRENT ROW); other frame shapes
+# fall back to the non-streaming path.
+
+
+@dataclass(frozen=True, slots=True, kw_only=True)
+class AggregateStreamingOpenRequest(ArrowSerializableDataclass):
+    """Request for aggregate_streaming_open — start a streaming session.
+
+    The worker resolves the function, calls ``streaming_open`` to build the
+    cross-partition global state, and returns an ``execution_id`` that
+    subsequent chunk/close calls reference.
+
+    ``input_schema`` is the schema of every chunk shipped via
+    ``streaming_chunk``. The first ``partition_key_count`` columns are
+    partition-key columns (used by the worker to dispatch rows to the right
+    per-partition state). The next ``order_key_count`` columns are
+    order-key columns (informational; the worker may verify monotonicity).
+    Remaining columns are the function's value arguments, in declaration
+    order.
+    """
+
+    function_name: str
+    arguments: Annotated[Arguments, ArrowType(pa.binary())]
+    input_schema: Annotated[pa.Schema, ArrowType(pa.binary())]
+    partition_key_count: int
+    order_key_count: int
+    output_schema: Annotated[pa.Schema, ArrowType(pa.binary())]
+    settings: Annotated[pa.RecordBatch | None, ArrowType(pa.binary())] = None
+    secrets: Annotated[pa.RecordBatch | None, ArrowType(pa.binary())] = None
+    attach_id: bytes | None = None
+
+
+@dataclass(frozen=True, slots=True, kw_only=True)
+class AggregateStreamingOpenResponse(ArrowSerializableDataclass):
+    """Response from aggregate_streaming_open — session token."""
+
+    execution_id: bytes
+
+
+@dataclass(frozen=True, slots=True, kw_only=True)
+class AggregateStreamingChunkRequest(ArrowSerializableDataclass):
+    """Request for aggregate_streaming_chunk — process one input chunk.
+
+    ``input_batch`` schema must match the ``input_schema`` agreed at
+    ``streaming_open``. The worker iterates rows, dispatches to per-partition
+    state by the partition-key columns, applies the function's update logic,
+    and returns a same-length output array.
+    """
+
+    function_name: str
+    execution_id: bytes
+    input_batch: bytes  # Full IPC stream bytes
+    attach_id: bytes | None = None
+
+
+@dataclass(frozen=True, slots=True, kw_only=True)
+class AggregateStreamingChunkResponse(ArrowSerializableDataclass):
+    """Response from aggregate_streaming_chunk — same-length output batch."""
+
+    result_batch: bytes  # Full IPC stream bytes (one row per input row)
+
+
+@dataclass(frozen=True, slots=True, kw_only=True)
+class AggregateStreamingCloseRequest(ArrowSerializableDataclass):
+    """Request for aggregate_streaming_close — end the session, free state."""
+
+    function_name: str
+    execution_id: bytes
+    attach_id: bytes | None = None
+
+
+@dataclass(frozen=True, slots=True, kw_only=True)
+class AggregateStreamingCloseResponse(ArrowSerializableDataclass):
+    """Response from aggregate_streaming_close — empty ack."""
+
+    pass
+
+
 # ---------------------------------------------------------------------------
 # VGI Protocol
 # ---------------------------------------------------------------------------
@@ -1138,6 +1225,26 @@ def aggregate_window_batch(self, request: AggregateWindowBatchRequest) -> Aggreg
         """Compute ``count`` window output rows in one batched RPC."""
         ...
 
+    # ========== Aggregate Streaming-Partitioned Methods (optional, all unary) ==========
+
+    def aggregate_streaming_open(
+        self, request: AggregateStreamingOpenRequest
+    ) -> AggregateStreamingOpenResponse:
+        """Start a streaming-partitioned aggregate session."""
+        ...
+
+    def aggregate_streaming_chunk(
+        self, request: AggregateStreamingChunkRequest
+    ) -> AggregateStreamingChunkResponse:
+        """Process one input chunk; returns one output row per input row."""
+        ...
+
+    def aggregate_streaming_close(
+        self, request: AggregateStreamingCloseRequest
+    ) -> AggregateStreamingCloseResponse:
+        """End the streaming session, free per-session state."""
+        ...
+
     # ========== Catalog - Discovery ==========
 
     def catalog_catalogs(self) -> CatalogsResponse:
diff --git a/vgi/worker.py b/vgi/worker.py