Skip to content

feat(routing): upstream (northbound) declaration aggregation#2631

Open
BOURBONCASK wants to merge 1 commit into
eclipse-zenoh:mainfrom
BOURBONCASK:feature/upstream-northbound-agg
Open

feat(routing): upstream (northbound) declaration aggregation#2631
BOURBONCASK wants to merge 1 commit into
eclipse-zenoh:mainfrom
BOURBONCASK:feature/upstream-northbound-agg

Conversation

@BOURBONCASK
Copy link
Copy Markdown

@BOURBONCASK BOURBONCASK commented Jun 2, 2026

This follows up on the discussion in #2630.

Up front, so there's no confusion about what this is: it's an experiment I built to deal with a real
routing-table scaling problem at edge-to-cloud scale, and I'm sharing it mostly as a concrete
reference / conversation starter rather than something I expect to land as-is. I understand a broader
Zenoh 2.0 redesign is only just starting to be discussed, with the scope still taking shape — so rather
than assume where this area lands, please read this as "here's one way it could look, and here's what I
learned doing it," offered as possible input to that conversation. I'm very happy to reshape it, cut it
down, rebase it onto whatever makes sense, or just leave it as something others can borrow from.

What it's for

When many downstream sessions each declare K subscribers/queryables under a shared key-expression prefix
and a router forwards them up into a router mesh, every upstream router ends up holding ~N×K
routing-table Resources (N branches × K keys). The routing table tends to be the first limit you hit.
Zenoh's existing aggregation.subscribers / publishers only collapses a session's own declarations,
not what a router forwards upstream for the sessions below it — so this extends config aggregation to a
router's northbound forwarding, letting an upstream router keep one Resource per configured prefix
instead of one per forwarded key.

What it does

When a north-bound router HAT forwards a downstream subscriber/queryable whose key-expression is included
by a configured aggregation.upstream.{subscribers,queryables} prefix, it folds the per-key children
into a single ${prefix} declaration upstream and suppresses the children there — keeping them
registered in the source region so downward routing is unchanged. There's no new wire type (the aggregate
is an ordinary DeclareSubscriber / DeclareQueryable), and it's opt-in (an empty aggregation.upstream
takes the existing propagation path).

aggregation: { upstream: { subscribers: ["example/**"], queryables: ["example/**"] } }

A few design notes (in case they're useful for the broader discussion)

The aggregate Resources are created once when the gateway is built and the fold path just looks them
up — creating a Resource needs the whole Tables, which isn't reachable from inside a single HAT, and
doing it once means each aggregate's match-set is wired exactly once, so reconnect/churn can't pile up
duplicate cross-links.

The aggregate queryable is advertised complete=false (presence, not authority), so BestMatching falls
through to the real per-key queryable — a genuinely-complete source is never shadowed and no complete
state can go stale across owner churn.

For target=AllComplete, a route entry that is non-complete, whose matched resource covers the query, and
points at a router, is treated as a transparent forwarder (a small in-process flag on the query route
entry, set only at the router-net fold site — no wire change). AllComplete passes through it so the next
router re-applies the filter against its real children; BestMatching is untouched.

It's northbound-only, the fold goes through a refcounted ledger, and route caches are invalidated on fold
and teardown. Suspicious prefixes (a bare ** root, the @/ admin-space, duplicates, mutual inclusion)
get a startup warning.

Trade-offs

At the upstream node, per-key ACL / QoS / interceptors and admin-space enumeration see only the
${prefix} aggregate (so per-key policy belongs on the forwarding router); the wildcard aggregate can
forward data toward the forwarding router for unsubscribed keys; and liveliness tokens are intentionally
not folded (a liveliness sample's key is the token's key, so a folded ${prefix}/** token couldn't
enumerate per-key presence or signal a per-key loss). These are noted alongside the config in
DEFAULT_CONFIG.json5 and the schema docs.

Numbers

From an in-process loopback-TCP benchmark:

  • Upstream routing-table cardinality drops from O(N·K) to O(N) — e.g. 5000 → 100 entries at N=100, K=50.
    This is exact by construction (the fold produces one aggregate per branch), independent of RAM.
  • Roughly 6–9× less whole-process RSS at the upstream node (K=50) — a process-level measurement and a
    secondary signal; the A/B delta is dominated by the cardinality collapse, which is the load-bearing
    result.

What changed

It's roughly ~480 lines of routing code plus config and docs, with a similar amount of tests — all behind
the empty-config fast path. The bulk lives in the router HAT (the per-prefix fold/suppress/teardown for
subscribers and queryables). Supporting pieces add the config and its docs, the pre-created aggregates and
the prefix validation in the dispatcher tables, the transparent-forwarder flag on the query route entry
(threaded through the other HATs' query paths), and the tests.

Tests

  • Deterministic, in-process (MockFace, no sleeps): K downstream subscribers collapse to exactly one
    aggregate on the upstream face, and undeclaring all of them withdraws exactly one aggregate.
  • Real loopback-TCP integration: subscriber/queryable collapse K→1 plus delivery and teardown; wildcard
    get fans out to all children; a missing-key get returns empty without hanging; target=AllComplete
    reaches a complete child through the aggregate while a non-complete child stays empty (with a negative
    control); two branches under the same prefix don't shadow each other; cross-mesh propagation; plus an
    ignored scale bench.
  • The existing unit / regions / queryable / matching / acl / adminspace / qos suites stay
    green, clippy --deny warnings is clean, and there's no behaviour change when the config is unset.
    (Also built + run on aarch64 with the same results.)

Compatibility

No protocol/wire change; additive opt-in config (defaults empty); MSRV (1.75) clean; built on the existing
regions/gateway routing model.


🏷️ Label-Based Checklist

No specific label requirements detected.

Current labels: No labels

Add one of these labels to this PR to see relevant checklist items: api-sync, breaking-change, bug, ci, dependencies, documentation, enhancement, new feature, internal

This section updates automatically when labels change.

Extend config-driven aggregation to a router's northbound forwarding. When a north-bound
router HAT forwards a downstream subscriber/queryable whose key-expression is included by a
configured `aggregation.upstream.{subscribers,queryables}` prefix, the per-key children are
folded into a single `${prefix}` declaration toward the upstream and suppressed there, while
staying registered in the source region so downward routing is unchanged. An upstream router
then holds one routing Resource per configured prefix instead of one per forwarded key.

The stock `aggregation.subscribers`/`publishers` only collapses a session's own declarations at
the session boundary; this collapses what a router forwards upstream on behalf of the sessions
below it -- the cost that grows as O(N*K) (N downstream branches, K keys each) on every upstream
router in a mesh.

Design:
- No new wire type: the aggregate is an ordinary DeclareSubscriber/DeclareQueryable, so mesh
  propagation, matching and admin are unchanged.
- Opt-in: an empty `aggregation.upstream` takes the existing propagation path (no-op fast path).
- Aggregate Resources are pre-created at gateway build; the fold path only looks them up (a
  Resource needs the full Tables, unreachable from inside a single HAT), so each aggregate's
  match-set is wired once and reconnect/churn cannot accumulate duplicate cross-links.
- The aggregate queryable is advertised complete=false, so BestMatching falls through to the
  real per-key queryable: a genuinely-complete source is never shadowed and no completeness
  state can go stale across owner churn.
- target=AllComplete reaches complete children behind the aggregate via a transparent-forwarder
  flag on the in-process query route entry (set only at the router-net fold site, no wire
  change); BestMatching is untouched.
- Liveliness tokens are intentionally not folded (a liveliness sample's key is the token's own
  key, so a folded wildcard token could neither enumerate the live set nor signal a per-key
  removal).
- A startup check warns on suspicious prefixes (a bare `**` root, the `@/` admin-space,
  duplicates, mutually-including prefixes).

Trade-offs (documented in DEFAULT_CONFIG.json5 and the config schema): at the upstream node,
per-key ACL/QoS-overwrite/interceptors and admin-space enumeration see only the `${prefix}`
aggregate, and the wildcard aggregate can forward data toward the forwarding router for
unsubscribed keys.

Tests: a deterministic in-process (MockFace) fold/teardown test, plus real loopback-TCP
integration tests covering subscriber/queryable collapse, delivery and teardown, wildcard get
fan-out, missing-key empty (no hang), target=AllComplete reaching a complete child through the
aggregate with a non-complete negative control, no cross-branch shadowing, and cross-mesh
propagation. MSRV 1.75; clippy --deny warnings clean; no behaviour change when the config is unset.

Signed-off-by: yifei.ma <yifeima98@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant