feat(routing): upstream (northbound) declaration aggregation by BOURBONCASK · Pull Request #2631 · eclipse-zenoh/zenoh

BOURBONCASK · 2026-06-02T09:18:03Z

This follows up on the discussion in #2630.

Up front, so there's no confusion about what this is: it's an experiment I built to deal with a real
routing-table scaling problem at edge-to-cloud scale, and I'm sharing it mostly as a concrete
reference / conversation starter rather than something I expect to land as-is. I understand a broader
Zenoh 2.0 redesign is only just starting to be discussed, with the scope still taking shape — so rather
than assume where this area lands, please read this as "here's one way it could look, and here's what I
learned doing it," offered as possible input to that conversation. I'm very happy to reshape it, cut it
down, rebase it onto whatever makes sense, or just leave it as something others can borrow from.

What it's for

When many downstream sessions each declare K subscribers/queryables under a shared key-expression prefix
and a router forwards them up into a router mesh, every upstream router ends up holding ~N×K
routing-table Resources (N branches × K keys). The routing table tends to be the first limit you hit.
Zenoh's existing aggregation.subscribers / publishers only collapses a session's own declarations,
not what a router forwards upstream for the sessions below it — so this extends config aggregation to a
router's northbound forwarding, letting an upstream router keep one Resource per configured prefix
instead of one per forwarded key.

What it does

When a north-bound router HAT forwards a downstream subscriber/queryable whose key-expression is included
by a configured aggregation.upstream.{subscribers,queryables} prefix, it folds the per-key children
into a single ${prefix} declaration upstream and suppresses the children there — keeping them
registered in the source region so downward routing is unchanged. There's no new wire type (the aggregate
is an ordinary DeclareSubscriber / DeclareQueryable), and it's opt-in (an empty aggregation.upstream
takes the existing propagation path).

aggregation: { upstream: { subscribers: ["example/**"], queryables: ["example/**"] } }

A few design notes (in case they're useful for the broader discussion)

The aggregate Resources are created once when the gateway is built and the fold path just looks them
up — creating a Resource needs the whole Tables, which isn't reachable from inside a single HAT, and
doing it once means each aggregate's match-set is wired exactly once, so reconnect/churn can't pile up
duplicate cross-links.

The aggregate queryable is advertised complete=false (presence, not authority), so BestMatching falls
through to the real per-key queryable — a genuinely-complete source is never shadowed and no complete
state can go stale across owner churn.

For target=AllComplete, a route entry that is non-complete, whose matched resource covers the query, and
points at a router, is treated as a transparent forwarder (a small in-process flag on the query route
entry, set only at the router-net fold site — no wire change). AllComplete passes through it so the next
router re-applies the filter against its real children; BestMatching is untouched.

It's northbound-only, the fold goes through a refcounted ledger, and route caches are invalidated on fold
and teardown. Suspicious prefixes (a bare ** root, the @/ admin-space, duplicates, mutual inclusion)
get a startup warning.

Trade-offs

At the upstream node, per-key ACL / QoS / interceptors and admin-space enumeration see only the
${prefix} aggregate (so per-key policy belongs on the forwarding router); the wildcard aggregate can
forward data toward the forwarding router for unsubscribed keys; and liveliness tokens are intentionally
not folded (a liveliness sample's key is the token's key, so a folded ${prefix}/** token couldn't
enumerate per-key presence or signal a per-key loss). These are noted alongside the config in
DEFAULT_CONFIG.json5 and the schema docs.

Numbers

From an in-process loopback-TCP benchmark:

Upstream routing-table cardinality drops from O(N·K) to O(N) — e.g. 5000 → 100 entries at N=100, K=50.
This is exact by construction (the fold produces one aggregate per branch), independent of RAM.
Roughly 6–9× less whole-process RSS at the upstream node (K=50) — a process-level measurement and a
secondary signal; the A/B delta is dominated by the cardinality collapse, which is the load-bearing
result.

What changed

It's roughly ~480 lines of routing code plus config and docs, with a similar amount of tests — all behind
the empty-config fast path. The bulk lives in the router HAT (the per-prefix fold/suppress/teardown for
subscribers and queryables). Supporting pieces add the config and its docs, the pre-created aggregates and
the prefix validation in the dispatcher tables, the transparent-forwarder flag on the query route entry
(threaded through the other HATs' query paths), and the tests.

Tests

Deterministic, in-process (MockFace, no sleeps): K downstream subscribers collapse to exactly one
aggregate on the upstream face, and undeclaring all of them withdraws exactly one aggregate.
Real loopback-TCP integration: subscriber/queryable collapse K→1 plus delivery and teardown; wildcard
get fans out to all children; a missing-key get returns empty without hanging; target=AllComplete
reaches a complete child through the aggregate while a non-complete child stays empty (with a negative
control); two branches under the same prefix don't shadow each other; cross-mesh propagation; plus an
ignored scale bench.
The existing unit / regions / queryable / matching / acl / adminspace / qos suites stay
green, clippy --deny warnings is clean, and there's no behaviour change when the config is unset.
(Also built + run on aarch64 with the same results.)

Compatibility

No protocol/wire change; additive opt-in config (defaults empty); MSRV (1.75) clean; built on the existing
regions/gateway routing model.

🏷️ Label-Based Checklist

No specific label requirements detected.

Current labels: No labels

Add one of these labels to this PR to see relevant checklist items: api-sync, breaking-change, bug, ci, dependencies, documentation, enhancement, new feature, internal

This section updates automatically when labels change.

Extend config-driven aggregation to a router's northbound forwarding. When a north-bound router HAT forwards a downstream subscriber/queryable whose key-expression is included by a configured `aggregation.upstream.{subscribers,queryables}` prefix, the per-key children are folded into a single `${prefix}` declaration toward the upstream and suppressed there, while staying registered in the source region so downward routing is unchanged. An upstream router then holds one routing Resource per configured prefix instead of one per forwarded key. The stock `aggregation.subscribers`/`publishers` only collapses a session's own declarations at the session boundary; this collapses what a router forwards upstream on behalf of the sessions below it -- the cost that grows as O(N*K) (N downstream branches, K keys each) on every upstream router in a mesh. Design: - No new wire type: the aggregate is an ordinary DeclareSubscriber/DeclareQueryable, so mesh propagation, matching and admin are unchanged. - Opt-in: an empty `aggregation.upstream` takes the existing propagation path (no-op fast path). - Aggregate Resources are pre-created at gateway build; the fold path only looks them up (a Resource needs the full Tables, unreachable from inside a single HAT), so each aggregate's match-set is wired once and reconnect/churn cannot accumulate duplicate cross-links. - The aggregate queryable is advertised complete=false, so BestMatching falls through to the real per-key queryable: a genuinely-complete source is never shadowed and no completeness state can go stale across owner churn. - target=AllComplete reaches complete children behind the aggregate via a transparent-forwarder flag on the in-process query route entry (set only at the router-net fold site, no wire change); BestMatching is untouched. - Liveliness tokens are intentionally not folded (a liveliness sample's key is the token's own key, so a folded wildcard token could neither enumerate the live set nor signal a per-key removal). - A startup check warns on suspicious prefixes (a bare `**` root, the `@/` admin-space, duplicates, mutually-including prefixes). Trade-offs (documented in DEFAULT_CONFIG.json5 and the config schema): at the upstream node, per-key ACL/QoS-overwrite/interceptors and admin-space enumeration see only the `${prefix}` aggregate, and the wildcard aggregate can forward data toward the forwarding router for unsubscribed keys. Tests: a deterministic in-process (MockFace) fold/teardown test, plus real loopback-TCP integration tests covering subscriber/queryable collapse, delivery and teardown, wildcard get fan-out, missing-key empty (no hang), target=AllComplete reaching a complete child through the aggregate with a non-complete negative control, no cross-branch shadowing, and cross-mesh propagation. MSRV 1.75; clippy --deny warnings clean; no behaviour change when the config is unset. Signed-off-by: yifei.ma <yifeima98@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(routing): upstream (northbound) declaration aggregation#2631

feat(routing): upstream (northbound) declaration aggregation#2631
BOURBONCASK wants to merge 1 commit into
eclipse-zenoh:mainfrom
BOURBONCASK:feature/upstream-northbound-agg

BOURBONCASK commented Jun 2, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BOURBONCASK commented Jun 2, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it's for

What it does

A few design notes (in case they're useful for the broader discussion)

Trade-offs

Numbers

What changed

Tests

Compatibility

🏷️ Label-Based Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BOURBONCASK commented Jun 2, 2026 •

edited by github-actions Bot

Loading