Skip to content

[HLD] Monitor Link Group#2308

Open
srodd-nexthop wants to merge 3 commits into
sonic-net:masterfrom
nexthop-ai:srodd.monitor-link-hld
Open

[HLD] Monitor Link Group#2308
srodd-nexthop wants to merge 3 commits into
sonic-net:masterfrom
nexthop-ai:srodd.monitor-link-hld

Conversation

@srodd-nexthop
Copy link
Copy Markdown

@srodd-nexthop srodd-nexthop commented Apr 27, 2026

What

Add HLD for the Monitor Link Group feature — link state tracking that automatically brings downlink interfaces admin-down when the number of operational uplinks in a group falls below a configurable threshold (min-uplinks). Prevents traffic black-holing in topologies where downlinks depend on upstream connectivity.

Why

In multi-homed server, aggregation, and border leaf deployments, servers connected to downlinks continue forwarding traffic even after uplinks fail. Monitor Link Group signals that failure to servers by taking the downlinks down, triggering failover to a redundant path.

Design summary

  • MonitorLinkGroupMgr: new Orch running as a sibling inside intfmgrd. Subscribes to CONFIG_DB:MONITOR_LINK_GROUP and STATE_DB:PORT_TABLE/LAG_TABLE. Owns the group state machine (DOWN / PENDING / UP).
  • State machine: group transitions UP when uplink_up_count >= min-uplinks. Optional link-up-delay holds the group in PENDING before transitioning UP, preventing churn from uplink flaps. No delay is applied at startup.
  • Enforcement split: MonitorLinkGroupMgr writes allow_up / force_down per-interface to STATE_DB:MONITOR_LINK_GROUP_MEMBER. PortMgr (portmgrd) and TeamMgr (teammgrd) subscribe and merge this with the configured admin_status before applying the effective state to the interface. MonitorLinkGroupMgr never calls ip link set directly.
  • Multi-group membership: an interface can be an uplink in one group and a downlink in another simultaneously. force_down wins if any group forces it down.
  • Supports both Ethernet and PortChannel interfaces.
  • YANG model in sonic-monitor-link-group.yang with leafref union for interface validation.
  • show monitor-link CLI reads STATE_DB:MONITOR_LINK_GROUP_STATE per group.

Repositories changed

Repo Change
sonic-swss-common STATE_MONITOR_LINK_GROUP_STATE_TABLE_NAME, STATE_MONITOR_LINK_GROUP_MEMBER_TABLE_NAME macros in schema.h
sonic-swss MonitorLinkGroupMgr (new), intfmgrd, PortMgr, TeamMgr
sonic-yang-models sonic-monitor-link-group.yang
sonic-utilities show monitor-link CLI

Yang Changes: sonic-net/sonic-buildimage#27004
swss: sonic-net/sonic-swss#4523
Show: sonic-net/sonic-utilities#4497
swss-common: sonic-net/sonic-swss-common#1181
sonic-mgmt tests: sonic-net/sonic-mgmt#24555

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@srodd-nexthop srodd-nexthop force-pushed the srodd.monitor-link-hld branch from 849b32e to 820ac15 Compare April 28, 2026 06:56
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@srodd-nexthop srodd-nexthop force-pushed the srodd.monitor-link-hld branch from 342260c to 088ce64 Compare April 28, 2026 07:55
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

High-level design document for the Monitor Link Group feature.
Covers architecture, DB schema, state machine, configuration,
warmboot/fastboot behavior, restrictions, and test cases.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
@srodd-nexthop srodd-nexthop force-pushed the srodd.monitor-link-hld branch from 088ce64 to 77b25cd Compare April 28, 2026 08:00
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.


**Producers:** PortSyncd / Kernel (Ethernet), TeamSyncd / Teamd (PortChannel)
**Consumer:** MonitorLinkGroupMgr

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the port/lag is down due to Monitor Link Group, it's worth mentioning it in the interface status show command.

Copy link
Copy Markdown
Author

@srodd-nexthop srodd-nexthop May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed at §9.1 ("show interface status / show interface description admin column"). For interfaces MLG is holding down, the Admin column renders error-down (mlg) instead of plain down, naming the source explicitly so operators can tell policy-driven hold from user shutdown.

Comment thread doc/monitor_link/monitor-link_HLD.md Outdated
type leafref { path /lag:sonic-portchannel/lag:PORTCHANNEL/lag:PORTCHANNEL_LIST/lag:name; }
type string { pattern ""; }
}
ordered-by user;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this odered-by?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — there's no functional dependency on insertion order; we just count entries. Dropped ordered-by user from both leaf-lists (commit 58f6a42).

end

MonitorLinkGroupMgr[MonitorLinkGroupMgr<br/>intfmgrd]
PortMgr[PortMgr<br/>portmgrd]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From intf mgr/port-mgr perspective,
It may be ideal to consult whether the Monitor link group determines the final downlink (owner port/LAG) status that is ultimately conveyed to the ASIC.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLG does not determine the final ASIC state. MonitorLinkGroupMgr publishes the desired hold state to STATE_DB (MONITOR_LINK_GROUP_MEMBER); the interface/port-channel owners (PortMgr / TeamMgr) subscribe and decide what to apply downstream. MLG is advisory from the ASIC's perspective — the final admin state stays owned by the existing port/LAG management daemons.

// "set-to-empty instead of delete" convention
leaf-list uplinks {
type union {
type leafref { path /port:sonic-port/port:PORT/port:PORT_LIST/port:name; }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow if the port is part of a port-channel?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are accepted. The schema doesn't check whether a physical port is also a port-channel member; the operator decides. Noted in §10 (By-design restrictions, item 6) — commit 58f6a42.

…ing, cycle detection, and show CLI

Major updates to reflect the in-flight implementation:

  R-4: rename uplinks/downlinks to monitored-links/managed-links throughout
       the document (definitions, schema tables, JSON examples, state machine,
       sequence diagrams, ASCII topology, requirements list, restrictions,
       and the YANG block).

  R-6: add a 'Dependency-cycle rejection' subsection under multi-group/cross-role
       support. Describes the directed dependency graph the daemon builds at SET
       time, the strongly-connected-component check, and the observable signal
       (no STATE_DB entry plus SWSS_LOG_ERROR) for cycle-forming configurations.

  R-7: drop the empty-string defaults on the monitored-links and managed-links
       leaf-lists. Updated YANG block reflects the cleaner schema.

  R-10: add the second YANG 'must' constraint bounding min-monitored-links by
        count(monitored-links). Restrictions section updated accordingly.

  PR-A: extend the STATE_DB schema table with last_state_change_{from,to,time},
        pending_start_time (set on entry to PENDING, cleared on entry to UP),
        and total_transitions. All transition-tracking fields are optional so
        legacy consumers ignore them safely.

  PR-B: show CLI sample now renders 'Last change:', 'Transitions:', and
        '(elapsed: Xs, remaining: Ys)' for PENDING groups. Documented field
        semantics and the OVERDUE fallback when the timer overshoots.

  PR-C: new paragraph documenting 'error-down (mlg)' rendering in
        'show interface status' and 'show interface description' for
        MLG-held managed interfaces, plus the per-source tag convention.

Section 11 Testing Requirements rewritten as a structured plan (unit tests +
system tests + negative tests) referencing the parallel sonic-mgmt PR
sonic-net/sonic-mgmt#24555, using Step/Goal/Expected-results tables aligned
with the Overlay-ECMP HLD format.

Revision bumped to 0.3.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

…rced

Addresses two review comments from @venkatmahalingam:

- Line 527: dropped 'ordered-by user' from monitored-links and
  managed-links leaf-lists. No functional dependency on insertion
  order; default ordered-by system is sufficient.

- Line 524: added a by-design restriction in section 10 stating that
  the schema does not enforce mutual exclusion between physical port
  membership in monitored-links / managed-links and port-channel
  membership. Operator responsibility.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants