[HLD] Monitor Link Group#2308
Conversation
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
849b32e to
820ac15
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
342260c to
088ce64
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
High-level design document for the Monitor Link Group feature. Covers architecture, DB schema, state machine, configuration, warmboot/fastboot behavior, restrictions, and test cases. Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
088ce64 to
77b25cd
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
|
|
||
| **Producers:** PortSyncd / Kernel (Ethernet), TeamSyncd / Teamd (PortChannel) | ||
| **Consumer:** MonitorLinkGroupMgr | ||
|
|
There was a problem hiding this comment.
If the port/lag is down due to Monitor Link Group, it's worth mentioning it in the interface status show command.
There was a problem hiding this comment.
Addressed at §9.1 ("show interface status / show interface description admin column"). For interfaces MLG is holding down, the Admin column renders error-down (mlg) instead of plain down, naming the source explicitly so operators can tell policy-driven hold from user shutdown.
| type leafref { path /lag:sonic-portchannel/lag:PORTCHANNEL/lag:PORTCHANNEL_LIST/lag:name; } | ||
| type string { pattern ""; } | ||
| } | ||
| ordered-by user; |
There was a problem hiding this comment.
Why do we need this odered-by?
There was a problem hiding this comment.
You're right — there's no functional dependency on insertion order; we just count entries. Dropped ordered-by user from both leaf-lists (commit 58f6a42).
| end | ||
|
|
||
| MonitorLinkGroupMgr[MonitorLinkGroupMgr<br/>intfmgrd] | ||
| PortMgr[PortMgr<br/>portmgrd] |
There was a problem hiding this comment.
From intf mgr/port-mgr perspective,
It may be ideal to consult whether the Monitor link group determines the final downlink (owner port/LAG) status that is ultimately conveyed to the ASIC.
There was a problem hiding this comment.
MLG does not determine the final ASIC state. MonitorLinkGroupMgr publishes the desired hold state to STATE_DB (MONITOR_LINK_GROUP_MEMBER); the interface/port-channel owners (PortMgr / TeamMgr) subscribe and decide what to apply downstream. MLG is advisory from the ASIC's perspective — the final admin state stays owned by the existing port/LAG management daemons.
| // "set-to-empty instead of delete" convention | ||
| leaf-list uplinks { | ||
| type union { | ||
| type leafref { path /port:sonic-port/port:PORT/port:PORT_LIST/port:name; } |
There was a problem hiding this comment.
Do we allow if the port is part of a port-channel?
There was a problem hiding this comment.
Yes, those are accepted. The schema doesn't check whether a physical port is also a port-channel member; the operator decides. Noted in §10 (By-design restrictions, item 6) — commit 58f6a42.
…ing, cycle detection, and show CLI
Major updates to reflect the in-flight implementation:
R-4: rename uplinks/downlinks to monitored-links/managed-links throughout
the document (definitions, schema tables, JSON examples, state machine,
sequence diagrams, ASCII topology, requirements list, restrictions,
and the YANG block).
R-6: add a 'Dependency-cycle rejection' subsection under multi-group/cross-role
support. Describes the directed dependency graph the daemon builds at SET
time, the strongly-connected-component check, and the observable signal
(no STATE_DB entry plus SWSS_LOG_ERROR) for cycle-forming configurations.
R-7: drop the empty-string defaults on the monitored-links and managed-links
leaf-lists. Updated YANG block reflects the cleaner schema.
R-10: add the second YANG 'must' constraint bounding min-monitored-links by
count(monitored-links). Restrictions section updated accordingly.
PR-A: extend the STATE_DB schema table with last_state_change_{from,to,time},
pending_start_time (set on entry to PENDING, cleared on entry to UP),
and total_transitions. All transition-tracking fields are optional so
legacy consumers ignore them safely.
PR-B: show CLI sample now renders 'Last change:', 'Transitions:', and
'(elapsed: Xs, remaining: Ys)' for PENDING groups. Documented field
semantics and the OVERDUE fallback when the timer overshoots.
PR-C: new paragraph documenting 'error-down (mlg)' rendering in
'show interface status' and 'show interface description' for
MLG-held managed interfaces, plus the per-source tag convention.
Section 11 Testing Requirements rewritten as a structured plan (unit tests +
system tests + negative tests) referencing the parallel sonic-mgmt PR
sonic-net/sonic-mgmt#24555, using Step/Goal/Expected-results tables aligned
with the Overlay-ECMP HLD format.
Revision bumped to 0.3.
Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
|
/azp run |
|
No pipelines are associated with this pull request. |
…rced Addresses two review comments from @venkatmahalingam: - Line 527: dropped 'ordered-by user' from monitored-links and managed-links leaf-lists. No functional dependency on insertion order; default ordered-by system is sufficient. - Line 524: added a by-design restriction in section 10 stating that the schema does not enforce mutual exclusion between physical port membership in monitored-links / managed-links and port-channel membership. Operator responsibility. Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
|
/azp run |
|
No pipelines are associated with this pull request. |
What
Add HLD for the Monitor Link Group feature — link state tracking that automatically brings downlink interfaces admin-down when the number of operational uplinks in a group falls below a configurable threshold (
min-uplinks). Prevents traffic black-holing in topologies where downlinks depend on upstream connectivity.Why
In multi-homed server, aggregation, and border leaf deployments, servers connected to downlinks continue forwarding traffic even after uplinks fail. Monitor Link Group signals that failure to servers by taking the downlinks down, triggering failover to a redundant path.
Design summary
Orchrunning as a sibling insideintfmgrd. Subscribes toCONFIG_DB:MONITOR_LINK_GROUPandSTATE_DB:PORT_TABLE/LAG_TABLE. Owns the group state machine (DOWN / PENDING / UP).uplink_up_count >= min-uplinks. Optionallink-up-delayholds the group in PENDING before transitioning UP, preventing churn from uplink flaps. No delay is applied at startup.allow_up/force_downper-interface toSTATE_DB:MONITOR_LINK_GROUP_MEMBER. PortMgr (portmgrd) and TeamMgr (teammgrd) subscribe and merge this with the configuredadmin_statusbefore applying the effective state to the interface. MonitorLinkGroupMgr never callsip link setdirectly.force_downwins if any group forces it down.sonic-monitor-link-group.yangwith leafref union for interface validation.show monitor-linkCLI readsSTATE_DB:MONITOR_LINK_GROUP_STATEper group.Repositories changed
STATE_MONITOR_LINK_GROUP_STATE_TABLE_NAME,STATE_MONITOR_LINK_GROUP_MEMBER_TABLE_NAMEmacros inschema.hMonitorLinkGroupMgr(new),intfmgrd,PortMgr,TeamMgrsonic-monitor-link-group.yangshow monitor-linkCLIYang Changes: sonic-net/sonic-buildimage#27004
swss: sonic-net/sonic-swss#4523
Show: sonic-net/sonic-utilities#4497
swss-common: sonic-net/sonic-swss-common#1181
sonic-mgmt tests: sonic-net/sonic-mgmt#24555