Skip to content

fix(gcu): stop injecting phantom admin_status moves in patch sorter#4546

Open
rookie-who wants to merge 1 commit into
sonic-net:masterfrom
rookie-who:fix/gcu-sorter-no-phantom-admin-status
Open

fix(gcu): stop injecting phantom admin_status moves in patch sorter#4546
rookie-who wants to merge 1 commit into
sonic-net:masterfrom
rookie-who:fix/gcu-sorter-no-phantom-admin-status

Conversation

@rookie-who
Copy link
Copy Markdown
Contributor

Problem Statement

RequiredValueMoveExtender in the GCU patch sorter injects admin_status=down moves even when the original patch does not include any admin_status changes. This causes the sorter to produce operations the caller never requested.

How we got here

While investigating flaky configlet.test_add_rack.test_add_rack failures on Cisco-8102-C64 and Mellanox SN4600C/SN4700 platforms (build 20251110.26), we traced the root cause to the patch sorter producing a phantom admin_status round-trip:

  1. The test generates a removal patch that removes BGP, buffer, queue, and PFC config for a T0 neighbor, but does not include admin_status=down for the decommissioned port
  2. The sorter's RequiredValueMoveExtender injects admin_status=down because YANG requires the port to be down before removing buffer/queue config
  3. After all buffer/queue removals complete, LowLevelMoveGenerator sees that the current state (admin_status=down) differs from the target (admin_status=up, since the patch didn't touch it) and generates a restore admin_status=up
  4. CONFIG_DB ends with admin_status=up for a port with no operational config

A manual config interface shutdown workaround ran after the patch, but on Cisco/Mellanox platforms with dynamic buffer management, buffermgrd raced with the manual shutdown and reset port state, causing the DB comparison to fail.

Design discussion

We considered several options:

Option A — Filter out-of-scope moves after sorting: Remove any sorted move that touches paths not in the original patch. Problem: the intermediate admin_status=down is required for YANG validation of buffer removals — can't simply filter it out.

Option B — Caller responsibility: Don't change the sorter. Require callers to always produce YANG-complete patches. If you're removing buffers, include admin_status=down.

Chosen approach — Hybrid: Stop the sorter from injecting phantom operations when the patch doesn't touch the required path, while preserving the ordering protection when the patch does include admin_status changes. If a patch is YANG-incomplete (removes buffers without setting admin_status=down), the sorter will report "no valid ordering" instead of silently inventing operations.

The principle: GCU should be a faithful executor — it sorts and applies what the patch specifies, nothing more. If YANG requires admin_status=down before removing buffer config, the patch itself must include that operation.

Historical context

RequiredValueMoveExtender was added in PR #1998 (Feb 2022, fixing issue #1930) for the AddRack scenario. In AddRack, the patch explicitly includes admin_status=up (adding a new port). The extender ensures:

  • Port-critical config (buffers, queues) is applied while the port is admin down
  • admin_status=up is applied last, after all critical config is in place

This injection was correct for AddRack (patch includes admin_status), but also fired in RemoveRack (patch has no admin_status op), creating the phantom down/up round-trip.

Fix

Modified RequiredValueMoveExtender.extend() to only inject admin_status=down moves when the original patch actually modifies admin_status (current != target for the required path). If current == target for admin_status, the patch didn't touch it and no injection occurs.

Preserved: The flip protection behavior — when the patch DOES include admin_status changes, the extender still prevents premature admin_status=up while port-critical changes remain. This ensures the AddRack ordering guarantee is maintained.

Companion PR

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

RequiredValueMoveExtender was injecting admin_status=down moves even
when the original patch did not include any admin_status changes.
This caused the sorter to:

1. Inject admin_status=down (for YANG dependency ordering)
2. Restore admin_status=up at the end (to match target config)

The restore created a phantom operation the caller never requested,
leaving the port in an unexpected state. On platforms with dynamic
buffer management (Cisco, Mellanox), this raced with buffermgrd and
caused flaky test failures (configlet.test_add_rack.test_add_rack).

The fix: only inject admin_status=down when the original patch
actually modifies admin_status (i.e., current != target for that
path). If the patch does not touch admin_status, the extender no
longer injects — the patch is considered incomplete and the sorter
will report 'no valid ordering', requiring the caller to provide a
YANG-complete patch.

The flip protection behavior is preserved: when the patch DOES
include admin_status changes, the extender still prevents premature
admin_status=up while port-critical config changes remain.

Background: RequiredValueMoveExtender was added in PR sonic-net#1998 (Feb 2022,
fixing issue sonic-net#1930) for the AddRack scenario where the patch includes
admin_status=up. The injection was needed to bring the port down
before applying critical config, then bring it back up at the end.
However, the injection also fired in the RemoveRack scenario where
the patch has no admin_status operation, creating the phantom
down/up round-trip.

Signed-off-by: rookie-who <rookie-who@users.noreply.github.com>
@rookie-who rookie-who force-pushed the fix/gcu-sorter-no-phantom-admin-status branch from 281c989 to e4fbdc4 Compare May 15, 2026 09:51
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants