fix(gcu): stop injecting phantom admin_status moves in patch sorter#4546
Open
rookie-who wants to merge 1 commit into
Open
fix(gcu): stop injecting phantom admin_status moves in patch sorter#4546rookie-who wants to merge 1 commit into
rookie-who wants to merge 1 commit into
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
RequiredValueMoveExtender was injecting admin_status=down moves even when the original patch did not include any admin_status changes. This caused the sorter to: 1. Inject admin_status=down (for YANG dependency ordering) 2. Restore admin_status=up at the end (to match target config) The restore created a phantom operation the caller never requested, leaving the port in an unexpected state. On platforms with dynamic buffer management (Cisco, Mellanox), this raced with buffermgrd and caused flaky test failures (configlet.test_add_rack.test_add_rack). The fix: only inject admin_status=down when the original patch actually modifies admin_status (i.e., current != target for that path). If the patch does not touch admin_status, the extender no longer injects — the patch is considered incomplete and the sorter will report 'no valid ordering', requiring the caller to provide a YANG-complete patch. The flip protection behavior is preserved: when the patch DOES include admin_status changes, the extender still prevents premature admin_status=up while port-critical config changes remain. Background: RequiredValueMoveExtender was added in PR sonic-net#1998 (Feb 2022, fixing issue sonic-net#1930) for the AddRack scenario where the patch includes admin_status=up. The injection was needed to bring the port down before applying critical config, then bring it back up at the end. However, the injection also fired in the RemoveRack scenario where the patch has no admin_status operation, creating the phantom down/up round-trip. Signed-off-by: rookie-who <rookie-who@users.noreply.github.com>
281c989 to
e4fbdc4
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Statement
RequiredValueMoveExtenderin the GCU patch sorter injectsadmin_status=downmoves even when the original patch does not include anyadmin_statuschanges. This causes the sorter to produce operations the caller never requested.How we got here
While investigating flaky
configlet.test_add_rack.test_add_rackfailures on Cisco-8102-C64 and Mellanox SN4600C/SN4700 platforms (build20251110.26), we traced the root cause to the patch sorter producing a phantomadmin_statusround-trip:admin_status=downfor the decommissioned portRequiredValueMoveExtenderinjectsadmin_status=downbecause YANG requires the port to be down before removing buffer/queue configLowLevelMoveGeneratorsees that the current state (admin_status=down) differs from the target (admin_status=up, since the patch didn't touch it) and generates a restoreadmin_status=upadmin_status=upfor a port with no operational configA manual
config interface shutdownworkaround ran after the patch, but on Cisco/Mellanox platforms with dynamic buffer management,buffermgrdraced with the manual shutdown and reset port state, causing the DB comparison to fail.Design discussion
We considered several options:
Option A — Filter out-of-scope moves after sorting: Remove any sorted move that touches paths not in the original patch. Problem: the intermediate
admin_status=downis required for YANG validation of buffer removals — can't simply filter it out.Option B — Caller responsibility: Don't change the sorter. Require callers to always produce YANG-complete patches. If you're removing buffers, include
admin_status=down.Chosen approach — Hybrid: Stop the sorter from injecting phantom operations when the patch doesn't touch the required path, while preserving the ordering protection when the patch does include admin_status changes. If a patch is YANG-incomplete (removes buffers without setting admin_status=down), the sorter will report "no valid ordering" instead of silently inventing operations.
The principle: GCU should be a faithful executor — it sorts and applies what the patch specifies, nothing more. If YANG requires
admin_status=downbefore removing buffer config, the patch itself must include that operation.Historical context
RequiredValueMoveExtenderwas added in PR #1998 (Feb 2022, fixing issue #1930) for the AddRack scenario. In AddRack, the patch explicitly includesadmin_status=up(adding a new port). The extender ensures:admin_status=upis applied last, after all critical config is in placeThis injection was correct for AddRack (patch includes admin_status), but also fired in RemoveRack (patch has no admin_status op), creating the phantom down/up round-trip.
Fix
Modified
RequiredValueMoveExtender.extend()to only injectadmin_status=downmoves when the original patch actually modifies admin_status (current != target for the required path). If current == target for admin_status, the patch didn't touch it and no injection occurs.Preserved: The flip protection behavior — when the patch DOES include admin_status changes, the extender still prevents premature
admin_status=upwhile port-critical changes remain. This ensures the AddRack ordering guarantee is maintained.Companion PR
test_add_racktest to includeadmin_status=downin the removal patch, making the patch YANG-complete