fix(configlet): include admin_status=down in removal patch for decommissioned ports#24658
Open
rookie-who wants to merge 1 commit into
Open
Conversation
…issioned ports The test_add_rack test was flaky on Cisco and Mellanox platforms because the GCU removal patch did not include an explicit admin_status=down operation for ports losing their T0 neighbor config. Without admin_status=down in the patch, GCU's patch sorter would: 1. Inject a temporary admin_status=down to satisfy YANG dependencies before removing buffer/queue config 2. Restore admin_status=up at the end (since the target config still had admin_status=up for the port) This left the port as admin_status=up in CONFIG_DB after the patch completed. A manual 'config interface shutdown' workaround ran after the patch, but on Cisco/Mellanox platforms with dynamic buffer management, buffermgrd would race with the manual shutdown and reset the port state back to the CONFIG_DB value (up), causing the DB comparison to fail. The fix addresses two issues: 1. Include admin_status=down in the removal patch for decommissioned ports, so GCU receives a complete patch with the correct final state. GCU should execute exactly what it is told -- it should not need to invent operations that the caller did not request. 2. Remove the manual 'config interface shutdown' workaround after the patch, since admin_status=down is now part of the patch itself and the workaround is no longer needed. Signed-off-by: rookie-who <rookie-who@users.noreply.github.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
|
@rimunagala please review from GCU point of view. @bingwang-ms please review from AddRack and AddCluster on L-DT2 point of view. |
Contributor
|
202511: #24657 |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates configlet generic patch generation for T0 removal so decommissioned ports explicitly get admin_status=down, and removes the post-removal manual shutdown workaround.
Changes:
- Adds logic to append
PORT/<port>/admin_status=downpatch operations for ports listed intor_data. - Removes the manual
config interface shutdownloop after applying the removal patch.
Comment on lines
+107
to
+117
| for port_name in decommissioned_ports: | ||
| path = "/PORT/{}/admin_status".format(port_name) | ||
| if path not in admin_status_paths: | ||
| # Only add if the port exists in source config with admin_status != down | ||
| src_port = src_json.get("PORT", {}).get(port_name, {}) | ||
| if src_port.get("admin_status", "up") != "down": | ||
| jpatch.append({ | ||
| "op": "replace", | ||
| "path": path, | ||
| "value": "down" | ||
| }) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Statement
The
configlet.test_add_rack.test_add_racktest has been flaky on Cisco-8102-C64 and Mellanox SN4600C/SN4700 platforms, while consistently passing on Arista. The failure manifests asPORT|Ethernet136retainingadmin_status: upin CONFIG_DB and APP_DB after the T0 removal patch, causing the DB comparison to fail across all 8 retry cycles.Root Cause Analysis
When the test generates a removal patch (removing a T0 neighbor's config — BGP, buffers, queues, PFC, etc.), the patch does not include an explicit
admin_status=downfor the decommissioned port. The patch only containsremoveoperations for BGP_NEIGHBOR, BUFFER_PG, BUFFER_QUEUE, PORT_QOS_MAP, etc.GCU's patch sorter then does the following:
admin_status=downas an intermediate step to satisfy YANG dependency ordering (YANG requires the port to be down before removing buffer/queue config)admin_status=upas the last sorted change, because the target config still has the port withadmin_status=upThis means GCU is inventing operations the caller never requested. The principle is: GCU should be a faithful executor — it should sort and apply exactly what the patch specifies, nothing more. If the patch cannot be applied without injecting phantom operations, then the patch is incomplete.
After the patch completes with
admin_status=upin CONFIG_DB, a manualconfig interface shutdownworkaround runs. However, on Cisco and Mellanox platforms with dynamic buffer management,buffermgrdraces with the manual shutdown — it re-initializes buffer config after GCU removed it, which triggers port re-processing and can resetadmin_statusback to the CONFIG_DB value (up).Why Arista passes: Arista platforms use a static buffer model where
buffermgrddoes not race with the manual shutdown.Fix
Include
admin_status=downin the removal patch for ports being decommissioned. This ensures the GCU target config hasadmin_status=down, so the sorter will not inject a phantom restore-to-up operation.Remove the manual
config interface shutdownworkaround after the patch, since the patch now correctly specifies the desired final state.This approach follows the principle that the patch should contain the complete desired state — GCU should not need to invent operations that the caller did not request.