Skip to content

fix(configlet): include admin_status=down in removal patch for decommissioned ports#24658

Open
rookie-who wants to merge 1 commit into
sonic-net:masterfrom
rookie-who:fix/configlet-test-add-rack-admin-status-master
Open

fix(configlet): include admin_status=down in removal patch for decommissioned ports#24658
rookie-who wants to merge 1 commit into
sonic-net:masterfrom
rookie-who:fix/configlet-test-add-rack-admin-status-master

Conversation

@rookie-who
Copy link
Copy Markdown
Contributor

Problem Statement

The configlet.test_add_rack.test_add_rack test has been flaky on Cisco-8102-C64 and Mellanox SN4600C/SN4700 platforms, while consistently passing on Arista. The failure manifests as PORT|Ethernet136 retaining admin_status: up in CONFIG_DB and APP_DB after the T0 removal patch, causing the DB comparison to fail across all 8 retry cycles.

Root Cause Analysis

When the test generates a removal patch (removing a T0 neighbor's config — BGP, buffers, queues, PFC, etc.), the patch does not include an explicit admin_status=down for the decommissioned port. The patch only contains remove operations for BGP_NEIGHBOR, BUFFER_PG, BUFFER_QUEUE, PORT_QOS_MAP, etc.

GCU's patch sorter then does the following:

  1. Injects admin_status=down as an intermediate step to satisfy YANG dependency ordering (YANG requires the port to be down before removing buffer/queue config)
  2. Restores admin_status=up as the last sorted change, because the target config still has the port with admin_status=up

This means GCU is inventing operations the caller never requested. The principle is: GCU should be a faithful executor — it should sort and apply exactly what the patch specifies, nothing more. If the patch cannot be applied without injecting phantom operations, then the patch is incomplete.

After the patch completes with admin_status=up in CONFIG_DB, a manual config interface shutdown workaround runs. However, on Cisco and Mellanox platforms with dynamic buffer management, buffermgrd races with the manual shutdown — it re-initializes buffer config after GCU removed it, which triggers port re-processing and can reset admin_status back to the CONFIG_DB value (up).

Why Arista passes: Arista platforms use a static buffer model where buffermgrd does not race with the manual shutdown.

Fix

  1. Include admin_status=down in the removal patch for ports being decommissioned. This ensures the GCU target config has admin_status=down, so the sorter will not inject a phantom restore-to-up operation.

  2. Remove the manual config interface shutdown workaround after the patch, since the patch now correctly specifies the desired final state.

This approach follows the principle that the patch should contain the complete desired state — GCU should not need to invent operations that the caller did not request.

…issioned ports

The test_add_rack test was flaky on Cisco and Mellanox platforms because
the GCU removal patch did not include an explicit admin_status=down
operation for ports losing their T0 neighbor config.

Without admin_status=down in the patch, GCU's patch sorter would:
1. Inject a temporary admin_status=down to satisfy YANG dependencies
   before removing buffer/queue config
2. Restore admin_status=up at the end (since the target config still
   had admin_status=up for the port)

This left the port as admin_status=up in CONFIG_DB after the patch
completed. A manual 'config interface shutdown' workaround ran after
the patch, but on Cisco/Mellanox platforms with dynamic buffer
management, buffermgrd would race with the manual shutdown and reset
the port state back to the CONFIG_DB value (up), causing the DB
comparison to fail.

The fix addresses two issues:

1. Include admin_status=down in the removal patch for decommissioned
   ports, so GCU receives a complete patch with the correct final
   state. GCU should execute exactly what it is told -- it should not
   need to invent operations that the caller did not request.

2. Remove the manual 'config interface shutdown' workaround after the
   patch, since admin_status=down is now part of the patch itself and
   the workaround is no longer needed.

Signed-off-by: rookie-who <rookie-who@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vaibhavhd
Copy link
Copy Markdown
Contributor

@rimunagala please review from GCU point of view.

@bingwang-ms please review from AddRack and AddCluster on L-DT2 point of view.

@vaibhavhd
Copy link
Copy Markdown
Contributor

202511: #24657

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates configlet generic patch generation for T0 removal so decommissioned ports explicitly get admin_status=down, and removes the post-removal manual shutdown workaround.

Changes:

  • Adds logic to append PORT/<port>/admin_status=down patch operations for ports listed in tor_data.
  • Removes the manual config interface shutdown loop after applying the removal patch.

Comment on lines +107 to +117
for port_name in decommissioned_ports:
path = "/PORT/{}/admin_status".format(port_name)
if path not in admin_status_paths:
# Only add if the port exists in source config with admin_status != down
src_port = src_json.get("PORT", {}).get(port_name, {})
if src_port.get("admin_status", "up") != "down":
jpatch.append({
"op": "replace",
"path": path,
"value": "down"
})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants