Skip to content

[smartswitch] Use dpu_halt_services_timeout from platform.json, fallback to HALT_TIMEOUT=60#376

Open
rameshraghupathy wants to merge 7 commits into
sonic-net:masterfrom
rameshraghupathy:fix-dpu-halt-timeout
Open

[smartswitch] Use dpu_halt_services_timeout from platform.json, fallback to HALT_TIMEOUT=60#376
rameshraghupathy wants to merge 7 commits into
sonic-net:masterfrom
rameshraghupathy:fix-dpu-halt-timeout

Conversation

@rameshraghupathy
Copy link
Copy Markdown
Contributor

Use dpu_halt_services_timeout from platform.json, fallback to HALT_TIMEOUT=60
Platform DPU HALT takes more than 60s under some conditions such as fstrim taking longer. SO increasing the timeout to 120s for now.

…ack to HALT_TIMEOUT=60

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Changes are fine. Could you fix the coverage?

@rameshraghupathy
Copy link
Copy Markdown
Contributor Author

@rameshraghupathy Changes are fine. Could you fix the coverage?

@vvolam Sure, will do.

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Copy Markdown
Contributor

vvolam commented Apr 21, 2026

@rameshraghupathy please fix PR description and use PR template

Copy link
Copy Markdown
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. PR description contradicts the code

The description says "increasing the timeout to 120s for now" but the code reads dpu_halt_services_timeout from platform.json with a 60s fallback — it doesn't hardcode 120. Please update the PR description to accurately explain the platform.json lookup mechanism and the fallback behavior. Also please use the PR template.

Comment thread host_modules/reboot.py
if timeout is None:
return HALT_TIMEOUT

return int(timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. No validation on timeout value

int(timeout) can return 0 or negative values. A platform.json with "dpu_halt_services_timeout": 0 or -5 would cause the halt polling loop to exit immediately and falsely report success/failure.

Suggestion:

val = int(timeout)
return val if val > 0 else HALT_TIMEOUT

Comment thread host_modules/reboot.py

return int(timeout)
except (OSError, ValueError, TypeError, AttributeError, json.JSONDecodeError):
return HALT_TIMEOUT
Copy link
Copy Markdown
Contributor

@vvolam vvolam Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Silent fallback — no logging

When platform.json is missing/unreadable/invalid, the function silently returns HALT_TIMEOUT. Adding a logger.info() in this except block would help operators diagnose timeout-related issues on the DPU.

Suggestion:

except (OSError, ValueError, TypeError, AttributeError, json.JSONDecodeError) as e:
    logger.info("%s: Failed to read dpu_halt_services_timeout from platform.json: %s. Using default %d", MOD_NAME, e, HALT_TIMEOUT)
    return HALT_TIMEOUT

mock.patch("builtins.open", mock.mock_open(read_data=json.dumps(mock_data))),
mock.patch("host_modules.reboot.json.load", return_value=mock_data),
):
assert host_reboot.get_dpu_halt_services_timeout() == HALT_TIMEOUT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4. Missing test: key absent from platform.json

There's a test for None value but not for when the key is entirely missing from the JSON (e.g., mock_data = {}). While the code path is the same (data.get()None → fallback), it's a distinct real-world scenario worth covering explicitly — it's the most common case on non-SmartSwitch platforms.

Comment thread host_modules/reboot.py
# Periodically check every 5 seconds until PMON container is stopped or timeout occurs
logger.info("%s: Waiting until services are halted or timeout occurs", MOD_NAME)
timeout = HALT_TIMEOUT
timeout = get_dpu_halt_services_timeout()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5. Existing halt tests are now fragile

test_execute_reboot_success_halt and test_execute_reboot_fail_halt_timeout don't mock get_dpu_halt_services_timeout(). They currently work because there's no real platform.json in the test environment, so the function hits an exception and falls back to 60. If the test environment ever has a platform.json (e.g., running tests on an actual switch or in a build environment with platform packages installed), these tests would use an unexpected timeout value.

Suggestion: add mock.patch("reboot.get_dpu_halt_services_timeout", return_value=60) to both halt test methods.

@vvolam
Copy link
Copy Markdown
Contributor

vvolam commented Apr 30, 2026

@rameshraghupathy could you address the comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants