[smartswitch] Use dpu_halt_services_timeout from platform.json, fallback to HALT_TIMEOUT=60#376
Conversation
…ack to HALT_TIMEOUT=60 Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
vvolam
left a comment
There was a problem hiding this comment.
@rameshraghupathy Changes are fine. Could you fix the coverage?
@vvolam Sure, will do. |
Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Ramesh Raghupathy <ram@cisco.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@rameshraghupathy please fix PR description and use PR template |
vvolam
left a comment
There was a problem hiding this comment.
1. PR description contradicts the code
The description says "increasing the timeout to 120s for now" but the code reads dpu_halt_services_timeout from platform.json with a 60s fallback — it doesn't hardcode 120. Please update the PR description to accurately explain the platform.json lookup mechanism and the fallback behavior. Also please use the PR template.
| if timeout is None: | ||
| return HALT_TIMEOUT | ||
|
|
||
| return int(timeout) |
There was a problem hiding this comment.
2. No validation on timeout value
int(timeout) can return 0 or negative values. A platform.json with "dpu_halt_services_timeout": 0 or -5 would cause the halt polling loop to exit immediately and falsely report success/failure.
Suggestion:
val = int(timeout)
return val if val > 0 else HALT_TIMEOUT|
|
||
| return int(timeout) | ||
| except (OSError, ValueError, TypeError, AttributeError, json.JSONDecodeError): | ||
| return HALT_TIMEOUT |
There was a problem hiding this comment.
3. Silent fallback — no logging
When platform.json is missing/unreadable/invalid, the function silently returns HALT_TIMEOUT. Adding a logger.info() in this except block would help operators diagnose timeout-related issues on the DPU.
Suggestion:
except (OSError, ValueError, TypeError, AttributeError, json.JSONDecodeError) as e:
logger.info("%s: Failed to read dpu_halt_services_timeout from platform.json: %s. Using default %d", MOD_NAME, e, HALT_TIMEOUT)
return HALT_TIMEOUT| mock.patch("builtins.open", mock.mock_open(read_data=json.dumps(mock_data))), | ||
| mock.patch("host_modules.reboot.json.load", return_value=mock_data), | ||
| ): | ||
| assert host_reboot.get_dpu_halt_services_timeout() == HALT_TIMEOUT |
There was a problem hiding this comment.
4. Missing test: key absent from platform.json
There's a test for None value but not for when the key is entirely missing from the JSON (e.g., mock_data = {}). While the code path is the same (data.get() → None → fallback), it's a distinct real-world scenario worth covering explicitly — it's the most common case on non-SmartSwitch platforms.
| # Periodically check every 5 seconds until PMON container is stopped or timeout occurs | ||
| logger.info("%s: Waiting until services are halted or timeout occurs", MOD_NAME) | ||
| timeout = HALT_TIMEOUT | ||
| timeout = get_dpu_halt_services_timeout() |
There was a problem hiding this comment.
5. Existing halt tests are now fragile
test_execute_reboot_success_halt and test_execute_reboot_fail_halt_timeout don't mock get_dpu_halt_services_timeout(). They currently work because there's no real platform.json in the test environment, so the function hits an exception and falls back to 60. If the test environment ever has a platform.json (e.g., running tests on an actual switch or in a build environment with platform packages installed), these tests would use an unexpected timeout value.
Suggestion: add mock.patch("reboot.get_dpu_halt_services_timeout", return_value=60) to both halt test methods.
|
@rameshraghupathy could you address the comments? |
Use dpu_halt_services_timeout from platform.json, fallback to HALT_TIMEOUT=60
Platform DPU HALT takes more than 60s under some conditions such as fstrim taking longer. SO increasing the timeout to 120s for now.