Skip to content

Add support for hamgrd rehydration after crashes#159

Open
BYGX-wcr wants to merge 5 commits into
sonic-net:masterfrom
BYGX-wcr:hamgrd-rehydration
Open

Add support for hamgrd rehydration after crashes#159
BYGX-wcr wants to merge 5 commits into
sonic-net:masterfrom
BYGX-wcr:hamgrd-rehydration

Conversation

@BYGX-wcr
Copy link
Copy Markdown
Contributor

What I did

Add support for NPU-driven Hamgrd to rehydrate after crashes.

Why I did it

Handle unplanned hamgrd failures

How I verified it

Added a UT

Details if related

Signed-off-by: BYGX-wcr <wcr@live.cn>
Copilot AI review requested due to automatic review settings April 30, 2026 00:24
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds crash-recovery (“rehydration”) support to the NPU-driven hamgrd HA scope actor so it can resume operation after an unplanned restart by leveraging persisted STATE_DB state.

Changes:

  • Introduces a new HaEvent::Rehydration event to represent crash-recovery mode.
  • Persists/restores the in-memory target_ha_scope_state via local_target_asic_ha_state in STATE_DB, and adds logic to re-apply idempotent side effects on restart.
  • Makes COUNTERS-related consumer bridges best-effort and adds a unit test covering Active-state rehydration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
crates/hamgrd/src/actors/ha_scope/npu.rs Implements rehydration detection, target-state persistence/restore, and rehydration side effects; makes COUNTERS subscriptions best-effort.
crates/hamgrd/src/actors/ha_scope/mod.rs Adds HaEvent::Rehydration string mapping and extends/introduces NPU-driven unit tests for rehydration.

Comment thread crates/hamgrd/src/actors/ha_scope/npu.rs Outdated
Comment on lines +1188 to +1191
if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {
internal
.get_mut(NpuDashHaScopeState::table_name())
.clone_from(&fvs);
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persist_target_state_if_changed() silently drops serialization failures from swss_serde::to_field_values(&npu_state). If this fails, the target state won't be persisted and crash recovery will misbehave, but there will be no signal in logs/tests. Consider returning a Result<()> (or at least logging the error at error!/warn!) so persistence failures are visible and can be handled.

Suggested change
if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {
internal
.get_mut(NpuDashHaScopeState::table_name())
.clone_from(&fvs);
match swss_serde::to_field_values(&npu_state) {
Ok(fvs) => {
internal
.get_mut(NpuDashHaScopeState::table_name())
.clone_from(&fvs);
}
Err(err) => {
eprintln!(
"failed to persist target HA scope state for NPU: {}",
err
);
}

Copilot uses AI. Check for mistakes.
Comment thread crates/hamgrd/src/actors/ha_scope/npu.rs
Comment thread crates/hamgrd/src/actors/ha_scope/npu.rs
Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@BYGX-wcr
Copy link
Copy Markdown
Contributor Author

/azpw run

@mssonicbld
Copy link
Copy Markdown

⚠️ Notice: /azpw run only runs failed jobs now. If you want to trigger a whole pipline run, please rebase your branch or close and reopen the PR.
💡 Tip: You can also use /azpw retry to retry failed jobs directly.

Retrying failed(or canceled) jobs...

@mssonicbld
Copy link
Copy Markdown

Retrying failed(or canceled) stages in build 1102063:

✅Stage Build:

  • Job amd64/ubuntu-22.04: retried.

Signed-off-by: BYGX-wcr <wcr@live.cn>
Copilot AI review requested due to automatic review settings May 6, 2026 18:12
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

/// Persist `target_ha_scope_state` to `local_target_asic_ha_state` in STATE_DB
/// so it survives hamgrd crashes. Only writes when the value has changed.
fn persist_target_state_if_changed(&self, state: &mut State) {
let internal = state.internal();
Comment on lines +1188 to +1189
if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {
internal.get_mut(NpuDashHaScopeState::table_name()).clone_from(&fvs);
Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: BYGX-wcr <wcr@live.cn>
Copilot AI review requested due to automatic review settings May 6, 2026 18:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +1161 to +1171
self.target_ha_scope_state = self
.base
.get_npu_ha_scope_state(internal)
.and_then(|s| s.local_target_asic_ha_state)
.and_then(|s| match s.as_str() {
"active" => Some(TargetState::Active),
"standby" => Some(TargetState::Standby),
"standalone" => Some(TargetState::Standalone),
"dead" => Some(TargetState::Dead),
_ => None,
});
Comment on lines +1195 to +1196
if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {
internal.get_mut(NpuDashHaScopeState::table_name()).clone_from(&fvs);
Comment on lines +1240 to +1244
HaState::Active => {
// Re-activate Active role on DPU with persisted term (no increment)
self.send_heartbeat_to_peer(state)?;
let _ = self.update_dpu_ha_scope_table_with_params(state, HaRole::Active.as_str_name());
}
@BYGX-wcr BYGX-wcr requested review from vivekrnv and zjswhhh May 18, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants