Add support for hamgrd rehydration after crashes by BYGX-wcr · Pull Request #159 · sonic-net/sonic-dash-ha

BYGX-wcr · 2026-04-30T00:24:56Z

What I did

Add support for NPU-driven Hamgrd to rehydrate after crashes.

Why I did it

Handle unplanned hamgrd failures

How I verified it

Added a UT

Details if related

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-04-30T00:25:03Z

/azp run

azure-pipelines · 2026-04-30T00:25:12Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-04-30T00:28:48Z

/azp run

azure-pipelines · 2026-04-30T00:28:57Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Adds crash-recovery (“rehydration”) support to the NPU-driven hamgrd HA scope actor so it can resume operation after an unplanned restart by leveraging persisted STATE_DB state.

Changes:

Introduces a new HaEvent::Rehydration event to represent crash-recovery mode.
Persists/restores the in-memory target_ha_scope_state via local_target_asic_ha_state in STATE_DB, and adds logic to re-apply idempotent side effects on restart.
Makes COUNTERS-related consumer bridges best-effort and adds a unit test covering Active-state rehydration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
crates/hamgrd/src/actors/ha_scope/npu.rs	Implements rehydration detection, target-state persistence/restore, and rehydration side effects; makes COUNTERS subscriptions best-effort.
crates/hamgrd/src/actors/ha_scope/mod.rs	Adds `HaEvent::Rehydration` string mapping and extends/introduces NPU-driven unit tests for rehydration.

Copilot · 2026-04-30T00:29:38Z

+        if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {
+            internal
+                .get_mut(NpuDashHaScopeState::table_name())
+                .clone_from(&fvs);


persist_target_state_if_changed() silently drops serialization failures from swss_serde::to_field_values(&npu_state). If this fails, the target state won't be persisted and crash recovery will misbehave, but there will be no signal in logs/tests. Consider returning a Result<()> (or at least logging the error at error!/warn!) so persistence failures are visible and can be handled.

Suggested change

if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {

internal

.get_mut(NpuDashHaScopeState::table_name())

.clone_from(&fvs);

match swss_serde::to_field_values(&npu_state) {

Ok(fvs) => {

internal

.get_mut(NpuDashHaScopeState::table_name())

.clone_from(&fvs);

}

Err(err) => {

eprintln!(

"failed to persist target HA scope state for NPU: {}",

err

);

}

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-04-30T04:48:35Z

/azp run

azure-pipelines · 2026-04-30T04:48:45Z

Azure Pipelines successfully started running 1 pipeline(s).

BYGX-wcr · 2026-04-30T21:32:47Z

/azpw run

mssonicbld · 2026-04-30T21:32:49Z

⚠️ Notice: /azpw run only runs failed jobs now. If you want to trigger a whole pipline run, please rebase your branch or close and reopen the PR.
💡 Tip: You can also use /azpw retry to retry failed jobs directly.

Retrying failed(or canceled) jobs...

mssonicbld · 2026-04-30T21:32:51Z

Retrying failed(or canceled) stages in build 1102063:

✅Stage Build:

Job amd64/ubuntu-22.04: retried.

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-05-06T18:12:56Z

/azp run

azure-pipelines · 2026-05-06T18:13:07Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

+    /// Persist `target_ha_scope_state` to `local_target_asic_ha_state` in STATE_DB
+    /// so it survives hamgrd crashes. Only writes when the value has changed.
+    fn persist_target_state_if_changed(&self, state: &mut State) {
+        let internal = state.internal();


+        if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {
+            internal.get_mut(NpuDashHaScopeState::table_name()).clone_from(&fvs);


Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-05-06T18:24:03Z

/azp run

azure-pipelines · 2026-05-06T18:24:13Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: BYGX-wcr <wcr@live.cn>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

+        self.target_ha_scope_state = self
+            .base
+            .get_npu_ha_scope_state(internal)
+            .and_then(|s| s.local_target_asic_ha_state)
+            .and_then(|s| match s.as_str() {
+                "active" => Some(TargetState::Active),
+                "standby" => Some(TargetState::Standby),
+                "standalone" => Some(TargetState::Standalone),
+                "dead" => Some(TargetState::Dead),
+                _ => None,
+            });


+        if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {
+            internal.get_mut(NpuDashHaScopeState::table_name()).clone_from(&fvs);


+            HaState::Active => {
+                // Re-activate Active role on DPU with persisted term (no increment)
+                self.send_heartbeat_to_peer(state)?;
+                let _ = self.update_dpu_ha_scope_table_with_params(state, HaRole::Active.as_str_name());
+            }


add support for hamgrd rehydration

75aff98

Signed-off-by: BYGX-wcr <wcr@live.cn>

Copilot AI review requested due to automatic review settings April 30, 2026 00:24

Copilot started reviewing on behalf of BYGX-wcr April 30, 2026 00:25 View session

r12f closed this Apr 30, 2026

r12f reopened this Apr 30, 2026

Copilot AI reviewed Apr 30, 2026

View reviewed changes

fix formatting

f74ef73

Signed-off-by: BYGX-wcr <wcr@live.cn>

remove recreation of pending operation ids

f19d4f8

Signed-off-by: BYGX-wcr <wcr@live.cn>

Copilot AI review requested due to automatic review settings May 6, 2026 18:12

Copilot started reviewing on behalf of BYGX-wcr May 6, 2026 18:13 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

add rehydration_needed flag and the handling logic

2fd9a5d

Signed-off-by: BYGX-wcr <wcr@live.cn>

fix formatting

68346db

Signed-off-by: BYGX-wcr <wcr@live.cn>

Copilot AI review requested due to automatic review settings May 6, 2026 18:46

Copilot started reviewing on behalf of BYGX-wcr May 6, 2026 18:46 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

BYGX-wcr mentioned this pull request May 6, 2026

Update SmartSwitch HA HLD sonic-net/SONiC#2180

Merged

BYGX-wcr requested review from vivekrnv and zjswhhh May 18, 2026 21:06

		if let Ok(fvs) = swss_serde::to_field_values(&npu_state) {
		internal.get_mut(NpuDashHaScopeState::table_name()).clone_from(&fvs);

Conversation

BYGX-wcr commented Apr 30, 2026

Uh oh!

mssonicbld commented Apr 30, 2026

Uh oh!

azure-pipelines Bot commented Apr 30, 2026

Uh oh!

mssonicbld commented Apr 30, 2026

Uh oh!

azure-pipelines Bot commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented Apr 30, 2026

Uh oh!

azure-pipelines Bot commented Apr 30, 2026

Uh oh!

BYGX-wcr commented Apr 30, 2026

Uh oh!

mssonicbld commented Apr 30, 2026

Uh oh!

mssonicbld commented Apr 30, 2026

Uh oh!

mssonicbld commented May 6, 2026

Uh oh!

azure-pipelines Bot commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mssonicbld commented May 6, 2026

Uh oh!

azure-pipelines Bot commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants