Add support for hamgrd rehydration after crashes#159
Conversation
Signed-off-by: BYGX-wcr <wcr@live.cn>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Adds crash-recovery (“rehydration”) support to the NPU-driven hamgrd HA scope actor so it can resume operation after an unplanned restart by leveraging persisted STATE_DB state.
Changes:
- Introduces a new
HaEvent::Rehydrationevent to represent crash-recovery mode. - Persists/restores the in-memory
target_ha_scope_statevialocal_target_asic_ha_stateinSTATE_DB, and adds logic to re-apply idempotent side effects on restart. - Makes COUNTERS-related consumer bridges best-effort and adds a unit test covering Active-state rehydration.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| crates/hamgrd/src/actors/ha_scope/npu.rs | Implements rehydration detection, target-state persistence/restore, and rehydration side effects; makes COUNTERS subscriptions best-effort. |
| crates/hamgrd/src/actors/ha_scope/mod.rs | Adds HaEvent::Rehydration string mapping and extends/introduces NPU-driven unit tests for rehydration. |
| if let Ok(fvs) = swss_serde::to_field_values(&npu_state) { | ||
| internal | ||
| .get_mut(NpuDashHaScopeState::table_name()) | ||
| .clone_from(&fvs); |
There was a problem hiding this comment.
persist_target_state_if_changed() silently drops serialization failures from swss_serde::to_field_values(&npu_state). If this fails, the target state won't be persisted and crash recovery will misbehave, but there will be no signal in logs/tests. Consider returning a Result<()> (or at least logging the error at error!/warn!) so persistence failures are visible and can be handled.
| if let Ok(fvs) = swss_serde::to_field_values(&npu_state) { | |
| internal | |
| .get_mut(NpuDashHaScopeState::table_name()) | |
| .clone_from(&fvs); | |
| match swss_serde::to_field_values(&npu_state) { | |
| Ok(fvs) => { | |
| internal | |
| .get_mut(NpuDashHaScopeState::table_name()) | |
| .clone_from(&fvs); | |
| } | |
| Err(err) => { | |
| eprintln!( | |
| "failed to persist target HA scope state for NPU: {}", | |
| err | |
| ); | |
| } |
Signed-off-by: BYGX-wcr <wcr@live.cn>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
Retrying failed(or canceled) jobs... |
|
Retrying failed(or canceled) stages in build 1102063: ✅Stage Build:
|
Signed-off-by: BYGX-wcr <wcr@live.cn>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| /// Persist `target_ha_scope_state` to `local_target_asic_ha_state` in STATE_DB | ||
| /// so it survives hamgrd crashes. Only writes when the value has changed. | ||
| fn persist_target_state_if_changed(&self, state: &mut State) { | ||
| let internal = state.internal(); |
| if let Ok(fvs) = swss_serde::to_field_values(&npu_state) { | ||
| internal.get_mut(NpuDashHaScopeState::table_name()).clone_from(&fvs); |
Signed-off-by: BYGX-wcr <wcr@live.cn>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: BYGX-wcr <wcr@live.cn>
| self.target_ha_scope_state = self | ||
| .base | ||
| .get_npu_ha_scope_state(internal) | ||
| .and_then(|s| s.local_target_asic_ha_state) | ||
| .and_then(|s| match s.as_str() { | ||
| "active" => Some(TargetState::Active), | ||
| "standby" => Some(TargetState::Standby), | ||
| "standalone" => Some(TargetState::Standalone), | ||
| "dead" => Some(TargetState::Dead), | ||
| _ => None, | ||
| }); |
| if let Ok(fvs) = swss_serde::to_field_values(&npu_state) { | ||
| internal.get_mut(NpuDashHaScopeState::table_name()).clone_from(&fvs); |
| HaState::Active => { | ||
| // Re-activate Active role on DPU with persisted term (no increment) | ||
| self.send_heartbeat_to_peer(state)?; | ||
| let _ = self.update_dpu_ha_scope_table_with_params(state, HaRole::Active.as_str_name()); | ||
| } |
What I did
Add support for NPU-driven Hamgrd to rehydrate after crashes.
Why I did it
Handle unplanned hamgrd failures
How I verified it
Added a UT
Details if related