Skip to content

Update SmartSwitch HA HLD#2180

Merged
BYGX-wcr merged 16 commits into
sonic-net:masterfrom
BYGX-wcr:NPU-driven-hamgrd
Apr 17, 2026
Merged

Update SmartSwitch HA HLD#2180
BYGX-wcr merged 16 commits into
sonic-net:masterfrom
BYGX-wcr:NPU-driven-hamgrd

Conversation

@BYGX-wcr
Copy link
Copy Markdown
Contributor

@BYGX-wcr BYGX-wcr commented Jan 12, 2026

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Although the state machine is driven by HAmgrd running on the NPU, the health signals that triggers the unplanned state machine transition are still expected to be driven by the DPU.
Specifically, we expect the DPU to perform the following two health monitoring mechanisms:
1. DPU-to-DPU liveness probing: The data path and packet format of the DPU-to-DPU probe will be the same as the one defined in the main HA design doc. Please refer to the [DPU-to-DPU data plane channel design](./smart-switch-ha-hld.md#4352-dpu-to-dpu-data-plane-channel). Upon detecting remote DPU failure events, the local DPU should notify the local HAmgrd via [DASH SAI event notification API](https://github.com/opencomputeproject/SAI/blob/master/experimental/saiswitchextensions.h).
2. DPU self health check: The DPU should monitor the health of itself and try to report failures via pmon if it can. No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add details about these health checkers? I.e. what exact DB entries are we checking?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self-health check is meant to be implemented by the vendors. Yet, we certainly should define a uniform PMON messaging format.


The state transition graph for DPU-scope-NPU-driven HA is shown as below:

<p align="center"><img alt="HA state transition" src="./images/dpu-scope-npu-driven-ha-state-transition.svg"></p>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of wording like "problem detected", can we put in more details, i.e. PeerLost SAI notification?

Can we have a full picture of what exact events will trigger what transition?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@BYGX-wcr BYGX-wcr changed the title Add dpu-scope npu-driven HA HLD Update SmartSwitch HA HLD Jan 16, 2026

Hence, to summarize, we are skipping designing ENI-level pipeline probe here.

#### 6.3.3. Failure Notication
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOC is not updated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


#### 7.2.2 State: Connecting

- Iniate connections to the peer DPU
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


- Iniate connections to the peer DPU
- On successful connection to the peer DPU, transition to *Connected*
- On connection failures, follow the standard procedure to transition to *Standalone*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to do retry here on connection failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

- On acquiring the voting result, perform the following state changes:
* transition to *InitializingToActive* if won the vote
* transition to *InitializingToStandby* if lost the vote
- If failed to get the voting result, follow the standard procedure to transition to *Standalone*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace standard procedure to transition to *Standalone* with [Entering standalone setup](#1014-entering-standalone-setup)

Standalone is a state. This will be misleading, as it could end up with state that is not standalone.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

- Listening for `PlannedSwitchover` and `PlannedShutdown`
* On receiving `PlannedSwitchover`, transition to *SwitchingToActive*
* On receiving `PlannedShutdown`, transition to *Destroying*
- Listening for health signals
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just refer to unplanned events handling.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be many ways to fail unplanned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is the case, there is no need to list this in every single item.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

- Request the approval to be active DPU from SDN controller
- Wait until received the approval, transition to *Active*

#### 7.2.7 State: PendingStandbyRoleActivation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to update the switch over and launch process below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated launch process. Not seen any changes needed in switchover.


#### 6.3.3. Failure Notication

Although we let the vendors to design their own probing mechanism, we specify a uniform failure event notification interface: [DASH SAI event notification API](https://github.com/opencomputeproject/SAI/blob/master/experimental/saiswitchextensions.h). The health signal will be reflected in `dp_channel_is_alive` field in `DASH_HA_SET_STATE` table of `STATE_DB` (per-DPU).
Copy link
Copy Markdown
Contributor

@zjswhhh zjswhhh Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be DPU_STATE_DB, instead of DPU's STATE_DB.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

| Standalone | Heartbeat to pair is lost. Acting like a standalone setup. | Yes | Yes | Yes | Yes | No | No | Acting like the peer doesn’t exist, hence skip all data syncs. <br/><br/>However, the peer can be still connected, because in certain failure cases, we have to drive the DPU to standalone mode for mitigation. |
| SwitchingToActive | Connected and preparing to switch over to active. | No | Yes | Yes | Yes | Yes | No | SwitchingToActive state is a transient state to help old active moving to standby, hence it accepts flow sync from old active, as well as making decision and sync flow back, when old active moved to standby.<br/><br/>Bulk sync is not used in SwitchOver at all. |
| SwitchingToStandby | Connected, leaving active state and preparing to become standby. | Yes | Tunneled to pair | Tunneled to pair | Yes | No | No | SwitchingToStandby is a transient state to help new active moving from SwitchingToActive state to active state. It is identical to standby except responding NPU probe to tunneling traffic. This is needed to make sure we always only have 1 decider at any moment during the transition.<br/><br/>Bulk sync is not used in SwitchOver at all. |
| PendingActiveRoleActivation | Connected, waiting for approval to become active. | No | Drop | Drop | No | No | No | PendingActiveRoleActivation is an intermediate state where we allow the SDN controller to block the DPU until the flow programming has been completed. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why InitializingToActive can init flow sync but PendingActiveRoleActivation can't?

According to the state transition, we enter PendingActiveRoleActivation from InitializingToActive when standby is ready. So it seems to be an even more advanced/ready state?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitializingToActive can only initiate bulk sync. Initiating flow sync is only allowed in Active state.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: BYGX-wcr <wcr@live.cn>
Signed-off-by: BYGX-wcr <wcr@live.cn>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@BYGX-wcr BYGX-wcr merged commit 5da3879 into sonic-net:master Apr 17, 2026
2 checks passed
@BYGX-wcr BYGX-wcr deleted the NPU-driven-hamgrd branch April 17, 2026 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants