KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off) by andrijapanicsb · Pull Request #13377 · apache/cloudstack

andrijapanicsb · 2026-06-08T16:20:15Z

Description

When a KVM host with host-HA + out-of-band management (OOBM) enabled is hard powered off (forced chassis-off from the BMC, or a real power/cable failure), CloudStack never transitions the host to Down and therefore never restarts its VMs on other hosts — the host stays in Alert/Disconnected indefinitely.

Root cause: the host-HA state machine declares a host dead (HAState.Fenced → investigator Status.Down) only after a successful OOBM power-off. Against an already-off chassis the BMC rejects the power-off (the Redfish driver maps OFF to GracefulShutdown, which returns HTTP 409 when the system is already off), so KVMHAProvider.fence() reports failure and the host stays stuck in the Fencing state — which HAManagerImpl.getHostStatusFromHAConfig() maps to Status.Disconnected, not Status.Down. VM-HA is therefore never invoked, and the VMs are only recovered once the original (dead) host is powered back on, at which point the pending power-off finally succeeds.

Observed in production with Redfish/iDRAC. Full root-cause analysis and management-server log evidence are in #13376.

Fix

Fencing now succeeds based on the actual chassis power state, not the power-off command's return code:

if the host is already powered off (OOBM STATUS == Off) → treat it as fenced (no power-off issued);
otherwise issue a best-effort power-off and then confirm via OOBM STATUS;
only a confirmed Off state counts as a successful fence; if the state cannot be confirmed (e.g. an unreachable BMC) the fence fails and is retried, to avoid split-brain.

This is OOBM-driver-agnostic (works for ipmitool, Redfish and nested-cloudstack drivers).

Additionally, the Redfish driver now maps PowerOperation.OFF to ForceOff (a hard power-off) instead of GracefulShutdown — consistent with the ipmitool driver and appropriate for fencing an unresponsive host; SOFT remains the graceful ACPI shutdown. Also fixes a latent String.format argument-count bug on the Redfish STATUS branch.

Fixes: #13376

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature/enhancement (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI

Bug Severity

How Has This Been Tested?

Unit tests added to KVMHostHATest (all green) covering the fence behaviour:

host already off → fenced without issuing a power-off;
power-off succeeds, STATUS confirms Off → fenced;
power-off command fails (HTTP 409) but STATUS confirms Off → still fenced (the regression for this issue);
power state cannot be confirmed (unreachable BMC) → fence fails (no split-brain);
OOBM not enabled → fence fails.

mvn -pl plugins/hypervisors/kvm -Dtest=KVMHostHATest test
=> Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

Note on reproduction: the original symptom reproduces on real Redfish hardware (power-off-when-off → HTTP 409). Software/nested OOBM drivers whose power-off is idempotent (e.g. the nested-cloudstack driver's stopVirtualMachine, which is a no-op on an already-stopped VM) do not exhibit the bug, so the deterministic coverage is provided by the unit tests above.

KVMHAProvider.fence() declared a host fenced only when the out-of-band power-off command reported success. Against an already-off chassis the BMC rejects the power-off (e.g. Redfish returns HTTP 409), so fence() failed and the host stayed stuck in the Fencing HA state, which maps to Disconnected (not Down). VM-HA therefore never restarted the VMs until the dead host was powered back on. Fencing now succeeds based on the actual chassis power state: - if the host is already powered off (OOBM STATUS == Off), treat it as fenced; - otherwise issue a best-effort power-off and confirm via OOBM STATUS; - only a confirmed Off state counts as success; if the state cannot be confirmed (e.g. unreachable BMC) the fence fails and is retried, to avoid split-brain. Also map Redfish PowerOperation.OFF to ForceOff (hard power-off) instead of GracefulShutdown, consistent with the ipmitool driver and appropriate for fencing an unresponsive host (SOFT remains the graceful ACPI shutdown). Fixes apache#13376

codecov · 2026-06-08T16:28:50Z

Codecov Report

❌ Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.68%. Comparing base (21b2025) to head (65a3e99).

Files with missing lines	Patch %	Lines
...va/org/apache/cloudstack/kvm/ha/KVMHAProvider.java	85.18%	3 Missing and 1 partial ⚠️
...fbandmanagement/driver/redfish/RedfishWrapper.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##               4.22   #13377   +/-   ##
=========================================
  Coverage     17.67%   17.68%           
- Complexity    15792    15798    +6     
=========================================
  Files          5922     5922           
  Lines        533165   533184   +19     
  Branches      65208    65211    +3     
=========================================
+ Hits          94242    94273   +31     
+ Misses       428276   428264   -12     
  Partials      10647    10647

Flag	Coverage Δ
uitests	`3.69% <ø> (ø)`
unittests	`18.75% <79.31%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

andrijapanicsb · 2026-06-08T22:38:14Z

@blueorangutan package kvm

blueorangutan · 2026-06-08T22:40:03Z

@andrijapanicsb a [SL] Jenkins job has been kicked to build packages. It will be bundled with kvm SystemVM template(s). I'll keep you posted as I make progress.

blueorangutan · 2026-06-08T23:27:44Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18199

andrijapanicsb · 2026-06-08T23:46:44Z

@blueorangutan test ol9 kvm-ol9 keepEnv

blueorangutan · 2026-06-08T23:48:03Z

@andrijapanicsb a [SL] Trillian-Jenkins test job (ol9 mgmt + kvm-ol9) has been kicked to run smoke tests

andrijapanicsb · 2026-06-11T00:46:13Z

@blueorangutan test ol9 kvm-ol9

blueorangutan · 2026-06-11T00:48:04Z

@andrijapanicsb a [SL] Trillian-Jenkins test job (ol9 mgmt + kvm-ol9) has been kicked to run smoke tests

andrijapanicsb · 2026-06-11T03:18:47Z

TL;DR --> hw/sw setup + what was tested + clock-measured timing results for VM HA to kick in:

KMV host: HPE DL360 / iLO5 (Ubuntu 24.04 for both mgmt and kvm host)
Driver: ipmi 2.0 (yet to test RedFish driver - which is the main reason for this PR)
CloudStack: 4.22.1.0 alone vs. this 4.22.1.0+this PR (fat JAR replacing)
Primary storage: OCFS2 SharedMountPoint

What was measured/tested:

Functional testing (feature not broken + yet to test RedFish, which was the reason for this patch)
- NFS Primary storage not tested (and assuming NOT NFSv3 = no locking of qcow2 = not an important factor/variable)
Semi-tuning was done (see global config below) due to focus being put on a completely different thing (and not minimal VM downtime)
PR also reduced VM downtime:
- down from 8 minutes to 2.5 minutes (confirmed with running test 2 times, not only once)
Clock-measured timing/results with AND without this patch/PR

KVM.ha global configs changed:

Setting	Test Value	Default Value
`kvm.ha.health.check.timeout`	15	10
`kvm.ha.activity.check.timeout`	30	60
`kvm.ha.activity.check.interval`	30	60
`kvm.ha.activity.check.max.attempts`	5	10
`kvm.ha.activity.check.failure.ratio`	0.6	0.7
`kvm.ha.degraded.max.period`	180	300
`kvm.ha.recover.wait.period`	180	600
`kvm.ha.fence.timeout`	120	60
`kvm.ha.recover.failure.threshold`	0	1

The last setting ensures that CloudStack skip one or more attempts to "recover" the host by using the BMC POWER RESET command (a.k.a tries 0 times) - it rather fences it immediately via the BMC POWER OFF command (since the host already has reached "Degraded" state and needs help - kill or fix)

Testing premise: we don't care about the host being recovered or staying powered off.
We care about minimal VM downtime when the host is messed up
- (i.e. when declared as "I'm messed up" - STONITH/fence it immediately and ensure VM-HA kicks in - instead of retrying 1 or more times to reset the host and NOT trigger VM-HA (we can't guarantee that after that the host will be fine after the OS re-boot - don't risk long VM downtime during the recovery period)

Host HA fencing improvement: handle already-powered-off hosts and reduce HA VM restart delay

This PR addresses a Host HA fencing scenario observed during testing on a physical environment using HPE iLO5 / BMC-based out-of-band management with IPMI driver (yet to test RedFish, which

The test environment was based on Apache CloudStack 4.22.1 with KVM. Primary storage was configured as CloudStack shared mount point storage backed by an OCFS2 clustered filesystem, which is now supported for Host HA. Host HA was enabled only on a single selected host for this test.

On that host, we placed two VMs:

VM type	HA setting	Expected behavior after host failure
HA-enabled VM	Created from a compute offering with HA enabled	Should be restarted on another suitable host after fencing
Non-HA VM	Created from a compute offering without HA	Should remain stopped and not be restarted automatically

A fat jar was produced from a branch based directly on the CloudStack 4.22.1 tag. The jar was extracted from the built RPM package and used for testing.

Scenario being tested

The test intentionally simulated a somewhat unusual but important failure scenario: the host was manually powered off through the BMC / IPMI / iLO interface before CloudStack completed its Host HA fencing flow.

This scenario matters because, depending on the out-of-band driver implementation, sending a power-off command to a chassis that is already powered off may return an error (Redfish does this, IPMI not affected) or otherwise be interpreted as a failed fencing operation

The important point is that CloudStack should not treat “the host is already powered off” as a fencing failure. If the final power state is off, the host is effectively fenced and VM HA can safely proceed.

Logic introduced by the patch

The patched logic changes the fencing flow to be state-driven instead of relying only on the return status of the bmc power-off command.

The intended behavior (after host reacheds Degraded state) is:

Before sending a power-off command, query the current chassis power state.
If the chassis is already powered off, treat the host as already fenced.
If the chassis is still powered on, send the power-off command.
Do not rely only on the raw command return code.
After the command completes, query the chassis power state again.
If the chassis is confirmed powered off, mark the host as fenced / down and allow VM HA to proceed.
If the chassis is still powered on, fencing should not be considered successful.

In short: the final observed power state is what matters. If the chassis is off, the host is fenced.

Test results

The test confirmed the expected VM HA behavior:

Test case	Manual chassis power-off time	Host reached Alert state	Host marked Down / fenced	HA-caused "VM.START" event	Approx. time until HA restart
Before patch	16:00:00	16:02:30	16:07:55	16:07:56	~7m 56s
With patch	16:16:00	Not separately recorded	16:18:39	16:18:40	~2m 40s

Before the patch, the host reached Alert state after approximately 2 minutes and 30 seconds, but it was not marked Down / fenced until 16:07:55. The VM-HA fired a VM start (for HA-enabled VM only), i.e. VM.SSTART event was observed one second later, at 16:07:56. This means the HA-enabled VM experienced roughly 8 minutes of downtime before the restart began.
VMs which are not HA-enabled were marked as down (it's debatable if this "OK" behaviour - if the underlying infra dies, the user still expect his VM to be running)

With the patched logic (replacing the fat jar), the same type of test was repeated. The chassis was manually powered off at 16:16:00. The host was marked Down / fenced at 16:18:39, and the HA-enabled VM start event was observed one second later, at 16:18:40. This reduced the time before HA restart from roughly 8 minutes to roughly 2 minutes and 40 seconds.

The non-HA VM was not restarted in either case, which is the expected behavior.

Result

The patch reduced the observed HA VM restart delay by approximately 5 minutes and 16 seconds in this test scenario.

More importantly, it makes the fencing logic safer and more deterministic: if the host is already powered off, CloudStack should recognize that condition as a successful fencing state rather than waiting longer or treating the operation as failed because the power-off command itself did not behave as expected (Redfish protocol)

This allows Host HA to proceed much sooner while still preserving the important safety rule: VM HA should only be triggered after the host has been confirmed powered off / fenced.

blueorangutan · 2026-06-11T15:39:10Z

[SF] Trillian test result (tid-16285)
Environment: kvm-ol9 (x2), zone: Advanced Networking with Mgmt server ol9
Total time taken: 51009 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr13377-t16285-kvm-ol9.zip
Smoke tests completed. 149 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

boring-cyborg Bot added the component:kvm label Jun 8, 2026

andrijapanicsb requested review from DaanHoogland, GabrielBrascher, harikrishna-patnala and sureshanaparti June 8, 2026 17:20

github-actions Bot mentioned this pull request Jun 8, 2026

[repo-status] Daily Status Report – June 8, 2026 #13381

Closed

apache deleted a comment from blueorangutan Jun 8, 2026

winterhazel added this to the 4.22.2 milestone Jun 9, 2026

winterhazel added status:needs-testing status:needs-review labels Jun 9, 2026

github-actions Bot mentioned this pull request Jun 9, 2026

[repo-status] Daily Status Report – June 9, 2026 #13386

Open

apache deleted a comment from blueorangutan Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377

KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377
andrijapanicsb wants to merge 1 commit into
apache:4.22from
andrijapanicsb:fix/kvm-ha-fence-already-off

andrijapanicsb commented Jun 8, 2026

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

andrijapanicsb commented Jun 8, 2026

Uh oh!

blueorangutan commented Jun 8, 2026

Uh oh!

blueorangutan commented Jun 8, 2026

Uh oh!

andrijapanicsb commented Jun 8, 2026

Uh oh!

blueorangutan commented Jun 8, 2026

Uh oh!

andrijapanicsb commented Jun 11, 2026

Uh oh!

blueorangutan commented Jun 11, 2026

Uh oh!

andrijapanicsb commented Jun 11, 2026

Uh oh!

blueorangutan commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrijapanicsb commented Jun 8, 2026

Description

Fix

Types of changes

Bug Severity

How Has This Been Tested?

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andrijapanicsb commented Jun 8, 2026

Uh oh!

blueorangutan commented Jun 8, 2026

Uh oh!

blueorangutan commented Jun 8, 2026

Uh oh!

andrijapanicsb commented Jun 8, 2026

Uh oh!

blueorangutan commented Jun 8, 2026

Uh oh!

andrijapanicsb commented Jun 11, 2026

Uh oh!

blueorangutan commented Jun 11, 2026

Uh oh!

andrijapanicsb commented Jun 11, 2026

TL;DR --> hw/sw setup + what was tested + clock-measured timing results for VM HA to kick in:

What was measured/tested:

KVM.ha global configs changed:

Host HA fencing improvement: handle already-powered-off hosts and reduce HA VM restart delay

Scenario being tested

Logic introduced by the patch

Test results

Result

Uh oh!

blueorangutan commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 8, 2026 •

edited

Loading