Skip to content

KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377

Open
andrijapanicsb wants to merge 1 commit into
apache:4.22from
andrijapanicsb:fix/kvm-ha-fence-already-off
Open

KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377
andrijapanicsb wants to merge 1 commit into
apache:4.22from
andrijapanicsb:fix/kvm-ha-fence-already-off

Conversation

@andrijapanicsb

Copy link
Copy Markdown
Contributor

Description

When a KVM host with host-HA + out-of-band management (OOBM) enabled is hard powered off (forced chassis-off from the BMC, or a real power/cable failure), CloudStack never transitions the host to Down and therefore never restarts its VMs on other hosts — the host stays in Alert/Disconnected indefinitely.

Root cause: the host-HA state machine declares a host dead (HAState.Fenced → investigator Status.Down) only after a successful OOBM power-off. Against an already-off chassis the BMC rejects the power-off (the Redfish driver maps OFF to GracefulShutdown, which returns HTTP 409 when the system is already off), so KVMHAProvider.fence() reports failure and the host stays stuck in the Fencing state — which HAManagerImpl.getHostStatusFromHAConfig() maps to Status.Disconnected, not Status.Down. VM-HA is therefore never invoked, and the VMs are only recovered once the original (dead) host is powered back on, at which point the pending power-off finally succeeds.

Observed in production with Redfish/iDRAC. Full root-cause analysis and management-server log evidence are in #13376.

Fix

Fencing now succeeds based on the actual chassis power state, not the power-off command's return code:

  • if the host is already powered off (OOBM STATUS == Off) → treat it as fenced (no power-off issued);
  • otherwise issue a best-effort power-off and then confirm via OOBM STATUS;
  • only a confirmed Off state counts as a successful fence; if the state cannot be confirmed (e.g. an unreachable BMC) the fence fails and is retried, to avoid split-brain.

This is OOBM-driver-agnostic (works for ipmitool, Redfish and nested-cloudstack drivers).

Additionally, the Redfish driver now maps PowerOperation.OFF to ForceOff (a hard power-off) instead of GracefulShutdown — consistent with the ipmitool driver and appropriate for fencing an unresponsive host; SOFT remains the graceful ACPI shutdown. Also fixes a latent String.format argument-count bug on the Redfish STATUS branch.

Fixes: #13376

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature/enhancement (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

Unit tests added to KVMHostHATest (all green) covering the fence behaviour:

  • host already off → fenced without issuing a power-off;
  • power-off succeeds, STATUS confirms Off → fenced;
  • power-off command fails (HTTP 409) but STATUS confirms Off → still fenced (the regression for this issue);
  • power state cannot be confirmed (unreachable BMC) → fence fails (no split-brain);
  • OOBM not enabled → fence fails.
mvn -pl plugins/hypervisors/kvm -Dtest=KVMHostHATest test
=> Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

Note on reproduction: the original symptom reproduces on real Redfish hardware (power-off-when-off → HTTP 409). Software/nested OOBM drivers whose power-off is idempotent (e.g. the nested-cloudstack driver's stopVirtualMachine, which is a no-op on an already-stopped VM) do not exhibit the bug, so the deterministic coverage is provided by the unit tests above.

KVMHAProvider.fence() declared a host fenced only when the out-of-band power-off
command reported success. Against an already-off chassis the BMC rejects the
power-off (e.g. Redfish returns HTTP 409), so fence() failed and the host stayed
stuck in the Fencing HA state, which maps to Disconnected (not Down). VM-HA
therefore never restarted the VMs until the dead host was powered back on.

Fencing now succeeds based on the actual chassis power state:
 - if the host is already powered off (OOBM STATUS == Off), treat it as fenced;
 - otherwise issue a best-effort power-off and confirm via OOBM STATUS;
 - only a confirmed Off state counts as success; if the state cannot be confirmed
   (e.g. unreachable BMC) the fence fails and is retried, to avoid split-brain.

Also map Redfish PowerOperation.OFF to ForceOff (hard power-off) instead of
GracefulShutdown, consistent with the ipmitool driver and appropriate for fencing
an unresponsive host (SOFT remains the graceful ACPI shutdown).

Fixes apache#13376
@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.68%. Comparing base (21b2025) to head (65a3e99).

Files with missing lines Patch % Lines
...va/org/apache/cloudstack/kvm/ha/KVMHAProvider.java 85.18% 3 Missing and 1 partial ⚠️
...fbandmanagement/driver/redfish/RedfishWrapper.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               4.22   #13377   +/-   ##
=========================================
  Coverage     17.67%   17.68%           
- Complexity    15792    15798    +6     
=========================================
  Files          5922     5922           
  Lines        533165   533184   +19     
  Branches      65208    65211    +3     
=========================================
+ Hits          94242    94273   +31     
+ Misses       428276   428264   -12     
  Partials      10647    10647           
Flag Coverage Δ
uitests 3.69% <ø> (ø)
unittests 18.75% <79.31%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@andrijapanicsb

Copy link
Copy Markdown
Contributor Author

@blueorangutan package kvm

@blueorangutan

Copy link
Copy Markdown

@andrijapanicsb a [SL] Jenkins job has been kicked to build packages. It will be bundled with kvm SystemVM template(s). I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18199

@apache apache deleted a comment from blueorangutan Jun 8, 2026
@apache apache deleted a comment from blueorangutan Jun 8, 2026
@apache apache deleted a comment from blueorangutan Jun 8, 2026
@apache apache deleted a comment from blueorangutan Jun 8, 2026
@andrijapanicsb

Copy link
Copy Markdown
Contributor Author

@blueorangutan test ol9 kvm-ol9 keepEnv

@blueorangutan

Copy link
Copy Markdown

@andrijapanicsb a [SL] Trillian-Jenkins test job (ol9 mgmt + kvm-ol9) has been kicked to run smoke tests

@andrijapanicsb

Copy link
Copy Markdown
Contributor Author

@blueorangutan test ol9 kvm-ol9

@blueorangutan

Copy link
Copy Markdown

@andrijapanicsb a [SL] Trillian-Jenkins test job (ol9 mgmt + kvm-ol9) has been kicked to run smoke tests

@andrijapanicsb

Copy link
Copy Markdown
Contributor Author

TL;DR --> hw/sw setup + what was tested + clock-measured timing results for VM HA to kick in:

  • KMV host: HPE DL360 / iLO5 (Ubuntu 24.04 for both mgmt and kvm host)
  • Driver: ipmi 2.0 (yet to test RedFish driver - which is the main reason for this PR)
  • CloudStack: 4.22.1.0 alone vs. this 4.22.1.0+this PR (fat JAR replacing)
  • Primary storage: OCFS2 SharedMountPoint

What was measured/tested:

  • Functional testing (feature not broken + yet to test RedFish, which was the reason for this patch)
    • NFS Primary storage not tested (and assuming NOT NFSv3 = no locking of qcow2 = not an important factor/variable)
  • Semi-tuning was done (see global config below) due to focus being put on a completely different thing (and not minimal VM downtime)
  • PR also reduced VM downtime:
    • down from 8 minutes to 2.5 minutes (confirmed with running test 2 times, not only once)
  • Clock-measured timing/results with AND without this patch/PR

KVM.ha global configs changed:

Setting Test Value Default Value
kvm.ha.health.check.timeout 15 10
kvm.ha.activity.check.timeout 30 60
kvm.ha.activity.check.interval 30 60
kvm.ha.activity.check.max.attempts 5 10
kvm.ha.activity.check.failure.ratio 0.6 0.7
kvm.ha.degraded.max.period 180 300
kvm.ha.recover.wait.period 180 600
kvm.ha.fence.timeout 120 60
kvm.ha.recover.failure.threshold 0 1

The last setting ensures that CloudStack skip one or more attempts to "recover" the host by using the BMC POWER RESET command (a.k.a tries 0 times) - it rather fences it immediately via the BMC POWER OFF command (since the host already has reached "Degraded" state and needs help - kill or fix)

  • Testing premise: we don't care about the host being recovered or staying powered off.
  • We care about minimal VM downtime when the host is messed up
    • (i.e. when declared as "I'm messed up" - STONITH/fence it immediately and ensure VM-HA kicks in - instead of retrying 1 or more times to reset the host and NOT trigger VM-HA (we can't guarantee that after that the host will be fine after the OS re-boot - don't risk long VM downtime during the recovery period)

Host HA fencing improvement: handle already-powered-off hosts and reduce HA VM restart delay

This PR addresses a Host HA fencing scenario observed during testing on a physical environment using HPE iLO5 / BMC-based out-of-band management with IPMI driver (yet to test RedFish, which

The test environment was based on Apache CloudStack 4.22.1 with KVM. Primary storage was configured as CloudStack shared mount point storage backed by an OCFS2 clustered filesystem, which is now supported for Host HA. Host HA was enabled only on a single selected host for this test.

On that host, we placed two VMs:

VM type HA setting Expected behavior after host failure
HA-enabled VM Created from a compute offering with HA enabled Should be restarted on another suitable host after fencing
Non-HA VM Created from a compute offering without HA Should remain stopped and not be restarted automatically

A fat jar was produced from a branch based directly on the CloudStack 4.22.1 tag. The jar was extracted from the built RPM package and used for testing.

Scenario being tested

The test intentionally simulated a somewhat unusual but important failure scenario: the host was manually powered off through the BMC / IPMI / iLO interface before CloudStack completed its Host HA fencing flow.

This scenario matters because, depending on the out-of-band driver implementation, sending a power-off command to a chassis that is already powered off may return an error (Redfish does this, IPMI not affected) or otherwise be interpreted as a failed fencing operation

The important point is that CloudStack should not treat “the host is already powered off” as a fencing failure. If the final power state is off, the host is effectively fenced and VM HA can safely proceed.

Logic introduced by the patch

The patched logic changes the fencing flow to be state-driven instead of relying only on the return status of the bmc power-off command.

The intended behavior (after host reacheds Degraded state) is:

  1. Before sending a power-off command, query the current chassis power state.
  2. If the chassis is already powered off, treat the host as already fenced.
  3. If the chassis is still powered on, send the power-off command.
  4. Do not rely only on the raw command return code.
  5. After the command completes, query the chassis power state again.
  6. If the chassis is confirmed powered off, mark the host as fenced / down and allow VM HA to proceed.
  7. If the chassis is still powered on, fencing should not be considered successful.

In short: the final observed power state is what matters. If the chassis is off, the host is fenced.

Test results

The test confirmed the expected VM HA behavior:

Test case Manual chassis power-off time Host reached Alert state Host marked Down / fenced HA-caused "VM.START" event Approx. time until HA restart
Before patch 16:00:00 16:02:30 16:07:55 16:07:56 ~7m 56s
With patch 16:16:00 Not separately recorded 16:18:39 16:18:40 ~2m 40s

Before the patch, the host reached Alert state after approximately 2 minutes and 30 seconds, but it was not marked Down / fenced until 16:07:55. The VM-HA fired a VM start (for HA-enabled VM only), i.e. VM.SSTART event was observed one second later, at 16:07:56. This means the HA-enabled VM experienced roughly 8 minutes of downtime before the restart began.
VMs which are not HA-enabled were marked as down (it's debatable if this "OK" behaviour - if the underlying infra dies, the user still expect his VM to be running)

With the patched logic (replacing the fat jar), the same type of test was repeated. The chassis was manually powered off at 16:16:00. The host was marked Down / fenced at 16:18:39, and the HA-enabled VM start event was observed one second later, at 16:18:40. This reduced the time before HA restart from roughly 8 minutes to roughly 2 minutes and 40 seconds.

The non-HA VM was not restarted in either case, which is the expected behavior.

Result

The patch reduced the observed HA VM restart delay by approximately 5 minutes and 16 seconds in this test scenario.

More importantly, it makes the fencing logic safer and more deterministic: if the host is already powered off, CloudStack should recognize that condition as a successful fencing state rather than waiting longer or treating the operation as failed because the power-off command itself did not behave as expected (Redfish protocol)

This allows Host HA to proceed much sooner while still preserving the important safety rule: VM HA should only be triggered after the host has been confirmed powered off / fenced.

@blueorangutan

Copy link
Copy Markdown

[SF] Trillian test result (tid-16285)
Environment: kvm-ol9 (x2), zone: Advanced Networking with Mgmt server ol9
Total time taken: 51009 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr13377-t16285-kvm-ol9.zip
Smoke tests completed. 149 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants