Skip to content

Commit 8adde4d

Browse files
committed
backup(nas): fix parent-checkpoint recreation for incremental after VM restart
Found while testing on a real libvirt 10.0.0 / qemu 8.2 host: an incremental backup taken after the VM has been (re)started since the last backup FAILED. CloudStack rebuilds the domain XML on every VM start, wiping libvirt's checkpoint registry, but the dirty bitmap persists on the qcow2 (QEMU reloads it). The old code rebuilt a minimal checkpoint XML and did a fresh 'checkpoint-create', which QEMU rejects with 'Bitmap already exists'; the qemu-img --remove fallback then can't run because the running VM holds a write lock on the image. Fix: register the parent with 'checkpoint-create --redefine' using the FULL checkpoint XML (checkpoint-dumpxml output), which adopts the existing bitmap. A minimal/synthesized XML is rejected by libvirt's checkpoint RNG schema, so the full dump is required: - dump-on-create: persist <bitmap>.checkpoint.xml next to each backup on the NAS - on-recreate: redefine from the parent backup's saved XML; if it's missing (a pre-fix backup) or redefine fails, fall back to a full so the chain restarts. Verified on host xcbn-paix-host3 (libvirt 10.0.0): full->dirty->incremental, then restart->redefine->incremental, all succeed; the previously-broken path now works.
1 parent ba74778 commit 8adde4d

1 file changed

Lines changed: 23 additions & 28 deletions

File tree

scripts/vm/hypervisor/kvm/nasbackup.sh

Lines changed: 23 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -157,37 +157,28 @@ backup_running_vm() {
157157
;;
158158
esac
159159

160-
# When incremental, verify the parent bitmap still exists on the running domain.
161-
# CloudStack rebuilds the libvirt domain XML on every VM start, so libvirt's checkpoint
162-
# registry is wiped — but the bitmap may still exist on the qcow2 itself (we pre-seed
163-
# one on stopped-VM backups, see backup_stopped_vm). If the parent is missing from
164-
# libvirt's view, recreate it. If it's missing entirely (qcow2 too), this falls through
165-
# to a fresh-create which captures all writes since — slightly larger but correct.
160+
# When incremental, make sure the parent checkpoint is registered with libvirt. CloudStack
161+
# rebuilds the domain XML on every VM start, which wipes libvirt's in-memory checkpoint
162+
# registry, while the dirty bitmap persists on the qcow2 (QEMU re-loads it on start). A
163+
# fresh checkpoint-create cannot be used then — QEMU reports "Bitmap already exists" — and
164+
# qemu-img cannot drop the bitmap on a running disk (the image is write-locked). The parent
165+
# must instead be re-registered with --redefine, using the FULL checkpoint XML this script
166+
# saved alongside the parent backup when it was taken (a minimal/synthesized XML is rejected
167+
# by libvirt's checkpoint schema on redefine, so the full dump is required).
166168
if [[ "$effective_mode" == "incremental" ]]; then
167169
if ! virsh -c qemu:///system checkpoint-list "$VM" --name 2>/dev/null | grep -qx "$BITMAP_PARENT"; then
168-
cat > $dest/recreate-checkpoint.xml <<XML
169-
<domaincheckpoint><name>$BITMAP_PARENT</name><disks>
170-
$(virsh -c qemu:///system domblklist "$VM" --details 2>/dev/null | awk '$2=="disk"{printf "<disk name=\"%s\"/>\n", $3}')
171-
</disks></domaincheckpoint>
172-
XML
173-
if ! virsh -c qemu:///system checkpoint-create "$VM" --xmlfile $dest/recreate-checkpoint.xml > /dev/null 2>&1; then
174-
# If a bitmap of the same name already lives on the qcow2 (pre-seeded by an
175-
# offline backup) libvirt 7.2+ should reuse it during checkpoint-create. Older
176-
# libvirt fails the create — clean up the orphan bitmap and retry as a fresh.
177-
local retried_ok=1
178-
for disk_path in $(virsh -c qemu:///system domblklist "$VM" --details 2>/dev/null | awk '$2=="disk"{print $4}'); do
179-
[[ -f "$disk_path" ]] && qemu-img bitmap --remove "$disk_path" "$BITMAP_PARENT" 2>/dev/null || true
180-
done
181-
if ! virsh -c qemu:///system checkpoint-create "$VM" --xmlfile $dest/recreate-checkpoint.xml > /dev/null 2>&1; then
182-
retried_ok=0
183-
fi
184-
if [[ "$retried_ok" == "0" ]]; then
185-
echo "Failed to recreate parent bitmap $BITMAP_PARENT for $VM"
186-
cleanup
187-
exit 1
188-
fi
170+
parent_first="${PARENT_PATHS%%,*}"
171+
parent_cp_xml="$(dirname "$parent_first")/$BITMAP_PARENT.checkpoint.xml"
172+
if [[ -f "$parent_cp_xml" ]] && \
173+
virsh -c qemu:///system checkpoint-create "$VM" --xmlfile "$parent_cp_xml" --redefine > /dev/null 2>&1; then
174+
: # parent checkpoint re-registered; the incremental can proceed against it
175+
else
176+
# No saved checkpoint XML (e.g. a backup taken before this fix) or redefine failed.
177+
# Fall back to a full so the chain restarts cleanly instead of failing the backup.
178+
echo "INCREMENTAL_FALLBACK=full (parent checkpoint $BITMAP_PARENT could not be re-registered)"
179+
effective_mode="full"
180+
BITMAP_PARENT=""
189181
fi
190-
rm -f $dest/recreate-checkpoint.xml
191182
fi
192183
fi
193184

@@ -344,6 +335,10 @@ XML
344335
virsh -c qemu:///system domjobinfo $VM --completed
345336
du -sb $dest | cut -f1
346337
if [[ -n "$BITMAP_NEW" ]]; then
338+
# Persist the FULL checkpoint XML next to this backup so a later incremental can
339+
# re-register this checkpoint with --redefine after the VM restarts (which wipes
340+
# libvirt's checkpoint registry but leaves the bitmap on the qcow2).
341+
virsh -c qemu:///system checkpoint-dumpxml "$VM" "$BITMAP_NEW" > "$dest/$BITMAP_NEW.checkpoint.xml" 2>/dev/null || true
347342
# Echo the bitmap name on its own line so the Java caller can capture it for backup_details.
348343
echo "BITMAP_CREATED=$BITMAP_NEW"
349344
fi

0 commit comments

Comments
 (0)