Skip to content

Add OVNDBBackup and OVNDBRestore CRDs for managed backup/restore#559

Open
lmiccini wants to merge 4 commits into
openstack-k8s-operators:mainfrom
lmiccini:ovndbbackup
Open

Add OVNDBBackup and OVNDBRestore CRDs for managed backup/restore#559
lmiccini wants to merge 4 commits into
openstack-k8s-operators:mainfrom
lmiccini:ovndbbackup

Conversation

@lmiccini
Copy link
Copy Markdown

@lmiccini lmiccini commented Apr 20, 2026

Introduces two new Custom Resource Definitions for automated OVN database backup and restore operations:

OVNDBBackup:

  • Scheduled backups via CronJob using ovsdb-client backup
  • Configurable retention policy
  • TLS support for database connections
  • Persistent storage for backup files (survives CR deletion)

OVNDBRestore:

  • Phase-based state machine: Validating → ScalingDown → Restoring → ScalingUp → Completed
  • Annotation-based replica override prevents higher-level operators from interfering during restore
  • Force-deletes pods during scale-down (preStop hooks hang when all RAFT members terminate simultaneously)
  • Deletes PVCs to prevent stale RAFT membership state
  • Copies standalone backup to pod-0 PVC and lets ovn-ctl handle the standalone-to-clustered conversion on startup
  • Controlled scale-up: pod-0 first, then remaining replicas
  • Post-restore DB verification via exec

Also includes:

  • Functional tests for both controllers
  • Webhooks with validation
  • Documentation

@openshift-ci openshift-ci Bot requested review from slawqo and stuggi April 20, 2026 15:02
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lmiccini
Once this PR has been reviewed and has the lgtm label, please assign stuggi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Introduces two new Custom Resource Definitions for automated OVN database
backup and restore operations:

OVNDBBackup:
- Scheduled backups via CronJob using ovsdb-client backup
- Configurable retention policy
- TLS support for database connections
- Persistent storage for backup files (survives CR deletion)

OVNDBRestore:
- Phase-based state machine: Validating → ScalingDown → Restoring →
  ScalingUp → Completed
- Annotation-based replica override prevents higher-level operators
  from interfering during restore
- Force-deletes pods during scale-down (preStop hooks hang when all
  RAFT members terminate simultaneously)
- Deletes non-pod-0 PVCs to prevent stale RAFT membership state
- Copies standalone backup to pod-0 PVC and lets ovn-ctl handle the
  standalone-to-clustered conversion on startup
- Controlled scale-up: pod-0 first, then remaining replicas
- Post-restore DB verification via exec

Also includes:
- Functional tests for both controllers
- Webhooks with validation
- Documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/1f189e72fde444b2908cea46d7183592

✔️ openstack-k8s-operators-content-provider SUCCESS in 44m 52s
ovn-operator-tempest-multinode FAILURE in 27m 15s

@karelyatin
Copy link
Copy Markdown
Contributor

@lmiccini can you add some more context/Jira here as also noticed some other backup related effort as part of #557

Tests the full backup/restore lifecycle: deploy OVN with 3 replicas,
seed test data, create backup, trigger manual backup job, restore from
backup, and verify data survives the restore. Force-deletes pods during
cleanup to avoid preStop hook hangs when the entire cluster is torn down.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lmiccini
Copy link
Copy Markdown
Author

@lmiccini can you add some more context/Jira here as also noticed some other backup related effort as part of #557

Hey Yatin, yes this is to provide an API around that script, following the mariadb-operator approach. This way we can orchestrate backups and restores.

Delete pod-0's PVC in phaseScaleDown and recreate it in phaseRestore
before creating the restore Job. With local-storage, pod-0's PVC may
be bound to a PV on a different node than the backup PVC, causing a
volume node affinity conflict that prevents the restore Job pod from
scheduling. Recreating the PVC lets WaitForFirstConsumer bind it to
a PV on the same node as the backup PVC.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/fac4823f5d0443d2b4f5d73efcc02a72

✔️ openstack-k8s-operators-content-provider SUCCESS in 46m 27s
ovn-operator-tempest-multinode FAILURE in 21m 24s

@lmiccini
Copy link
Copy Markdown
Author

recheck

Allow specifying a BACKUP_TIMESTAMP on backup jobs and a backupTimestamp
field on OVNDBRestore so that OVN DB backups can participate in a
coordinated full-environment backup/restore workflow alongside Galera
and OADP using a single shared timestamp.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c55d7e0e04dc4eeb99e749d2c1bb0a04

✔️ openstack-k8s-operators-content-provider SUCCESS in 44m 06s
ovn-operator-tempest-multinode FAILURE in 26m 50s

@lmiccini
Copy link
Copy Markdown
Author

recheck

@otherwiseguy
Copy link
Copy Markdown

otherwiseguy commented Apr 23, 2026

I have not yet read the patch, but I'm not sure what the use case is for scheduled ovsdb backups. We already have an active-active cluster, and aside from non-changing initial configuration data the entire database can be generated from the neutron db and ovn-northd. In addition, any backup, will be out of date and need to be synced with the neutron db--basically the same process as not having a backup.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants