Skip to content

Add creation of OVN DB backups#557

Open
otherwiseguy wants to merge 1 commit into
openstack-k8s-operators:mainfrom
otherwiseguy:OSPRH-28871
Open

Add creation of OVN DB backups#557
otherwiseguy wants to merge 1 commit into
openstack-k8s-operators:mainfrom
otherwiseguy:OSPRH-28871

Conversation

@otherwiseguy
Copy link
Copy Markdown

@otherwiseguy otherwiseguy commented Apr 17, 2026

In case rollback is required, generate backup files for the OVN DB files.

Resolves: OSPRH-28871

In case rollback is required, generate backup files for the
OVN DB files.

Resolves: OSPRH-28871
Signed-off-by: Terry Wilson <twilson@redhat.com>
@openshift-ci openshift-ci Bot requested review from slawqo and stuggi April 17, 2026 01:00
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: otherwiseguy
Once this PR has been reviewed and has the lgtm label, please assign slawqo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2026

Hi @otherwiseguy. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

done
}

function wait_for_ovsdb_client {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed because ovsdb-client is not run in the background, there is nothing to wait on.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this was not added for any background task but for the cases when pods are terminated in middle while ovsdb-client was running, it's similar to what we do for ovsdb-tool handling where empty db caused the issues in past while pods were terminated.
Or you observed some issues with these checks or just observed as cleanup task?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code, the only reason I could see for the ovsdb-tool wait was that setup.sh ran ovn-ctl with '&' to background it, and ovsdb-tool/ovsdb_tool() is called in ovs-lib in things like create_cluster, upgrade_cluster, etc.

ovsdb-client, from what I can tell, is only called via ovn-ctl through ovs-lib's upgrade_cluster(), which we would never call with this patch applied because we pass --no-db-cluster-schema-upgrade to ovn-ctl. So since the only place it will ever be called is ultimately from setup.sh, in the foreground, there would never be any reason to wait for the execution of ovsdb-client.

Let me know if I'm missing something.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since osvdb-client is being called at

if test X"`ovsdb-client needs-conversion unix:$db_socket "$db_schema"`" = Xyes; then
echo "Database conversion needed for ${DB_TYPE} database, performing conversion..."
# Set trap to wait for ovsdb-client to finish conversion
trap wait_for_ovsdb_client EXIT
if ovsdb-client convert unix:$db_socket "$db_schema"; then
echo "Database conversion completed successfully for ${DB_TYPE} database"
else
echo "Error: Database conversion failed for ${DB_TYPE} database"
trap - EXIT
return 1
fi
# Wait for conversion to complete and remove trap
wait_for_ovsdb_client
trap - EXIT
fi

The wait were added for foreground tasks itself to avoid the cases when pods are terminated/recreated/deleted for any reason with operator or human involvement. It may not have similar side effect as ovsdb-tool(in that case it was resulted into empty db) but still doesn't looks harmful, or you seeing some issue with keeping it?

# call to ovn-ctl directly instead of start-${DB_TYPE}-db-server to pass
# extra_args after --
set /usr/share/ovn/scripts/ovn-ctl --no-monitor
set "$@" --no-db-cluster-schema-upgrade
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent ovn-ctl from upgrading the schema so that only we are upgrading the schema during the check_and_convert_db step.

@otherwiseguy
Copy link
Copy Markdown
Author

Tested manually by running RHOSO 18.0.1 ovndb containers and then switching to current ci-framework installed image (nb schema 7.3.0 -> 7.12.0) and backup file was created.

db_version=$(ovsdb-client get-schema-version "unix:$db_socket" "$db_name")
local backup_file="${DB_FILE%.db}-backup-v${db_version}.db"
echo "Backing up ${db_name} v${db_version} to ${backup_file}..."
ovsdb-client backup "unix:$db_socket" "$db_name" > "$backup_file"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ovndbcluster[-nb/-sb] pod has this startup_probe:
startupProbe := &corev1.Probe{ // TODO might need tuning TimeoutSeconds: 5, PeriodSeconds: 3, FailureThreshold: 20, InitialDelaySeconds: 3, }
It means that it has 63 seconds (FailureThreshold * PeriodSeconds + InitialDelay) to be ready. On a large deployments can this operation take more than 1 minute?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@averdagu It's conceivable it could be a problem. With a test SBDB that I just loaded with 2M FDB entries, around 250MB, it took around 35 seconds on my laptop to do the backup. I've seen 500MB DBs occasionally. it'd be pretty unlikely to be a problem with NBDB, though I've seen some bigger NBDBs with tons of ACLs, etc.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@averdagu is it ok to bump the timeout? Or should we just insist that backup is a manual process that is done before upgrade?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atleast it would not be a problem with startup and liveness probe as those checking for ovsdb-server process, the conversion we running post starting that. Only thing that should be confirmed if readiness probes are impacted due to this where we check cluster membership

@karelyatin
Copy link
Copy Markdown
Contributor

Also observed backup/restore activity #559 , are those linked? or that was unknown and just parallel effort, just trying to get some context on all these parallel works and see if we can refine those together.

@otherwiseguy
Copy link
Copy Markdown
Author

Also observed backup/restore activity #559 , are those linked? or that was unknown and just parallel effort, just trying to get some context on all these parallel works and see if we can refine those together.

Someone asked me in a jira issue if I thought that we needed something like that and I told them that I didn't think there would be any benefit to doing it. I wasn't aware someone had gone ahead and implemented it. Given the neutron db is the source of truth and outside of some initial static configuration that the operator would already handle, I believe we aim to make all of the OVSDB DBs to be able to be generated from the contents of the neutron DB. Add to that that the point of an active-active cluster is to have live backups as well and that any manually taken backup would have to be synced to the neutron db the same as an empty db would, and I don't really see the benefit of doing #559. Maybe someone else can come up with a reason I can't think of.

@stuggi
Copy link
Copy Markdown
Contributor

stuggi commented Apr 24, 2026

Also observed backup/restore activity #559 , are those linked? or that was unknown and just parallel effort, just trying to get some context on all these parallel works and see if we can refine those together.

Someone asked me in a jira issue if I thought that we needed something like that and I told them that I didn't think there would be any benefit to doing it. I wasn't aware someone had gone ahead and implemented it. Given the neutron db is the source of truth and outside of some initial static configuration that the operator would already handle, I believe we aim to make all of the OVSDB DBs to be able to be generated from the contents of the neutron DB. Add to that that the point of an active-active cluster is to have live backups as well and that any manually taken backup would have to be synced to the neutron db the same as an empty db would, and I don't really see the benefit of doing #559. Maybe someone else can come up with a reason I can't think of.

That was my. FYIi, for now I have submitted a PR to our upstream full backup doc to add optionally add OVN backups openstack-k8s-operators/dev-docs#172 . Let me know if you think it should be the default and we can change it to be it.

The reason why @lmiccini started #559 is to look into an interface for ovn, which is like the one there is for galera. It was done/started after you shared the correct steps

db_version=$(ovsdb-client get-schema-version "unix:$db_socket" "$db_name")
local backup_file="${DB_FILE%.db}-backup-v${db_version}.db"
echo "Backing up ${db_name} v${db_version} to ${backup_file}..."
ovsdb-client backup "unix:$db_socket" "$db_name" > "$backup_file"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atleast it would not be a problem with startup and liveness probe as those checking for ovsdb-server process, the conversion we running post starting that. Only thing that should be confirmed if readiness probes are impacted due to this where we check cluster membership

db_name="OVN_Southbound"
fi
local db_version
db_version=$(ovsdb-client get-schema-version "unix:$db_socket" "$db_name")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only concern here is this can pile up per db upgrade so we should be checking how many of these we should persist.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm just going to abandon the automatic backup idea. They can document making manual backups, and honestly I think the idea of backups is ultimately kind of silly when no matter what you have to do an ovn-db-sync-util run in any case, and it can create from nothing (bugs in that util are the only real reason for the backup, and we should fix those, but a manual "just in case" process seems sufficient).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants