Add creation of OVN DB backups#557
Conversation
In case rollback is required, generate backup files for the OVN DB files. Resolves: OSPRH-28871 Signed-off-by: Terry Wilson <twilson@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: otherwiseguy The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @otherwiseguy. Thanks for your PR. I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| done | ||
| } | ||
|
|
||
| function wait_for_ovsdb_client { |
There was a problem hiding this comment.
This is removed because ovsdb-client is not run in the background, there is nothing to wait on.
There was a problem hiding this comment.
Actually this was not added for any background task but for the cases when pods are terminated in middle while ovsdb-client was running, it's similar to what we do for ovsdb-tool handling where empty db caused the issues in past while pods were terminated.
Or you observed some issues with these checks or just observed as cleanup task?
There was a problem hiding this comment.
Reading the code, the only reason I could see for the ovsdb-tool wait was that setup.sh ran ovn-ctl with '&' to background it, and ovsdb-tool/ovsdb_tool() is called in ovs-lib in things like create_cluster, upgrade_cluster, etc.
ovsdb-client, from what I can tell, is only called via ovn-ctl through ovs-lib's upgrade_cluster(), which we would never call with this patch applied because we pass --no-db-cluster-schema-upgrade to ovn-ctl. So since the only place it will ever be called is ultimately from setup.sh, in the foreground, there would never be any reason to wait for the execution of ovsdb-client.
Let me know if I'm missing something.
There was a problem hiding this comment.
since osvdb-client is being called at
ovn-operator/templates/ovndbcluster/bin/functions
Lines 52 to 69 in 55b3554
The wait were added for foreground tasks itself to avoid the cases when pods are terminated/recreated/deleted for any reason with operator or human involvement. It may not have similar side effect as ovsdb-tool(in that case it was resulted into empty db) but still doesn't looks harmful, or you seeing some issue with keeping it?
| # call to ovn-ctl directly instead of start-${DB_TYPE}-db-server to pass | ||
| # extra_args after -- | ||
| set /usr/share/ovn/scripts/ovn-ctl --no-monitor | ||
| set "$@" --no-db-cluster-schema-upgrade |
There was a problem hiding this comment.
This will prevent ovn-ctl from upgrading the schema so that only we are upgrading the schema during the check_and_convert_db step.
|
Tested manually by running RHOSO 18.0.1 ovndb containers and then switching to current ci-framework installed image (nb schema 7.3.0 -> 7.12.0) and backup file was created. |
| db_version=$(ovsdb-client get-schema-version "unix:$db_socket" "$db_name") | ||
| local backup_file="${DB_FILE%.db}-backup-v${db_version}.db" | ||
| echo "Backing up ${db_name} v${db_version} to ${backup_file}..." | ||
| ovsdb-client backup "unix:$db_socket" "$db_name" > "$backup_file" |
There was a problem hiding this comment.
The ovndbcluster[-nb/-sb] pod has this startup_probe:
startupProbe := &corev1.Probe{ // TODO might need tuning TimeoutSeconds: 5, PeriodSeconds: 3, FailureThreshold: 20, InitialDelaySeconds: 3, }
It means that it has 63 seconds (FailureThreshold * PeriodSeconds + InitialDelay) to be ready. On a large deployments can this operation take more than 1 minute?
There was a problem hiding this comment.
@averdagu It's conceivable it could be a problem. With a test SBDB that I just loaded with 2M FDB entries, around 250MB, it took around 35 seconds on my laptop to do the backup. I've seen 500MB DBs occasionally. it'd be pretty unlikely to be a problem with NBDB, though I've seen some bigger NBDBs with tons of ACLs, etc.
There was a problem hiding this comment.
@averdagu is it ok to bump the timeout? Or should we just insist that backup is a manual process that is done before upgrade?
There was a problem hiding this comment.
Atleast it would not be a problem with startup and liveness probe as those checking for ovsdb-server process, the conversion we running post starting that. Only thing that should be confirmed if readiness probes are impacted due to this where we check cluster membership
|
Also observed backup/restore activity #559 , are those linked? or that was unknown and just parallel effort, just trying to get some context on all these parallel works and see if we can refine those together. |
Someone asked me in a jira issue if I thought that we needed something like that and I told them that I didn't think there would be any benefit to doing it. I wasn't aware someone had gone ahead and implemented it. Given the neutron db is the source of truth and outside of some initial static configuration that the operator would already handle, I believe we aim to make all of the OVSDB DBs to be able to be generated from the contents of the neutron DB. Add to that that the point of an active-active cluster is to have live backups as well and that any manually taken backup would have to be synced to the neutron db the same as an empty db would, and I don't really see the benefit of doing #559. Maybe someone else can come up with a reason I can't think of. |
That was my. FYIi, for now I have submitted a PR to our upstream full backup doc to add optionally add OVN backups openstack-k8s-operators/dev-docs#172 . Let me know if you think it should be the default and we can change it to be it. The reason why @lmiccini started #559 is to look into an interface for ovn, which is like the one there is for galera. It was done/started after you shared the correct steps |
| db_version=$(ovsdb-client get-schema-version "unix:$db_socket" "$db_name") | ||
| local backup_file="${DB_FILE%.db}-backup-v${db_version}.db" | ||
| echo "Backing up ${db_name} v${db_version} to ${backup_file}..." | ||
| ovsdb-client backup "unix:$db_socket" "$db_name" > "$backup_file" |
There was a problem hiding this comment.
Atleast it would not be a problem with startup and liveness probe as those checking for ovsdb-server process, the conversion we running post starting that. Only thing that should be confirmed if readiness probes are impacted due to this where we check cluster membership
| db_name="OVN_Southbound" | ||
| fi | ||
| local db_version | ||
| db_version=$(ovsdb-client get-schema-version "unix:$db_socket" "$db_name") |
There was a problem hiding this comment.
the only concern here is this can pile up per db upgrade so we should be checking how many of these we should persist.
There was a problem hiding this comment.
I think I'm just going to abandon the automatic backup idea. They can document making manual backups, and honestly I think the idea of backups is ultimately kind of silly when no matter what you have to do an ovn-db-sync-util run in any case, and it can create from nothing (bugs in that util are the only real reason for the backup, and we should fix those, but a manual "just in case" process seems sufficient).
In case rollback is required, generate backup files for the OVN DB files.
Resolves: OSPRH-28871