fix: recover mongodb after backup restore#27
fix: recover mongodb after backup restore#27wallyxjh wants to merge 1 commit intolabring:fix/v0.9.3from
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves MongoDB’s startup/readiness behavior in lorry for the “restored from backup” path, aiming to recover from missing replica-set metadata and/or credential drift so role detection and labeling can resume correctly.
Changes:
- Extend local probing (
isMaster/replSetGetStatus) to detect RS initialization vs credential mismatch and avoid stale Go-driver topology. - Add auto-recovery flow (init RS, create/sync root, reconnect client) primarily for the first member after restore.
- Add local
isMasterfallbacks for member role/leader checks and reduce readiness probe failures during transient recovery.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| pkg/lorry/engines/mongodb/types.go | Adds fields to IsMasterResp needed for local role/RS detection. |
| pkg/lorry/engines/mongodb/manager.go | Implements reconnect logic, local init checks, startup recovery hooks, and isMaster fallbacks. |
| pkg/lorry/engines/mongodb/get_replica_role.go | Adds restore auto-recovery orchestration and softens replica-role errors during readiness probing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| mgr.Logger.Info("Create Root failed, trying to update existing user password", "error", err) | ||
| updateErr := UpdateUserPass(ctx, client, config.Username, config.Password) | ||
| if updateErr != nil { | ||
| mgr.Logger.Info("Update existing root password also failed", "error", updateErr.Error()) | ||
| return err | ||
| } |
There was a problem hiding this comment.
When the fallback UpdateUserPass fails, the function returns the original CreateUser error (err) rather than updateErr. Returning updateErr (or combining them) would better reflect the final failure and make debugging restore recovery issues easier.
| @@ -245,13 +327,74 @@ func (mgr *Manager) CreateRoot(ctx context.Context) error { | |||
| mgr.Logger.Info(fmt.Sprintf("Create user: %s, passwd: %s, roles: %v", config.Username, config.Password, role)) | |||
There was a problem hiding this comment.
This log line prints the root password in plaintext. Please redact/remove the password from logs (and ideally avoid logging credentials entirely) to prevent secret leakage in pod logs and aggregated logging systems.
| mgr.Logger.Info(fmt.Sprintf("Create user: %s, passwd: %s, roles: %v", config.Username, config.Password, role)) | |
| mgr.Logger.Info(fmt.Sprintf("Create user: %s, roles: %v", config.Username, role)) |
| defer client.Disconnect(context.TODO()) //nolint:errcheck | ||
|
|
There was a problem hiding this comment.
Disconnect(context.TODO()) can block without a timeout and ignores cancellation, which is risky in probe/role-detection paths. Prefer using a short context.WithTimeout(context.Background(), ...) (or reuse ctx if it has sufficient timeout) when disconnecting clients created for local probes.
| mgr.Logger.Info("DB is not ready (unauth connection also failed)", "error", unauthErr) | ||
| return false | ||
| } | ||
| defer unauthClient.Disconnect(context.TODO()) //nolint:errcheck |
There was a problem hiding this comment.
Disconnect(context.TODO()) here can block probe execution indefinitely since it has no timeout. Use a bounded context (e.g., context.WithTimeout(context.Background(), ...)) for disconnects on the unauth client so readiness/startup probes can't hang beyond kubelet timeouts.
| defer unauthClient.Disconnect(context.TODO()) //nolint:errcheck | |
| defer func() { | |
| disconnectCtx, disconnectCancel := context.WithTimeout(context.Background(), 500*time.Millisecond) | |
| defer disconnectCancel() | |
| _ = unauthClient.Disconnect(disconnectCtx) | |
| }() |
| cluster := dcs.GetStore().GetClusterFromCache() | ||
| if mgr.IsFirstMember() { | ||
| // Use a fresh context with generous timeout for recovery operations | ||
| // (rs.initiate + CreateRoot + reconnect can take several seconds) | ||
| recoveryCtx, recoveryCancel := context.WithTimeout(context.Background(), 30*time.Second) | ||
| defer recoveryCancel() | ||
| if fixErr := mgr.tryAutoRecoverFromRestore(recoveryCtx, cluster); fixErr != nil { | ||
| mgr.Logger.Info("DB auto-recovery failed during startup probe", "error", fixErr.Error()) |
There was a problem hiding this comment.
GetClusterFromCache() may return nil (it ignores errors from GetCluster()), but the result is passed into tryAutoRecoverFromRestore. If cluster is nil, initializeFromRestore -> InitializeCluster -> InitiateReplSet will dereference cluster.Members and panic during the startup probe. Please guard against a nil cluster here (or fetch the cluster with an error and handle it) before attempting auto-recovery.
| func (mgr *Manager) tryAutoRecoverFromRestore(ctx context.Context, cluster *dcs.Cluster) error { | ||
| // Step 1: Use isMaster to check if RS is initialized (works without auth on direct connection) | ||
| resp, isMasterErr := mgr.getLocalIsMaster(ctx) | ||
| if isMasterErr == nil && resp != nil { | ||
| // isMaster succeeded. Check if RS is configured. | ||
| if resp.SetName != "" { | ||
| // RS is initialized - the issue is likely just auth mismatch. | ||
| mgr.Logger.Info("RS is initialized (isMaster reports setName), syncing root password", "setName", resp.SetName) | ||
| return mgr.syncRootPassword(ctx) | ||
| } | ||
| // isMaster returned OK but no setName - RS is not initialized | ||
| mgr.Logger.Info("RS not initialized (isMaster has no setName), initiating replica set") | ||
| return mgr.initializeFromRestore(ctx, cluster) | ||
| } |
There was a problem hiding this comment.
tryAutoRecoverFromRestore can be called with a nil cluster (e.g., GetClusterFromCache() returns nil when it can't fetch the Cluster). Both initializeFromRestore paths call InitializeCluster(ctx, cluster) which will panic when cluster is nil. Consider validating cluster != nil at the top of this recovery flow and returning a clear error (or skipping recovery) instead of risking a nil dereference.
| mgr.Logger.Info("Get local unauth client failed, trying auth client to update root password", "error", err.Error()) | ||
| updateErr := UpdateUserPass(ctx, mgr.Client, config.Username, config.Password) | ||
| if updateErr != nil { | ||
| mgr.Logger.Info("Update root password with auth client also failed", "error", updateErr.Error()) | ||
| return err | ||
| } |
There was a problem hiding this comment.
If UpdateUserPass fails, this returns the original NewLocalUnauthClient error (err) instead of the more relevant updateErr. Returning updateErr (or wrapping both) would make failures diagnosable and avoid hiding the real cause.
Summary
This PR fixes the MongoDB restore-from-backup startup path in lorry.
After MongoDB is restored from backup,
mongodmay already be running, but lorry can still mark the instance as not ready becausers.status()or authenticated connections fail. As a result, the replica role cannot be identified correctly, Pod readiness stays abnormal, role labels are not recovered, and the first member may remain stuck in an uninitialized state.Root Cause
The restored data may be inconsistent with the expected cluster state. This mainly happens in two cases:
At the same time, MongoDB HA initialization is disabled by default, so the normal fallback flow for
InitializeClusterandCreateRootis never triggered. In addition, the Go Mongo driver may cache an invalid topology when the replica set is not initialized, causing authenticated replica-set clients to keep failing.Changes
This PR adds a startup-time auto-recovery flow for MongoDB restore scenarios.
isMasterandreplSetGetStatusto determine whether the issue is an uninitialized replica set or a credential mismatch.isMasterfallback logic for replica role detection, leader detection, and cluster initialization checks.Impact
This change mainly affects MongoDB startup readiness, cluster initialization, root password synchronization, replica role detection, and leader detection in lorry. It specifically covers the first startup path after backup restore, especially for the first member.
Other database engines are not affected.
Validation
go test ./pkg/lorry/engines/mongodb/...