Skip to content

fix: recover mongodb after backup restore#27

Open
wallyxjh wants to merge 1 commit intolabring:fix/v0.9.3from
wallyxjh:fix/mongodb-restore-backup
Open

fix: recover mongodb after backup restore#27
wallyxjh wants to merge 1 commit intolabring:fix/v0.9.3from
wallyxjh:fix/mongodb-restore-backup

Conversation

@wallyxjh
Copy link
Copy Markdown
Collaborator

Summary

This PR fixes the MongoDB restore-from-backup startup path in lorry.

After MongoDB is restored from backup, mongod may already be running, but lorry can still mark the instance as not ready because rs.status() or authenticated connections fail. As a result, the replica role cannot be identified correctly, Pod readiness stays abnormal, role labels are not recovered, and the first member may remain stuck in an uninitialized state.

Root Cause

The restored data may be inconsistent with the expected cluster state. This mainly happens in two cases:

  1. The restored data does not contain replica set metadata.
  2. The restored data still contains the old root user/password while the current secret has changed.

At the same time, MongoDB HA initialization is disabled by default, so the normal fallback flow for InitializeCluster and CreateRoot is never triggered. In addition, the Go Mongo driver may cache an invalid topology when the replica set is not initialized, causing authenticated replica-set clients to keep failing.

Changes

This PR adds a startup-time auto-recovery flow for MongoDB restore scenarios.

  • Use a local unauthenticated connection together with isMaster and replSetGetStatus to determine whether the issue is an uninitialized replica set or a credential mismatch.
  • If the current Pod is the first member, automatically initialize the replica set, create or update the root password, and reconnect the client to refresh the cached topology.
  • Add local isMaster fallback logic for replica role detection, leader detection, and cluster initialization checks.
  • Reconnect the MongoDB client after replica set initialization or credential synchronization to avoid stale topology state in the Go driver.
  • Avoid surfacing transient role-detection failures as readiness probe errors during recovery.

Impact

This change mainly affects MongoDB startup readiness, cluster initialization, root password synchronization, replica role detection, and leader detection in lorry. It specifically covers the first startup path after backup restore, especially for the first member.

Other database engines are not affected.

Validation

  • go test ./pkg/lorry/engines/mongodb/...

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves MongoDB’s startup/readiness behavior in lorry for the “restored from backup” path, aiming to recover from missing replica-set metadata and/or credential drift so role detection and labeling can resume correctly.

Changes:

  • Extend local probing (isMaster / replSetGetStatus) to detect RS initialization vs credential mismatch and avoid stale Go-driver topology.
  • Add auto-recovery flow (init RS, create/sync root, reconnect client) primarily for the first member after restore.
  • Add local isMaster fallbacks for member role/leader checks and reduce readiness probe failures during transient recovery.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
pkg/lorry/engines/mongodb/types.go Adds fields to IsMasterResp needed for local role/RS detection.
pkg/lorry/engines/mongodb/manager.go Implements reconnect logic, local init checks, startup recovery hooks, and isMaster fallbacks.
pkg/lorry/engines/mongodb/get_replica_role.go Adds restore auto-recovery orchestration and softens replica-role errors during readiness probing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +332 to +337
mgr.Logger.Info("Create Root failed, trying to update existing user password", "error", err)
updateErr := UpdateUserPass(ctx, client, config.Username, config.Password)
if updateErr != nil {
mgr.Logger.Info("Update existing root password also failed", "error", updateErr.Error())
return err
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the fallback UpdateUserPass fails, the function returns the original CreateUser error (err) rather than updateErr. Returning updateErr (or combining them) would better reflect the final failure and make debugging restore recovery issues easier.

Copilot uses AI. Check for mistakes.
@@ -245,13 +327,74 @@ func (mgr *Manager) CreateRoot(ctx context.Context) error {
mgr.Logger.Info(fmt.Sprintf("Create user: %s, passwd: %s, roles: %v", config.Username, config.Password, role))
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line prints the root password in plaintext. Please redact/remove the password from logs (and ideally avoid logging credentials entirely) to prevent secret leakage in pod logs and aggregated logging systems.

Suggested change
mgr.Logger.Info(fmt.Sprintf("Create user: %s, passwd: %s, roles: %v", config.Username, config.Password, role))
mgr.Logger.Info(fmt.Sprintf("Create user: %s, roles: %v", config.Username, role))

Copilot uses AI. Check for mistakes.
Comment on lines +365 to +366
defer client.Disconnect(context.TODO()) //nolint:errcheck

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disconnect(context.TODO()) can block without a timeout and ignores cancellation, which is risky in probe/role-detection paths. Prefer using a short context.WithTimeout(context.Background(), ...) (or reuse ctx if it has sufficient timeout) when disconnecting clients created for local probes.

Copilot uses AI. Check for mistakes.
mgr.Logger.Info("DB is not ready (unauth connection also failed)", "error", unauthErr)
return false
}
defer unauthClient.Disconnect(context.TODO()) //nolint:errcheck
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disconnect(context.TODO()) here can block probe execution indefinitely since it has no timeout. Use a bounded context (e.g., context.WithTimeout(context.Background(), ...)) for disconnects on the unauth client so readiness/startup probes can't hang beyond kubelet timeouts.

Suggested change
defer unauthClient.Disconnect(context.TODO()) //nolint:errcheck
defer func() {
disconnectCtx, disconnectCancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
defer disconnectCancel()
_ = unauthClient.Disconnect(disconnectCtx)
}()

Copilot uses AI. Check for mistakes.
Comment on lines +439 to +446
cluster := dcs.GetStore().GetClusterFromCache()
if mgr.IsFirstMember() {
// Use a fresh context with generous timeout for recovery operations
// (rs.initiate + CreateRoot + reconnect can take several seconds)
recoveryCtx, recoveryCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer recoveryCancel()
if fixErr := mgr.tryAutoRecoverFromRestore(recoveryCtx, cluster); fixErr != nil {
mgr.Logger.Info("DB auto-recovery failed during startup probe", "error", fixErr.Error())
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetClusterFromCache() may return nil (it ignores errors from GetCluster()), but the result is passed into tryAutoRecoverFromRestore. If cluster is nil, initializeFromRestore -> InitializeCluster -> InitiateReplSet will dereference cluster.Members and panic during the startup probe. Please guard against a nil cluster here (or fetch the cluster with an error and handle it) before attempting auto-recovery.

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +63
func (mgr *Manager) tryAutoRecoverFromRestore(ctx context.Context, cluster *dcs.Cluster) error {
// Step 1: Use isMaster to check if RS is initialized (works without auth on direct connection)
resp, isMasterErr := mgr.getLocalIsMaster(ctx)
if isMasterErr == nil && resp != nil {
// isMaster succeeded. Check if RS is configured.
if resp.SetName != "" {
// RS is initialized - the issue is likely just auth mismatch.
mgr.Logger.Info("RS is initialized (isMaster reports setName), syncing root password", "setName", resp.SetName)
return mgr.syncRootPassword(ctx)
}
// isMaster returned OK but no setName - RS is not initialized
mgr.Logger.Info("RS not initialized (isMaster has no setName), initiating replica set")
return mgr.initializeFromRestore(ctx, cluster)
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryAutoRecoverFromRestore can be called with a nil cluster (e.g., GetClusterFromCache() returns nil when it can't fetch the Cluster). Both initializeFromRestore paths call InitializeCluster(ctx, cluster) which will panic when cluster is nil. Consider validating cluster != nil at the top of this recovery flow and returning a clear error (or skipping recovery) instead of risking a nil dereference.

Copilot uses AI. Check for mistakes.
Comment on lines +311 to +316
mgr.Logger.Info("Get local unauth client failed, trying auth client to update root password", "error", err.Error())
updateErr := UpdateUserPass(ctx, mgr.Client, config.Username, config.Password)
if updateErr != nil {
mgr.Logger.Info("Update root password with auth client also failed", "error", updateErr.Error())
return err
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If UpdateUserPass fails, this returns the original NewLocalUnauthClient error (err) instead of the more relevant updateErr. Returning updateErr (or wrapping both) would make failures diagnosable and avoid hiding the real cause.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants