fix: recover mongodb after backup restore by wallyxjh · Pull Request #27 · labring/kubeblocks

wallyxjh · 2026-04-14T09:00:10Z

Summary

This PR fixes the MongoDB restore-from-backup startup path in lorry.

After MongoDB is restored from backup, mongod may already be running, but lorry can still mark the instance as not ready because rs.status() or authenticated connections fail. As a result, the replica role cannot be identified correctly, Pod readiness stays abnormal, role labels are not recovered, and the first member may remain stuck in an uninitialized state.

Root Cause

The restored data may be inconsistent with the expected cluster state. This mainly happens in two cases:

The restored data does not contain replica set metadata.
The restored data still contains the old root user/password while the current secret has changed.

At the same time, MongoDB HA initialization is disabled by default, so the normal fallback flow for InitializeCluster and CreateRoot is never triggered. In addition, the Go Mongo driver may cache an invalid topology when the replica set is not initialized, causing authenticated replica-set clients to keep failing.

Changes

This PR adds a startup-time auto-recovery flow for MongoDB restore scenarios.

Use a local unauthenticated connection together with isMaster and replSetGetStatus to determine whether the issue is an uninitialized replica set or a credential mismatch.
If the current Pod is the first member, automatically initialize the replica set, create or update the root password, and reconnect the client to refresh the cached topology.
Add local isMaster fallback logic for replica role detection, leader detection, and cluster initialization checks.
Reconnect the MongoDB client after replica set initialization or credential synchronization to avoid stale topology state in the Go driver.
Avoid surfacing transient role-detection failures as readiness probe errors during recovery.

Impact

This change mainly affects MongoDB startup readiness, cluster initialization, root password synchronization, replica role detection, and leader detection in lorry. It specifically covers the first startup path after backup restore, especially for the first member.

Other database engines are not affected.

Validation

go test ./pkg/lorry/engines/mongodb/...

Copilot

Pull request overview

This PR improves MongoDB’s startup/readiness behavior in lorry for the “restored from backup” path, aiming to recover from missing replica-set metadata and/or credential drift so role detection and labeling can resume correctly.

Changes:

Extend local probing (isMaster / replSetGetStatus) to detect RS initialization vs credential mismatch and avoid stale Go-driver topology.
Add auto-recovery flow (init RS, create/sync root, reconnect client) primarily for the first member after restore.
Add local isMaster fallbacks for member role/leader checks and reduce readiness probe failures during transient recovery.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
pkg/lorry/engines/mongodb/types.go	Adds fields to `IsMasterResp` needed for local role/RS detection.
pkg/lorry/engines/mongodb/manager.go	Implements reconnect logic, local init checks, startup recovery hooks, and `isMaster` fallbacks.
pkg/lorry/engines/mongodb/get_replica_role.go	Adds restore auto-recovery orchestration and softens replica-role errors during readiness probing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T09:14:31Z

pkg/lorry/engines/mongodb/manager.go

+		mgr.Logger.Info("Create Root failed, trying to update existing user password", "error", err)
+		updateErr := UpdateUserPass(ctx, client, config.Username, config.Password)
+		if updateErr != nil {
+			mgr.Logger.Info("Update existing root password also failed", "error", updateErr.Error())
+			return err
+		}


When the fallback UpdateUserPass fails, the function returns the original CreateUser error (err) rather than updateErr. Returning updateErr (or combining them) would better reflect the final failure and make debugging restore recovery issues easier.

Copilot · 2026-04-14T09:14:32Z

pkg/lorry/engines/mongodb/manager.go

@@ -245,13 +327,74 @@ func (mgr *Manager) CreateRoot(ctx context.Context) error {
 	mgr.Logger.Info(fmt.Sprintf("Create user: %s, passwd: %s, roles: %v", config.Username, config.Password, role))


This log line prints the root password in plaintext. Please redact/remove the password from logs (and ideally avoid logging credentials entirely) to prevent secret leakage in pod logs and aggregated logging systems.

Suggested change

mgr.Logger.Info(fmt.Sprintf("Create user: %s, passwd: %s, roles: %v", config.Username, config.Password, role))

mgr.Logger.Info(fmt.Sprintf("Create user: %s, roles: %v", config.Username, role))

Copilot · 2026-04-14T09:14:32Z

pkg/lorry/engines/mongodb/manager.go

+	defer client.Disconnect(context.TODO()) //nolint:errcheck
+


Disconnect(context.TODO()) can block without a timeout and ignores cancellation, which is risky in probe/role-detection paths. Prefer using a short context.WithTimeout(context.Background(), ...) (or reuse ctx if it has sufficient timeout) when disconnecting clients created for local probes.

Copilot · 2026-04-14T09:14:32Z

pkg/lorry/engines/mongodb/manager.go

+			mgr.Logger.Info("DB is not ready (unauth connection also failed)", "error", unauthErr)
+			return false
+		}
+		defer unauthClient.Disconnect(context.TODO()) //nolint:errcheck


Disconnect(context.TODO()) here can block probe execution indefinitely since it has no timeout. Use a bounded context (e.g., context.WithTimeout(context.Background(), ...)) for disconnects on the unauth client so readiness/startup probes can't hang beyond kubelet timeouts.

Suggested change

defer unauthClient.Disconnect(context.TODO()) //nolint:errcheck

defer func() {

disconnectCtx, disconnectCancel := context.WithTimeout(context.Background(), 500*time.Millisecond)

defer disconnectCancel()

_ = unauthClient.Disconnect(disconnectCtx)

}()

Copilot · 2026-04-14T09:14:32Z

pkg/lorry/engines/mongodb/manager.go

+		cluster := dcs.GetStore().GetClusterFromCache()
+		if mgr.IsFirstMember() {
+			// Use a fresh context with generous timeout for recovery operations
+			// (rs.initiate + CreateRoot + reconnect can take several seconds)
+			recoveryCtx, recoveryCancel := context.WithTimeout(context.Background(), 30*time.Second)
+			defer recoveryCancel()
+			if fixErr := mgr.tryAutoRecoverFromRestore(recoveryCtx, cluster); fixErr != nil {
+				mgr.Logger.Info("DB auto-recovery failed during startup probe", "error", fixErr.Error())


GetClusterFromCache() may return nil (it ignores errors from GetCluster()), but the result is passed into tryAutoRecoverFromRestore. If cluster is nil, initializeFromRestore -> InitializeCluster -> InitiateReplSet will dereference cluster.Members and panic during the startup probe. Please guard against a nil cluster here (or fetch the cluster with an error and handle it) before attempting auto-recovery.

Copilot · 2026-04-14T09:14:33Z

pkg/lorry/engines/mongodb/get_replica_role.go

+func (mgr *Manager) tryAutoRecoverFromRestore(ctx context.Context, cluster *dcs.Cluster) error {
+	// Step 1: Use isMaster to check if RS is initialized (works without auth on direct connection)
+	resp, isMasterErr := mgr.getLocalIsMaster(ctx)
+	if isMasterErr == nil && resp != nil {
+		// isMaster succeeded. Check if RS is configured.
+		if resp.SetName != "" {
+			// RS is initialized - the issue is likely just auth mismatch.
+			mgr.Logger.Info("RS is initialized (isMaster reports setName), syncing root password", "setName", resp.SetName)
+			return mgr.syncRootPassword(ctx)
+		}
+		// isMaster returned OK but no setName - RS is not initialized
+		mgr.Logger.Info("RS not initialized (isMaster has no setName), initiating replica set")
+		return mgr.initializeFromRestore(ctx, cluster)
+	}


tryAutoRecoverFromRestore can be called with a nil cluster (e.g., GetClusterFromCache() returns nil when it can't fetch the Cluster). Both initializeFromRestore paths call InitializeCluster(ctx, cluster) which will panic when cluster is nil. Consider validating cluster != nil at the top of this recovery flow and returning a clear error (or skipping recovery) instead of risking a nil dereference.

Copilot · 2026-04-14T09:14:33Z

pkg/lorry/engines/mongodb/manager.go

+		mgr.Logger.Info("Get local unauth client failed, trying auth client to update root password", "error", err.Error())
+		updateErr := UpdateUserPass(ctx, mgr.Client, config.Username, config.Password)
+		if updateErr != nil {
+			mgr.Logger.Info("Update root password with auth client also failed", "error", updateErr.Error())
+			return err
+		}


If UpdateUserPass fails, this returns the original NewLocalUnauthClient error (err) instead of the more relevant updateErr. Returning updateErr (or wrapping both) would make failures diagnosable and avoid hiding the real cause.

fix: recover mongodb after backup restore

0a72f13

pull-request-size bot added the size/L label Apr 14, 2026

zijiren233 requested a review from Copilot April 14, 2026 09:08

Copilot started reviewing on behalf of zijiren233 April 14, 2026 09:09 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover mongodb after backup restore#27

fix: recover mongodb after backup restore#27
wallyxjh wants to merge 1 commit intolabring:fix/v0.9.3from
wallyxjh:fix/mongodb-restore-backup

wallyxjh commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -245,13 +327,74 @@ func (mgr *Manager) CreateRoot(ctx context.Context) error {
		mgr.Logger.Info(fmt.Sprintf("Create user: %s, passwd: %s, roles: %v", config.Username, config.Password, role))

-		defer unauthClient.Disconnect(context.TODO()) //nolint:errcheck
+		defer func() {
+			disconnectCtx, disconnectCancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
+			defer disconnectCancel()
+			_ = unauthClient.Disconnect(disconnectCtx)
+		}()

Conversation

wallyxjh commented Apr 14, 2026

Summary

Root Cause

Changes

Impact

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants