Fixed multiple configNode bugs by Caideyipi · Pull Request #17609 · apache/iotdb

Caideyipi · 2026-05-06T08:57:51Z

Description

In Simple mode, a successful response will be returned stating "Metadata has taken effect, but WAL has not been flushed to disk." Configuration changes will be lost after a restart.
In the Simple mode, some files have the same sorting key, posing a risk of log replay in disorder.
The return TSStatus of ConsensusManager.write() is not checked, and the exception is only a warn followed by proceeding to the success branch. This means that if consensus writing is rejected, redirected, or the state machine returns a failure, the caller will still receive ACCEPT, while the local heartbeat cache/metrics have already been created in advance, leading to a divergence between metadata and runtime states.
The value returned by getRegionGroupCount() increases by 1 every time it is counted, and this value is used for capacity expansion judgment, display, and maxRegionGroupNum calculation, which directly affects the region scaling decision.
When taking a snapshot, the regionMaintainTaskList lock is not acquired before traversing. Concurrent offer/poll may throw ConcurrentModificationException or put the maintenance queue snapshot into an inconsistent state.
If the pipe listener fails to process the snapshot, the entire takeSnapshot() method will return false; however, ConfigRegionStateMachine combines and absorbs the failures of loadSnapshot and the pipe listener, which can amplify auxiliary link failures into consensus snapshot failures, and even mask the real failure of snapshot loading.

This PR has:

Key changed/added classes (or packages if there are too many classes) in this PR

codecov · 2026-05-06T10:01:18Z

Codecov Report

❌ Patch coverage is 42.80000% with 143 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.95%. Comparing base (011bdec) to head (65ae336).

Files with missing lines	Patch %	Lines
...che/iotdb/confignode/manager/node/NodeManager.java	42.30%	75 Missing ⚠️
...nsensus/statemachine/ConfigRegionStateMachine.java	42.59%	62 Missing ⚠️
...gnode/persistence/executor/ConfigPlanExecutor.java	0.00%	2 Missing ⚠️
...onfignode/persistence/partition/PartitionInfo.java	71.42%	2 Missing ⚠️
...de/procedure/impl/node/AddConfigNodeProcedure.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17609      +/-   ##
============================================
+ Coverage     40.90%   40.95%   +0.05%     
  Complexity     2610     2610              
============================================
  Files          5186     5186              
  Lines        351420   351563     +143     
  Branches      44991    45018      +27     
============================================
+ Hits         143758   143995     +237     
+ Misses       207662   207568      -94

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

CRZbulabula

Overall

This PR fixes 6 real correctness bugs in ConfigNode (WAL ordering, log sorting, unchecked consensus writes, off-by-one, concurrent snapshot, pipe-listener failure masking). Changes are well-structured, rollback logic is carefully designed, and test coverage for core scenarios is solid.

Inline comments below cover a few items worth attention before merge.

CRZbulabula · 2026-05-07T09:22:31Z

+      if (persistStatus.getCode() != TSStatusCode.SUCCESS_STATUS.getStatusCode()) {
+        return persistStatus;
+      }
+    }


WAL replay idempotency requirement

With the new write-ahead ordering (persist → execute), if persistPlanForSimpleConsensus succeeds at line 125 but executor.executeNonQueryPlan fails at line 133, the plan is already persisted in WAL and will be replayed on restart.

This is correct WAL semantics, but it introduces a hard requirement: all ConfigPhysicalPlan implementations must be idempotent under replay. For example, a RegisterDataNodePlan replayed against an already-registered node must not fail or double-register.

Is this idempotency property currently guaranteed across all ConfigPhysicalPlan types? If not, this could cause issues on crash recovery in edge cases.

CRZbulabula · 2026-05-07T09:22:31Z

      ByteBuffer buffer = plan.serializeToByteBuffer();
      buffer.position(buffer.limit());
      simpleLogWriter.write(buffer);
+      simpleLogWriter.force();


force() on every write makes scheduled flush redundant

Now that every write is synchronously forced here, the scheduled flushWALForSimpleConsensus thread (lines 447-455) becomes redundant — it calls simpleLogWriter.force() periodically, but there's nothing left to flush.

This is harmless (double-force is idempotent) but worth either:

Removing the scheduled thread to avoid confusion, or

Adding a brief comment explaining the intentional belt-and-suspenders approach.

For ConfigNode metadata operations (low write frequency), the per-write force() is the right durability choice.

CRZbulabula · 2026-05-07T09:22:31Z

+          "Persist current ConfigPhysicalPlan for ConfigNode SimpleConsensus mode failed", e);
+      return new TSStatus(TSStatusCode.EXECUTE_STATEMENT_ERROR.getStatusCode())
+          .setMessage(
+              "Persist ConfigNode SimpleConsensus log failed: " + String.valueOf(e.getMessage()));


Nit: String.valueOf(e.getMessage()) is redundant — string concatenation with + already handles null by converting to "null". Can simplify to:

"Persist ConfigNode SimpleConsensus log failed: " + e.getMessage()

CRZbulabula · 2026-05-07T09:22:31Z

+      final List<RegionMaintainTask> copiedRegionMaintainTaskList;
+      synchronized (regionMaintainTaskList) {
+        copiedRegionMaintainTaskList = new ArrayList<>(regionMaintainTaskList);
+      }


Good fix — synchronized copy before iterating. One minor observation: this copies the entire list under the lock. If regionMaintainTaskList can grow very large, this could cause a brief pause. In practice, for ConfigNode this list is typically small, so this is fine. Just noting for future awareness.

CRZbulabula

PTAL.

CRZbulabula · 2026-05-07T10:08:17Z

              }
            });
-    return result.getAndIncrement();
+    return result.get();


Is this a bug?

# Conflicts: # iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/consensus/statemachine/ConfigRegionStateMachine.java # iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/procedure/impl/node/AddConfigNodeProcedure.java

luoluoyuyu

Review summary

Fixes several ConfigNode correctness issues including WAL persist-before-execute and rollback on execute failure. Before merge, document or test that all ConfigPhysicalPlan types are idempotent under WAL replay when persist succeeds and execute fails.

Also resolve CI failures in this PR.

luoluoyuyu · 2026-05-27T08:03:56Z

  protected TSStatus write(ConfigPhysicalPlan plan) {
+    SimpleConsensusPersistResult persistResult = null;
+    if (ConsensusFactory.SIMPLE_CONSENSUS.equals(CONF.getConfigNodeConsensusProtocolClass())) {
+      persistResult = persistPlanForSimpleConsensus(plan);


With persist before execute, a successful persistPlanForSimpleConsensus followed by a failed executeNonQueryPlan leaves the plan in WAL and it will be replayed on restart. Every ConfigPhysicalPlan executed on replay must be idempotent. Please add an audit or tests for the plan types touched by this PR.

luoluoyuyu · 2026-05-27T08:03:56Z

+
    TSStatus result;
    try {
      result = executor.executeNonQueryPlan(plan);


rollbackFailedPlanForSimpleConsensus on execute failure is the right safety net. Ensure rollback failure paths are logged clearly enough for operators when WAL and in-memory state may diverge.

# Conflicts: # iotdb-core/node-commons/src/main/java/org/apache/iotdb/commons/path/PartialPath.java

Caideyipi · 2026-06-03T03:47:06Z

Addressed the review feedback and pushed the updates.

Changes:

Added WAL rollback/replay coverage: failed SimpleConsensus writes now assert that the rolled-back WAL contains no replayable plan.
Added idempotency coverage for replay-relevant NodeInfo plans touched by this PR: RegisterDataNodePlan, ApplyConfigNodePlan, and UpdateVersionInfoPlan.
Improved SimpleConsensus rollback failure logging with plan type, log file, truncate offset, and endIndexBeforeWrite; the returned status now also notes that the persisted plan may replay after restart if rollback fails.
Removed newly introduced TODO comments; todo-check is now passing.
Merged latest master and resolved the PartialPath conflict by keeping master's implementation, so the PR is mergeable again.

Verification:

mvn -pl iotdb-core/confignode spotless:apply
mvn -pl iotdb-core/node-commons spotless:apply
mvn -pl iotdb-core/confignode '-Dtest=ConfigRegionStateMachineTest,NodeInfoTest' test
git diff --check origin/master...HEAD

Current CI note: the Simple (17) failures were checkout/network failures on self-hosted runners (gnutls_handshake before tests started). I reran the failed jobs. todo-check has passed; the remaining jobs are still pending.

sonarqubecloud · 2026-06-03T03:47:57Z

Quality Gate passed

Issues
6 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

# Conflicts: # iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/consensus/statemachine/ConfigRegionStateMachine.java # iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/persistence/executor/ConfigPlanExecutor.java

Fixed multiple config bugs

2047d3a

CRZbulabula reviewed May 7, 2026

View reviewed changes

CRZbulabula requested changes May 7, 2026

View reviewed changes

Caideyipi added 4 commits May 22, 2026 14:02

fix

1d9fe36

fix

1b86d98

Update ConfigRegionStateMachine.java

2e4c652

This comment was marked as outdated.

Sign in to view

luoluoyuyu suggested changes May 27, 2026

View reviewed changes

Caideyipi added 2 commits June 3, 2026 11:30

Address ConfigNode review feedback

cddb4b1

Merge remote-tracking branch 'origin/master' into config-bug

65ae336

# Conflicts: # iotdb-core/node-commons/src/main/java/org/apache/iotdb/commons/path/PartialPath.java

Caideyipi added 2 commits June 29, 2026 14:55

Fix compile

c39aed2

Uh oh!

Conversation

Caideyipi commented May 6, 2026

Description

Key changed/added classes (or packages if there are too many classes) in this PR

Uh oh!

codecov Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

CRZbulabula left a comment

Choose a reason for hiding this comment

Overall

Uh oh!

CRZbulabula May 7, 2026

Choose a reason for hiding this comment

Uh oh!

CRZbulabula May 7, 2026

Choose a reason for hiding this comment

Uh oh!

CRZbulabula May 7, 2026

Choose a reason for hiding this comment

Uh oh!

CRZbulabula May 7, 2026

Choose a reason for hiding this comment

Uh oh!

CRZbulabula left a comment

Choose a reason for hiding this comment

Uh oh!

CRZbulabula May 7, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

luoluoyuyu left a comment

Choose a reason for hiding this comment

Review summary

Uh oh!

luoluoyuyu May 27, 2026

Choose a reason for hiding this comment

Uh oh!

luoluoyuyu May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Caideyipi commented Jun 3, 2026

Uh oh!

sonarqubecloud Bot commented Jun 3, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 6, 2026 •

edited

Loading