Skip to content

Use procedures for region create/delete to fix the delete-region ghost task#18033

Open
CRZbulabula wants to merge 5 commits into
masterfrom
improve-region-maintain-async
Open

Use procedures for region create/delete to fix the delete-region ghost task#18033
CRZbulabula wants to merge 5 commits into
masterfrom
improve-region-maintain-async

Conversation

@CRZbulabula

@CRZbulabula CRZbulabula commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

The ConfigNode-leader periodically drained a persistent RegionMaintainer queue (PartitionManager.maintainRegionReplicas, every 10s). For region deletion it sent synchronous deleteRegion RPCs to DataNodes.

For a data region this deletion can be slow (removing TsFiles + the consensus peer) and may outlive the RPC timeout. When it does:

  1. The ConfigNode times out and retries the same task 10s later.
  2. On the DataNode the second call re-runs the deletion, which now returns SUCCESS because the region is already gone (ConsensusGroupNotExistException is swallowed and deleting an already-removed region is a no-op).
  3. The ConfigNode treats that SUCCESS as completion, polls the task off the queue, and forgets it — while the first call is still running on the DataNode.

The result is a ghost task: the cluster keeps executing a deletion the coordinator has already discarded.

Approach

Region deletion moves off the fire-and-forget queue and onto a Procedure, while region creation (replica re-creation) stays on the RegionMaintainer queue, which never had the ghost-task race (a create returns a final, unambiguous status from a single RPC).

Deletion → RemoveRegionGroupProcedure

A new RemoveRegionGroupProcedure deletes a whole region group, replica by replica, reusing the existing migration primitives (submitDeleteOldRegionPeerTaskdeleteOldRegionPeer/deleteLocalPeer, then waitTaskFinish polling for the real terminal state). It needs no quorum, is idempotent, tolerates an already-absent peer, and fails loudly rather than skipping, so the parent can never drop a partition table while region data is still on disk. It gains persistence and retry across leader changes like the other region procedures.

Callers now spawn it as a child instead of offering delete tasks to the queue:

  • DeleteDatabaseProcedure — one RemoveRegionGroupProcedure per region group, completed before the DatabasePartitionTable is removed.
  • CreateRegionGroupsProcedure (no-quorum branch) — one RemoveRegionGroupProcedure to clean up the replicas that did get created.

To make this reuse safe, a latent bug in RegionMigrateService.DeleteOldRegionPeerTask was fixed: on a deletePeer/deleteRegion failure it called taskFail but fell through to taskSucceed, reporting failure as success — it now returns, and deletePeer treats ConsensusGroupNotExistException as success. This also hardens the RemoveRegionPeer / region-migration path.

Creation stays on the RegionMaintainer queue

PartitionManager.maintainRegionReplicas and the RegionCleaner scheduler are kept, but now only recreate failed replicas. The dead delete branch (the DELETE_REGION async RPC) is removed, and legacy RegionDeleteTasks that an upgraded ConfigNode might replay from a consensus log or snapshot are filtered out at the two PartitionInfo ingestion points (offerRegionMaintainTasks, processLoadSnapshot) so they can never get stuck at a region's queue head and block re-creation of that region's other replicas.

Cleanup / readability

  • Removed the now-dead async CnToDnAsyncRequestType.DELETE_REGION enum + dispatch and the two orphaned ManagerMessages log constants (en + zh).
  • Refactored maintainRegionReplicas for readability now that the queue is create-only: dropped the currentType/switch machinery and the duplicated SchemaRegion/DataRegion blocks, replaced the nested Optional.ifPresent + leader check with an early return, and extracted submitRegionCreateTasks(type, tasks) / collectSuccessfulRegions(...). Per-region FIFO ordering, the failed-head-blocks-region semantics, and the in-memory/persisted poll lockstep are preserved; the loop now batches both region types per round (throughput-only change). Net −50 lines in the method.

Upgrade safety (keep deserialization, don't delete)

RegionDeleteTask, the RegionMaintainType.DELETE enum value, and the RegionMaintainTask.Factory deserialization arms are kept byte-identical to master (master persisted these to the Ratis log and snapshots), so a rolling upgrade can still replay an old log / load an old snapshot — the delete tasks are deserialized and then dropped at ingestion. The retained OfferRegionMaintainTasks / PollRegionMaintainTask / PollSpecificRegionMaintainTask plan types likewise keep consensus-log replay compatible.

Tests

  • RemoveRegionGroupProcedureTest — serialization round-trip through ProcedureFactory (exercises the new ProcedureType code + factory registration) and the replica-deletion state machine.
  • PartitionInfoTest.testLegacyRegionDeleteTasksAreFiltered — a legacy delete task offered to / loaded from a snapshot is dropped, leaving only create tasks queued, and survives a snapshot round-trip.
  • Existing ConfigPhysicalPlanSerDeTest for the retained plan types stays green (proves consensus-log replay compatibility).

All confignode + datanode unit tests above pass; spotless:check and checkstyle are clean.

…t task

The ConfigNode-leader periodically drained a persistent RegionMaintainer
queue (PartitionManager.maintainRegionReplicas) by sending synchronous
deleteRegion RPCs to DataNodes. A slow data-region deletion could outlive
the RPC timeout; the ConfigNode retried 10s later, and the DataNode's
second call returned SUCCESS (the region was already gone and
ConsensusGroupNotExistException was swallowed). The ConfigNode then polled
the task off the queue and believed it was done, while the first call was
still running on the DataNode - a "ghost task".

Replace the whole RegionMaintainer queue with Procedures that mirror the
existing AddRegionPeerProcedure/RemoveRegionPeerProcedure pattern:

- DataNode: new deleteRegionAsync(TMaintainPeerReq) RPC plus a
  DeleteRegionTask in RegionMigrateService. The deletion runs in the
  background and records its terminal state in taskResultMap, so the
  ConfigNode polls getRegionMaintainResult for progress
  (PROCESSING/SUCCESS/FAIL/TASK_NOT_EXIST) instead of trusting a single
  RPC response. The task returns immediately after taskFail and never
  falls through to taskSucceed, so a failed deletion can never be reported
  as success.

- ConfigNode: new DeleteRegionProcedure (async submit + waitTaskFinish
  poll, taskId = procId, idempotent on retry/leader-change) and
  CreateRegionProcedure (the create RPC already returns a final status, so
  it stays synchronous and the procedure only adds persistence and retry).
  DeleteDatabaseProcedure and CreateRegionGroupsProcedure now spawn these
  as child procedures instead of offering to the queue, and every
  completed task is logged.

- Upgrade safety: the OfferRegionMaintainTasks/PollRegionMaintainTask/
  PollSpecificRegionMaintainTask plan types and classes are kept
  (@deprecated, still deserialized and applied as no-ops) so an old
  consensus log still replays. The PartitionInfo snapshot has no version
  header, so the trailing RegionMaintainer block is still consumed on load
  (and the tasks discarded with a WARN) and written as empty on save,
  keeping the byte layout unchanged.

Add DeleteRegionProcedureTest/CreateRegionProcedureTest and a
PartitionInfo test that loads a legacy snapshot containing pending tasks.
Address review feedback: DeleteRegionProcedure now reuses the existing
deleteOldRegionPeer async plumbing (submitDeleteOldRegionPeerTask ->
DeleteOldRegionPeerTask), which already runs deleteLocalPeer + delete
region data on a background thread keyed by taskId and is polled via
getRegionMaintainResult. This removes the duplicate deleteRegionAsync RPC,
the DeleteRegionTask, submitDeleteRegionTask (both the service and handler
layers), the CnToDnSyncRequestType.DELETE_REGION_ASYNC entry, and the
related i18n constant.

Also fix a latent bug in DeleteOldRegionPeerTask.run() that this reuse
exposes: on a deletePeer/deleteRegion failure it called taskFail but did
NOT return, then fell through to taskSucceed - so a failed deletion was
reported as success. It now returns after taskFail. deletePeer also treats
ConsensusGroupNotExistException as success (the peer may already be gone
when deleting a whole region replica), so a retry converges cleanly.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Follow-up (6306e96): simplified the DataNode side. Instead of adding a new deleteRegionAsync RPC + DeleteRegionTask, DeleteRegionProcedure now reuses the existing deleteOldRegionPeer plumbing (submitDeleteOldRegionPeerTaskDeleteOldRegionPeerTask), which already does deleteLocalPeer + delete-region-data on a background thread keyed by taskId and is polled via getRegionMaintainResult — exactly what region-replica deletion needs.

This removes the duplicated deleteRegionAsync RPC, DeleteRegionTask, submitDeleteRegionTask (service + handler), the DELETE_REGION_ASYNC request type and its i18n constant (net −254 lines on the DataNode side).

It also fixes a latent bug in DeleteOldRegionPeerTask.run() that the reuse exposes: on a deletePeer/deleteRegion failure it called taskFail but did not return, then fell through to taskSucceed, so a failed deletion was reported as success. It now returns after taskFail, and deletePeer treats ConsensusGroupNotExistException as success so a retry converges cleanly. This also hardens the existing RemoveRegionPeer/migration path that already uses this task.

The ConfigNode-side CreateRegionProcedure/DeleteRegionProcedure remain separate procedures: Add/RemoveRegionPeerProcedure model peer-membership changes (coordinator, consensus pipes, transfer-leader, resetPeerList rollback, partition-table edits), which is a different concern from region-group create/delete, so reusing them there would drag in irrelevant/harmful steps.

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 17.47573% with 170 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.41%. Comparing base (1d893f1) to head (4ae30e3).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
...confignode/manager/partition/PartitionManager.java 0.00% 75 Missing ⚠️
...cedure/impl/region/RemoveRegionGroupProcedure.java 23.95% 73 Missing ⚠️
...edure/impl/region/CreateRegionGroupsProcedure.java 0.00% 7 Missing ⚠️
...procedure/impl/schema/DeleteDatabaseProcedure.java 0.00% 4 Missing ⚠️
.../apache/iotdb/db/service/RegionMigrateService.java 0.00% 4 Missing ⚠️
...onfignode/persistence/partition/PartitionInfo.java 72.72% 3 Missing ⚠️
...ignode/procedure/state/RemoveRegionGroupState.java 0.00% 2 Missing ⚠️
...b/confignode/procedure/store/ProcedureFactory.java 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18033      +/-   ##
============================================
+ Coverage     41.26%   41.41%   +0.15%     
  Complexity      318      318              
============================================
  Files          5273     5288      +15     
  Lines        368462   369569    +1107     
  Branches      47692    47809     +117     
============================================
+ Hits         152040   153069    +1029     
- Misses       216422   216500      +78     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…roups via procedure

When CreateRegionGroupsProcedure creates a RegionGroup and some replicas fail
but the group still reaches a serving quorum, the missing replicas are offered
to the RegionMaintainer queue (RegionCreateTask) and recreated by the
RegionCleaner. When the group cannot reach a quorum, its created replicas are
torn down by a single child RemoveRegionGroupProcedure.

DeleteDatabaseProcedure deletes each region group through a child
RemoveRegionGroupProcedure, which removes every replica's peer and data with
deleteOldRegionPeer (no quorum required) and polls the DataNode for the async
result, so the DatabasePartitionTable is dropped only after the regions are
gone.

Drop CreateRegionProcedure and DeleteRegionProcedure; RemoveRegionGroupProcedure
reuses procedure type code 207.
Region deletion is now owned by RemoveRegionGroupProcedure, so the
periodic RegionMaintainer (PartitionManager.maintainRegionReplicas)
should only recreate failed region replicas.

- Remove the DELETE execution branch (the DELETE_REGION async RPC) and
  the RegionMaintainType.DELETE special-casing from the task-selection
  loop. The switch keeps a defensive DELETE arm that only logs, so a
  regression cannot silently stall the loop.
- Filter out legacy RegionDeleteTasks at the two PartitionInfo ingestion
  points (offerRegionMaintainTasks replay and processLoadSnapshot) so a
  delete task carried over from an upgraded snapshot/consensus log can
  never get stuck at the head of a region's queue and block recreation
  of that region's other replicas.
- Remove the now-dead async CnToDnAsyncRequestType.DELETE_REGION enum +
  dispatch and the two orphaned ManagerMessages constants (en + zh).
- Keep RegionDeleteTask, the RegionMaintainType.DELETE enum value, and
  the Factory deserialization arms unchanged for rolling-upgrade replay
  and snapshot-load compatibility. The DataNode deleteRegion Thrift
  handler is also left intact as a cross-version RPC surface.

Add PartitionInfoTest#testLegacyRegionDeleteTasksAreFiltered covering
both ingestion filters.
The RegionMaintainer queue is now create-only, so the loop no longer
needs the currentType/switch machinery or the duplicated Schema/Data
blocks. Restructure it without changing behavior:

- Replace the nested Optional.ifPresent + isLeader with an early return.
- Group queued tasks into one FIFO sub-queue per region (create tasks
  only; a non-create task is defensively logged and skipped) and drain
  them head-by-head, advancing each region one task at a time so offer
  order is preserved.
- Extract submitRegionCreateTasks() (one batched create RPC per region
  type) and collectSuccessfulRegions(), removing the duplicated
  SchemaRegion/DataRegion code.
- Bucket the per-round heads by region type and submit both types in the
  same round; the per-region durable poll is keyed by the typed
  TConsensusGroupId, so Schema and Data stay independent. This is a
  throughput improvement only.

Rename the guard message to UNEXPECTED_NON_CREATE_REGION_MAINTAIN_TASK_
SKIPPED since the branch now catches any non-create task, not only the
legacy delete task.

Net -50 lines. Behavior verified to be preserved (per-region FIFO,
failed-head blocking, lockstep in-memory/persisted poll, no NPE, no
infinite loop). confignode UTs green.
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant