Use procedures for region create/delete to fix the delete-region ghost task#18033
Use procedures for region create/delete to fix the delete-region ghost task#18033CRZbulabula wants to merge 5 commits into
Conversation
…t task The ConfigNode-leader periodically drained a persistent RegionMaintainer queue (PartitionManager.maintainRegionReplicas) by sending synchronous deleteRegion RPCs to DataNodes. A slow data-region deletion could outlive the RPC timeout; the ConfigNode retried 10s later, and the DataNode's second call returned SUCCESS (the region was already gone and ConsensusGroupNotExistException was swallowed). The ConfigNode then polled the task off the queue and believed it was done, while the first call was still running on the DataNode - a "ghost task". Replace the whole RegionMaintainer queue with Procedures that mirror the existing AddRegionPeerProcedure/RemoveRegionPeerProcedure pattern: - DataNode: new deleteRegionAsync(TMaintainPeerReq) RPC plus a DeleteRegionTask in RegionMigrateService. The deletion runs in the background and records its terminal state in taskResultMap, so the ConfigNode polls getRegionMaintainResult for progress (PROCESSING/SUCCESS/FAIL/TASK_NOT_EXIST) instead of trusting a single RPC response. The task returns immediately after taskFail and never falls through to taskSucceed, so a failed deletion can never be reported as success. - ConfigNode: new DeleteRegionProcedure (async submit + waitTaskFinish poll, taskId = procId, idempotent on retry/leader-change) and CreateRegionProcedure (the create RPC already returns a final status, so it stays synchronous and the procedure only adds persistence and retry). DeleteDatabaseProcedure and CreateRegionGroupsProcedure now spawn these as child procedures instead of offering to the queue, and every completed task is logged. - Upgrade safety: the OfferRegionMaintainTasks/PollRegionMaintainTask/ PollSpecificRegionMaintainTask plan types and classes are kept (@deprecated, still deserialized and applied as no-ops) so an old consensus log still replays. The PartitionInfo snapshot has no version header, so the trailing RegionMaintainer block is still consumed on load (and the tasks discarded with a WARN) and written as empty on save, keeping the byte layout unchanged. Add DeleteRegionProcedureTest/CreateRegionProcedureTest and a PartitionInfo test that loads a legacy snapshot containing pending tasks.
Address review feedback: DeleteRegionProcedure now reuses the existing deleteOldRegionPeer async plumbing (submitDeleteOldRegionPeerTask -> DeleteOldRegionPeerTask), which already runs deleteLocalPeer + delete region data on a background thread keyed by taskId and is polled via getRegionMaintainResult. This removes the duplicate deleteRegionAsync RPC, the DeleteRegionTask, submitDeleteRegionTask (both the service and handler layers), the CnToDnSyncRequestType.DELETE_REGION_ASYNC entry, and the related i18n constant. Also fix a latent bug in DeleteOldRegionPeerTask.run() that this reuse exposes: on a deletePeer/deleteRegion failure it called taskFail but did NOT return, then fell through to taskSucceed - so a failed deletion was reported as success. It now returns after taskFail. deletePeer also treats ConsensusGroupNotExistException as success (the peer may already be gone when deleting a whole region replica), so a retry converges cleanly.
|
Follow-up (6306e96): simplified the DataNode side. Instead of adding a new This removes the duplicated It also fixes a latent bug in The ConfigNode-side |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18033 +/- ##
============================================
+ Coverage 41.26% 41.41% +0.15%
Complexity 318 318
============================================
Files 5273 5288 +15
Lines 368462 369569 +1107
Branches 47692 47809 +117
============================================
+ Hits 152040 153069 +1029
- Misses 216422 216500 +78 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…roups via procedure When CreateRegionGroupsProcedure creates a RegionGroup and some replicas fail but the group still reaches a serving quorum, the missing replicas are offered to the RegionMaintainer queue (RegionCreateTask) and recreated by the RegionCleaner. When the group cannot reach a quorum, its created replicas are torn down by a single child RemoveRegionGroupProcedure. DeleteDatabaseProcedure deletes each region group through a child RemoveRegionGroupProcedure, which removes every replica's peer and data with deleteOldRegionPeer (no quorum required) and polls the DataNode for the async result, so the DatabasePartitionTable is dropped only after the regions are gone. Drop CreateRegionProcedure and DeleteRegionProcedure; RemoveRegionGroupProcedure reuses procedure type code 207.
Region deletion is now owned by RemoveRegionGroupProcedure, so the periodic RegionMaintainer (PartitionManager.maintainRegionReplicas) should only recreate failed region replicas. - Remove the DELETE execution branch (the DELETE_REGION async RPC) and the RegionMaintainType.DELETE special-casing from the task-selection loop. The switch keeps a defensive DELETE arm that only logs, so a regression cannot silently stall the loop. - Filter out legacy RegionDeleteTasks at the two PartitionInfo ingestion points (offerRegionMaintainTasks replay and processLoadSnapshot) so a delete task carried over from an upgraded snapshot/consensus log can never get stuck at the head of a region's queue and block recreation of that region's other replicas. - Remove the now-dead async CnToDnAsyncRequestType.DELETE_REGION enum + dispatch and the two orphaned ManagerMessages constants (en + zh). - Keep RegionDeleteTask, the RegionMaintainType.DELETE enum value, and the Factory deserialization arms unchanged for rolling-upgrade replay and snapshot-load compatibility. The DataNode deleteRegion Thrift handler is also left intact as a cross-version RPC surface. Add PartitionInfoTest#testLegacyRegionDeleteTasksAreFiltered covering both ingestion filters.
The RegionMaintainer queue is now create-only, so the loop no longer needs the currentType/switch machinery or the duplicated Schema/Data blocks. Restructure it without changing behavior: - Replace the nested Optional.ifPresent + isLeader with an early return. - Group queued tasks into one FIFO sub-queue per region (create tasks only; a non-create task is defensively logged and skipped) and drain them head-by-head, advancing each region one task at a time so offer order is preserved. - Extract submitRegionCreateTasks() (one batched create RPC per region type) and collectSuccessfulRegions(), removing the duplicated SchemaRegion/DataRegion code. - Bucket the per-round heads by region type and submit both types in the same round; the per-region durable poll is keyed by the typed TConsensusGroupId, so Schema and Data stay independent. This is a throughput improvement only. Rename the guard message to UNEXPECTED_NON_CREATE_REGION_MAINTAIN_TASK_ SKIPPED since the branch now catches any non-create task, not only the legacy delete task. Net -50 lines. Behavior verified to be preserved (per-region FIFO, failed-head blocking, lockstep in-memory/persisted poll, no NPE, no infinite loop). confignode UTs green.
|



Problem
The ConfigNode-leader periodically drained a persistent RegionMaintainer queue (
PartitionManager.maintainRegionReplicas, every 10s). For region deletion it sent synchronousdeleteRegionRPCs to DataNodes.For a data region this deletion can be slow (removing TsFiles + the consensus peer) and may outlive the RPC timeout. When it does:
SUCCESSbecause the region is already gone (ConsensusGroupNotExistExceptionis swallowed and deleting an already-removed region is a no-op).SUCCESSas completion, polls the task off the queue, and forgets it — while the first call is still running on the DataNode.The result is a ghost task: the cluster keeps executing a deletion the coordinator has already discarded.
Approach
Region deletion moves off the fire-and-forget queue and onto a Procedure, while region creation (replica re-creation) stays on the RegionMaintainer queue, which never had the ghost-task race (a create returns a final, unambiguous status from a single RPC).
Deletion →
RemoveRegionGroupProcedureA new
RemoveRegionGroupProceduredeletes a whole region group, replica by replica, reusing the existing migration primitives (submitDeleteOldRegionPeerTask→deleteOldRegionPeer/deleteLocalPeer, thenwaitTaskFinishpolling for the real terminal state). It needs no quorum, is idempotent, tolerates an already-absent peer, and fails loudly rather than skipping, so the parent can never drop a partition table while region data is still on disk. It gains persistence and retry across leader changes like the other region procedures.Callers now spawn it as a child instead of offering delete tasks to the queue:
DeleteDatabaseProcedure— oneRemoveRegionGroupProcedureper region group, completed before theDatabasePartitionTableis removed.CreateRegionGroupsProcedure(no-quorum branch) — oneRemoveRegionGroupProcedureto clean up the replicas that did get created.To make this reuse safe, a latent bug in
RegionMigrateService.DeleteOldRegionPeerTaskwas fixed: on a deletePeer/deleteRegion failure it calledtaskFailbut fell through totaskSucceed, reporting failure as success — it now returns, anddeletePeertreatsConsensusGroupNotExistExceptionas success. This also hardens the RemoveRegionPeer / region-migration path.Creation stays on the RegionMaintainer queue
PartitionManager.maintainRegionReplicasand theRegionCleanerscheduler are kept, but now only recreate failed replicas. The dead delete branch (theDELETE_REGIONasync RPC) is removed, and legacyRegionDeleteTasks that an upgraded ConfigNode might replay from a consensus log or snapshot are filtered out at the twoPartitionInfoingestion points (offerRegionMaintainTasks,processLoadSnapshot) so they can never get stuck at a region's queue head and block re-creation of that region's other replicas.Cleanup / readability
CnToDnAsyncRequestType.DELETE_REGIONenum + dispatch and the two orphanedManagerMessageslog constants (en + zh).maintainRegionReplicasfor readability now that the queue is create-only: dropped thecurrentType/switchmachinery and the duplicated SchemaRegion/DataRegion blocks, replaced the nestedOptional.ifPresent+ leader check with an early return, and extractedsubmitRegionCreateTasks(type, tasks)/collectSuccessfulRegions(...). Per-region FIFO ordering, the failed-head-blocks-region semantics, and the in-memory/persisted poll lockstep are preserved; the loop now batches both region types per round (throughput-only change). Net −50 lines in the method.Upgrade safety (keep deserialization, don't delete)
RegionDeleteTask, theRegionMaintainType.DELETEenum value, and theRegionMaintainTask.Factorydeserialization arms are kept byte-identical to master (master persisted these to the Ratis log and snapshots), so a rolling upgrade can still replay an old log / load an old snapshot — the delete tasks are deserialized and then dropped at ingestion. The retainedOfferRegionMaintainTasks/PollRegionMaintainTask/PollSpecificRegionMaintainTaskplan types likewise keep consensus-log replay compatible.Tests
RemoveRegionGroupProcedureTest— serialization round-trip throughProcedureFactory(exercises the newProcedureTypecode + factory registration) and the replica-deletion state machine.PartitionInfoTest.testLegacyRegionDeleteTasksAreFiltered— a legacy delete task offered to / loaded from a snapshot is dropped, leaving only create tasks queued, and survives a snapshot round-trip.ConfigPhysicalPlanSerDeTestfor the retained plan types stays green (proves consensus-log replay compatibility).All confignode + datanode unit tests above pass;
spotless:checkandcheckstyleare clean.