Skip to content

IGNITE-25825 Async write intent cleanup (dry run)#7980

Open
ascherbakoff wants to merge 5 commits intoapache:mainfrom
gridgain:ignite-25825-ww
Open

IGNITE-25825 Async write intent cleanup (dry run)#7980
ascherbakoff wants to merge 5 commits intoapache:mainfrom
gridgain:ignite-25825-ww

Conversation

@ascherbakoff
Copy link
Copy Markdown
Contributor

@ascherbakoff ascherbakoff commented Apr 13, 2026

A temporary PR, will be removed after merging the IGNITE-24963

@ascherbakoff ascherbakoff requested a review from isapego as a code owner April 13, 2026 12:51
Copilot AI review requested due to automatic review settings April 13, 2026 12:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces infrastructure for asynchronous write-intent cleanup and refactors deadlock-prevention behavior across the lock manager, transaction manager, and a broad set of tests/integration tests.

Changes:

  • Refactors deadlock prevention into explicit policies (wait-die / wound-wait / no-wait / timeout / reversed wait-die) and updates lock acquisition/conflict handling + tests accordingly.
  • Adds a “kill transaction” network message and wires wound-wait “fail owner” action to send kill requests to coordinators.
  • Reworks write-intent switch / cleanup race handling using a new partition-level inflight tracker, and adjusts retry/“retriable” exception detection.

Reviewed changes

Copilot reviewed 67 out of 67 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
modules/transactions/src/testFixtures/java/org/apache/ignite/internal/tx/test/LockWaiterMatcher.java New matcher for “future waits”.
modules/transactions/src/testFixtures/java/org/apache/ignite/internal/tx/test/LockFutureMatcher.java New matcher for granted lock.
modules/transactions/src/testFixtures/java/org/apache/ignite/internal/tx/test/LockConflictMatcher.java New matcher for conflict error.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/WoundWaitDeadlockPreventionRollbackFailActionTest.java Tests wound-wait rollback failAction.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/WoundWaitDeadlockPreventionNoOpFailActionTest.java Tests wound-wait no-op failAction.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/WaitDieDeadlockPreventionTest.java Updates to new conflict matcher + key helpers.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/TransactionIdsTest.java Adds test for new TransactionIds.hash.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/TimeoutDeadlockPreventionTest.java Switches to TimeoutDeadlockPreventionPolicy + awaits matcher.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/ReversedWaitDieDeadlockPreventionTest.java Moves to dedicated reversed wait-die policy.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/NoWaitDeadlockPreventionTest.java Moves to NoWaitDeadlockPreventionPolicy.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/NoneDeadlockPreventionTest.java Adds awaits matcher + key helper updates.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/LockManagerTxLabelTest.java Switches to NoWaitDeadlockPreventionPolicy.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/impl/OrphanDetectorTxLabelTest.java Parameterizes over policies.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/impl/OrphanDetectorTest.java Parameterizes over policies; refactors lock manager init.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/HeapLockManagerEventsTest.java Parameterizes lock events tests over policies.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/CoarseGrainedLockManagerTest.java Updates coarse-lock behavior expectations/order.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/AbstractLockManagerTest.java Removes large abstract lock manager test suite.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/AbstractLockManagerEventsTest.java Uses new lockKey helpers.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/AbstractLockingTest.java Refactors test base: tx fixtures, lockKey helpers, id ordering.
modules/transactions/src/test/java/org/apache/ignite/internal/tx/AbstractDeadlockPreventionTest.java Refactors to matcher-based assertions + shared tx fixtures.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/TransactionIds.java Adds hash() and compare() helpers.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/message/TxMessageGroup.java Adds TX_KILL_MESSAGE type.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/message/TxKillMessage.java New message for coordinator kill.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockManager.java Exposes active deadlock prevention policy.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockKey.java Improves toString for ByteBuffer keys.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/Lock.java Adds equals/hashCode.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/InternalTransaction.java Adds enlistFailedException hook.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/WoundWaitDeadlockPreventionPolicy.java New wound-wait policy.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/WaitDieDeadlockPreventionPolicy.java Rewrites policy in terms of allowWait/invertedWaitOrder.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/TxMessageSender.java Adds kill() sender for coordinators.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/TxManagerImpl.java Uses wound-wait by default; adds tx-kill handling; refactors replica response handler.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/TxIdPriorityComparator.java Delegates comparator to TransactionIds.compare.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/TimeoutDeadlockPreventionPolicy.java New timeout policy.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/ReversedWaitDieDeadlockPreventionPolicy.java New reversed wait-die policy.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/ReadWriteTransactionImpl.java Makes killed volatile; exposes enlistFailedException; adjusts kill flag timing.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/PublicApiThreadingTransaction.java Forwards enlistFailedException.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/NoWaitDeadlockPreventionPolicy.java New no-wait policy.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/HeapLockManager.java Major lock manager refactor: allowWait-based conflicts, sealing semantics, new ordering logic.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/DeadlockPreventionPolicyImpl.java Removes old configurable policy implementation.
modules/transactions/src/main/java/org/apache/ignite/internal/tx/DeadlockPreventionPolicy.java New policy contract: allowWait/failAction/invertedWaitOrder.
modules/transactions/src/integrationTest/java/org/apache/ignite/tx/distributed/ItTransactionRecoveryTest.java Adapts tests to policy-dependent wait order.
modules/transactions/src/integrationTest/java/org/apache/ignite/internal/tx/ItRunInTransactionTest.java Synchronizes test behavior across policy modes.
modules/transactions/src/integrationTest/java/org/apache/ignite/internal/disaster/ItDisasterRecoveryReconfigurationTest.java Skips test for incompatible policy mode.
modules/table/src/testFixtures/java/org/apache/ignite/internal/table/TxAbstractTest.java Adapts lock-ordering and retry expectations to policy.
modules/table/src/main/java/org/apache/ignite/internal/table/distributed/storage/InternalTableImpl.java Uses enlistFailedException on inflight rejection; retries based on RetriableTransactionException.
modules/table/src/main/java/org/apache/ignite/internal/table/distributed/replicator/PartitionReplicaListener.java Replaces tx cleanup readiness state with PartitionInflights tracking.
modules/table/src/main/java/org/apache/ignite/internal/table/distributed/replicator/PartitionInflights.java New inflight tracker for enlist/cleanup race avoidance.
modules/table/src/integrationTest/java/org/apache/ignite/internal/table/ItOperationRetryTest.java Skips tests for incompatible policy mode.
modules/table/src/integrationTest/java/org/apache/ignite/distributed/ItTxStateLocalMapTest.java Adjusts local-state checks for unlock-only path.
modules/table/src/integrationTest/java/org/apache/ignite/distributed/ItTxAbstractDistributedTestSingleNode.java Skips timeout test for incompatible policy mode.
modules/sql-engine/src/integrationTest/java/org/apache/ignite/internal/sql/engine/systemviews/ItLocksSystemViewTest.java Makes conflict test policy-order-aware.
modules/sql-engine/src/integrationTest/java/org/apache/ignite/internal/sql/engine/ItDmlTest.java Adjusts scan-with-tx test variable ordering.
modules/runner/src/integrationTest/java/org/apache/ignite/internal/table/ItTableScanTest.java Makes scan/write conflict tests policy-order-aware.
modules/runner/src/integrationTest/java/org/apache/ignite/internal/table/ItDataConsistencyTest.java Changes test workload/validation; adds lock-empty assertion.
modules/runner/src/integrationTest/java/org/apache/ignite/internal/benchmark/TpccBenchmarkNodeRunner.java New runner for TPC-C benchmark node.
modules/replicator/src/main/java/org/apache/ignite/internal/replicator/exception/ReplicaUnavailableException.java Marks as RetriableTransactionException.
modules/replicator/src/main/java/org/apache/ignite/internal/replicator/exception/ReplicationException.java Removes RetriableTransactionException marker.
modules/platforms/cpp/tests/odbc-test/transaction_test.cpp Disables failing test (TODO).
modules/partition-replicator/src/main/java/org/apache/ignite/internal/partition/replicator/handlers/TxFinishReplicaRequestHandler.java Makes cleanup fire-and-forget (async).
modules/core/src/testFixtures/java/org/apache/ignite/internal/testframework/IgniteTestUtils.java Adds ensureFutureNotCompleted helper.
modules/core/src/main/java/org/apache/ignite/internal/lang/NodeStoppingException.java Removes retry marker interfaces.
modules/client/src/integrationTest/java/org/apache/ignite/internal/streamer/ItClientDataStreamerLoadTest.java Improves verification/logging; policy-dependent allowances.
modules/client/src/integrationTest/java/org/apache/ignite/internal/client/ItThinClientTransactionsTest.java Adjusts tests for policy-dependent order.
modules/api/src/test/java/org/apache/ignite/tx/RunInTransactionRetryTest.java Updates retry expectations for commit/rollback failures.
modules/api/src/main/java/org/apache/ignite/tx/RunInTransactionInternalImpl.java Makes commit retriable; refactors async flow.
modules/api/src/main/java/org/apache/ignite/tx/IgniteTransactions.java Updates docs: retries no longer mention TimeoutException explicitly.
Comments suppressed due to low confidence (1)

modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/TxManagerImpl.java:1276

  • handleReplicaAsyncResponse used to ignore messages with a non-null correlationId (i.e. responses handled by the request/response layer). The new implementation processes any ReplicaResponse regardless of correlationId, which risks removing inflights / triggering cleanup handling for unrelated correlated invocations. Restore the correlationId == null guard (or an equivalent filter) to ensure only direct async responses are handled here.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 226 to 239
return finishTransaction(enlistedPartitionGroups, txId, commit, commitTimestamp)
.thenCompose(txResult -> {
.thenApply(txResult -> {
boolean actualCommit = txResult.transactionState() == COMMITTED;
HybridTimestamp actualCommitTs = txResult.commitTimestamp();

return txManager.cleanup(replicationGroupId, enlistedPartitions, actualCommit, actualCommitTs, txId)
.thenApply(v -> txResult);
try {
txManager.cleanup(replicationGroupId, enlistedPartitions, actualCommit, actualCommitTs, txId)
.thenApply(v -> txResult);
} catch (Exception e) {
LOG.warn("Failed to cleanup a transaction [id=" + txId + ']', e);
}

return txResult;
});
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finishAndCleanup no longer waits for txManager.cleanup(...), but the current code also drops the returned future entirely (cleanup(...).thenApply(...) is not returned or observed). This makes cleanup failures invisible (the try/catch won’t catch async failures) and can leave resources/locks uncleaned without any signal. Consider explicitly attaching a whenComplete/logging handler (and/or returning/chaining the cleanup future if callers must observe completion) instead of creating an unused stage.

Copilot uses AI. Check for mistakes.
isFinishedDueToTimeout
), publicCause));
// TODO IGNITE-28461 fail fast if TxContext.err != null.
return failedFuture(tx.enlistFailedException());
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When addInflight fails, the code now returns tx.enlistFailedException() which does not include the table/partition context that was previously present in this failure path (tableName/partId). This makes troubleshooting much harder for users. Consider preserving that context in the thrown exception message (e.g., wrap/enrich enlistFailedException() with tableName/partId).

Suggested change
return failedFuture(tx.enlistFailedException());
TransactionException enlistFailedException = tx.enlistFailedException();
return failedFuture(new TransactionException(
enlistFailedException.traceId(),
enlistFailedException.code(),
format(
"{} [tableName={}, partitionId={}]",
enlistFailedException.getMessage(),
tableName,
partId
),
enlistFailedException
));

Copilot uses AI. Check for mistakes.
PRIMARY_REPLICA_AWAIT_ERR,
PRIMARY_REPLICA_AWAIT_TIMEOUT_ERR
);
return ExceptionUtils.hasCause(e, RetriableTransactionException.class);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exceptionAllowsImplicitTxRetry now relies solely on the RetriableTransactionException marker. Some errors that were previously retried here (e.g. GROUP_OVERLOADED_ERR via GroupOverloadedException) do not implement this marker, so implicit-tx retries may silently stop happening. Either ensure all previously-retriable exceptions implement RetriableTransactionException, or keep a small explicit allowlist fallback for the relevant error codes.

Suggested change
return ExceptionUtils.hasCause(e, RetriableTransactionException.class);
if (ExceptionUtils.hasCause(e, RetriableTransactionException.class)) {
return true;
}
for (Throwable cause = e; cause != null; cause = cause.getCause()) {
if (cause instanceof IgniteException
&& ((IgniteException) cause).code() == org.apache.ignite.lang.ErrorGroups.Replicator.GROUP_OVERLOADED_ERR) {
return true;
}
}
return false;

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +92
public static int hash(UUID txId, int divisor) {
return spread(txId.hashCode()) % divisor;
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TransactionIds.hash will throw ArithmeticException for divisor <= 0 and uses % which is easy to misuse from callers. Since this is a public utility, consider validating divisor > 0 (and/or using Math.floorMod(...)) to make the contract explicit and avoid surprising failures if the method is reused outside the current CONCURRENCY usage.

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +54
@RepeatedTest(10)
public void testHash() {
Random r = new Random(0);
UUID id = UUID.randomUUID();
int div = 1 + r.nextInt(32);
int hash = TransactionIds.hash(id, div);
assertTrue(hash >= 0 && hash < div, id + " " + div);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testHash seeds Random inside the test (new Random(0)), so every repetition uses the same div value; the @RepeatedTest(10) doesn’t increase coverage. Consider moving the Random outside the test method (or varying the seed/divisor) so repetitions actually exercise different divisors/inputs.

Copilot uses AI. Check for mistakes.
@ascherbakoff ascherbakoff changed the title IGNITE-25825 Async write intent cleanup IGNITE-25825 Async write intent cleanup (dry run) Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants