Skip to content

Split large pending writes queue entries#3970

Open
ohadzeliger wants to merge 29 commits intoFoundationDB:mainfrom
ohadzeliger:split-queue-entries
Open

Split large pending writes queue entries#3970
ohadzeliger wants to merge 29 commits intoFoundationDB:mainfrom
ohadzeliger:split-queue-entries

Conversation

@ohadzeliger
Copy link
Contributor

@ohadzeliger ohadzeliger commented Feb 25, 2026

This PR allows Pending Writes Queue entries to be split using the SplitHelper so that we can save large entries in the queue. It contains both the SplitHelper work as well as the PendingWriteQueue work.
On the SplitHelper the work introduces an abstraction of the key construction and value writing in order to support VersionInKey - the concept of having an incomplete version in the keys of the split records. There are two implementations: DefaultKeyValueSplitHelper (the current behavior) and VersioningKeyValueSplitHelper (the new one in support of the queue).
Care was taken to ensure no changes occur to the existing behavior (for backwards compatibility). The original test SplitHelperTest remains almost intact, while a new test, SplitHelperTestMultipleTransactions was added to test the new functionality. This new test borrows heavily from the other test, with the intent of testing the behavior across multiple transactions (since we have to commit the first in order to complete the version in the key so that we can locate the entry).
Two test cases, saveWithSplitAndIncompleteVersions and deleteWithSplitAndVersion were not copied over as they had no relevance for the new VersionInKey.

This is considered a breaking change, as it changes the data structure of the Lucene index's pending writes queue. The index should be disabled before the deployment and re-enabled afterwards.

Resolves #3981

ohadzeliger and others added 19 commits February 13, 2026 16:29
# Conflicts:
#	fdb-record-layer-lucene/src/main/java/com/apple/foundationdb/record/lucene/directory/FDBDirectory.java
#	fdb-record-layer-lucene/src/main/java/com/apple/foundationdb/record/lucene/directory/PendingWriteQueue.java
#	fdb-record-layer-lucene/src/test/java/com/apple/foundationdb/record/lucene/directory/PendingWriteQueueTest.java
Adds a useVersionInKey boolean to SplitHelperTestConfig, a keyHelper(int localVersion)
factory method, and extends allValidConfigs() to produce versioning variants of all
existing configs (excluding isDryRun=true, which is incompatible with versionstamps).

This lays the groundwork for multi-transaction test methods that exercise
VersioningSplitKeyValueHelper alongside the existing DefaultSplitKeyValueHelper tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix a bug in SplitHelper.writeSplitRecord where split keys were packed
using subspace.subspace(key) + Tuple.from(index), producing an
inconsistent layout compared to the unsplit path for
VersioningSplitKeyValueHelper. Now both call packSplitKey(subspace,
key.add(index)), which VersioningSplitKeyValueHelper handles as
[subspace][vs][key][index] — consistent with [subspace][vs][key][0]
for unsplit records.

Add SplitHelperTestConfig.useVersionInKey and keyHelper(localVersion)
to drive tests through VersioningSplitKeyValueHelper. Extend
allValidConfigs() to cover both helpers. Add Assumptions.assumeFalse
guards where tests are incompatible with versioning keys.

Introduce saveOnly + verifySuccessfullySaved helpers split from
saveSuccessfully, and new saveWithSplitMultipleTransactions test that
writes in tx1 and verifies in tx2, covering both default and versioning
key helpers.

Add parameterized defaultHelperSplitKeyEquivalence test asserting that
the old call site subspace.subspace(key).pack(index) equals the new
subspace.pack(key.add(index)) for DefaultSplitKeyValueHelper across
all relevant key/index combinations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eTransactions

- Add useVersionInKey and localVersion parameters to writeDummyRecord,
  threading both through to writeDummyKV
- writeDummyKV now handles versionstamp prepending and FDB write dispatch
  (SET_VERSIONSTAMPED_KEY vs plain set) based on useVersionInKey
- Remove writeDummyVersionedRecord (now redundant with writeDummyRecord)
- Add deleteWithSplitMultipleTransactions: write in tx1, delete in tx2,
  verify empty in tx3, covering both default and versioning key helpers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eTransactions

- Add writeDummyRecordsMultipleTransactions(boolean useVersionInKey): writes
  50 records in tx1 (with FDBRecordVersion when useVersionInKey=false, with
  versionstamp keys when true), then scans back in tx2 to build expected
  FDBRawRecord list with accurate committed sizes
- Add scanMultipleRecordsWithVersioningKey: parallel to scanMultipleRecords,
  uses writeDummyRecordsMultipleTransactions(true) for setup

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parallel to scanContinuations, uses writeDummyRecordsMultipleTransactions(true)
for setup so records are written with versionstamp keys and read back in a
separate transaction with accurate committed sizes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ohadzeliger ohadzeliger changed the title Draft PR for Split queue entries Split large pending writes queue entries Mar 2, 2026
@ohadzeliger ohadzeliger requested review from ScottDugas and jjezra and removed request for ScottDugas March 2, 2026 15:57
@ohadzeliger ohadzeliger marked this pull request as ready for review March 2, 2026 15:57
@ohadzeliger ohadzeliger self-assigned this Mar 2, 2026
@ohadzeliger ohadzeliger added enhancement New feature or request Run mixed-mode Label to add to Pull Requests to have it run mixed mode tests labels Mar 2, 2026
@ohadzeliger ohadzeliger requested a review from ScottDugas March 2, 2026 16:44
Copy link
Collaborator

@ScottDugas ScottDugas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a fair number of comments about how the clarity of the test could be improved. For anything other than quick one-line changes, it probably makes sense to merge and followup with improved readability of the test, so that we can be confident we're testing the same thing as before.

if (splitLongRecords || previousSizeInfo == null || previousSizeInfo.isVersionedInline()) {
if (splitKeyHelper.shouldClearBeforeWrite() && (splitLongRecords || previousSizeInfo == null || previousSizeInfo.isVersionedInline())) {
// Note that the clearPreviousSplitRecords also removes version splits from the context cache
// This is not currently supported for the case where we have versions in the keys since we can't trace the old values down
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't know the previous value, how can we have a previousSizeInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When shouldClearBeforeWrite()==true the assumption is that we have values we can use to clear the previous keys (with or without the sizeInfo).
when shouldClearBeforeWrite()==false then sizeInfo is ignored as we don't call clear at all (and chances are we will see null there).

Copy link
Contributor Author

@ohadzeliger ohadzeliger Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changed logic, if the key has an incomplete version, then we can't clear older records based on size info since the size info either has complete version (previous transaction was committed) or it has incomplete versions (in which case the keys are not in the RYW cache)

* -
*/
@Tag(Tags.RequiresFDB)
public class SplitHelperMultipleTransactionsTest extends FDBRecordStoreTestBase {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not as part of this PR, but it might be worth some sort of chaos testing. (i.e. random key shapes, random bytes for values of random lengths, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@ohadzeliger ohadzeliger added the breaking change Changes that are not backwards compatible label Mar 3, 2026
* @param serialized serialized representation
* @param version the version to store inline with this record
*/
public static void saveWithSplit(@Nonnull final FDBRecordContext context, @Nonnull final Subspace subspace,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One to look out for is reading/deleting an item within the same transaction as the one that inserted it. That may mean fishing through the versionMutationCache on the FDBRecordContext for keys with versions that haven't been updated if that hasn't already been done.

if (splitLongRecords || previousSizeInfo == null || previousSizeInfo.isVersionedInline()) {
if (splitKeyHelper.shouldClearBeforeWrite() && (splitLongRecords || previousSizeInfo == null || previousSizeInfo.isVersionedInline())) {
// Note that the clearPreviousSplitRecords also removes version splits from the context cache
// This is not currently supported for the case where we have versions in the keys since we can't trace the old values down
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand this, but it seems like we should be able to do this. The versionMutationCache always uses a fixed value (all \xff bytes) for the uncommitted version, so two keys with the same FDBRecordVersion in it should appear in the same location of the versionMutationCache.

As it happens, I think this is fine for the use case of the pending writes queue, as a different local version is claimed with each insert. Theoretically, you could imagine leveraging this code to support versionstamps in primary keys (which we currently don't, but this gets us pretty close to being able to support), in which case this may be more important

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue faced was that the cache contains specific keys with split suffixes which can span multiple cache entries. In the "incomplete version in key" case we would need to replace them all when a new record gets written with the same local version. One way to introduce that would be to group all the splits by the local version and replace the entire group every time, but that seems a little excessive since there is no use case for this currently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do group all of the keys by their local version. The mutations currently go into a NavigableMap, which is sorted by unsigned byte order. Incomplete versions all get assigned 10 \xff bytes as their global version, and so they'll all be there in the cache in consecutive keys. If there's no use case for updating large values within the same transaction as they're included, then we could punt this, though


public static Stream<Arguments> testConfigs() {
return SplitHelperTestConfig.allValidConfigs().map(Arguments::of);
public static Stream<Arguments> testConfigsNoVersionInKey() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit odd that the way this turned out, all of the tests wound up with the NoVersionInKey case. It kind of seems like what you'd want is for testConfigs to return the same set of configs as earlier, and then you could test out instances where the key has a version in separate cases in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the original attempt. Things got complicated as the versionInKey required structural changes to the tests (save in one transaction, keep local and global version, construct actual key, assert in another transaction), which was incompatible with the structure of the test.
I think that a future enhancement (separate from the actual refactoring) would be to break the large sequential tests into smaller ones, where we could affect the way the assertions are made (one or two transactions) separately from the actual flow of the test logic.

public class SplitHelperMultipleTransactionsTest extends FDBRecordStoreTestBase {

// From the traditional nursery rhyme
private static final byte[] HUMPTY_DUMPTY =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noted the duplication. Given that the distinction seems to just be whether the methods use versionstamp or not, I think this could mostly just be combined with the other tests. I suppose a common helper could be extracted if we wanted to separate out the versionstamp and non-versionstamp tests for code organization purposes. But I don't think, for example, that the multiple transaction-ness of these causes much issue, if they lived in the same file as the others

@alecgrieser
Copy link
Collaborator

I haven't totally gone through all of the tests, but I figured I'd add what I had before taking off for the day

Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some concern about the particulars of the API here, but I think the approach is sound, in that I don't believe it would cause trouble for existing data, and it does seem like it would enable large values into the PendingWriteQueue. Fixing up the API in a follow up seems acceptable

Copy link
Collaborator

@ScottDugas ScottDugas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline that we should take in some of the api changes, and test improvements, but I have some comments on the most recent PR changes.

@Nonnull
private TestDocument createHugeDocument() {
String hugeString = "Hello ".repeat(100_000);
Random random = ThreadLocalRandom.current();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to use @RandomSeedSource, although I would be quite surprised if this test changes behavior with a different seed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calling methods use @source already, not sure how to combine that with @RandomSeedSource

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RandomSeedSource itself is based on RandomizedTestUtils.randomSeeds:

return RandomizedTestUtils.randomSeeds(fixedSeeds).map(Arguments::of);

I think the only provider is a boolean provider, so creating a method arguments provider that returns something like:

static Stream<Arguments> seedAndUseCompression() {
    return RandomizedTestUtils.randomSeeds(/* some array of fixed seeds */)
        .flatMap(seed -> Stream.of(Arguments.of(seed, false), Arguments.of(seed, true));
}

Would do it, I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can't use @RandomSeedSource, but there's actually a cleaner way to do that, you could do:

static Stream<Arguments> seedAndUseCompression() {
    return ParameterizedTestUtils.cartesianProduct(
            RandomizedTestUtils.randomSeeds(/* some array of fixed seeds */),
            ParameterizedTestUtils.booleans("useCompression"));
}

Using cartesianProduct just makes this code a bit cleaner (IMHO), using booleans means that the description of the arguments will be useCompression/!useCompression in our test logs/ides which is clearer than just true/false


@Nonnull
private TestDocument createHugeDocument() {
String hugeString = "Hello ".repeat(100_000);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in a different test that has a document that compresses well. I don't think there's any max size for SplitHelper, so I don't know that you could observe any change in behavior, other than trying to save something that is 10,000,000 bytes: https://apple.github.io/foundationdb/known-limitations.html but very compressible. I'm not sure we want a test using 10mb of memory just for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a limit on the total record size, so there is no discernible difference between N splits and 500*N splits, AFAICT

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That limit, is the limit on the amount data you can have in a single transaction. So if you tried to save a 11,000,000 byte string of "repeat me" it would save successfully with compression, and fail if it just had the splits.

Copy link
Contributor

@jjezra jjezra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There are a few things in the tests, I suppose, but we could do that later. I do also think that there's an approach here that would let us remove any keys that were previously written within the same transaction, but I'm okay with punting that for a later improvement

Comment on lines +150 to +151
// an incomplete version in the key means we shouldn't delete previous values for the record (since they all have
// completed versions by now)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right. The keys only get associated with a complete version once they've been committed. Saying that we don't currently support updating records with incomplete versions in the key may be more accurate

if (splitLongRecords || previousSizeInfo == null || previousSizeInfo.isVersionedInline()) {
if (splitKeyHelper.shouldClearBeforeWrite() && (splitLongRecords || previousSizeInfo == null || previousSizeInfo.isVersionedInline())) {
// Note that the clearPreviousSplitRecords also removes version splits from the context cache
// This is not currently supported for the case where we have versions in the keys since we can't trace the old values down
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do group all of the keys by their local version. The mutations currently go into a NavigableMap, which is sorted by unsigned byte order. Incomplete versions all get assigned 10 \xff bytes as their global version, and so they'll all be there in the cache in consecutive keys. If there's no use case for updating large values within the same transaction as they're included, then we could punt this, though


private List<FDBRawRecord> writeDummyRecordsMultipleTransactions(boolean useVersionInKey) {
final byte[] valueVersion = "_cushions_".getBytes(StandardCharsets.US_ASCII);
// Generate primary keys using a generalization of the Fibonacci formula: https://oeis.org/A247698
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I wrote the original version of this, and there's no reason besides it being a little more whimsical. That's also why it's _cushions_ (both are in oblique references to the same person, who is also referenced in the OEIS entry)

@Nonnull
private TestDocument createHugeDocument() {
String hugeString = "Hello ".repeat(100_000);
Random random = ThreadLocalRandom.current();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RandomSeedSource itself is based on RandomizedTestUtils.randomSeeds:

return RandomizedTestUtils.randomSeeds(fixedSeeds).map(Arguments::of);

I think the only provider is a boolean provider, so creating a method arguments provider that returns something like:

static Stream<Arguments> seedAndUseCompression() {
    return RandomizedTestUtils.randomSeeds(/* some array of fixed seeds */)
        .flatMap(seed -> Stream.of(Arguments.of(seed, false), Arguments.of(seed, true));
}

Would do it, I think

*/
@MethodSource("splitKeyEquivalenceCases")
@ParameterizedTest(name = "defaultHelperSplitKeyEquivalence[key={0}, index={1}]")
void defaultHelperSplitKeyEquivalence(Tuple key, long index) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name here is probably a bit confusing given that DefaultSplitKeyValueHelper no longer exists. Perhaps this test is no longer necessary?


private Tuple toIncompleteKey(Tuple key, int localVersion, boolean versionInKey) {
if (versionInKey) {
return Tuple.from(Versionstamp.incomplete(localVersion)).addAll(key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much concern, but it would probably be good to followup with some tests where the version is elsewhere in the key

@Nonnull
private TestDocument createHugeDocument() {
String hugeString = "Hello ".repeat(100_000);
Random random = ThreadLocalRandom.current();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can't use @RandomSeedSource, but there's actually a cleaner way to do that, you could do:

static Stream<Arguments> seedAndUseCompression() {
    return ParameterizedTestUtils.cartesianProduct(
            RandomizedTestUtils.randomSeeds(/* some array of fixed seeds */),
            ParameterizedTestUtils.booleans("useCompression"));
}

Using cartesianProduct just makes this code a bit cleaner (IMHO), using booleans means that the description of the arguments will be useCompression/!useCompression in our test logs/ides which is clearer than just true/false


@Nonnull
private TestDocument createHugeDocument() {
String hugeString = "Hello ".repeat(100_000);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That limit, is the limit on the amount data you can have in a single transaction. So if you tried to save a 11,000,000 byte string of "repeat me" it would save successfully with compression, and fail if it just had the splits.

if (current != null) {
// This should never happen. It means that the same key (and suffix) and local version were used for subsequent
// write. It is most likely not an intended flow and this check will protect against that.
// It is an incomplete check since the same record can have different suffixes to the primary key that would not collide.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// It is an incomplete check since the same record can have different suffixes to the primary key that would not collide.
// It is an incomplete check since the same record can have different suffixes to the primary key that would not collide.
// Namely, if one is split, and one is not split they will not overlap. Anything else and the would.

* @param sizeInfo optional size information to populate
*/
@SuppressWarnings("PMD.CloseResource")
public static void saveWithSplit(@Nonnull final FDBRecordContext context, @Nonnull final Subspace subspace,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense with this new api to support dryRun in the same way?
I don't think it should be done in this PR, and not really a priority, but there is some inconsistency that might be good resolving eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Changes that are not backwards compatible enhancement New feature or request Run mixed-mode Label to add to Pull Requests to have it run mixed mode tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Split large Lucene Pending Writes Queue entries

4 participants