Skip to content

Comments

New incremental cleanup strategy: 63%-88% faster#2903

Merged
djcb merged 2 commits intodjcb:masterfrom
dcolascione:cleanup-perf
Feb 24, 2026
Merged

New incremental cleanup strategy: 63%-88% faster#2903
djcb merged 2 commits intodjcb:masterfrom
dcolascione:cleanup-perf

Conversation

@dcolascione
Copy link
Contributor

This change adds a new cleanup mode that avoids cleanup having re-traverse the directories the index pass just looked at. Additionally, we efficiently query the Xapian database by walking the term list instead of doing multiple point-wise path lookups.

I'd noticed that most of my time in mu's cleanup pass consisted of B-tree lookups in Xapian (one 8KB pread64 at a time). The point lookups forced Xapian to traverse from the root of the B-tree to the leaf for every single message. Additionally, in order to join on the message path, we had to do another B-tree traversal after locating each message term. Now we just walk the terms in order, which is much more efficient, as we touch each B-tree node only once.

On my system, with 1371861 total messages, the total time of mu index (no lazy check):

--nocleanup: 3.6s
incremental cleanup: 4.2s (0.6s in cleanup)
legacy cleanup: 5.2s (1.6s in cleanup)

With the new mode, we save 1.0s of the 1.6s cleanup, so we're ~63% faster.

But the incremental cleanup works even better with lazy checking. If I enable --lazy-check, dirty only my INBOX (360778 messages), and run index, I get:

--nocleanup: 0.9s
incremental cleanup: 1.1s (0.2s in cleanup)
legacy cleanup: 2.5s (1.6s in cleanup)

We save 1.4s out of 1.6s for ~88% speedup.

This change also fixes a timestamp bug: we should be storing the start time of the index pass in metadata, not the end time, so that on the next index pass, we notice messages that arrived between the two times.

All tests pass. You can set the environment variable MU_NO_INCREMENTAL_CLEANUP to use the legacy cleanup path instead.

This change adds a new cleanup mode that avoids cleanup having
re-traverse the directories the index pass just looked at.
Additionally, we efficiently query the Xapian database by walking the
term list instead of doing multiple point-wise path lookups.

I'd noticed that most of my time in mu's cleanup pass consisted of
B-tree lookups in Xapian (one 8KB pread64 at a time).  The point
lookups forced Xapian to traverse from the root of the B-tree to the
leaf for every single message.  Additionally, in order to join on the
message path, we had to do *another* B-tree traversal after locating
each message term.  Now we just walk the terms in order, which is much
more efficient, as we touch each B-tree node only once.

On my system, with 1371861 total messages, the total time of mu
index (no lazy check):

--nocleanup: 3.6s
incremental cleanup: 4.2s (0.6s in cleanup)
legacy cleanup: 5.2s (1.6s in cleanup)

With the new mode, we save 1.0s of the 1.6s cleanup, so we're
~63% faster.

But the incremental cleanup works even better with lazy checking.
If I enable --lazy-check, dirty only my INBOX (360778 messages), and
run index, I get:

--nocleanup: 0.9s
incremental cleanup: 1.1s (0.2s in cleanup)
legacy cleanup: 2.5s (1.6s in cleanup)

We save 1.4s out of 1.6s for ~88% speedup.

This change also fixes a timestamp bug: we should be storing
the *start* time of the index pass in metadata, not the end time, so
that on the next index pass, we notice messages that arrived between
the two times.

All tests pass.  You can set the environment variable
MU_NO_INCREMENTAL_CLEANUP to use the legacy cleanup path instead.
@djcb
Copy link
Owner

djcb commented Feb 19, 2026

Ah, nice work, thank you! I'll take a closer look in the next days.

Copy link
Owner

@djcb djcb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good!

I'm not too familiar with ranges yet, so a good reason to learn a bit!

Few question (see review).

This also currently does not compile on the MacOS CI build:
https://github.com/djcb/mu/actions/runs/22084626928/job/63816567686
(oh you fixed already, great!)

std::vector<Store::Id> ids_to_remove;

xapian_db().request_transaction();

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop could use a comment, i.e., a few lines on what we're doing here.

* B-tree traversals.
*
* @param ids vector with terms for the message
* @param progress_fn called occasionally to update number of removed messages;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function declaration below calls this terms, and not ids.

Perhaps use a const vector& of strings here instead?` That seems more idiomatic.
That doesn't preclude using a span in the implementation of course, to work on subsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change the names. Happy to change the type too, but first check out https://abseil.io/tips/93 and the vector section of https://queue.acm.org/detail.cfm?id=3372264

Passing containers (including strings) around by const reference is a long-standing C++ pessimization and pet peeve of mine: a const reference forces the compiler to emit a double pointer deference (chase reference to container; chase data pointer inside container to content). Passing borrowed container slices as std::span (and strings as string_view) lets the compiler avoid the double dereference and (IMHO) is more elegant to boot.

Your codebase though. Just let me know if you're sure.

ids_to_remove.insert(ids_to_remove.end(), mset.begin(), mset.end());
}

// Sort the IDs to remove to make Xapian tree traversal easier
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extent are we depending on Xapian implementation details here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xapian contractually iterates over terms in ascending byte-lexicographic order, so it's only natural to suppose that its storage engines in general will get the best locality accessing terms in this order. We're allowed to delete in any order we want, so if we have to choose an order, this one seems reasonable. There's no functional dependency on Xapian internals --- and sadly, Xapian has no public bulk delete API, AFAIK

https://xapian.org/docs/apidoc/html/classXapian_1_1Database.html#abf8de9d7fe351a347e7fa9af605a71bb

@djcb djcb merged commit 474a0ec into djcb:master Feb 24, 2026
2 checks passed
@djcb
Copy link
Owner

djcb commented Feb 24, 2026

Okay, I have merged this now, thank you!

@dcolascione
Copy link
Contributor Author

Oh, thanks! I was going to get around to making those changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants