New incremental cleanup strategy: 63%-88% faster by dcolascione · Pull Request #2903 · djcb/mu

dcolascione · 2026-02-17T03:12:20Z

This change adds a new cleanup mode that avoids cleanup having re-traverse the directories the index pass just looked at. Additionally, we efficiently query the Xapian database by walking the term list instead of doing multiple point-wise path lookups.

I'd noticed that most of my time in mu's cleanup pass consisted of B-tree lookups in Xapian (one 8KB pread64 at a time). The point lookups forced Xapian to traverse from the root of the B-tree to the leaf for every single message. Additionally, in order to join on the message path, we had to do another B-tree traversal after locating each message term. Now we just walk the terms in order, which is much more efficient, as we touch each B-tree node only once.

On my system, with 1371861 total messages, the total time of mu index (no lazy check):

--nocleanup: 3.6s
incremental cleanup: 4.2s (0.6s in cleanup)
legacy cleanup: 5.2s (1.6s in cleanup)

With the new mode, we save 1.0s of the 1.6s cleanup, so we're ~63% faster.

But the incremental cleanup works even better with lazy checking. If I enable --lazy-check, dirty only my INBOX (360778 messages), and run index, I get:

--nocleanup: 0.9s
incremental cleanup: 1.1s (0.2s in cleanup)
legacy cleanup: 2.5s (1.6s in cleanup)

We save 1.4s out of 1.6s for ~88% speedup.

This change also fixes a timestamp bug: we should be storing the start time of the index pass in metadata, not the end time, so that on the next index pass, we notice messages that arrived between the two times.

All tests pass. You can set the environment variable MU_NO_INCREMENTAL_CLEANUP to use the legacy cleanup path instead.

This change adds a new cleanup mode that avoids cleanup having re-traverse the directories the index pass just looked at. Additionally, we efficiently query the Xapian database by walking the term list instead of doing multiple point-wise path lookups. I'd noticed that most of my time in mu's cleanup pass consisted of B-tree lookups in Xapian (one 8KB pread64 at a time). The point lookups forced Xapian to traverse from the root of the B-tree to the leaf for every single message. Additionally, in order to join on the message path, we had to do *another* B-tree traversal after locating each message term. Now we just walk the terms in order, which is much more efficient, as we touch each B-tree node only once. On my system, with 1371861 total messages, the total time of mu index (no lazy check): --nocleanup: 3.6s incremental cleanup: 4.2s (0.6s in cleanup) legacy cleanup: 5.2s (1.6s in cleanup) With the new mode, we save 1.0s of the 1.6s cleanup, so we're ~63% faster. But the incremental cleanup works even better with lazy checking. If I enable --lazy-check, dirty only my INBOX (360778 messages), and run index, I get: --nocleanup: 0.9s incremental cleanup: 1.1s (0.2s in cleanup) legacy cleanup: 2.5s (1.6s in cleanup) We save 1.4s out of 1.6s for ~88% speedup. This change also fixes a timestamp bug: we should be storing the *start* time of the index pass in metadata, not the end time, so that on the next index pass, we notice messages that arrived between the two times. All tests pass. You can set the environment variable MU_NO_INCREMENTAL_CLEANUP to use the legacy cleanup path instead.

https://trac.xapian.org/ticket/850

djcb · 2026-02-19T06:10:52Z

Ah, nice work, thank you! I'll take a closer look in the next days.

djcb

Overall, looks good!

I'm not too familiar with ranges yet, so a good reason to learn a bit!

Few question (see review).

~~This also currently does not compile on the MacOS CI build:~~
https://github.com/djcb/mu/actions/runs/22084626928/job/63816567686
(oh you fixed already, great!)

djcb · 2026-02-21T10:06:54Z

lib/mu-store.cc

+	std::vector<Store::Id> ids_to_remove;
+
+	xapian_db().request_transaction();
+


This loop could use a comment, i.e., a few lines on what we're doing here.

djcb · 2026-02-21T10:09:53Z

lib/mu-store.hh

+	 * B-tree traversals.
+	 *
+	 * @param ids vector with terms for the message
+	 * @param progress_fn called occasionally to update number of removed messages;


The function declaration below calls this terms, and not ids.

Perhaps use a const vector& of strings here instead?` That seems more idiomatic.
That doesn't preclude using a span in the implementation of course, to work on subsets.

Will change the names. Happy to change the type too, but first check out https://abseil.io/tips/93 and the vector section of https://queue.acm.org/detail.cfm?id=3372264

Passing containers (including strings) around by const reference is a long-standing C++ pessimization and pet peeve of mine: a const reference forces the compiler to emit a double pointer deference (chase reference to container; chase data pointer inside container to content). Passing borrowed container slices as std::span (and strings as string_view) lets the compiler avoid the double dereference and (IMHO) is more elegant to boot.

Your codebase though. Just let me know if you're sure.

djcb · 2026-02-21T10:10:00Z

lib/mu-store.cc

+		ids_to_remove.insert(ids_to_remove.end(), mset.begin(), mset.end());
+	}
+
+	// Sort the IDs to remove to make Xapian tree traversal easier


To what extent are we depending on Xapian implementation details here?

Xapian contractually iterates over terms in ascending byte-lexicographic order, so it's only natural to suppose that its storage engines in general will get the best locality accessing terms in this order. We're allowed to delete in any order we want, so if we have to choose an order, this one seems reasonable. There's no functional dependency on Xapian internals --- and sadly, Xapian has no public bulk delete API, AFAIK

https://xapian.org/docs/apidoc/html/classXapian_1_1Database.html#abf8de9d7fe351a347e7fa9af605a71bb

djcb · 2026-02-24T17:07:33Z

Okay, I have merged this now, thank you!

dcolascione · 2026-02-24T22:07:24Z

Oh, thanks! I was going to get around to making those changes.

dcolascione added 2 commits February 16, 2026 22:05

Work around xapian bug

474a0ec

https://trac.xapian.org/ticket/850

djcb reviewed Feb 21, 2026

View reviewed changes

djcb merged commit 474a0ec into djcb:master Feb 24, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

New incremental cleanup strategy: 63%-88% faster#2903

New incremental cleanup strategy: 63%-88% faster#2903
djcb merged 2 commits intodjcb:masterfrom
dcolascione:cleanup-perf

dcolascione commented Feb 17, 2026

Uh oh!

djcb commented Feb 19, 2026

Uh oh!

djcb left a comment •

edited

Loading

Uh oh!

djcb Feb 21, 2026

Uh oh!

djcb Feb 21, 2026

Uh oh!

dcolascione Feb 21, 2026

Uh oh!

djcb Feb 21, 2026

Uh oh!

dcolascione Feb 21, 2026

Uh oh!

Uh oh!

djcb commented Feb 24, 2026

Uh oh!

dcolascione commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		std::vector<Store::Id> ids_to_remove;

		xapian_db().request_transaction();

Comments

Conversation

dcolascione commented Feb 17, 2026

Uh oh!

djcb commented Feb 19, 2026

Uh oh!

djcb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djcb Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

djcb Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

dcolascione Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

djcb Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

dcolascione Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

djcb commented Feb 24, 2026

Uh oh!

dcolascione commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

djcb left a comment •

edited

Loading