New incremental cleanup strategy: 63%-88% faster#2903
Conversation
This change adds a new cleanup mode that avoids cleanup having re-traverse the directories the index pass just looked at. Additionally, we efficiently query the Xapian database by walking the term list instead of doing multiple point-wise path lookups. I'd noticed that most of my time in mu's cleanup pass consisted of B-tree lookups in Xapian (one 8KB pread64 at a time). The point lookups forced Xapian to traverse from the root of the B-tree to the leaf for every single message. Additionally, in order to join on the message path, we had to do *another* B-tree traversal after locating each message term. Now we just walk the terms in order, which is much more efficient, as we touch each B-tree node only once. On my system, with 1371861 total messages, the total time of mu index (no lazy check): --nocleanup: 3.6s incremental cleanup: 4.2s (0.6s in cleanup) legacy cleanup: 5.2s (1.6s in cleanup) With the new mode, we save 1.0s of the 1.6s cleanup, so we're ~63% faster. But the incremental cleanup works even better with lazy checking. If I enable --lazy-check, dirty only my INBOX (360778 messages), and run index, I get: --nocleanup: 0.9s incremental cleanup: 1.1s (0.2s in cleanup) legacy cleanup: 2.5s (1.6s in cleanup) We save 1.4s out of 1.6s for ~88% speedup. This change also fixes a timestamp bug: we should be storing the *start* time of the index pass in metadata, not the end time, so that on the next index pass, we notice messages that arrived between the two times. All tests pass. You can set the environment variable MU_NO_INCREMENTAL_CLEANUP to use the legacy cleanup path instead.
|
Ah, nice work, thank you! I'll take a closer look in the next days. |
There was a problem hiding this comment.
Overall, looks good!
I'm not too familiar with ranges yet, so a good reason to learn a bit!
Few question (see review).
This also currently does not compile on the MacOS CI build:
https://github.com/djcb/mu/actions/runs/22084626928/job/63816567686
(oh you fixed already, great!)
| std::vector<Store::Id> ids_to_remove; | ||
|
|
||
| xapian_db().request_transaction(); | ||
|
|
There was a problem hiding this comment.
This loop could use a comment, i.e., a few lines on what we're doing here.
| * B-tree traversals. | ||
| * | ||
| * @param ids vector with terms for the message | ||
| * @param progress_fn called occasionally to update number of removed messages; |
There was a problem hiding this comment.
The function declaration below calls this terms, and not ids.
Perhaps use a const vector& of strings here instead?` That seems more idiomatic.
That doesn't preclude using a span in the implementation of course, to work on subsets.
There was a problem hiding this comment.
Will change the names. Happy to change the type too, but first check out https://abseil.io/tips/93 and the vector section of https://queue.acm.org/detail.cfm?id=3372264
Passing containers (including strings) around by const reference is a long-standing C++ pessimization and pet peeve of mine: a const reference forces the compiler to emit a double pointer deference (chase reference to container; chase data pointer inside container to content). Passing borrowed container slices as std::span (and strings as string_view) lets the compiler avoid the double dereference and (IMHO) is more elegant to boot.
Your codebase though. Just let me know if you're sure.
| ids_to_remove.insert(ids_to_remove.end(), mset.begin(), mset.end()); | ||
| } | ||
|
|
||
| // Sort the IDs to remove to make Xapian tree traversal easier |
There was a problem hiding this comment.
To what extent are we depending on Xapian implementation details here?
There was a problem hiding this comment.
Xapian contractually iterates over terms in ascending byte-lexicographic order, so it's only natural to suppose that its storage engines in general will get the best locality accessing terms in this order. We're allowed to delete in any order we want, so if we have to choose an order, this one seems reasonable. There's no functional dependency on Xapian internals --- and sadly, Xapian has no public bulk delete API, AFAIK
https://xapian.org/docs/apidoc/html/classXapian_1_1Database.html#abf8de9d7fe351a347e7fa9af605a71bb
|
Okay, I have merged this now, thank you! |
|
Oh, thanks! I was going to get around to making those changes. |
This change adds a new cleanup mode that avoids cleanup having re-traverse the directories the index pass just looked at. Additionally, we efficiently query the Xapian database by walking the term list instead of doing multiple point-wise path lookups.
I'd noticed that most of my time in mu's cleanup pass consisted of B-tree lookups in Xapian (one 8KB pread64 at a time). The point lookups forced Xapian to traverse from the root of the B-tree to the leaf for every single message. Additionally, in order to join on the message path, we had to do another B-tree traversal after locating each message term. Now we just walk the terms in order, which is much more efficient, as we touch each B-tree node only once.
On my system, with 1371861 total messages, the total time of mu index (no lazy check):
--nocleanup: 3.6s
incremental cleanup: 4.2s (0.6s in cleanup)
legacy cleanup: 5.2s (1.6s in cleanup)
With the new mode, we save 1.0s of the 1.6s cleanup, so we're ~63% faster.
But the incremental cleanup works even better with lazy checking. If I enable --lazy-check, dirty only my INBOX (360778 messages), and run index, I get:
--nocleanup: 0.9s
incremental cleanup: 1.1s (0.2s in cleanup)
legacy cleanup: 2.5s (1.6s in cleanup)
We save 1.4s out of 1.6s for ~88% speedup.
This change also fixes a timestamp bug: we should be storing the start time of the index pass in metadata, not the end time, so that on the next index pass, we notice messages that arrived between the two times.
All tests pass. You can set the environment variable MU_NO_INCREMENTAL_CLEANUP to use the legacy cleanup path instead.