Set NoFreelistSync on bolt metadata databases to avoid write amplification#6882
Set NoFreelistSync on bolt metadata databases to avoid write amplification#6882ZRHann wants to merge 1 commit into
Conversation
2cc37e5 to
e73c7c2
Compare
…cation bbolt rewrites the entire freelist on every transaction commit. Once a metadata database accumulates many free pages (for example after GC frees most of a large database), each small metadata write amplifies into a multi-MB disk write. With a high commit rate this can saturate disk write bandwidth and, because every commit holds bbolt's single writer lock while fsyncing, stall otherwise-cheap build operations. Set NoFreelistSync on the bolt databases that buildkit opens itself (containerdmeta.db, metadata_v2.db and cache.db). The freelist is not data; it is reconstructed by scanning the database on open, so durability is unaffected. containerd applies the same option to its own metadata database for the same reason: containerd/containerd#6761 Signed-off-by: ZRHann <zrhann@foxmail.com>
e73c7c2 to
41af39f
Compare
|
@tonistiigi friendly ping |
tonistiigi
left a comment
There was a problem hiding this comment.
Thanks.
This seems preferable, but I think one of the downsides is that the initial load time increases. I'm not sure how much or if this is practical problem or not. If you have data then please share.
So I think we should do this, but we also should add periodic compaction(in same release cycle). I played around with this while back and need to see if I still have my experiments somewhere, unless you want to attempt it as well. We changed all our DB usage to interface for a similar reason that we could support swapping out the DB if needed.
Looks like missed history db and cachedigest db. I guess these were not intentional.
|
I tested this on the affected
|
|
Happy to keep following up on the periodic compaction. Would be great if you could share your old experiments if you still have them. |
Motivation
bbolt rewrites the entire freelist on every transaction commit — it is not written incrementally. So once one of the databases buildkit opens accumulates many free pages (e.g. after GC frees most of a database that previously grew large), even a commit that changes a single key turns into a multi-MB write.
Evidence:
(*Tx).commitFreelist→freelist.Write, which serializes the whole freelist; the written size ispageHeader + 8 bytes × (free + pending page count).What we observed in production
containerdmeta.dbhad grown to ~4.8 GB while live data was only ~10 KB — GC had freed almost everything, leaving a huge freelist that was rewritten on every commit. Inspected with thebboltCLI (go install go.etcd.io/bbolt/cmd/bbolt@latest):So every commit wrote ~9 MB regardless of the logical change, and because writes serialize behind bbolt's single writer lock (each commit holds it across the fsync), this kept the disk pegged and made even
RUN lstake seconds.For reference, compacting the same database drops it from 4.8 GB to 128 KB (
bbolt compact -o out.db containerdmeta.db), confirming almost all of it was free pages.Change
Set
NoFreelistSync: trueon the bolt databases buildkit opens itself:containerdmeta.db(runc worker)metadata_v2.db(cache metadata)cache.db(solver cache; already opened withNoSync)Testing
Copied the affected
containerdmeta.dbto local disk and ran a small Go program that opens it exactly as buildkit does and commits single-key transactions in a tight loop, measuring bytes actually written through/proc/self/io(wchar):The loop is run with
opts = nil(current behaviour) andopts = &bolt.Options{NoFreelistSync: true}. Single writer, 10 s, on a copy of the real database; numbers below are stable across repeated runs:nil)NoFreelistSyncPrior art
containerd sets the same option on its own metadata database (
meta.db), and has since 2022:NoFreelistSync = trueplugins/metadata/plugin.go)buildkit opens its bolt databases directly rather than through containerd's metadata plugin, so it never inherited this change.