Skip to content

Set NoFreelistSync on bolt metadata databases to avoid write amplification#6882

Open
ZRHann wants to merge 1 commit into
moby:masterfrom
ZRHann:nofreelistsync-metadata-db
Open

Set NoFreelistSync on bolt metadata databases to avoid write amplification#6882
ZRHann wants to merge 1 commit into
moby:masterfrom
ZRHann:nofreelistsync-metadata-db

Conversation

@ZRHann

@ZRHann ZRHann commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Motivation

bbolt rewrites the entire freelist on every transaction commit — it is not written incrementally. So once one of the databases buildkit opens accumulates many free pages (e.g. after GC frees most of a database that previously grew large), even a commit that changes a single key turns into a multi-MB write.

Evidence:

  • In bbolt, every commit calls (*Tx).commitFreelistfreelist.Write, which serializes the whole freelist; the written size is pageHeader + 8 bytes × (free + pending page count).
  • This is a known bbolt property — see etcd-io/bbolt#401, where a maintainer notes: "The list is not differential, so with each transaction it's written entirely." (The same issue reports the resulting symptom: a single small request taking "ALL of the available I/O bandwidth".)

What we observed in production

containerdmeta.db had grown to ~4.8 GB while live data was only ~10 KB — GC had freed almost everything, leaving a huge freelist that was rewritten on every commit. Inspected with the bbolt CLI (go install go.etcd.io/bbolt/cmd/bbolt@latest):

$ bbolt page containerdmeta.db 0          # meta page
Freelist:   <pgid=279088>
HWM:        <pgid=1266413>                # 1,266,413 pages × 4 KB ≈ 4.8 GB

$ bbolt page containerdmeta.db 279088     # the freelist page
Page Type:  freelist
Total Size: 8966144 bytes                 # ≈ 8.5 MB rewritten on EVERY commit
Item Count: 1118119                       # 1.12M free pages

$ bbolt stats containerdmeta.db
  Number of keys/value pairs: 122
  Bytes actually used for leaf data: 10532   # ≈ 10 KB live data

So every commit wrote ~9 MB regardless of the logical change, and because writes serialize behind bbolt's single writer lock (each commit holds it across the fsync), this kept the disk pegged and made even RUN ls take seconds.

For reference, compacting the same database drops it from 4.8 GB to 128 KB (bbolt compact -o out.db containerdmeta.db), confirming almost all of it was free pages.

Change

Set NoFreelistSync: true on the bolt databases buildkit opens itself:

  • containerdmeta.db (runc worker)
  • metadata_v2.db (cache metadata)
  • cache.db (solver cache; already opened with NoSync)

Testing

Copied the affected containerdmeta.db to local disk and ran a small Go program that opens it exactly as buildkit does and commits single-key transactions in a tight loop, measuring bytes actually written through /proc/self/io (wchar):

// opts: nil == current buildkit behaviour, vs &bolt.Options{NoFreelistSync: true}
db, _ := bolt.Open(path, 0644, opts)
db.Update(func(tx *bolt.Tx) error {                  // ensure a bucket exists
    _, e := tx.CreateBucketIfNotExists([]byte("probe")); return e
})

w0 := procSelfIOWchar()                              // "wchar:" from /proc/self/io
start := time.Now()
var commits int
var lat []time.Duration
for time.Since(start) < 10*time.Second {             // tight single-writer loop
    t0 := time.Now()
    db.Update(func(tx *bolt.Tx) error {              // one tiny single-key commit
        return tx.Bucket([]byte("probe")).Put([]byte(fmt.Sprintf("k-%d", commits)), []byte("v"))
    })
    lat = append(lat, time.Since(t0))                // per-commit latency
    commits++
}
written := procSelfIOWchar() - w0
elapsed := time.Since(start)

writePerCommit  := written / int64(commits)              // ≈ on-disk freelist + dirty pages
writeThroughput := float64(written) / elapsed.Seconds()  // bytes/s actually written
// sort(lat); median = lat[len/2], p99 = lat[len*99/100]

The loop is run with opts = nil (current behaviour) and opts = &bolt.Options{NoFreelistSync: true}. Single writer, 10 s, on a copy of the real database; numbers below are stable across repeated runs:

config write per commit write throughput commit latency (median / p99)
current (nil) 8.98 MB 369 MB/s 27 ms / 32 ms
NoFreelistSync 0.016 MB 6 MB/s 2.7 ms / 3.7 ms

Prior art

containerd sets the same option on its own metadata database (meta.db), and has since 2022:

buildkit opens its bolt databases directly rather than through containerd's metadata plugin, so it never inherited this change.

…cation

bbolt rewrites the entire freelist on every transaction commit. Once a
metadata database accumulates many free pages (for example after GC frees
most of a large database), each small metadata write amplifies into a
multi-MB disk write. With a high commit rate this can saturate disk write
bandwidth and, because every commit holds bbolt's single writer lock while
fsyncing, stall otherwise-cheap build operations.

Set NoFreelistSync on the bolt databases that buildkit opens itself
(containerdmeta.db, metadata_v2.db and cache.db). The freelist is not data;
it is reconstructed by scanning the database on open, so durability is
unaffected. containerd applies the same option to its own metadata database
for the same reason: containerd/containerd#6761

Signed-off-by: ZRHann <zrhann@foxmail.com>
@ZRHann ZRHann force-pushed the nofreelistsync-metadata-db branch from e73c7c2 to 41af39f Compare June 18, 2026 07:57
@ZRHann ZRHann marked this pull request as ready for review June 18, 2026 08:10
@ZRHann

ZRHann commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@tonistiigi friendly ping

@tonistiigi tonistiigi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

This seems preferable, but I think one of the downsides is that the initial load time increases. I'm not sure how much or if this is practical problem or not. If you have data then please share.

So I think we should do this, but we also should add periodic compaction(in same release cycle). I played around with this while back and need to see if I still have my experiments somewhere, unless you want to attempt it as well. We changed all our DB usage to interface for a similar reason that we could support swapping out the DB if needed.

Looks like missed history db and cachedigest db. I guess these were not intentional.

@fiam

@ZRHann

ZRHann commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

I tested this on the affected containerdmeta.db (~4.8 GB on disk):

  • open without NoFreelistSync: ~100 ms
  • open with NoFreelistSync: ~200 ms
  • after bbolt compact (file → ~128 KB), open with NoFreelistSync: ~0.14 ms

@ZRHann

ZRHann commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Happy to keep following up on the periodic compaction. Would be great if you could share your old experiments if you still have them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants