Skip to content

Ghost LRU entries persist indefinitely after file deletion — file-not-found path doesn't evict from index #893

@im-nick-adams

Description

@im-nick-adams

Summary

When a CAS blob is deleted from disk (e.g., via the overwrite path in Add()), but a concurrent GET has already looked up the key in the LRU and released the lock, the subsequent os.Open fails with "not exists." The warning is logged, but the stale LRU entry is never removed, causing the same warning to fire on every subsequent access until the server is restarted.

Impact

On a moderately active server (~10 concurrent Bazel users), we observed 20,145 phantom warnings over 2 days (~49/minute), producing persistent Lost inputs no longer available remotely failures for Bazel clients using --remote_download_toplevel. A restart temporarily resolves the issue by rebuilding the LRU from disk, but phantoms accumulate again within hours.

Root Cause

In cache/disk/disk.go, availableOrTryProxy():

// Slow path retry:
c.mu.Lock()
item, listElem = c.lru.Get(key)
if listElem != nil {
    blobPath = path.Join(c.dir, c.FileLocation(...))
    f, err = os.Open(blobPath)
}
c.mu.Unlock()

if err != nil {
    // WARNING: logs but does NOT evict from LRU
    log.Printf("Warning: expected %q to exist on disk, undersized cache?", blobPath)
}

Compare to the decompression-error path ~20 lines later, which correctly self-heals:

if err != nil {
    log.Printf("Warning: expected item to be on disk, but something happened...")
    c.mu.Lock()
    c.lru.RemoveElement(listElem)  // ← correctly removes dead entry
    c.mu.Unlock()
}

Suggested Fix

Add LRU eviction in the file-not-found path, matching the existing decompression-error pattern:

if err != nil {
    log.Printf("Warning: expected %q to exist on disk, undersized cache?", blobPath)
    c.mu.Lock()
    if listElem != nil {
        c.lru.RemoveElement(listElem)
    }
    c.mu.Unlock()
}

This makes ghost entries self-healing on first access rather than persisting indefinitely.

Environment

  • bazel-remote commit: b857daf1f63c641dc3fe6105a674a6e9ed81cf35
  • Go 1.25.6
  • Local disk storage (no NFS), zstd compression mode
  • ~847K cached files, 126 GB on disk, 1.2 TB max

Reproduction

  1. Start bazel-remote with multiple concurrent clients
  2. Trigger overwrites (same CAS hash uploaded by different clients, or after bazel clean --expunge)
  3. Concurrent GET requests for overwritten blobs hit the race window
  4. Once triggered, the phantom entry persists forever — visible via repeated "expected to exist on disk" warnings for the same path

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions