Environment
- concread 0.5.10, plus diagnostic commit dd13f6a
- rustc 1.88.0
- Consumer: 389-ds-base
Problem
Under sustained concurrent load, evict_to_len panics because a node popped from the freq/rec linked list has no corresponding entry in the hash trie. cache.get_mut(&key) returns None for a key that was inserted moments earlier in the same commit.
This happens when a read transaction completes and try_quiesce_stats opportunistically acquires the write lock to commit pending cache maintenance.
Initial crash (concread 0.5.10, no diagnostics)
#13 ARCache::evict_to_len (mod.rs:1346)
r = Option::None
owned = LLNodeOwned { inner: 0x7f91877cfb40 }
#14 ARCache::evict (mod.rs:1455)
rec_to_len=120587, freq_to_len=4243, delta=1, p=86482
#15 ARCache::commit (mod.rs:1694)
commit_txid=40047203
#17 ARCache::try_quiesce_stats (mod.rs:693)
tlocal: items=0, hit: len=0
CursorWrite.length=136719, stats.includes=1
#18 cache_char_read_complete (cache.rs:228)
Diagnostic build crash (0.5.10 + dd13f6a)
I've added scanning for duplicates in the source list and checks against the destination ghost list, then reproduced:
evict_to_len: KEY MISSING from cache map for key "<dn_value>".
Popped node ptr=0x7fb8d8ac7470, node_txid=41318110, node_size=1
commit_txid=41318110, ll.len()=107677, to_ll.len()=34329,
target_size=107677, dupes_remaining_in_src_ll=0,
key_found_in_dest_ghost_ll=false
#12 ARCache::evict_to_len (mod.rs:1457)
in_to_ll = false, dupes_in_ll = 0
r = Option::None
owned = LLNodeOwned { inner: 0x7fb8d8ac7470 }
#13 ARCache::evict (mod.rs:1584)
rec_to_len=107677, freq_to_len=17153, delta=1, p=84990
#14 ARCache::commit (mod.rs:1881)
commit_txid=41318110
#16 ARCache::try_quiesce_stats (mod.rs:711)
tlocal: items=0, hit: len=0
CursorWrite.length=159160, stats.includes=1
#17 cache_char_read_complete (cache.rs:228)
IIUC, node_txid == commit_txid means the node was added to the rec list during this commit by one of the drain functions. node_size=1 is HAUNTED_SIZE, consistent with a Haunted revival (I filed the drain_inc_rx size bug fix). No duplicates in the source list, key not in the ghost list. The key string itself is intact and readable in the panic message -- the linked list node's memory is fine. The hash trie just can't find it.
I wasn't able to find any place where we remove a cache map entry without removing the corresponding linked list node... And 389DS's code is fine too.
Environment
Problem
Under sustained concurrent load,
evict_to_lenpanics because a node popped from the freq/rec linked list has no corresponding entry in the hash trie.cache.get_mut(&key)returnsNonefor a key that was inserted moments earlier in the same commit.This happens when a read transaction completes and
try_quiesce_statsopportunistically acquires the write lock to commit pending cache maintenance.Initial crash (concread 0.5.10, no diagnostics)
Diagnostic build crash (0.5.10 + dd13f6a)
I've added scanning for duplicates in the source list and checks against the destination ghost list, then reproduced:
IIUC,
node_txid == commit_txidmeans the node was added to the rec list during this commit by one of the drain functions.node_size=1isHAUNTED_SIZE, consistent with a Haunted revival (I filed thedrain_inc_rxsize bug fix). No duplicates in the source list, key not in the ghost list. The key string itself is intact and readable in the panic message -- the linked list node's memory is fine. The hash trie just can't find it.I wasn't able to find any place where we remove a cache map entry without removing the corresponding linked list node... And 389DS's code is fine too.