Correctness, robustness, and quality improvements by dahankzter · Pull Request #53 · Petroniuss/groupcache

dahankzter · 2026-02-19T07:15:21Z

Summary

A batch of fixes, performance improvements, and quality-of-life changes found during a systematic audit of the codebase. Each commit is self-contained and independently reviewable.

Correctness fixes

Replace unwrap with proper error on empty gRPC response — a peer returning GetResponse { value: None } previously caused a panic; now returns a descriptive error
Add default max capacity to main_cache — without a bound, the main cache could grow without limit and exhaust memory (default: 100k entries)
Store SocketAddr directly in VNode — eliminates a parse().unwrap() that could panic on malformed strings; custom Hash impl preserves existing ring distribution
Eliminate TOCTOU race in add_peer / remove_peer — double-checked locking prevents duplicate peer entries under concurrent service discovery
Don't hold Arc across sleep in service discovery loop — the previous while let Some(cache) = cache.upgrade() pattern kept a strong reference alive during sleep, preventing Drop from firing;
restructured to loop { sleep; upgrade }

API improvements

Make GroupcacheError a public enum — consumers can now pattern-match on error variants (Loading, Transport, Deserialization, etc.) instead of relying on string matching
Implement Display for GroupcachePeer — useful for logging and diagnostics
Log remote load errors — remote load failures that fall back to local load were previously silent; now logged at warn level

Performance

Use Arc<str> instead of String for cache keys — avoids heap allocation on every cache lookup/insert; reference-count increment replaces full string clone
Use rmp_serde::from_slice instead of from_read — zero-copy deserialization path for byte slices

Housekeeping

Abort service discovery task on shutdown — GroupcacheInner now implements Drop to abort the background service discovery task via OnceLock<AbortHandle>
Clean up build.rs — remove tonic-prost-build from runtime [dependencies] (kept only in [build-dependencies])
Add Makefile — targets for build, test, lint, coverage (lcov/html/summary), clean, and help
Comprehensive test coverage — 15 new unit + integration tests bringing line coverage to ~98%; generated proto code excluded from reports

Test plan

All 46 tests pass (cargo test --workspace)
Clippy clean (cargo clippy --workspace -- -D warnings)
Formatting clean (cargo fmt --all -- --check)
~98% line coverage on library code (cargo llvm-cov --workspace --all-features --ignore-filename-regex 'groupcache_pb\.rs')
Verify no behavioral regressions in downstream consumers

load_remotely() called .unwrap() on the optional `value` field from GetResponse, which would panic if a peer returned an empty response (e.g. due to a bug or proto version mismatch). Replace with a new EmptyResponse error variant that propagates cleanly through the error chain instead of crashing the node. Also fix a pre-existing clippy::obfuscated_if_else warning in set_peers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

The default main_cache had no max_capacity set, meaning it would grow without bound until the process ran out of memory. Set a default limit of 100k items (consistent with hot_cache having 10k). Users can still override this via the builder's main_cache() method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

VNode previously stored the peer address as a formatted string "{addr}_{id}" and reconstructed it via split('_') + parse(), both with unwrap(). This would panic on malformed data. Store the SocketAddr and vnode id as proper typed fields instead, with a custom Hash impl that preserves the original string-based hash distribution to avoid reshuffling the ring on upgrade. Also changed VNODES_PER_PEER from i32 to u32 since it is a count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

add_peer had a time-of-check-to-time-of-use race: it checked for peer existence under a read lock, dropped it, performed an async connect, then took the write lock to insert. Another task could add the same peer in between. Fix by re-checking under the write lock after connecting. remove_peer had a similar (though less impactful) race between the read-lock check and write-lock removal. Simplify by using a single write lock for both check and removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

When a remote peer load fails, groupcache falls back to loading locally. The remote error was previously discarded entirely, making it very difficult to diagnose production issues like flaky peers or serialization mismatches. Now log the error at warn level before falling back, so operators have visibility into remote failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

GroupcacheError was an opaque struct wrapping InternalGroupcacheError via #[error(transparent)], so users could only .to_string() errors with no way to distinguish loader failures from transport errors or deserialization issues. Replace with a public enum whose variants (Loading, Transport, Deserialization, Connection, EmptyResponse, etc.) allow callers to match on error causes and handle them appropriately. Error message formats are preserved for backward compatibility. Also fix a pre-existing flaky test that hardcoded 127.0.0.1:8080 as an unreachable address — replace with a dynamically allocated port that is guaranteed to have nothing listening. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

- Move tonic-prost-build to build-dependencies only (was incorrectly also listed in [dependencies], pulling it into the runtime dep tree) - Replace confusing `if true { return }` + dead code pattern in build.rs with a clear `if false` guard and instructions for regenerating the protobuf code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

Replace String-based cache keys with Arc<str> to reduce heap allocations on hot paths. String keys require a full heap copy on every clone, while Arc<str> clones via a cheap atomic refcount increment. This matters because the same key may appear in the singleflight group, main_cache, and hot_cache. Cache lookups (get/remove) still accept &str thanks to Arc<str>: Borrow<str>, so callers are unaffected. A public CacheKey type alias is exported for users who build custom Cache instances via the builder. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

Previously the background service discovery task spawned by tokio::spawn had its JoinHandle dropped immediately. While the task's Weak<GroupcacheInner> reference eventually causes the loop to exit, there's a lag of up to one full poll interval (default 10s) before the task notices. Store the AbortHandle via OnceLock and abort it in the Drop impl of GroupcacheInner, giving immediate cancellation when the last Groupcache clone is dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

The previous `while let Some(cache) = cache.upgrade()` pattern bound the Arc for the entire loop iteration, including during tokio::time::sleep. This prevented GroupcacheInner::drop from firing (and thus the AbortHandle from triggering) until the sleep completed — delaying shutdown by up to one poll interval. Restructure to sleep first, then upgrade the Weak. During sleep only a Weak reference exists, so dropping all Groupcache clones allows Drop to fire immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

from_read wraps the input in a Read adapter, adding unnecessary indirection when we already have a byte slice. from_slice operates directly on &[u8]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

Delegates to the inner SocketAddr display, giving clean output like "127.0.0.1:8080" for logging and error messages instead of requiring Debug formatting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

Targets: all (default) - fmt, lint, test, doc build / check - compile / type-check test - cargo test --workspace clippy / fmt / lint - quality checks fmt-fix - auto-fix formatting doc - build documentation coverage - lcov report (coverage/lcov.info) coverage-summary - per-file summary to stdout coverage-html - HTML report (coverage/html/) clean - remove build artifacts and coverage data help - show all targets Also adds coverage/ to .gitignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

Add unit tests for error conversion paths, Display impl, and service discovery loop. Add integration tests for add_peer idempotency, set_peers with new peers, HTTPS connection, and remove on disconnected peer. Exclude generated proto code from coverage reports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>

henrik-leovegas and others added 14 commits February 18, 2026 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctness, robustness, and quality improvements#53

Correctness, robustness, and quality improvements#53
dahankzter wants to merge 14 commits intoPetroniuss:mainfrom
dahankzter:main

dahankzter commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dahankzter commented Feb 19, 2026

Summary

Correctness fixes

API improvements

Performance

Housekeeping

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants