Correctness, robustness, and quality improvements#53
Open
dahankzter wants to merge 14 commits intoPetroniuss:mainfrom
Open
Correctness, robustness, and quality improvements#53dahankzter wants to merge 14 commits intoPetroniuss:mainfrom
dahankzter wants to merge 14 commits intoPetroniuss:mainfrom
Conversation
load_remotely() called .unwrap() on the optional `value` field from GetResponse, which would panic if a peer returned an empty response (e.g. due to a bug or proto version mismatch). Replace with a new EmptyResponse error variant that propagates cleanly through the error chain instead of crashing the node. Also fix a pre-existing clippy::obfuscated_if_else warning in set_peers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
The default main_cache had no max_capacity set, meaning it would grow without bound until the process ran out of memory. Set a default limit of 100k items (consistent with hot_cache having 10k). Users can still override this via the builder's main_cache() method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
VNode previously stored the peer address as a formatted string
"{addr}_{id}" and reconstructed it via split('_') + parse(), both
with unwrap(). This would panic on malformed data. Store the
SocketAddr and vnode id as proper typed fields instead, with a custom
Hash impl that preserves the original string-based hash distribution
to avoid reshuffling the ring on upgrade.
Also changed VNODES_PER_PEER from i32 to u32 since it is a count.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
add_peer had a time-of-check-to-time-of-use race: it checked for peer existence under a read lock, dropped it, performed an async connect, then took the write lock to insert. Another task could add the same peer in between. Fix by re-checking under the write lock after connecting. remove_peer had a similar (though less impactful) race between the read-lock check and write-lock removal. Simplify by using a single write lock for both check and removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
When a remote peer load fails, groupcache falls back to loading locally. The remote error was previously discarded entirely, making it very difficult to diagnose production issues like flaky peers or serialization mismatches. Now log the error at warn level before falling back, so operators have visibility into remote failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
GroupcacheError was an opaque struct wrapping InternalGroupcacheError via #[error(transparent)], so users could only .to_string() errors with no way to distinguish loader failures from transport errors or deserialization issues. Replace with a public enum whose variants (Loading, Transport, Deserialization, Connection, EmptyResponse, etc.) allow callers to match on error causes and handle them appropriately. Error message formats are preserved for backward compatibility. Also fix a pre-existing flaky test that hardcoded 127.0.0.1:8080 as an unreachable address — replace with a dynamically allocated port that is guaranteed to have nothing listening. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
- Move tonic-prost-build to build-dependencies only (was incorrectly
also listed in [dependencies], pulling it into the runtime dep tree)
- Replace confusing `if true { return }` + dead code pattern in
build.rs with a clear `if false` guard and instructions for
regenerating the protobuf code
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Replace String-based cache keys with Arc<str> to reduce heap allocations on hot paths. String keys require a full heap copy on every clone, while Arc<str> clones via a cheap atomic refcount increment. This matters because the same key may appear in the singleflight group, main_cache, and hot_cache. Cache lookups (get/remove) still accept &str thanks to Arc<str>: Borrow<str>, so callers are unaffected. A public CacheKey type alias is exported for users who build custom Cache instances via the builder. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Previously the background service discovery task spawned by tokio::spawn had its JoinHandle dropped immediately. While the task's Weak<GroupcacheInner> reference eventually causes the loop to exit, there's a lag of up to one full poll interval (default 10s) before the task notices. Store the AbortHandle via OnceLock and abort it in the Drop impl of GroupcacheInner, giving immediate cancellation when the last Groupcache clone is dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
The previous `while let Some(cache) = cache.upgrade()` pattern bound the Arc for the entire loop iteration, including during tokio::time::sleep. This prevented GroupcacheInner::drop from firing (and thus the AbortHandle from triggering) until the sleep completed — delaying shutdown by up to one poll interval. Restructure to sleep first, then upgrade the Weak. During sleep only a Weak reference exists, so dropping all Groupcache clones allows Drop to fire immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
from_read wraps the input in a Read adapter, adding unnecessary indirection when we already have a byte slice. from_slice operates directly on &[u8]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Delegates to the inner SocketAddr display, giving clean output like "127.0.0.1:8080" for logging and error messages instead of requiring Debug formatting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Targets: all (default) - fmt, lint, test, doc build / check - compile / type-check test - cargo test --workspace clippy / fmt / lint - quality checks fmt-fix - auto-fix formatting doc - build documentation coverage - lcov report (coverage/lcov.info) coverage-summary - per-file summary to stdout coverage-html - HTML report (coverage/html/) clean - remove build artifacts and coverage data help - show all targets Also adds coverage/ to .gitignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Add unit tests for error conversion paths, Display impl, and service discovery loop. Add integration tests for add_peer idempotency, set_peers with new peers, HTTPS connection, and remove on disconnected peer. Exclude generated proto code from coverage reports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A batch of fixes, performance improvements, and quality-of-life changes found during a systematic audit of the codebase. Each commit is self-contained and independently reviewable.
Correctness fixes
unwrapwith proper error on empty gRPC response — a peer returningGetResponse { value: None }previously caused a panic; now returns a descriptive errormain_cache— without a bound, the main cache could grow without limit and exhaust memory (default: 100k entries)SocketAddrdirectly inVNode— eliminates aparse().unwrap()that could panic on malformed strings; customHashimpl preserves existing ring distributionadd_peer/remove_peer— double-checked locking prevents duplicate peer entries under concurrent service discoveryArcacross sleep in service discovery loop — the previouswhile let Some(cache) = cache.upgrade()pattern kept a strong reference alive duringsleep, preventingDropfrom firing;restructured to
loop { sleep; upgrade }API improvements
GroupcacheErrora public enum — consumers can now pattern-match on error variants (Loading, Transport, Deserialization, etc.) instead of relying on string matchingDisplayforGroupcachePeer— useful for logging and diagnosticswarnlevelPerformance
Arc<str>instead ofStringfor cache keys — avoids heap allocation on every cache lookup/insert; reference-count increment replaces full string clonermp_serde::from_sliceinstead offrom_read— zero-copy deserialization path for byte slicesHousekeeping
GroupcacheInnernow implementsDropto abort the background service discovery task viaOnceLock<AbortHandle>build.rs— removetonic-prost-buildfrom runtime[dependencies](kept only in[build-dependencies])Test plan
cargo test --workspace)cargo clippy --workspace -- -D warnings)cargo fmt --all -- --check)cargo llvm-cov --workspace --all-features --ignore-filename-regex 'groupcache_pb\.rs')