Skip to content

Correctness, robustness, and quality improvements#53

Open
dahankzter wants to merge 14 commits intoPetroniuss:mainfrom
dahankzter:main
Open

Correctness, robustness, and quality improvements#53
dahankzter wants to merge 14 commits intoPetroniuss:mainfrom
dahankzter:main

Conversation

@dahankzter
Copy link

Summary

A batch of fixes, performance improvements, and quality-of-life changes found during a systematic audit of the codebase. Each commit is self-contained and independently reviewable.

Correctness fixes

  • Replace unwrap with proper error on empty gRPC response — a peer returning GetResponse { value: None } previously caused a panic; now returns a descriptive error
  • Add default max capacity to main_cache — without a bound, the main cache could grow without limit and exhaust memory (default: 100k entries)
  • Store SocketAddr directly in VNode — eliminates a parse().unwrap() that could panic on malformed strings; custom Hash impl preserves existing ring distribution
  • Eliminate TOCTOU race in add_peer / remove_peer — double-checked locking prevents duplicate peer entries under concurrent service discovery
  • Don't hold Arc across sleep in service discovery loop — the previous while let Some(cache) = cache.upgrade() pattern kept a strong reference alive during sleep, preventing Drop from firing;
    restructured to loop { sleep; upgrade }

API improvements

  • Make GroupcacheError a public enum — consumers can now pattern-match on error variants (Loading, Transport, Deserialization, etc.) instead of relying on string matching
  • Implement Display for GroupcachePeer — useful for logging and diagnostics
  • Log remote load errors — remote load failures that fall back to local load were previously silent; now logged at warn level

Performance

  • Use Arc<str> instead of String for cache keys — avoids heap allocation on every cache lookup/insert; reference-count increment replaces full string clone
  • Use rmp_serde::from_slice instead of from_read — zero-copy deserialization path for byte slices

Housekeeping

  • Abort service discovery task on shutdownGroupcacheInner now implements Drop to abort the background service discovery task via OnceLock<AbortHandle>
  • Clean up build.rs — remove tonic-prost-build from runtime [dependencies] (kept only in [build-dependencies])
  • Add Makefile — targets for build, test, lint, coverage (lcov/html/summary), clean, and help
  • Comprehensive test coverage — 15 new unit + integration tests bringing line coverage to ~98%; generated proto code excluded from reports

Test plan

  • All 46 tests pass (cargo test --workspace)
  • Clippy clean (cargo clippy --workspace -- -D warnings)
  • Formatting clean (cargo fmt --all -- --check)
  • ~98% line coverage on library code (cargo llvm-cov --workspace --all-features --ignore-filename-regex 'groupcache_pb\.rs')
  • Verify no behavioral regressions in downstream consumers

henrik-leovegas and others added 14 commits February 18, 2026 22:35
load_remotely() called .unwrap() on the optional `value` field from
GetResponse, which would panic if a peer returned an empty response
(e.g. due to a bug or proto version mismatch). Replace with a new
EmptyResponse error variant that propagates cleanly through the error
chain instead of crashing the node.

Also fix a pre-existing clippy::obfuscated_if_else warning in
set_peers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
The default main_cache had no max_capacity set, meaning it would grow
without bound until the process ran out of memory. Set a default limit
of 100k items (consistent with hot_cache having 10k). Users can still
override this via the builder's main_cache() method.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
VNode previously stored the peer address as a formatted string
"{addr}_{id}" and reconstructed it via split('_') + parse(), both
with unwrap(). This would panic on malformed data. Store the
SocketAddr and vnode id as proper typed fields instead, with a custom
Hash impl that preserves the original string-based hash distribution
to avoid reshuffling the ring on upgrade.

Also changed VNODES_PER_PEER from i32 to u32 since it is a count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
add_peer had a time-of-check-to-time-of-use race: it checked for
peer existence under a read lock, dropped it, performed an async
connect, then took the write lock to insert. Another task could add
the same peer in between. Fix by re-checking under the write lock
after connecting.

remove_peer had a similar (though less impactful) race between the
read-lock check and write-lock removal. Simplify by using a single
write lock for both check and removal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
When a remote peer load fails, groupcache falls back to loading
locally. The remote error was previously discarded entirely, making
it very difficult to diagnose production issues like flaky peers or
serialization mismatches. Now log the error at warn level before
falling back, so operators have visibility into remote failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
GroupcacheError was an opaque struct wrapping InternalGroupcacheError
via #[error(transparent)], so users could only .to_string() errors
with no way to distinguish loader failures from transport errors or
deserialization issues.

Replace with a public enum whose variants (Loading, Transport,
Deserialization, Connection, EmptyResponse, etc.) allow callers to
match on error causes and handle them appropriately. Error message
formats are preserved for backward compatibility.

Also fix a pre-existing flaky test that hardcoded 127.0.0.1:8080 as
an unreachable address — replace with a dynamically allocated port
that is guaranteed to have nothing listening.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
- Move tonic-prost-build to build-dependencies only (was incorrectly
  also listed in [dependencies], pulling it into the runtime dep tree)
- Replace confusing `if true { return }` + dead code pattern in
  build.rs with a clear `if false` guard and instructions for
  regenerating the protobuf code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Replace String-based cache keys with Arc<str> to reduce heap
allocations on hot paths. String keys require a full heap copy on
every clone, while Arc<str> clones via a cheap atomic refcount
increment. This matters because the same key may appear in the
singleflight group, main_cache, and hot_cache.

Cache lookups (get/remove) still accept &str thanks to
Arc<str>: Borrow<str>, so callers are unaffected.

A public CacheKey type alias is exported for users who build
custom Cache instances via the builder.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Previously the background service discovery task spawned by
tokio::spawn had its JoinHandle dropped immediately. While the
task's Weak<GroupcacheInner> reference eventually causes the loop
to exit, there's a lag of up to one full poll interval (default
10s) before the task notices.

Store the AbortHandle via OnceLock and abort it in the Drop impl
of GroupcacheInner, giving immediate cancellation when the last
Groupcache clone is dropped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
The previous `while let Some(cache) = cache.upgrade()` pattern
bound the Arc for the entire loop iteration, including during
tokio::time::sleep. This prevented GroupcacheInner::drop from
firing (and thus the AbortHandle from triggering) until the sleep
completed — delaying shutdown by up to one poll interval.

Restructure to sleep first, then upgrade the Weak. During sleep
only a Weak reference exists, so dropping all Groupcache clones
allows Drop to fire immediately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
from_read wraps the input in a Read adapter, adding unnecessary
indirection when we already have a byte slice. from_slice operates
directly on &[u8].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Delegates to the inner SocketAddr display, giving clean output
like "127.0.0.1:8080" for logging and error messages instead of
requiring Debug formatting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Targets:
  all (default)      - fmt, lint, test, doc
  build / check      - compile / type-check
  test               - cargo test --workspace
  clippy / fmt / lint - quality checks
  fmt-fix            - auto-fix formatting
  doc                - build documentation
  coverage           - lcov report (coverage/lcov.info)
  coverage-summary   - per-file summary to stdout
  coverage-html      - HTML report (coverage/html/)
  clean              - remove build artifacts and coverage data
  help               - show all targets

Also adds coverage/ to .gitignore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Add unit tests for error conversion paths, Display impl, and service
discovery loop. Add integration tests for add_peer idempotency,
set_peers with new peers, HTTPS connection, and remove on disconnected
peer. Exclude generated proto code from coverage reports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Henrik Ma Johansson <dahankzter@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants