Add persistent disk cache for blocklists with ETag/Last-Modified support#242
Add persistent disk cache for blocklists with ETag/Last-Modified support#242
Conversation
|
😎 Merged manually by @abhimehro - details. |
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
Co-authored-by: abhimehro <84992105+abhimehro@users.noreply.github.com>
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
|
Hi @copilot, there was an interruption during the workflow where you working through the implementation plan. Could you please resume where you left off and complete everything? Thanks again! I really appreciate your help. |
- Added platform-specific cache directory helper (Linux, macOS, Windows) - Implemented cache loading/saving with graceful error handling - Modified _gh_get() to support conditional HTTP requests (304 Not Modified) - Added cache statistics tracking (hits, misses, validations, errors) - Display cache statistics in sync summary - Added comprehensive tests for disk cache functionality - Fixed test_cache_optimization mock to accept headers parameter - All cache-related tests passing (18/18) Co-authored-by: abhimehro <84992105+abhimehro@users.noreply.github.com>
Completed! I've resumed and finished implementing the persistent disk cache with HTTP ETag/Last-Modified support. What's been done: How it works: Example output: Ready for merge! The implementation is complete and all checks pass. (Commit: 070777b) |
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
There was a problem hiding this comment.
Pull request overview
Adds a persistent on-disk cache for blocklist JSON downloads, using HTTP conditional requests (ETag / Last-Modified) to reduce cold-start sync time and reporting cache effectiveness in the end-of-sync summary.
Changes:
- Introduces platform-specific disk cache directory resolution plus load/save routines for
blocklists.json. - Updates
_gh_get()to send conditional headers and handle304 Not Modifiedby reusing cached data. - Adds a new disk-cache test suite and updates an existing cache optimization test mock to accept
headers.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
main.py |
Implements disk cache infra, conditional HTTP support in _gh_get, cache stats printing, and cache load/save integration. |
tests/test_disk_cache.py |
Adds comprehensive unit tests for cache dir selection, load/save behavior, conditional headers, and stats tracking. |
tests/test_cache_optimization.py |
Updates stream mock signature to accept headers and sets status_code for compatibility with new _gh_get logic. |
| global _cache_stats | ||
|
|
||
| # First check: Quick check without holding lock for long | ||
| with _cache_lock: | ||
| if url in _cache: | ||
| _cache_stats["hits"] += 1 | ||
| return _cache[url] |
There was a problem hiding this comment.
_cache_stats is mutated from multiple threads (warm_up_cache and sync thread pools both call _gh_get). The current “+= 1” updates can lose increments and make the reported stats inaccurate. If these stats are meant to be reliable, protect updates with a lock or use a thread-safe counter pattern.
| # Create temporary cache directory for testing | ||
| self.temp_dir = tempfile.mkdtemp() | ||
| self.original_get_cache_dir = main.get_cache_dir | ||
|
|
||
| def tearDown(self): | ||
| """Clean up after each test.""" | ||
| main._cache.clear() | ||
| main._disk_cache.clear() | ||
| main.validate_folder_url.cache_clear() | ||
| main._cache_stats = {"hits": 0, "misses": 0, "validations": 0, "errors": 0} | ||
|
|
||
| # Clean up temp directory | ||
| import shutil | ||
| if os.path.exists(self.temp_dir): | ||
| shutil.rmtree(self.temp_dir) | ||
|
|
There was a problem hiding this comment.
Tests overwrite main.get_cache_dir via assignment but never restore it, which can leak state into other test modules (order-dependent failures). Prefer patch.object(main, "get_cache_dir") within each test or restore self.original_get_cache_dir in tearDown().
| def test_conditional_request_with_etag(self): | ||
| """Test that conditional requests use ETag from disk cache.""" | ||
| test_url = "https://example.com/test.json" | ||
| test_data = {"group": {"group": "Test"}, "domains": ["example.com"]} | ||
|
|
||
| # Pre-populate disk cache with ETag | ||
| main._disk_cache[test_url] = { | ||
| "data": test_data, | ||
| "etag": "abc123", | ||
| "last_modified": None, | ||
| "fetched_at": 1234567890.0, | ||
| "last_validated": 1234567890.0 | ||
| } | ||
|
|
||
| def mock_stream(method, url, headers=None): | ||
| # Verify If-None-Match header was sent | ||
| self.assertIsNotNone(headers) | ||
| self.assertEqual(headers.get("If-None-Match"), "abc123") | ||
|
|
||
| # Return 304 Not Modified | ||
| mock_response = MagicMock() | ||
| mock_response.status_code = 304 | ||
| mock_response.raise_for_status = MagicMock() | ||
| mock_response.headers = {} | ||
| mock_response.__enter__ = MagicMock(return_value=mock_response) | ||
| mock_response.__exit__ = MagicMock(return_value=False) | ||
| return mock_response | ||
|
|
||
| with patch.object(main._gh, 'stream', side_effect=mock_stream): | ||
| result = main._gh_get(test_url) | ||
|
|
||
| # Should return cached data | ||
| self.assertEqual(result, test_data) | ||
| # Should count as validation, not miss | ||
| self.assertEqual(main._cache_stats["validations"], 1) | ||
| self.assertEqual(main._cache_stats["misses"], 0) | ||
|
|
||
| def test_conditional_request_with_last_modified(self): | ||
| """Test that conditional requests use Last-Modified from disk cache.""" | ||
| test_url = "https://example.com/test.json" | ||
| test_data = {"group": {"group": "Test"}, "domains": ["example.com"]} | ||
|
|
||
| # Pre-populate disk cache with Last-Modified | ||
| main._disk_cache[test_url] = { | ||
| "data": test_data, | ||
| "etag": None, | ||
| "last_modified": "Mon, 01 Jan 2024 00:00:00 GMT", | ||
| "fetched_at": 1234567890.0, | ||
| "last_validated": 1234567890.0 | ||
| } | ||
|
|
||
| def mock_stream(method, url, headers=None): | ||
| # Verify If-Modified-Since header was sent | ||
| self.assertIsNotNone(headers) | ||
| self.assertEqual(headers.get("If-Modified-Since"), "Mon, 01 Jan 2024 00:00:00 GMT") | ||
|
|
There was a problem hiding this comment.
The conditional-request tests don’t assert that headers with missing values are omitted. Given _gh_get persists "etag"/"last_modified" keys even when the values are None, add assertions that If-None-Match / If-Modified-Since are absent (not present with None) when the cached value is missing, so the tests catch the httpx header-type regression.
| if r.status_code == 304: | ||
| if cached_entry and "data" in cached_entry: | ||
| log.debug(f"Cache validated (304) for {sanitize_for_log(url)}") | ||
| _cache_stats["validations"] += 1 | ||
|
|
||
| # Update in-memory cache with validated data | ||
| data = cached_entry["data"] | ||
| with _cache_lock: | ||
| _cache[url] = data | ||
|
|
||
| # Update timestamp in disk cache to track last validation | ||
| cached_entry["last_validated"] = time.time() | ||
| return data |
There was a problem hiding this comment.
The 304 handling path returns cached_entry["data"] without validating it. If the on-disk cache is structurally wrong but still valid JSON, a 304 response will keep returning bad data and can break sync until the user manually deletes the cache. Validate the cached payload (e.g., via validate_folder_data) before accepting a 304; if invalid, drop the cache entry and perform an unconditional re-fetch.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
Implementation Complete ✅
Context
Cold-start sync currently downloads all blocklists every time. This implementation adds persistent disk caching with HTTP ETag/Last-Modified support to reduce cold-start sync time by 30-50%.
Changes Made
Cache Infrastructure:
~/.cache/ctrld-sync, macOS:~/Library/Caches/ctrld-sync, Windows:%LOCALAPPDATA%\ctrld-sync\cache)HTTP Cache Support:
_gh_get()to send conditional requests usingIf-None-Match(ETag) andIf-Modified-SinceheadersUser Experience:
Example Output:
Security Measures
Testing
Performance Impact
Expected: 30-50% faster cold-start syncs with warm cache
Mechanism: Avoids re-downloading unchanged blocklists via HTTP 304 responses
Fallback: Full download if cache is empty or corrupted
Files Changed
main.py: Added cache infrastructure, modified_gh_get()for conditional requests, integrated cache loading/saving inmain()tests/test_disk_cache.py: New comprehensive test suite for disk cache functionalitytests/test_cache_optimization.py: Updated existing test mock to support headers parameterOriginal prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.