Skip to content

Add cache TTL and --clear-cache flag to persistent blocklist cache#390

Merged
abhimehro merged 3 commits intomainfrom
copilot/implement-disk-cache-blocklist
Feb 20, 2026
Merged

Add cache TTL and --clear-cache flag to persistent blocklist cache#390
abhimehro merged 3 commits intomainfrom
copilot/implement-disk-cache-blocklist

Conversation

Copy link

Copilot AI commented Feb 19, 2026

Every sync downloaded all blocklists from scratch with no reuse between runs. The disk cache infrastructure (ETag/304 handling, platform-aware storage) was already in place but lacked TTL enforcement and a manual invalidation path.

Changes

Cache TTL (main.py)

  • Added CACHE_TTL_SECONDS = 24 * 60 * 60 (24h default)
  • _gh_get() now checks last_validated against TTL before making any HTTP request:
    • Within TTL → serve from disk cache directly, no network call (counted as hits)
    • Beyond TTL → send conditional request (If-None-Match / If-Modified-Since); still avoids full re-download on 304
last_validated = cached_entry.get("last_validated", 0)
if time.time() - last_validated < CACHE_TTL_SECONDS:
    # Serve from disk, skip HTTP entirely
    data = cached_entry["data"]
    with _cache_lock:
        _cache[url] = data
    _cache_stats["hits"] += 1
    return data
# Beyond TTL: fall through to conditional request

--clear-cache flag (main.py)

  • New CLI argument; deletes blocklists.json from the platform cache dir, clears in-memory _disk_cache, exits 0
  • Runs immediately after load_disk_cache(), before any sync work begins

Tests (tests/test_disk_cache.py)

  • test_ttl_within_ttl_returns_disk_cache_without_request — fresh entry, no HTTP call made
  • test_ttl_expired_sends_conditional_request — stale entry triggers If-None-Match header
  • test_clear_cache_deletes_file — cache file removed and in-memory map cleared
Original prompt

This section details on the original issue you should resolve

<issue_title>[Code Quality] Implement persistent disk cache for blocklist downloads</issue_title>
<issue_description>### Description

Every sync operation downloads all blocklists from scratch, even if they haven't changed. Implementing a persistent disk cache with ETag/Last-Modified validation can reduce cold-start sync time by 30-50%.

Problem

From the Performance Research discussion (#219):

  • Cold start downloads all blocklists every time
  • No persistent caching between runs
  • Cache warmup performs sequential HTTP fetches
  • Wasted bandwidth downloading unchanged blocklists

Current behavior:

  • Every run fetches 23+ blocklists from scratch
  • No validation of whether remote content changed
  • Large blocklists (100k+ rules) downloaded repeatedly

Suggested Changes

Implement disk-based cache with HTTP validation:

  1. Cache storage:

    • Store downloaded blocklist content in ~/.cache/ctrld-sync/
    • Include metadata (ETag, Last-Modified, downloaded timestamp)
  2. Cache validation:

    • Send If-None-Match (ETag) or If-Modified-Since headers
    • Handle 304 Not Modified responses (use cached content)
    • Only download if remote content changed (200 response)
  3. Cache invalidation:

    • Configurable cache TTL (default: 24 hours)
    • Manual cache clear option (--clear-cache flag)
    • Automatic invalidation on HTTP errors
  4. Graceful degradation:

    • Fall back to direct download if cache corrupted
    • Handle missing cache directory gracefully

Files Affected

  • main.py - Add caching logic to blocklist fetch functions
  • New: cache.py (optional) - Separate module for cache management
  • requirements.txt / pyproject.toml - Add dependencies if needed

Success Criteria

  • Blocklist downloads use persistent disk cache
  • Cache validates with ETag/Last-Modified headers
  • 304 Not Modified responses use cached content
  • Cache TTL configurable (default 24 hours)
  • --clear-cache flag to manually invalidate
  • Cold-start sync time reduced by 30-50% (measure before/after)
  • All existing tests pass
  • New tests for cache hit/miss scenarios

Priority

High - Significant user experience improvement for repeated syncs

Effort Estimate

2-3 days (design, implementation, testing)

Performance Metrics

Before (baseline to measure):

  • Time to first folder sync (cold start)
  • Total sync time for 23 folders
  • Network bandwidth used

After (target improvements):

  • 30-50% reduction in cold-start sync time
  • 80-90% reduction in bandwidth for unchanged blocklists

Source

Extracted from Daily Perf Improver discussion abhimehro/ctrld-sync#219

🔍 Task mining by Discussion Task Miner - Code Quality Improvement Agent

To install this workflow, run gh aw add github/gh-aw/.github/workflows/discussion-task-miner.md@94662b1dee8ce96c876ba9f33b3ab8be32de82a4. View source at https://github.com/github/gh-aw/tree/94662b1dee8ce96c876ba9f33b3ab8be32de82a4/.github/workflows/discussion-task-miner.md.

  • expires on Feb 20, 2026, 9:12 AM UTC

Comments on the Issue (you are @copilot in this section)


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

@trunk-io
Copy link

trunk-io bot commented Feb 19, 2026

😎 Merged manually by @abhimehro - details.

Copilot AI and others added 2 commits February 19, 2026 23:56
Co-authored-by: abhimehro <84992105+abhimehro@users.noreply.github.com>
Co-authored-by: abhimehro <84992105+abhimehro@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement persistent disk cache for blocklist downloads Add cache TTL and --clear-cache flag to persistent blocklist cache Feb 20, 2026
Copilot AI requested a review from abhimehro February 20, 2026 00:01
@abhimehro abhimehro marked this pull request as ready for review February 20, 2026 00:15
Copilot AI review requested due to automatic review settings February 20, 2026 00:15
@github-actions
Copy link

👋 Development Partner is reviewing this PR. Will provide feedback shortly.

with _cache_lock:
_cache[url] = data
_cache_stats["hits"] += 1
log.debug(f"Disk cache hit (within TTL) for {sanitize_for_log(url)}")

Check warning

Code scanning / Prospector (reported by Codacy)

Use lazy % formatting in logging functions (logging-fstring-interpolation) Warning

Use lazy % formatting in logging functions (logging-fstring-interpolation)
else:
print(f"{Colors.CYAN}ℹ No cache file found, nothing to clear{Colors.ENDC}")
_disk_cache.clear()
exit(0)

Check warning

Code scanning / Prospector (reported by Codacy)

Consider using sys.exit() (consider-using-sys-exit) Warning

Consider using sys.exit() (consider-using-sys-exit)

Check warning

Code scanning / Prospector (reported by Codacy)

Access to a protected member _disk_cache of a client class (protected-access) Warning test

Access to a protected member _disk_cache of a client class (protected-access)

Check warning

Code scanning / Prospector (reported by Codacy)

Access to a protected member _cache_stats of a client class (protected-access) Warning test

Access to a protected member _cache_stats of a client class (protected-access)

Check warning

Code scanning / Prospector (reported by Codacy)

Access to a protected member _cache_stats of a client class (protected-access) Warning test

Access to a protected member _cache_stats of a client class (protected-access)

Check warning

Code scanning / Prospector (reported by Codacy)

Access to a protected member _gh_get of a client class (protected-access) Warning test

Access to a protected member _gh_get of a client class (protected-access)

Check warning

Code scanning / Prospector (reported by Codacy)

Access to a protected member _cache_stats of a client class (protected-access) Warning test

Access to a protected member _cache_stats of a client class (protected-access)

Check warning

Code scanning / Prospector (reported by Codacy)

Access to a protected member _cache_stats of a client class (protected-access) Warning test

Access to a protected member _cache_stats of a client class (protected-access)

Check warning

Code scanning / Prospector (reported by Codacy)

Access to a protected member _disk_cache of a client class (protected-access) Warning test

Access to a protected member _disk_cache of a client class (protected-access)

Check warning

Code scanning / Prospector (reported by Codacy)

Access to a protected member _disk_cache of a client class (protected-access) Warning test

Access to a protected member _disk_cache of a client class (protected-access)

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Trailing whitespace Warning test

Trailing whitespace

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Trailing whitespace Warning test

Trailing whitespace

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Trailing whitespace Warning test

Trailing whitespace
else:
print(f"{Colors.CYAN}ℹ No cache file found, nothing to clear{Colors.ENDC}")
_disk_cache.clear()
exit(0)

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Consider using sys.exit() Warning

Consider using sys.exit()
print(f"{Colors.GREEN}✓ Cleared blocklist cache: {cache_file}{Colors.ENDC}")
except OSError as e:
print(f"{Colors.FAIL}✗ Failed to clear cache: {e}{Colors.ENDC}")
exit(1)

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Consider using sys.exit() Warning

Consider using sys.exit()

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Access to a protected member _disk_cache of a client class Note test

Access to a protected member _disk_cache of a client class

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Access to a protected member _disk_cache of a client class Note test

Access to a protected member _disk_cache of a client class

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Access to a protected member _gh of a client class Note test

Access to a protected member _gh of a client class

# Handle --clear-cache: delete cache file and exit immediately
if args.clear_cache:
global _disk_cache

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Using the global statement Note

Using the global statement
with _cache_lock:
_cache[url] = data
_cache_stats["hits"] += 1
log.debug(f"Disk cache hit (within TTL) for {sanitize_for_log(url)}")

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Use lazy % formatting in logging functions Note

Use lazy % formatting in logging functions

Check warning

Code scanning / Pylint (reported by Codacy)

Instance of 'PurePath' has no 'exists' member Warning test

Instance of 'PurePath' has no 'exists' member

Check warning

Code scanning / Pylint (reported by Codacy)

Instance of 'PurePath' has no 'unlink' member Warning test

Instance of 'PurePath' has no 'unlink' member

Check warning

Code scanning / Pylint (reported by Codacy)

Instance of 'PurePath' has no 'exists' member Warning test

Instance of 'PurePath' has no 'exists' member

Check warning

Code scanning / Pylint (reported by Codacy)

Instance of 'PurePath' has no 'write_text' member Warning test

Instance of 'PurePath' has no 'write_text' member

Check warning

Code scanning / Pylint (reported by Codacy)

Trailing whitespace Warning test

Trailing whitespace

Check notice

Code scanning / Pylint (reported by Codacy)

Access to a protected member _gh_get of a client class Note test

Access to a protected member _gh_get of a client class

Check notice

Code scanning / Pylint (reported by Codacy)

Unused argument 'url' Note test

Unused argument 'url'

Check notice

Code scanning / Pylint (reported by Codacy)

Access to a protected member _cache_stats of a client class Note test

Access to a protected member _cache_stats of a client class

Check notice

Code scanning / Pylint (reported by Codacy)

Unused argument 'method' Note test

Unused argument 'method'

Check notice

Code scanning / Pylint (reported by Codacy)

Access to a protected member _disk_cache of a client class Note test

Access to a protected member _disk_cache of a client class
@abhimehro abhimehro merged commit 6aa20ec into main Feb 20, 2026
65 checks passed
@abhimehro abhimehro deleted the copilot/implement-disk-cache-blocklist branch February 20, 2026 00:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds cache TTL enforcement and a manual cache clearing mechanism to the existing persistent disk cache infrastructure. The implementation enables blocklists to be served directly from disk within a 24-hour TTL window without any HTTP requests, reducing network traffic and improving cold-start performance for frequent syncs.

Changes:

  • Added 24-hour TTL for disk cache entries to avoid unnecessary HTTP requests within the TTL window
  • Added --clear-cache CLI flag to manually invalidate and delete the persistent cache file
  • Added comprehensive tests for TTL behavior (within/beyond TTL) and cache clearing functionality

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

File Description
main.py Added CACHE_TTL_SECONDS constant, TTL check in _gh_get() before making HTTP requests, --clear-cache argument and handler
tests/test_disk_cache.py Added three new tests: TTL cache hit without HTTP request, TTL expiry triggering conditional request, and cache file deletion
uv.lock Removed empty metadata section (automated cleanup)
Comments suppressed due to low confidence (3)

main.py:1274

  • The blocklist_fetches counter is incremented at line 1260 before checking if a TTL cache hit occurs. When an entry is within TTL (lines 1267-1274), no HTTP request is made, but the counter has already been incremented. This makes the "Blocklist fetches" statistic misleading.

The counter should only be incremented when an actual HTTP request is made to fetch or validate a blocklist. Move the counter increment to after the TTL check, so it's only incremented when we proceed to make a network request.

    # Track that we're about to make a blocklist fetch
    with _cache_lock:
        _api_stats["blocklist_fetches"] += 1
    
    # Check disk cache for TTL-based hit or conditional request headers
    headers = {}
    cached_entry = _disk_cache.get(url)
    if cached_entry:
        last_validated = cached_entry.get("last_validated", 0)
        if time.time() - last_validated < CACHE_TTL_SECONDS:
            # Within TTL: return cached data directly without any HTTP request
            data = cached_entry["data"]
            with _cache_lock:
                _cache[url] = data
            _cache_stats["hits"] += 1
            log.debug(f"Disk cache hit (within TTL) for {sanitize_for_log(url)}")
            return data

tests/test_disk_cache.py:339

  • The test verifies that no HTTP request is made and that cache stats are correct, but it doesn't verify that the blocklist_fetches counter is not incremented. Since this counter is currently being incremented even for TTL cache hits (see the related bug in main.py), this test should also verify that _api_stats["blocklist_fetches"] remains at 0 when serving from TTL cache.
    def test_ttl_within_ttl_returns_disk_cache_without_request(self):
        """Test that disk cache entries within TTL are returned without an HTTP request."""
        test_url = "https://example.com/test.json"
        test_data = {"group": {"group": "Test"}, "domains": ["example.com"]}
        
        # Pre-populate disk cache with a recent last_validated timestamp (within TTL)
        main._disk_cache[test_url] = {
            "data": test_data,
            "etag": "fresh123",
            "last_modified": None,
            "fetched_at": time.time(),
            "last_validated": time.time(),  # Just validated - within TTL
        }
        
        http_called = []
        
        def mock_stream(method, url, headers=None):
            http_called.append(url)
            mock_response = MagicMock()
            mock_response.status_code = 200
            mock_response.headers = {"Content-Type": "application/json"}
            mock_response.__enter__ = MagicMock(return_value=mock_response)
            mock_response.__exit__ = MagicMock(return_value=False)
            return mock_response
        
        with patch.object(main._gh, 'stream', side_effect=mock_stream):
            result = main._gh_get(test_url)
        
        # HTTP should NOT have been called (within TTL)
        self.assertEqual(len(http_called), 0)
        # Result should be the cached data
        self.assertEqual(result, test_data)
        # Should count as a hit
        self.assertEqual(main._cache_stats["hits"], 1)
        self.assertEqual(main._cache_stats["misses"], 0)
        self.assertEqual(main._cache_stats["validations"], 0)

tests/test_disk_cache.py:373

  • The test verifies the conditional request behavior when TTL is expired, but it doesn't verify the blocklist_fetches counter. This test should verify that _api_stats["blocklist_fetches"] is incremented by 1 when a conditional request is sent (even though it results in a 304 response), since an HTTP request was actually made.
    def test_ttl_expired_sends_conditional_request(self):
        """Test that disk cache entries beyond TTL trigger a conditional HTTP request."""
        test_url = "https://example.com/test.json"
        test_data = {"group": {"group": "Test"}, "domains": ["example.com"]}
        
        # Pre-populate disk cache with an old last_validated (beyond TTL)
        main._disk_cache[test_url] = {
            "data": test_data,
            "etag": "stale123",
            "last_modified": None,
            "fetched_at": 0.0,       # very old
            "last_validated": 0.0,   # very old - beyond any TTL
        }
        
        def mock_stream(method, url, headers=None):
            # Conditional request should be sent with If-None-Match
            self.assertEqual(headers.get("If-None-Match"), "stale123")
            mock_response = MagicMock()
            mock_response.status_code = 304
            mock_response.raise_for_status = MagicMock()
            mock_response.headers = {}
            mock_response.__enter__ = MagicMock(return_value=mock_response)
            mock_response.__exit__ = MagicMock(return_value=False)
            return mock_response
        
        with patch.object(main._gh, 'stream', side_effect=mock_stream):
            result = main._gh_get(test_url)
        
        # Should return cached data (304 response)
        self.assertEqual(result, test_data)
        # Should count as validation (conditional request)
        self.assertEqual(main._cache_stats["validations"], 1)
        self.assertEqual(main._cache_stats["hits"], 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code Quality] Implement persistent disk cache for blocklist downloads

3 participants