Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,15 @@
## 2026-02-17 - [Cache DNS Lookups by Hostname]
**Learning:** When validating multiple URLs from the same host (e.g., githubusercontent), caching based on the full URL still triggers redundant DNS lookups for each unique path. Extracting hostname validation into a separate `@lru_cache` function avoids repeated blocking `getaddrinfo` calls for the same domain.
**Action:** Identify expensive validation steps (like DNS) that depend on a subset of the input (hostname vs full URL) and cache them independently.

## 2024-03-24 - [Avoid Regex on Simple Strings]
**Learning:** Running complex regex substitutions on every log message (for sanitization) introduces measurable CPU overhead, especially when most strings don't contain sensitive patterns. Simple string checks (`in`) are orders of magnitude faster than regex execution.
**Action:** Add early return checks (e.g., `if "://" in s:`) before invoking expensive regex operations in hot paths like logging or string sanitization.

## 2024-03-24 - Thread Pool Churn
**Learning:** Python's `ThreadPoolExecutor` incurs measurable overhead (thread creation/shutdown) when created/destroyed repeatedly inside loops, even with small worker counts.
**Action:** Lift `ThreadPoolExecutor` creation to the highest possible scope and pass it down as a dependency (using `contextlib.nullcontext` for flexible ownership).

## 2026-02-19 - [Minimize JSON Serialization Overhead]

Check notice

Code scanning / Remark-lint (reported by Codacy)

Warn when references to undefined definitions are found. Note

[no-undefined-references] Found reference to undefined definition

Check notice

Code scanning / Remark-lint (reported by Codacy)

Warn when shortcut reference links are used. Note

[no-shortcut-reference-link] Use the trailing [] on reference links
**Learning:** Using `indent=2` in `json.dump` significantly increases file size (newlines + spaces) and CPU time for formatting/I/O, especially for large datasets like blocklists. Removing indentation reduces file size by ~30% and speeds up I/O.
**Action:** Avoid pretty-printing (`indent=...`) for internal machine-readable cache files. Only use it for human-readable debug output.
18 changes: 11 additions & 7 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -804,7 +804,8 @@ def save_disk_cache() -> None:
# This prevents corrupted cache if process is killed mid-write
temp_file = cache_file.with_suffix(".tmp")
with open(temp_file, "w", encoding="utf-8") as f:
json.dump(_disk_cache, f, indent=2)
# OPTIMIZATION: Removed indent=2 to reduce cache file size and speed up I/O
json.dump(_disk_cache, f)

# Set file permissions to user-only (rw-------)
if platform.system() != "Windows":
Expand Down Expand Up @@ -1948,35 +1949,38 @@ def push_rules(
# Using a local reference to the match method avoids function call overhead
# in the hot loop. This provides a measurable speedup for large lists.
match_rule = RULE_PATTERN.match
# Optimization 3: Hoist append method and sanitize call to avoid repeated lookups
append = filtered_hostnames.append
sanitized_folder_name = sanitize_for_log(folder_name)

for h in unique_hostnames:
if h in existing_rules:
continue

if not match_rule(h):
log.warning(
f"Skipping unsafe rule in {sanitize_for_log(folder_name)}: {sanitize_for_log(h)}"
f"Skipping unsafe rule in {sanitized_folder_name}: {sanitize_for_log(h)}"
)
skipped_unsafe += 1
continue

filtered_hostnames.append(h)
append(h)

if skipped_unsafe > 0:
log.warning(
f"Folder {sanitize_for_log(folder_name)}: skipped {skipped_unsafe} unsafe rules"
f"Folder {sanitized_folder_name}: skipped {skipped_unsafe} unsafe rules"
)

duplicates_count = original_count - len(filtered_hostnames) - skipped_unsafe

if duplicates_count > 0:
log.info(
f"Folder {sanitize_for_log(folder_name)}: skipping {duplicates_count} duplicate rules"
f"Folder {sanitized_folder_name}: skipping {duplicates_count} duplicate rules"
)

if not filtered_hostnames:
log.info(
f"Folder {sanitize_for_log(folder_name)} - no new rules to push after filtering duplicates"
f"Folder {sanitized_folder_name} - no new rules to push after filtering duplicates"
)
return True

Expand All @@ -1993,7 +1997,7 @@ def push_rules(
str_do = str(do)
str_status = str(status)
str_group = str(folder_id)
sanitized_folder_name = sanitize_for_log(folder_name)
# sanitized_folder_name is already computed above
progress_label = f"Folder {sanitized_folder_name}"

def process_batch(batch_idx: int, batch_data: List[str]) -> Optional[List[str]]:
Expand Down
Loading