Draft: Recycle storage scale plan (design doc)#17
Draft
jdidion wants to merge 2 commits into
Draft
Conversation
Design for keeping recycle dedup and display fast as the log grows past ~5K entries. Plan is backend-adaptive: - Obsidian backend: rotate Curaitor/Recycle.md into monthly archives (Curaitor/Recycle/YYYY-MM.md), maintain a derived TSV index at .curaitor/recycle-index.tsv for O(1) dedup probes. Markdown stays the human-readable source of truth; index is rebuild-on-mismatch. - SQLite backend: add url_normalized + tag columns and indexes; dedup becomes a single indexed SELECT. Markdown regeneration is opt-in via a CLI export. Interface additions (backward-compatible): hasRecycled(url) for O(1) membership tests, loadRecycleRecent(limit) for bounded reads. Not yet implemented — spec gates the work behind a threshold (~5K entries OR >100ms dedup wall time). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Merged
2 tasks
jdidion
added a commit
to jdidion/claude-plugins
that referenced
this pull request
May 10, 2026
…#86) > **Note:** This MR was largely generated by Claude and has not been completely reviewed by me (the human). You should feel free to defer your review until this warning has been removed. ## Summary Partial implementation of [curaitor PR #17](jdidion/curaitor#17) (Recycle Storage Scale Plan) — specifically the Obsidian-side TSV index + fast-path dedup. Skips the monthly-partitioning and SQLite-backend work because: - Current Recycle.md is 370 lines. 1175 unique URLs across live + archives. Still well below the 5K trigger the design doc set. - Monthly partitioning is already done by `scripts/recycle-rollover.py`; no new partitioning work needed. - SQLite-backend changes belong in the curaitor TypeScript app, not the claude-plugins Python scripts. This PR validates the fast-path design on real data so it's ready to pay off when the log grows. ## What shipped ### `scripts/recycle-reindex.py` (new) One-shot TSV builder. Scans `Curaitor/Recycle.md` + the most-recent N monthly archives, extracts unique normalized URLs, writes `.curaitor/recycle-index.tsv` with format: ``` # recycle-index v1 checksum=<sha256-over-source-files> url_normalized source_file title genomeweb.com/genomic-research/ssri-approach... Recycle.md Genotype-Guided SSRI... arxiv.org/abs/2404.12345 Recycle-2026-04.md Some arxiv paper ... ``` Dedups across files (same URL in live Recycle.md AND an archive → one row). Header line embeds a SHA-256 checksum over the source markdown bytes so consumers can detect drift. Idempotent, safe to re-run, atomic write via tmp + rename. Supports `--dry-run` and `--json` modes for observability. ### `scripts/triage-write.py` — fast-path integration - New `_load_recycle_tsv(vault)`: reads the TSV if its checksum matches the current Recycle.md + archives. Returns a `set[str]` of normalized URLs on success, `None` on any failure (missing file / stale checksum / malformed). - `build_recycle_index(vault)`: prefers the TSV; falls back to line-by-line markdown parse on miss. On a miss, **kicks off a background rebuild** via `Popen` (fire-and-forget, `start_new_session=True`) so the next call hits the fast path. - `build_url_index(vault)`: reuses the cached TSV for the recycle portion; only runs markdown parse for recycle sources when the TSV is unavailable. The fallback is the current line-by-line parse. **Every code path remains correct without the TSV**; the TSV is pure optimization. ## Measured performance Benchmark on synthetic Recycle.md files (local disk): | N rows | Fast path | Slow path | Speedup | |---------|-----------|-----------|---------| | 1,000 | 0.6ms | 2.3ms | 3.8× | | 5,000 | 1.4ms | 11.4ms | 8.1× | | 20,000 | 4.0ms | 44.4ms | 11.1× | On the **live Google Drive-backed vault** (1175 URLs across 3 files), fast path is currently ~3× slower than the fallback (20.8ms vs 6.9ms) because the checksum computation reads all 3 source files and Google Drive's filesystem latency dominates at this scale. This inverts past ~3-5K rows where the per-row parse cost in the fallback grows linearly while the TSV's per-row lookup stays constant. The design doc's "<1ms p99 on indexed path" holds on local filesystems; Google Drive adds a filesystem-latency constant that benchmarks need to account for. **Parity verified**: on the real vault (1175 URLs), fast path and slow path return exactly the same set with zero diff. ## What's NOT in this PR (deferred) - **Monthly-partition strategy from the design doc** — already done by `scripts/recycle-rollover.py`. No work needed. - **SQLite backend schema additions** (`url_normalized`, `tag` columns + indexes) — belongs in the curaitor TypeScript app, not here. - **StorageBackend interface additions** (`hasRecycled(url)`, `loadRecycleRecent(limit)`) — TypeScript app, not here. - **SQLite export CLI** (`npm run export:recycle-md`) — TypeScript app. - **Cron-wrapper auto-reindex pre-triage** — not needed; the background rebuild on miss converges within one triage cycle. ## Initial TSV build The user's vault already has `.curaitor/recycle-index.tsv` as of this PR commit (built during spike verification). Future cron runs pick it up automatically. For any other install: run `python3 scripts/recycle-reindex.py` once. Or just run any triage — the miss-handler will auto-rebuild. ## Test plan - [ ] Next triage cron fire hits the TSV fast-path (verify via timing in `triage-write.py` output if we add instrumentation later, or trust the parity check). - [ ] Hand-edit Recycle.md → next triage hits checksum-mismatch → fallback runs → background rebuild → next-next triage is fast again. (Hand-test once Gemma is re-enabled, since cron is currently slow-path-only when local_triage is off.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TSV fast-path + auto-rebuild shipped in jdidion/claude-plugins#86 (2026-05-10). Update the status block to reflect partial-implementation + what's still deferred (SQLite backend schema, StorageBackend interface additions, markdown-export CLI). Include measured benchmarks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/SPEC-recycle-scale.md— backend-adaptive plan for keeping recycle dedup and display fast as the log grows past ~5K entries.Curaitor/Recycle.mdinto monthly archives + derived TSV index at.curaitor/recycle-index.tsv. Markdown stays the human source of truth.url_normalized+tagcolumns and indexes; dedup = single indexed SELECT.hasRecycled(url)for O(1) membership,loadRecycleRecent(limit)for bounded reads.Status update (2026-05-10)
Obsidian-side portion is built — shipped in jdidion/claude-plugins#86:
scripts/recycle-reindex.pyone-shot TSV builder (checksum-stamped, idempotent, atomic write).scripts/triage-write.pyfast-path integration inbuild_recycle_index+build_url_index, with background auto-rebuild on checksum mismatch.Still deferred — TypeScript app work in this repo:
url_normalized,tagcolumns + indexes).StorageBackend.hasRecycled(url)+StorageBackend.loadRecycleRecent(limit)interface additions.npm run export:recycle-md).Rationale for splitting: current Recycle.md is 370 lines (1175 including archives) — well below the 5K trigger — and the SQLite backend is blocked on a separate storage-migration refactor. Doing the Obsidian-side work early validated the design on real data; the SQLite work follows when the storage-migration lands.
Context
Test plan
url_normalized,tag, indexes) don't break the storage-migration plan already on recordhasRecycled+loadRecycleRecentinterface additions look right; no other new methods needed🤖 Generated with Claude Code