Draft: Recycle storage scale plan (design doc) by jdidion · Pull Request #17 · jdidion/curaitor

jdidion · 2026-04-21T14:59:13Z

Note: This MR was largely generated by Claude and has not been completely reviewed by me (the human). You should feel free to defer your review until this warning has been removed.

Summary

New design doc docs/SPEC-recycle-scale.md — backend-adaptive plan for keeping recycle dedup and display fast as the log grows past ~5K entries.
Obsidian backend: rotate Curaitor/Recycle.md into monthly archives + derived TSV index at .curaitor/recycle-index.tsv. Markdown stays the human source of truth.
SQLite backend: add url_normalized + tag columns and indexes; dedup = single indexed SELECT.
Interface additions are backward-compatible: hasRecycled(url) for O(1) membership, loadRecycleRecent(limit) for bounded reads.

Status update (2026-05-10)

Obsidian-side portion is built — shipped in jdidion/claude-plugins#86:

scripts/recycle-reindex.py one-shot TSV builder (checksum-stamped, idempotent, atomic write).
scripts/triage-write.py fast-path integration in build_recycle_index + build_url_index, with background auto-rebuild on checksum mismatch.
Parity-verified on the live 1175-URL vault; fast path and fallback return identical sets.
Benchmarked: 3.8× at 1K rows, 8.1× at 5K, 11.1× at 20K (local disk).

Still deferred — TypeScript app work in this repo:

SQLite backend schema additions (url_normalized, tag columns + indexes).
StorageBackend.hasRecycled(url) + StorageBackend.loadRecycleRecent(limit) interface additions.
Markdown-export CLI (npm run export:recycle-md).

Rationale for splitting: current Recycle.md is 370 lines (1175 including archives) — well below the 5K trigger — and the SQLite backend is blocked on a separate storage-migration refactor. Doing the Obsidian-side work early validated the design on real data; the SQLite work follows when the storage-migration lands.

Context

Current Recycle.md: 527 entries (2026-04-21) → 370 in live + 1175 across live + archives (2026-05-10). Growing ~5-20/day.
Dedup probe (shipped in jdidion/claude-plugins#8, 2026-04-20) does a line-by-line scan of Recycle.md on every triage run. Fine at 1K lines, untenable at 100K.
Fast path now shipped for Obsidian in claude-plugins#86.

Test plan

Design review: Obsidian monthly-partition + TSV-index strategy survives manual edits to Recycle.md (rebuild-on-checksum-mismatch is in the plan, and is shipped + parity-tested)
Design review: SQLite schema additions (url_normalized, tag, indexes) don't break the storage-migration plan already on record
Design review: hasRecycled + loadRecycleRecent interface additions look right; no other new methods needed

🤖 Generated with Claude Code

Design for keeping recycle dedup and display fast as the log grows past ~5K entries. Plan is backend-adaptive: - Obsidian backend: rotate Curaitor/Recycle.md into monthly archives (Curaitor/Recycle/YYYY-MM.md), maintain a derived TSV index at .curaitor/recycle-index.tsv for O(1) dedup probes. Markdown stays the human-readable source of truth; index is rebuild-on-mismatch. - SQLite backend: add url_normalized + tag columns and indexes; dedup becomes a single indexed SELECT. Markdown regeneration is opt-in via a CLI export. Interface additions (backward-compatible): hasRecycled(url) for O(1) membership tests, loadRecycleRecent(limit) for bounded reads. Not yet implemented — spec gates the work behind a threshold (~5K entries OR >100ms dedup wall time). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…#86) > **Note:** This MR was largely generated by Claude and has not been completely reviewed by me (the human). You should feel free to defer your review until this warning has been removed. ## Summary Partial implementation of [curaitor PR #17](jdidion/curaitor#17) (Recycle Storage Scale Plan) — specifically the Obsidian-side TSV index + fast-path dedup. Skips the monthly-partitioning and SQLite-backend work because: - Current Recycle.md is 370 lines. 1175 unique URLs across live + archives. Still well below the 5K trigger the design doc set. - Monthly partitioning is already done by `scripts/recycle-rollover.py`; no new partitioning work needed. - SQLite-backend changes belong in the curaitor TypeScript app, not the claude-plugins Python scripts. This PR validates the fast-path design on real data so it's ready to pay off when the log grows. ## What shipped ### `scripts/recycle-reindex.py` (new) One-shot TSV builder. Scans `Curaitor/Recycle.md` + the most-recent N monthly archives, extracts unique normalized URLs, writes `.curaitor/recycle-index.tsv` with format: ``` # recycle-index v1 checksum=<sha256-over-source-files> url_normalized source_file title genomeweb.com/genomic-research/ssri-approach... Recycle.md Genotype-Guided SSRI... arxiv.org/abs/2404.12345 Recycle-2026-04.md Some arxiv paper ... ``` Dedups across files (same URL in live Recycle.md AND an archive → one row). Header line embeds a SHA-256 checksum over the source markdown bytes so consumers can detect drift. Idempotent, safe to re-run, atomic write via tmp + rename. Supports `--dry-run` and `--json` modes for observability. ### `scripts/triage-write.py` — fast-path integration - New `_load_recycle_tsv(vault)`: reads the TSV if its checksum matches the current Recycle.md + archives. Returns a `set[str]` of normalized URLs on success, `None` on any failure (missing file / stale checksum / malformed). - `build_recycle_index(vault)`: prefers the TSV; falls back to line-by-line markdown parse on miss. On a miss, **kicks off a background rebuild** via `Popen` (fire-and-forget, `start_new_session=True`) so the next call hits the fast path. - `build_url_index(vault)`: reuses the cached TSV for the recycle portion; only runs markdown parse for recycle sources when the TSV is unavailable. The fallback is the current line-by-line parse. **Every code path remains correct without the TSV**; the TSV is pure optimization. ## Measured performance Benchmark on synthetic Recycle.md files (local disk): | N rows | Fast path | Slow path | Speedup | |---------|-----------|-----------|---------| | 1,000 | 0.6ms | 2.3ms | 3.8× | | 5,000 | 1.4ms | 11.4ms | 8.1× | | 20,000 | 4.0ms | 44.4ms | 11.1× | On the **live Google Drive-backed vault** (1175 URLs across 3 files), fast path is currently ~3× slower than the fallback (20.8ms vs 6.9ms) because the checksum computation reads all 3 source files and Google Drive's filesystem latency dominates at this scale. This inverts past ~3-5K rows where the per-row parse cost in the fallback grows linearly while the TSV's per-row lookup stays constant. The design doc's "<1ms p99 on indexed path" holds on local filesystems; Google Drive adds a filesystem-latency constant that benchmarks need to account for. **Parity verified**: on the real vault (1175 URLs), fast path and slow path return exactly the same set with zero diff. ## What's NOT in this PR (deferred) - **Monthly-partition strategy from the design doc** — already done by `scripts/recycle-rollover.py`. No work needed. - **SQLite backend schema additions** (`url_normalized`, `tag` columns + indexes) — belongs in the curaitor TypeScript app, not here. - **StorageBackend interface additions** (`hasRecycled(url)`, `loadRecycleRecent(limit)`) — TypeScript app, not here. - **SQLite export CLI** (`npm run export:recycle-md`) — TypeScript app. - **Cron-wrapper auto-reindex pre-triage** — not needed; the background rebuild on miss converges within one triage cycle. ## Initial TSV build The user's vault already has `.curaitor/recycle-index.tsv` as of this PR commit (built during spike verification). Future cron runs pick it up automatically. For any other install: run `python3 scripts/recycle-reindex.py` once. Or just run any triage — the miss-handler will auto-rebuild. ## Test plan - [ ] Next triage cron fire hits the TSV fast-path (verify via timing in `triage-write.py` output if we add instrumentation later, or trust the parity check). - [ ] Hand-edit Recycle.md → next triage hits checksum-mismatch → fallback runs → background rebuild → next-next triage is fast again. (Hand-test once Gemma is re-enabled, since cron is currently slow-path-only when local_triage is off.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The TSV fast-path + auto-rebuild shipped in jdidion/claude-plugins#86 (2026-05-10). Update the status block to reflect partial-implementation + what's still deferred (SQLite backend schema, StorageBackend interface additions, markdown-export CLI). Include measured benchmarks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jdidion mentioned this pull request May 10, 2026

curaitor: Recycle TSV index fast-path (partial impl of PR #17 design) jdidion/claude-plugins#86

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Recycle storage scale plan (design doc)#17

Draft: Recycle storage scale plan (design doc)#17
jdidion wants to merge 2 commits into
mainfrom
docs/recycle-scale-plan

jdidion commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdidion commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Status update (2026-05-10)

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jdidion commented Apr 21, 2026 •

edited

Loading