curaitor: Recycle TSV index fast-path (partial impl of PR #17 design) by jdidion · Pull Request #86 · jdidion/claude-plugins

jdidion · 2026-05-10T16:11:24Z

Note: This MR was largely generated by Claude and has not been completely reviewed by me (the human). You should feel free to defer your review until this warning has been removed.

Summary

Partial implementation of curaitor PR #17 (Recycle Storage Scale Plan) — specifically the Obsidian-side TSV index + fast-path dedup. Skips the monthly-partitioning and SQLite-backend work because:

Current Recycle.md is 370 lines. 1175 unique URLs across live + archives. Still well below the 5K trigger the design doc set.
Monthly partitioning is already done by scripts/recycle-rollover.py; no new partitioning work needed.
SQLite-backend changes belong in the curaitor TypeScript app, not the claude-plugins Python scripts.

This PR validates the fast-path design on real data so it's ready to pay off when the log grows.

What shipped

`scripts/recycle-reindex.py` (new)

One-shot TSV builder. Scans Curaitor/Recycle.md + the most-recent N monthly archives, extracts unique normalized URLs, writes .curaitor/recycle-index.tsv with format:

# recycle-index v1 checksum=<sha256-over-source-files>
url_normalized	source_file	title
genomeweb.com/genomic-research/ssri-approach...	Recycle.md	Genotype-Guided SSRI...
arxiv.org/abs/2404.12345	Recycle-2026-04.md	Some arxiv paper
...

Dedups across files (same URL in live Recycle.md AND an archive → one row). Header line embeds a SHA-256 checksum over the source markdown bytes so consumers can detect drift.

Idempotent, safe to re-run, atomic write via tmp + rename. Supports --dry-run and --json modes for observability.

`scripts/triage-write.py` — fast-path integration

New _load_recycle_tsv(vault): reads the TSV if its checksum matches the current Recycle.md + archives. Returns a set[str] of normalized URLs on success, None on any failure (missing file / stale checksum / malformed).
build_recycle_index(vault): prefers the TSV; falls back to line-by-line markdown parse on miss. On a miss, kicks off a background rebuild via Popen (fire-and-forget, start_new_session=True) so the next call hits the fast path.
build_url_index(vault): reuses the cached TSV for the recycle portion; only runs markdown parse for recycle sources when the TSV is unavailable.

The fallback is the current line-by-line parse. Every code path remains correct without the TSV; the TSV is pure optimization.

Measured performance

Benchmark on synthetic Recycle.md files (local disk):

N rows	Fast path	Slow path	Speedup
1,000	0.6ms	2.3ms	3.8×
5,000	1.4ms	11.4ms	8.1×
20,000	4.0ms	44.4ms	11.1×

On the live Google Drive-backed vault (1175 URLs across 3 files), fast path is currently ~3× slower than the fallback (20.8ms vs 6.9ms) because the checksum computation reads all 3 source files and Google Drive's filesystem latency dominates at this scale. This inverts past ~3-5K rows where the per-row parse cost in the fallback grows linearly while the TSV's per-row lookup stays constant. The design doc's "<1ms p99 on indexed path" holds on local filesystems; Google Drive adds a filesystem-latency constant that benchmarks need to account for.

Parity verified: on the real vault (1175 URLs), fast path and slow path return exactly the same set with zero diff.

What's NOT in this PR (deferred)

Monthly-partition strategy from the design doc — already done by scripts/recycle-rollover.py. No work needed.
SQLite backend schema additions (url_normalized, tag columns + indexes) — belongs in the curaitor TypeScript app, not here.
StorageBackend interface additions (hasRecycled(url), loadRecycleRecent(limit)) — TypeScript app, not here.
SQLite export CLI (npm run export:recycle-md) — TypeScript app.
Cron-wrapper auto-reindex pre-triage — not needed; the background rebuild on miss converges within one triage cycle.

Initial TSV build

The user's vault already has .curaitor/recycle-index.tsv as of this PR commit (built during spike verification). Future cron runs pick it up automatically.

For any other install: run python3 scripts/recycle-reindex.py once. Or just run any triage — the miss-handler will auto-rebuild.

Test plan

Next triage cron fire hits the TSV fast-path (verify via timing in triage-write.py output if we add instrumentation later, or trust the parity check).
Hand-edit Recycle.md → next triage hits checksum-mismatch → fallback runs → background rebuild → next-next triage is fast again. (Hand-test once Gemma is re-enabled, since cron is currently slow-path-only when local_triage is off.)

🤖 Generated with Claude Code

Implements the Obsidian-side TSV-index portion of the Recycle storage scale plan (curaitor#17). Skips the monthly-partition work (already done by recycle-rollover.py) and the SQLite-backend additions (TypeScript app, not this repo). New scripts/recycle-reindex.py builds .curaitor/recycle-index.tsv from the live Recycle.md + most-recent N monthly archives. Each row: url_normalized\tsource_file\ttitle. Header embeds a SHA-256 over source-file bytes for drift detection. Idempotent, atomic write, supports --dry-run and --json. scripts/triage-write.py gains _load_recycle_tsv() (fast-path reader that verifies the stored checksum matches current sources) and uses it in build_recycle_index() and build_url_index(). On any miss (stale / malformed / missing TSV) it falls back to the existing line-by-line markdown parse AND kicks off a background rebuild via Popen+start_new_session so next call is fast. Parity verified on the real vault (1175 URLs across 3 files); fast path and slow path return identical sets. Benchmarked on local disk: 3.8x at 1K rows, 8.1x at 5K, 11.1x at 20K. Google Drive filesystem latency makes the fast path currently slower than fallback at today's 1175-row scale (checksum I/O dominates), but the crossover happens around 3-5K which is the original design's trigger point anyway. Current vault already has the initial TSV built. Future cron runs + any new install auto-self-heal via the on-miss rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The TSV fast-path + auto-rebuild shipped in jdidion/claude-plugins#86 (2026-05-10). Update the status block to reflect partial-implementation + what's still deferred (SQLite backend schema, StorageBackend interface additions, markdown-export CLI). Include measured benchmarks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

> **Note:** This MR was largely generated by Claude and has not been completely reviewed by me (the human). You should feel free to defer your review until this warning has been removed. ## Summary Release curaitor v0.4.0 — bumps `plugins/curaitor/.claude-plugin/plugin.json` version from `0.3.0` to `0.4.0`. Includes the full run of cron + triage refactors landed in the last few days. ## Release notes ### Cron architecture: Claude-free The scheduled `/cu:discover` and `/cu:triage` jobs no longer depend on Claude auth. Both skills' cron path was replaced by headless Python orchestrators that run feeds/Instapaper fetch → dedup → Gemma pre-pass → deterministic routing → optional level-2 Claude queue for interactive drain. - **#79** headless `/cu:discover` cron (`scripts/discover-cron.py`) - **#81** headless `/cu:triage` cron (`scripts/triage-cron.py`) — Instapaper source, hard-routes LinkedIn / video / podcast URLs to pending-claude-review, runs Gemma on text-article survivors - **#80** hardened cron log tail (`## Cron end` marker + zero-output annotation); dropped the Feedly backend and its 2 gated-journal feeds (operational friction outweighed coverage) ### Dedup gets source-aware + a fast path - **#84** Instapaper saves of URLs already in `Curaitor/Ignored/` now rescue the note to Inbox (fresh classification + `prior_*` audit keys) instead of silently appending `(duplicate)` to Recycle.md. Fixes a regression where explicit user saves were being dropped. - **#86** TSV index fast-path at `.curaitor/recycle-index.tsv` with SHA-256 drift detection and background auto-rebuild. Partial implementation of curaitor#17's design; Obsidian-side only, SQLite work deferred. ### Gemma reliability - **#85** consistency check + one-shot repair re-prompt in `scripts/local-triage.py`. Catches the `(confidence=high-not-interested, verdict=review)` contradictions Gemma-4-26B-A4B-it-MLX-4bit produces at batch=1 (measured 17.7% rate on real Ignored corpus). On persistent contradiction, flags `error: contradiction_unresolved` and escalates to Claude. Gemma re-enabled in user-local config alongside this. ### Skills / UX - **#82** banned subjective quality judgments in `cu:read` / `cu:review` / `cu:review-ignored` (no more "excellent paper", "major figure"). Explicit allow-list for factual descriptors. - **#87** reviewed-keep flag: `/cu:read skip` and `/cu:review y`/`r` stamp `review_status: kept-after-review`. `/cu:read` startup surfaces previously-kept articles in a distinct section above fresh arrivals. ## Verification - Plugin install path is direct symlinks from `~/projects/curaitor-review/.claude/skills/` into `~/projects/claude-plugins/plugins/curaitor/`. All merged PRs already reached the running interactive sessions; the version bump is for marketplace index accuracy. - Cron wrapper + the 3 Python orchestrators have been exercised end-to-end this week. ## What's deferred to v0.5.0+ - Spike (c) Gemma batch-4-8 — only if v0.4.0's consistency check proves insufficient in measured data. - SQLite backend schema additions for Recycle (curaitor app TypeScript work, separate repo). - Full `/cu:triage` interactive-mode rewrite (skill doc currently has both headless + interactive instructions; could be cleaner). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jdidion marked this pull request as ready for review May 10, 2026 16:15

jdidion merged commit ad76e80 into main May 10, 2026

jdidion deleted the feat/curaitor-recycle-tsv-index branch May 10, 2026 16:23

This was referenced May 10, 2026

Draft: Recycle storage scale plan (design doc) jdidion/curaitor#17

Draft

curaitor: bump to v0.4.0 #88

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

curaitor: Recycle TSV index fast-path (partial impl of PR #17 design)#86

curaitor: Recycle TSV index fast-path (partial impl of PR #17 design)#86
jdidion merged 1 commit into
mainfrom
feat/curaitor-recycle-tsv-index

jdidion commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdidion commented May 10, 2026

Summary

What shipped

scripts/recycle-reindex.py (new)

scripts/triage-write.py — fast-path integration

Measured performance

What's NOT in this PR (deferred)

Initial TSV build

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`scripts/recycle-reindex.py` (new)

`scripts/triage-write.py` — fast-path integration