curaitor: Recycle TSV index fast-path (partial impl of PR #17 design)#86
Merged
Conversation
Implements the Obsidian-side TSV-index portion of the Recycle storage scale plan (curaitor#17). Skips the monthly-partition work (already done by recycle-rollover.py) and the SQLite-backend additions (TypeScript app, not this repo). New scripts/recycle-reindex.py builds .curaitor/recycle-index.tsv from the live Recycle.md + most-recent N monthly archives. Each row: url_normalized\tsource_file\ttitle. Header embeds a SHA-256 over source-file bytes for drift detection. Idempotent, atomic write, supports --dry-run and --json. scripts/triage-write.py gains _load_recycle_tsv() (fast-path reader that verifies the stored checksum matches current sources) and uses it in build_recycle_index() and build_url_index(). On any miss (stale / malformed / missing TSV) it falls back to the existing line-by-line markdown parse AND kicks off a background rebuild via Popen+start_new_session so next call is fast. Parity verified on the real vault (1175 URLs across 3 files); fast path and slow path return identical sets. Benchmarked on local disk: 3.8x at 1K rows, 8.1x at 5K, 11.1x at 20K. Google Drive filesystem latency makes the fast path currently slower than fallback at today's 1175-row scale (checksum I/O dominates), but the crossover happens around 3-5K which is the original design's trigger point anyway. Current vault already has the initial TSV built. Future cron runs + any new install auto-self-heal via the on-miss rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdidion
added a commit
to jdidion/curaitor
that referenced
this pull request
May 10, 2026
The TSV fast-path + auto-rebuild shipped in jdidion/claude-plugins#86 (2026-05-10). Update the status block to reflect partial-implementation + what's still deferred (SQLite backend schema, StorageBackend interface additions, markdown-export CLI). Include measured benchmarks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 10, 2026
Merged
jdidion
added a commit
that referenced
this pull request
May 10, 2026
> **Note:** This MR was largely generated by Claude and has not been completely reviewed by me (the human). You should feel free to defer your review until this warning has been removed. ## Summary Release curaitor v0.4.0 — bumps `plugins/curaitor/.claude-plugin/plugin.json` version from `0.3.0` to `0.4.0`. Includes the full run of cron + triage refactors landed in the last few days. ## Release notes ### Cron architecture: Claude-free The scheduled `/cu:discover` and `/cu:triage` jobs no longer depend on Claude auth. Both skills' cron path was replaced by headless Python orchestrators that run feeds/Instapaper fetch → dedup → Gemma pre-pass → deterministic routing → optional level-2 Claude queue for interactive drain. - **#79** headless `/cu:discover` cron (`scripts/discover-cron.py`) - **#81** headless `/cu:triage` cron (`scripts/triage-cron.py`) — Instapaper source, hard-routes LinkedIn / video / podcast URLs to pending-claude-review, runs Gemma on text-article survivors - **#80** hardened cron log tail (`## Cron end` marker + zero-output annotation); dropped the Feedly backend and its 2 gated-journal feeds (operational friction outweighed coverage) ### Dedup gets source-aware + a fast path - **#84** Instapaper saves of URLs already in `Curaitor/Ignored/` now rescue the note to Inbox (fresh classification + `prior_*` audit keys) instead of silently appending `(duplicate)` to Recycle.md. Fixes a regression where explicit user saves were being dropped. - **#86** TSV index fast-path at `.curaitor/recycle-index.tsv` with SHA-256 drift detection and background auto-rebuild. Partial implementation of curaitor#17's design; Obsidian-side only, SQLite work deferred. ### Gemma reliability - **#85** consistency check + one-shot repair re-prompt in `scripts/local-triage.py`. Catches the `(confidence=high-not-interested, verdict=review)` contradictions Gemma-4-26B-A4B-it-MLX-4bit produces at batch=1 (measured 17.7% rate on real Ignored corpus). On persistent contradiction, flags `error: contradiction_unresolved` and escalates to Claude. Gemma re-enabled in user-local config alongside this. ### Skills / UX - **#82** banned subjective quality judgments in `cu:read` / `cu:review` / `cu:review-ignored` (no more "excellent paper", "major figure"). Explicit allow-list for factual descriptors. - **#87** reviewed-keep flag: `/cu:read skip` and `/cu:review y`/`r` stamp `review_status: kept-after-review`. `/cu:read` startup surfaces previously-kept articles in a distinct section above fresh arrivals. ## Verification - Plugin install path is direct symlinks from `~/projects/curaitor-review/.claude/skills/` into `~/projects/claude-plugins/plugins/curaitor/`. All merged PRs already reached the running interactive sessions; the version bump is for marketplace index accuracy. - Cron wrapper + the 3 Python orchestrators have been exercised end-to-end this week. ## What's deferred to v0.5.0+ - Spike (c) Gemma batch-4-8 — only if v0.4.0's consistency check proves insufficient in measured data. - SQLite backend schema additions for Recycle (curaitor app TypeScript work, separate repo). - Full `/cu:triage` interactive-mode rewrite (skill doc currently has both headless + interactive instructions; could be cleaner). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Partial implementation of curaitor PR #17 (Recycle Storage Scale Plan) — specifically the Obsidian-side TSV index + fast-path dedup. Skips the monthly-partitioning and SQLite-backend work because:
scripts/recycle-rollover.py; no new partitioning work needed.This PR validates the fast-path design on real data so it's ready to pay off when the log grows.
What shipped
scripts/recycle-reindex.py(new)One-shot TSV builder. Scans
Curaitor/Recycle.md+ the most-recent N monthly archives, extracts unique normalized URLs, writes.curaitor/recycle-index.tsvwith format:Dedups across files (same URL in live Recycle.md AND an archive → one row). Header line embeds a SHA-256 checksum over the source markdown bytes so consumers can detect drift.
Idempotent, safe to re-run, atomic write via tmp + rename. Supports
--dry-runand--jsonmodes for observability.scripts/triage-write.py— fast-path integration_load_recycle_tsv(vault): reads the TSV if its checksum matches the current Recycle.md + archives. Returns aset[str]of normalized URLs on success,Noneon any failure (missing file / stale checksum / malformed).build_recycle_index(vault): prefers the TSV; falls back to line-by-line markdown parse on miss. On a miss, kicks off a background rebuild viaPopen(fire-and-forget,start_new_session=True) so the next call hits the fast path.build_url_index(vault): reuses the cached TSV for the recycle portion; only runs markdown parse for recycle sources when the TSV is unavailable.The fallback is the current line-by-line parse. Every code path remains correct without the TSV; the TSV is pure optimization.
Measured performance
Benchmark on synthetic Recycle.md files (local disk):
On the live Google Drive-backed vault (1175 URLs across 3 files), fast path is currently ~3× slower than the fallback (20.8ms vs 6.9ms) because the checksum computation reads all 3 source files and Google Drive's filesystem latency dominates at this scale. This inverts past ~3-5K rows where the per-row parse cost in the fallback grows linearly while the TSV's per-row lookup stays constant. The design doc's "<1ms p99 on indexed path" holds on local filesystems; Google Drive adds a filesystem-latency constant that benchmarks need to account for.
Parity verified: on the real vault (1175 URLs), fast path and slow path return exactly the same set with zero diff.
What's NOT in this PR (deferred)
scripts/recycle-rollover.py. No work needed.url_normalized,tagcolumns + indexes) — belongs in the curaitor TypeScript app, not here.hasRecycled(url),loadRecycleRecent(limit)) — TypeScript app, not here.npm run export:recycle-md) — TypeScript app.Initial TSV build
The user's vault already has
.curaitor/recycle-index.tsvas of this PR commit (built during spike verification). Future cron runs pick it up automatically.For any other install: run
python3 scripts/recycle-reindex.pyonce. Or just run any triage — the miss-handler will auto-rebuild.Test plan
triage-write.pyoutput if we add instrumentation later, or trust the parity check).🤖 Generated with Claude Code