Skip to content

curaitor: Recycle TSV index fast-path (partial impl of PR #17 design)#86

Merged
jdidion merged 1 commit into
mainfrom
feat/curaitor-recycle-tsv-index
May 10, 2026
Merged

curaitor: Recycle TSV index fast-path (partial impl of PR #17 design)#86
jdidion merged 1 commit into
mainfrom
feat/curaitor-recycle-tsv-index

Conversation

@jdidion
Copy link
Copy Markdown
Owner

@jdidion jdidion commented May 10, 2026

Note: This MR was largely generated by Claude and has not been completely reviewed by me (the human). You should feel free to defer your review until this warning has been removed.

Summary

Partial implementation of curaitor PR #17 (Recycle Storage Scale Plan) — specifically the Obsidian-side TSV index + fast-path dedup. Skips the monthly-partitioning and SQLite-backend work because:

  • Current Recycle.md is 370 lines. 1175 unique URLs across live + archives. Still well below the 5K trigger the design doc set.
  • Monthly partitioning is already done by scripts/recycle-rollover.py; no new partitioning work needed.
  • SQLite-backend changes belong in the curaitor TypeScript app, not the claude-plugins Python scripts.

This PR validates the fast-path design on real data so it's ready to pay off when the log grows.

What shipped

scripts/recycle-reindex.py (new)

One-shot TSV builder. Scans Curaitor/Recycle.md + the most-recent N monthly archives, extracts unique normalized URLs, writes .curaitor/recycle-index.tsv with format:

# recycle-index v1 checksum=<sha256-over-source-files>
url_normalized	source_file	title
genomeweb.com/genomic-research/ssri-approach...	Recycle.md	Genotype-Guided SSRI...
arxiv.org/abs/2404.12345	Recycle-2026-04.md	Some arxiv paper
...

Dedups across files (same URL in live Recycle.md AND an archive → one row). Header line embeds a SHA-256 checksum over the source markdown bytes so consumers can detect drift.

Idempotent, safe to re-run, atomic write via tmp + rename. Supports --dry-run and --json modes for observability.

scripts/triage-write.py — fast-path integration

  • New _load_recycle_tsv(vault): reads the TSV if its checksum matches the current Recycle.md + archives. Returns a set[str] of normalized URLs on success, None on any failure (missing file / stale checksum / malformed).
  • build_recycle_index(vault): prefers the TSV; falls back to line-by-line markdown parse on miss. On a miss, kicks off a background rebuild via Popen (fire-and-forget, start_new_session=True) so the next call hits the fast path.
  • build_url_index(vault): reuses the cached TSV for the recycle portion; only runs markdown parse for recycle sources when the TSV is unavailable.

The fallback is the current line-by-line parse. Every code path remains correct without the TSV; the TSV is pure optimization.

Measured performance

Benchmark on synthetic Recycle.md files (local disk):

N rows Fast path Slow path Speedup
1,000 0.6ms 2.3ms 3.8×
5,000 1.4ms 11.4ms 8.1×
20,000 4.0ms 44.4ms 11.1×

On the live Google Drive-backed vault (1175 URLs across 3 files), fast path is currently ~3× slower than the fallback (20.8ms vs 6.9ms) because the checksum computation reads all 3 source files and Google Drive's filesystem latency dominates at this scale. This inverts past ~3-5K rows where the per-row parse cost in the fallback grows linearly while the TSV's per-row lookup stays constant. The design doc's "<1ms p99 on indexed path" holds on local filesystems; Google Drive adds a filesystem-latency constant that benchmarks need to account for.

Parity verified: on the real vault (1175 URLs), fast path and slow path return exactly the same set with zero diff.

What's NOT in this PR (deferred)

  • Monthly-partition strategy from the design doc — already done by scripts/recycle-rollover.py. No work needed.
  • SQLite backend schema additions (url_normalized, tag columns + indexes) — belongs in the curaitor TypeScript app, not here.
  • StorageBackend interface additions (hasRecycled(url), loadRecycleRecent(limit)) — TypeScript app, not here.
  • SQLite export CLI (npm run export:recycle-md) — TypeScript app.
  • Cron-wrapper auto-reindex pre-triage — not needed; the background rebuild on miss converges within one triage cycle.

Initial TSV build

The user's vault already has .curaitor/recycle-index.tsv as of this PR commit (built during spike verification). Future cron runs pick it up automatically.

For any other install: run python3 scripts/recycle-reindex.py once. Or just run any triage — the miss-handler will auto-rebuild.

Test plan

  • Next triage cron fire hits the TSV fast-path (verify via timing in triage-write.py output if we add instrumentation later, or trust the parity check).
  • Hand-edit Recycle.md → next triage hits checksum-mismatch → fallback runs → background rebuild → next-next triage is fast again. (Hand-test once Gemma is re-enabled, since cron is currently slow-path-only when local_triage is off.)

🤖 Generated with Claude Code

Implements the Obsidian-side TSV-index portion of the Recycle
storage scale plan (curaitor#17). Skips the monthly-partition
work (already done by recycle-rollover.py) and the SQLite-backend
additions (TypeScript app, not this repo).

New scripts/recycle-reindex.py builds .curaitor/recycle-index.tsv
from the live Recycle.md + most-recent N monthly archives. Each
row: url_normalized\tsource_file\ttitle. Header embeds a SHA-256
over source-file bytes for drift detection. Idempotent, atomic
write, supports --dry-run and --json.

scripts/triage-write.py gains _load_recycle_tsv() (fast-path
reader that verifies the stored checksum matches current sources)
and uses it in build_recycle_index() and build_url_index(). On
any miss (stale / malformed / missing TSV) it falls back to the
existing line-by-line markdown parse AND kicks off a background
rebuild via Popen+start_new_session so next call is fast.

Parity verified on the real vault (1175 URLs across 3 files);
fast path and slow path return identical sets. Benchmarked on
local disk: 3.8x at 1K rows, 8.1x at 5K, 11.1x at 20K. Google
Drive filesystem latency makes the fast path currently slower
than fallback at today's 1175-row scale (checksum I/O dominates),
but the crossover happens around 3-5K which is the original
design's trigger point anyway.

Current vault already has the initial TSV built. Future cron
runs + any new install auto-self-heal via the on-miss rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jdidion jdidion marked this pull request as ready for review May 10, 2026 16:15
@jdidion jdidion merged commit ad76e80 into main May 10, 2026
@jdidion jdidion deleted the feat/curaitor-recycle-tsv-index branch May 10, 2026 16:23
jdidion added a commit to jdidion/curaitor that referenced this pull request May 10, 2026
The TSV fast-path + auto-rebuild shipped in
jdidion/claude-plugins#86 (2026-05-10). Update the status block
to reflect partial-implementation + what's still deferred
(SQLite backend schema, StorageBackend interface additions,
markdown-export CLI). Include measured benchmarks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdidion added a commit that referenced this pull request May 10, 2026
> **Note:** This MR was largely generated by Claude and has not been
completely reviewed by me (the human). You should feel free to defer
your review until this warning has been removed.

## Summary

Release curaitor v0.4.0 — bumps
`plugins/curaitor/.claude-plugin/plugin.json` version from `0.3.0` to
`0.4.0`. Includes the full run of cron + triage refactors landed in the
last few days.

## Release notes

### Cron architecture: Claude-free

The scheduled `/cu:discover` and `/cu:triage` jobs no longer depend on
Claude auth. Both skills' cron path was replaced by headless Python
orchestrators that run feeds/Instapaper fetch → dedup → Gemma pre-pass →
deterministic routing → optional level-2 Claude queue for interactive
drain.

- **#79** headless `/cu:discover` cron (`scripts/discover-cron.py`)
- **#81** headless `/cu:triage` cron (`scripts/triage-cron.py`) —
Instapaper source, hard-routes LinkedIn / video / podcast URLs to
pending-claude-review, runs Gemma on text-article survivors
- **#80** hardened cron log tail (`## Cron end` marker + zero-output
annotation); dropped the Feedly backend and its 2 gated-journal feeds
(operational friction outweighed coverage)

### Dedup gets source-aware + a fast path

- **#84** Instapaper saves of URLs already in `Curaitor/Ignored/` now
rescue the note to Inbox (fresh classification + `prior_*` audit keys)
instead of silently appending `(duplicate)` to Recycle.md. Fixes a
regression where explicit user saves were being dropped.
- **#86** TSV index fast-path at `.curaitor/recycle-index.tsv` with
SHA-256 drift detection and background auto-rebuild. Partial
implementation of curaitor#17's design; Obsidian-side only, SQLite work
deferred.

### Gemma reliability

- **#85** consistency check + one-shot repair re-prompt in
`scripts/local-triage.py`. Catches the `(confidence=high-not-interested,
verdict=review)` contradictions Gemma-4-26B-A4B-it-MLX-4bit produces at
batch=1 (measured 17.7% rate on real Ignored corpus). On persistent
contradiction, flags `error: contradiction_unresolved` and escalates to
Claude. Gemma re-enabled in user-local config alongside this.

### Skills / UX

- **#82** banned subjective quality judgments in `cu:read` / `cu:review`
/ `cu:review-ignored` (no more "excellent paper", "major figure").
Explicit allow-list for factual descriptors.
- **#87** reviewed-keep flag: `/cu:read skip` and `/cu:review y`/`r`
stamp `review_status: kept-after-review`. `/cu:read` startup surfaces
previously-kept articles in a distinct section above fresh arrivals.

## Verification

- Plugin install path is direct symlinks from
`~/projects/curaitor-review/.claude/skills/` into
`~/projects/claude-plugins/plugins/curaitor/`. All merged PRs already
reached the running interactive sessions; the version bump is for
marketplace index accuracy.
- Cron wrapper + the 3 Python orchestrators have been exercised
end-to-end this week.

## What's deferred to v0.5.0+

- Spike (c) Gemma batch-4-8 — only if v0.4.0's consistency check proves
insufficient in measured data.
- SQLite backend schema additions for Recycle (curaitor app TypeScript
work, separate repo).
- Full `/cu:triage` interactive-mode rewrite (skill doc currently has
both headless + interactive instructions; could be cleaner).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant