feat(parser): content-hash diff-summary disk cache (#845)#858
Merged
Conversation
PR 5 of the #845 sprint. Adds a per-repo, content-hash-keyed disk cache for LLM-summarized diffs so re-runs of `coco commit` / `changelog` / `recap` / `review` skip the LLM entirely for files whose diffs haven't changed since the prior run. Cache key: `sha256(diff + model + SUMMARIZE_PROMPT_HASH)`. Switching models or editing the summarization prompt invalidates entries automatically. Storage lives at `$XDG_CACHE_HOME/coco/diff-summaries/summaries.<repo-hash>.json` with a 500-entry hard cap and LRU eviction. Best-effort throughout — read/write failures fall back to the LLM path, never load-bearing. Wired into both `summarizeFileDiff` (pre-process pass) and `summarizeDirectoryDiff` (wave consolidation). New `coco cache` command exposes `clear` and `info` subcommands. `COCO_NO_CACHE=1` opts out for users who want a guaranteed fresh run. Bench harness gains `--repeat` (cold pass + warm pass) and `--no-cache` flags. New baseline captures both passes. Bench numbers (cold → warm wall-clock): - medium: 8506ms → 4ms (99.95% faster) - large: 14417ms → 6ms (99.96%) - feature-add: 13064ms → 4ms (99.97%) - refactor: 51116ms → 6ms (99.99%) - initial-commit: 16967ms → 7ms (99.96%) - docs-update: 24384ms → 2ms (99.99%) - monorepo: 88958ms → 26ms (99.97%) 24 new tests (19 cache module + 5 cache command). Full suite: 5134 passing.
…6858 The repo-key is just a 16-char filename suffix; it never needed sha1 specifically. Switching to sha256 + truncate keeps the same behavior (deterministic short identifier, same length on disk) and clears the DevSkim weak-hash alert without an inline suppression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 5 of the #845 diff-condensing pipeline sprint. Adds a per-repo, content-hash-keyed disk cache so re-runs of
coco commit/changelog/recap/reviewskip the LLM entirely for files whose diffs haven't changed since the prior run. Cache writes are best-effort and gated by an opt-out env var.What changed
src/lib/parsers/default/utils/diffSummaryCache.ts— new module. JSON-on-disk cache with sha256 keys, 500-entry hard cap, LRU eviction, schema-versioned envelope. 19 unit tests cover key derivation, path resolution, round-trip, repo isolation, eviction, touch, clear, corrupt-file robustness.src/lib/langchain/chains/summarize/prompt.ts— exportsSUMMARIZE_PROMPT_HASH(sha256 of the template, first 16 chars). Cache keys mix this in so any prompt edit invalidates prior entries automatically.summarizeFileDiff(insummarizeLargeFiles.ts) andsummarizeDirectoryDiff(insummarizeDiffs.ts) — check the cache after the trivial-shape skip, return synthetic results on hit, write on success.src/commands/cache/— newcoco cache <subcommand>command.clearremoves the file;inforeports entry count, on-disk size, total summary tokens, last-saved timestamp.bin/benchmark.ts— new--repeatflag runs each fixture cold + warm so PRs can demonstrate cache wins. New--no-cacheflag for reproducing pre-PR-5 numbers..bench/baseline.json— refreshed to capture both cold and warm passes.Cache key
`sha256(diff + '\x1f' + model + '\x1f' + promptHash)`
The model and prompt hash are mixed into the key, so:
SUMMARIZE_PROMPTinvalidates entries (different prompt → different output, no manual bumps required).Storage
Opt-out
`COCO_NO_CACHE=1` disables both reads and writes for the entire run. Useful for benchmarking, debugging unexpected output, or forcing a full regeneration.
Bench numbers
Bench with `--repeat` against all fixtures (cold pass clears the cache first; warm pass runs again with full cache):
The cold numbers match PR 4's baseline (the cache miss path is identical to the prior code path). The win is the warm column: same fixture, unchanged inputs, second run is essentially free.
The realistic case for users is "edit one file → re-commit": cache hit rate ≥ N-1/N where N is the file count, so wall-clock collapses to one LLM call instead of N.
Test plan
npm run test(5134 passing, 0 regressions; +24 new)npm run lintnpm run build— confirmed CLI builds cleanlynpm run build:schema— schema unchangednode dist/index.js cache --help— command surfaces correctlynode dist/index.js cache info— shows real entries from a recent bench run (80 entries, 15.1 KB)npm run bench -- --repeat— confirms warm wins across all fixtures