feat(parser): content-hash diff-summary disk cache (#845) by gfargo · Pull Request #858 · gfargo/coco

gfargo · 2026-05-06T05:13:58Z

Summary

PR 5 of the #845 diff-condensing pipeline sprint. Adds a per-repo, content-hash-keyed disk cache so re-runs of coco commit / changelog / recap / review skip the LLM entirely for files whose diffs haven't changed since the prior run. Cache writes are best-effort and gated by an opt-out env var.

What changed

src/lib/parsers/default/utils/diffSummaryCache.ts — new module. JSON-on-disk cache with sha256 keys, 500-entry hard cap, LRU eviction, schema-versioned envelope. 19 unit tests cover key derivation, path resolution, round-trip, repo isolation, eviction, touch, clear, corrupt-file robustness.
src/lib/langchain/chains/summarize/prompt.ts — exports SUMMARIZE_PROMPT_HASH (sha256 of the template, first 16 chars). Cache keys mix this in so any prompt edit invalidates prior entries automatically.
summarizeFileDiff (in summarizeLargeFiles.ts) and summarizeDirectoryDiff (in summarizeDiffs.ts) — check the cache after the trivial-shape skip, return synthetic results on hit, write on success.
src/commands/cache/ — new coco cache <subcommand> command. clear removes the file; info reports entry count, on-disk size, total summary tokens, last-saved timestamp.
bin/benchmark.ts — new --repeat flag runs each fixture cold + warm so PRs can demonstrate cache wins. New --no-cache flag for reproducing pre-PR-5 numbers.
.bench/baseline.json — refreshed to capture both cold and warm passes.

Cache key

`sha256(diff + '\x1f' + model + '\x1f' + promptHash)`

The model and prompt hash are mixed into the key, so:

Switching providers/models invalidates entries (different models produce different summaries).
Editing SUMMARIZE_PROMPT invalidates entries (different prompt → different output, no manual bumps required).
Same diff text on the same model + prompt → cache hit, instant return.

Storage

Path: `$XDG_CACHE_HOME/coco/diff-summaries/summaries.<sha1(repoPath, 16-char)>.json`
Per-repo isolation (no cross-repo pollution; mirrors the boot cache from feat(log-tui): per-repo disk cache of the last commit-log fetch (#808) #828).
500 entries hard cap, LRU eviction on overflow (oldest `lastAccessedAt` evicted first).
Best-effort: read failures fall back to "no cache" (LLM runs as before); write failures are swallowed silently. The cache is never load-bearing.

Opt-out

`COCO_NO_CACHE=1` disables both reads and writes for the entire run. Useful for benchmarking, debugging unexpected output, or forcing a full regeneration.

Bench numbers

Bench with `--repeat` against all fixtures (cold pass clears the cache first; warm pass runs again with full cache):

Fixture	Cold (ms)	Warm (ms)	Reduction
tiny	1	0	(early-exit, no cache touched)
medium	8506	4	99.95%
large	14417	6	99.96%
feature-add	13064	4	99.97%
refactor	51116	6	99.99%
initial-commit	16967	7	99.96%
docs-update	24384	2	99.99%
dep-bump	0	0	(early-exit, no cache touched)
monorepo	88958	26	99.97%

The cold numbers match PR 4's baseline (the cache miss path is identical to the prior code path). The win is the warm column: same fixture, unchanged inputs, second run is essentially free.

The realistic case for users is "edit one file → re-commit": cache hit rate ≥ N-1/N where N is the file count, so wall-clock collapses to one LLM call instead of N.

Test plan

npm run test (5134 passing, 0 regressions; +24 new)
npm run lint
npm run build — confirmed CLI builds cleanly
npm run build:schema — schema unchanged
node dist/index.js cache --help — command surfaces correctly
node dist/index.js cache info — shows real entries from a recent bench run (80 entries, 15.1 KB)
npm run bench -- --repeat — confirms warm wins across all fixtures

PR 5 of the #845 sprint. Adds a per-repo, content-hash-keyed disk cache for LLM-summarized diffs so re-runs of `coco commit` / `changelog` / `recap` / `review` skip the LLM entirely for files whose diffs haven't changed since the prior run. Cache key: `sha256(diff + model + SUMMARIZE_PROMPT_HASH)`. Switching models or editing the summarization prompt invalidates entries automatically. Storage lives at `$XDG_CACHE_HOME/coco/diff-summaries/summaries.<repo-hash>.json` with a 500-entry hard cap and LRU eviction. Best-effort throughout — read/write failures fall back to the LLM path, never load-bearing. Wired into both `summarizeFileDiff` (pre-process pass) and `summarizeDirectoryDiff` (wave consolidation). New `coco cache` command exposes `clear` and `info` subcommands. `COCO_NO_CACHE=1` opts out for users who want a guaranteed fresh run. Bench harness gains `--repeat` (cold pass + warm pass) and `--no-cache` flags. New baseline captures both passes. Bench numbers (cold → warm wall-clock): - medium: 8506ms → 4ms (99.95% faster) - large: 14417ms → 6ms (99.96%) - feature-add: 13064ms → 4ms (99.97%) - refactor: 51116ms → 6ms (99.99%) - initial-commit: 16967ms → 7ms (99.96%) - docs-update: 24384ms → 2ms (99.99%) - monorepo: 88958ms → 26ms (99.97%) 24 new tests (19 cache module + 5 cache command). Full suite: 5134 passing.

…6858 The repo-key is just a 16-char filename suffix; it never needed sha1 specifically. Switching to sha256 + truncate keeps the same behavior (deterministic short identifier, same length on disk) and clears the DevSkim weak-hash alert without an inline suppression.

github-advanced-security AI found potential problems May 6, 2026

View reviewed changes

Comment thread src/lib/parsers/default/utils/diffSummaryCache.ts Fixed

gfargo merged commit 32b4345 into main May 6, 2026
9 checks passed

gfargo deleted the feat/diff-summary-cache-845 branch May 6, 2026 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parser): content-hash diff-summary disk cache (#845)#858

feat(parser): content-hash diff-summary disk cache (#845)#858
gfargo merged 2 commits intomainfrom
feat/diff-summary-cache-845

gfargo commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gfargo commented May 6, 2026

Summary

What changed

Cache key

Storage

Opt-out

Bench numbers

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants