Skip to content

feat(parser): content-hash diff-summary disk cache (#845)#858

Merged
gfargo merged 2 commits intomainfrom
feat/diff-summary-cache-845
May 6, 2026
Merged

feat(parser): content-hash diff-summary disk cache (#845)#858
gfargo merged 2 commits intomainfrom
feat/diff-summary-cache-845

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 6, 2026

Summary

PR 5 of the #845 diff-condensing pipeline sprint. Adds a per-repo, content-hash-keyed disk cache so re-runs of coco commit / changelog / recap / review skip the LLM entirely for files whose diffs haven't changed since the prior run. Cache writes are best-effort and gated by an opt-out env var.

What changed

  • src/lib/parsers/default/utils/diffSummaryCache.ts — new module. JSON-on-disk cache with sha256 keys, 500-entry hard cap, LRU eviction, schema-versioned envelope. 19 unit tests cover key derivation, path resolution, round-trip, repo isolation, eviction, touch, clear, corrupt-file robustness.
  • src/lib/langchain/chains/summarize/prompt.ts — exports SUMMARIZE_PROMPT_HASH (sha256 of the template, first 16 chars). Cache keys mix this in so any prompt edit invalidates prior entries automatically.
  • summarizeFileDiff (in summarizeLargeFiles.ts) and summarizeDirectoryDiff (in summarizeDiffs.ts) — check the cache after the trivial-shape skip, return synthetic results on hit, write on success.
  • src/commands/cache/ — new coco cache <subcommand> command. clear removes the file; info reports entry count, on-disk size, total summary tokens, last-saved timestamp.
  • bin/benchmark.ts — new --repeat flag runs each fixture cold + warm so PRs can demonstrate cache wins. New --no-cache flag for reproducing pre-PR-5 numbers.
  • .bench/baseline.json — refreshed to capture both cold and warm passes.

Cache key

`sha256(diff + '\x1f' + model + '\x1f' + promptHash)`

The model and prompt hash are mixed into the key, so:

  • Switching providers/models invalidates entries (different models produce different summaries).
  • Editing SUMMARIZE_PROMPT invalidates entries (different prompt → different output, no manual bumps required).
  • Same diff text on the same model + prompt → cache hit, instant return.

Storage

  • Path: `$XDG_CACHE_HOME/coco/diff-summaries/summaries.<sha1(repoPath, 16-char)>.json`
  • Per-repo isolation (no cross-repo pollution; mirrors the boot cache from feat(log-tui): per-repo disk cache of the last commit-log fetch (#808) #828).
  • 500 entries hard cap, LRU eviction on overflow (oldest `lastAccessedAt` evicted first).
  • Best-effort: read failures fall back to "no cache" (LLM runs as before); write failures are swallowed silently. The cache is never load-bearing.

Opt-out

`COCO_NO_CACHE=1` disables both reads and writes for the entire run. Useful for benchmarking, debugging unexpected output, or forcing a full regeneration.

Bench numbers

Bench with `--repeat` against all fixtures (cold pass clears the cache first; warm pass runs again with full cache):

Fixture Cold (ms) Warm (ms) Reduction
tiny 1 0 (early-exit, no cache touched)
medium 8506 4 99.95%
large 14417 6 99.96%
feature-add 13064 4 99.97%
refactor 51116 6 99.99%
initial-commit 16967 7 99.96%
docs-update 24384 2 99.99%
dep-bump 0 0 (early-exit, no cache touched)
monorepo 88958 26 99.97%

The cold numbers match PR 4's baseline (the cache miss path is identical to the prior code path). The win is the warm column: same fixture, unchanged inputs, second run is essentially free.

The realistic case for users is "edit one file → re-commit": cache hit rate ≥ N-1/N where N is the file count, so wall-clock collapses to one LLM call instead of N.

Test plan

  • npm run test (5134 passing, 0 regressions; +24 new)
  • npm run lint
  • npm run build — confirmed CLI builds cleanly
  • npm run build:schema — schema unchanged
  • node dist/index.js cache --help — command surfaces correctly
  • node dist/index.js cache info — shows real entries from a recent bench run (80 entries, 15.1 KB)
  • npm run bench -- --repeat — confirms warm wins across all fixtures

PR 5 of the #845 sprint. Adds a per-repo, content-hash-keyed disk
cache for LLM-summarized diffs so re-runs of `coco commit` /
`changelog` / `recap` / `review` skip the LLM entirely for files
whose diffs haven't changed since the prior run.

Cache key: `sha256(diff + model + SUMMARIZE_PROMPT_HASH)`. Switching
models or editing the summarization prompt invalidates entries
automatically. Storage lives at
`$XDG_CACHE_HOME/coco/diff-summaries/summaries.<repo-hash>.json`
with a 500-entry hard cap and LRU eviction. Best-effort throughout —
read/write failures fall back to the LLM path, never load-bearing.

Wired into both `summarizeFileDiff` (pre-process pass) and
`summarizeDirectoryDiff` (wave consolidation). New `coco cache`
command exposes `clear` and `info` subcommands. `COCO_NO_CACHE=1`
opts out for users who want a guaranteed fresh run.

Bench harness gains `--repeat` (cold pass + warm pass) and
`--no-cache` flags. New baseline captures both passes.

Bench numbers (cold → warm wall-clock):
- medium: 8506ms → 4ms (99.95% faster)
- large: 14417ms → 6ms (99.96%)
- feature-add: 13064ms → 4ms (99.97%)
- refactor: 51116ms → 6ms (99.99%)
- initial-commit: 16967ms → 7ms (99.96%)
- docs-update: 24384ms → 2ms (99.99%)
- monorepo: 88958ms → 26ms (99.97%)

24 new tests (19 cache module + 5 cache command). Full suite: 5134
passing.
Comment thread src/lib/parsers/default/utils/diffSummaryCache.ts Fixed
…6858

The repo-key is just a 16-char filename suffix; it never needed sha1
specifically. Switching to sha256 + truncate keeps the same behavior
(deterministic short identifier, same length on disk) and clears the
DevSkim weak-hash alert without an inline suppression.
@gfargo gfargo merged commit 32b4345 into main May 6, 2026
9 checks passed
@gfargo gfargo deleted the feat/diff-summary-cache-845 branch May 6, 2026 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants