Summary
Add a client-side content transform mechanism, analogous to git's clean/smudge filters, so Lore can store and address a canonical form of a file while the working tree keeps its native form. The primary motivation is to restore fragment-level deduplication (and unlock the existing text-aware three-way merge) for compressed container formats, where content-defined chunking currently cannot help.
Motivation: compressed containers defeat content-defined chunking
Lore's fragment-level dedup via FastCDC is described as being as effective on a multi-gigabyte binary as on a kilobyte of text. That holds for raw content, but it breaks down for compressed containers. A small logical edit to a gzip or zip file changes almost the entire compressed byte stream, which shifts every downstream chunk boundary, so dedup collapses and Lore re-stores nearly the whole file on each commit.
This affects a large, common category of files, not a niche one: gzip/zip wrappers, Office Open XML (.docx, .xlsx, .pptx), OpenDocument, many DAW project formats, and various packaged game assets. As a concrete example, an Ableton Live set (.als) is a single gzip-compressed XML document. Changing one MIDI note rewrites the whole compressed stream, so every save stores a fresh copy even though the logical delta is tiny.
If Lore stored the decompressed canonical form instead, FastCDC would see a small chunk delta per save and dedup as designed, and that canonical form would also be eligible for the three-way text merge Lore already ships.
Why this fits Lore
- It makes the headline dedup feature deliver for an entire file category, rather than treating compressed files as a degraded case.
- The server and immutable store stay fully content-agnostic. The transform is purely client-side, between the working tree and staging, so content-addressing, fragments, and the storage protocol are unchanged. Version control remains one consumer of opaque-byte storage.
- There is precedent for declarative per-path rules at the working-tree boundary:
.loreignore as an outbound filter and .lore/view as an inbound filter. A transform declaration is a natural sibling.
Sketch of proposed behavior
- A declaration mapping path patterns to a named transform driver, in the spirit of
.gitattributes (for example, *.als filter=gzip in a .loreattributes file).
- A clean transform (working tree to store) and a smudge transform (store to working tree).
- A built-in gzip driver would cover
.als, .adg, .adv, and many others. A pluggable external-program interface would allow custom canonicalizers, for example pretty-printing or stable attribute ordering of XML for cleaner diffs.
- Dirtiness detection compares canonical forms (re-run clean), not raw working-tree bytes, since recompression is not guaranteed to be byte-identical to the original.
Alternatives considered
- Transform with an external tool before staging (what I do today). Drawbacks: the working tree and the repository disagree on file contents, the transform is not atomic with staging, and if the compressed form is what gets committed, dedup still suffers.
.loreignore can exclude but not transform.
- Committing a manually decompressed file loses the native file the application opens.
Native support resolves all three.
Open questions and design considerations
- Round-trip determinism and the dirtiness model (git solves this by re-running clean and comparing).
- Security of filter execution when running external programs from repository-provided config in untrusted repositories (git has useful prior art, and prior CVEs, here).
- Whether this should build on the client-side hooks already discussed for the roadmap, or be a first-class filter mechanism, and how it relates to the pluggable per-content-type resolvers mentioned for merge.
- Interaction with locking and with binary conflict resolution.
Process
I searched open and closed issues and did not find an existing request for this. Per CONTRIBUTING, this likely touches public APIs and config and may warrant a Lore Enhancement Proposal. I am opening it as a feature request first to gauge interest and direction, and I am happy to move it to #feature-requests or draft a LEP if that is preferred.
Context: I am prototyping version control for Ableton Live projects on top of Lore, and this is the one rough edge where Lore's storage strengths are not currently reachable for the project's core file type.
Summary
Add a client-side content transform mechanism, analogous to git's clean/smudge filters, so Lore can store and address a canonical form of a file while the working tree keeps its native form. The primary motivation is to restore fragment-level deduplication (and unlock the existing text-aware three-way merge) for compressed container formats, where content-defined chunking currently cannot help.
Motivation: compressed containers defeat content-defined chunking
Lore's fragment-level dedup via FastCDC is described as being as effective on a multi-gigabyte binary as on a kilobyte of text. That holds for raw content, but it breaks down for compressed containers. A small logical edit to a gzip or zip file changes almost the entire compressed byte stream, which shifts every downstream chunk boundary, so dedup collapses and Lore re-stores nearly the whole file on each commit.
This affects a large, common category of files, not a niche one: gzip/zip wrappers, Office Open XML (
.docx,.xlsx,.pptx), OpenDocument, many DAW project formats, and various packaged game assets. As a concrete example, an Ableton Live set (.als) is a single gzip-compressed XML document. Changing one MIDI note rewrites the whole compressed stream, so every save stores a fresh copy even though the logical delta is tiny.If Lore stored the decompressed canonical form instead, FastCDC would see a small chunk delta per save and dedup as designed, and that canonical form would also be eligible for the three-way text merge Lore already ships.
Why this fits Lore
.loreignoreas an outbound filter and.lore/viewas an inbound filter. A transform declaration is a natural sibling.Sketch of proposed behavior
.gitattributes(for example,*.als filter=gzipin a.loreattributesfile)..als,.adg,.adv, and many others. A pluggable external-program interface would allow custom canonicalizers, for example pretty-printing or stable attribute ordering of XML for cleaner diffs.Alternatives considered
.loreignorecan exclude but not transform.Native support resolves all three.
Open questions and design considerations
Process
I searched open and closed issues and did not find an existing request for this. Per CONTRIBUTING, this likely touches public APIs and config and may warrant a Lore Enhancement Proposal. I am opening it as a feature request first to gauge interest and direction, and I am happy to move it to #feature-requests or draft a LEP if that is preferred.
Context: I am prototyping version control for Ableton Live projects on top of Lore, and this is the one rough edge where Lore's storage strengths are not currently reachable for the project's core file type.