From ad8eb0c583694070cb11b8bdcc34a880c33607ed Mon Sep 17 00:00:00 2001 From: shikokuchuo <53399081+shikokuchuo@users.noreply.github.com> Date: Thu, 21 May 2026 10:24:46 +0100 Subject: [PATCH] docs(attribution): document UTF-16 / UTF-8 encoding boundary Adds a contract doc in claude-notes/designs/ plus file-header / struct-doc anchors on the two load-bearing AttributionRun types. Frames the dual encoding as inherited from each side's substrate (tree-sitter + SourceInfo on the Rust side, JS string primitives + Monaco + Automerge on the JS side) rather than a stylistic choice, so it isn't "harmonized" later. --- .../designs/attribution-encoding-contract.md | 129 ++++++++++++++++++ crates/quarto-core/src/attribution/types.rs | 7 + hub-client/src/services/attribution-runs.ts | 4 + 3 files changed, 140 insertions(+) create mode 100644 claude-notes/designs/attribution-encoding-contract.md diff --git a/claude-notes/designs/attribution-encoding-contract.md b/claude-notes/designs/attribution-encoding-contract.md new file mode 100644 index 000000000..5ad71867e --- /dev/null +++ b/claude-notes/designs/attribution-encoding-contract.md @@ -0,0 +1,129 @@ +# Attribution encoding contract + +**Status:** Active (Phase 5 of the attribution pipeline, see +`claude-notes/plans/2026-05-06-attribution-pipeline.md`). +**Conversion site:** `buildAttributionPayload` in +`hub-client/src/hooks/useAttribution.ts`. +**Key types:** `AttributionRun` (TS) in +`hub-client/src/services/attribution-runs.ts`, `AttributionRun` (Rust) +in `crates/quarto-core/src/attribution/types.rs`. + +## Summary + +Attribution intentionally uses **two coordinate spaces** for run +boundaries, joined at exactly one site. The JS side speaks UTF-16 +code units because that is what Automerge text patches, JS string +indexing, and Monaco editor offsets all natively produce. The Rust +side speaks UTF-8 byte offsets because that is what `&str` indexing +and `SourceInfo` ranges use. The conversion happens at the WASM +wire, immediately before the JSON payload is shipped to the Rust +pipeline. + +This is not a bug to be "harmonized." Both sides are correct for +their own domain; collapsing to a single encoding would force one +side to fight its primitives on every offset. + +## The two spaces + +| Side | Space | Why | +|---|---|---| +| Automerge (Rust crate, compiled with `utf16-indexing`) | UTF-16 code units | Crate default; verified at `crates/quarto-hub/src/automerge_api_tests.rs::test_text_encoding_is_utf16` | +| Automerge (JS via `@automerge/automerge`) | UTF-16 code units | Native to JS strings; `patch.value.length` and `patch.length` on `diff()` output are UTF-16 | +| Monaco editor (presence cursors) | UTF-16 code units | Native to `model.getOffsetAt` / `model.getPositionAt`; passes through `A.getCursorPosition` unchanged | +| Run-list replay (`attribution-runs.ts`) | UTF-16 code units | Direct passthrough of Automerge patches | +| Rust attribution (`AttributionRun`, `query_byte_range`, `GitBlameProvider`) | UTF-8 byte offsets | Aligns with `&str` indexing and the rest of `SourceInfo` | + +## Single conversion site + +`buildAttributionPayload(state, sourceText, identities)` in +`hub-client/src/hooks/useAttribution.ts` calls +`buildCharToByteMap(text)` (iterates `text.charCodeAt(i)` with +explicit surrogate-pair handling, returning a `Uint32Array` of byte +offsets keyed by UTF-16 index), then `runsCharToByteOffsets(runs, map)` +to translate each run's `start`/`end` from char offsets to byte +offsets before `JSON.stringify`. + +All Automerge-driven JS code upstream of `buildAttributionPayload` +stays in UTF-16. All Rust code downstream of the JSON wire stays in +UTF-8 bytes. The wire format itself carries bytes. + +## Surrogate-pair handling + +`buildCharToByteMap` walks UTF-16 code units, not code points. A +surrogate pair (e.g. an emoji or non-BMP CJK character) occupies two +consecutive UTF-16 indices, both of which receive a map entry: the +high-surrogate index maps to the byte offset *before* the 4-byte +UTF-8 sequence, the low-surrogate index to the offset *after*. +Automerge does not emit splice positions that land mid-surrogate, so +a run boundary on the low-surrogate index should not occur in +practice; if it does, the map keeps the translation well-defined. + +Test coverage: `hub-client/src/services/attribution-runs.test.ts` +exercises ASCII (identity), 2-byte (Latin-1 supplement), 3-byte +(CJK), and 4-byte (surrogate-pair) cases. + +## Failure modes if "harmonized" + +The encoding split is not a stylistic preference on either side. +Each side's encoding is forced by the substrate immediately beneath +it; collapsing to a single encoding doesn't simplify the design, it +relocates the translation cost to a worse place. + +**The encodings are inherited, not chosen.** On the Rust side, every +caller of `query_byte_range` already has `start` and `end` in byte +form because the layers below produce byte offsets natively: +tree-sitter returns node positions as bytes (`Range.start_byte` / +`end_byte`), `SourceInfo` carries those bytes through every AST +transform, `&str` indexing requires bytes, and `GitBlameProvider` +consumes line-based blame output which is byte-positional. On the +JS side, `string[i]`, `charCodeAt(i)`, `string.length`, Automerge +text splice positions, Monaco's `getOffsetAt`, and the DOM Selection +API all return UTF-16 code units natively. Attribution is the +*consumer* of coordinate systems chosen many layers below it on +both sides — there is no UTF-16 plane to ask tree-sitter for, and +no UTF-8 plane to ask Monaco for. + +**Force UTF-8 on the JS side.** Every Automerge patch position, +every Monaco cursor, and every `string[i]` would need byte +translation per use. Cost moves from one conversion per payload +(debounced at ~500 ms) to one conversion per editor interaction — +many orders of magnitude more work, in the editor hot path rather +than the producer's cold debounced path. + +**Force UTF-16 on the Rust side, option (a) — translate at every +query.** Every `query_byte_range` call becomes a +`byte_range → utf16_range → query` chain. For a document with +thousands of AST nodes, every render triggers thousands of +conversions. Translation cost moves from O(payloads) to +O(AST nodes × renders), into the rendering hot path. + +**Force UTF-16 on the Rust side, option (b) — translate the +substrate.** Rewrite tree-sitter integration, `SourceInfo`, every +AST transform, citeproc, link rewriting, diagnostics, and +serialization to track UTF-16 code units rather than bytes. Massive +ripple for no win — `&str` still indexes by bytes, so a byte ↔ +code-unit map would have to travel alongside every range anyway. + +The current design picks the WASM wire as the conversion point +because it is the smallest, coldest, most auditable surface: one +function, debounced, off the rendering hot path. + +## Soft floor on async desync + +`runsCharToByteOffsets` uses `charToByte[r.start] ?? r.start` rather +than asserting in-bounds. This is deliberate: between the moment a +run list is computed (via `buildRunListAttribution` / +`updateRunListAttribution`) and the moment `buildAttributionPayload` +reads `sourceTextRef.current`, the document can receive a remote +Automerge change. A deletion-shaped race produces runs whose `end` +exceeds the new `sourceText.length`. The `??` falls back to the raw +UTF-16 offset for that one frame; the next debounced update heals +it. Converting to a hard assertion would null the payload on benign +races for no correctness gain. + +## Related + +- `claude-notes/plans/2026-05-06-attribution-pipeline.md` — original Phase 5 plan +- `crates/quarto-hub/src/automerge_api_tests.rs` — UTF-16 feature check + emoji splice tests +- `crates/quarto-core/src/attribution/types.rs::AttributionRun` — Rust-side byte-offset types +- `crates/quarto-lsp-core/src/types.rs::Position` — a separate UTF-16 surface (LSP spec); not connected to attribution diff --git a/crates/quarto-core/src/attribution/types.rs b/crates/quarto-core/src/attribution/types.rs index d57572316..672b7193c 100644 --- a/crates/quarto-core/src/attribution/types.rs +++ b/crates/quarto-core/src/attribution/types.rs @@ -21,6 +21,13 @@ use crate::format::{Format, FormatIdentifier}; /// A contiguous byte-range run attributed to a single author at a /// single point in time. /// +/// `start` and `end` are UTF-8 **byte** offsets into the source text, +/// deliberately distinct from the UTF-16 code units used on the JS +/// side (`hub-client/src/services/attribution-runs.ts`) and by +/// Automerge text splice positions. Conversion happens once, at the +/// WASM wire, inside `buildAttributionPayload`. See +/// `claude-notes/designs/attribution-encoding-contract.md`. +/// /// `actor` is `Arc` (not `String`) so the same Arc is shared /// across every run by the same author. For a doc with 5 /// contributors and 1000 runs this is 5 string allocations + 1000 diff --git a/hub-client/src/services/attribution-runs.ts b/hub-client/src/services/attribution-runs.ts index 6e96783fd..3c4086cb1 100644 --- a/hub-client/src/services/attribution-runs.ts +++ b/hub-client/src/services/attribution-runs.ts @@ -15,6 +15,10 @@ * Algorithm reference (and known-good baseline): the prototype branch * `feat/node-attribution` carries this file along with the consumer-side * surface and the `attribution-runs.test.ts` invariant suite. + * + * See `claude-notes/designs/attribution-encoding-contract.md` for the + * full statement of the UTF-16 / UTF-8 boundary and why both sides are + * correct in their own coordinate space. */ import { diff } from '@automerge/automerge';