From ad8eb0c583694070cb11b8bdcc34a880c33607ed Mon Sep 17 00:00:00 2001
From: shikokuchuo <53399081+shikokuchuo@users.noreply.github.com>
Date: Thu, 21 May 2026 10:24:46 +0100
Subject: [PATCH] docs(attribution): document UTF-16 / UTF-8 encoding boundary

Adds a contract doc in claude-notes/designs/ plus file-header /
struct-doc anchors on the two load-bearing AttributionRun types.
Frames the dual encoding as inherited from each side's substrate
(tree-sitter + SourceInfo on the Rust side, JS string primitives +
Monaco + Automerge on the JS side) rather than a stylistic choice,
so it isn't "harmonized" later.
---
 .../designs/attribution-encoding-contract.md  | 129 ++++++++++++++++++
 crates/quarto-core/src/attribution/types.rs   |   7 +
 hub-client/src/services/attribution-runs.ts   |   4 +
 3 files changed, 140 insertions(+)
 create mode 100644 claude-notes/designs/attribution-encoding-contract.md

diff --git a/claude-notes/designs/attribution-encoding-contract.md b/claude-notes/designs/attribution-encoding-contract.md
new file mode 100644
index 000000000..5ad71867e
--- /dev/null
+++ b/claude-notes/designs/attribution-encoding-contract.md
@@ -0,0 +1,129 @@
+# Attribution encoding contract
+
+**Status:** Active (Phase 5 of the attribution pipeline, see
+`claude-notes/plans/2026-05-06-attribution-pipeline.md`).
+**Conversion site:** `buildAttributionPayload` in
+`hub-client/src/hooks/useAttribution.ts`.
+**Key types:** `AttributionRun` (TS) in
+`hub-client/src/services/attribution-runs.ts`, `AttributionRun` (Rust)
+in `crates/quarto-core/src/attribution/types.rs`.
+
+## Summary
+
+Attribution intentionally uses **two coordinate spaces** for run
+boundaries, joined at exactly one site. The JS side speaks UTF-16
+code units because that is what Automerge text patches, JS string
+indexing, and Monaco editor offsets all natively produce. The Rust
+side speaks UTF-8 byte offsets because that is what `&str` indexing
+and `SourceInfo` ranges use. The conversion happens at the WASM
+wire, immediately before the JSON payload is shipped to the Rust
+pipeline.
+
+This is not a bug to be "harmonized." Both sides are correct for
+their own domain; collapsing to a single encoding would force one
+side to fight its primitives on every offset.
+
+## The two spaces
+
+| Side | Space | Why |
+|---|---|---|
+| Automerge (Rust crate, compiled with `utf16-indexing`) | UTF-16 code units | Crate default; verified at `crates/quarto-hub/src/automerge_api_tests.rs::test_text_encoding_is_utf16` |
+| Automerge (JS via `@automerge/automerge`) | UTF-16 code units | Native to JS strings; `patch.value.length` and `patch.length` on `diff()` output are UTF-16 |
+| Monaco editor (presence cursors) | UTF-16 code units | Native to `model.getOffsetAt` / `model.getPositionAt`; passes through `A.getCursorPosition` unchanged |
+| Run-list replay (`attribution-runs.ts`) | UTF-16 code units | Direct passthrough of Automerge patches |
+| Rust attribution (`AttributionRun`, `query_byte_range`, `GitBlameProvider`) | UTF-8 byte offsets | Aligns with `&str` indexing and the rest of `SourceInfo` |
+
+## Single conversion site
+
+`buildAttributionPayload(state, sourceText, identities)` in
+`hub-client/src/hooks/useAttribution.ts` calls
+`buildCharToByteMap(text)` (iterates `text.charCodeAt(i)` with
+explicit surrogate-pair handling, returning a `Uint32Array` of byte
+offsets keyed by UTF-16 index), then `runsCharToByteOffsets(runs, map)`
+to translate each run's `start`/`end` from char offsets to byte
+offsets before `JSON.stringify`.
+
+All Automerge-driven JS code upstream of `buildAttributionPayload`
+stays in UTF-16. All Rust code downstream of the JSON wire stays in
+UTF-8 bytes. The wire format itself carries bytes.
+
+## Surrogate-pair handling
+
+`buildCharToByteMap` walks UTF-16 code units, not code points. A
+surrogate pair (e.g. an emoji or non-BMP CJK character) occupies two
+consecutive UTF-16 indices, both of which receive a map entry: the
+high-surrogate index maps to the byte offset *before* the 4-byte
+UTF-8 sequence, the low-surrogate index to the offset *after*.
+Automerge does not emit splice positions that land mid-surrogate, so
+a run boundary on the low-surrogate index should not occur in
+practice; if it does, the map keeps the translation well-defined.
+
+Test coverage: `hub-client/src/services/attribution-runs.test.ts`
+exercises ASCII (identity), 2-byte (Latin-1 supplement), 3-byte
+(CJK), and 4-byte (surrogate-pair) cases.
+
+## Failure modes if "harmonized"
+
+The encoding split is not a stylistic preference on either side.
+Each side's encoding is forced by the substrate immediately beneath
+it; collapsing to a single encoding doesn't simplify the design, it
+relocates the translation cost to a worse place.
+
+**The encodings are inherited, not chosen.** On the Rust side, every
+caller of `query_byte_range` already has `start` and `end` in byte
+form because the layers below produce byte offsets natively:
+tree-sitter returns node positions as bytes (`Range.start_byte` /
+`end_byte`), `SourceInfo` carries those bytes through every AST
+transform, `&str` indexing requires bytes, and `GitBlameProvider`
+consumes line-based blame output which is byte-positional. On the
+JS side, `string[i]`, `charCodeAt(i)`, `string.length`, Automerge
+text splice positions, Monaco's `getOffsetAt`, and the DOM Selection
+API all return UTF-16 code units natively. Attribution is the
+*consumer* of coordinate systems chosen many layers below it on
+both sides — there is no UTF-16 plane to ask tree-sitter for, and
+no UTF-8 plane to ask Monaco for.
+
+**Force UTF-8 on the JS side.** Every Automerge patch position,
+every Monaco cursor, and every `string[i]` would need byte
+translation per use. Cost moves from one conversion per payload
+(debounced at ~500 ms) to one conversion per editor interaction —
+many orders of magnitude more work, in the editor hot path rather
+than the producer's cold debounced path.
+
+**Force UTF-16 on the Rust side, option (a) — translate at every
+query.** Every `query_byte_range` call becomes a
+`byte_range → utf16_range → query` chain. For a document with
+thousands of AST nodes, every render triggers thousands of
+conversions. Translation cost moves from O(payloads) to
+O(AST nodes × renders), into the rendering hot path.
+
+**Force UTF-16 on the Rust side, option (b) — translate the
+substrate.** Rewrite tree-sitter integration, `SourceInfo`, every
+AST transform, citeproc, link rewriting, diagnostics, and
+serialization to track UTF-16 code units rather than bytes. Massive
+ripple for no win — `&str` still indexes by bytes, so a byte ↔
+code-unit map would have to travel alongside every range anyway.
+
+The current design picks the WASM wire as the conversion point
+because it is the smallest, coldest, most auditable surface: one
+function, debounced, off the rendering hot path.
+
+## Soft floor on async desync
+
+`runsCharToByteOffsets` uses `charToByte[r.start] ?? r.start` rather
+than asserting in-bounds. This is deliberate: between the moment a
+run list is computed (via `buildRunListAttribution` /
+`updateRunListAttribution`) and the moment `buildAttributionPayload`
+reads `sourceTextRef.current`, the document can receive a remote
+Automerge change. A deletion-shaped race produces runs whose `end`
+exceeds the new `sourceText.length`. The `??` falls back to the raw
+UTF-16 offset for that one frame; the next debounced update heals
+it. Converting to a hard assertion would null the payload on benign
+races for no correctness gain.
+
+## Related
+
+- `claude-notes/plans/2026-05-06-attribution-pipeline.md` — original Phase 5 plan
+- `crates/quarto-hub/src/automerge_api_tests.rs` — UTF-16 feature check + emoji splice tests
+- `crates/quarto-core/src/attribution/types.rs::AttributionRun` — Rust-side byte-offset types
+- `crates/quarto-lsp-core/src/types.rs::Position` — a separate UTF-16 surface (LSP spec); not connected to attribution
diff --git a/crates/quarto-core/src/attribution/types.rs b/crates/quarto-core/src/attribution/types.rs
index d57572316..672b7193c 100644
--- a/crates/quarto-core/src/attribution/types.rs
+++ b/crates/quarto-core/src/attribution/types.rs
@@ -21,6 +21,13 @@ use crate::format::{Format, FormatIdentifier};
 /// A contiguous byte-range run attributed to a single author at a
 /// single point in time.
 ///
+/// `start` and `end` are UTF-8 **byte** offsets into the source text,
+/// deliberately distinct from the UTF-16 code units used on the JS
+/// side (`hub-client/src/services/attribution-runs.ts`) and by
+/// Automerge text splice positions. Conversion happens once, at the
+/// WASM wire, inside `buildAttributionPayload`. See
+/// `claude-notes/designs/attribution-encoding-contract.md`.
+///
 /// `actor` is `Arc<str>` (not `String`) so the same Arc is shared
 /// across every run by the same author. For a doc with 5
 /// contributors and 1000 runs this is 5 string allocations + 1000
diff --git a/hub-client/src/services/attribution-runs.ts b/hub-client/src/services/attribution-runs.ts
index 6e96783fd..3c4086cb1 100644
--- a/hub-client/src/services/attribution-runs.ts
+++ b/hub-client/src/services/attribution-runs.ts
@@ -15,6 +15,10 @@
  * Algorithm reference (and known-good baseline): the prototype branch
  * `feat/node-attribution` carries this file along with the consumer-side
  * surface and the `attribution-runs.test.ts` invariant suite.
+ *
+ * See `claude-notes/designs/attribution-encoding-contract.md` for the
+ * full statement of the UTF-16 / UTF-8 boundary and why both sides are
+ * correct in their own coordinate space.
  */
 
 import { diff } from '@automerge/automerge';