Affected spec section
/standard/system-profiles/ (locator validation contract) and /standard/identifier-syntax/ § Deterministic identity (which currently leaves canonical form unspecified at the profile level).
Motivation
The deterministic UUID seed uses the exact normalized locator bytes. If a citation-system profile permits multiple regex-valid spellings of the same passage, each spelling mints a distinct permanent identity — a hard identity split for the same conceptual reference, with no way to merge after publication.
Examples that today's profiles silently allow:
- Bekker:
514a1 vs 514a01 (line leading zero), 0514a1 (page leading zero) — same line of Aristotle, three permanent IDs.
- Bible (current
bible-book-chapter-verse): John.3.16, john.3.16, JOHN.3.16 — three permanent IDs for John 3:16.
The current editorial rule says references must represent attested reference points, but that rule is not machine-checkable and does not prevent two contributors from minting different spellings of the same point.
Proposed change
Require every citation-system profile to declare and enforce canonical forms along two axes:
- ASCII digit sequences. Profiles MUST state whether leading zeros are permitted in numeric components. Default: forbidden unless the citation tradition uses them.
- ASCII letter case. Profiles MUST state the canonical case for letter components. Default: case-sensitive match against the profile's declared casing.
Enforcement MUST be by locator_regex rejection, not by silent folding/normalization at minting time. Rejection at the validation boundary avoids normalization_version churn — folding would force a profile version bump on every refinement, and would invalidate already-minted IDs.
Scope is intentionally narrow:
- ASCII digits only — does not touch decimal separators, thousands separators, or non-ASCII digits.
- ASCII case only — does not propose Unicode case folding, NFC/NFKC normalization choices, or whitespace handling.
The two concrete bugs that motivated this (textrefs/registry#2 Bekker leading-zero forms, textrefs/registry#3 Bible case variants) become single-line constraints in the per-profile regex once this is normative.
Alternatives considered
Compatibility impact
breaking — major version bump required
(Profiles change shape and become stricter. Existing minted references must be re-checked against the new constraints; the review found no live data that would break, but this needs a CI sweep before the bump.)
Target spec version
v0.2.0-draft
Related
Affected spec section
/standard/system-profiles/(locator validation contract) and/standard/identifier-syntax/§ Deterministic identity (which currently leaves canonical form unspecified at the profile level).Motivation
The deterministic UUID seed uses the exact normalized locator bytes. If a citation-system profile permits multiple regex-valid spellings of the same passage, each spelling mints a distinct permanent identity — a hard identity split for the same conceptual reference, with no way to merge after publication.
Examples that today's profiles silently allow:
514a1vs514a01(line leading zero),0514a1(page leading zero) — same line of Aristotle, three permanent IDs.bible-book-chapter-verse):John.3.16,john.3.16,JOHN.3.16— three permanent IDs for John 3:16.The current editorial rule says references must represent attested reference points, but that rule is not machine-checkable and does not prevent two contributors from minting different spellings of the same point.
Proposed change
Require every citation-system profile to declare and enforce canonical forms along two axes:
Enforcement MUST be by
locator_regexrejection, not by silent folding/normalization at minting time. Rejection at the validation boundary avoidsnormalization_versionchurn — folding would force a profile version bump on every refinement, and would invalidate already-minted IDs.Scope is intentionally narrow:
The two concrete bugs that motivated this (textrefs/registry#2 Bekker leading-zero forms, textrefs/registry#3 Bible case variants) become single-line constraints in the per-profile regex once this is normative.
Alternatives considered
normalization_versionbumps for any refinement, and means the bytes hashed don't match the locator authored — surprising for anyone recomputing IDs from the published record.Compatibility impact
breaking — major version bump required
(Profiles change shape and become stricter. Existing minted references must be re-checked against the new constraints; the review found no live data that would break, but this needs a CI sweep before the bump.)
Target spec version
v0.2.0-draft
Related