Skip to content

standard: require canonical ASCII digit and case forms in citation-system profiles #13

Description

@maehr

Affected spec section

/standard/system-profiles/ (locator validation contract) and /standard/identifier-syntax/ § Deterministic identity (which currently leaves canonical form unspecified at the profile level).

Motivation

The deterministic UUID seed uses the exact normalized locator bytes. If a citation-system profile permits multiple regex-valid spellings of the same passage, each spelling mints a distinct permanent identity — a hard identity split for the same conceptual reference, with no way to merge after publication.

Examples that today's profiles silently allow:

  • Bekker: 514a1 vs 514a01 (line leading zero), 0514a1 (page leading zero) — same line of Aristotle, three permanent IDs.
  • Bible (current bible-book-chapter-verse): John.3.16, john.3.16, JOHN.3.16 — three permanent IDs for John 3:16.

The current editorial rule says references must represent attested reference points, but that rule is not machine-checkable and does not prevent two contributors from minting different spellings of the same point.

Proposed change

Require every citation-system profile to declare and enforce canonical forms along two axes:

  1. ASCII digit sequences. Profiles MUST state whether leading zeros are permitted in numeric components. Default: forbidden unless the citation tradition uses them.
  2. ASCII letter case. Profiles MUST state the canonical case for letter components. Default: case-sensitive match against the profile's declared casing.

Enforcement MUST be by locator_regex rejection, not by silent folding/normalization at minting time. Rejection at the validation boundary avoids normalization_version churn — folding would force a profile version bump on every refinement, and would invalidate already-minted IDs.

Scope is intentionally narrow:

  • ASCII digits only — does not touch decimal separators, thousands separators, or non-ASCII digits.
  • ASCII case only — does not propose Unicode case folding, NFC/NFKC normalization choices, or whitespace handling.

The two concrete bugs that motivated this (textrefs/registry#2 Bekker leading-zero forms, textrefs/registry#3 Bible case variants) become single-line constraints in the per-profile regex once this is normative.

Alternatives considered

Compatibility impact

breaking — major version bump required

(Profiles change shape and become stricter. Existing minted references must be re-checked against the new constraints; the review found no live data that would break, but this needs a CI sweep before the bump.)

Target spec version

v0.2.0-draft

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions