Skip to content

standard: revisit locator validation mechanism (regex + examples vs. alternatives) #9

Description

@maehr

Affected spec section

/standard/system-profiles/ — locator validation contract for citation-system profiles.

Motivation

Two concrete registry issues filed today both tweak locator_regex to fix concrete bugs, but they share a deeper question that is worth surfacing at the spec level before we keep patching individual profiles:

The question: is locator_regex + examples the right validation contract for a citation-system profile at all?

Reasons to revisit:

  1. The deterministic UUID seed uses the exact normalized locator bytes, so any regex-valid spelling difference (leading zeros, case, abbreviation form) mints a distinct permanent identity. A regex alone is a weak fence against identity splits.
  2. Some axes that profiles actually need to validate cannot be expressed in a regex at all — most obviously Bible book vocabulary, which is an enumerated set, not a character pattern.
  3. ECMAScript regex is backtracking; allowing arbitrary contributor regexes is a latent DoS/CI-stall risk (see review notes on RE2-compatible subsets).
  4. Tightening the regex to enforce canonical forms (no leading zeros, canonical case) tangles syntactic validation with normalization rules in a way that is hard to read and easy to get wrong.

This isn't a request to ship a new design today — it's a request to decide what we want the contract to be before more profiles land.

Proposed change

Open the design question for discussion. Sketch options without picking one:

  1. Status quo, tightened. Keep locator_regex but constrain it to an RE2-compatible / linear-time subset, and require profiles to declare canonical digit and case forms enforced by rejection rather than folding.
  2. Regex + structured fields. Keep locator_regex for the grammar around named capture groups, and add structured fields beside it for things regex can't express — e.g. vocabulary: for enumerated tokens (Bible books, Stephanus columns), canonical_case:, leading_zeros: forbid.
  3. Declarative validator schema. Replace the single regex with a small declarative shape: named components with per-component types and constraints (integer min=1 no-leading-zeros, enum: [Genesis, …], regex: [ab]). Compiler derives the regex.
  4. Drop syntactic validation from the profile. Rely on minting-time review and on the resolver. Cheapest spec, riskiest data.

Alternatives considered

The four options above are the alternatives. Reasonable hybrids exist (e.g. option 2 with option 1's RE2 constraint).

Compatibility impact

additive (no breaking change)

(The discussion is additive. A chosen outcome might later require breaking changes to existing profiles; that would be handled in a follow-up proposal.)

Target spec version

v0.2.0-draft

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions