Skip to content

Clarify the relationship between the literal and execution encodings #72

@tahonermann

Description

@tahonermann

The relationship between the literal (compile-time) and execution (locale dependent run-time) encodings is not clear in the standard. Intuition argues that the encodings used for the literal and execution encodings must be compatible; that any character encoded using the literal encoding must match its encoding in the execution encoding. But such a relationship is commonly violated. Corentin asked the following questions during discussion of P2297R0 in the 2021-02-24 SG16 telecon. Note that the first case uses a character from the basis source character set, but the second does not.

isalpha('a') // Can this ever return false?
isalpha('é') // Can this ever return false?

In principle, both of these can return false, and in practice, there are cases where even characters of the basic source character set lack representation in the execution encoding or are differently encoded. This happens with Shift-JIS where U+00A5 YEN SIGN (¥) is substituted for U+005C REVERSE SOLIDUS () relative to ASCII and in various EBCDIC code pages where the following basic source character set members are not part of the "invariant subset of EBCDIC"

  • U+0021 EXCLAMATION MARK (!)
  • U+0023 NUMBER SIGN (#)
  • U+005B LEFT SQUARE BRACKET ([)
  • U+005C REVERSE SOLIDUS ()
  • U+005D RIGHT SQUARE BRACKET (])
  • U+005E CIRCUMFLEX ACCENT (^)
  • U+007B LEFT CURLY BRACKET ({)
  • U+007C VERTICAL LINE (|)
  • U+007D RIGHT CURLY BRACKET (})
  • U+007E TILDE (~)

Further complications arise when a program compiled with a literal encoding such as UTF-8 is run in an environment with, for example, a Windows-1252 execution encoding. In this case, lead and trail code unit bytes from the UTF-8 encoding are perceived to be individual characters. The situation is even worse for encodings like Shift-JIS where a trailing code unit sequence may contain code units that themselves match the encoding of a single byte encoded character.

Possibilities for addressing this include:

  • declaring no relationship between the literal and execution encodings,
  • specifying a subset of the basic source character set that consists of characters known not to have invariant encoding with respect to known execution encodings,
  • stating that a program that is run in an environment where the execution encoding is not compatible with the literal encoding exhibits undefined behavior if a string literal encodes a non-compatible character and that string is passed to a execution sensitive function.

Metadata

Metadata

Labels

clarificationSomething isn't clearpaper submittedA paper proposing a specific solution has been submitted

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions