Skip to content

A portable mechanism to specify source file encoding #71

@tahonermann

Description

@tahonermann

The set of encodings accepted for source files and the encoding actually used to interpret a source file are implementation-defined. From [lex.phases]p1.1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. ...

Many compilers support multiple encodings that can be used with source files. For example, gcc allows the source file encoding to be specified with the -finput-charset option and Visual C++ with the /source-charset option, but that encoding is then applied to all source files. Some compilers allow a per-file source encoding to be specified with a BOM or with an in-source syntax. For example, Visual C++ will recognize a source file with a UTF-8 BOM as being UTF-8 encoded and IBM's xlC compiler allows a source file to specify its encoding with a #pragma filetag directive. The latter is similar to the Python encoding declaration or the HTML encoding declaration.

The lack of a per-file mechanism to indicate source file encoding is an impediment to incremental adoption of UTF-8 since projects cannot rely on their public facing header files being interpreted as UTF-8 encoded.

P2295 proposes requiring implementations to support UTF-8 encoded source, but leaves the mechanism for how the UTF-8 encoding is selected as implementation-defined. That presents the possibility of implementations choosing different, possibly even conflicting, mechanisms (such conflicts exist today, while Visual C++ will honor a BOM, gcc will reject one unless it has already been directed to interpret such a source file as UTF-8 encoded).

The Unicode guidance for use of a BOM to determine file encoding is not clear. A paper was recently submitted to the Unicode consortium to clarify that guidance.

Possibilities for portably specifying a per-file source encoding include:

  • A magic comment (like Python).
  • A pragma directive (like IBM xlC).
  • A BOM (like Visual C++).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestpaper neededA paper proposing a specific solution is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions