A portable mechanism to specify source file encoding

The set of encodings accepted for source files and the encoding actually used to interpret a source file are implementation-defined.  From [[lex.phases]p1.1](http://eel.is/c++draft/lex.phases#1.1):

> Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.  The set of physical source file characters accepted is implementation-defined. ...

Many compilers support multiple encodings that can be used with source files.  For example, gcc allows the source file encoding to be specified with the `-finput-charset` option and Visual C++ with the `/source-charset` option, but that encoding is then applied to all source files.  Some compilers allow a per-file source encoding to be specified with a BOM or with an in-source syntax.  For example, Visual C++ will recognize a source file with a UTF-8 BOM as being UTF-8 encoded and IBM's xlC compiler allows a source file to specify its encoding with a [`#pragma filetag` directive](https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm).  The latter is similar to the [Python encoding declaration](https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations) or the [HTML encoding declaration](https://www.w3.org/TR/html52/document-metadata.html#specifying-the-documents-character-encoding).

The lack of a per-file mechanism to indicate source file encoding is an impediment to incremental adoption of UTF-8 since projects cannot rely on their public facing header files being interpreted as UTF-8 encoded.

[P2295](https://wg21.link/p2295) proposes requiring implementations to support UTF-8 encoded source, but leaves the mechanism for how the UTF-8 encoding is selected as implementation-defined.  That presents the possibility of implementations choosing different, possibly even conflicting, mechanisms (such conflicts exist today, while Visual C++ will honor a BOM, gcc will reject one unless it has already been directed to interpret such a source file as UTF-8 encoded).

The Unicode guidance for use of a BOM to determine file encoding is not clear.  A [paper](https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf) was recently submitted to the Unicode consortium to clarify that guidance.

Possibilities for portably specifying a per-file source encoding include:

- A magic comment (like Python).
- A `pragma` directive (like IBM xlC).
- A BOM (like Visual C++).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A portable mechanism to specify source file encoding #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A portable mechanism to specify source file encoding #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions