The set of encodings accepted for source files and the encoding actually used to interpret a source file are implementation-defined. From [lex.phases]p1.1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. ...
Many compilers support multiple encodings that can be used with source files. For example, gcc allows the source file encoding to be specified with the -finput-charset option and Visual C++ with the /source-charset option, but that encoding is then applied to all source files. Some compilers allow a per-file source encoding to be specified with a BOM or with an in-source syntax. For example, Visual C++ will recognize a source file with a UTF-8 BOM as being UTF-8 encoded and IBM's xlC compiler allows a source file to specify its encoding with a #pragma filetag directive. The latter is similar to the Python encoding declaration or the HTML encoding declaration.
The lack of a per-file mechanism to indicate source file encoding is an impediment to incremental adoption of UTF-8 since projects cannot rely on their public facing header files being interpreted as UTF-8 encoded.
P2295 proposes requiring implementations to support UTF-8 encoded source, but leaves the mechanism for how the UTF-8 encoding is selected as implementation-defined. That presents the possibility of implementations choosing different, possibly even conflicting, mechanisms (such conflicts exist today, while Visual C++ will honor a BOM, gcc will reject one unless it has already been directed to interpret such a source file as UTF-8 encoded).
The Unicode guidance for use of a BOM to determine file encoding is not clear. A paper was recently submitted to the Unicode consortium to clarify that guidance.
Possibilities for portably specifying a per-file source encoding include:
- A magic comment (like Python).
- A
pragma directive (like IBM xlC).
- A BOM (like Visual C++).
The set of encodings accepted for source files and the encoding actually used to interpret a source file are implementation-defined. From [lex.phases]p1.1:
Many compilers support multiple encodings that can be used with source files. For example, gcc allows the source file encoding to be specified with the
-finput-charsetoption and Visual C++ with the/source-charsetoption, but that encoding is then applied to all source files. Some compilers allow a per-file source encoding to be specified with a BOM or with an in-source syntax. For example, Visual C++ will recognize a source file with a UTF-8 BOM as being UTF-8 encoded and IBM's xlC compiler allows a source file to specify its encoding with a#pragma filetagdirective. The latter is similar to the Python encoding declaration or the HTML encoding declaration.The lack of a per-file mechanism to indicate source file encoding is an impediment to incremental adoption of UTF-8 since projects cannot rely on their public facing header files being interpreted as UTF-8 encoded.
P2295 proposes requiring implementations to support UTF-8 encoded source, but leaves the mechanism for how the UTF-8 encoding is selected as implementation-defined. That presents the possibility of implementations choosing different, possibly even conflicting, mechanisms (such conflicts exist today, while Visual C++ will honor a BOM, gcc will reject one unless it has already been directed to interpret such a source file as UTF-8 encoded).
The Unicode guidance for use of a BOM to determine file encoding is not clear. A paper was recently submitted to the Unicode consortium to clarify that guidance.
Possibilities for portably specifying a per-file source encoding include:
pragmadirective (like IBM xlC).