Question

I'm trying to understand Universal Character Names in the C11 standard and found that the N1570 draft of the C11 standard has much less detail than the C++11 standard with respect to Translation Phases 1 and 5 and the formation and handling of UCNs within them. This is what each has to say:

Translation Phase 1

N1570 Draft C11 5.1.1.2p1.1:

Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

C++11 2.2p1.1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

Translation Phase 5

N1570 Draft C11 5.1.1.2p1.5:

Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; [...]

C++ 2.2p1.5:

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set; [...]

(emphasis was added on differences)

The Questions

  1. In the C++11 standard, it is very clear that source file characters not in the basic source character set are converted to UCNs, and that they are treated exactly as would have been a UCN in that same place, with the sole exception of raw-strings. Is the same true of C11? When a C11 compiler sees a multi-byte UTF-8 character such as °, does it too translate this to \u00b0 in phase 1, and treat it just as if \u00b0 had appeared there instead?

  2. To put it in a different way, at the end of which translation phase, if any, are the following snippets of code transformed into textually equivalent forms for the first time in C11?

    const char* hell° = "hell°";
    

    and

    const char* hell\u00b0 = "hell\u00b0";
    
  3. If in 2., the answer is "in none", then during which translation phase are those two identifiers first understood to refer to the same thing, despite being textually different?

  4. In C11, are UCNs in character/string literals also converted in phase 5? If so, why omit this from the draft standard?
  5. How are UCNs in identifiers (as opposed to in character/string literals as already mentioned) handled in both C11 and C++11? Are they also converted in phase 5? Or is this something implementation-defined? Does GCC for instance print out such identifiers in UCN-coded form, or in actual UTF-8?
Was it helpful?

Solution

Comments turned into an answer

Interesting question!

The C standard can leave more of the conversions unstated because they are implementation-defined (and C has no raw strings to confuse the issue).

  1. What it says in the C standard is sufficient — except that it leaves your question 1 unanswerable.
  2. Q2 has to be 'Phase 5', I think, with caveats about it being 'the token stream is equivalent'.
  3. Q3 is strictly N/A, but Phase 7 is probably the answer.
  4. Q4 is 'yes', and it says so because it mentions 'escape sequences' and UCNs are escape sequences.
  5. Q5 is 'Phase 5' too.

Can the C++11-mandated processes in Phase 1 and 5 be taken as compliant within the wording of C11 (putting aside raw strings)?

I think they are effectively the same; the difference arises primarily from the raw literal issue which is specific to C++. Generally, the C and C++ standards try not to make things gratuitously different, and in particular try to the workings of the preprocessor and the low-level character parsing the same in both (which has been easier since C99 added support for C++ // comments, but which evidently got harder with the addition of raw literals to C++11).

One day, I'll have to study the raw literal notations and their implications more thoroughly.

OTHER TIPS

First, please note that these distinction exist since 1998; UCN were first introduced in C++98, a new standard (ISO/IEC 14882, 1st edition:1998), and then made their way into the C99 revision of the C standard; but the C committee (and existing implementers, and their pre-existing implementations) did not feel the C++ way was the only way to achieve the trick, particularly with corner cases and the use of smaller character sets than Unicode, or just different; for example, the requirement to ship the mapping tables from whatever-supported-encodings to Unicode was a preoccupation for C vendors in 1998.

  1. The C standard (consciously) avoids deciding this, and let the compiler chooses how to proceed. While your reasoning takes obviously place with the context of UTF-8 character sets used for both source and execution, there are a large (and pre-existing) range of different C99/C11 compilers available which are using different sets; and the committee felt it should not restrict the implementers too much on this issue. In my experience, most compilers keep it distinct in practice (for performance reasons.)
  2. Because of this freedom, some compilers can have it identical after phase 1 (like a C++ compiler shall), while other can left it distinct as late as phase 7 for the first degree character; the second degree characters (in the string) ought to be the same after phase 5, assuming the degree character is part of the extended execution character set supported by the implementation.

For the other answers, I won't add anything to Jonathan's.

About your additional question about the C++ more deterministic process to be Standard-C-compliant, it is clearly a goal to be so; and if you find a corner case which shows otherwise (a C++11-compliant preprocessor which would not conform to the C99 and C11 standards), then you should consider asking the WG14 committee about a potential defect.

Obviously, the reverse is not true: it is possible to write a pre-processor with handling of UCN which complies to C99/C11 but not to the C++ standards; the most obvious difference is with

#define str(t) #t
#define str_is(x, y)  const char * x = y " is " str(y)
str_is(hell°,      "hell°");
str_is(hell\u00B0, "hell\u00B0");

which a C-compliant preprocessor can render in a similar same way as your examples (and most do so) and as such, will have distinct renderings; but I am under the impression that a C++-compliant preprocessor is required to transform into (strictly equivalent)

const char* hell°      = "hell°"       " is " "\"hell\\u00b0\"";
const char* hell\u00b0 = "hell\\u00b0" " is " "\"hell\\u00b0\"";

Last, but not least, I believe not much compilers are fully compliant to this very level of details!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top