Unicode case folding to upper case

Question 1

Summary: The wording used by Microsoft is...confusing to say the least. It appears that simple upper case mapping should be done, though I can't be certain.

Background

Part of the confusion might be the difference between case folding and case mapping. Case mapping maps every character to a designated case. Case folding, while it is based on lower-casing, is defined to result in case-less characters (UTR #21 §1.3).

Now there are two variants of case mapping and case folding, simple and full. Unlike the simple transformation, The full one can change string length, and as you rightly point out is not needed here. The specification specifically mentions simple, and is probably the only clear thing in this answer. I do feel the need to mention for future reference that the the current Unicode Standard (6.3.0) mentions that the default case transformation is the full one, though the version Microsoft references (3.1.1) does not appear to make this distinction.

Spec Analysis

(...) convert to upper-case with the Unicode Default Case Conversion Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.

To me this quote appears to suggest they want upper case, and simply made an error by saying case folding instead of case mapping. But then comes that reference you quoted:

For Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.

They actually mention the case folding data file! At this point, I'm not sure what to think. My main line of thought is that Microsoft wants case folding though erroneously thought that it was based on upper casing rather than lower casing. This is even a stretch though, but its the closest I've been able to come to reconciling this possible contradiction, and I hope there's a better explanation.

I've found in section 2.6.1 the following which supports some form of upper-casing:

[...] the directory entry name is compared using a special case-insensitive upper-case mapping, described in Red-Black Tree.

Note that they do in fact use the term mapping here.

The exception list

Taking a look at the exception list for the mentioned Windows XP and Windows Server 2003, most entries are subtractions, suggesting code points Microsoft wants to keep distinct. However, in the table, the code points are actually listed in reverse order to the Unicode case folding data file.

One interpretation of this is that it's just a display quirk. This idea is shot down by the last row where they subtract the case transformation 0x03C2 -> 0x03C2. This transformation does not exist in the data file since the transformation 0x03C2 -> 0x03C3 does (an unlisted case transformation is considered to transform to itself).

Another interpretation is that they do in fact erroneously believe that its the reverse mapping that's the correct one. As you mentioned though, this runs into trouble, as the reverse mapping is not always straightforward. Otherwise, this interpretation would be fine.

A third interpretation is to consider their reference to the Unicode case folding data file wrong. This of course makes me feel uneasy, but if they actually did mean case mapping originally, they might have just provided the link as a quick reference point. The exception list they mention does have column headers such as "Lowercase UTF-16 code point", but we know that case folding is in fact case-less.

As an aside, I did look at the exception list for the later operating systems, hoping to gain some more insight. I found more confusion. In particular the addition of 0x03C3 -> 0x03A3 troubles me. Since the exception list and the Unicode file list their code points in the opposite order, it appears that the transformation is already in the data file and doesn't need to be added. This part of the specification does not want to be understood!

Conclusion

If you've read all of the above, you'll probably guess that this conclusion is going to be less than ideal. Clearly at one or more points, the specification is in error, but it's hard to tell where. Really there are three possibilities depending on your interpretation as to what kind of case transformation needs to be done.

Simple upper case mapping
Simple case folding, followed by simple upper case mapping
Simple case folding

To me it seems like Microsoft does in fact want upper casing. From there I believe that the case folding reference is an error, and as such my guess is they just want simple upper case mapping.

I highly doubt it's the last simple case folding only option though. Both of the other options would give very similar results with only a small amount of code points possibly giving different results.

It seems like the only way to know for sure would be to either contact Microsoft, or painstakingly look at binaries to see which method is followed.

Question 2

In 3.13 Default Case Algorithms (p. 115) of The Unicode Standard Version 6.2 – Core Specification the text refers to UnicodeData.txt. This contains:

03B2;GREEK SMALL LETTER BETA;Ll;0;L;;;;;N;;;0392;;0392
03D0;GREEK BETA SYMBOL;Ll;0;L;<compat> 03B2;;;;N;GREEK SMALL LETTER CURLED BETA;;0392;;0392

which indicates that the Greek small letter Beta should indeed map to the Greek Beta symbol, and as an aside indicates that the two symbols have some level of compatibility. It also contains the remainder of the bidirectional case conversion you are looking for. You may also need to look at SpecialCasing.txt for boundary cases.