سؤال

Given a NFC normalized string, applying full case folding to that string, can I assume that the result is NFC normalized too?

I don't understand what the Unicode standard is trying to tell me in this quote:

Normalization also interacts with case folding. For any string X, let Q(X) = NFC(toCasefold(NFD(X))). In other words, Q(X) is the result of normalizing X, then case folding the result, then putting the result into Normalization Form NFC format. Because of the way normalization and case folding are defined, Q(Q(X)) = Q(X). Repeatedly applying Q does not change the result; case folding is closed under canonical normalization for either Normalization Form NFC or NFD.

هل كانت مفيدة؟

المحلول

A Unicode string might not be in NFC after case folding. An example is U+00DF (LATIN SMALL LETTER SHARP S) followed by U+0301 (COMBINING ACUTE ACCENT).

X = U+00DF U+0301
NFC(X) = U+00DF U+0301
toCasefold(NFC(X)) = U+0073 U+0073 U+0301
NFC(toCasefold(NFC(X))) = U+0073 U+015B

نصائح أخرى

You have asked two questions:

Question 1: Is toCasefold(NFC(X)) binary equal to NFC(toCasefold(NFC(X)))?

The standard doesn't explicitly answer this question. (I would expect the answer is yes, that case folding does not affect normalization, but I have no proof.)

Question 2: What is the Unicode standard telling me in the quote?

The standard is only saying it is not necessary to do case folding again after canonical normalization. In other words, canonical normalization (to NFC or NFD form) does not change the case of any characters from uppercase to lowercase or vice versa. This doesn't answer your first question.

It is not saying whether or not it is necessary to do canonical normalization again after case folding.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top