Properties of combining diacritics

https://stackoverflow.com/questions/8281443

09-03-2021
|

Question

For combining diacritics, are they counted as letters? Since, as far as I know, they can only combine with other letters in well-formed Unicode.

The ICU function to determine if a Unicode codepoint is a letter only takes one codepoint, so for any given codepoint, it can't know if they've been combined with a diacritic- or if it's a diacritic, what it's been combined with. I'm trying to implement something akin to a Unicode-aware regex, using a construct like

while(is_letter(codepoint))

However, I'm quite concerned about what's going to happen if codepoint is actually a diacritic, which would be collated with a previous codepoint, and other collating marks.

Is this safe to do? Or will I have to explicitly find and ignore diacritics and other collating marks?

Edit: What I really need to do is iterate characters, not codepoints.

This question is a victim of the XY problem. I need to raise a question about my actual problem.

Solution

I'm not totally clear on what you're trying to do, so I apologize in advance if this isn't the answer you're looking for, but:

For combining diacritics, are they counted as letters?

Broadly speaking, diacritics are counted as "marks" rather than "letters". For example, U+0301 COMBINING ACUTE ACCENT, as in <ś>, is a "nonspacing mark", which is one of three kinds of "mark". However, the "modifier letters", which are counted as "letters", might nonetheless be thought of as diacritics; for example, U+02C0 MODIFIER LETTER GLOTTAL STOP, as in <sˀ>, is a "modifier letter".

If you look through the main file of the Unicode Character Database (warning: it's 1.3 MB text-file), you can get a sense for which characters are classified as "modifier letters" (Lm) and which as "nonspacing marks" (Mn) or "spacing marks" (Ms) or "enclosing marks" (Me).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow