What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?

They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly missing something here. Why the distinction?


The Unicode Standard, Chapter 3, D52

  • Combining character: A character with the General Category of Combining Mark (M).
  • Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing Mark (Me).
  • All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class.
  • The interpretation of private-use characters (Co) as combining characters or not is determined by the implementation.
  • These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
  • The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non- joiner. The combining character is said to apply to that base character.
  • There may be no such base character, such as when a combining character is at the start of text or follows a control or format character—for example, a carriage return, tab, or right-left mark. In such cases, the combining characters are called isolated combining characters.
  • With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
  • The representative images of combining characters are depicted with a dotted circle in the code charts. When presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.

The Unicode Standard, Chapter 3, D59

  • Grapheme extender: A character with the property Grapheme_Extend.
  • Grapheme extender characters consist of all nonspacing marks, zero width joiner, zero width non-joiner, U+FF9E, U+FF9F, and a small number of spacing marks.
  • A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character. zero width joiner and zero width non-joiner are formally defined to be grapheme extenders so that their presence does not break up a sequence of other grapheme extenders.
  • The small number of spacing marks that have the property Grapheme_Extend are all the second parts of a two-part combining mark.
  • The set of characters with the Grapheme_Extend property and the set of characters with the Grapheme_Base property are disjoint, by definition.
有帮助吗?

解决方案

The difference in actual usage is that combining characters are defined as a General Category for rough classification of characters and grapheme extenders are mainly used for UAX #29 text segmentation.

EDIT: Since you offered a bounty, I can elaborate a bit.

Combining characters are characters that can't be use as stand-alone characters but must be combined with another character. They're used to define combining character sequences.

Grapheme extenders were introduced in Unicode 3.2 to be used in Unicode Technical Report #29: Text Boundaries (then in a proposed status, now known as Unicode Standard Annex #29: Unicode Text Segmentation). The main use is to define grapheme clusters. Grapheme clusters are basically user-perceived characters. According to UAX #29:

Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.

The main difference is that grapheme extenders don't include most of the spacing marks (the set is actually smaller than the set of combining characters). Most of the spacing marks are vowel signs for Asian scripts. In these scripts, vowels are sometimes written by modifying a consonant character. If this modification takes up horizontal space (spacing mark), it used to be seen as a separate user-perceived character and forms a new (legacy) grapheme cluster. In later versions of UAX #29, this was changed and extended grapheme clusters were introduced where most but not all spacing marks don't break a cluster.

I think they key sentence from the standard is: "A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character." Combining characters, on the other hand, also include spacing marks that are applied to the left or right. There are a few exceptions, though (see property Other_Grapheme_Extend).

Example

U+0995 BENGALI LETTER KA:

U+09C0 BENGALI VOWEL SIGN II (combining character, but no grapheme extender):

Combination of the two:

কী

This is a single combining character sequence consisting of two legacy grapheme clusters. The vowel sign can't be used by itself but it still counts as a legacy grapheme cluster. A text editor, for example, could allow to place the cursor between the two characters.

There are over 300 combining characters like this which do not extend graphemes, and four characters which are not combining but do extend graphemes.

其他提示

I’ve posted this question on the Unicode mailing list and got some more responses. I’ll post some of them here.

Tom Gewecke wrote:

I'm not an expert on this aspect of Unicode, but I understand that "grapheme extender" is a finer distinction in character properties designed to be used in certain specific and complex processes like grapheme breaking. You might find this blog article helpful in seeing where it comes into play: http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-4.html

PS The answer by nwellnhof at StackOverflow is an excellent explanation of this issue in my view.

Philippe Verdy wrote:

Many grapheme extenders are not "combining characters". Combining characters are classified this way for legacy reasons (the very weak "general category" property) and this property is normatively stabilized. As well most combining characters have a non-zero combining class and they are stabilized for the purpose of normalization.

Grapheme extenders include characters that are also NOT combining characters but controls (e.g. joiners). Some graphemclusters are also more complex in some scripts (there are extenders encoded BEFORE the base character; and they cannot be classified as combining characters because combining characters are always encoded AFTER a base character)

For legacy reasons (and roundtrip compatibility with older standards) not all scripts are encoded using the UCS character model using combining characters. (E.g. the Thai script; not following the "logical" encoding order; but following the model used in TIS-620 and other standards based on it; including for Windows, and *nix/*nux).

Richard Wordingham wrote:

Spacing combining marks (category Mc) are in general not grapheme extenders. The ones that are included are mostly included so that the boundaries between 'legacy grapheme clusters' http://www.unicode.org/reports/tr29/tr29-23.html are invariant under canonical equivalence. There are six grapheme extenders that are not nonspacing (Mn) or enclosing (Me) and are not needed by this rule: ZWNJ, ZWJ, U+302E HANGUL SINGLE DOT TONE MARK U+302F HANGUL DOUBLE DOT TONE MARK U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

I can see that it will sometimes be helpful to ZWNJ and ZWJ along with the previous base character. The fullwidth soundmarks U+3099 and U+309A are included for reasons of canonical equivalence, so it makes sense to include their halfwidth versions.

I don't actually see the logic for including U+302E and U+302F. If you're going to encourage forcing someone who's typed the wrong base character before a sequence of 3 non-spacing marks to retype the lot, you may as well do the same with Hangul tone marks.

May I quote from Yannis Haralambous' Fonts and Encodings, page 116f.:

The idea is that a script or a system of notation is sometimes too finely divided into characters. And when we have cut constructs up into characters, there is no way to put them back together again to rebuild larger characters. For example, Catalan has the ligature ‘ŀl’. This ligature is encoded as two Unicode characters: an ‘ŀ’ 0x0140 latin small letter l with middle dot and an ordinary ‘l’. But this division may not always be what we want.
Suppose that we wish to place a circumflex accent over this ligature, as we might well wish to do with the ligatures ‘œ’ and ‘æ’. How can this be done in Unicode? To allow users to build up characters in constructs that play the rôle of new characters, Unicode introduced three new properties (grapheme base, grapheme extension, grapheme link) and one new character: 0x034F combining grapheme joiner.

So the way I see it, this means that grapheme extenders are used to apply (for example) accents on characters that are themselves composed of several characters.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top