Question

When writing interpreters for PDF, HTML and other documents we need to deal with a variety of white-space characters and additional non-printing characters. The ANSI ones are well defined, but how many others are likely to be found in practice? A typical example is the cluster in ISO10646 (I think):

                     en space
                 em space
                   thin space
‌  ‌     ‌   ‌   zero width non-joiner
‍   ‍     ‍   ‍   zero width joiner
‎   ‎     ‎   ‎   left-to-right mark
‏   ‏     ‏   ‏   right-to-left mark

(For obvious reasons the characters do not appear above!).

Was it helpful?

Solution

Unicode will be with us, in increasing quantity, for a long time. If an HTML or XML document is written in UTF-8 encoded Unicode, then you should expect any and all of these to appear.

In Unicode (Unicode Character Database) the following codepoints are defined as whitespace:

U+0009–U+000D (control characters, containing Tab, CR and LF)
U+0020 SPACE
U+0085 NEL (control character next line)
U+00A0 NBSP (NO-BREAK SPACE)
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000–U+200A (different sorts of spaces)
U+2028 LS (LINE SEPARATOR)
U+2029 PS (PARAGRAPH SEPARATOR)
U+202F NNBSP (NARROW NO-BREAK SPACE)
U+205F MMSP (MEDIUM MATHEMATICAL SPACE)
U+3000 IDEOGRAPHIC SPACE

OTHER TIPS

In development world there's at least one more (most often used in web development)

   // non-breaking space

But the more you get to design world the more you see various space/invisible characters. Publishing software normally has

  • space - the regular SPACE
  • en space
  • em space
  • thin space
  • hair space
  • non-breaking space
  • non-breaking fixed width space
  • sixth space
  • quarter space
  • third space
  • punctuation space
  • flush space
  • figure space
  • ...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top