Question

In PHP, what's the most elegant way to get the complete list (array of strings) of all the Unicode whitespace characters, encoded in utf8?

I need that to generate test data.

Was it helpful?

Solution

This email contains a list of all Unicode whitespace characters encoded in UTF-8, UTF-16, and HTML.

edit

Originally answered Feb 9 '10 (!). Really guys, if the information is outdated, you can add your own answer, rather than complain. Just google for the URL mentioned in my answer, and earn some rep:

The mail has been archived here (took me seconds), and the whitespace table is even mentioned in the introduction

static $whitespace = array(
    "SPACE" => "\x20",
    "NO-BREAK SPACE" => "\xc2\xa0",
    "OGHAM SPACE MARK" => "\xe1\x9a\x80",
    "EN QUAD" => "\xe2\x80\x80",
    "EM QUAD" => "\xe2\x80\x81",
    "EN SPACE" => "\xe2\x80\x82",
    "EM SPACE" => "\xe2\x80\x83",
    "THREE-PER-EM SPACE" => "\xe2\x80\x84",
    "FOUR-PER-EM SPACE" => "\xe2\x80\x85",
    "SIX-PER-EM SPACE" => "\xe2\x80\x86",
    "FIGURE SPACE" => "\xe2\x80\x87",
    "PUNCTUATION SPACE" => "\xe2\x80\x88",
    "THIN SPACE" => "\xe2\x80\x89",
    "HAIR SPACE" => "\xe2\x80\x8a",
    "ZERO WIDTH SPACE" => "\xe2\x80\x8b",
    "NARROW NO-BREAK SPACE" => "\xe2\x80\xaf",
    "MEDIUM MATHEMATICAL SPACE" => "\xe2\x81\x9f",
    "IDEOGRAPHIC SPACE" => "\xe3\x80\x80",
);

OTHER TIPS

Years later, this question still has top results on Google when looking for unicode whitespace characters. devio's answer is great, but incomplete. As of this writing (October 2017) Wikipedia has a list of whitespace characters here: https://en.wikipedia.org/wiki/Whitespace_character

This list has specifies 25 code points, whereas the currently accepted answer lists 18. Including the seven other code points, the list is:

U+0009  character tabulation
U+000A  line feed
U+000B  line tabulation
U+000C  form feed
U+000D  carriage return
U+0020  space
U+0085  next line
U+00A0  no-break space
U+1680  ogham space mark
U+180E  mongolian vowel separator
U+2000  en quad
U+2001  em quad
U+2002  en space
U+2003  em space
U+2004  three-per-em space
U+2005  four-per-em space
U+2006  six-per-em space
U+2007  figure space
U+2008  punctuation space
U+2009  thin space
U+200A  hair space
U+200B  zero width space
U+200C  zero width non-joiner
U+200D  zero width joiner
U+2028  line separator
U+2029  paragraph separator
U+202F  narrow no-break space
U+205F  medium mathematical space
U+2060  word joiner
U+3000  ideographic space
U+FEFF  zero width non-breaking space

http://en.wikipedia.org/wiki/Space_%28punctuation%29#Spaces_in_Unicode

Unfortunately, it doesn't give UTF-8, but it does have the character in the web page, so you could cut and paste into your editor (if it saves in UTF-8). Alternatively, http://www.fileformat.info/info/unicode/char/180E/index.htm gives UTF-8 (replace "180E" with the hex UTF-16 value you are looking up).

This also gives a couple extra characters that @devio's excellent answer misses.

0x9 b'\t'
0xa b'\n'
0xb b'\x0b'
0xc b'\x0c'
0xd b'\r'
0x20 b' '
0x85 b'\xc2\x85'
0xa0 b'\xc2\xa0'
0x1680 b'\xe1\x9a\x80'
0x180e b'\xe1\xa0\x8e'
0x2000 b'\xe2\x80\x80'
0x2001 b'\xe2\x80\x81'
0x2002 b'\xe2\x80\x82'
0x2003 b'\xe2\x80\x83'
0x2004 b'\xe2\x80\x84'
0x2005 b'\xe2\x80\x85'
0x2006 b'\xe2\x80\x86'
0x2007 b'\xe2\x80\x87'
0x2008 b'\xe2\x80\x88'
0x2009 b'\xe2\x80\x89'
0x200a b'\xe2\x80\x8a'
0x200b b'\xe2\x80\x8b'
0x200c b'\xe2\x80\x8c'
0x200d b'\xe2\x80\x8d'
0x2028 b'\xe2\x80\xa8'
0x2029 b'\xe2\x80\xa9'
0x202f b'\xe2\x80\xaf'
0x205f b'\xe2\x81\x9f'
0x2060 b'\xe2\x81\xa0'
0x3000 b'\xe3\x80\x80'
0xfeff b'\xef\xbb\xbf'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top