If your input is UTF-8 encoded and you want to count Unicode graphemes, you can do this:
$count = preg_match_all('/\X/u', $text);
Here is some explanation. Unicode graphemes are "characters" (Unicode codepoints), including the "combining marks" that can follow them.
mb_strlen($text, 'UTF-8')
would count combining marks as separate characters (and strlen($text)
would give you the total bytecount).
Since, judging by a comment of yours, your input could have some characters converted to their HTML entity equivalent, you should first do an html_entity_decode()
:
$count = preg_match_all('/\X/u', html_entity_decode($text, ENT_QUOTES, 'UTF-8'));
UPDATE
The intl
PECL extension now provides grapheme_strlen()
and other grapheme_*()
functions (but only if you have the intl
PECL extension installed, of course).