문제

Can anyone explain this weird behavior of the Unicode strlen function in PHP's intl extension?

var_dump(grapheme_strlen("a\r\n")); // (ASCII 'a') length: 3
var_dump(grapheme_strlen("の\r\n")); // length: 2
var_dump(grapheme_strlen("\r\n")); // length: 2

Seems like grapheme_strlen is counting "\r\n" (CR LF, which are two separate code points used for line separation on Windows) as a single grapheme, which could be quite reasonable considering the name of the function, but it does it only if the line ending is preceded by a non-ASCII character. Why?

도움이 되었습니까?

해결책

This is a bug. grapheme_strlen should work on the Grapheme Cluster Boundaries defined in Unicode Standard Annex #29 (Unicode Text Segmentation). The standard clearly says not to break between CR and LF.

If you have a look at the PHP source, grapheme_strlen simply returns the number of characters for ASCII strings.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top