Question

Normally I would just do this.

$str = preg_replace('#(\d+)#', ' $1 ', $str);

If I knew it was going to be utf-8 I would add a lowercase "u" modifier to the pattern and I think I would be good. But because of reports of utf-8 taking 2x and in some cases 3x the storage space than it would take if the native character set were used, I'm trying not to restrict the application to utf-8.

Thus, I'm trying to stay away from my favorite preg_ functions.

Most things have been fairly simple so far, but I'm a little stuck on replacements where I'd normally use character classes in preg_ such as "\d".

Was it helpful?

Solution

Implement a storage wrapper with mb_convert_encoding so internally you only have to manipulate UTF-8.

(I still think you should require UTF-8 and save everyone a lot of trouble.)

OTHER TIPS

I think that UTF-8 encoding is such that anything in the encoded output with a byte value of 127 or less is always the ASCII character matching that byte value and never part of a multi byte sequence. So you can continue to pretend the encoding is ASCII in this situation and not cause problems (as spaces and digits are all ASCII).

See the description in http://en.wikipedia.org/wiki/UTF-8 where it shows that all the bytes in a multibyte sequence have the most significant bit set (e.g. are all > 127).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top