Question

I currently use mbstring.func_overload = 7 to get working with UTF-8 charset.

I am thinking to refactor all func call to use mb_* functions.

Do you think this is necessarily, or with PHP 6 or newer version the multibyte problem will be solved in another way?

Was it helpful?

Solution

for string that are utf-8 (of course)

Yes, of course. There are many things you can do with strings though. UTF-8 is backwards compatible with ASCII. If you only want to operate on the ASCII characters of a string, it may or may not make a difference. It depends on what you need to do with your strings.

If you want a direct answer: No, you should not refactor every function to an mb_ function, because it's likely overkill. Should you check your use cases whether a multi-byte UTF-8 string may impact results and refactor accordingly? Yes.

OTHER TIPS

Not recommended if you are using the libraries other people create. Here are three reasons.

  1. Overloading can break the behaviors of the libraries that don't expect overloading.
  2. Your framework can be broken in the environments without overloading.
  3. Depending on overloading decreases the prospective users of your framework because of 2

Good example of 1. is miscaliculation of bytesize in HTTP Content-Length field by using strlen. The cause is that the overloaded strlen function does not return the number of bytes but number of characters. You can see real world issues in CakePHP and Zend_Http_Client.

Edit: deprecating mbstring.func_overload is under consideration in PHP 5.5 or 5.6 (from mbstring maintainer's mail in 2012 April). So now you should avoid mbstring.func_overload.

The recommended policy of handling mutibyte characters for various platforms is to use mbstring or intl or iconv directlly. If you really need fallback functions for handling multibyte characters, use function_exists().

You can see the cases in Wordpress and MediaWiki.

  1. WordPress: wp-includes/compact.php
  2. MediaWiki: Fallback Class

Some of CMSes like Drupal (unicocde.inc) introduce mutibyte abstraction layer.

I think the abstraction layer is not good idea. The reason is that the number of handling multibyte functions needed in a lot of case is under 10 and umultibyte functions are easy to use and decrease perfomance for switching the handling to mbstring or intl or iconv if these module are installed.

The minimum requirement for handling multibyte characters is mb_substr() and handling invalid byte sequence. You can see the cases of a fallback function for mb_substr() in the above CMSes.
I answered about handling invalid byte sequence in the following place: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top