Why is iconv generating an illegal character error?

https://stackoverflow.com/questions/12528430

PHP
php-5.2

03-07-2021
|

سؤال

I'm trying to iron out the warnings and notices from a script. The script includes the following:

$clean_string = iconv('UTF-8', 'UTF-8//IGNORE', $supplier.' => '.$product_name);

As I understand it, the purpose of this line, as intended by the original author of the script, is to remove non-UTF-8 characters from the string, but obviously any non-UTF-8 characters in the input will cause iconv to throw an illegal character warning.

To solve this, my idea was to do something like the following:

$clean_string = iconv(mb_detect_encoding($supplier.' => '.$product_name), 'UTF-8//IGNORE', $supplier.' => '.$product_name);

Oddly however, mb_detect_encoding() is returning UTF-8 as the detected encoding!

The letter e with an accent (é) is an example of a character that causes this behaviour.

I realise I'm mixing multibyte libraries between detection and conversion, but I couldn't find an encoding detection function in the iconv library.

I've considered using the mb_convert_encoding() function to clean the string up into UTF-8, but the PHP documentation isn't clear what happens to characters that cannot be represented.

I am using PHP 5.2.17, and with the glibc iconv implementation, library version 2.5.

Can anyone offer any suggestions on how to clean the string into UTF-8, or insight into why this behaviour occurs?

المحلول

Your example:

$string     = $supplier . ' => ' . $product_name;
$stringUtf8 = iconv('UTF-8', 'UTF-8//IGNORE', $string);

and using PHP 5.2 might work for you. In later PHP versions, if the input is not precisely UTF-8, incov will drop the string (you will get an empty string). That so far as a note to you, you might not be aware of it.

Then you try with mb_detect_encoding^Docs to find out about the original encoding:

$string     = $supplier . ' => ' . $product_name;
$encoding   = mb_detect_encoding($string);
$stringUtf8 = iconv($encoding, 'UTF-8//IGNORE', $string);

As I already linked in a comment, mb_detect_encoding is doing some magic and can not work. It tries to help you, however, it can not detect the encoding very good. This is by matters of the subject. You can try to set the strict mode to true:

$order      = mb_detect_order();
$encoding   = mb_detect_encoding($string, $order, true);
if (FALSE === $encoding) {
    throw new UnexpectedValueException(
        sprintf(
            'Unable to detect input encoding with mb_detect_encoding, order was: %s'
            , print_r($order, true)
        )
     );
}

Next to that you might also need to translate the names of the encoding^Docs (and/or validate against supported encoding) between the two libraries (iconv and multi byte strings).

Hope this helps so that you at least do better understand why some things might not work and how you can better find the error-cases and filter the input then with the standard PHP extensions.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow