iconv with ascii // transit triggers ErrorException: "iconv(): Detected an illegal character in input string"

StackOverflow https://stackoverflow.com/questions/21702816

  •  09-10-2022
  •  | 
  •  

Domanda

First of all, I have to say that; I am a stranger of multilingual conversions.

I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use

$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);

to achive my requirements (an UTF8, lowercase string)

However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get à as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).

What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?

Notes: I read about @iconv questions here, but I think it is not a good solution to have empty database entries.


Thanks to all answers, I will read and try to understand each of them.

È stato utile?

Soluzione 2

If you encode çokGüŞelLl as UTF-8 you should get the following bytes:

var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"

That's a check you must do. You also have this:

utf8_encode($str)

Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.

So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.

Altri suggerimenti

The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.

Ensure that your data is proper UTF-8 before saving it to your database:

// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
    throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}

// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);

Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.

$string = $database->getSomeRecordWithUnicode();

echo mb_strtolower($string);

Done!

PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.

PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.

About HTML forms

Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.

<!doctype html>
<html>
<body>
    <form accept-charset="UTF-8">

Now all browsers should encode the data they submit in utf-8.

You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top