Question

I, like many other PHP developers have had issues with character encoding, the question will outline the steps I go through to ensure that my data is saved and outputted as UTF8. I would like any advice on what else I should consider and or change with my current thinking.

I have a mysql database DEFAULT CHARACTER UTF-8 my tables have collation of utf8_general_ci

I am using a php script to read data from an RSS feed then saving that data to by database. Before I save that data I check to see whether that data is UTF-8 or not by doing the following:

protected function _convertToUTF8($content) {
    $enc = mb_detect_encoding($content);
    return mb_convert_encoding($content, "UTF-8", $enc);
}

When outputting this data to a webpage I set the headers in php

header("Content-type: text/html; charset=utf-8");

and I also set the Content-Type meta tag to be utf-8

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

So far everything works as expected I get no funny characters outputting and all is going smoothly, but should I be changing/considering anything else when dealing with this data?

The problem I am now having is outputting this data to a txt file (csv) I am using fwrite() which has successfully created the file but the 3rd party I am passing this file to says that the file is not UTF-8. I am not sure the data is being outputted as UTF-8, how can I check this? When logged into the remote server over SSH and i cat the file i get Itâs a when I vim the file I get Itâ~@~Ys when i less the file I get It<E2><80><99>s. What am I missing here?

Thanks in advance!

Was it helpful?

Solution 2

In the end it was a BOM that was required for the external application to read the file properly.

OTHER TIPS

You can not detect the encoding of any data. Encoding is always meta-information next to the data itself.

Even mb_detect_encoding() tries it's best to do so, you should never use it to handle data automatically. Because as it's not possible to detect encoding from the data itself, this function can not as well.

Don't rely on it. Use it only for manual inspection in case you need to debug a problem or in the last resort of fallback, but never in the standard data processings. An even then, do not trust that information too much.

How can I say so? Just an example: A text can be validly US-ASCII encoded and a detection routine for UTF-8 will return that it's valid UTF-8 encoded. And that's just one example. The truth is, this is just much more complex.

So take it for granted that you can not detect the encoding from the raw data.

Instead, look for the meta information that specifies the encoding. If no encoding information is given, lookup the default encoding in the specification documents for the transport of data.

In your case of storing data from RSS feeds, lookup the information either in the response headers and/or the XML prologue. It normally contains the encoding in ISO notation of the document.

As your database expects data encoded as UTF-8 your processing must take care that only UTF-8 data is put into the database. So check and acquire the encoding of the data and then do the steps needed to change the encoding. But do not rely on mb_detect_encoding() to perform these steps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top