Storing a serialized object in MySql database

https://stackoverflow.com/questions/9109097

21-04-2021
|

سؤال

I have a big php object that I want to serialize and store in a MySql database. The table encoding is UTF-8 and the column to hold the serialized object encoding is also UTF-8.

The problem is the object holds a text string containing French characters.

For example:

Merci d'avoir passé commande avec Lovre. Voici le récapitulatif de votre commande

When I serialize the object then unserialize it again directly the string is maintained and is in correct format.

However, when I store the serialized object into a MySql database then retrieve it again then unserialize it the string becomes like this:

Merci d'avoir passÃ© commande avec Lovre. Voici le rÃ©capitulatif de votre commande

Something goes wrong when I store the object in the database.

Notes:

The object is stored using propel ORM.
The column type is text.
The string is stored and read from a html file.

المحلول

The strings created by serialize are binary strings, they don't have a specific charset encoding but are just an "array" of bytes (where-as one byte is 8bit, an octet).

If you now take such a string and tell your database that it is LATIN-1 encoded and your database stores it into a text-field with UTF-8 encoding, the database will transparently change the encoding from LATIN-1 into UTF-8. UTF-8 is a charset encoding that uses more than one byte per character for some characters, for example those you give in your question like é.

The character é is then stored as Ã© inside the database, which is the UTF-8 byte-sequence for é.

If you now fetch the data from the database without specifying in which encoding you need it, the database will return it as UTF-8.

Now unserialize has a problem because the binary string has been modfied in a way which makes it invalid.

Instead you need to either tell your database that it should not modify the encoding when it stores the serialized string, e.g. by choosing the right column type and encoding (binary field, BLOB - Binary Large Object^{MySQL Docs}, see as well Binary Types^{Propel Docs}) -or- when you fetch the data from the database you revert the charset-encoding back to the original format. The first approach (binary field) is better because it is exactly what you're looking for.

For the data that has been already stored into the database in a wrong format, you need to correct the data. To do that you first need to find out which re-encoding was applied, e.g. from which charset to which charset. I assume it's LATIN-1 but there is no guarantee. You need to review the encoding of your current application data and processes to find out.

After you've found out, encode the values back from UTF-8 to the original encoding.

نصائح أخرى

make sure to use utf-8 everywhere - sounds like you missed something.

in your case, i think you've forgotten to set the correct charset for you database-connection (using a SET NAMES statement or mysql_set_charset()) - but thats hard to say without seeing your code (and i don't know propel).

the following is a quote from chazomaticus, who has given a perfect answer in UTF-8 all the way through, listing all the points you have to take care of:

Storage:

Specify utf8_unicode_ci (or equivalent) collation on all tables and text columns in your database. This makes MySQL physically store and retrieve values natively in UTF-8.

Retrieval:

In PHP, in whatever DB wrapper you use, you'll need to set the connection charset to utf8. This way, MySQL does no conversion from its native UTF-8 when it hands data off to PHP. * Note that if you don't use a DB wrapper, you'll probably have to issue a query to tell MySQL to give you results in UTF-8: SET NAMES 'utf8' (as soon as you connect).

Delivery:

You've got to tell PHP to deliver the proper headers to the client, so text will be interpreted as UTF-8. In PHP, you can use the default_charset php.ini option, or manually issue the Content-Type header yourself, which is just more work but has the same effect.

Submission:

You want all data sent to you by browsers to be in UTF-8. Unfortunately, the only way to reliably do this is add the accept-charset attribute to all your <form> tags: <form ... accept-charset="UTF-8">.

Note that the W3C HTML spec says that clients "should" default to sending forms back to the server in whatever charset the server served, but this is apparently only a recommendation, hence the need for being explicit on every single <form> tag.

Although, on that front, you'll still want to verify every submitted string as being valid UTF-8 before you try to store it or use it anywhere. PHP's mb_check_encoding() does the trick, but you have to use it religiously.

Processing:

This is, unfortunately, the hard part. You need to make sure that every time you process a UTF-8 string, you do so safely. Easiest way to do this is by making extensive use of PHP's mbstring extension.

PHP's string operations are NOT by default UTF-8 safe. There are some things you can safely do with normal PHP string operations (like concatenation), but for most things you should use the equivalent mbstring function.

To know what you're doing (read: not mess it up), you really need to know UTF-8 and how it works on the lowest possible level. Check out any of the links from utf8.com for some good resources to learn everything you need to know.

Also, I feel like this should be said somewhere, even though it may seem obvious: every PHP or HTML file you'll be serving should be encoded in valid UTF-8.

note that you don't need to use utf-8 - the important part is to use the same charset everywhere, independent of what charset that might be. but if you need to change things anyway, use utf-8.

I'm always storing esrialized data via using base64_encode(). Serialized data is sometimes causing problems, but after using the base64-value of it, only simple characters remain.

I strongly recommend you to use json_encode instead of serialize. Some day you will find yourself trying to use that data from another place that is not PHP and having it stored in JSON makes it readable everywhere; virtually every language supports decoding JSON and is a well stablished standard.

The answer about using utf8 everywhere holds! :-D

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow