Question

If a user writes a string of text in arabic into a facebook comment and saves, what is the collation type of the data storage?

I don't believe that they use a mysql table for comments, but I've just messed with the topic using a localhost mysql table, where I stored some arabic in a binary character.

it transformed the text into some presumably escaped sequence of character. but once you've saved it, it stayed that way.

If you consider i18n, even when I have facebook set to english, typing in other non-western encoded characters still saves and displays correctly.

any insight into how they've achieved this?

Was it helpful?

Solution

First; I don't know for sure but I don't believe MySQL comes into play anywhere for this.

The right thing to do is store it UTF-8 in <some-system>, period. Which might as well be MySQL I guess. I don't know specifics but I do believe MySQL (and PHP for that matter**) are not really up-to-par with UTF-8/Unicode support and so they might manifest some "glitches". For example, you need to execute "set names=utf8" or some crazy stuff first thing after opening the connection for utf8 to work at all (which might be why your test didn't work). Also, I remember something about MySQL not supporting 4-byte encoded UTF-8 characters, only up to 3. Don't know if that is true currently, but I vaguely remember something about it. [edit] Should be fixed in 5.5+ I don't know about Arabic but they might be the 4-byte kind. [edit] They should need 2 or 3 bytes.

And while we're on glitches: about PHP I remember stuff like strlen() returning bytes instead of actual characters etc. If I'm not mistaken it has some mb_XXX functions (multibyte string) that should handle UTF-8 better. [edit] Turns out it does.

I don't see how i18n and setting facebook to English (or Swahili for that matter) would affect this at all. It's just the language used in the interface (and maybe/probably affecting datetime formatting etc.) and has nothing to do with user-generated content.

Oh, almost forgot the obligatory The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)-link

** Just mentioning it because it usually goes hand-in-hand with MySQL.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top