Question

I am taking an XML file and reading it into various strings, before writing to a database, however I am having difficulty with German characters.

The XML file starts off

<?xml version="1.0" encoding="UTF-8"?>

Then an example of where I am having problems is this part

<name><![CDATA[PONS Großwörterbuch Deutsch als Fremdsprache Android]]></name>

My PHP has this relevant section

$dom = new DOMDocument();
$domNode = $xmlReader->expand();
$element = $dom->appendChild($domNode);
$domString = utf8_encode($dom->saveXML($element));
$product = new SimpleXMLElement($domString);

//read in data
$arr = $product->attributes();
$link_ident = $arr["id"];
$link_id =  $platform . "" . $link_ident;
$link_name = $product->name;

So $link_name becomes PONS GroÃwörterbuch Deutsch als Fremdsprache Android

I then did a

$link_name = utf8_decode($link_name);

Which when I echoed back in terminal worked fine

PONS GroÃwörterbuch Deutsch als Fremdsprache Android as is now 
PONS Großwörterbuch Deutsch als Fremdsprache Android after utf8decode 

However when it is written into my database it appears as:

PONS Kompaktwörterbuch Deutsch-Englisch (Android)

The collation for link_name in MysQL is utf8_general_ci

How should I be doing this to get it correctly written into my database?

This is the code I use to write to the database

$link_name = utf8_decode($link_name);
$link_id = mysql_real_escape_string($link_id);
$link_name = mysql_real_escape_string($link_name);
$description = mysql_real_escape_string($description);
$metadesc = mysql_real_escape_string($metadesc);
$link_created = mysql_real_escape_string($link_created);
$link_modified = mysql_real_escape_string($link_modified);
$website = mysql_real_escape_string($website);
$cost = mysql_real_escape_string($cost);
$image_name = mysql_real_escape_string($image_name);
$query = "REPLACE into jos_mt_links
(link_id, link_name, alias, link_desc, user_id, link_published,link_approved, metadesc, link_created, link_modified, website, price)
VALUES ('$link_id','$link_name','$link_name','$description','63','1','1','$metadesc','$link_created','$link_modified','$website','$cost')";
echo $link_name . " has been inserted ";

and when I run it from shell I see

PONS Kompaktwörterbuch Deutsch-Englisch (Android) has been inserted
Was it helpful?

Solution

You've got a UTF-8 string from an XML file, and you're putting it into a UTF-8 database. So there is no encoding or decode to be done, just shove the original string into the database. Make sure you've used mysql_set_charset('utf-8') first to tell the database there are UTF-8 strings coming.

utf8_decode and utf8_encode are misleadingly named. They are only for converting between UTF-8 and ISO-8859-1 encodings. Calling utf8_decode, which converts to ISO-8859-1, will naturally lose any characters you have that don't fit in that encoding. You should generally avoid these functions unless there's a specific place where you need to be using 8859-1.

You should not consider what the terminal shows when you echo a string to be definitive. The terminal has its own encoding problems and especially under Windows it is likely to be impossible to output every character properly. On a Western Windows install the system code page (which the terminal will use to turn the bytes PHP spits out into characters to display on-screen) will be code page 1252, which is similar to but not the same as ISO-8859-1. This is why utf8_decode, which spits out ISO-8859-1, appeared to make the text appear as you expected. But that's of little use. Internally you should be using UTF-8 for all strings.

OTHER TIPS

You must use mb_convert_encoding or iconv unction before you write into your database.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top