Question

Using ARC2, textual data gets corrupted.

My RDF input file is in UTF-8. It gets loaded in ARC2, which uses a MySQL backend, through a LOAD <path/to/file.rdf> query. The MySQL database is in UTF-8 too, as a check with PHPMyAdmin makes sure.

However, the textual data gets corrupted. After several conversion checks, the problem seems to be that the original UTF-8 file is believed to be in ISO-8859-1, and converted to UTF-8 once again.

Example: "surmonté" → "surmonteÌ".

This "surmonteÌ" is actulally available in UTF-8 in the database.

Is this related to the way ARC2 opens files (digging through the code, not exhaustively but quite deep, did not show anything suspicious), or could this be a more general case with PHP and MySQL?

How can I make sure the imported data is not wrongly re-encoded but taken as the original?

Was it helpful?

Solution

ARC2 uses two functions: $store->setUp(), which CREATEs TABLEs and DATABASE if needs be; and query(LOAD…, a detailed in the question.

It turns out, the setUp() part must not be called in the same script as the load part. At least, not during the same execution. The solution I took was to make two separate scripts, one to init the database, another to load the data, but simply commenting out the init part once it is done also works. In any case, the trick is to make sure the loading won't take place right after the initialization.

This happens because the SET NAMES utf8 encoding specification upon DB connection is set only after collation detection, for which MySQL does not seem to detect properly if the database has just been created. I made a pull request of a fix.


As a side note, it is not efficient to use the LOAD <path/to/file.rdf construct of the question: this will be computed as a relative web address, calling the server to download from itself through the network. It is much more efficient to use a construct such as:

 $store->query('LOAD <file://' . dirname(__FILE__) . '/path/to/file.rdf>')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top