Question

I'm using XMLReader to parse a large XML file from a third party, file size is 1GB+. The XML file specifies the encoding as UTF8 (<?xml version="1.0" encoding="utf-8" ?>), although it isn't.

XMLReader throws an error because of the unknown encoding type, but not until it's already processed most of the file.

Exception message:

Input is not proper UTF-8, indicate encoding

I have determined that the real encoding of the file is ISO-8859-1, and it will work fine if I manually specify this when calling $reader->open().

The problem is that my script needs to parse unknown files from the database, so it needs to rely on the encoding type specified within the file. I need to find a way to parse any file regardless of its encoding, are there any suggestions for doing this?

Was it helpful?

Solution

I figured out that vim is pretty good at converting from one encoding to another.

My trick is to parse the file normally, and when the encoding error is encountered just re-encode the file with vim and start parsing again.

Here's the rough idea:

$xmlFile = '/path/to/file.xml';

// Parse the file in a loop
while(...)
{

    try
    {
        // Normal parsing logic...

        $reader->readOuterXml();

        //...
    }
    catch(Exception $ex)
    {
        $encoding = getXMLEncoding($xmlFile) ?: 'utf-8';

        exec(sprintf(VIM_PATH . ' -c "set fileencoding=%s" -c "wq" "%s"', $encoding, $xmlFile));

        // File has been re-encoded
        // The real encoding should now match the declared encoding

        // -> Go back to the beginning and parse the file again
    }

}

Using this method might garble 1 or 2 chars, but it's way better than completely failed parsing. Ideally the 3rd party would mark their files correctly.

My system is Windows, so the vim arguments might be different on Linux (don't know).

OTHER TIPS

Use simplexml_load_file to parse XML. In order to avoid encoding problems, use utf8_encode on data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top