Question

I have downloaded the xml dump of the Stack Over Flow site. While transferring the dump into a mysql database I keep running into the following error: Got an Exception: Character reference "some character set like &#x10" is an invalid XML character.

I used UltraEdit (it is a 800 meg file) to remove some characters from the file, but if I remove an invalid charater set and run the parser I get error identifying more invalid characters. Any suggestions on how to solve this?

Cheers all,

j

Was it helpful?

Solution

Which dump are you using? There were problems from the first version (not just invalid characters, but also < appearing where it shouldn't) but they should have been fixed in the second dump.

For what it's worth, I fixed the invalid characters in the original using two regex replaces. Replace "&#x0[12345678BCEF];" and "" each with "?" - treating them both as regular expressions, of course.

OTHER TIPS

The set of characters permitted in XML is here. As you can see, #x10 is not one of them. If these are present in the stackoverflow dump, then it's not XML compliant.

Alternatively, you're reading the XML using the wrong character encoding.

You should convert your file to UTF-8 I develop in java, below is my conversion

public String FileUTF8Cleaner (File xmlfile) {

    String out = xmlfile+".utf8";
    if (new File(out).exists())
        System.out.println("### File conversion process ### Deleting utf8 file");
        new File(out).delete();
        System.out.println("### File conversion process ### Deleting utf8 file [DONE!]");

    try {
        System.out.println("### File conversion process ### Converting file");
        FileInputStream fis = new FileInputStream(xmlfile);
        DataInputStream in = new DataInputStream(fis);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        String strLine;

        FileOutputStream fos = new FileOutputStream(out);

        while ((strLine = br.readLine()) != null) {

            fos.write(strLine.replaceAll("\\p{Cc}", "").getBytes());
            fos.write("\n".getBytes());
        }

        fos.close();
        fis.close();
        in.close();
        br.close();
        System.out.println("### File conversion process ### Converting file [DONE)]");

    } catch(Exception e) {
        e.printStackTrace();
    }

        System.out.println("### File conversion process ### Processing file : "+xmlfile.getAbsolutePath()+" [DONE!]");
        return out;

}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top