Question

I need to load wikipedia revision histories into POJOs, so I'm using JAXB to unmarshall the wikipeida data dump (well, individual pages of it). The problem is that the text nodes occasionally contain entities that are not defined in the wikipedia xml dump. eg: ° (`°' pleases keep in mind that I do not know the complete set of entities that I need to be able to read. My input file is 3tb, so let's just assume that everything html can render is in there.).

How can I configure JAXB to handle entities that are not valid xml?

Here is the SAX Exception that JAXB throws when it encounters an undefined entity:

Exception in thread "main" javax.xml.bind.UnmarshalException

 - with linked exception:

[org.xml.sax.SAXParseException: The entity "deg" was referenced, but not declared.]

    at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)

    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(UnmarshallerImpl.java:481)

    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:199)

    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:168)

    at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:137)

    at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:184)

    at com.stottlerhenke.tools.wikiparse.WikipediaIO.readPage(WikipediaIO.java:73)

    at com.stottlerhenke.tools.wikiparse.WikipediaIO.main(WikipediaIO.java:53)

Caused by: org.xml.sax.SAXParseException: The entity "deg" was referenced, but not declared.

    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)

    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)

    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)

    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)

    at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)

    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source)

    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)

    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)

    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)

    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:195)

Edit: The input that triggered that exception is the complete revision history for the wikipedia article on the Arctic Circle. The XSD used to generate the JAXB classes is here: http://www.mediawiki.org/xml/export-0.3.xsd

Edit: The source of this problem was an error on my part -- I was using an initial extractor that did not maintain encoded entities properly. However, I did find a way around this, should anyone have the problem I thought I had. See below.

Was it helpful?

Solution 2

This is a hack, but it works in a pinch.

I downloaded the html entity definitions from w3.org, and set the doctype of the input xml file to xhtml-transitional, but directed the doctype url to a local dtd:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1-transitional.dtd">

xhtml1-transitional.dtd, in turn, requires:

  • xhtml-lat1.ent
  • xhtml-special.ent
  • xhtml-symbol.ent

which I sucked down and put along side xhtml1-transitional.dtd

(All files are available at: http://www.w3.org/TR/xhtml1/DTD/ )

Like I said, ugly as hell, but it did seem to do the job.

OTHER TIPS

Resolving entities is not the job of JAXB's. It's the job of the underlying XML parser.

What you could do is:

  • read the data yourself using DOM
  • replace all unresolved entities by something you wish
  • then, let JAXB handle the result
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top