How to ignore whitespace while reading a file to produce an XML DOM

https://stackoverflow.com/questions/229310

04-07-2019
|

Question

I'm trying to read a file to produce a DOM Document, but the file has whitespace and newlines and I'm trying to ignore them, but I couldn't:

DocumentBuilderFactory docfactory=DocumentBuilderFactory.newInstance();
docfactory.setIgnoringElementContentWhitespace(true);

I see in Javadoc that setIgnoringElementContentWhitespace method operates only when the validating flag is enabled, but I haven't the DTD or XML Schema for the document.

What can I do?

Update

I don't like the idea of introduce mySelf < !ELEMENT... declarations and i have tried the solution proposed in the forum pointed by Tomalak, but it doesn't work, i have used java 1.6 in an linux environment. I think if no more is proposed i will make a few methods to ignore whitespace text nodes

Solution

‘IgnoringElementContentWhitespace’ is not about removing all pure-whitespace text nodes, only whitespace nodes whose parents are described in the schema as having ELEMENT content — that is to say, they only contain other elements and never text.

If you don't have a schema (DTD or XSD) in use, element content defaults to MIXED, so this parameter will never have any effect. (Unless the parser provides a non-standard DOM extension to treat all unknown elements as containing ELEMENT content, which as far as I know the ones available for Java do not.)

You could hack the document on the way into the parser to include the schema information, for example by adding an internal subset to the < !DOCTYPE ... [...] > declaration containing < !ELEMENT ... > declarations, then use the IgnoringElementContentWhitespace parameter.

Or, possibly easier, you could just strip out the whitespace nodes, either in a post-process, or as they come in using an LSParserFilter.

OTHER TIPS

This is a (really) late answer, but here is how I solved it. I wrote my own implementation of a NodeList class. It simply ignores text nodes that are empty. Code follows:

private static class NdLst implements NodeList, Iterable<Node> {

    private List<Node> nodes;

    public NdLst(NodeList list) {
        nodes = new ArrayList<Node>();
        for (int i = 0; i < list.getLength(); i++) {
            if (!isWhitespaceNode(list.item(i))) {
                nodes.add(list.item(i));
            }
        }
    }

    @Override
    public Node item(int index) {
        return nodes.get(index);
    }

    @Override
    public int getLength() {
        return nodes.size();
    }

    private static boolean isWhitespaceNode(Node n) {
        if (n.getNodeType() == Node.TEXT_NODE) {
            String val = n.getNodeValue();
            return val.trim().length() == 0;
        } else {
            return false;
        }
    }

    @Override
    public Iterator<Node> iterator() {
        return nodes.iterator();
    }
}

You then wrap all of your NodeLists in this class and it will effectively ignore all whitespace nodes. (Which I define as Text Nodes with 0-length trimmed text.)

It also has the added benefit of being able to be used in a for-each loop.

I made it works by doing this

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        dbFactory.setIgnoringElementContentWhitespace(true);
        dbFactory.setSchema(schema);
        dbFactory.setNamespaceAware(true);
NodeList nodeList = element.getElementsByTagNameNS("*", "associate");

I ended up following @bobince's idea of using an LSParserFilter. Yes, the interface is documented at https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/ls/LSParserFilter.html but it's very hard to find good example/explanation material. After considerable searching I located DOM Level 3 Load and Save XML Reference Guide at http://www.informit.com/articles/article.aspx?p=31297&seqNum=29 (Nicholas Chase, Mar 14, 2003). That helped me considerably. Here are portions of my code, which does an XML diff with org.custommonkey.xmlunit. (This is a tool written on my own time to help me with paid work, so I have left a lot of things, like better exception handling, for when things are slow.)

I especially like the use of an LSParserFilter because, for my purpose, I will likely add an option in the future to ignore id attributes too, which should be an easy enhancement with this framework.

// A small portion of my main class.
// Other imports may be necessary...
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSParser;
import org.w3c.dom.ls.LSParserFilter;

Document controlDoc = null;
Document testDoc = null;
try {
    System.setProperty(DOMImplementationRegistry.PROPERTY, "org.apache.xerces.dom.DOMImplementationSourceImpl");
    DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
    DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
    LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
    LSParserFilter filter = new InputFilter();
    builder.setFilter(filter);
    controlDoc = builder.parseURI(files[0].getPath());
    testDoc = builder.parseURI(files[1].getPath());
} catch (Exception exc) {
    System.out.println(exc.getMessage());
}

//--------------------------------------

import org.w3c.dom.ls.LSParserFilter;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.traversal.NodeFilter;

public class InputFilter implements LSParserFilter {

    public short acceptNode(Node node) {
        if (Utils.isNewline(node)) {
            return NodeFilter.FILTER_REJECT;
        }
        return NodeFilter.FILTER_ACCEPT;
    }

    public int getWhatToShow() {
        return NodeFilter.SHOW_ALL;
    }

    public short startElement(Element elem) {
        return LSParserFilter.FILTER_ACCEPT;
    }

}

//-------------------------------------
// From my Utils.java:

    public static boolean isNewline(Node node) {
        return (node.getNodeType() == Node.TEXT_NODE) && node.getTextContent().equals("\n");
    }

Try this:

private static Document prepareXML(String param) throws ParserConfigurationException, SAXException, IOException {

        param = param.replaceAll(">\\s+<", "><").trim();
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setIgnoringElementContentWhitespace(true);
        DocumentBuilder builder = factory.newDocumentBuilder();
        InputSource in = new InputSource(new StringReader(param));
        return builder.parse(in);

    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow