Removing duplicated newlines/tabs/whitespaces in XML character element

Question

The first part - replacing multiple white-space - is relatively easy though I don't think the parser will do it for you:

InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);

NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
    XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
  Text text = (Text) nodes.item(i);
  text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}

// check results
TransformerFactory.newInstance()
    .newTransformer()
    .transform(new DOMSource(doc), new StreamResult(System.out));

This is the hard part:

If the node contains XML encoded characters: tabs (	), newlines (
) or whitespaces () - they should be left.

The parser will always turn "	" into "\t" - you may need to write your own XML parser.

According to the author of Saxon:

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.