The first part - replacing multiple white-space - is relatively easy though I don't think the parser will do it for you:
InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);
NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
Text text = (Text) nodes.item(i);
text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}
// check results
TransformerFactory.newInstance()
.newTransformer()
.transform(new DOMSource(doc), new StreamResult(System.out));
This is the hard part:
If the node contains XML encoded characters: tabs (
	
), newlines (

) or whitespaces (
) - they should be left.
The parser will always turn "	"
into "\t"
- you may need to write your own XML parser.
According to the author of Saxon:
I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.