Question

<node> test
    test
    test
</node>

I want my XML parser read characters in <node> and:

  1. replace newlines and tabs to spaces and compose multiple spaces into one. At result, the text should look similar to "test test test".
  2. If the node contains XML encoded characters: tabs (&#x9;), newlines (&#xA;) or whitespaces (&#20;) - they should be left.

I'm trying a code below, but it preserve duplicated whitespaces.

  dbf = DocumentBuilderFactory.newInstance();
  dbf.setIgnoringComments( true );
  dbf.setNamespaceAware( namespaceAware );
  db = dbf.newDocumentBuilder();
  doc = db.parse( inputStream );

Is the any way to do what I want?

Thanks!

Was it helpful?

Solution

The first part - replacing multiple white-space - is relatively easy though I don't think the parser will do it for you:

InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);

NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
    XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
  Text text = (Text) nodes.item(i);
  text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}

// check results
TransformerFactory.newInstance()
    .newTransformer()
    .transform(new DOMSource(doc), new StreamResult(System.out));

This is the hard part:

If the node contains XML encoded characters: tabs (&#x9;), newlines (&#xA;) or whitespaces (&#20;) - they should be left.

The parser will always turn "&#x9;" into "\t" - you may need to write your own XML parser.

According to the author of Saxon:

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top