Question

be gentle.

I'm trying to use javax.xml.transform.Transformer to format some xml string to be indented / spaceless between the tags. If there are no spaces between the tags, it works ok. If there are it acts weird. I'll post an example. I tried to follow up on the following topic : http://forums.sun.com/thread.jspa?messageID=2054303#2699961. No success.

Code to follow :

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
   factory.setIgnoringElementContentWhitespace(true);
   DocumentBuilder builder = factory.newDocumentBuilder();
   DOMImplementation domImpl = builder.getDOMImplementation();
   DOMImplementationLS ls = (DOMImplementationLS) domImpl.getFeature("LS", "3.0");
   LSInput in = ls.createLSInput();
   in.setByteStream(new ByteArrayInputStream(input.getBytes()));
   LSParser parser = ls.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS,
     "http://www.w3.org/2001/XMLSchema");
   Document xmlInput = parser.parse(in);

   StringWriter stringWriter = new StringWriter();
   StreamResult xmlOutput = new StreamResult(stringWriter);
   TransformerFactory f = TransformerFactory.newInstance();
   f.setAttribute("indent-number", 2);

   Transformer transformer = f.newTransformer();
   transformer.setOutputProperty(OutputKeys.INDENT, "yes");
   transformer.setOutputProperty(OutputKeys.METHOD, "xml");
   transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
   transformer.transform(new DOMSource(xmlInput), xmlOutput);

If there's no interruption between tags

input : <tag><nested>    hello   </nested></tag>
output : 
<tag>
  <nested>    hello   </nested>
</tag>

If there is :

input : <tag>  <nested>    hello   </nested></tag>
output : 
<tag>  <nested>    hello   </nested>
</tag>

JVM 1.6.

Is something obvious wrong here ?

Was it helpful?

Solution

This must be an issue with the transformer implementation. I've created a small test class that reads a String with no whitespace or line breaks as XML and creates a transformer from an XSLT stylesheet (also from a String). The stylesheet specifies that indentation must happen. This is basically another way of achieving what you've done with transformer.setOutputProperty(OutputKeys.INDENT, "yes");

Here it is:

package transformation;

import java.io.StringReader;

import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class TransformerTest {

    public static void main(String[] args) throws Exception {

        final String xmlSample = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><tag><nested>hello</nested></tag>";
        final String stylesheet = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><xsl:stylesheet version=\"1.0\" xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"><xsl:output method=\"xml\" version=\"1.0\" indent=\"yes\"/><xsl:template match=\"node()|@*\"><xsl:copy><xsl:apply-templates select=\"node()|@*\"/></xsl:copy></xsl:template></xsl:stylesheet>";

        final TransformerFactory factory = TransformerFactory.newInstance();

        final Source xslSource = new StreamSource(new StringReader(stylesheet));
        final Transformer transformer = factory.newTransformer(xslSource);

        final Source source = new StreamSource(new StringReader(xmlSample));
        final Result result = new StreamResult(System.out);

        transformer.transform(source, result);

    }

}

Now the curious thing is, results vary based on the transformer I use. If I don't place any TransformerFactory implementation on the classpath (using the default implementation in the JRE libs), the result is this:

<?xml version="1.0" encoding="UTF-8"?>
<tag>
<nested>hello</nested>
</tag>

Not correct, since the tag isn't indented.

Then, by adding a recent Xalan implementation on the classpath (xalan.jar and serializer.jar, still using JRE default parsers/DOM builders), I get this:

<?xml version="1.0" encoding="UTF-8"?><tag>
<nested>hello</nested>
</tag>

Still not correct, the first tag is on the same line as the XML declaration AND isn't indented.

To be honest, this quite shocked me. I'd understand if whitespace between tags or around text nodes would influence the indentation, as the transformer might assume some of it is non-ignorable. But to see a straightforward XML like that mangled is plain weird. I thought perhaps using the console output might have something to do with it, so I tried streaming to a file. Same result.

Kind of weird how long-standing transformer implementations still have such behaviour. But not nearly as bad as when I noticed using a Validator of a Schema resulted in attributes being dropped from the "enhanced" XML output.

So it would seem there's not much to be done about this, apart from trying to find other processors and see if they're having the same problem. Maybe Saxon is worth a shot. This bug report is interesting too (it is for Java 1.5, however): http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6296446

OTHER TIPS

The transformer doesn't seem to like the white space, so the simplest solution seems to be to simply remove it....

    public String prettyPrintXML(String inXML)  {

       String outXML = inXML;

// The transformer doesn't like white space between tags so remove it.          
           String[] bits = inXML.split(">");      
       inXML="";
       boolean first = true;
       for (int n=0;n<bits.length; n++){
           if (first)
            inXML = inXML + bits[n].trim();
           else
             inXML = inXML + ">"+bits[n].trim();

           first = false;
       }
      inXML = inXML + ">";

Pass the inXML into your transformer and off you go.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top