Question

The code below is not correctly transforming the input data to XML. I think so because I don't expect Transformer to generate output with non-valid xml characters in it (I'm talking about the &).

Here is the code:

package com.example.test.formatter;

import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import android.test.AndroidTestCase;
import android.util.Log;

public class XmlTest extends AndroidTestCase {

    public void testFormat() {

        try {
            String arbitraryInput = "Arbitrary input: \uD83D"; // we don't have control over this input

            DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
            Document document = documentBuilder.newDocument();

            TransformerFactory transformerFactory = TransformerFactory.newInstance();
            Transformer transformer = transformerFactory.newTransformer();
            transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
            transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
            transformer.setOutputProperty(OutputKeys.INDENT, "true");

            StringWriter stringWriter = new StringWriter();
            StreamResult result = new StreamResult(stringWriter);
            DOMSource source = new DOMSource(document);

            Element root = document.createElement("root");
            Element subElement = document.createElement("key");
            subElement.setTextContent(arbitraryInput);
            root.appendChild(subElement);

            document.appendChild(root);

            stringWriter.getBuffer().setLength(0);
            transformer.transform(source, result);

            String parsed = stringWriter.toString(); // <root><key>Arbitrary input: &#55357;</key></root>
            Log.e("parsed", parsed);
        }
        catch(Throwable ex) {
            ex.printStackTrace();
        }

    }

}

I was expecting to get something like

<root><key>Arbitrary input: &amp; #55357;</key></root>

But instead I get:

<root><key>Arbitrary input: &#55357;</key></root>

So, what should I do if I want to get valid XML output of Transformer?

Thanks!

EDIT:

I think that the output is invalid because when I'm trying to process the produced XML output with PHP like this:

<?php

$data = "<root><key>Arbitrary input: &#55357;</key></root>";

$xmlDocument = new \DOMDocument();
$xmlDocument->loadXML($data);

I get a warning (or exception if the environment was configured to throw exceptions on warnings):

PHP Warning:  DOMDocument::loadXML(): xmlParseCharRef: invalid xmlChar value 55357 in Entity, line: 1 in /tmp/test.php on line 6
PHP Stack trace:
PHP   1. {main}() /tmp/test.php:0
PHP   2. DOMDocument->loadXML() /tmp/test.php:6

Please note that if the I was trying to process with DOMDocument (PHP) the following code everything would be just fine:

$data = " <root><key>Arbitrary input: &amp; #55357;</key></root>";

Either the Java Transformer or the DOMDocument (PHP) is doing something wrong. Can you point me out?

Thanks!

Was it helpful?

Solution

After some more investigation: \uD83D is indeed an invalid character. The range \uD800 to \uDFFF is reserved by the Unicode standard for lead and trail surrogates and there will never be characters assigned.

The encoding used by the Java transformer would be correct if only the character was valid. But since it is not, you are trying to assemble an invalid XML document.

The construct

<root><key>Arbitrary input: &amp; #55357;</key></root>

is clearly not reflecting the input data, it means the value of key is

Arbitrary input: & #55357;

Which is different from what you want it to be.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top