Preserve encoding after SAX parsing

https://stackoverflow.com/questions/19686463

01-07-2022
|

题

I have an XML document that contains attributes like the following:

<Tag Body="&lt;p&gt;">

I want to preserve the text in the Body attribute exactly as-is; however, the parsing method is converting the text to "<p>". I want to keep the "&", "l", "t", ";", etc.

I'm using the Java SAX API to parse the XML document like so:

    SAXParserFactory spf = SAXParserFactory.newInstance();
    SAXParser saxParser = spf.newSAXParser();
    XMLReader xmlReader = saxParser.getXMLReader();
    xmlReader.setContentHandler(new MyHandler());
    xmlReader.setErrorHandler(new MyErrorHandler(System.err));
    xmlReader.parse(convertToFileURL(myFileName));

The relevant code in MyHandler.java is:

public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
throws SAXException
{
    if (qName.equals("Tag")){
        String Body = atts.getValue("Body");
        char []s = Body.toCharArray();  // s[0] will be "<", but I want it to be "&"
    }
}

How can I get the parsing method to leave the attribute text alone and not try to convert anything?

解决方案

I'll answer my own question.

I didn't find a way to stop the parser from unescaping the text to begin with, but I did find a workaround (thatnks @user1516873) to re-escape it afterwards using Apache Commons:

String Body = atts.getValue("Body");
String Body_escaped = StringEscapeUtils.escapeXml(Body);

This achieves the desired results.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow