Why some characters are missing when i parse a xml tag using SaxParser?

https://stackoverflow.com/questions/18460518

26-06-2022
|

Domanda

I am parsing a xml response which has almost 90000 characters in my android application using SaxParser. xml looks like following:

 <Registration>
     <Client>   
         <Name>John</Name>
         <ID>1</ID>
         <Date>2013:08:22T03:43:44</Date>
     </Client>  
     <Client>   
         <Name>James</Name>
         <ID>2</ID>
         <Date>2013:08:23T16:28:00</Date>
     </Client>
     <Client>   
         <Name>Eric</Name>
         <ID>3</ID>
         <Date>2013:08:23T19:04:15</Date>
     </Client>

     ..... 
 </Registration>

sometimes parser misses some characters from Date tag. Instead of giving 2013:08:23T19:04:15 back it gives 2013:08:23T back. I tried to skip all white spaces from response xml string using following line of code:

 responseStr = responseStr.replaceAll("\\s","");

But then i get following exception:

 Parsing exception: org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 16: not well-formed (invalid token)

Following is the code i am using for parsing:

 try {
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();

            DefaultHandler handler = new DefaultHandler() {
                public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
                    tagName = qName;
                }

                public void endElement(String uri, String localName, String qName) throws SAXException {

                }

                public void characters(char ch[], int start, int length) throws SAXException {
                    if(tagName.equals("Name")){
                        obj = new RegisteredUser();
                        String str = new String(ch, start, length);
                        obj.setName(str);
                    }else if(tagName.equals("ID")){
                        String str = new String(ch, start, length);
                        obj.setId(str);
                    }else if(tagName.equals("Date")){
                        String str = new String(ch, start, length);
                        obj.setDate(str);

                        users.add(obj);
                    }
                }

                public void startDocument() throws SAXException {
                    System.out.println("document started");
                }

                public void endDocument() throws SAXException {
                    System.out.println("document ended");
                }
            };

            saxParser.parse(new InputSource(new StringReader(resp)), handler);

        }catch(Exception e){
            System.out.println("Parsing exception: "+e);
            System.out.println("exception");

        }

Any idea why is parser skipping characters from a tag and how can i solve this problem. Thanks in advance.

Soluzione

It's possible that characters is called more than once for any given text node.

In that case you'll have to concatenate the result yourself!

The reason for this is when some internal buffer of the parser ends while there's still content of the text node. Instead of enlarging the buffer (which could require a lot of memory when the text node is large), it let's that be handled by the client code.

You want something like that:

StringBuilder textContent = new StringBuilder();

public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
    tagName = qName;
    textContent.setLength(0);
}
public void characters(char ch[], int start, int length) throws SAXException {
    textContent.append(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
    String text = textContent.toString();
    // handle text here
}

Of course this code can be improved to only track the text content for nodes you actually care about.

Altri suggerimenti

As other mentioned characters method may be called multiple times, its upto the SAX parsers implementation to return all contiguous character data in a single chunk, or they may split it into several chunks. See the docs SAX Parser characters

You're incorrectly assuming that all the characters in a text node will be read at once and sent to the characters() method. It's not the case. The characters() method can be called multiple times for a single text node.

You should append all the chars to a StringBuilder and then only convert to a String or Date when endElement() is called.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow