Domanda

I'm Using ContentHandler to parse custom html with css styles. The problem is - ContentHandler missbehaves when I'm trying to parse HTML with UL tag. It calls startTag() then endTag() then characters()

Here is my HTML

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style>ul.ul1{list-style-type:image;}
</style>
</head>
<body>
<ul class="ul1">List</ul>
<ul class="ul2">List</ul>
</body>
</html>

Here is Sample code to test parser

public class ContentHandler implements org.xml.sax.ContentHandler {
    public ContentHandler() {
    }

    public Spanned getResult() {
    }

    @Override
    public void setDocumentLocator(Locator locator) {
    }

    @Override
    public void startDocument() throws SAXException {
    }

    @Override
    public void endDocument() throws SAXException {
    }

    @Override
    public void startPrefixMapping(String prefix, String uri) throws SAXException {
    }

    @Override
    public void endPrefixMapping(String prefix) throws SAXException {
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
        Log.d("html_parser", "start " + localName);
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        Log.d("html_parser", "end " + localName);
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        String bodyText = new String(ch, start, length);
        Log.d("html_parser", bodyText);
    }

    @Override
    public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
    }

    @Override
    public void processingInstruction(String target, String data) throws SAXException {
    }

    @Override
    public void skippedEntity(String name) throws SAXException {
    }
}

And LogCat output

02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start html
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start head
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start meta
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end meta
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start style
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ ul.ul1{list-style-type:image;}
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end style
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end head
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start body
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ List
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ List
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end body
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end html

Please notice, when I'm parsing HTML without UL tag it works OK. Also notice that for parsing org.ccil.cowan.tagsoup.jaxp.SAXParserImpl are used.

È stato utile?

Soluzione

i have tested your problem and found some intrested fact. You have use SAX parser for parsing html, so html has a lot differences from xml. for example sometimes tags can be unclosed, etc. So org.ccil.cowan.tagsoup.jaxp.SAXParserImpl allow us to parse html. Also that parser wraps some addition tags https://github.com/websdotcom/tagsoup#what-tagsoup-does. Look for html in the next code. if you will add correct structure of content it processed normally. So I think that this is like bug in TagSoup lib.

import android.test.AndroidTestCase;
import android.util.Log;

import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import javax.xml.parsers.SAXParser;

/**
 * Created by kulik on 1/5/14.
 */
public class SaxTest extends AndroidTestCase {
    private static final String TAG = "SaxTest";

    public void testSax() {
        String testString = "<!DOCTYPE html>\n" +
                "<html>\n" +
                "<head>\n" +
               "<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n" +
                "<style>ul.ul1{list-style-type:image;}\n" +
                "</style>\n" +
                "</head>\n" +
                "<body>\n" +
                "<ul class=\"ul1\">List</ul>\n" +
                "<ul class=\"ul2\">" +
                "<li> li1</li>\n" +
                "<li> li2</li>\n" +
                "</ul>" +
                "</body>\n" +
                "</html>";

        Reader reader = new StringReader(testString);
        try {
            SAXParser sp = SAXParserImpl.newInstance(null);
            XMLReader xr = sp.getXMLReader();

            DefaultHandler myHandler = new ContentHandler();
            xr.setContentHandler(myHandler);
            xr.parse(new InputSource(reader));
        } catch (SAXException e) {
            Log.e(TAG, "", e);
        } catch (IOException e) {
            Log.e(TAG, "", e);
        }
    }

    public class ContentHandler extends DefaultHandler  {

        @Override
        public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
            Log.d("html_parser", "start " + localName);
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            Log.d("html_parser", "end " + localName);
        }

        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            String bodyText = new String(ch, start, length);
            Log.d("html_parser", bodyText);
        }
    }
}

and log

  D/html_parser﹕ start html
  D/html_parser﹕ start head
  D/html_parser﹕ start meta
  D/html_parser﹕ end meta
  D/html_parser﹕ start style
  D/html_parser﹕ ul.ul1{list-style-type:image;}
  D/html_parser﹕ end style
  D/html_parser﹕ end head
  D/html_parser﹕ start body
  D/html_parser﹕ start ul
  D/html_parser﹕ end ul
  D/html_parser﹕ List
  D/html_parser﹕ start ul
  D/html_parser﹕ start li
  D/html_parser﹕ li1
  D/html_parser﹕ end li
  D/html_parser﹕ start li
  D/html_parser﹕ li2
  D/html_parser﹕ end li
  D/html_parser﹕ end ul
  D/html_parser﹕ end body
  D/html_parser﹕ end html

So you can implement your handler to catch that situation, because i think this is connected only with tags without any

  • . Maybe it appears because:

    The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted wheneve.......

    http://home.ccil.org/~cowan/XML/tagsoup/

    also you can ask tagsoup team.

  • Autorizzato sotto: CC-BY-SA insieme a attribuzione
    Non affiliato a StackOverflow
    scroll top