i have tested your problem and found some intrested fact. You have use SAX parser for parsing html, so html has a lot differences from xml. for example sometimes tags can be unclosed, etc. So org.ccil.cowan.tagsoup.jaxp.SAXParserImpl allow us to parse html. Also that parser wraps some addition tags https://github.com/websdotcom/tagsoup#what-tagsoup-does. Look for html in the next code. if you will add correct structure of content it processed normally. So I think that this is like bug in TagSoup lib.
import android.test.AndroidTestCase;
import android.util.Log;
import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import javax.xml.parsers.SAXParser;
/**
* Created by kulik on 1/5/14.
*/
public class SaxTest extends AndroidTestCase {
private static final String TAG = "SaxTest";
public void testSax() {
String testString = "<!DOCTYPE html>\n" +
"<html>\n" +
"<head>\n" +
"<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n" +
"<style>ul.ul1{list-style-type:image;}\n" +
"</style>\n" +
"</head>\n" +
"<body>\n" +
"<ul class=\"ul1\">List</ul>\n" +
"<ul class=\"ul2\">" +
"<li> li1</li>\n" +
"<li> li2</li>\n" +
"</ul>" +
"</body>\n" +
"</html>";
Reader reader = new StringReader(testString);
try {
SAXParser sp = SAXParserImpl.newInstance(null);
XMLReader xr = sp.getXMLReader();
DefaultHandler myHandler = new ContentHandler();
xr.setContentHandler(myHandler);
xr.parse(new InputSource(reader));
} catch (SAXException e) {
Log.e(TAG, "", e);
} catch (IOException e) {
Log.e(TAG, "", e);
}
}
public class ContentHandler extends DefaultHandler {
@Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
Log.d("html_parser", "start " + localName);
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
Log.d("html_parser", "end " + localName);
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
String bodyText = new String(ch, start, length);
Log.d("html_parser", bodyText);
}
}
}
and log
D/html_parser﹕ start html
D/html_parser﹕ start head
D/html_parser﹕ start meta
D/html_parser﹕ end meta
D/html_parser﹕ start style
D/html_parser﹕ ul.ul1{list-style-type:image;}
D/html_parser﹕ end style
D/html_parser﹕ end head
D/html_parser﹕ start body
D/html_parser﹕ start ul
D/html_parser﹕ end ul
D/html_parser﹕ List
D/html_parser﹕ start ul
D/html_parser﹕ start li
D/html_parser﹕ li1
D/html_parser﹕ end li
D/html_parser﹕ start li
D/html_parser﹕ li2
D/html_parser﹕ end li
D/html_parser﹕ end ul
D/html_parser﹕ end body
D/html_parser﹕ end html
So you can implement your handler to catch that situation, because i think this is connected only with tags without any
The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted wheneve.......
http://home.ccil.org/~cowan/XML/tagsoup/
also you can ask tagsoup team.