Question

I need to parse about 100 kB of HTML data and this simply causes huge performance issues on Android. I've tried both the built-in XML parser and JTidy.

The built-in XML parser gives me a parsing time of about half a second, which I can easily live with. Problem is that it's a bad idea to use an XML parser to parse messy HTML code, those this is not an option. (I tried preprocessing, but it even started complaining about valid HTML, so...)

I googled a bit and JTidy was suggested for cleaning up the code before passing it to an XML parser. This was an absolute nightmare, with JTidy for preprocessing parsing now takes approximately 7 seconds.

So now my only alternative really is regex. What do you think?

Was it helpful?

Solution

It depends on whether you are owner of html.

If (as I understood) you are not owner of html data and can't influence on how it is formatted, then you probably find this info useful: Parse HTML in Android
But if html is really bad, the result can't be guaranteed. And you would prefer working with regex. Even browsers switch to quirks mode when work with "bad" html without guarantee of correctness viewing.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top