Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

https://stackoverflow.com/questions/10007785

29-05-2021
|

Question

So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code. After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary. For instance when I extract this:

String extractions = <td>Good day sir</td>

Then I use:

extractions.replaceAll("<td>", "").replaceAll("</td>", "");

I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.

I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster? Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).

Solution

Like others have said, regex is not the best tool for this job. But in this case, the particular way you use regex is even more inefficient than it would normally be.

In any case, let me offer one more possible solution (depending on your use case).

It's called YQL (Yahoo Query Language). http://developer.yahoo.com/yql/

Here is a console for it so you can play around with it. http://developer.yahoo.com/yql/console/

YQL is the lazy developer's way to build your own api on the fly. The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route. Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).

OTHER TIPS

Using regex to parse a website is always a bad idea:

How to use regular expressions to parse HTML in Java?

Using regular expressions to parse HTML: why not?

Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow