Scraping Yahoo Answers with Jsoup

https://stackoverflow.com/questions/21342483

02-10-2022
|

Question

I am trying to scrape results of keyword search from Yahoo Answers, in my case, "alcohol addiction." I am using Jsoup and URL modification to go through pages of the search results to scrape the results. However, I am noticing that, even though I put in URL for 'Newest' results, it keeps showing 'Relevance' results, and what's worse, the results are not exactly the same as what's shown on the browser.

For instance, the URL for Newest results is: http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=new

And for relevant results, the URL is: http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=rel

And the "1" will change to 2, 3, 4, etc as you go to the next page (there are 10 results per page).

Here's what I do to scrape the page:

String urlID = "";
String end = "&sort=new";
String glob = "http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=";
Integer forumID = 0;

while(nextPageIsThere){
    forumID++;
    System.out.println("Now extracting the page: "+forumID);
    try {
        urlID = glob+forumID+end;
        System.out.println(urlID);
        exdoc = Jsoup.connect(urlID).get();
        java.util.Date date= new java.util.Date();
    } catch (IOException e) {
        e.printStackTrace();
    }

...

What's even more confusing is even if I increase the page number, and the system output shows that the URL is changing to:

http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=2&sort=new

and

http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=3&sort=new

it still scrapes the same page as shown in page 1 over and over again. I know my code is not wrong. I've been debugging it for hours. I think it's something got to do with Jsoup.connect and/or Yahoo Answer possibly blocking bots? At the same time, I don't think it's really that.

Does anyone know why this might be happening?

Solution

JSoup is working with static HTML only, they can't parse dynamic pages like this, where content is downloaded after page is loaded with Ajax request or JavaScript modification.

Try reading this page with HTMLUnit, this parser has support for JS pages.

It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow