I am trying to scrape results of keyword search from Yahoo Answers, in my case, "alcohol addiction." I am using Jsoup and URL modification to go through pages of the search results to scrape the results. However, I am noticing that, even though I put in URL for 'Newest' results, it keeps showing 'Relevance' results, and what's worse, the results are not exactly the same as what's shown on the browser.
For instance, the URL for Newest results is:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=new
And for relevant results, the URL is:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=rel
And the "1" will change to 2, 3, 4, etc as you go to the next page (there are 10 results per page).
Here's what I do to scrape the page:
String urlID = "";
String end = "&sort=new";
String glob = "http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=";
Integer forumID = 0;
while(nextPageIsThere){
forumID++;
System.out.println("Now extracting the page: "+forumID);
try {
urlID = glob+forumID+end;
System.out.println(urlID);
exdoc = Jsoup.connect(urlID).get();
java.util.Date date= new java.util.Date();
} catch (IOException e) {
e.printStackTrace();
}
...
What's even more confusing is even if I increase the page number, and the system output shows that the URL is changing to:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=2&sort=new
and
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=3&sort=new
it still scrapes the same page as shown in page 1 over and over again. I know my code is not wrong. I've been debugging it for hours. I think it's something got to do with Jsoup.connect and/or Yahoo Answer possibly blocking bots? At the same time, I don't think it's really that.
Does anyone know why this might be happening?