Question

I am trying to make a web scrpper that, for this example, scrapes news articles from Reuters.com. I want to get the title and date. I know I will ultimately just have to pull the source code from each address and then parse the HTML using something like JSoup.

My question is: How do I ensure I do this for each news article on Reuters.com? How do I know I have hit all the reuters.com addresses? Is there any APIs that can help me with this?

Was it helpful?

Solution

What you are referring to is called web scraping plus web crawling. What you have to do is visit every link matching some criteria (crawling) and then scrape the content (scraping). I've never used them but here are two java frameworks for the job

  1. http://wiki.apache.org/nutch/NutchTutorial
  2. https://code.google.com/p/crawler4j/

Of course you will have to use jsoup (or simillar) for parsing the content after you've collected the urls

Update Check this out Sending cookies in request with crawler4j? for a better list of crawlers. Nutch is pretty good, but very complicated if the only thing you want is to crawl one site. crawler4j is very simple but I don't know if it supports cookies (and if that matters to you it's a deal breaker).

OTHER TIPS

Try this website http://scrape4me.com/

I was able to generate this url for headline: http://scrape4me.com/api?url=http%3A%2F%2Fwww.reuters.com%2F&head=head&elm=&item[][DIV.topStory]=0&ch=ch

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top