Question

What I do is:
- crawl the page
- fetch all links of the page, puts them in a list
- start a new crawler, which visits each links of the list
- download them

There must be a quicker way, where I can download the links directly when I visit the page? Thx!

Was it helpful?

Solution

crawler4j automatically does this process for you. You first add one or more seed pages. These are the pages that are first fetched and processed. crawler4j then extracts all the links in these pages and passes them to your shouldVisit function. If you really want to crawl all of them this function should just return true on all functions. If you only want to crawl pages within a specific domain you can check the URL and return true or false based on that.

Those URLs that your shouldVisit returns true, are then fetched by crawler threads and the same process is performed on them.

The example code here is a good sample for starting.

OTHER TIPS

The general approach would be to separate the crawling, and the downloading tasks into separate worker Threads, with a maximum number of Threads, depending on your memory requirements (i.e. maximum RAM you want to use for storing all this info).

However, crawler4j already gives you this functionality. By splitting downloading and crawling into separate Threads, you try to maximize the utilization of your connection, pulling down as much data as both your connection can handle, and as the servers providing the information can send you. The natural limitation to this is that, even if you spawn 1,000 Threads, if the servers are only given you the content at 0.3k per second, that still only 300 KB per second that you'll be downloading. But you just don't have any control over that aspect of it, I'm afraid.

The other way to increase the speed is to run the crawler on a system with a fatter pipe to the internet, since your maximum download speed is, I'm guessing, the limiting factor to how fast you can get data currently. For example, if you were running the crawling on an AWS instance (or any of the cloud application platforms), you would benefit from their extremely high speed connections to backbones, and shorten the amount of time it takes to crawl a collection of websites by effectively expanding your bandwidth far beyond what you're going to get at a home or office connection (unless you work at an ISP, that is).

It's theoretically possible that, in a situation where your pipe is extremely large, the limitation starts to become the maximum write speed of your disk, for any data that you're saving to local (or network) disk storage.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top