Question

I'm looking for a web crawler with the ability to grab the page's CSS. I don't need any other fancy crawling abilities.

I'm trying to make my way through Xapian, Nutch and Heritrix. They all seem to be a bit complex. If anyone has any experience or recommendation I would love to hear. An accessible tutorial to any of the above platforms, is also welcomed.

David

Was it helpful?

Solution

You are right, don't use those, they are way too heavy.

Use: Crawler4j

Follow the onsite tutorial for a simple crawler.

The only change you need is in MyCrawler.java: Remove "css" from the FILTERS pattern In the visit() method, put a simple condition as follows:

if (url.contains(".css")) {
    // do what you need with it
}

That's it - you are good!

OTHER TIPS

I recommend using plain HTTPClient and simple regex. You can store the responses in file, database or archive of your own (See Heritrix).

It keeps things simple, instead of using a heavy weighted crawler. Since the there are few CSS per domain, you can safely ignore complex url following within domain.

Cheers !

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top