Java CSS Crawler

https://stackoverflow.com/questions/4707598

11-10-2019
|

Question

I'm looking for a web crawler with the ability to grab the page's CSS. I don't need any other fancy crawling abilities.

I'm trying to make my way through Xapian, Nutch and Heritrix. They all seem to be a bit complex. If anyone has any experience or recommendation I would love to hear. An accessible tutorial to any of the above platforms, is also welcomed.

David

Solution

You are right, don't use those, they are way too heavy.

Use: Crawler4j

Follow the onsite tutorial for a simple crawler.

The only change you need is in MyCrawler.java: Remove "css" from the FILTERS pattern In the visit() method, put a simple condition as follows:

if (url.contains(".css")) {
    // do what you need with it
}

That's it - you are good!

OTHER TIPS

I recommend using plain HTTPClient and simple regex. You can store the responses in file, database or archive of your own (See Heritrix).

It keeps things simple, instead of using a heavy weighted crawler. Since the there are few CSS per domain, you can safely ignore complex url following within domain.

Cheers !

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow