Java CSS Crawler
-
11-10-2019 - |
Question
I'm looking for a web crawler with the ability to grab the page's CSS. I don't need any other fancy crawling abilities.
I'm trying to make my way through Xapian, Nutch and Heritrix. They all seem to be a bit complex. If anyone has any experience or recommendation I would love to hear. An accessible tutorial to any of the above platforms, is also welcomed.
David
Solution
You are right, don't use those, they are way too heavy.
Use: Crawler4j
Follow the onsite tutorial for a simple crawler.
The only change you need is in MyCrawler.java: Remove "css" from the FILTERS pattern In the visit() method, put a simple condition as follows:
if (url.contains(".css")) {
// do what you need with it
}
That's it - you are good!
OTHER TIPS
I recommend using plain HTTPClient and simple regex. You can store the responses in file, database or archive of your own (See Heritrix).
It keeps things simple, instead of using a heavy weighted crawler. Since the there are few CSS per domain, you can safely ignore complex url following within domain.
Cheers !