Question

Is there any fast (maybe multi-threaded) way to crawl my site (clicking on all local links) to look for 404/500 errors (i.e. ensure 200 response)?

I also want to be able to set it to only click into 1 of each type of link. So if I have 1000 category pages, it only clicks into one.

Is http://code.google.com/p/crawler4j/ a good option?

I'd like something that is super easy to set up, and I'd prefer PHP over Java (though if Java is significantly faster, that would be ok).

Was it helpful?

Solution

You can use the old and stable Xenu tool to crawl your site.

You can configure him to use 100 threads and sort the results by status code[500\404\200\403]

OTHER TIPS

You could implement this pretty easily with any number of open source python projects:

  1. Mechanize seems pretty popular
  2. Beautiful Soup and urllib

You'd crawl the site using one of those methods and check the server response, which should be pretty straight forward.

However, if you have a sitemap (or any sort of list with all of your URLs), you could just try and open each one using cURL, or urllib, and get your response without the need to crawl.

Define "fast"? how big is your site? cURL would be a good start: http://curl.haxx.se/docs/manual.html

Unless you hae a really immense site and need to test it on a time scale of seconds, just enumerate the URLs into a list and try each one.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top