How to crawl my site to detect 404/500 errors?
-
22-06-2021 - |
Question
Is there any fast (maybe multi-threaded) way to crawl my site (clicking on all local links) to look for 404/500 errors (i.e. ensure 200 response)?
I also want to be able to set it to only click into 1 of each type of link. So if I have 1000 category pages, it only clicks into one.
Is http://code.google.com/p/crawler4j/ a good option?
I'd like something that is super easy to set up, and I'd prefer PHP over Java (though if Java is significantly faster, that would be ok).
Solution
You can use the old and stable Xenu tool to crawl your site.
You can configure him to use 100 threads and sort the results by status code[500\404\200\403]
OTHER TIPS
You could implement this pretty easily with any number of open source python projects:
- Mechanize seems pretty popular
- Beautiful Soup and urllib
You'd crawl the site using one of those methods and check the server response, which should be pretty straight forward.
However, if you have a sitemap (or any sort of list with all of your URLs), you could just try and open each one using cURL, or urllib, and get your response without the need to crawl.
Define "fast"? how big is your site? cURL would be a good start: http://curl.haxx.se/docs/manual.html
Unless you hae a really immense site and need to test it on a time scale of seconds, just enumerate the URLs into a list and try each one.