Crawler4j - Many URLs are discarded / not processed(missing in output)

https://stackoverflow.com/questions/21810697

12-10-2022
|

Domanda

I am running crawler4j to find status(http response) code for one million URLs. I have not set any filters to filter out URLs to be processed.
I get proper response for 90% URLs, but 10% are missing in the output.
They dont even appear in handlePageStatusCode() method of Webcrawler extended class. Probably they are not processed due to various issues.
Is it possible to find those missing URLs to reprocess? Can we improve the crawling process not to miss any of the URLs?

Soluzione

Yes, and we have!

Please use the latest version of Crawler4j as I have added many methods to catch different types of exceptions.

Now, when you extend WebCrawler, just override the many methods you can override: https://github.com/yasserg/crawler4j/blob/master/src/main/java/edu/uci/ics/crawler4j/crawler/WebCrawler.java

Like the following for example: onPageBiggerThanMaxSize onUnexpectedStatusCode onContentFetchError onUnhandledException etc.

just please note that those methods were called and the page wasn't processed due to a reason, so adding it again as a seed shouldn't change the problem...

Anyway, the latest version of crawler4j, handles many pages much better, so by just upgrading to v4.1 (currently) or later you will be able to crawl many more pages.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow