crawler4j recrawl a website not working

https://stackoverflow.com/questions/19356109

30-06-2022
|

Question

I am using crawler4j library to crawl some websites but I have a problem when I call two times the process. It only works for the first time. The second time doesn't give any ERROR but it does nothing.

I think that the library is saving the urls crawled and that is why I can't call.

I saw some information here but not the solution...

http://code.google.com/p/crawler4j/wiki/FrequentlyAskedQuestions

Thanks in advance,

Hibernator.

La solution

Your Crawl Storage Folder was written after the first time, furthermore, this file cannot be auto-delete(to recrawl) because the access to the file is denied, so in the second time, the program checked this file and thinks that all URLs are crawled. You should edit the crawler4j to close completely the access to the Crawl Storage Folder. Follow this: https://code.google.com/p/crawler4j/issues/detail?id=157

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow