Question

I have created a custom crawler using crawler4j. In my app, I create a lot of controllers and after a while, the number of threads in the system will hit the maximum value and the JVM will throw an Exception. Even though I call ShutDown() on the controller, and set it as null and call System.gc(), the threads in my app remain open and the app will crash.

I used the jvisualvm.exe (Java VisualVM) and saw that at one point my app hits 931 threads.

Is there a way I can immediately kill all the threads created by the CrawlController object of the crawler4j project? (or any other object for that matter)

Was it helpful?

Solution

I just spent 2 hours struggling with the exact same problem. I finally discovered the source of the bug. If create a controller, and down't start it, shutdown() won't kill any of the threads created. Instead, you have to use the following:

controller.shutdown();
controller.getPageFetcher().shutdown();

where controller is your instance of CrawlController.
I also raised this as an issue on the crawler4j project page, and it looks like this will be fixed by the release of version 3.6

OTHER TIPS

Ephraim is correct. There are two issues in Crawler4j:

  1. not closing Environment object in CrawlController.
  2. not closing PageFetcher object in CrawlController.

https://code.google.com/r/yonid-crawler4j/

I have done my best at creating a version that Shutdown properly after start (startunblocking) as well as having a forceShutdown for cases where you create a controller and do not run a start function.

ShutDown() asks kindly the threads to finish their jobs and will shoot down afterwards, but what if the Threads have endless tasks so they will never finish? Have you tried to use shutdownNow()? This will interrupt running tasks before there are finished and shoots down the the threads immediately.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top