質問

I am trying to run boilerpipe with Python multiprocessing. Doing this to parse RSS feeds from multiple sources. The problem is it hangs in one of the threads after processing some links. The whole flow works if I remove the pool and run it in a loop.

Here is my multiprocessing code:

proc_pool = Pool(processes=4)
for each_link in data:
    proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()

This is my boilerpipe code which is being called inside process_link_for_feeds():

def parse_using_bp(in_url):
    extracted_html = ""
    if ContentParser.url_skip_p.match(in_url):
        return extracted_html
    try:
        extractor = Extractor(extractor='ArticleExtractor', url=in_url)
        extracted_html = extractor.getHTML()
        del extractor
    except BaseException as e:
        print "Something's wrong at Boilerpipe -->", in_url, "-->", e
        extracted_html = ""
    finally:
        return extracted_html

I am clueless on why it is hanging. Is there something wrong in the proc_pool code?

役に立ちましたか?

解決

Can you try threading instead? Multiprocessing is basically for when you are CPU bound. Also, boilerpipe already includes protection when using threading which suggests that it may need protection in multiprocessing also.

If you really need mp, I will try to figure out how to patch boilerpipe.

Here is what I guess will be a drop-in replacement using threading. It uses multiprocessing.pool.ThreadPool (which is a "fake" multiprocessing pool). The only change is from Pool(..) to multiprocessing.pool.ThreadPool(...) The problem is that I'm not sure the boilerpipe multithreading test will detect the thread pool () as having activeCount() > 1.

import multiprocessing
from multiprocessing.pool import ThreadPool  # hidden ThreadPool class

# ...
proc_pool = ThreadPool(processes=4)  # this is the only difference
for each_link in data:
    proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top