Threading with Hadoop Streaming

https://stackoverflow.com/questions/18087121

23-06-2022
|

Question

I am making use of Hadoop streaming to write a python based HTML grabber. I find that running a single threaded python script is slow. I want to modify it to a multithreaded version. Does anyone know what would be a good number to set the number of threads in the mapper to. I am not sure of the specs of each node of the cluster but I assume that it would support atleast two threads.

Solution

I tried to use threading with python, there were issues with the Global Interpreter Lock. Ported code to use the multiprocessing module, internally hadoop assigns as many mappers as there are cores in the cluster, hence multiprocessing is not the way to go if you need speed up. Multithreading if performed right might give some speedup

OTHER TIPS

I have not use hadoop streaming for html grabber but here is a post that talking about how urllib2 work s using multiple thread (not multipleprocessing package, just simple multi thread).

Hope can be helpful.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow