Question

I am using Python's multiprocessing to do bulk downloads using FTP. However, when I try to open more than 5 ftp sessions, an EOFError is raised, meaning the host is disconnecting me for opening too many sessions.

The only solution I see is to open a single FTP object and pass it to the necessary methods. The problem is that because multiprocessing uses pickling to move objects around, and FTP objects can't be pickled, this is not possible. My question is thus whether it is possible to work around this by finding a way to pickle FTP objects?

My code is of the following form:

def get_file(name):
    #code here    

def worker(name_list, out_q):
    lst = []
    for name in name_list:
        lst.append(get_file(name))
    out_q.put(lst)

if __name__ == '__main__':

    #est ftp cnxn
    ftp = FTP('ftp.blah.blah', 'anonymous', 'meow')

    #multiprocessing code here

The get_file def needs access to the ftp connection, and if I put it outside of the if __name__ == '__main__' block, then a new ftp connection is created each time a process runs through the code.

Was it helpful?

Solution

I don't really understand why you would want to do that:

  • create a bunch of processes to download stuff in parallel
  • but only use one FTP object, in effect serializing the download

How exactly does this solve your problem?

But, instead of serializing the FTP object, create a process for FTP requests and devise a mini-language for communicating with that process - let your other processes send (easily pickleable) messsages of the form get src dst.

EDIT: Just checked the documentation for [ftplib][1]. Nowhere does it say it can handle multiple calls. Assume it doesn't!

So, I would do this:

  • create MAX_CONNECTIONS (e.g. 5) FTP worker processes that
  • contact a master process that has a queue of files to retrieve
  • worker processes retrieve an task from the queue, downloads the file and checks master for new stuff to do
  • repeat until the work is done

OTHER TIPS

You might be able to work around the problem by creating a pickleable class that wraps the FTP objects. Essentially you bind the FTP constructor arguments in your wrapper class then once it's deserialized on the remote host, the FTP object is instantiated there.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top