Python: Upload huge amount of files via FTP

Question 1

Here's a new answer, based on the comments to the previous one.

We'll use a single TCP socket, and send each file by alternating sending name and contents, as netstrings, for each file, all in one big stream.

I'm assuming Python 2.6, that the filesystems on both sides use the same encoding, and that you don't need lots of concurrent clients (but you might occasionally need, say, two—e.g., the real one, and a tester). And I'm again assuming you've got a module filegenerator whose generate() method registers with inotify, queues up notifications, and yields them one by one.

client.py:

import contextlib
import socket
import filegenerator

sock = socket.socket()
with contextlib.closing(sock):
    sock.connect((HOST, 12345))
    for filename in filegenerator.generate():
        with open(filename, 'rb') as f:
            contents = f.read()
            buf = '{0}:{1},{2}:{3},'.format(len(filename), filename, 
                                            len(contents), contents)
            sock.sendall(buf)

server.py:

import contextlib
import socket
import threading

def pairs(iterable):
    return zip(*[iter(iterable)]*2)

def netstrings(conn):
    buf = ''
    while True:
        newbuf = conn.recv(1536*1024) 
        if not newbuf:
            return
        buf += newbuf
        while True:
            colon = buf.find(':')
            if colon == -1:
                break
            length = int(buf[:colon])
            if len(buf) >= colon + length + 2:
                if buf[colon+length+1] != ',':
                    raise ValueError('Not a netstring') 
                yield buf[colon+1:colon+length+1]
                buf = buf[colon+length+2:]

def client(conn):
    with contextlib.closing(conn):
        for filename, contents in pairs(netstrings(conn)):
            with open(filename, 'wb') as f:
                f.write(contents)

sock = socket.socket()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
with contextlib.closing(sock):
    sock.bind(('0.0.0.0', 12345))
    sock.listen(1)
    while True:
        conn, addr = sock.accept()
        t = threading.Thread(target=client, args=[conn])
        t.daemon = True
        t.start()

If you need more than about 200 clients on Windows, 100 on linux and BSD (including Mac), a dozen on less good platforms, you probably want to go with an event loop design instead of a threaded design, using epoll on linux, kqueue on BSD, and IO completion ports on Windows. This an be painful, but fortunately, there are frameworks that wrap everything up for you. Two popular (and very different) choices are Twisted and gevent.

One nice thing about gevent in particular is that you can write threaded code today, and with a handful of simple changes turn it into event-based code like magic.

On the other hand, if you're eventually going to want event-based code, it's probably better to learn and use a framework from the start, so you don't have to deal with all the fiddly bits of accepting and looping around recv until you get a full message and shutting down cleanly and so on, and just write the parts you care about. After all, more than half the code above is basically boilerplate for stuff that every server shares, so if you don't have to write it, why bother?

In a comment, you said:

Also the files are binary, so it's possible that I'll have problems if client encodings are diferent from server's.

Notice that I opened each file in binary mode ('rb' and 'wb'), and intentionally chose a protocol (netstrings) that can handle binary strings without trying to interpret them as characters or treat embedded NUL characters as EOF or anything like that. And, while I'm using str.format, in Python 2.x that won't do any implicit encoding unless you feed it unicode strings or give it locale-based format types, neither of which I'm doing. (Note that in 3.x, you'd need to use bytes instead of str, which would change a bit of the code.)

In other words, the client and server encodings don't enter into it; you're doing a binary transfer exactly the same as FTP's I mode.

But what if you wanted the opposite, to transfer text and reencode automatically for the target system? There are three easy ways to do that:

Send the client's encoding (either once at the top, or once per file), and on the server, decode from the client and reencode to the local file.
Do everything in text/unicode mode, even the socket. This is silly, and in 2.x it's hard to do as well.
Define an wire encoding—say, UTF-8. The client is responsible for decoding files and encoding to UTF-8 for send; the server is responsible for decoding UTF-8 on receive and encoding files.

Going with the third option, assuming that the files are going to be in your default filesystem encoding, the changed client code is:

with io.open(filename, 'r', encoding=sys.getfilesystemencoding()) as f:
    contents = f.read().encode('utf-8')

And on the server:

with io.open(filename, 'w', encoding=sys.getfilesystemencoding()) as f:
    f.write(contents.decode('utf-8'))

The io.open function also, by default, uses universal newlines, so the client will translate anything into Unix-style newlines, and the server will translate to its own native newline type.

Note that FTP's T mode actually doesn't do any re-encoding; it only does newline conversion (and a more limited version of it).

Question 2

Yes, you can reuse connections with ftplib. All you have to do is not close them and keep using them.

For example, assuming you've got a module filegenerator whose generate() method registers with inotify, queues up notifications, and yields them one by one:

import ftplib
import os
import filegenerator

ftp = ftplib.FTP('ftp.example.com')
ftp.login()
ftp.cwd('/path/to/store/stuff')

os.chdir('/path/to/read/from/')

for filename in filegenerator.generate():
    with open(filename, 'rb') as f:
        ftp.storbinary('STOR {}'.format(filename), f)

ftp.close()

I'm a bit confused by this:

The problem we are having now is the amount of connections that keeps open in TIME_WAIT state.

It sounds like your problem is not that you create a new connection for each file, but that you never close the old ones. In which case the solution is easy: just close them.

Either that, or you're trying to do them all in parallel, but don't realize that's what you're doing.

If you want some parallelism, but not unboundedly so, you can easily, e.g. create a pool of 4 threads, each with an open ftplib connection, each reading from a queue, and then an inotify thread that just pushed onto that queue.