Question

I'm developing part of a system where processes are limited to about 350MB of RAM; we use cx_Oracle to download files from an external system for processing.

The external system stores files as BLOBs, and we can grab them doing something like this:

# ... set up Oracle connection, then
cursor.execute(u"""SELECT   filename, data, filesize
                   FROM    FILEDATA
                   WHERE   ID = :id""", id=the_one_you_wanted)
filename, lob, filesize = cursor.fetchone()

with open(filename, "w") as the_file:
    the_file.write(lob.read())

lob.read() will obviously fail with MemoryError when we hit a file larger than 300-350MB, so we've tried something like this instead of reading it all at once:

read_size = 0
chunk_size = lob.getchunksize() * 100
while read_size < filesize:
    data = lob.read(chunk_size, read_size + 1)
    read_size += len(data)
    the_file.write(data)

Unfortunately, we still get MemoryError after several iterations. From the time lob.read() is taking, and the out-of-memory condition we eventually get, it looks as if lob.read() is pulling ( chunk_size + read_size ) bytes from the database every time. That is, reads are taking O(n) time and O(n) memory, even though the buffer is quite a bit smaller.

To work around this, we've tried something like:

read_size = 0
while read_size < filesize:
    q = u'''SELECT dbms_lob.substr(data, 2000, %s)
            FROM FILEDATA WHERE ID = :id''' % (read_bytes + 1)
    cursor.execute(q, id=filedataid[0])
    row = cursor.fetchone()
    read_bytes += len(row[0])
    the_file.write(row[0])

This pulls 2000 bytes (argh) at a time, and takes forever (something like two hours for a 1.5GB file). Why 2000 bytes? According to the Oracle docs, dbms_lob.substr() stores its return value in a RAW, which is limited to 2000 bytes.

Is there some way I can store the dbms_lob.substr() results in a larger data object and read maybe a few megabytes at a time? How do I do this with cx_Oracle?

Was it helpful?

Solution

I think that the argument order in lob.read() is reversed in your code. The first argument should be the offset, the second argument should be the amount to read. This would explain the O(n) time and memory usage.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top