Вопрос

I have recently joined a new company and am new to python (their preferred scripting language) and have been working with cx_oracle to create some ETL processes. The scripts I have built so far have been single-threaded jobs that select the subset of columns I need from an Oracle source DB and write the output to a named pipe where an external process is waiting to read that data and insert it into the target.

This has worked fine until I get to some tables that are in the 500 million -2 billion row range. The job still works, but it is taking many hours to complete. These large source tables are partitioned so I have been trying to research ways to coordinate parallel reads of different partitions so I can get two or more threads working concurrently, each writing to a separate named pipe.

Is there an elegant way in cx-oracle to handle multiple threads reading from different partitions of the same table?

Here's my current (simple) code:

import cx_Oracle
import csv

# connect via SQL*Net string or by each segment in a separate argument
connection = cx_Oracle.connect("user/password@TNS")


csv.register_dialect('pipe_delimited', escapechar='\\' delimiter='|',quoting=csv.QUOTE_NONE)

cursor = connection.cursor()
f = open("<path_to_named_pipe>", "w")

writer = csv.writer(f, dialect='pipe_delimited', lineterminator="\n")
r = cursor.execute("""SELECT <column_list> from <SOURCE_TABLE>""")
for row in cursor:
        writer.writerow(row)
f.close()

Some of my source tables have over 1000 partitions so hard-coding the partition names in isn't the preferred option. I have been thinking about setting up arrays of partition names and iterating through them, but if folks have other ideas I'd love to hear them.

Это было полезно?

Решение

First of all, you need to make sure that *cx_Oracle* is thread-safe. Since it implements the Python DB API Spec v2.0, all you need to do is check the threadsafety module global. Values 2 or 3 mean that you can open multiple connections to the DB and run multiple queries at the same time. The best way to do this is to use the threading module, which is pretty easy to use. This is a short and sweet article on how to get started with it.

Of course, there are no guarantees that pipelining your queries will result in a significant performance gains (DB engine, I/O, etc. reasons) but it's definitely worth the try. Good luck!

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top