Question

I have a table that has to be populated every hour from the data being pulled out of Greenplum.This table is being stored on the Greenplum server.

So,what I want to know is that what method(python script, windows scheduler or anything) will be suitable for my data(which I guess can be as huge as 60GB or more) should be used for scheduling the query(written in postgreSQL) to be run every hour.

Can someone attach a sample code for the same?

Was it helpful?

Solution

You will want to do a parallel COPY of ranges of the data out of Greenplum and into PostgreSQL. Make sure PostgreSQL is set up for fast data loading. If possible, use an UNLOGGED table; otherwise use wal_level = 'minimal' at least.

How many parallel workers depends on the PostgreSQL server's I/O subsystem more than anything. Test and see.

I would recommend using Python with psycopg2 and the copy_expert cursor function. See the docs. Use multiprocessing with a pipe to share a file-like object between a reader and writer worker, with the reader connected to greenplum and the writer to PostgreSQL.

So effectively each worker does something a bit like the following shell pseudo-code:

psql -h greenplum-box \
     -c "COPY (SELECT * FROM mytable WHERE id BETWEEN 1 AND 10000) TO stdin" \
  | \
psql -h postgres-box \
     -c "COPY myttable FROM stdin";

(but you connect the two up using pyscopg2, copy_export, multiprocessing, and a pipe).

Do all the usual fast loading work like creating indexes afterwards. See how to speed up insertion performance in PostgreSQL.

If you have the disk space, create a table like dataload_temp, populate it, then in one transaction drop the old one and rename the new one to the old one's name. That way there's minimal disruption.

Alternately, take a look at pg_bulkload for off-line (but non-streaming) bulk data loading.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top