Frage

I've been given the task of writing an ETL (Extract, Transform, Load) process between a PostgreSQL 9.1 database hosted on Heroku (we can call it the Master) to another, application-purposed copy of the data that will be in another Heroku (Cedar Stack) hosted PostgreSQL database. Our primary development stack is Python 2.7.2, Django 1.3.3 and PostgreSQL 9.1. As many of you may know, the file system in Heroku is limited in what you can do, and I'm not sure if I completely understand what the rules are for the Ephemeral Filesystem.

So, I'm trying to figure out what my options are here. The obvious one is that I can just write a Django management command and have two separate database connections (and a destination and source set of models) and pump the data over that way and handle the ETL in the process. While effect, my initial tests shows this is a very slow approach. Obviously, a faster approach would be to use PostreSQL COPY functionality. But, normally if I was doing this I would be able to write it out to a file and then use psql to pull it in. Any one done anything like this between two dedicated PostgreSQL databases on Heroku? Any advice or tips will be appreciated.

War es hilfreich?

Lösung

One solution may be to do the whole ETL process in Postgres land. That is, use the dblink extension to pull data from the source database into the target database. This may or may not be sufficient, but it's worth investigating.

You are free to use the filesystem on a heroku dyno, but I don't think this is a bullet proof solution. The way it works is that you can write to the filesystem just fine, but as soon as that process exits, away goes the data within it. The size of that filesystem is not guaranteed at all, but it is quite large, unless you need multiple hundreds of GBs worth of storage.

Finally, you can speed up some of the process by turning some session level postgres knobs. Instead of listing them here, just read it up on the excellent postgres docs.

EDIT: We now support the Postgres FDW, a better alternative to dblink: http://www.postgresql.org/docs/current/static/postgres-fdw.html

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top