I'm a python developer with pretty good RDBMS experience. I need to process a fairly large amount of data (approx 500GB). The data is sitting in approximately 1200 csv files in s3 buckets. I have written a script in Python and can run it on a server. However, it is way too slow. Based on the current speed and the amount of data it will take approximately 50 days to get through all of the files (and of course, the deadline is WELL before that).
Note: the processing is sort of your basic ETL type of stuff - nothing terrible fancy. I could easily just pump it into a temp schema in PostgreSQL, and then run scripts onto of it. But, again, from my initial testing, this would be way to slow.
Note: A brand new PostgreSQL 9.1 database will be it's final destination.
So, I was thinking about trying to spin up a bunch of EC2 instances to try and run them in batches (in parallel). But, I have never done something like this before so I've been looking around for ideas, etc.
Again, I'm a python developer, so it seems like Fabric + boto might be promising. I have used boto from time to time, but never any experience with Fabric.
I know from reading/research this is probably a great job for Hadoop, but I don't know it and can't afford to hire it done, and the time line doesn't allow for a learning curve or hiring someone. I should also not, that it's kind of a one time deal. So, I don't need to build a really elegant solution. I just need for it to work and be able to get through all of the data by the end of the year.
Also, I know this is not a simple stackoverflow-kind of question (something like "how can I reverse a list in python"). But, what I'm hoping for is someone to read this and "say, I do something similar and use XYZ... it's great!"
I guess what I'm asking is does anybody know of any thing out there that I could use to accomplish this task (given that I'm a Python developer and I don't know Hadoop or Java - and have a tight timeline that prevents me learning a new technology like Hadoop or learning a new language)
Thanks for reading. I look forward to any suggestions.