How to dynamically scale StarCluster/qsub/EC2 to run parallel jobs across multiple nodes

Question 1

I/O and data sharing. If you're I/O is low, you can possibly leave your data on your master node and use nfs to share it among your nodes. If you have a lot of I/O, I would recommend using an S3 bucket.
Distribution: Your bash script that launches multiple qsub is the right thing to do. It's up to you to either call it on a single file or on a few files at once.
Scaling: See your parallel jobs running on the cluster as different tasks. It's up to you to run 1 or more instances of your application on each node. Eg.: If you use cr1.8xlarge nodes, you have 32 cores. You can launch 1 instance of your app there using the 32 cores or 4 instances of your app using 8 cores. See "slots" configuration for each nodes within Open Grid Engine. (If you were more willing to run one big instance of your app combining the cores of multiple nodes, I never did it so I can't help you with that.) Then, to add a node, you can use the "addnode" command from StarCluster. Once the node is up, OGS will automatically distribute jobs there too. You could also use StarCluster loadbalancer to automatically add/remove nodes.

So, here is my suggestion. 1. Extract your files to S3. 2. Launch StarCluster 3. Using your bashscript, qsub a job for every few files (might be more efficient for a job to work on say 10 files than having a job for every single files) 4. Your application must I/O to s3. 5. When the queue is empty, have a script look at the results to make sure all jobs ran well. You may reschedule jobs when the output is missing.

I don't know how your aggregation is done so I can't tell.
I never used hadoop, so I can't help there either.
You have no need to make your python script MPI executable.
If you use an heterogeneous cluster, then you know from the start how many cores will be available on each nodes.
If you define a node with 32 cores to have 4 slots, then you should make your jobs use at most 8 cores each.

Question 2

After some time researching on various options available for dynamic scaling, I decided to use the Queue mechanism to distribute jobs to multiple workers.

Job_Manager - Reads input, constructs job, adds the job into the queue SQS Queue is the Queue service Worker processes - Listens to the queue and processes the output.

The input/output drives are NFS and are available to all the server/clients.

To dynamically scale, Add NFS client info in /exports and restart the server. The active clients have a rw,hard,intr configuration in their respective fstab. By launching n worker processes in the new client, more workers are added to process.

In so far, it is reliable and scaling well. I was able to launch close 90 workers across 3 machines, and process 200,000 files in less than 5 hours. Earlier, it was taking close to 24 hours for the same as I couldn't distribute data and run workers across multiple nodes.