I/O and data sharing. If you're I/O is low, you can possibly leave your data on your master node and use nfs to share it among your nodes. If you have a lot of I/O, I would recommend using an S3 bucket.
Distribution: Your bash script that launches multiple qsub is the right thing to do. It's up to you to either call it on a single file or on a few files at once.
Scaling: See your parallel jobs running on the cluster as different tasks. It's up to you to run 1 or more instances of your application on each node. Eg.: If you use cr1.8xlarge nodes, you have 32 cores. You can launch 1 instance of your app there using the 32 cores or 4 instances of your app using 8 cores. See "slots" configuration for each nodes within Open Grid Engine. (If you were more willing to run one big instance of your app combining the cores of multiple nodes, I never did it so I can't help you with that.) Then, to add a node, you can use the "addnode" command from StarCluster. Once the node is up, OGS will automatically distribute jobs there too. You could also use StarCluster loadbalancer to automatically add/remove nodes.
So, here is my suggestion. 1. Extract your files to S3. 2. Launch StarCluster 3. Using your bashscript, qsub a job for every few files (might be more efficient for a job to work on say 10 files than having a job for every single files) 4. Your application must I/O to s3. 5. When the queue is empty, have a script look at the results to make sure all jobs ran well. You may reschedule jobs when the output is missing.
- I don't know how your aggregation is done so I can't tell.
- I never used hadoop, so I can't help there either.
- You have no need to make your python script MPI executable.
- If you use an heterogeneous cluster, then you know from the start how many cores will be available on each nodes.
- If you define a node with 32 cores to have 4 slots, then you should make your jobs use at most 8 cores each.